Four generations of Nvidia GPUs compared
November 30, 2022
5 mins read
Testing
The post continues the series of benchmark rusults published previously with results for GPUs of the latest four generations of Nvidia GPUs. The selected hardware constitute progress of the high-end consumer grade graphics adapter, which I have being using in my Deep Learning activities over the past five years.
Method
The GPUs were evaluated by a benchmark script from pytorch-image-models repo by Ross Wightman. The test code was selected because it seems to represent real workload of Computer Vision Deep learning tasks quite accuratly. All tests were performed in the same docker environment, based on nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 docker image with PyTorch 1.13.0.
In all evaluations, the models were given the same input image size of 224x224 pixels and a batch size of 256 for all inference experiments. The number of samples per batch varied depending on the available VRAM on the GPU for training experiments.
GPU | vgg16 | resnet50 | tf_efficientnetv2_b0 | swin_base_patch4 _window7_224 | ||||
---|---|---|---|---|---|---|---|---|
float16 | float32 | float16 | float32 | float16 | float32 | float16 | float32 | |
GTX 1080 Ti | 192 | 96 | 192 | 96 | 256 | 128 | 64 | 32 |
RTX 2080 Ti | 192 | 96 | 192 | 96 | 256 | 128 | 64 | 32 |
RTX 3090 | 256 | 256 | 256 | 256 | 256 | 256 | 192 | 96 |
RTX 4090 | 256 | 256 | 256 | 256 | 256 | 256 | 192 | 96 |
All GPUs were tested with stock clock speed, without overclocking applied.
GPU | Number of CUDA cores | Base Clock, MHz | Number of Tensor Cores | VRAM, GB | TDP, W | Release date |
---|---|---|---|---|---|---|
GTX 1080 Ti | 3584 | 1480 | - | 11 | 250 | Mar 2017 |
RTX 2080 Ti | 4352 | 1350 | 544 | 11 | 250 | Sep 2018 |
RTX 3090 | 10496 | 1395 | 328 | 24 | 350 | Sep 2020 |
RTX 4090 | 16384 | 2230 | 512 | 24 | 450 | Sep 2022 |
It is important to note that Tensor Core technology has been updated with each generation of GPUs, particularly with the inclusion of a wider range of precisions and improved throughput, which has made the use of Tensor Cores more convenient, easier and rewarding.
Results
The tables contain inference (eval
) and training (train
) rate in samples per second. Nvidia GTX 1080 Ti is used as the reference.
GPU | vgg16 | resnet50 | tf_efficientnetv2_b0 | swin_base_patch4 _window7_224 | Average | ||||
---|---|---|---|---|---|---|---|---|---|
eval | train | eval | train | eval | train | eval | train | ||
GTX 1080 Ti | 405 | 104 | 685 | 185 | 1730 | 418 | 129 | 50 | 0% |
RTX 2080 Ti | 513 (+26.7%) | 132 (+26.9%) | 912 (+33.1%) | 252 (+36.2%) | 2456 (+42.0%) | 609 (+45.7%) | 234 (+81.4%) | 76 (+52.0%) | +43.0% |
RTX 3090 | 997 (+146.2%) | 285 (+174.0%) | 1708 (+149.3%) | 535 (+189.2%) | 4211 (+143.4%) | 1118 (+167.5%) | 370 (+186.8%) | 129 (+158.0%) | +164.3% |
RTX 4090 | 1388 (+242.7%) | 457 (+339.4%) | 2310 (+237.2%) | 721 (+289.7%) | 6027 (+248.4%) | 1543 (+269.1%) | 674 (+422.5%) | 404 (+708.0%) | +344.6% |
GPU | vgg16 | resnet50 | tf_efficientnetv2_b0 | swin_base_patch4 _window7_224 | Average | ||||
---|---|---|---|---|---|---|---|---|---|
eval | train | eval | train | eval | train | eval | train | ||
GTX 1080 Ti | 417 | 94 | 887 | 235 | 2136 | 499 | 152 | 57 | 0% |
RTX 2080 Ti | 966 (+131.7%) | 309 (+228.7%) | 1995 (+124.9%) | 554 (+135.7%) | 4617 (+116.2%) | 1124 (+125.3%) | 680 (+347.4%) | 225 (+294.7%) | +229.6% |
RTX 3090 | 1394 (+234.3%) | 442 (+370.2%) | 3017 (+240.1%) | 890 (+278.7%) | 7059 (+230.5%) | 1706 (+241.9%) | 1026 (+575.0%) | 341 (+500.0%) | +333.8% |
RTX 4090 | 2359 (+465.7%) | 729 (+675.5%) | 4495 (+406.8%) | 1285 (+446.8%) | 11856 (+455.1%) | 2598 (+420.6%) | 1692 (+1013.2%) | 563 (+887.7%) | +596.4% |
Observations
- The latest architecture benefits more from recent hardware advances.
- It appears that Tensor Cores play a significant role in the performance increase. The most prominent boost in float32 performance occurred when moving from the 20th series to the 30th series, which coincides with the support of TF32 by Tensor Cores. Similarly, the most significant jump was seen with float16 precision with the first introduction of Tensor Cores in the RTX 2080 Ti.
- Interestingly, the performance of older GPUs has also been improved through software updates. In previous tests with PyTorch 1.0 and CUDA 10.0, the GTX 1080 Ti performed just slightly better in float16 mode compared to float32. However, the current test results show an average improvement of about 15%. Additionally, the performance gain of the RTX 2080 Ti over GTX 1080 Ti in float16 mode was below 2x previously, but the current results show a significantly higher performance benefit. This suggests that the software updates (both by Nvidia and PyTorch) are also improving performance and older GPUs benefit from updates.
- The performance increase with newer hardware for training is slightly better than in inference mode. This may be because the training mode is less dependent on data transfer as GPU computations take longer.
- While the average performance increase is impressive, the increased power consumption of RTX 4090 compared to the GTX 1080 Ti (an increase of 80%) may make the results less appealing to some users.
- If we consider Moor’s law as doubling of computational performance every two years, the time between the oldest and the newest GPU from this study should result in about 6.75x performance increase. The demonstrated average improvement on Deep Learning tasks is approximately 6x, which is in good agreement, given imperfections of the measurements and empirical nature of the relationship. Therefore, it can be concluded that Moore’s law is still not dead, at least for float16 GPU computations :).
Acknowledgments
Many thanks to Ruslan Baikulov, who contributed some of the test results.