Four generations of Nvidia GPUs compared

November 30, 2022

5 mins read

Testing

The post continues the series of benchmark rusults published previously with results for GPUs of the latest four generations of Nvidia GPUs. The selected hardware constitute progress of the high-end consumer grade graphics adapter, which I have being using in my Deep Learning activities over the past five years.

Method

The GPUs were evaluated by a benchmark script from pytorch-image-models repo by Ross Wightman. The test code was selected because it seems to represent real workload of Computer Vision Deep learning tasks quite accuratly. All tests were performed in the same docker environment, based on nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 docker image with PyTorch 1.13.0.

In all evaluations, the models were given the same input image size of 224x224 pixels and a batch size of 256 for all inference experiments. The number of samples per batch varied depending on the available VRAM on the GPU for training experiments.

Train batch size
GPU vgg16 resnet50 tf_efficientnetv2_b0 swin_base_patch4 _window7_224
float16 float32 float16 float32 float16 float32 float16 float32
GTX 1080 Ti 192 96 192 96 256 128 64 32
RTX 2080 Ti 192 96 192 96 256 128 64 32
RTX 3090 256 256 256 256 256 256 192 96
RTX 4090 256 256 256 256 256 256 192 96

All GPUs were tested with stock clock speed, without overclocking applied.

GPU specs
GPU Number of CUDA cores Base Clock, MHz Number of Tensor Cores VRAM, GB TDP, W Release date
GTX 1080 Ti 3584 1480 - 11 250 Mar 2017
RTX 2080 Ti 4352 1350 544 11 250 Sep 2018
RTX 3090 10496 1395 328 24 350 Sep 2020
RTX 4090 16384 2230 512 24 450 Sep 2022

It is important to note that Tensor Core technology has been updated with each generation of GPUs, particularly with the inclusion of a wider range of precisions and improved throughput, which has made the use of Tensor Cores more convenient, easier and rewarding.

Results

The tables contain inference (eval) and training (train) rate in samples per second. Nvidia GTX 1080 Ti is used as the reference.

FP32 results
GPU vgg16 resnet50 tf_efficientnetv2_b0 swin_base_patch4 _window7_224 Average
eval train eval train eval train eval train
GTX 1080 Ti 405 104 685 185 1730 418 129 50 0%
RTX 2080 Ti 513 (+26.7%) 132 (+26.9%) 912 (+33.1%) 252 (+36.2%) 2456 (+42.0%) 609 (+45.7%) 234 (+81.4%) 76 (+52.0%) +43.0%
RTX 3090 997 (+146.2%) 285 (+174.0%) 1708 (+149.3%) 535 (+189.2%) 4211 (+143.4%) 1118 (+167.5%) 370 (+186.8%) 129 (+158.0%) +164.3%
RTX 4090 1388 (+242.7%) 457 (+339.4%) 2310 (+237.2%) 721 (+289.7%) 6027 (+248.4%) 1543 (+269.1%) 674 (+422.5%) 404 (+708.0%) +344.6%
FP16 results
GPU vgg16 resnet50 tf_efficientnetv2_b0 swin_base_patch4 _window7_224 Average
eval train eval train eval train eval train
GTX 1080 Ti 417 94 887 235 2136 499 152 57 0%
RTX 2080 Ti 966 (+131.7%) 309 (+228.7%) 1995 (+124.9%) 554 (+135.7%) 4617 (+116.2%) 1124 (+125.3%) 680 (+347.4%) 225 (+294.7%) +229.6%
RTX 3090 1394 (+234.3%) 442 (+370.2%) 3017 (+240.1%) 890 (+278.7%) 7059 (+230.5%) 1706 (+241.9%) 1026 (+575.0%) 341 (+500.0%) +333.8%
RTX 4090 2359 (+465.7%) 729 (+675.5%) 4495 (+406.8%) 1285 (+446.8%) 11856 (+455.1%) 2598 (+420.6%) 1692 (+1013.2%) 563 (+887.7%) +596.4%

Observations

  • The latest architecture benefits more from recent hardware advances.
  • It appears that Tensor Cores play a significant role in the performance increase. The most prominent boost in float32 performance occurred when moving from the 20th series to the 30th series, which coincides with the support of TF32 by Tensor Cores. Similarly, the most significant jump was seen with float16 precision with the first introduction of Tensor Cores in the RTX 2080 Ti.
  • Interestingly, the performance of older GPUs has also been improved through software updates. In previous tests with PyTorch 1.0 and CUDA 10.0, the GTX 1080 Ti performed just slightly better in float16 mode compared to float32. However, the current test results show an average improvement of about 15%. Additionally, the performance gain of the RTX 2080 Ti over GTX 1080 Ti in float16 mode was below 2x previously, but the current results show a significantly higher performance benefit. This suggests that the software updates (both by Nvidia and PyTorch) are also improving performance and older GPUs benefit from updates.
  • The performance increase with newer hardware for training is slightly better than in inference mode. This may be because the training mode is less dependent on data transfer as GPU computations take longer.
  • While the average performance increase is impressive, the increased power consumption of RTX 4090 compared to the GTX 1080 Ti (an increase of 80%) may make the results less appealing to some users.
  • If we consider Moor’s law as doubling of computational performance every two years, the time between the oldest and the newest GPU from this study should result in about 6.75x performance increase. The demonstrated average improvement on Deep Learning tasks is approximately 6x, which is in good agreement, given imperfections of the measurements and empirical nature of the relationship. Therefore, it can be concluded that Moore’s law is still not dead, at least for float16 GPU computations :).

Acknowledgments

Many thanks to Ruslan Baikulov, who contributed some of the test results.