Four generations of Nvidia GPUs compared

November 30, 2022

5 mins read

Testing

The post continues the series of benchmark rusults published previously with results for GPUs of the latest four generations of Nvidia GPUs. The selected hardware constitute progress of the high-end consumer grade graphics adapter, which I have being using in my Deep Learning activities over the past five years.

Method

The GPUs were evaluated by a benchmark script from pytorch-image-models repo by Ross Wightman. The test code was selected because it seems to represent real workload of Computer Vision Deep learning tasks quite accuratly. All tests were performed in the same docker environment, based on nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 docker image with PyTorch 1.13.0.

In all evaluations, the models were given the same input image size of 224x224 pixels and a batch size of 256 for all inference experiments. The number of samples per batch varied depending on the available VRAM on the GPU for training experiments.

**Train batch size**
GPU	vgg16		resnet50		tf_efficientnetv2_b0		swin_base_patch4 _window7_224
GPU	float16	float32	float16	float32	float16	float32	float16	float32
GTX 1080 Ti	192	96	192	96	256	128	64	32
RTX 2080 Ti	192	96	192	96	256	128	64	32
RTX 3090	256	256	256	256	256	256	192	96
RTX 4090	256	256	256	256	256	256	192	96

All GPUs were tested with stock clock speed, without overclocking applied.

**GPU specs**
GPU	Number of CUDA cores	Base Clock, MHz	Number of Tensor Cores	VRAM, GB	TDP, W	Release date
GTX 1080 Ti	3584	1480	-	11	250	Mar 2017
RTX 2080 Ti	4352	1350	544	11	250	Sep 2018
RTX 3090	10496	1395	328	24	350	Sep 2020
RTX 4090	16384	2230	512	24	450	Sep 2022

It is important to note that Tensor Core technology has been updated with each generation of GPUs, particularly with the inclusion of a wider range of precisions and improved throughput, which has made the use of Tensor Cores more convenient, easier and rewarding.

Results

The tables contain inference (eval) and training (train) rate in samples per second. Nvidia GTX 1080 Ti is used as the reference.

**FP32 results**
GPU	vgg16		resnet50		tf_efficientnetv2_b0		swin_base_patch4 _window7_224		Average
GPU	eval	train	eval	train	eval	train	eval	train	Average
GTX 1080 Ti	405	104	685	185	1730	418	129	50	0%
RTX 2080 Ti	513 (+26.7%)	132 (+26.9%)	912 (+33.1%)	252 (+36.2%)	2456 (+42.0%)	609 (+45.7%)	234 (+81.4%)	76 (+52.0%)	+43.0%
RTX 3090	997 (+146.2%)	285 (+174.0%)	1708 (+149.3%)	535 (+189.2%)	4211 (+143.4%)	1118 (+167.5%)	370 (+186.8%)	129 (+158.0%)	+164.3%
RTX 4090	1388 (+242.7%)	457 (+339.4%)	2310 (+237.2%)	721 (+289.7%)	6027 (+248.4%)	1543 (+269.1%)	674 (+422.5%)	404 (+708.0%)	+344.6%

**FP16 results**
GPU	vgg16		resnet50		tf_efficientnetv2_b0		swin_base_patch4 _window7_224		Average
GPU	eval	train	eval	train	eval	train	eval	train	Average
GTX 1080 Ti	417	94	887	235	2136	499	152	57	0%
RTX 2080 Ti	966 (+131.7%)	309 (+228.7%)	1995 (+124.9%)	554 (+135.7%)	4617 (+116.2%)	1124 (+125.3%)	680 (+347.4%)	225 (+294.7%)	+229.6%
RTX 3090	1394 (+234.3%)	442 (+370.2%)	3017 (+240.1%)	890 (+278.7%)	7059 (+230.5%)	1706 (+241.9%)	1026 (+575.0%)	341 (+500.0%)	+333.8%
RTX 4090	2359 (+465.7%)	729 (+675.5%)	4495 (+406.8%)	1285 (+446.8%)	11856 (+455.1%)	2598 (+420.6%)	1692 (+1013.2%)	563 (+887.7%)	+596.4%

Observations

The latest architecture benefits more from recent hardware advances.
It appears that Tensor Cores play a significant role in the performance increase. The most prominent boost in float32 performance occurred when moving from the 20th series to the 30th series, which coincides with the support of TF32 by Tensor Cores. Similarly, the most significant jump was seen with float16 precision with the first introduction of Tensor Cores in the RTX 2080 Ti.
Interestingly, the performance of older GPUs has also been improved through software updates. In previous tests with PyTorch 1.0 and CUDA 10.0, the GTX 1080 Ti performed just slightly better in float16 mode compared to float32. However, the current test results show an average improvement of about 15%. Additionally, the performance gain of the RTX 2080 Ti over GTX 1080 Ti in float16 mode was below 2x previously, but the current results show a significantly higher performance benefit. This suggests that the software updates (both by Nvidia and PyTorch) are also improving performance and older GPUs benefit from updates.
The performance increase with newer hardware for training is slightly better than in inference mode. This may be because the training mode is less dependent on data transfer as GPU computations take longer.
While the average performance increase is impressive, the increased power consumption of RTX 4090 compared to the GTX 1080 Ti (an increase of 80%) may make the results less appealing to some users.
If we consider Moor’s law as doubling of computational performance every two years, the time between the oldest and the newest GPU from this study should result in about 6.75x performance increase. The demonstrated average improvement on Deep Learning tasks is approximately 6x, which is in good agreement, given imperfections of the measurements and empirical nature of the relationship. Therefore, it can be concluded that Moore’s law is still not dead, at least for float16 GPU computations :).

Acknowledgments

Many thanks to Ruslan Baikulov, who contributed some of the test results.

« Multi-task learning loss balancing Top-1 solution of SoccerNet Camera Calibration Challenge 2023 »

Benchmarking Nvidia RTX 5090 (Categories: Hardware, DeepLearning, Benchmark)
Performance Analysis of Intel iGPUs in VLM and LLM applications (Categories: Hardware, DeepLearning)
Camera Calibration: What to perfect before touching the code (Categories: ComputerVision, OpenCV, Calibration, Hardware)
Deep Learning in Sports and Autonomous Vehicles (Categories: DeepLearning, ComputerVision, Self-Driving)
Top-1 solution of SoccerNet Camera Calibration Challenge 2023 (Categories: DeepLearning, ComputerVision, Calibration, Competitions)
Multi-task learning loss balancing (Categories: DeepLearning, ComputerVision)

Computer Vision Lab

Nikolay Falaleev