Benchmarking Nvidia RTX 5090

February 17, 2025

8 mins read

Nvidia RTX 5090. Official image from Nvidia.

Nvidia RTX 5090 tests were performed on a system with an AMD Ryzen 9 9950X using the Nvidia’s proprietary driver 570.86.16 and CUDA 12.8 in a Docker environment. Note that the driver is marked as ‘beta’, so it may be that GPU performance will differ with future releases. Hardware settings were default for all test cases, without hardware overclocking.

GPUs

GPU	Number of CUDA Cores	Base Clock (MHz)	Number of Tensor Cores	VRAM (GB)	VRAM Bandwidth (GB/s)	Memory Bus Width (bits)	TDP (W)	Lithography (nm)	Release Date
GTX 1080 Ti	3584	1480	-	11	484	352	250	16	Mar 2017
RTX 2080 Ti	4352	1350	544	11	616	352	250	12	Sep 2018
RTX 3090	10496	1395	328	24	936	384	350	8	Sep 2020
RTX 4090	16384	2230	512	24	1018	384	450	5	Sep 2022
RTX 5090	21760	2017	576	32	1792	512	575	4	Jan 2025

Note that Tensor Cores were updated during each architecture update, adding support for different precisions and operations, as well as optimizations of these operations. Therefore, the Tensor Core count should not be considered a direct performance proxy metric.

Computer Vision models

Computer vision models benchmarks results

The tests were performed using benchmarks from timm, version 1.0.14, a collection of computer vision models. The selection of models is partially conditioned by previous benchmarks to provide some level of comparability with older results for previous generations of GPUs [1]. The benchmark was performed using nightly builds of PyTorch 2.6.0 with CUDA 12.8 support.

The set of results is based on a batch size of 256, which is most relevant to training scenarios and inference in concurrent applications. If the desired batch size does not fit into VRAM, it was reduced by steps of 32 until it fits. The image size for all models was set to 224x224. This can also be viewed as an upper boundary estimation of the GPUs throughput. At the same time, the tests are not meant to demonstrate the absolutely highest performance of the hardware, as advanced optimization techniques were not applied; instead, they attempt to compare different generations of video accelerators in roughly equal settings.

Note that these results do not include additional hardware-specific optimizations or torch.compile application, which are expected to change the results given different generations of Tensor Cores and differences between Tensor Cores subsystem features.

The reported increase percentage is calculated using RTX 3090 as the baseline. All results are in samples per second.

FP32 Comparison

GPU	vgg16		resnet50		tf_efficientnetv2_b0		swin_base_patch4_window7_224		efficientvit_m4
	Inference	Train	Inference	Train	Inference	Train	Inference	Train	Inference	Train
RTX 3090	841.0	260.8	1679.9	523.0	4358.6	1145.8	493.9	158.0	10600.6	2730.0
RTX 4090	1454.6 (+73.0%)	456.5 (+75.1%)	2433.1 (+44.8%)	757.5 (+44.8%)	6477.3 (+48.6%)	1643.8 (+43.5%)	855.3 (+73.2%)	293.2 (+85.6%)	18975.9 (+79.0%)	3866.7 (+41.6%)
RTX 5090	1867.5 (+122.1%)	594.7 (+128.1%)	3576.8 (+112.9%)	1128.6 (+115.8%)	9254.5 (+112.3%)	2448.9 (+113.7%)	1315.8 (+166.4%)	450.2 (+185.0%)	23555.6 (+122.2%)	6940.8 (+154.3%)

FP16 Comparison

GPU	vgg16		resnet50		tf_efficientnetv2_b0		swin_base_patch4_window7_224		efficientvit_m4
	Inference	Train	Inference	Train	Inference	Train	Inference	Train	Inference	Train
RTX 3090	1387.6	438.2	2973.1	888.7	7010.4	1818.3	979.1	337.0	11087.8	3114.9
RTX 4090	2418.6 (+74.3%)	837.5 (+91.1%)	4601.8 (+54.8%)	1360.6 (+53.1%)	12393.6 (+76.8%)	2823.5 (+55.3%)	1762.2 (+80.0%)	597.1 (+77.2%)	17223.6 (+55.3%)	3810.7 (+22.3%)
RTX 5090	3350.1 (+141.4%)	1161.0 (+164.9%)	5741.6 (+93.1%)	1623.9 (+82.7%)	15907.3 (+126.9%)	3446.1 (+89.5%)	2471.9 (+152.5%)	822.3 (+144.1%)	31682.2 (+185.7%)	7310.4 (+134.7%)

On average, we have about an equal boost of 132% for both precisions by switching from Ampere to Blackwell (or 44% for switching from Ada Lovelace to Blackwell). As just a speculation, a notable feature is that the boost is less significant (113 and 98% for FP32 and FP16 of RTX 5090 vs RTX 3090) if we consider convolutional-dominant models (ResNet and EfficientNet in the test), which may indicate that the newer GPU’s architecture is more optimized for matrix multiplication dominant models, or the models benefit more from the update of the memory subsystem. Among these models (VGG and Swin Transformers), we can see a more significant boost for FP16, which is not surprising given modern training pipelines are often optimized for half-precision. Despite the test not providing facts to support the hypothesis, given the very fast nature of EfficientViT model, the model may see a more significant impact from VRAM bandwidth, which could be an explanation for the outlier results for the model.

LLMs

Ollama models benchmarks results

All tests were performed using Ollama 0.5.11 with an 8k context length and Q4_K_M quantisation, which is the default recommended quantisation level for Ollama.

All results are reported in tokens per second. The increase percentage is calculated using RTX 3090 as the baseline.

Model	RTX 3090	RTX 4090 (Increase %)	RTX 5090 (Increase %)
deepseek-r1:32b	30.85	37.44 (+21.36%)	60.66 (+96.63%)
qwen2.5:32b	32.12	38.15 (+18.78%)	62.81 (+95.54%)
qwen2.5:7b	100.32	119.56 (+19.18%)	213.48 (+112.80%)
mistral-small:24b	45.78	54.04 (+17.99%)	91.29 (+99.37%)
phi4:14b	64.40	77.84 (+20.87%)	130.31 (+102.35%)
phi3.5:3.8b	170.24	217.32 (+27.69%)	346.65 (+103.62%)
llama3.1:8b	100.53	121.74 (+21.10%)	210.79 (+109.68%)
llama3.2:3b	152.83	182.11 (+19.24%)	339.51 (+122.33%)
qwen2.5:1.5b	170.29	214.98 (+26.26%)	402.32 (+136.26%)

Interestingly enough, average performance improvements of RTX 4090 vs RTX 3090 are less than those observed for Computer Vision models, which may be related to a more significant influence of memory bandwidth on language models or other features of the test setup or the models themselves.

On average, RTX 4090 outperforms RTX 3090 by about 21.4%, while the latest gen GPU (RTX 5090) is faster than RTX 4090 by 72%, which is a significant improvement between generations and may justify an update. The observed difference may be attributed to the fact that language models are more demanding on memory bandwidth and the latest generation’s VRAM offers substantial (~1.7x) improvement over previous generations.

Conclusion

To sum up, the generational gap between RTX 4090 and RTX 5090 is about 44% in Computer Vision tasks and about 72% in Natural Language Processing tasks, achieved at the cost of a ~28% increase in power usage. In addition, transitioning to Blackwell offers faster and larger VRAM, which may provide further benefits for many applications. At the same time, upgrading from RTX 3090 generally more than doubles performance across all task types (~132% boost in Computer Vision and about ~108% on average in Ollama LLMs inference). Of course, whether this upgrade is worthwhile depends on individual or organisational needs, desired features (considering the VRAM upgrade), and budget constraints.

The main question we still have to answer: Is Moore’s law dead or not? We can consider a simplified formulation as doubling of computational performance every two years. If we compare the performance of the most recent GPU with Nvidia GTX 1080 Ti - the oldest one tested in this blog post [1] - we can see an FP16 training improvement of about 14.4x (for the Swin model). Given the duration between releases of GTX 1080 Ti and RTX 5090, we should expect a roughly 15x fold increase in compute. This suggests that mankind’s progress in semiconductors is still near holding Moore’s law, with the caveat that it may not be valid for FP32 compute or convolution-based models. The Nvidia RTX 5090 Founders Edition GPU’s convenient two-slot design makes it an excellent solution for dual-GPU workstations. With its notable TDP, when paired with a decent CPU, such a setup is not only a desired tool for many Deep Learning developers but also can double up as an efficient home heater during those chilly winter months.

References:

[1]: Four gens of Nvidia GPUs compared

« Performance Analysis of Intel iGPUs in VLM and LLM applications

Performance Analysis of Intel iGPUs in VLM and LLM applications (Categories: Hardware, DeepLearning)
Camera Calibration: What to perfect before touching the code (Categories: ComputerVision, OpenCV, Calibration, Hardware)
Deep Learning in Sports and Autonomous Vehicles (Categories: DeepLearning, ComputerVision, Self-Driving)
Top-1 solution of SoccerNet Camera Calibration Challenge 2023 (Categories: DeepLearning, ComputerVision, Calibration, Competitions)
Four generations of Nvidia GPUs compared (Categories: Hardware, DeepLearning)
Multi-task learning loss balancing (Categories: DeepLearning, ComputerVision)

Computer Vision Lab

Nikolay Falaleev