Benchmarking Nvidia RTX 5090
February 17, 2025
8 mins read

Nvidia RTX 5090. Official image from Nvidia.
All tests were performed on a system with an AMD Ryzen 9 9950X using Nvidia’s proprietary driver 570.86.16 and CUDA 12.8 in a Docker environment. Note that the driver is marked as ‘beta’, so it may be that GPU performance will differ with future releases. Hardware settings were default for all test cases, without hardware overclocking.
GPUs
GPU | Number of CUDA Cores | Base Clock (MHz) | Number of Tensor Cores | VRAM (GB) | VRAM Bandwidth (GB/s) | Memory Bus Width (bits) | TDP (W) | Lithography (nm) | Release Date |
---|---|---|---|---|---|---|---|---|---|
GTX 1080 Ti | 3584 | 1480 | - | 11 | 484 | 352 | 250 | 16 | Mar 2017 |
RTX 2080 Ti | 4352 | 1350 | 544 | 11 | 616 | 352 | 250 | 12 | Sep 2018 |
RTX 3090 | 10496 | 1395 | 328 | 24 | 936 | 384 | 350 | 8 | Sep 2020 |
RTX 4090 | 16384 | 2230 | 512 | 24 | 1018 | 384 | 450 | 5 | Sep 2022 |
RTX 5090 | 21760 | 2017 | 576 | 32 | 1792 | 512 | 575 | 4 | Jan 2025 |
Note that Tensor Cores were updated during each architecture update, adding support for different precisions and operations, as well as optimizations of these operations. Therefore, the Tensor Core count should not be considered a direct performance proxy metric.
Computer Vision models
The tests were performed using benchmarks from timm, version 1.0.14, a collection of computer vision models. The selection of models is partially conditioned by previous benchmarks to provide some level of comparability with older results for previous generations of GPUs [1]. The benchmark was performed using nightly builds of PyTorch 2.6.0 with CUDA 12.8 support.
The set of results is based on a batch size of 256, which is most relevant to training scenarios and inference in concurrent applications. If the desired batch size does not fit into VRAM, it was reduced by steps of 32 until it fits. The image size for all models was set to 224x224. This can also be viewed as an upper boundary estimation of the GPUs throughput. At the same time, the tests are not meant to demonstrate the absolutely highest performance of the hardware, as advanced optimization techniques were not applied; rather, they attempt to compare different generations of video accelerators in roughly equal settings.
Note that these results do not include additional hardware-specific optimizations or torch.compile
application, which are expected to change the results given different generations of Tensor Cores and differences between Tensor Cores subsystem features.
The reported increase percentage is calculated using RTX 3090 as the baseline. All results are in samples per second.
FP32 Comparison
GPU | vgg16 | resnet50 | tf_efficientnetv2_b0 | swin_base_patch4_window7_224 | efficientvit_m4 | |||||
---|---|---|---|---|---|---|---|---|---|---|
Inference | Train | Inference | Train | Inference | Train | Inference | Train | Inference | Train | |
RTX 3090 | 841.0 | 260.8 | 1679.9 | 523.0 | 4358.6 | 1145.8 | 493.9 | 158.0 | 10600.6 | 2730.0 |
RTX 4090 | 1454.6 (+73.0%) | 456.5 (+75.1%) | 2433.1 (+44.8%) | 757.5 (+44.8%) | 6477.3 (+48.6%) | 1643.8 (+43.5%) | 855.3 (+73.2%) | 293.2 (+85.6%) | 18975.9 (+79.0%) | 3866.7 (+41.6%) |
RTX 5090 | 1867.5 (+122.1%) | 594.7 (+128.1%) | 3576.8 (+112.9%) | 1128.6 (+115.8%) | 9254.5 (+112.3%) | 2448.9 (+113.7%) | 1315.8 (+166.4%) | 450.2 (+185.0%) | 23555.6 (+122.2%) | 6940.8 (+154.3%) |
FP16 Comparison
GPU | vgg16 | resnet50 | tf_efficientnetv2_b0 | swin_base_patch4_window7_224 | efficientvit_m4 | |||||
---|---|---|---|---|---|---|---|---|---|---|
Inference | Train | Inference | Train | Inference | Train | Inference | Train | Inference | Train | |
RTX 3090 | 1387.6 | 438.2 | 2973.1 | 888.7 | 7010.4 | 1818.3 | 979.1 | 337.0 | 11087.8 | 3114.9 |
RTX 4090 | 2418.6 (+74.3%) | 837.5 (+91.1%) | 4601.8 (+54.8%) | 1360.6 (+53.1%) | 12393.6 (+76.8%) | 2823.5 (+55.3%) | 1762.2 (+80.0%) | 597.1 (+77.2%) | 17223.6 (+55.3%) | 3810.7 (+22.3%) |
RTX 5090 | 3350.1 (+141.4%) | 1161.0 (+164.9%) | 5741.6 (+93.1%) | 1623.9 (+82.7%) | 15907.3 (+126.9%) | 3446.1 (+89.5%) | 2471.9 (+152.5%) | 822.3 (+144.1%) | 31682.2 (+185.7%) | 7310.4 (+134.7%) |
On average, we have about an equal boost of 132% for both precisions by switching from Ampere to Blackwell (or 44% for switching from Ada Lovelace to Blackwell). As just a speculation, a notable feature is that the boost is less significant (113 and 98% for FP32 and FP16 of RTX 5090 vs RTX 3090) if we consider convolutional-dominant models (ResNet and EfficientNet in the test), which may indicate that the newer GPU’s architecture is more optimized for matrix multiplication dominant models. Among these models (VGG and Swin Transformers), we can see a more significant boost for FP16, which is not surprising given modern training pipelines are often optimized for half-precision. Despite the test not providing facts to support the hypothesis, given the very fast nature of EfficientViT model, the model may see a more significant impact from VRAM bandwidth, which could be an explanation for the outlier results for the model.
LLMs
All tests were performed using Ollama 0.5.11 with 8k context length and using Q4_K_M quantisation, which is the default recommended quantisation level for Ollama.
All results are reported in tokens per second. The increase percentage is calculated using RTX 3090 as the baseline.
Model | RTX 3090 | RTX 4090 (Increase %) | RTX 5090 (Increase %) |
---|---|---|---|
deepseek-r1:32b | 30.85 | 37.44 (+21.36%) | 60.66 (+96.63%) |
qwen2.5:32b | 32.12 | 38.15 (+18.78%) | 62.81 (+95.54%) |
qwen2.5:7b | 100.32 | 119.56 (+19.18%) | 213.48 (+112.80%) |
mistral-small:24b | 45.78 | 54.04 (+17.99%) | 91.29 (+99.37%) |
phi4:14b | 64.40 | 77.84 (+20.87%) | 130.31 (+102.35%) |
phi3.5:3.8b | 170.24 | 217.32 (+27.69%) | 346.65 (+103.62%) |
llama3.1:8b | 100.53 | 121.74 (+21.10%) | 210.79 (+109.68%) |
llama3.2:3b | 152.83 | 182.11 (+19.24%) | 339.51 (+122.33%) |
qwen2.5:1.5b | 170.29 | 214.98 (+26.26%) | 402.32 (+136.26%) |
Interestengly enoght, average performance improvements of RTX 4090 vs RTX 3090 are less than these observed for Computer Vision models, which may be related to more significunt influence of memory bandwidth on language models or other features of the test setup or the models themselves.
On average, RTX 4090 outperforms RTX 3090 by about 21.4%, while the latest gen GPU (RTX 5090) is faster than RTX 4090 by 72%, which is a significant improvement between generations and may justify an update. The observed difference may be attributed to the fact that language models are more demanding on memory bandwidth and the latest generation’s VRAM offers substantial (~1.7x) improvement over previous generations.
Conclusion
To sum up, the generational gap between RTX 4090 and RTX 5090 is about 44% in Computer Vision tasks and about 72% in Natural Language Processing tasks, achieved at the cost of a ~28% increase in power usage. In addition, transitioning to Blackwell offers faster and larger VRAM, which may provide further benefits for many applications. At the same time, upgrading from RTX 3090 generally more than doubles performance across all task types (~132% boost in Computer Vision and about ~108% on average in Ollama LLMs inference). Of course, whether this upgrade is worthwhile depends on individual or organisational needs, desired features (considering the VRAM upgrade), and budget constraints.
The main question we still have to answer: Is Moore’s law dead or not? We can consider a simplified formulation as doubling of computational performance every two years. If we compare the performance of the most recent GPU with Nvidia GTX 1080 Ti - the oldest one tested in this blog post [1] - we can see an FP16 training improvement of about 14.4x (for the Swin model). Given the duration between releases of GTX 1080 Ti and RTX 5090, we should expect a roughly 15x fold increase in compute. This suggests that mankind’s progress in semiconductors is still near holding Moore’s law, with the caveat that it may not be valid for FP32 compute or convolution-based models.
Given the convenience of its two-slot design, the Nvidia RTX 5090 Founders Edition GPU makes an excellent solution for workstations with two GPUs. With its notable TDP, when paired with a decent CPU, such a setup is not only a desired tool for many Deep Learning developers but also can double up as an efficient house heater during those chilly winter months.
References: