Benchmarking RTX 2080 Ti vs Pascal GPUs vs Tesla V100 with DL tasks

November 06, 2018

3 mins read

Testing

The post presents results of Turing and Pascal GPUs benchmarking with a popular Deep Learning Benchmark. PyTorch based tests with both floating point precisions (FP32 and FP16) were chosen for the comparison.

All tests were performed within a docker container nvcr.io/nvidia/pytorch:18.10-py3 (nvidia-docker image with PyTorch 1.0a0, CUDA 10, cuDNN 7400) for reproducibility. Proprietary Nvidia driver version: 410.73. It can be obtained from the Nvidia NGC Registry. Other test parameters were set the same to the original tests, that allows a direct comparison between the results.

An AMD Ryzen 7 1700X CPU powered the testing machine with 64 GB of RAM. All pieces of hardware were on stock frequencies without overclocking. The Tesla V100 setup is the only exclusion, an AWS p3.2xlarge cloud instance was used for the test.

Results

The tables contains time for a forward (eval) or forward and backward (train) passes for different models. The relative results as compared with GTX 1080 Ti as a reference are given in brakets.

FP32 results
GPU VGG-16 ResNet-152 DenseNet-161 Average
eval train eval train eval train
Tesla V100 21.4 ms (-46.0%) 74.4 ms (-41.1%) 36.9 ms (-37.5%) 151.6 ms (-24.0%) 37.6 ms (-41.3%) 156.7 ms (-25.5%) -35.9%
RTX 2080 Ti 28.4 ms (-28.3%) 97.5 ms (-22.9%) 42.7 ms (-27.6%) 151.2 ms (-24.2%) 46.6 ms (-27.2%) 155.9 ms (-25.9%) -26.0%
GTX 1080 Ti 39.6 ms 126.4 ms 59.0 ms 199.5 ms 64.0 ms 210.4 ms 0.0%
GTX 1070 65.9 ms (+66.4%) 205.6 ms (+62.7%) 102.4 ms (+73.6%) 333.9 ms (+67.4%) 109.0 ms (+70.3%) 348.7 ms (+65.7%) +67.7%
FP16 results
GPU VGG-16 ResNet-152 DenseNet-161 Average
eval train eval train eval train
Tesla V100 11.9 ms (-67.3%) 42.2 ms (-63.7%) 30.4 ms (-38.5%) 110.5 ms (-43.7%) 32.6 ms (-38.1%) 121.3 ms (-37.0%) -48.0%
RTX 2080 Ti 19.3 ms (-40.0%) 70.7 ms (-39.1%) 25.0 ms (-49.4%) 101.8 ms (-48.1%) 30.7 ms (-41.7%) 116.4 ms (-39.6%) -44.1%
GTX 1080 Ti 36.4 ms 116.1 ms 49.4 ms 196.2 ms 52.7 ms 192.6 ms 0.0%
GTX 1070 61.2 ms (+68.1%) 190.9 ms (+64.4%) 86.1 ms (+74.3%) 309.3 ms (+57.6%) 88.2 ms (+67.4%) 306.2 ms (+59.0%) +65.1%

To sum up, the new generation of GPU’s has a less than 30% increase in computational power as compared with the Pascal 1080 Ti. However, one should note up to 50% increase in FP16 performance which is achieved by the hardware support of the half precision calculations. Such an increase might produce a huge difference for practical application, especially for inference speed-up.

An interesting point to mention is the fact that the Nvidia RTX 2080 Ti performance in the test is on par with the Nvidia Titan V results (see here, but mind the software versions difference). Interestingly, the software versions make a big difference. For instance, see an older benchmark of Tesla V100 within a docker container with CUDA 9.0. It is also worth mentioning that the Tesla V100 performs significantly better in the case of the VGG-16 neural network, probably due to special architecture related optimizations.

[UPDATE] 19.01.2019 Add Tesla V100 test results.