SGEMMによるGPUの性能比較
CUDAに付属のサンプルプログラムで性能比較を行いました
新しいGPUを購入したので, 性能比較をしてみました.
使ったプログラムはCUDAのサンプルプログラム「matrixMulCUBLAS」(単精度の行列の積)です.
まずGTX 980から,
[Matrix Multiply CUBLAS] - Starting... ==1468== NVPROF is profiling process 1468, command: ./matrixMulCUBLAS -sizemult=10 GPU Device 0: "GeForce GTX 980" with compute capability 5.2 MatrixA(640,1280), MatrixB(640,1280), MatrixC(640,1280) Computing result using CUBLAS...done. Performance= 3871.70 GFlop/s, Time= 0.271 msec, Size= 1048576000 Ops Computing result using host CPU...done. Comparing CUBLAS Matrix Multiply with CPU results: PASS ==1468== Profiling application: ./matrixMulCUBLAS -sizemult=10 ==1468== Profiling result:約3.87TFlops出ています.
Time(%) Time Calls Avg Min Max Name 82.80% 8.3273ms 31 268.62us 266.21us 273.63us maxwell_sgemm_128x64_nn 8.88% 893.37us 1 893.37us 893.37us 893.37us [CUDA memcpy DtoH] 8.32% 836.47us 3 278.82us 896ns 434.30us [CUDA memcpy HtoD]
プロファイルを見ると, CUBLASがmaxwell向けのプログラム(maxwell_sgemm)を
用意していることがわかります.
次にGTX970
[Matrix Multiply CUBLAS] - Starting... ==14894== NVPROF is profiling process 14894, command: ./matrixMulCUBLAS -sizemult=10 GPU Device 0: "GeForce GTX 970" with compute capability 5.2 MatrixA(640,1280), MatrixB(640,1280), MatrixC(640,1280) Computing result using CUBLAS...done. Performance= 2946.91 GFlop/s, Time= 0.356 msec, Size= 1048576000 Ops Computing result using host CPU...done. Comparing CUBLAS Matrix Multiply with CPU results: PASS ==14894== Profiling application: ./matrixMulCUBLAS -sizemult=10 ==14894== Profiling result: Time(%) Time Calls Avg Min Max Name 86.90% 10.953ms 31 353.32us 346.20us 360.37us maxwell_sgemm_128x64_nn 6.65% 838.71us 1 838.71us 838.71us 838.71us [CUDA memcpy DtoH] 6.44% 811.82us 3 270.61us 928ns 427.58us [CUDA memcpy HtoD]GTX980と同じくmaxwell向けのCUBLASを使っています. 性能は1TFlops近く落ちて, 2.95TFlops.
最後にGTX780Ti
[Matrix Multiply CUBLAS] - Starting... ==4391== NVPROF is profiling process 4391, command: ./matrixMulCUBLAS -sizemult=10 GPU Device 0: "GeForce GTX 780 Ti" with compute capability 3.5 MatrixA(640,1280), MatrixB(640,1280), MatrixC(640,1280) Computing result using CUBLAS...done. Performance= 3056.64 GFlop/s, Time= 0.343 msec, Size= 1048576000 Ops Computing result using host CPU...done. Comparing CUBLAS Matrix Multiply with CPU results: PASS ==4391== Profiling application: ./matrixMulCUBLAS -sizemult=10 ==4391== Profiling result: Time(%) Time Calls Avg Min Max Name 86.05% 10.568ms 31 340.91us 338.24us 345.25us sgemm_sm35_ldg_nn_128x8x128x16x16 7.02% 862.62us 3 287.54us 1.0560us 445.70us [CUDA memcpy HtoD] 6.93% 850.94us 1 850.94us 850.94us 850.94us [CUDA memcpy DtoH]GTX970とほぼ変わらず, 3.06TFlopsでした.