Benchmarking CUDA vs Eigen

Here are some benchmarking notes on CUDA vs the Eigen Library on the two most common operations in my research. Those two operations are the SAXPY operation, which is Y = a * X + Y where X, Y are vectors and a is a scalar. The other main operation is a dot product, which is sum(X * Y) where X and Y are arrays using element-wise array multiplication.

SAXPY

Benchmark 1000 10,000 100,000 1,000,000 10,000,000 100,000,000
Eigen SAXPY 7.6e-07 6.717e-06 7.6082e-05 0.000609743 0.00827539 0.0677491
CUDA 1 Vector SAXPY 128 threads 1.0112e-05 2.1504e-05 2.9568e-05 0.000153792 0.00141053 0.0139648
CUDA 2 Vector SAXPY 128 threads 1.936e-05 1.68e-05 2.8672e-05 0.000154656 0.0014087 0.0139671
CUDA 1 Vector SAXPY with 2 loads/thread 128 threads 2.6368e-05 3.1136e-05 3.3344e-05 0.000134112 0.00116026 0.0113349
CUDA 1 Vector SAXPY with 4 loads/thread 128 threads 2.5824e-05 2.3104e-05 2.5632e-05 8.0096e-05 0.000589056 0.00569706
CUDA 2 Vector SAXPY with 2 loads/thread 128 threads 2.768e-05 2.4832e-05 3.2832e-05 0.000130656 0.00111782 0.0108092
CUDA 2 Vector SAXPY with 4 loads/thread 128 threads 3.3536e-05 2.752e-05 3.1456e-05 0.000118656 0.00101552 0.00985734
CUDA 1 Vector SAXPY 256 threads 1.8752e-05 1.5072e-05 2.9792e-05 0.000159648 0.00144925 0.014379
CUDA 2 Vector SAXPY 256 threads 1.8752e-05 1.664e-05 2.8896e-05 0.000161568 0.00145043 0.0143882
CUDA 1 Vector SAXPY with 2 loads/thread 256 threads 3.1168e-05 2.4704e-05 3.4048e-05 0.00013888 0.00120445 0.0116761
CUDA 1 Vector SAXPY with 4 loads/thread 256 threads 3.408e-05 2.368e-05 2.688e-05 7.8656e-05 0.000612 0.00583763
CUDA 2 Vector SAXPY with 2 loads/thread 256 threads 2.7232e-05 2.4864e-05 3.3568e-05 0.000134112 0.00112749 0.0110074
CUDA 2 Vector SAXPY with 4 loads/thread 256 threads 3.3248e-05 2.8096e-05 3.2064e-05 0.000121152 0.0010169 0.00988438
CUDA 1 Vector SAXPY 512 threads 2.1824e-05 1.7504e-05 3.008e-05 0.0001688 0.00152813 0.0151885
CUDA 2 Vector SAXPY 512 threads 2.5984e-05 1.6768e-05 3.056e-05 0.000167008 0.00152986 0.0151532
CUDA 1 Vector SAXPY with 2 loads/thread 512 threads 3.0592e-05 2.5504e-05 3.5168e-05 0.000151168 0.00124963 0.0122253
CUDA 1 Vector SAXPY with 4 loads/thread 512 threads 2.7552e-05 2.4544e-05 2.8224e-05 8.1312e-05 0.000642976 0.00612454
CUDA 2 Vector SAXPY with 2 loads/thread 512 threads 2.7232e-05 2.512e-05 3.4016e-05 0.000138048 0.00117328 0.011491
CUDA 2 Vector SAXPY with 4 loads/thread 512 threads 3.4272e-05 2.7744e-05 3.8752e-05 0.000124672 0.00103123 0.0099735
CUDA 1 Vector SAXPY 1024 threads 2.0064e-05 1.7504e-05 3.3184e-05 0.000202144 0.00188838 0.0187485
CUDA 2 Vector SAXPY 1024 threads 2.656e-05 1.6928e-05 3.1808e-05 0.000190048 0.00174509 0.017351
CUDA 1 Vector SAXPY with 2 loads/thread 1024 threads 3.4944e-05 2.5696e-05 3.5424e-05 0.000150464 0.00132205 0.0129062
CUDA 1 Vector SAXPY with 4 loads/thread 1024 threads 2.608e-05 2.48e-05 2.7072e-05 8.6592e-05 0.000673856 0.00646515
CUDA 2 Vector SAXPY with 2 loads/thread 1024 threads 2.6656e-05 2.6048e-05 4.1376e-05 0.000148896 0.00125997 0.0123311
CUDA 2 Vector SAXPY with 4 loads/thread 1024 threads 3.52e-05 2.8512e-05 3.3024e-05 0.000123584 0.0010383 0.0100846

Lots of data here to parse. The main takeaway here is that CUDA equals the performance of Eigen starting at vectors of length 100,000 and above. After that, CUDA roughly outperforms Eigen by an order of magnitude.

I tried a bunch of different parameters, but the main thing was that 128 threads per block with 4 loads in each thread was clearly the winner in terms of performance overall. 128 threads per block obviously creates more blocks; more blocks is essential to maximizing the GPUs performance as they fully occupy all the streaming multiprocessors (SMs). 4 loads per thread means that each thread works on 4 elements, with each element being strided by the width of the block. Doing work in this way allows warps to be swapped out as they request data to be loaded from global memory so that the addition/multiplication instructions can be performed as soon as the data is ready, pipelining everything in a way.

Code sample for 2 loads/thread:

__global__ instruction_level_parallelism(float *x, float *y, float scale, unsigned numThreads) {
   unsigned idx = threadIdx.x + blockIdx.x * blockDim.x;
   if (idx < numThreads)
      x[idx] += y[idx] * scale;

   if (2 * idx < numThreads)
      x[idx + blockDim.x] += y[idx + blockDim.x] * scale;
}

Obviously numThreads should be the number of elements divided by 2 when launching the kernel. But basically there are two loads strided across by the block size.

And finally the difference between 1 and 2 vectors. Normally, you might have vectors stored as:

X = X1, X2, X3, …, XN
Y = Y1, Y2, Y3, …, YN

and SAXPYs those together. But instead, the one vector approach lays them out as:

XY = X1, Y1, X2, Y2, X3, Y3, …, XN, YN

The CUDA cards are able to coalesce memory accesses when consectutive threads access memory locations contiguously. By laying the memory out like this, thread 1 can access memory location 1, thread 2 memory location 2, etc contiguously which maximizes load efficiency.

Dot Products

Benchmark 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000
Eigen Dot Product 5.26e-07 3.352e-06 4.4561e-05 0.000423223 0.00529679 0.0534771
CUDA 1 Vector Dot Product 128 threads 0.000112256 0.00012656 0.00016688 0.00040752 0.00259299 0.0240384
CUDA 1 Vector Dot Product 256 threads 9.7664e-05 0.000126432 0.000167488 0.000410912 0.00264259 0.0245879
CUDA 1 Vector Dot Product 512 threads 0.000102656 0.000132 0.00017136 0.000426784 0.00275974 0.025626
CUDA 1 Vector Dot Product 1024 threads 0.000105312 0.000138944 0.000183456 0.000469216 0.00309581 0.0290027
CUDA 2 Vector Dot Product 128 threads 9.9136e-05 0.000126688 0.000165888 0.000403584 0.00256003 0.0238534
CUDA 2 Vector Dot Product 256 threads 9.9488e-05 0.000127424 0.000167488 0.000409216 0.00261757 0.0243956
CUDA 2 Vector Dot Product 512 threads 9.616e-05 0.000126496 0.000171104 0.000424 0.00272925 0.0255431
CUDA 2 Vector Dot Product 1024 threads 0.00010288 0.000136544 0.000180192 0.00046144 0.00305926 0.0286854
CUDA Templated 2 Vector Dot Product 128 threads 5.1968e-05 4.496e-05 6.9248e-05 0.000211712 0.00161814 0.0155401
CUDA Templated 2 Vector Dot Product 256 threads 5.1552e-05 4.5568e-05 5.8912e-05 0.000223712 0.00169581 0.0164204
CUDA Templated 2 Vector Dot Product 512 threads 3.8016e-05 4.8544e-05 6.144e-05 0.000237696 0.00183901 0.0178052
CUDA Templated 2 Vector Dot Product 1024 threads 3.4848e-05 4.784e-05 6.5568e-05 0.00025344 0.00216915 0.0210177
CUDA Templated 1 Vector Dot Product 128 threads 5.152e-05 4.4416e-05 6.8736e-05 0.000212512 0.00164326 0.0158
CUDA Templated 1 Vector Dot Product 256 threads 5.3856e-05 4.56e-05 5.9168e-05 0.000223424 0.00169162 0.0163945
CUDA Templated 1 Vector Dot Product 512 threads 3.4432e-05 4.528e-05 5.8304e-05 0.000234912 0.00183699 0.0177262
CUDA Templated 1 Vector Dot Product 1024 threads 3.7472e-05 5.2224e-05 6.8064e-05 0.000262496 0.00212323 0.0205787
CUDA Product Kernel then Templated Reduce 128 threads 6.2656e-05 8.2208e-05 0.000114752 0.000308896 0.0019601 0.0181938
CUDA Product Kernel then Templated Reduce 256 threads 5.5872e-05 7.4528e-05 0.000110528 0.00031584 0.00202419 0.0188202
CUDA Product Kernel then Templated Reduce 512 threads 5.4016e-05 6.8576e-05 0.000105824 0.00031616 0.00211914 0.0199164
CUDA Product Kernel then Templated Reduce 1024 threads 5.5104e-05 6.3296e-05 0.000103744 0.000339424 0.00235552 0.0221885

Here it’s even more interesting. You actually don’t get a similar magnitude of performance until the 100,000 length mark, but even then Eigen is slightly faster, about 1E-5 to 2E-5 faster, which isn’t a lot but (in GenSel) 50,000 markers and a Markov Chain length of 40,000, that’s 2,000,000 dot products which can add up (to like whole seconds). After 1,000,000 CUDA can be 2-5x faster, not an order of magnitude but still faster which is important.

Here it appears the 2 Vector dot product is actually faster. The “Templated Dot Product” is just Mark Harris’ Optimizing Parallel Reduction in CUDA with two vectors. The other dot product is a simple loop which performs a single sum per thread, reducing the vector in half per accumulation iteration. The Product Kernel then Templated Reduce tries to perform an element-wise multiplication followed by the templated reduce. It’s odd that that’s slower in my opinion, but whatever. Also interesting to note is that Eigen is lazy; I had to print the values of the dot products at the end and then remove that last field when analyzing the data just to get the timings right.

Conclusion

My main conclusion was that once you get to vectors at or above lengths of 100,000, CUDA is an excellent choice for matrix math. Below that though you can stick to libraries like Eigen which optimize code very very well. There’s lots of different optimizations you can do, but the main thing is optimizing memory bandwidth and distributing enough work across all of the multiprocessors.