Benchmarking CUDA vs Eigen
Here are some benchmarking notes on CUDA vs the Eigen Library on the two most common operations in my research. Those two operations are the SAXPY operation, which is Y = a * X + Y where X, Y
are vectors and a
is a scalar. The other main operation is a dot product, which is sum(X * Y)
where X and Y are arrays using element-wise array multiplication.
SAXPY
Benchmark | 1000 | 10,000 | 100,000 | 1,000,000 | 10,000,000 | 100,000,000 |
---|---|---|---|---|---|---|
Eigen SAXPY | 7.6e-07 | 6.717e-06 | 7.6082e-05 | 0.000609743 | 0.00827539 | 0.0677491 |
CUDA 1 Vector SAXPY 128 threads | 1.0112e-05 | 2.1504e-05 | 2.9568e-05 | 0.000153792 | 0.00141053 | 0.0139648 |
CUDA 2 Vector SAXPY 128 threads | 1.936e-05 | 1.68e-05 | 2.8672e-05 | 0.000154656 | 0.0014087 | 0.0139671 |
CUDA 1 Vector SAXPY with 2 loads/thread 128 threads | 2.6368e-05 | 3.1136e-05 | 3.3344e-05 | 0.000134112 | 0.00116026 | 0.0113349 |
CUDA 1 Vector SAXPY with 4 loads/thread 128 threads | 2.5824e-05 | 2.3104e-05 | 2.5632e-05 | 8.0096e-05 | 0.000589056 | 0.00569706 |
CUDA 2 Vector SAXPY with 2 loads/thread 128 threads | 2.768e-05 | 2.4832e-05 | 3.2832e-05 | 0.000130656 | 0.00111782 | 0.0108092 |
CUDA 2 Vector SAXPY with 4 loads/thread 128 threads | 3.3536e-05 | 2.752e-05 | 3.1456e-05 | 0.000118656 | 0.00101552 | 0.00985734 |
CUDA 1 Vector SAXPY 256 threads | 1.8752e-05 | 1.5072e-05 | 2.9792e-05 | 0.000159648 | 0.00144925 | 0.014379 |
CUDA 2 Vector SAXPY 256 threads | 1.8752e-05 | 1.664e-05 | 2.8896e-05 | 0.000161568 | 0.00145043 | 0.0143882 |
CUDA 1 Vector SAXPY with 2 loads/thread 256 threads | 3.1168e-05 | 2.4704e-05 | 3.4048e-05 | 0.00013888 | 0.00120445 | 0.0116761 |
CUDA 1 Vector SAXPY with 4 loads/thread 256 threads | 3.408e-05 | 2.368e-05 | 2.688e-05 | 7.8656e-05 | 0.000612 | 0.00583763 |
CUDA 2 Vector SAXPY with 2 loads/thread 256 threads | 2.7232e-05 | 2.4864e-05 | 3.3568e-05 | 0.000134112 | 0.00112749 | 0.0110074 |
CUDA 2 Vector SAXPY with 4 loads/thread 256 threads | 3.3248e-05 | 2.8096e-05 | 3.2064e-05 | 0.000121152 | 0.0010169 | 0.00988438 |
CUDA 1 Vector SAXPY 512 threads | 2.1824e-05 | 1.7504e-05 | 3.008e-05 | 0.0001688 | 0.00152813 | 0.0151885 |
CUDA 2 Vector SAXPY 512 threads | 2.5984e-05 | 1.6768e-05 | 3.056e-05 | 0.000167008 | 0.00152986 | 0.0151532 |
CUDA 1 Vector SAXPY with 2 loads/thread 512 threads | 3.0592e-05 | 2.5504e-05 | 3.5168e-05 | 0.000151168 | 0.00124963 | 0.0122253 |
CUDA 1 Vector SAXPY with 4 loads/thread 512 threads | 2.7552e-05 | 2.4544e-05 | 2.8224e-05 | 8.1312e-05 | 0.000642976 | 0.00612454 |
CUDA 2 Vector SAXPY with 2 loads/thread 512 threads | 2.7232e-05 | 2.512e-05 | 3.4016e-05 | 0.000138048 | 0.00117328 | 0.011491 |
CUDA 2 Vector SAXPY with 4 loads/thread 512 threads | 3.4272e-05 | 2.7744e-05 | 3.8752e-05 | 0.000124672 | 0.00103123 | 0.0099735 |
CUDA 1 Vector SAXPY 1024 threads | 2.0064e-05 | 1.7504e-05 | 3.3184e-05 | 0.000202144 | 0.00188838 | 0.0187485 |
CUDA 2 Vector SAXPY 1024 threads | 2.656e-05 | 1.6928e-05 | 3.1808e-05 | 0.000190048 | 0.00174509 | 0.017351 |
CUDA 1 Vector SAXPY with 2 loads/thread 1024 threads | 3.4944e-05 | 2.5696e-05 | 3.5424e-05 | 0.000150464 | 0.00132205 | 0.0129062 |
CUDA 1 Vector SAXPY with 4 loads/thread 1024 threads | 2.608e-05 | 2.48e-05 | 2.7072e-05 | 8.6592e-05 | 0.000673856 | 0.00646515 |
CUDA 2 Vector SAXPY with 2 loads/thread 1024 threads | 2.6656e-05 | 2.6048e-05 | 4.1376e-05 | 0.000148896 | 0.00125997 | 0.0123311 |
CUDA 2 Vector SAXPY with 4 loads/thread 1024 threads | 3.52e-05 | 2.8512e-05 | 3.3024e-05 | 0.000123584 | 0.0010383 | 0.0100846 |
Lots of data here to parse. The main takeaway here is that CUDA equals the performance of Eigen starting at vectors of length 100,000 and above. After that, CUDA roughly outperforms Eigen by an order of magnitude.
I tried a bunch of different parameters, but the main thing was that 128 threads per block with 4 loads in each thread was clearly the winner in terms of performance overall. 128 threads per block obviously creates more blocks; more blocks is essential to maximizing the GPUs performance as they fully occupy all the streaming multiprocessors (SMs). 4 loads per thread means that each thread works on 4 elements, with each element being strided by the width of the block. Doing work in this way allows warps to be swapped out as they request data to be loaded from global memory so that the addition/multiplication instructions can be performed as soon as the data is ready, pipelining everything in a way.
Code sample for 2 loads/thread:
__global__ instruction_level_parallelism(float *x, float *y, float scale, unsigned numThreads) {
unsigned idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < numThreads)
x[idx] += y[idx] * scale;
if (2 * idx < numThreads)
x[idx + blockDim.x] += y[idx + blockDim.x] * scale;
}
Obviously numThreads should be the number of elements divided by 2 when launching the kernel. But basically there are two loads strided across by the block size.
And finally the difference between 1 and 2 vectors. Normally, you might have vectors stored as:
X = X1, X2, X3, …, XN
Y = Y1, Y2, Y3, …, YN
and SAXPYs those together. But instead, the one vector approach lays them out as:
XY = X1, Y1, X2, Y2, X3, Y3, …, XN, YN
The CUDA cards are able to coalesce memory accesses when consectutive threads access memory locations contiguously. By laying the memory out like this, thread 1 can access memory location 1, thread 2 memory location 2, etc contiguously which maximizes load efficiency.
Dot Products
Benchmark | 1,000 | 10,000 | 100,000 | 1,000,000 | 10,000,000 | 100,000,000 |
---|---|---|---|---|---|---|
Eigen Dot Product | 5.26e-07 | 3.352e-06 | 4.4561e-05 | 0.000423223 | 0.00529679 | 0.0534771 |
CUDA 1 Vector Dot Product 128 threads | 0.000112256 | 0.00012656 | 0.00016688 | 0.00040752 | 0.00259299 | 0.0240384 |
CUDA 1 Vector Dot Product 256 threads | 9.7664e-05 | 0.000126432 | 0.000167488 | 0.000410912 | 0.00264259 | 0.0245879 |
CUDA 1 Vector Dot Product 512 threads | 0.000102656 | 0.000132 | 0.00017136 | 0.000426784 | 0.00275974 | 0.025626 |
CUDA 1 Vector Dot Product 1024 threads | 0.000105312 | 0.000138944 | 0.000183456 | 0.000469216 | 0.00309581 | 0.0290027 |
CUDA 2 Vector Dot Product 128 threads | 9.9136e-05 | 0.000126688 | 0.000165888 | 0.000403584 | 0.00256003 | 0.0238534 |
CUDA 2 Vector Dot Product 256 threads | 9.9488e-05 | 0.000127424 | 0.000167488 | 0.000409216 | 0.00261757 | 0.0243956 |
CUDA 2 Vector Dot Product 512 threads | 9.616e-05 | 0.000126496 | 0.000171104 | 0.000424 | 0.00272925 | 0.0255431 |
CUDA 2 Vector Dot Product 1024 threads | 0.00010288 | 0.000136544 | 0.000180192 | 0.00046144 | 0.00305926 | 0.0286854 |
CUDA Templated 2 Vector Dot Product 128 threads | 5.1968e-05 | 4.496e-05 | 6.9248e-05 | 0.000211712 | 0.00161814 | 0.0155401 |
CUDA Templated 2 Vector Dot Product 256 threads | 5.1552e-05 | 4.5568e-05 | 5.8912e-05 | 0.000223712 | 0.00169581 | 0.0164204 |
CUDA Templated 2 Vector Dot Product 512 threads | 3.8016e-05 | 4.8544e-05 | 6.144e-05 | 0.000237696 | 0.00183901 | 0.0178052 |
CUDA Templated 2 Vector Dot Product 1024 threads | 3.4848e-05 | 4.784e-05 | 6.5568e-05 | 0.00025344 | 0.00216915 | 0.0210177 |
CUDA Templated 1 Vector Dot Product 128 threads | 5.152e-05 | 4.4416e-05 | 6.8736e-05 | 0.000212512 | 0.00164326 | 0.0158 |
CUDA Templated 1 Vector Dot Product 256 threads | 5.3856e-05 | 4.56e-05 | 5.9168e-05 | 0.000223424 | 0.00169162 | 0.0163945 |
CUDA Templated 1 Vector Dot Product 512 threads | 3.4432e-05 | 4.528e-05 | 5.8304e-05 | 0.000234912 | 0.00183699 | 0.0177262 |
CUDA Templated 1 Vector Dot Product 1024 threads | 3.7472e-05 | 5.2224e-05 | 6.8064e-05 | 0.000262496 | 0.00212323 | 0.0205787 |
CUDA Product Kernel then Templated Reduce 128 threads | 6.2656e-05 | 8.2208e-05 | 0.000114752 | 0.000308896 | 0.0019601 | 0.0181938 |
CUDA Product Kernel then Templated Reduce 256 threads | 5.5872e-05 | 7.4528e-05 | 0.000110528 | 0.00031584 | 0.00202419 | 0.0188202 |
CUDA Product Kernel then Templated Reduce 512 threads | 5.4016e-05 | 6.8576e-05 | 0.000105824 | 0.00031616 | 0.00211914 | 0.0199164 |
CUDA Product Kernel then Templated Reduce 1024 threads | 5.5104e-05 | 6.3296e-05 | 0.000103744 | 0.000339424 | 0.00235552 | 0.0221885 |
Here it’s even more interesting. You actually don’t get a similar magnitude of performance until the 100,000 length mark, but even then Eigen is slightly faster, about 1E-5 to 2E-5 faster, which isn’t a lot but (in GenSel) 50,000 markers and a Markov Chain length of 40,000, that’s 2,000,000 dot products which can add up (to like whole seconds). After 1,000,000 CUDA can be 2-5x faster, not an order of magnitude but still faster which is important.
Here it appears the 2 Vector dot product is actually faster. The “Templated Dot Product” is just Mark Harris’ Optimizing Parallel Reduction in CUDA with two vectors. The other dot product is a simple loop which performs a single sum per thread, reducing the vector in half per accumulation iteration. The Product Kernel then Templated Reduce tries to perform an element-wise multiplication followed by the templated reduce. It’s odd that that’s slower in my opinion, but whatever. Also interesting to note is that Eigen is lazy; I had to print the values of the dot products at the end and then remove that last field when analyzing the data just to get the timings right.
Conclusion
My main conclusion was that once you get to vectors at or above lengths of 100,000, CUDA is an excellent choice for matrix math. Below that though you can stick to libraries like Eigen which optimize code very very well. There’s lots of different optimizations you can do, but the main thing is optimizing memory bandwidth and distributing enough work across all of the multiprocessors.