GPU Computing

Matrix Transpose in CUDA

Project Repo

These assignments are about learning the basics of GPU programming using CUDA. In all assignment we benchmark non-symmetric matrices of size $N \times N$, where $N$ is a power of 2, with different kernels and block sizes.

Click to download report

First Assignment

Develop a few algorithms to transpose the matrix in CPU, measure effective bandwidth under different compiler optimizations (-O0, -O1, -O2, -O3), analyze cache behavior and efficiency.

Second Assignment

Develop a few algorithms to transpose the matrix in GPU, measure effective bandwidth and efficiency. The developed algorithms are:

  • simple transposition
  • with shared memory
  • with shared memory and coalesced memory access

Results were also compared with the Copy of the matrix in GPU, to compare to the simplest algorithm we could use.

Third Assignment

Develop some algorithms to transpose the matrix in GPU, using NVIDIA’s Cooperative Groups. Compare the results against cuBLAS. This wasn’t straightforward as Cooperative Groups are not a good idea for dense matrix transposition, so I had to work against all the things that were studied during the course. Three algorithms were developed (ignoring copy):

  • Intra-block synchronization (good old __syncthreads())
  • Intra-block synchronization coalesced
  • Inter-block synchronization (grid-wide synch, expected and demonstrated slower)
Built with Hugo
Theme Stack designed by Jimmy