A Tool for Performance Analysis of GPU-Accelerated Applications


Architectures for High-Performance Computing (HPC) now commonly employ accelerators such as Graphics Processing Units (GPUs). High-level programming abstractions for accelerated computing include OpenMP as well as RAJA and Kokkos—programming abstractions based on C++ templates. Such programming models hide GPU architectural details and generate sophisticated GPU code organized as many small procedures. For example, a dot product kernel expressed using RAJA atop NVIDIA’s thrust templates yields 35 procedures. Existing performance tools are ill-suited for analyzing such complex kernels because they lack a comprehensive profile view. At best, tools such as NVIDIA’s nvvp provide a profile view that shows only limited CPU calling contexts and omits both calling contexts and loops in GPU code. To address this problem, we extended Rice University’s HPCToolkit to build a complete profile view for GPU-accelerated applications.

IEEE/ACM International Symposium on Code Generation and Optimization (CGO)