Low Overhead and Context Sensitive Profiling of GPU-Accelerated Applications


As we near the end of Moore’s law scaling, the next-generation computing platforms are increasingly exploring heterogeneous processors for acceleration. Graphics Processing Units (GPUs) are the most widely used accelerators. Meanwhile, applications are evolving by adopting new programming models and algorithms for emerging platforms. To harness the full power of GPUs, performance tools serve a critical role in understanding and tuning application performance, especially for those that involve complex executions spanning both CPU and GPU. To help developers analyze and tune applications, performance tools need to associate performance metrics with calling contexts. However, existing performance tools incur high overhead collecting and attributing performance metrics to full calling contexts. To address the problem, we developed a tool that constructs both CPU and GPU calling contexts with low overhead and high accuracy. With an innovative call path memoization mechanism, our tool can obtain call paths for GPU operations with negligible cost. For GPU calling contexts, our tool uses an adaptive epoch profiling method to collect GPU instruction samples to reduce the synchronization cost and reconstruct the calling contexts using postmortem analysis. We have evaluated our tool on nine HPC and machine learning applications on a machine equipped with an NVIDIA GPU. Compared with the state-of-the-art GPU profilers, our tool reduces the overhead for coarse-grained profiling of GPU operations from 2.07X to 1.42X and the overhead for fine-grained profiling of GPU instructions from 27.51X to 4.61X with an accuracy of 99.93% and 96.16% in each mode.

Proceedings of the 36th ACM International Conference on Supercomputing (ICS’22)