We implemented GVProf, the first value profiler that locates value redundancy problems in applications running on GPU-based clusters. Our experiments show that GVProf incurs acceptable overhead and scales to large executions. GVProf provides useful insights to guide performance optimization. Under the guidance of GVProf, we optimized several HPC and machine learning workloads, obtaining speedups up to 1.93x.
Our tool provides a profile view and a trace view for GPU-accelerated applications. The profile view identifies where GPU APIs are invoked in CPU calling context, approximates calling context for GPU execution, and analyzes instruction mix for GPU kernels. The tool traces CPU and GPU activities for a large number of processes and threads with minimal overhead.