A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability

Abstract

GPUs are widely used in accelerating deep neural networks (DNNs) for their high bandwidth and parallelism. But tuning the performance of DNN computations is challenging, as it requires a thorough understanding of both underlying architectures and algorithm implementations. Traditional research, which focused on analyzing performance by CUDA C language or PTX instructions, has not combined hardware features tightly with source code. In this paper, we present a performance analysis framework at the assembly level. First, an instruction parser takes assembly source code, benchmark results, and hardware features as input to identify each instruction’s efficiency and latency. Then, a DAG constructor builds a DAG that models instruction executions. Finally, a performance advisor incorporates block partitions, occupancy, and the generated DAG to predict running cycles of the source code and presents its potential bottlenecks. We demonstrate the effectiveness of our framework by optimizing DNNs’ performance-critical kernels-GEMM and convolution. After taking steps to reduce bottlenecks, the experimental results show that our GEMM is 20% faster than cuBLAS, and our convolution outperforms cuDNN by 40%–60%. Because of the usage of assembly instructions, we can predict performance with an error as low as 2% in average.

Publication
Proceedings of the International Conference on Supercomputing (ICS)