May 2020 – August 2020

Software Engineering Intern

Google Inc

Performance Regression Analysis of Feedback-direct Optimization (FDO) Based Programs
June 2018 – August 2018

Research Intern

Facebook Inc

Neural Network Optimization on Mobiles
April 2017 – July 2017

Research Intern

Nvidia Inc

Neural Network Quantization
October 2013 – February 2014

SDE Intern

Baidu Inc

Hadoop Workflow Optimization



GPA is a performance advisor for NVIDIA GPUs that suggests potential code optimization opportunities at a hierarchy of levels, including individual lines, loops, and functions. GPA uses data flow analysis to approximately attribute measured instruction stalls to their root causes and uses information about a program’s structure and the GPU to match inefficiency patterns with suggestions for optimization. GPA estimates each optimization’s speedup based on a PC sampling-based performance model.

We implemented GVProf, the first value profiler that locates value redundancy problems in applications running on GPU-based clusters. Our experiments show that GVProf incurs acceptable overhead and scales to large executions. GVProf provides useful insights to guide performance optimization. Under the guidance of GVProf, we optimized several HPC and machine learning workloads, obtaining speedups up to 1.93x.

Our tool provides a profile view and a trace view for GPU-accelerated applications. The profile view identifies where GPU APIs are invoked in CPU calling context, approximates calling context for GPU execution, and analyzes instruction mix for GPU kernels. The tool traces CPU and GPU activities for a large number of processes and threads with minimal overhead.

A fast, memory efficient, and light-weight implementation for gSpan algorithm in data mining. gBolt is up to 100x faster than the original implementation with multi-threading on a single machine. gBolt also reduces more than 200 folds memory usage, running efficiently on personal computers.

Recent Publications

Quickly discover relevant content by filtering publications.

(2021). GPA: A GPU Performance Advisor Based on Instruction Sampling. IEEE/ACM International Symposium on Code Generation and Optimization (CGO’21).

Project Source Document

(2020). GVProf: A Value Profiler for GPU-Based Clusters. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’20).

Project Source Document

(2020). Tools for Top-down Performance Analysis of GPU-Accelerated Applications. Proceedings of the 34th ACM International Conference on Supercomputing (ICS’20).

Project Source Document

(2020). A tool for top-down performance analysis of GPU-accelerated applications. Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’20).

Project Source Document

(2019). A tool for performance analysis of GPU-accelerated applications. Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’19).

Project Source Document

Recent & Upcoming Talks

Using HPCToolkit to Measure and Analyze the Performance of GPU-accelerated Applications Tutorial, Mar-Apr 2021

Presented our CGO’21 work.

Presented our SC’20 work.

Presented our ICS’20 work.

Presented a poster and a short talk about HPCToolkit’s GPU support at PPoPP’20