Publications

(2025). Comprehensive Evaluation of LLMs in HPC Code Performance Optimization. Proceedings of the Workshop on AI Assisted Software Development for HPC (AI4Dev).

Cite arXiv

(2025). DeepContext: A Context-aware, Cross-platform, and Cross-framework Tool for Performance Profiling and Analysis of Deep Learning Workloads. Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).

Cite DOI arXiv

(2025). Triton-Viz: Visualizing GPU Programming in AI Courses. Proceedings of the 56th ACM Technical Symposium on Computer Science Education (SIGCSE).

Cite Project URL

(2023). DrGPUM: Guiding Memory Optimization for GPU-Accelerated Applications. Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).

Cite Project DOI URL

(2022). ValueExpert: Exploring Value Patterns in GPU-Accelerated Applications. Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).

Cite Project DOI URL

(2021). GPA: A GPU Performance Advisor Based on Instruction Sampling. IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

Cite Project DOI URL

(2021). Measurement and Analysis of GPU-Accelerated OpenCL Computations on Intel GPUs. IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools).

Cite Project DOI URL

(2020). Tools for Top-down Performance Analysis of GPU-Accelerated Applications. Proceedings of the 34th ACM International Conference on Supercomputing (ICS).

Cite Project DOI URL

(2020). GVPROF: A Value Profiler for GPU-Based Clusters. International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

Cite Project DOI URL

(2020). A Tool for Top-down Performance Analysis of GPU-Accelerated Applications. Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP).

Cite Project DOI URL

(2019). A Tool for Performance Analysis of GPU-Accelerated Applications. IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

Cite Project DOI URL

(2018). Quadboost: A Scalable Concurrent Quadtree. IEEE Transactions on Parallel and Distributed Systems (TPDS).

Cite DOI URL

(2017). Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP).

Cite Project DOI URL

(2015). BF-MapReduce: A Bloom Filter Based Efficient Lightweight Search. IEEE Conference on Collaboration and Internet Computing (CIC).

Cite DOI URL