1

Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using F_2

Keren Zhou, Mario Lezcano, Adam Goucher, Akhmed Rakhmati, Jeff Niu, Justin Lebar, Pawel Szczerbuk, Peter Bell, Phil Tillet, Thomas Raoux, Zahi Moudallal

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

Hao Wu, Qidong Zhao, Songqing Chen, Yang Chen, Yueming Hao, Tony C. W. Liu, Sijia Chen, Adnan Aziz, Keren Zhou

PASTA: A Modular Program Analysis Tool Framework for Accelerators

Mao Lin, Hyeran Jeon, Keren Zhou

Proton: Towards Multi-level, Adaptive Profiling for Triton

Keren Zhou, Tianle Zhong, Hao Wu, Jihyeong Lee, Yue Guan, Yufei Ding, Corbin Robeck, Yuanwei Fang, Jeff Niu, Philippe Tillet

Mercury: Unlocking Multi-GPU Operator Optimization for LLMs via Remote Memory Scheduling

Remote memory scheduling framework that optimizes LLM operators across multi-GPU deployments.

Yue Guan, Xinwei Qiang, Zaifeng Pan, Daniels Johnson, Yuanwei Fang, Keren Zhou, Yuke Wang, Wanlu Li, Yufei Ding, Adnan Aziz

Comprehensive Evaluation of LLMs in HPC Code Performance Optimization

Benchmarks and evaluates large language models for optimizing high-performance computing code.

Bowen Cui, Tejas Ramesh, Oscar Hernandez, Keren Zhou

KPerfIR: Towards an Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads

An open, compiler-focused infrastructure for profiling and optimizing GPU kernels on AI workloads.

Yue Guan, Yuanwei Fang, Keren Zhou, Corbin Robeck, Manman Ren, Zhongkai Yu, Yufei Ding, Adnan Aziz

DeepContext: A Context-aware, Cross-platform, and Cross-framework Tool for Performance Profiling and Analysis of Deep Learning Workloads

Performance profiling toolkit that unifies deep learning workload analysis across platforms and frameworks.

Qidong Zhao, Hao Wu, Yuming Hao, Zilingfeng Ye, Jiajia Li, Xu Liu, Keren Zhou

Triton-Viz: Visualizing GPU Programming in AI Courses

GPU programming is a critical component in AI system courses, which is notoriously difficult to learn and teach, given its unique …

Tejas Ramesh, Alexander Rush, Xu Liu, Binqian Yin, Keren Zhou, Shuyin Jiao

SS1: Accelerating Inference with Fast and Expressive Sketch Structured Transform

Tensor multiplication with learned weight matrices is the fundamental building block in deep learning models. These matrices can often …

Aditya Desai, Kimia Saedi, Apoorv Walia, Jihyeong Lee, Keren Zhou, Anshumali Shrivastava