Keren Zhou
Keren Zhou
Home
Experience
Projects
Featured
Publications
Talks
Students
Tags
News
Light
Dark
Automatic
1
Mercury: Unlocking Multi-GPU Operator Optimization for LLMs via Remote Memory Scheduling
Remote memory scheduling framework that optimizes LLM operators across multi-GPU deployments.
Yue Guan
,
Xinwei Qiang
,
Zaifeng Pan
,
Daniels Johnson
,
Yuanwei Fang
,
Keren Zhou
,
Yuke Wang
,
Wanlu Li
,
Yufei Ding
,
Adnan Aziz
Cite
DOI
PDF
Comprehensive Evaluation of LLMs in HPC Code Performance Optimization
Benchmarks and evaluates large language models for optimizing high-performance computing code.
Bowen Cui
,
Tejas Ramesh
,
Oscar Hernandez
,
Keren Zhou
Cite
arXiv
KPerfIR: Towards an Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads
An open, compiler-focused infrastructure for profiling and optimizing GPU kernels on AI workloads.
Yue Guan
,
Yuanwei Fang
,
Keren Zhou
,
Corbin Robeck
,
Manman Ren
,
Zhongkai Yu
,
Yufei Ding
,
Adnan Aziz
Cite
DOI
PDF
arXiv
DeepContext: A Context-aware, Cross-platform, and Cross-framework Tool for Performance Profiling and Analysis of Deep Learning Workloads
Performance profiling toolkit that unifies deep learning workload analysis across platforms and frameworks.
Qidong Zhao
,
Hao Wu
,
Yuming Hao
,
Zilingfeng Ye
,
Jiajia Li
,
Xu Liu
,
Keren Zhou
Cite
DOI
arXiv
Triton-Viz: Visualizing GPU Programming in AI Courses
GPU programming is a critical component in AI system courses, which is notoriously difficult to learn and teach, given its unique …
Tejas Ramesh
,
Alexander Rush
,
Xu Liu
,
Binqian Yin
,
Keren Zhou
,
Shuyin Jiao
Cite
Project
URL
SS1: Accelerating Inference with Fast and Expressive Sketch Structured Transform
Tensor multiplication with learned weight matrices is the fundamental building block in deep learning models. These matrices can often …
Aditya Desai
,
Kimia Saedi
,
Apoorv Walia
,
Jihyeong Lee
,
Keren Zhou
,
Anshumali Shrivastava
Cite
Project
Centimani: Enabling Fast AI Accelerator Selection for DNN Training with a Novel Performance Predictor
For an extended period, graphics processing units (GPUs) have stood as the exclusive choice for training deep neural network (DNN) …
Zhen Xie
,
Murali Emani
,
Xiaodong Yu
,
Dingwen Tao
,
Xin He
,
Pengfei Su
,
Keren Zhou
,
Venkatram Vishwanath
Cite
Project
URL
FASTEN: Fast GPU-accelerated Segmented Matrix Multiplication for Heterogenous Graph Neural Networks
This paper introduces FASTEN, a cutting-edge library developed to address the computational challenges inherent in Heterogeneous Graph …
Keren Zhou
,
Karthik Ganapathi Subramanian
,
Po-Hsun Lin
,
Matthias Fey
,
Binqian Yin
,
Jiajia Li
Cite
Project
DOI
URL
PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation
This paper introduces two extensions to the popular PyTorch machine learning framework, TorchDynamo and TorchInductor, which implement …
Jason Ansel
,
Edward Yang
,
Horace He
,
Natalia Gimelshein
,
Animesh Jain
,
Michael Voznesensky
,
Bin Bao
,
Peter Bell
,
David Berard
,
Evgeni Burovski
,
Geeta Chauhan
,
Anjali Chourdia
,
Will Constable
,
Alban Desmaison
,
Zachary DeVito
,
Elias Ellison
,
Will Feng
,
Jiong Gong
,
Michael Gschwind
,
Brian Hirsh
,
Sherlock Huang
,
Kshiteej Kalambarkar
,
Laurent Kirsch
,
Michael Lazos
,
Mario Lezcano
,
Yanbo Liang
,
Jason Liang
,
Yinghai Lu
,
C. K. Luk
,
Bert Maher
,
Yunjie Pan
,
Christian Puhrsch
,
Matthias Reso
,
Mark Saroufim
,
Marcos Yukio Siraichi
,
Helen Suk
,
Shunting Zhang
,
Michael Suo
,
Phil Tillet
,
Xu Zhao
,
Eikan Wang
,
Keren Zhou
,
Richard Zou
,
Xiaodong Wang
,
Ajit Mathews
,
William Wen
,
Gregory Chanan
,
Peng Wu
,
Soumith Chintala
Cite
Project
DOI
URL
Hardware-Aware Compression with Random Operation Access Specific Tile (ROAST) Hashing
Advancements in deep learning are often associated with increasing model sizes. Training and deploying large models require …
Aditya Desai
,
Keren Zhou
,
Anshumali Shrivastava
Cite
Project
URL
»
Cite
×