Search

Home
Experience
Lab
Projects
Featured
Publications
Talks
Students
Tags
News

Light Dark Automatic

Keren Zhou

Latest

Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using F_2
Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using F_2
Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context
Triton, Gluon, and the Future of Tile-Based Programming Models
PASTA: A Modular Program Analysis Tool Framework for Accelerators
Proton: Towards Multi-level, Adaptive Profiling for Triton
Mercury: Unlocking Multi-GPU Operator Optimization for LLMs via Remote Memory Scheduling
Comprehensive Evaluation of LLMs in HPC Code Performance Optimization
KPerfIR: Towards an Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads
Tile-based Programming Models for AI
DeepContext: A Context-aware, Cross-platform, and Cross-framework Tool for Performance Profiling and Analysis of Deep Learning Workloads
Proton: Adaptive and Lightweight Profiling for Deep Learning Workloads
The Proton Dialect: An MLIR Dialect For AI Compiler GPU Kernel Profiling
Triton-Viz: Visualizing GPU Programming in AI Courses
SS1: Accelerating Inference with Fast and Expressive Sketch Structured Transform
Profiling and Debugging GPU-accelerated AI Applications
Proton: Introduction and Development
Dev Tools: Proton/Interpreter
Triton Update
FASTEN: Fast GPU-accelerated Segmented Matrix Multiplication for Heterogenous Graph Neural Networks
Update on Triton's Interpreter
Proton: A Profiler for Triton
Centimani: Enabling Fast AI Accelerator Selection for DNN Training with a Novel Performance Predictor
FASTEN: Fast GPU-accelerated Segmented Matrix Multiplication for Heterogenous Graph Neural Networks
PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation
Technical Review on PyTorch 2.0 and Triton
Hardware-Aware Compression with Random Operation Access Specific Tile (ROAST) Hashing
Towards Agile Development of Efficient Deep Learning Operators (Hardware Insights)
Towards Agile Development of Efficient Deep Learning Operators (Call for Contributions)
DrGPUM: Guiding Memory Optimization for GPU-Accelerated Applications
Semi-supervised learning for shale image segmentation with fast normalized cut loss
Towards Agile Development of Efficient Deep Learning Operators (Pre-MLIR)
Practical Performance Optimization for Deep Learning Applications
ValueExpert: Exploring Value Patterns in GPU-accelerated Applications
Accelerating High-order Stencils on GPUs
An Automated Tool for Analysis and Tuning of GPU-Accelerated Code in HPC Applications
Low Overhead and Context Sensitive Profiling of GPU-Accelerated Applications
Paw-Net: Stacking Ensemble Deep Learning for Segmenting Scanning Electron Microscopy Images of Fine-grained Shale Samples
ValueExpert: Exploring Value Patterns in GPU-Accelerated Applications
Performance Measurement, Analysis, and Optimization of GPU-accelerated Applications
Analyzing GPU-accelerated Applications Using HPCToolkit
GPA: A GPU Performance Advisor Based on Instruction Sampling
GPA: A GPU Performance Advisor Based on Instruction Sampling
Measurement and Analysis of GPU-accelerated Applications with HPCToolkit
Measurement and Analysis of GPU-Accelerated OpenCL Computations on Intel GPUs
Outcomes of OpenMP Hackathon: OpenMP Application Experiences with the Offloading Model
GVProf: A Value Profiler for GPU-Based Clusters
Tools for Top-down Performance Analysis of GPU-Accelerated Applications
A Tool for Top-down Performance Analysis of GPU-accelerated Applications
A Tool for Top-down Performance Analysis of GPU-Accelerated Applications
GVPROF: A Value Profiler for GPU-Based Clusters
Tools for Top-down Performance Analysis of GPU-Accelerated Applications
Optimizing GPU-accelerated Applications with HPCToolkit
A Tool for Performance Analysis of GPU-accelerated Applications
A Tool for Performance Analysis of GPU-Accelerated Applications
Quadboost: A Scalable Concurrent Quadtree
A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability
Deep Learning on Modern Architectures
A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability
Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning
Convolution Methods
BF-MapReduce: A Bloom Filter Based Efficient Lightweight Search
Multi-Classes Feature Engineering with Sliding Window for Purchase Prediction in Mobile Commerce

© 2026 Keren Zhou. This work is licensed under CC BY NC ND 4.0

Published with Wowchemy — the free, open source website builder that empowers creators.

Cite