GPU

FASTEN: Fast GPU-accelerated Segmented Matrix Multiplication for Heterogenous Graph Neural Networks

Presented the FASTEN work for accelerating segmented matrix multiplication

Jun 1, 2024 9:41 PM — 9:41 PM Virtual

Update on Triton's Interpreter

Review Triton’s Interpreter’s progress and future plans

Apr 3, 2024 10:03 PM — 10:03 PM Virtual

Proton: A Profiler for Triton

Went through Proton’s design overview

Feb 20, 2024 10:03 PM — 10:03 PM Virtual

PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation

This paper introduces two extensions to the popular PyTorch machine learning framework, TorchDynamo and TorchInductor, which implement …

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael Lazos, Mario Lezcano, Yanbo Liang, Jason Liang, Yinghai Lu, C. K. Luk, Bert Maher, Yunjie Pan, Christian Puhrsch, Matthias Reso, Mark Saroufim, Marcos Yukio Siraichi, Helen Suk, Shunting Zhang, Michael Suo, Phil Tillet, Xu Zhao, Eikan Wang, Keren Zhou, Richard Zou, Xiaodong Wang, Ajit Mathews, William Wen, Gregory Chanan, Peng Wu, Soumith Chintala

Technical Review on PyTorch 2.0 and Triton

High-level overview of PyTorch 2.0 and Triton integration

Aug 7, 2023 10:03 PM — 10:03 PM Virtual

Our tool provides a profile view and a trace view for GPU-accelerated applications. The profile view identifies where GPU APIs are invoked in CPU calling context, approximates calling context for GPU execution, and analyzes instruction mix for GPU kernels. The tool traces CPU and GPU activities for a large number of processes and threads with minimal overhead.

HPCToolkit

Triton is a language and compiler for writing highly efficient custom Deep-Learning primitives. The aim of Triton is to provide an open-source environment for expressing tensor math workloads that offers high flexibility, developer productivity and end to end performance.

Triton

Hardware-Aware Compression with Random Operation Access Specific Tile (ROAST) Hashing

Advancements in deep learning are often associated with increasing model sizes. Training and deploying large models require …

Aditya Desai, Keren Zhou, Anshumali Shrivastava

Towards Agile Development of Efficient Deep Learning Operators (Hardware Insights)

Presented a talk about Triton and requested feedback from Intel engineers

Jun 29, 2023 10:56 PM — 10:56 PM Virtual

Towards Agile Development of Efficient Deep Learning Operators (Call for Contributions)

Presented a talk about Triton and called for contributions to improving the language

Jun 19, 2023 10:56 PM — 10:56 PM Lake Tahoe, California