KEYNOTE TALK
Rethinking Network Fabric Architectures for Distributed AI/ML Platforms
Associate Professor
School of Electrical and Computer Engineering
Georgia Institute of Technology 🇺🇸
Associate Professor
School of Electrical and Computer Engineering
Georgia Institute of Technology 🇺🇸
Abstract: The surge of artificial intelligence, specifically large language models, is emphasizing the ever-growing demand to efficiently train and serve them efficiently. Large sizes of models and datasets has necessitated the need for distributed execution over hundreds to thousands of customized GPU/TPU-based platforms connected via high-speed network fabrics. Examples of such platforms today include Google’s Cloud TPU, NVIDIA’s HGX, Intel’s Habana, Cerebras’ Andromeda, Tesla Dojo, and many more. This, in turn, brings the communication overheads to exchange gradients and activations into the critical path, making the design and optimization of the network fabric a crucial component for overall performance and efficiency.
Designing an optimized network fabric for AI platforms is an open and active challenge today – with co-design opportunities spanning across technology (e.g., waferscale, photonics), hardware architectures (i.e., network topologies) and software scheduling (e.g., optimal collective algorithms). This talk will introduce our work in (i) modeling diverse distributed AI platforms to identify communication bottlenecks, (ii) designing scalable fabric topologies leveraging diverse technologies, and (ii) collective scheduling optimizations to enhance network bandwidth efficiency.
Bio: Tushar Krishna is an Associate Professor in the School of Electrical and Computer Engineering at Georgia Tech. He has a Ph.D. in Electrical Engineering and Computer Science from MIT (2014), a M.S.E in Electrical Engineering from Princeton University (2009), and a B.Tech in Electrical Engineering from the Indian Institute of Technology (IIT) Delhi (2007).
He has also been a visiting professor at MIT (2023-24), Harvard University (2024-25) and a researcher at Intel (2014).
Dr. Krishna’s research spans computer architecture, interconnection networks, networks-on-chip (NoC), and AI/ML accelerator systems – with a focus on optimizing data movement in modern computing platforms. His research is funded via multiple awards from NSF, DARPA, IARPA, SRC (including JUMP2.0), Department of Energy, Intel, Google, Meta/Facebook, Qualcomm and TSMC. His papers have been cited over 17,000 times. Three of his papers have been selected for IEEE Micro’s Top Picks from Computer Architecture, one more received an honorable mention, and four have won best paper awards.
Dr. Krishna was inducted into the HPCA Hall of Fame in 2022. At Georgia Tech, he has been honored by the “Class of 1940 Course Survey Teaching Effectiveness Award” in 2018, the “Roger P. Webb Outstanding Junior Faculty Award” from the School of ECE in 2021, the “Richard M. Bass/Eta Kappa Nu Outstanding Junior Teacher Award” in 2023, and the “Roger P. Webb Outstanding Mid-career Faculty Award” from the School of ECE in 2024.
Dr. Krishna currently serves as an Associate Director for the Center for Research into Novel Computing Hierarchies (CRNCH) – a cross-disciplinary research center at Georgia Tech. He is also a co-chair of the Chakra Execution Traces and Benchmarks Working group within ML Commons.