XTensor 2024

1st Workshop on Cross-stack Optimization of Tensor Methods

April 27 1:30-5:30pm, 2024 @ San Diego, CA, USA

Held in conjunction with ASPLOS'24

Tensor problems are becoming ever more important to represent data and analyze its inherent properties, as the correlation of data gains importance in many domains. The application of tensor methods covers machine learning/deep learning, quantum chemistry/physics, quantum circuit simulation, social networks, and healthcare, to name a few. The research on tensor methods comes from multiple domains, including computer architecture, programming languages, compilers, and parallel computing. This workshop aims to gather researchers from diverse computer system backgrounds to present and communicate their work on tensor methods, and then seek a cross-stack solution to improve the performance of tensor algorithms.

Agenda

1:30 - 1:40 pm: Welcome and Opening Remarks [Slides]

1:40 - 2:30 pm: Keynote Talk by Ang Li @ PNNL

Title: What GPU tensor-cores really bring to the table?
Abstract: Tensor cores or matrix cores have become the major computing units of modern GPUs. In this talk, I will go through several lines of research we have performed along the functionality, performance, and numerical precision of GPU tensor cores, at the lower wmma level. Some relevant work will also be reviewed, followed by discussions on new research opportunities.
Bio: Dr. Ang Li is a senior computer scientist in the Physical and Computational Sciences Directorate (PCSD) of Pacific Northwest National Laboratory (PNNL) and an affiliated Associated Professor with the ECE department of University of Washington (UW). He received the PhD degrees from the ECE department of NUS, Singapore and EE department of TU/e, Netherlands. He joined PNNL since 2016. His research interests include software-hardware codesign for HPC, computer architecture, and quantum computing. His research work won the best paper award in ICCD’21, Cluster’22 and within the finalist for SC’15/17/20, IISWC’18 and ISCA’22. He is an associated editor of TPDS and serves as a review committee member of many HPC and architecture conferences.

2:30 - 2:45 pm: A Novel Sparse Tensor Representation for Quantum Simulations, by Srikar Chundury

Abstract: Digital simulation of quantum systems is important for algorithmic development and quantum device verification. We present a novel sparse tensor format, DiaQ, specifically catering to state vector quantum simulation.We develop a set of numerical methods fundamental to state simulation with lower algorithmic complexity than their dense counterparts and provide an implementation in C++ with parallelization over multicores and vectorization. Experiments of sparse DiaQ indicate significant performance benefits over equivalent quantum simulations with a dense tensor format for various benchmark circuits particularly for larger number of qubits.
Bio: Srikar is a graduate student at North Carolina State University, working under Dr. Frank Mueller and Dr. Jiajia Li on efficient quantum simulations. He has a Bachelor's degree in Computer Science from PES University, Bangalore, India. He briefly worked as a Software Engineer at Walmart Labs before starting joining NC State for his graduate studies. Srikar's primary research interests lie at the intersection of High-Performance Computing and Quantum Simulation.
In his leisure time, Srikar enjoys watching sports, playing cricket and table tennis, and exploring new places.

2:45 - 3:00 pm: Accelerating GEMV via Targeted Tensor Core Optimizations, by Sounder Rajendran

Abstract: This work explores the acceleration of General Matrix-Vector Multiplication (GEMV) through the use of specialized matrix processing units, known as Tensor Cores, available in NVIDIA GPUs starting from the Volta architecture. With GEMV’s extensive application in High-Performance Computing (HPC) and Artificial Intelligence/Machine Learning (AI/ML), from backpropagation in neural networks to iterative solvers in computational fluid dynamics, the operation’s efficiency is pivotal. Our research centers on novel matrix augmentation and transformation techniques that adapt GEMV computations for Tensor Cores, focusing on minimizing FLOPS loss and leveraging shared memory optimizations. This adaptation seeks to exploit the high-throughput capabilities of Tensor Cores, traditionally utilized for matrix-matrix multiplication (GEMM), thus extending their utility to accelerate GEMV operations. Preliminary findings demonstrate significant speedups in GEMV execution compared to conventional cuBLAS implementations, underscoring the potential of our approach in unlocking new dimensions of computational performance for a wide array of HPC and AI/ML applications. Through this exploration, we aim to further contribute to the optimization of GEMV, facilitating more efficient utilization of advanced hardware accelerations.
Bio: Sounder Rajendran is a graduate student in Computer Engineering at North Carolina State University, where he works under Dr. Jiajia Li focusing on GPU architecture-aware algorithm optimizations, particularly in tensor cores. He holds a Bachelor's degree in Electrical and Electronics Engineering from Coimbatore Institute of Technology, India. In his professional career, Sounder contributed to research and development of Low Power Embedded Systems and Physical Test Simulations at Bosch India, within its HMI Centre of Excellence, and contributed to development of core-level Chip Pervasive Logic modules supporting CPU power management at AMD.

3:00 - 3:30 pm: Break

3:30 - 4:10 pm: Invited Talk by Xupeng Miao @ CMU

Title: Toward Fast and Affordable Serving Systems for Large Language Models
Abstract: In the rapidly evolving field of generative artificial intelligence, efficient deployment of large language models (LLMs) is a critical challenge. In this talk, I will introduce our three innovative approaches to enhancing the efficiency and cost-effectiveness of LLM inference and serving systems. First, I will present SpecInfer, the inaugural tree-based speculative inference system that reduces LLM serving latency by 1.5-3.5x compared to existing solutions by leveraging a novel token tree speculation and verification mechanism. Next, I will describe SpotServe, the first LLM serving system on spot instances, handling preemptions with dynamic reparallelization, ensuring relatively low tail latency, and reducing monetary cost by 54%. Finally, I will exhibit FlexLLM, a low-cost co-serving system for LLM inference and parameter-efficient fine-tuning (PEFT), achieving higher fine-tuning throughput, while preserving the SLA for inference jobs.
Bio: Xupeng Miao is currently a postdoc researcher at Carnegie Mellon University working with Prof. Zhihao Jia and Prof. Tianqi Chen. Before that, he received his Ph.D. degree from Peking University advised by Prof. Bin Cui. He is broadly interested in machine learning systems, data management, and distributed computing. His past research has been published in top-tier system and machine learning conferences, including OSDI, ASPLOS, SIGMOD, VLDB, NSDI, NeurIPS and so on. Recently, he has focused on building efficient, scalable, and affordable software systems (e.g., FlexFlow Serve) for large language models. His work was recognized through the 2022 ACM China Doctoral Dissertation Award, the Best Scalable Data Science Paper Award of VLDB 2022, and the Distinguished Artifact Award of ASPLOS 2024.

4:10 - 4:25 pm: Optimizing Sparse Tensor-times-Vector on Cerebras WSE-2, by Sai Krishna Teja Varma Manthena

Abstract: Sparse tensor algebra plays a crucial role in various domains such as scientific computing, machine learning, and data analytics, particularly in computations like Sparse Tensor-times-Vector (SpTTV), essential for popular tensor decompositions such as Tucker decomposition and Tensor Power Iteration. This work delves into the optimization of the SpTTV kernel specifically on the Cerebras Wafer-Scale Engine (WSE), a powerful architecture initially designed for machine learning training but increasingly recognized for its versatility across different workloads. Our evaluation, conducted on a 13th Gen Intel(R) Core(TM) i9-13900F with a clock speed of 2000 MHz, compares our optimized implementation against PASTA library’s OpenMP implementation of the SpTTV kernel. The results demonstrate significant performance improvements, achieving upto 4× speed up for a synthetic 50×50×50 tensor generated with non-zeros ranging from 128 to 32768 using upto 256 Processing Elements (PEs), and 3.5× speedup for a synthetic 68×68×68 tensor generated ranging from 128 to 16384 non-zeros using upto 128 PEs. This work attempts to showcase the potential of the Cerebras WSE in accelerating sparse tensor computations and advancing highperformance computing paradigms.
Bio: Sai Krishna Teja Varma Manthena is a second year Master’s student in Computer Science Department at NC State University, advised by Dr. Jiajia Li. He completed his undergraduate degree from Jawaharlal Nehru Technological University, Hyderabad, India. His current research lies on optimizing sparse tensor computations for AI architectures. He has hands-on experience in GPU programming and other parallel processing paradigms.

4:30 - 5:30 pm: Panel & Open Discussion (Ang Li, Xupeng Miao)

Call for Papers

XTensor is a venue for discussion and brainstorming at the intersection of software and hardware research.

Research topics includes, but not limited to:

Research angles:

Programming abstractions
Compiler techniques
Runtime optimization
Libraries/frameworks
High-performance algorithms
Hardware architecture
Performance model

Tensor methods:

Data: dense, sparse, structured, symmetric layout; random or non-negative values; etc.
Randomized, approximate, etc.
Tensor operators, such as tensor products, tensor-matrix multiplication, tensor-tensor multiplication, matrix-matrix multiplication
Tensor decompositions, such as Canonical polyadic decomposition (CPD), Tucker decomposition, etc.
Tensor networks, such as tensor train, tensor ring, hierarchical Tucker, the projected entangled pair states (PEPS), etc.
Tensor regression, tensor component analysis, tensor-structured dictionary learning, etc.

Platforms:

CPUs
GPUs (e.g., NVIDIA, AMD, Intel)
AI accelerators (e.g., SambaNova, Graphcore, Habana, GroqRack)
Wafer-scale system (e.g., Cerebras)
FPGAs
ASICs
Any kinds of simulators

Position Papers Submission

Discussion and communication are the primary goals of the workshop. Thus, we only ask for 2-page position papers. The submitted work is very flexible in content. It could be early, in-progress research; could have not very complete experiments or results; could be a valuable survey of recent research trends or tools; could propose an important open question and call for solution; could be an experience report even with failed approaches but with some lessons learned. Submitted papers will undergo peer review by a program committee of experts from diverse research domains working on tensor problems.

Papers should follow the two-column formatting guidelines for SIGPLAN template (the acmart format with the sigplan two-column option) and up to 2 pages, excluding references. Review is single-blind, please include authors' names on the submitted PDF.

Paper submission will be via EasyChair. The accepted papers will not be published in a proceeding. Presentation slides will show on the workshop website.

Important Dates

Paper submission: April 3, 2024 (extended)
Author Notification: April 17, 2024
Workshop: April 27 1-5pm, 2024

All deadlines are AOE 11:59pm.

Organization

Chairs:

Frank Mueller, North Carolina State University
Dong Li, University of California, Merced
Lizhong Chen, Oregon State University
Jiajia Li, North Carolina State University, jiajia.li@ncsu.edu

Committee:

Jee Choi, University of Oregon
Bastian Hagedorn, NVIDIA
Ahmed Helal, Intel
Bin Ren, William & Mary
Keren Zhou, George Mason University

Please contact Jiajia Li at jiajia.li@ncsu.edu if you have any questions.