Samyam Rajbhandari

CoRR, 2024

2023

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models.

[BibT_eX]

[DOI]

CoRR, 2023

DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention.

[BibT_eX]

[DOI]

CoRR, 2023

DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales.

[BibT_eX]

[DOI]

Zhewei Yao

CoRR, 2023

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training.

[BibT_eX]

[DOI]

CoRR, 2023

A Novel Tensor-Expert Hybrid Parallelism Approach to Scale Mixture-of-Experts Training.

[BibT_eX]

[DOI]

CoRR, 2023

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training.

[BibT_eX]

[DOI]

Proceedings of the 37th International Conference on Supercomputing, 2023

2022

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model.

[BibT_eX]

[DOI]

CoRR, 2022

DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.

[BibT_eX]

[DOI]

Proceedings of the SC22: International Conference for High Performance Computing, 2022

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale.

[BibT_eX]

[DOI]

Andrés Felipe Cruz-Salinas

Ammar Ahmad Awan

Jeff Rasley

Yuxiong He

Proceedings of the International Conference on Machine Learning, 2022

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed.

[BibT_eX]

[DOI]

Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

2021

Scalable and Efficient MoE Training for Multitask Multilingual Models.

[BibT_eX]

[DOI]

Young Jin Kim

Ammar Ahmad Awan

Alexandre Muzio

CoRR, 2021

ZeRO-Offload: Democratizing Billion-Scale Model Training.

[BibT_eX]

[DOI]

Jie Ren

Proceedings of the 2021 USENIX Annual Technical Conference, 2021

ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2021

SimiGrad: Fine-Grained Adaptive Batching for Large Scale Training using Gradient Similarity Measurement.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed.

[BibT_eX]

[DOI]

Proceedings of the 38th International Conference on Machine Learning, 2021

2020

Fast LSTM by dynamic decomposition on cloud and distributed systems.

[BibT_eX]

[DOI]

Knowl. Inf. Syst., 2020

APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm.

[BibT_eX]

[DOI]

CoRR, 2020

ZeRO: memory optimizations toward training trillion parameter models.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2020

DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters.

[BibT_eX]

[DOI]

Proceedings of the KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2020

2019

ZeRO: Memory Optimization Towards Training A Trillion Parameter Models.

[BibT_eX]

[DOI]

CoRR, 2019

AntMan: Sparse Low-Rank Compression to Accelerate RNN inference.

[BibT_eX]

[DOI]

Harsh Shrivastava

Yuxiong He

CoRR, 2019

Accelerating Large Scale Deep Learning Inference through DeepCPU at Microsoft.

[BibT_eX]

[DOI]

Proceedings of the 2019 USENIX Conference on Operational Machine Learning, 2019

Fast LSTM Inference by Dynamic Decomposition on Cloud Systems.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE International Conference on Data Mining, 2019

2018

DeepCPU: Serving RNN-based Deep Learning Models 10x Faster.

[BibT_eX]

[DOI]

Proceedings of the 2018 USENIX Annual Technical Conference, 2018

Learning Intrinsic Sparse Structures within Long Short-Term Memory.

[BibT_eX]

[DOI]

Proceedings of the 6th International Conference on Learning Representations, 2018

2017

Learning Intrinsic Sparse Structures within Long Short-term Memory.

[BibT_eX]

[DOI]

CoRR, 2017

Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis.

[BibT_eX]

[DOI]

Fabrice Rastello

Karol Kowalski

Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

Optimizing CNNs on Multicores for Scalability, Performance and Goodput.

[BibT_eX]

[DOI]

Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017

2016

A domain-specific compiler for a parallel multiresolution adaptive numerical simulation environment.

[BibT_eX]

[DOI]

Jinsung Kim

Proceedings of the International Conference for High Performance Computing, 2016

On fusing recursive traversals of K-d trees.

[BibT_eX]

[DOI]

Jinsung Kim

Proceedings of the 25th International Conference on Compiler Construction, 2016

2014

A Communication-Optimal Framework for Contracting Distributed Tensors.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2014

CAST: Contraction Algorithm for Symmetric Tensors.

[BibT_eX]

[DOI]

Proceedings of the 43rd International Conference on Parallel Processing, 2014

2013

A framework for load balancing of tensor contraction expressions via dynamic task partitioning.

[BibT_eX]

[DOI]

Pai-Wei Lai

Kevin Stock