Samyam Rajbhandari

Orcid: 0000-0002-0386-8759

According to our database1, Samyam Rajbhandari authored at least 35 papers between 2012 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference.
CoRR, 2024

2023
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models.
CoRR, 2023

DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention.
CoRR, 2023

DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales.
CoRR, 2023

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training.
CoRR, 2023

A Novel Tensor-Expert Hybrid Parallelism Approach to Scale Mixture-of-Experts Training.
CoRR, 2023

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training.
Proceedings of the 37th International Conference on Supercomputing, 2023

2022
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model.
CoRR, 2022

DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale.
Proceedings of the International Conference on Machine Learning, 2022

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed.
Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

2021
Scalable and Efficient MoE Training for Multitask Multilingual Models.
CoRR, 2021

ZeRO-Offload: Democratizing Billion-Scale Model Training.
Proceedings of the 2021 USENIX Annual Technical Conference, 2021

ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning.
Proceedings of the International Conference for High Performance Computing, 2021

SimiGrad: Fine-Grained Adaptive Batching for Large Scale Training using Gradient Similarity Measurement.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed.
Proceedings of the 38th International Conference on Machine Learning, 2021

2020
Fast LSTM by dynamic decomposition on cloud and distributed systems.
Knowl. Inf. Syst., 2020

APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm.
CoRR, 2020

ZeRO: memory optimizations toward training trillion parameter models.
Proceedings of the International Conference for High Performance Computing, 2020

DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters.
Proceedings of the KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2020

2019
ZeRO: Memory Optimization Towards Training A Trillion Parameter Models.
CoRR, 2019

AntMan: Sparse Low-Rank Compression to Accelerate RNN inference.
CoRR, 2019

Accelerating Large Scale Deep Learning Inference through DeepCPU at Microsoft.
Proceedings of the 2019 USENIX Conference on Operational Machine Learning, 2019

Fast LSTM Inference by Dynamic Decomposition on Cloud Systems.
Proceedings of the 2019 IEEE International Conference on Data Mining, 2019

2018
DeepCPU: Serving RNN-based Deep Learning Models 10x Faster.
Proceedings of the 2018 USENIX Annual Technical Conference, 2018

Learning Intrinsic Sparse Structures within Long Short-Term Memory.
Proceedings of the 6th International Conference on Learning Representations, 2018

2017
Learning Intrinsic Sparse Structures within Long Short-term Memory.
CoRR, 2017

Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis.
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

Optimizing CNNs on Multicores for Scalability, Performance and Goodput.
Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017

2016
A domain-specific compiler for a parallel multiresolution adaptive numerical simulation environment.
Proceedings of the International Conference for High Performance Computing, 2016

On fusing recursive traversals of K-d trees.
Proceedings of the 25th International Conference on Compiler Construction, 2016

2014
A Communication-Optimal Framework for Contracting Distributed Tensors.
Proceedings of the International Conference for High Performance Computing, 2014

CAST: Contraction Algorithm for Symmetric Tensors.
Proceedings of the 43rd International Conference on Parallel Processing, 2014

2013
A framework for load balancing of tensor contraction expressions via dynamic task partitioning.
Proceedings of the International Conference for High Performance Computing, 2013

2012
International Conference on Computational Science, ICCS 2012.
Proceedings of the International Conference on Computational Science, 2012


  Loading...