Ammar Ahmad Awan

CoRR, 2024

2023

DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies.

[BibT_eX]

[DOI]

Cindy Orozco Bohorquez

Massimiliano Lupo Pasini

CoRR, 2023

DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention.

[BibT_eX]

[DOI]

CoRR, 2023

DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales.

[BibT_eX]

[DOI]

Zhewei Yao

CoRR, 2023

A Novel Tensor-Expert Hybrid Parallelism Approach to Scale Mixture-of-Experts Training.

[BibT_eX]

[DOI]

CoRR, 2023

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training.

[BibT_eX]

[DOI]

Proceedings of the 37th International Conference on Supercomputing, 2023

2022

DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.

[BibT_eX]

[DOI]

Proceedings of the SC22: International Conference for High Performance Computing, 2022

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale.

[BibT_eX]

[DOI]

Jeff Rasley

Yuxiong He

Proceedings of the International Conference on Machine Learning, 2022

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed.

[BibT_eX]

[DOI]

Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

2021

Scalable and Efficient MoE Training for Multitask Multilingual Models.

[BibT_eX]

[DOI]

Young Jin Kim

Andrés Felipe Cruz-Salinas

Alexandre Muzio

CoRR, 2021

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed.

[BibT_eX]

[DOI]

Proceedings of the 38th International Conference on Machine Learning, 2021

2020

Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects.

[BibT_eX]

[DOI]

IEEE Micro, 2020

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 35th International Conference, 2020

GEMS: GPU-enabled memory-aware model-parallelism system for distributed DNN training.

[BibT_eX]

[DOI]

Arpan Jain

Dhabaleswar K. D. K. Panda

Asmaa M. Aljuhani

Jahanzeb Maqbool Hashmi

Proceedings of the International Conference for High Performance Computing, 2020

Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems.

[BibT_eX]

[DOI]

Ching-Hsiang Chu

Pouya Kousha

Kawthar Shafie Khorassani

Dhabaleswar K. D. K. Panda

Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020

2019

Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2019

Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?

[BibT_eX]

[DOI]

Karthik Vadambacheri Manian

Ching-Hsiang Chu

Karthik Vadambacheri Manian

Parallel Comput., 2019

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow.

[BibT_eX]

[DOI]

CoRR, 2019

OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks.

[BibT_eX]

[DOI]

Ching-Hsiang Chu

Kawthar Shafie Khorassani

Proceedings of the 2019 IEEE/ACM Performance Modeling, 2019

Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera.

[BibT_eX]

[DOI]

Proceedings of the Third IEEE/ACM Workshop on Deep Learning on Supercomputers, 2019

High performance distributed deep learning: a beginner's guide.

[BibT_eX]

[DOI]

Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019

Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE International Conference on Cluster Computing, 2019

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation.

[BibT_eX]

[DOI]

Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019

2018

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

[BibT_eX]

[DOI]

Proceedings of the 25th European MPI Users' Group Meeting, 2018

OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training.

[BibT_eX]

[DOI]

Proceedings of the 25th IEEE International Conference on High Performance Computing, 2018

2017

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures.

[BibT_eX]

[DOI]

Proceedings of the Machine Learning on HPC Environments, 2017

S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters.

[BibT_eX]

[DOI]

Khaled Hamidouche

Jahanzeb Maqbool Hashmi

Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning.

[BibT_eX]

[DOI]

Jahanzeb Maqbool Hashmi

Bracy Elton