Ammar Ahmad Awan

Orcid: 0000-0002-6272-3760

Affiliations:
  • The Ohio State University, Columbus, OH, USA


According to our database1, Ammar Ahmad Awan authored at least 46 papers between 2012 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference.
CoRR, 2024

2023
DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies.
CoRR, 2023

DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention.
CoRR, 2023

DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales.
CoRR, 2023

A Novel Tensor-Expert Hybrid Parallelism Approach to Scale Mixture-of-Experts Training.
CoRR, 2023

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training.
Proceedings of the 37th International Conference on Supercomputing, 2023

2022
DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale.
Proceedings of the International Conference on Machine Learning, 2022

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed.
Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

2021
Scalable and Efficient MoE Training for Multitask Multilingual Models.
CoRR, 2021

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed.
Proceedings of the 38th International Conference on Machine Learning, 2021

2020
Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects.
IEEE Micro, 2020

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow.
Proceedings of the High Performance Computing - 35th International Conference, 2020

GEMS: GPU-enabled memory-aware model-parallelism system for distributed DNN training.
Proceedings of the International Conference for High Performance Computing, 2020

Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems.
Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020

2019
Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast.
IEEE Trans. Parallel Distributed Syst., 2019

Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?
Parallel Comput., 2019

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow.
CoRR, 2019

OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks.
Proceedings of the 2019 IEEE/ACM Performance Modeling, 2019

Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera.
Proceedings of the Third IEEE/ACM Workshop on Deep Learning on Supercomputers, 2019

High performance distributed deep learning: a beginner's guide.
Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019

Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters.
Proceedings of the 2019 IEEE International Conference on Cluster Computing, 2019

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation.
Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019

2018
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Proceedings of the 25th European MPI Users' Group Meeting, 2018

OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training.
Proceedings of the 25th IEEE International Conference on High Performance Computing, 2018

2017
An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures.
Proceedings of the Machine Learning on HPC Environments, 2017

S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters.
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning.
Proceedings of the 46th International Conference on Parallel Processing, 2017

2016
CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters.
Parallel Comput., 2016

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning.
Proceedings of the 23rd European MPI Users' Group Meeting, EuroMPI 2016, 2016

CUDA M3: Designing Efficient CUDA Managed Memory-Aware MPI by Exploiting GDR and IPC.
Proceedings of the 23rd IEEE International Conference on High Performance Computing, 2016

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters.
Proceedings of the IEEE/ACM 16th International Symposium on Cluster, 2016

2015
Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters.
Proceedings of the High Performance Computing - 30th International Conference, 2015

GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks.
Proceedings of the 22nd European MPI Users' Group Meeting, 2015

A Case for Non-blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X.
Proceedings of the OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies, 2015

On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

2013
Privacy-aware searching with oblivious term matching for cloud storage.
J. Supercomput., 2013

An Energy Efficient Decoding Scheme for Wireless Body Area Sensor Networks.
CoRR, 2013

An MPI-IO Compliant Java Based Parallel I/O Library.
Proceedings of the 13th IEEE/ACM International Symposium on Cluster, 2013

REECH-ME: Regional Energy Efficient Cluster Heads Based on Maximum Energy Routing Protocol for WSNs.
Proceedings of the 2013 Eighth International Conference on Broadband and Wireless Computing, 2013

DREEM-ME: Distributed Regional Energy Efficient Multi-hop Routing Protocol Based on Maximum Energy in WSNs.
Proceedings of the 2013 Eighth International Conference on Broadband and Wireless Computing, 2013

2012
Towards Efficient Support for Parallel I/O in Java HPC.
Proceedings of the 13th International Conference on Parallel and Distributed Computing, 2012

Intercloud message exchange middleware.
Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, 2012


  Loading...