Ching-Hsiang Chu

Mohammadreza Bayatpour

J. Comput. Sci., 2021

High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models.

[BibT_eX]

[DOI]

CoRR, 2021

Designing a ROCm-Aware MPI Library for AMD GPUs: Early Experiences.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 36th International Conference, 2021

Adaptive and Hierarchical Large Message All-to-all Communication Algorithms for Large-scale Dense GPU Systems.

[BibT_eX]

[DOI]

Proceedings of the 21st IEEE/ACM International Symposium on Cluster, 2021

2020

FALCON-X: Zero-copy MPI derived datatype processing on modern CPU and GPU architectures.

[BibT_eX]

[DOI]

Sourav Chakraborty

Mohammadreza Bayatpour

J. Parallel Distributed Comput., 2020

NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems.

[BibT_eX]

[DOI]

Pouya Kousha

Ammar Ahmad Awan

Dhabaleswar K. D. K. Panda

Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020

Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters.

[BibT_eX]

[DOI]

Qinghua Zhou

Karthik Vadambacheri Manian

Proceedings of the IEEE International Conference on Cluster Computing, 2020

2019

Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2019

Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?

[BibT_eX]

[DOI]

Ammar Ahmad Awan

Parallel Comput., 2019

Performance Evaluation of MPI Libraries on GPU-Enabled OpenPOWER Architectures: Early Experiences.

[BibT_eX]

[DOI]

Karthik Vadambacheri Manian

Proceedings of the High Performance Computing, 2019

OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks.

[BibT_eX]

[DOI]

Ammar Ahmad Awan

Proceedings of the 2019 IEEE/ACM Performance Modeling, 2019

C-GDR: High-Performance Container-Aware GPUDirect MPI Communication Schemes on RDMA Networks.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

Communication Profiling and Characterization of Deep Learning Workloads on Clusters with High-Performance Interconnects.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE Symposium on High-Performance Interconnects, 2019

Designing a Profiling and Visualization Tool for Scalable and In-depth Analysis of High-Performance GPU Clusters.

[BibT_eX]

[DOI]

Pouya Kousha

Bharath Ramesh

Kaushik Kandadi Suresh

Proceedings of the 26th IEEE International Conference on High Performance Computing, 2019

High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems.

[BibT_eX]

[DOI]

Karthik Vadambacheri Manian

Proceedings of the 26th IEEE International Conference on High Performance Computing, 2019

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation.

[BibT_eX]

[DOI]

Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019

Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures.

[BibT_eX]

[DOI]

Proceedings of the 12th Workshop on General Purpose Processing Using GPUs, 2019

2018

Distributed Topology Control for Energy-Efficient and Reliable Wireless Communications.

[BibT_eX]

[DOI]

IEEE Syst. J., 2018

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

[BibT_eX]

[DOI]

Proceedings of the 25th European MPI Users' Group Meeting, 2018

Designing High-Performance In-Memory Key-Value Operations with Persistent GPU Kernels and OpenSHMEM.

[BibT_eX]

[DOI]

Manjunath Gorentla Venkata

Sreeram Potluri

Anshuman Goswami

Neena Imam

Chris J. Newburn

Proceedings of the OpenSHMEM and Related Technologies. OpenSHMEM in the Era of Extreme Heterogeneity, 2018

OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training.

[BibT_eX]

[DOI]

Proceedings of the 25th IEEE International Conference on High Performance Computing, 2018

2017

MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling.

[BibT_eX]

[DOI]

Proceedings of the 46th International Conference on Parallel Processing, 2017

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning.

[BibT_eX]

[DOI]

Bracy Elton