Hari Subramoni

Aswathnarayan Radhakrishnan

CoRR, April, 2026

Tracking Phenological Status and Ecological Interactions in a Hawaiian Cloud Forest Understory using Low-Cost Camera Traps and Visual Foundation Models.

[BibT_eX]

[DOI]

CoRR, March, 2026

From Bands to Depth: Understanding Bathymetry Decisions on Sentinel-2.

[BibT_eX]

[DOI]

Satyaki Roy Chowdhury

Hsiao Jou Hsu

Aswathnarayan Radhakrishnan

Joachim Moortgat

CoRR, January, 2026

From Bands to Depth: Understanding Bathymetry Decisions on Sentinel-2.

[BibT_eX]

[DOI]

Satyaki Roy Chowdhury

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2026

2025

SmartWilds: Multimodal Wildlife Monitoring Dataset.

[BibT_eX]

[DOI]

CoRR, September, 2025

OHIO: Enhancing RDMA Scalability in Alltoall With Optimized Communication Overlap.

[BibT_eX]

[DOI]

Tu Tran

IEEE Micro, 2025

Understanding and Characterizing Communication Characteristics for Distributed Transformer Models.

[BibT_eX]

[DOI]

IEEE Micro, 2025

A Streaming Collectives Interface Targeting Dataflow Acceleration and HPC Workloads.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2025

Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer.

[BibT_eX]

[DOI]

Proceedings of the Eighth Conference on Machine Learning and Systems, 2025

Unified Designs of Multi-Rail-Aware MPI Allreduce and Alltoall Operations Across Diverse GPU and Interconnect Systems.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2025

Design and Optimization of GPU-Aware MPI Allreduce Using Direct Sendrecv Communication.

[BibT_eX]

[DOI]

Proceedings of the 54th International Conference on Parallel Processing, 2025

Towards Dynamic Message Passing Protocols for Stencil-Based Communication Patterns.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2025

2024

Cyberinfrastructure for machine learning applications in agriculture: experiences, analysis, and vision.

[BibT_eX]

[DOI]

Frontiers Artif. Intell., 2024

Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer.

[BibT_eX]

[DOI]

CoRR, 2024

Accelerating communication with multi-HCA aware collectives in MPI.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2024

OMB-CXL: A Micro-Benchmark Suite for Evaluating MPI Communication Utilizing Compute Express Link Memory Devices.

[BibT_eX]

[DOI]

Proceedings of the Practice and Experience in Advanced Research Computing 2024: Human Powered Computing, 2024

Infer-HiRes: Accelerating Inference for High-Resolution Images with Quantization and Distributed Deep Learning.

[BibT_eX]

[DOI]

Proceedings of the Practice and Experience in Advanced Research Computing 2024: Human Powered Computing, 2024

OMB-FPGA: A Microbenchmark Suite for FPGA-aware MPIs using OpenCL and SYCL.

[BibT_eX]

[DOI]

Proceedings of the Practice and Experience in Advanced Research Computing 2024: Human Powered Computing, 2024

Design and Implementation of an IPC-based Collective MPI Library for Intel GPUs.

[BibT_eX]

[DOI]

Proceedings of the Practice and Experience in Advanced Research Computing 2024: Human Powered Computing, 2024

Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters.

[BibT_eX]

[DOI]

Proceedings of the ISC High Performance 2024 Research Paper Proceedings (39th International Conference), 2024

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

PML-MPI: A Pre-Trained ML Framework for Efficient Collective Algorithm Selection in MPI.

[BibT_eX]

[DOI]

Mingzhe Han

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

HINT: Designing Cache-Efficient MPI_Alltoall using Hybrid Memory Copy Ordering and Non-Temporal Instructions.

[BibT_eX]

[DOI]

Nick Contini

Nawras Alnaasan

Mustafa Abduljabbar

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

Message from the HCW 2024 Technical Program Committee Co-Chairs.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

The Case for Co-Designing Model Architectures with Hardware.

[BibT_eX]

[DOI]

Proceedings of the 53rd International Conference on Parallel Processing, 2024

OHIO: Improving RDMA Network Scalability in MPI_Alltoall Through Optimized Hierarchical and Intra/Inter-Node Communication Overlap Design.

[BibT_eX]

[DOI]

Tu Tran

Proceedings of the IEEE Symposium on High-Performance Interconnects, 2024

Demystifying the Communication Characteristics for Distributed Transformer Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE Symposium on High-Performance Interconnects, 2024

Characterizing Communication in Distributed Parameter-Efficient Fine-Tuning for Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE Symposium on High-Performance Interconnects, 2024

Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning.

[BibT_eX]

[DOI]

Proceedings of the 31st IEEE International Conference on High Performance Computing, 2024

Using BlueField-3 SmartNICs to Offload Vector Operations in Krylov Subspace Methods.

[BibT_eX]

[DOI]

Proceedings of the 31st IEEE International Conference on High Performance Computing, 2024

Effective and Efficient Offloading Designs for One-Sided Communication to SmartNICs.

[BibT_eX]

[DOI]

Proceedings of the 31st IEEE International Conference on High Performance Computing, 2024

Design and Implementation of Kernel-based MPI Reduction Operations for Intel GPU s.

[BibT_eX]

[DOI]

Proceedings of the 31st IEEE International Conference on High Performance Computing, 2024

HyperSack: Distributed Hyperparameter Optimization for Deep Learning using Resource-Aware Scheduling on Heterogeneous GPU Systems.

[BibT_eX]

[DOI]

Proceedings of the 31st IEEE International Conference on High Performance Computing, 2024

Accelerating Large Language Model Training with Hybrid GPU-based Compression.

[BibT_eX]

[DOI]

Proceedings of the 24th IEEE International Symposium on Cluster, 2024

MPI Allgather Utilizing CXL Shared Memory Pool in Multi-Node Computing Systems.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Big Data, 2024

2023

High Performance MPI over the Slingshot Interconnect.

[BibT_eX]

[DOI]

J. Comput. Sci. Technol., February, 2023

Network-Assisted Noncontiguous Transfers for GPU-Aware MPI Libraries.

[BibT_eX]

[DOI]

IEEE Micro, 2023

Performance Characterization of using Quantization for DNN Inference on Edge Devices: Extended Version.

[BibT_eX]

[DOI]

CoRR, 2023

DPU-Bench: A Micro-Benchmark Suite to Measure Offload Efficiency Of SmartNICs.

[BibT_eX]

[DOI]

Steve Poole

Proceedings of the Practice and Experience in Advanced Research Computing, 2023

Optimizing Amber for Device-to-Device GPU Communication.

[BibT_eX]

[DOI]

Proceedings of the Practice and Experience in Advanced Research Computing, 2023

SAI: AI-Enabled Speech Assistant Interface for Science Gateways in HPC.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 38th International Conference, 2023

Democratizing HPC Access and Use with Knowledge Graphs.

[BibT_eX]

[DOI]

Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

MPI-xCCL: A Portable MPI Library over Collective Communication Libraries for Various Accelerators.

[BibT_eX]

[DOI]

Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

In-Depth Evaluation of a Lower-Level Direct-Verbs API on InfiniBand-based Clusters: Early Experiences.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc<sup>*</sup>.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication.

[BibT_eX]

[DOI]

Nicholas Contini

Proceedings of the 37th International Conference on Supercomputing, 2023

Performance Characterization of Using Quantization for DNN Inference on Edge Devices.

[BibT_eX]

[DOI]

Proceedings of the 7th IEEE International Conference on Fog and Edge Computing, 2023

Designing In-network Computing Aware Reduction Collectives in MPI.

[BibT_eX]

[DOI]

Proceedings of the IEEE Symposium on High-Performance Interconnects, 2023

Battle of the BlueFields: An In-Depth Comparison of the BlueField-2 and BlueField-3 SmartNICs.

[BibT_eX]

[DOI]

Stephen W. Poole

Proceedings of the IEEE Symposium on High-Performance Interconnects, 2023

Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference.

[BibT_eX]

[DOI]

Proceedings of the 30th IEEE International Conference on High Performance Computing, 2023

Optimized All-to-All Connection Establishment for High-Performance MPI Libraries Over InfiniBand.

[BibT_eX]

[DOI]

Mustafa Abduljabbar

Proceedings of the 30th IEEE International Conference on High Performance Computing, 2023

Implementing and Optimizing a GPU-aware MPI Library for Intel GPUs: Early Experiences.

[BibT_eX]

[DOI]

Proceedings of the 23rd IEEE/ACM International Symposium on Cluster, 2023

ScaMP: Scalable Meta-Parallelism for Deep Learning Search.

[BibT_eX]

[DOI]

Proceedings of the 23rd IEEE/ACM International Symposium on Cluster, 2023

HARVEST: High-Performance Artificial Vision Framework for Expert Labeling using Semi-Supervised Training.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Big Data, 2023

MPI4Spark Meets YARN: Enhancing MPI4Spark through YARN support for HPC.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Big Data, 2023

Benchmarking Modern Databases for Storing and Profiling Very Large Scale HPC Communication Data.

[BibT_eX]

[DOI]

Proceedings of the Benchmarking, Measuring, and Optimizing, 2023

2022

Optimizing Distributed DNN Training Using CPUs and BlueField-2 DPUs.

[BibT_eX]

[DOI]

IEEE Micro, 2022

High Performance MPI over the Slingshot Interconnect: Early Experiences.

[BibT_eX]

[DOI]

Proceedings of the PEARC '22: Practice and Experience in Advanced Research Computing, Boston, MA, USA, July 10, 2022

Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters.

[BibT_eX]

[DOI]

Qinghua Zhou

Quentin Anthony

Proceedings of the High Performance Computing - 37th International Conference, 2022

"Hey CAI" - Conversational AI Enabled User Interface for HPC Tools.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 37th International Conference, 2022

Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 37th International Conference, 2022

Arm meets Cloud: A Case Study of MPI Library Performance on AWS Arm-based HPC Cloud with Elastic Fabric Adapter.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

Towards Java-based HPC using the MVAPICH2 Library: Early Experiences.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

Designing Hierarchical Multi-HCA Aware Allgather in MPI.

[BibT_eX]

[DOI]

Proceedings of the Workshop Proceedings of the 51st International Conference on Parallel Processing, 2022

Network Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries.

[BibT_eX]

[DOI]

Proceedings of the IEEE Symposium on High-Performance Interconnects, 2022

Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads.

[BibT_eX]

[DOI]

Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

Efficient Personalized and Non-Personalized Alltoall Communication for Modern Multi-HCA GPU-Based Clusters.

[BibT_eX]

[DOI]

Akshay Paniraja Guptha

Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

Designing Efficient Pipelined Communication Schemes using Compression in MPI Libraries.

[BibT_eX]

[DOI]

Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters.

[BibT_eX]

[DOI]

Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

Lightning Talks of EduHPC 2022.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM International Workshop on Education for High Performance Computing, 2022

Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2022

2021

The MVAPICH project: Transforming research into high-performance MPI library for HPC community.

[BibT_eX]

[DOI]

Dhabaleswar Kumar Panda

J. Comput. Sci., 2021

Cross-layer Visualization and Profiling of Network and I/O Communication for HPC Clusters.

[BibT_eX]

[DOI]

CoRR, 2021

INAM: Cross-stack Profiling and Analysis of Communication in MPI-based Applications.

[BibT_eX]

[DOI]

Kamal Raj Sankarapandian Dayala Ganesh Ram

Proceedings of the PEARC '21: Practice and Experience in Advanced Research Computing, 2021

Designing a ROCm-Aware MPI Library for AMD GPUs: Early Experiences.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 36th International Conference, 2021

BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs.

[BibT_eX]

[DOI]

Nick Sarkauskas

Proceedings of the High Performance Computing - 36th International Conference, 2021

Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters<sup>*</sup>.

[BibT_eX]

[DOI]

Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium, 2021

SUPER: SUb-Graph Parallelism for TransformERs.

[BibT_eX]

[DOI]

Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium, 2021

Scaling Single-Image Super-Resolution Training on Modern HPC Clusters: Early Experiences.

[BibT_eX]

[DOI]

Quentin Anthony

Lang Xu

Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops, 2021

Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs.

[BibT_eX]

[DOI]

Proceedings of the IEEE Symposium on High-Performance Interconnects, 2021

Layout-aware Hardware-assisted Designs for Derived Data Types in MPI.

[BibT_eX]

[DOI]

Proceedings of the 28th IEEE International Conference on High Performance Computing, 2021

Large-Message Nonblocking MPI_Iallgather and MPI Ibcast Offload via BlueField-2 DPU.

[BibT_eX]

[DOI]

Nick Sarkauskas

Proceedings of the 28th IEEE International Conference on High Performance Computing, 2021

DistMILE: A Distributed Multi-Level Framework for Scalable Graph Embedding.

[BibT_eX]

[DOI]

Srinivasan Parthasarathy

Proceedings of the 28th IEEE International Conference on High Performance Computing, 2021

Towards Architecture-aware Hierarchical Communication Trees on Modern HPC Systems.

[BibT_eX]

[DOI]

Proceedings of the 28th IEEE International Conference on High Performance Computing, 2021

Efficient MPI-based Communication for GPU-Accelerated Dask Applications.

[BibT_eX]

[DOI]

Proceedings of the 21st IEEE/ACM International Symposium on Cluster, 2021

Adaptive and Hierarchical Large Message All-to-all Communication Algorithms for Large-scale Dense GPU Systems.

[BibT_eX]

[DOI]

Proceedings of the 21st IEEE/ACM International Symposium on Cluster, 2021

2020

FALCON-X: Zero-copy MPI derived datatype processing on modern CPU and GPU architectures.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2020

EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2020

Accelerated Real-time Network Monitoring and Profiling at Scale using OSU INAM.

[BibT_eX]

[DOI]

Proceedings of the PEARC '20: Practice and Experience in Advanced Research Computing, 2020

Communication-Aware Hardware-Assisted MPI Overlap Engine.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 35th International Conference, 2020

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 35th International Conference, 2020

MPI Meets Cloud: Case Study with Amazon EC2 and Microsoft Azure.

[BibT_eX]

[DOI]

Proceedings of the Fourth IEEE/ACM Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, 2020

GEMS: GPU-enabled memory-aware model-parallelism system for distributed DNN training.

[BibT_eX]

[DOI]

Arpan Jain

Asmaa M. Aljuhani

Proceedings of the International Conference for High Performance Computing, 2020

Accelerating GPU-based Machine Learning in Python using MPI Library: A Case Study with MVAPICH2-GDR.

[BibT_eX]

[DOI]

Quentin Anthony

Proceedings of the 6th IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments, 2020

Scalable MPI Collectives using SHARP: Large Scale Performance Evaluation on the TACC Frontera System.

[BibT_eX]

[DOI]

Nick Sarkauskas

Proceedings of the Workshop on Exascale MPI, 2020

Performance Characterization of Network Mechanisms for Non-Contiguous Data Transfers in MPI.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

Analyzing and Understanding the Impact of Interconnect Performance on HPC, Big Data, and Deep Learning Applications: A Case Study with InfiniBand EDR and HDR.

[BibT_eX]

[DOI]

Amit Ruhela

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems.

[BibT_eX]

[DOI]

Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020

Blink: Towards Efficient RDMA-based Communication Coroutines for Parallel Python Applications.

[BibT_eX]

[DOI]

Proceedings of the 27th IEEE International Conference on High Performance Computing, 2020

Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters.

[BibT_eX]

[DOI]

Qinghua Zhou

Proceedings of the IEEE International Conference on Cluster Computing, 2020

Design and Characterization of InfiniBand Hardware Tag Matching in MPI.

[BibT_eX]

[DOI]

Proceedings of the 20th IEEE/ACM International Symposium on Cluster, 2020

2019

Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2019

Efficient design for MPI asynchronous progress without dedicated resources.

[BibT_eX]

[DOI]

Amit Ruhela

Parallel Comput., 2019

Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?

[BibT_eX]

[DOI]

Parallel Comput., 2019

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow.

[BibT_eX]

[DOI]

CoRR, 2019

Performance Evaluation of MPI Libraries on GPU-Enabled OpenPOWER Architectures: Early Experiences.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing, 2019

Design and Evaluation of Shared Memory CommunicationBenchmarks on Emerging Architectures using MVAPICH2.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, 2019

Leveraging Network-level parallelism with Multiple Process-Endpoints for MPI Broadcast.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, 2019

OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE/ACM Performance Modeling, 2019

Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera.

[BibT_eX]

[DOI]

Proceedings of the Third IEEE/ACM Workshop on Deep Learning on Supercomputers, 2019

High performance distributed deep learning: a beginner's guide.

[BibT_eX]

[DOI]

Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019

FALCON: Efficient Designs for Zero-Copy MPI Datatype Processing on Emerging Architectures.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

Communication Profiling and Characterization of Deep Learning Workloads on Clusters with High-Performance Interconnects.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE Symposium on High-Performance Interconnects, 2019

Designing Scalable and High-Performance MPI Libraries on Amazon Elastic Fabric Adapter.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE Symposium on High-Performance Interconnects, 2019

Designing a Profiling and Visualization Tool for Scalable and In-depth Analysis of High-Performance GPU Clusters.

[BibT_eX]

[DOI]

Proceedings of the 26th IEEE International Conference on High Performance Computing, 2019

High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems.

[BibT_eX]

[DOI]

Proceedings of the 26th IEEE International Conference on High Performance Computing, 2019

Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE International Conference on Cluster Computing, 2019

Design and Characterization of Shared Address Space MPI Collectives on Modern Architectures.

[BibT_eX]

[DOI]

Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation.

[BibT_eX]

[DOI]

Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019

Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures.

[BibT_eX]

[DOI]

Proceedings of the 12th Workshop on General Purpose Processing Using GPUs, 2019

2018

MPI performance engineering with the MPI tool interface: The integration of MVAPICH and TAU.

[BibT_eX]

[DOI]

Parallel Comput., 2018

Networking and communication challenges for post-exascale systems.

[BibT_eX]

[DOI]

Xiaoyi Lu

Frontiers Inf. Technol. Electron. Eng., 2018

Cooperative rendezvous protocols for improved performance and overlap.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2018

Efficient Asynchronous Communication Progress for MPI without Dedicated Resources.

[BibT_eX]

[DOI]

Amit Ruhela

Proceedings of the 25th European MPI Users' Group Meeting, 2018

Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures.

[BibT_eX]

[DOI]

Proceedings of the 25th European MPI Users' Group Meeting, 2018

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

[BibT_eX]

[DOI]

Proceedings of the 25th European MPI Users' Group Meeting, 2018

Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training.

[BibT_eX]

[DOI]

Proceedings of the 25th IEEE International Conference on High Performance Computing, 2018

SALaR: Scalable and Adaptive Designs for Large Message Reduction Collectives.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2018

2017

Designing Dynamic and Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation and Communication.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 32nd International Conference, 2017

Scalable reduction collectives with data partitioning-based multi-leader design.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2017

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures.

[BibT_eX]

[DOI]

Proceedings of the Machine Learning on HPC Environments, 2017

MPI performance engineering with the MPI tool interface: the integration of MVAPICH and TAU.

[BibT_eX]

[DOI]

Proceedings of the 24th European MPI Users' Group Meeting, 2017

Exploiting and Evaluating OpenSHMEM on KNL Architecture.

[BibT_eX]

[DOI]

Mingzhe Li

Proceedings of the OpenSHMEM and Related Technologies. Big Compute and Big Data Convergence, 2017

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning.

[BibT_eX]

[DOI]

Bracy Elton

Proceedings of the 46th International Conference on Parallel Processing, 2017

Designing Registration Caching Free High-Performance MPI Library with Implicit On-Demand Paging (ODP) of InfiniBand.

[BibT_eX]

[DOI]

Proceedings of the 24th IEEE International Conference on High Performance Computing, 2017

Kernel-Assisted Communication Engine for MPI on Emerging Manycore Processors.

[BibT_eX]

[DOI]

Khaled Hamidouche

Proceedings of the 24th IEEE International Conference on High Performance Computing, 2017

A Scalable Network-Based Performance Analysis Tool for MPI on Large-Scale HPC Systems.

[BibT_eX]

[DOI]

Xiaoyi Lu

Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

Contention-Aware Kernel-Assisted MPI Collectives for Multi-/Many-Core Systems.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

2016

CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters.

[BibT_eX]

[DOI]

Parallel Comput., 2016

INAM2: InfiniBand Network Analysis and Monitoring with MPI.

[BibT_eX]

[DOI]

Albert Mathews Augustine

Proceedings of the High Performance Computing - 31st International Conference, 2016

Designing MPI library with on-demand paging (ODP) of infiniband: challenges and benefits.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2016

Efficient Reliability Support for Hardware Multicast-Based Broadcast in GPU-enabled Streaming Applications.

[BibT_eX]

[DOI]

Proceedings of the First International Workshop on Communication Optimizations in HPC, 2016

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters.

[BibT_eX]

[DOI]

Proceedings of the 28th International Symposium on Computer Architecture and High Performance Computing, 2016

Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-Enabled Systems.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

System-Level Scalable Checkpoint-Restart for Petascale Computing.

[BibT_eX]

[DOI]

Proceedings of the 22nd IEEE International Conference on Parallel and Distributed Systems, 2016

Adaptive and Dynamic Design for MPI Tag Matching.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016

Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Conference on Cloud Computing Technology and Science, 2016

SHMEMPMI - Shared Memory Based PMI for Improved Performance and Scalability.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM 16th International Symposium on Cluster, 2016

2015

Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 30th International Conference, 2015

GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks.

[BibT_eX]

[DOI]

Proceedings of the 22nd European MPI Users' Group Meeting, 2015

On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

Impact of InfiniBand DC Transport Protocol on Energy Consumption of All-to-All Collective Algorithms.

[BibT_eX]

[DOI]

Proceedings of the 23rd IEEE Annual Symposium on High-Performance Interconnects, 2015

Offloaded GPU Collectives Using CORE-Direct and CUDA Capabilities on InfiniBand Clusters.

[BibT_eX]

[DOI]

Proceedings of the 22nd IEEE International Conference on High Performance Computing, 2015

High Performance MPI Datatype Support with User-Mode Memory Registration: Challenges, Designs, and Benefits.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

Non-Blocking PMI Extensions for Fast MPI Startup.

[BibT_eX]

[DOI]

Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015

2014

Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand: Early Experiences.

[BibT_eX]

[DOI]

Proceedings of the Supercomputing - 29th International Conference, 2014

PMI Extensions for Scalable MPI Startup.

[BibT_eX]

[DOI]

Proceedings of the 21st European MPI Users' Group Meeting, 2014

Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models.

[BibT_eX]

[DOI]

Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, 2014

Designing Topology-Aware Communication Schedules for Alltoall Operations in Large InfiniBand Clusters.

[BibT_eX]

[DOI]

Proceedings of the 43rd International Conference on Parallel Processing, 2014

Wide-area overlay networking to manage science DMZ accelerated flows.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Computing, Networking and Communications, 2014

A high performance broadcast design with hardware multicast and GPUDirect RDMA for streaming applications on Infiniband clusters.

[BibT_eX]

[DOI]

Proceedings of the 21st International Conference on High Performance Computing, 2014

2013

MVAPICH-PRISM: a proxy-based communication framework using InfiniBand and SCIF for intel MIC clusters.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2013

High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand.

[BibT_eX]

[DOI]

Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

Extending OpenSHMEM for GPU Computing.

[BibT_eX]

[DOI]

Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

MIC-RO: enabling efficient remote offload on heterogeneous many integrated core (MIC) clusters with InfiniBand.

[BibT_eX]

[DOI]

Khaled Hamidouche

Proceedings of the International Conference on Supercomputing, 2013

High-Performance Design of Hadoop RPC with RDMA over InfiniBand.

[BibT_eX]

[DOI]

Proceedings of the 42nd International Conference on Parallel Processing, 2013

A Novel Functional Partitioning Approach to Design High-Performance MPI-3 Non-blocking Alltoallv Collective on Multi-core Systems.

[BibT_eX]

[DOI]

Proceedings of the 42nd International Conference on Parallel Processing, 2013

Design of network topology aware scheduling services for large InfiniBand clusters.

[BibT_eX]

[DOI]

Devendar Bureddy

Proceedings of the 2013 IEEE International Conference on Cluster Computing, 2013

2012

Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes.

[BibT_eX]

[DOI]

Raghunath Rajachandrasekar

Proceedings of the SC Conference on High Performance Computing Networking, 2012

High performance RDMA-based design of HDFS over InfiniBand.

[BibT_eX]

[DOI]

Nusrat S. Islam

Md. Wasi-ur-Rahman

Jithin Jose

Proceedings of the SC Conference on High Performance Computing Networking, 2012

Understanding the communication characteristics in HBase: What are the fundamental bottlenecks?

[BibT_eX]

[DOI]

Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 2012

Designing Network Failover and Recovery in MPI for Multi-Rail InfiniBand Clusters.

[BibT_eX]

[DOI]

S. Pai Raikar

Jérôme Vienne

Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers.

[BibT_eX]

[DOI]

Bronis R. de Supinski

Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

High-Performance Design of HBase with RDMA over InfiniBand.

[BibT_eX]

[DOI]

Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing Systems.

[BibT_eX]

[DOI]

Proceedings of the IEEE 20th Annual Symposium on High-Performance Interconnects, 2012

A Scalable InfiniBand Network Topology-Aware Performance Analysis Tool for MPI.

[BibT_eX]

[DOI]

Jérôme Vienne

Raghunath Rajachandrasekar

Proceedings of the Euro-Par 2012: Parallel Processing Workshops, 2012

Minimizing Network Contention in InfiniBand Clusters with a QoS-Aware Data-Staging Framework.

[BibT_eX]

[DOI]

Jai Jaswani

Proceedings of the 2012 IEEE International Conference on Cluster Computing, 2012

Can Network-Offload Based Non-blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms?

[BibT_eX]

[DOI]

Proceedings of the 2012 IEEE International Conference on Cluster Computing Workshops, 2012

Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports.

[BibT_eX]

[DOI]

Jithin Jose

Proceedings of the 12th IEEE/ACM International Symposium on Cluster, 2012

2011

Collective Communication, Network Support For.

[BibT_eX]

[DOI]

Proceedings of the Encyclopedia of Parallel Computing, 2011

High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT.

[BibT_eX]

[DOI]

Comput. Sci. Res. Dev., 2011

Codesign for InfiniBand Clusters.

[BibT_eX]

[DOI]

Karen Tomko

Computer, 2011

Memcached Design on High Performance RDMA Capable Interconnects.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Parallel Processing, 2011

Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL.

[BibT_eX]

[DOI]

Proceedings of the IEEE 19th Annual Symposium on High Performance Interconnects, 2011

INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool.

[BibT_eX]

[DOI]

N. Dandapanthula

Jérôme Vienne

Ron Brightwell

Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters.

[BibT_eX]

[DOI]

Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER), 2011

2010

Intra-Socket and Inter-Socket Communication in Multi-core Systems.

[BibT_eX]

[DOI]

IEEE Comput. Archit. Lett., 2010

Streaming, low-latency communication in on-line trading systems.

[BibT_eX]

[DOI]

Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather.

[BibT_eX]

[DOI]

Abhinav Vishnu

Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

High Performance Design and Implementation of Nemesis Communication Layer for Two-Sided and One-Sided MPI Semantics in MVAPICH2.

[BibT_eX]

[DOI]

Miao Luo

Ping Lai

Emilio Pasquale Mancini

Proceedings of the 39th International Conference on Parallel Processing, 2010

Improving Application Performance and Predictability Using Multiple Virtual Lanes in Modern Multi-core InfiniBand Clusters.

[BibT_eX]

[DOI]

Proceedings of the 39th International Conference on Parallel Processing, 2010

Design and Evaluation of Generalized Collective Communication Primitives with Overlap Using ConnectX-2 Offload Engine.

[BibT_eX]

[DOI]

Proceedings of the IEEE 18th Annual Symposium on High Performance Interconnects, 2010

High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand.

[BibT_eX]

[DOI]

Proceedings of the 10th IEEE/ACM International Conference on Cluster, 2010

High Performance Topology-Aware Communication in Multicore Processors.

[BibT_eX]

[DOI]

Proceedings of the Scientific Computing with Multicore and Accelerators., 2010

2009

Designing multi-leader-based Allgather algorithms for multi-core clusters.

[BibT_eX]

[DOI]

Gopalakrishnan Santhanaraman

Matthew J. Koop

Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand.

[BibT_eX]

[DOI]

Proceedings of the ICPP 2009, 2009

Designing Next Generation Clusters: Evaluation of InfiniBand DDR/QDR on Intel Computing Platforms.

[BibT_eX]

[DOI]

Matthew J. Koop