Ahmad Afsahi

Ilias S. Kotsireas

Int. J. Parallel Program., September, 2026

Accelerating Intra-Node GPU-to-GPU Communication Through Multi-Path Transfers with CUDA Graphs.

[BibT_eX]

[DOI]

Amirreza Barati Sedeh

Hamed Sharifian

CoRR, April, 2026

2025

Accelerating Intra-Node GPU Communication: A Performance Model for Multi-Path Transfers.

[BibT_eX]

[DOI]

Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, 2025

Utilizing Network Hardware Parallelism for MPI Partitioned Collective Communication.

[BibT_eX]

[DOI]

Amirreza Barati Sedeh

Whit Schonbein

Proceedings of the 33rd Euromicro International Conference on Parallel, 2025

Collaborative Bandwidth-Efficient Intra-Node Allreduce.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Parallel and Distributed Processing Symposium, 2025

Cascade: a Collaborative Algorithm for Scalable and Efficient Neighborhood Allgather.

[BibT_eX]

[DOI]

Hamed Sharifian

Proceedings of the IEEE International Conference on Cluster Computing, 2025

2024

ROCm-Aware Leader-based Designs for MPI Neighbourhood Collectives.

[BibT_eX]

[DOI]

Mahdieh Gazimirsaeed

Proceedings of the ISC High Performance 2024 Research Paper Proceedings (39th International Conference), 2024

Design and Implementation of MPI-Native GPU-Initiated MPI Partitioned Communication.

[BibT_eX]

[DOI]

Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024

Enhancing Intra-Node GPU-to-GPU Performance in MPI+UCX through Multi-Path Communication.

[BibT_eX]

[DOI]

Yiltan Hassan Temucin

Proceedings of the 3rd International Workshop on Extreme Heterogeneity Solutions, 2024

A Topology- and Load-Aware Design for Neighborhood Allgather.

[BibT_eX]

[DOI]

Hamed Sharifian

Proceedings of the IEEE International Conference on Cluster Computing, 2024

2023

A Dynamic Network-Native MPI Partitioned Aggregation Over InfiniBand Verbs.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2023

2022

Accelerating Deep Learning Using Interconnect-Aware UCX Communication for MPI Collectives.

[BibT_eX]

[DOI]

IEEE Micro, 2022

Efficient Process Arrival Pattern Aware Collective Communication for Deep Learning.

[BibT_eX]

[DOI]

Pedram Alizadeh

Proceedings of the EuroMPI/USA'22: 29th European MPI Users' Group Meeting, Chattanooga, TN, USA, September 26, 2022

Micro-Benchmarking MPI Partitioned Point-to-Point Communication.

[BibT_eX]

[DOI]

Proceedings of the 51st International Conference on Parallel Processing, 2022

2021

Efficient Multi-Path NVLink/PCIe-Aware UCX based Collective Communication for Deep Learning.

[BibT_eX]

[DOI]

Pedram Alizadeh

Proceedings of the IEEE Symposium on High-Performance Interconnects, 2021

2020

Communication-aware message matching in MPI.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2020

2019

A dynamic, unified design for dedicated message matching engines for collective and point-to-point communications.

[BibT_eX]

[DOI]

Parallel Comput., 2019

An Efficient Collaborative Communication Mechanism for MPI Neighborhood Collectives.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

Fuzzy Matching: Hardware Accelerated MPI Communication Middleware.

[BibT_eX]

[DOI]

Matthew G. F. Dosanjh

Whit Schonbein

Patrick G. Bridges

Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019

2018

Design considerations for GPU-aware collective communications in MPI.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2018

A Dedicated Message Matching Mechanism for Collective Communications.

[BibT_eX]

[DOI]

Proceedings of the 47th International Conference on Parallel Processing, 2018

The Case for Semi-Permanent Cache Occupancy: Understanding the Impact of Data Locality on Network Processing.

[BibT_eX]

[DOI]

Matthew G. F. Dosanjh

Whit Schonbein

Michael J. Levenhagen

Patrick G. Bridges

Proceedings of the 47th International Conference on Parallel Processing, 2018

2017

Exploiting heterogeneity of communication channels for efficient GPU selection on multi-GPU nodes.

[BibT_eX]

[DOI]

Parallel Comput., 2017

Exploiting Common Neighborhoods to Optimize MPI Neighborhood Collectives.

[BibT_eX]

[DOI]

Jesper Larsson Träff

Pavan Balaji

Proceedings of the 24th IEEE International Conference on High Performance Computing, 2017

2016

MAGC: A Mapping Approach for GPU Clusters.

[BibT_eX]

[DOI]

Proceedings of the 28th International Symposium on Computer Architecture and High Performance Computing, 2016

Topology-Aware Rank Reordering for MPI Collectives.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

PTRAM: A Parallel Topology-and Routing-Aware Mapping Framework for Large-Scale HPC Systems.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

Topology-Aware GPU Selection on Multi-GPU Nodes.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

2015

Scalable Network Communication Using Unreliable RDMA.

[BibT_eX]

[DOI]

Proceedings of the Handbook on Data Centers, 2015

Scalable connectionless RDMA over unreliable datagrams.

[BibT_eX]

[DOI]

Parallel Comput., 2015

Hyper-Q aware intranode MPI collectives on the GPU.

[BibT_eX]

[DOI]

Proceedings of the First International Workshop on Extreme Scale Programming Models and Middleware, 2015

2014

Extreme-scale computing services over MPI: Experiences, observations and features proposal for next-generation message passing interface.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2014

A fast and resource-conscious MPI message queue mechanism for large-scale jobs.

[BibT_eX]

[DOI]

Future Gener. Comput. Syst., 2014

Nonblocking Epochs in MPI One-Sided Communication.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2014

Intra-Epoch Message Scheduling To Exploit Unused or Residual Overlapping Potential.

[BibT_eX]

[DOI]

Proceedings of the 21st European MPI Users' Group Meeting, 2014

GPU-Aware Intranode MPI_Allreduce.

[BibT_eX]

[DOI]

Proceedings of the 21st European MPI Users' Group Meeting, 2014

2013

Using MPI in high-performance computing services.

[BibT_eX]

[DOI]

Proceedings of the 20th European MPI Users's Group Meeting, 2013

Mercury: Enabling remote procedure call for high-performance computing.

[BibT_eX]

[DOI]

Proceedings of the 2013 IEEE International Conference on Cluster Computing, 2013

Toward Asynchronous and MPI-Interoperable Active Messages.

[BibT_eX]

[DOI]

Proceedings of the 13th IEEE/ACM International Symposium on Cluster, 2013

2012

An Efficient MPI Message Queue Mechanism for Large-scale Jobs.

[BibT_eX]

[DOI]

Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems, 2012

A study of hardware performance monitoring counter selection in power modeling of computing systems.

[BibT_eX]

[DOI]

Proceedings of the 2012 International Green Computing Conference, 2012

Designing an Offloaded Nonblocking MPI_Allgather Collective Using CORE-Direct.

[BibT_eX]

[DOI]

Grigori Inozemtsev

Proceedings of the 2012 IEEE International Conference on Cluster Computing, 2012

2011

Process Arrival Pattern Aware Alltoall and Allgather on InfiniBand Clusters.

[BibT_eX]

[DOI]

Int. J. Parallel Program., 2011

Exploiting application buffer reuse to improve MPI small message transfer protocols over RDMA-enabled networks.

[BibT_eX]

[DOI]

Clust. Comput., 2011

Multi-core and Network Aware MPI Topology Functions.

[BibT_eX]

[DOI]

Proceedings of the Recent Advances in the Message Passing Interface, 2011

RDMA Capable iWARP over Datagrams.

[BibT_eX]

[DOI]

Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Investigating Scenario-Conscious Asynchronous Rendezvous over RDMA.

[BibT_eX]

[DOI]

Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER), 2011

2010

Adaptive estimation and prediction of power and performance in high performance computing.

[BibT_eX]

[DOI]

Comput. Sci. Res. Dev., 2010

A study of hardware assisted IP over InfiniBand and its impact on enterprise data center performance.

[BibT_eX]

[DOI]

Pavan Balaji

Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2010

iWARP redefined: Scalable connectionless communication over high-speed Ethernet.

[BibT_eX]

[DOI]

Proceedings of the 2010 International Conference on High Performance Computing, 2010

2009

A Speculative and Adaptive MPI Rendezvous Protocol Over RDMA-enabled Interconnects.

[BibT_eX]

[DOI]

Int. J. Parallel Program., 2009

Improving energy efficiency of asymmetric chip multithreaded multiprocessors through reduced OS noise scheduling.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2009

Process Arrival Pattern and Shared Memory Aware Alltoall on InfiniBand.

[BibT_eX]

[DOI]

Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2009

Improving RDMA-based MPI eager protocol for frequently-used buffers.

[BibT_eX]

[DOI]

Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers.

[BibT_eX]

[DOI]

Pavan Balaji

Proceedings of the 15th IEEE International Conference on Parallel and Distributed Systems, 2009

2008

Efficient shared memory and RDMA based collectives on multi-rail QsNet<sup>II</sup> SMP clusters.

[BibT_eX]

[DOI]

Clust. Comput., 2008

An Analysis of QoS Provisioning for Sockets Direct Protocol vs. IPoIB over Modern InfiniBand Networks.

[BibT_eX]

[DOI]

Proceedings of the 37th International Conference on Parallel Processing, 2008

Improving Communication Progress and Overlap in MPI Rendezvous Protocol over RDMA-enabled Interconnects.

[BibT_eX]

[DOI]

Proceedings of the 22nd Annual International Symposium on High Performance Computing Systems and Applications (HPCS 2008), 2008

2007

10-Gigabit iWARP Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G.

[BibT_eX]

[DOI]

Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

A Comprehensive Analysis of OpenMP Applications on Dual-Core Intel Xeon SMPs.

[BibT_eX]

[DOI]

Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

RDMA-based and SMP-aware Multi-port All-Gather on Multi-rail QsNet^II SMP Clusters.

[BibT_eX]

[DOI]

Proceedings of the 2007 International Conference on Parallel Processing (ICPP 2007), 2007

High Performance RDMA-based Multi-port All-gather on Multi-rail QsNet II.

[BibT_eX]

[DOI]

Proceedings of the 21st Annual International Symposium on High Performance Computing Systems and Applications (HPCS 2007), 2007

Assessing the Ability of Computation/Communication Overlap and Communication Progress in Modern Interconnects.

[BibT_eX]

[DOI]

Proceedings of the 15th Annual IEEE Symposium on High-Performance Interconnects, 2007

A feasibility analysis of power-awareness and energy minimization in modern interconnects for high-performance computing.

[BibT_eX]

[DOI]

Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

Improving system efficiency through scheduling and power management.

[BibT_eX]

[DOI]

Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

2006

Efficient RDMA-based multi-port collectives on multi-rail QsNet<sup>II</sup> clusters.

[BibT_eX]

[DOI]

Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Power-performance efficiency of asymmetric multiprocessors for multi-threaded scientific applications.

[BibT_eX]

[DOI]

Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

2005

Communication Characteristics of Message-Passing Scientific and Engineering Applications.

[BibT_eX]

Proceedings of the International Conference on Parallel and Distributed Computing Systems, 2005

2004

Myrinet Networks: A Performance Study.

[BibT_eX]

[DOI]

Proceedings of the 3rd IEEE International Symposium on Network Computing and Applications (NCA 2004), 30 August, 2004

Performance Evaluation of the Sun Fire Link SMP Clusters.

[BibT_eX]

Nathan R. Fredrickson

Proceedings of the 18th Annual Symposium on High Performance Computing Systems and Applications, 2004

2003

Performance characteristics of openMP constructs, and application benchmarks on a large symmetric multiprocessor.

[BibT_eX]

[DOI]

Nathan R. Fredrickson

Proceedings of the 17th Annual International Conference on Supercomputing, 2003

2002

Analysis of a Latency Hiding Broadcasting Algorithm on a Reconfigurable Optical Interconnect.

[BibT_eX]

[DOI]

Parallel Process. Lett., 2002

Efficient communication using message prediction for clusters of multiprocessors.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2002

Architectural Extensions to Support Efficient Communication Using Message Prediction.

[BibT_eX]

[DOI]

Proceedings of the 16th Annual International Symposium on High Performance Computing Systems and Applications, 2002

2000

Efficient Communication Using Message Prediction for Cluster Multiprocessors.

[BibT_eX]

[DOI]

Proceedings of the Network-Based Parallel Computing: Communication, 2000

1999

Hiding Communication Latency in Reconfigurable Message-Passing Environments.

[BibT_eX]

[DOI]

Proceedings of the 13th International Parallel Processing Symposium / 10th Symposium on Parallel and Distributed Processing (IPPS / SPDP '99), 1999

1998

Communications Latency Hiding Techniques for a Reconfigurable Optical Interconnect: Benchmark Studies.

[BibT_eX]

[DOI]

Proceedings of the Applied Parallel Computing, 1998

1997

Collective Communications on a Reconfigurable Optical Interconnect.

[BibT_eX]