Khaled Hamidouche

Orcid: 0000-0003-4836-5335

According to our database1, Khaled Hamidouche authored at least 61 papers between 2011 and 2023.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2023
GPU-initiated Fine-grained Overlap of Collective Communication with Computation.
CoRR, 2023


2020
Hot Interconnects 26.
IEEE Micro, 2020

<u>G</u>PU <u>i</u>nitiated <u>O</u>penSHMEM: correct and efficient intra-kernel networking for dGPUs.
Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

2018
ComP-net: command processor networking for efficient intra-kernel communications on GPUs.
Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, 2018

2017
GPU triggered networking for intra-kernel communications.
Proceedings of the International Conference for High Performance Computing, 2017

S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters.
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling.
Proceedings of the 46th International Conference on Parallel Processing, 2017

Kernel-Assisted Communication Engine for MPI on Emerging Manycore Processors.
Proceedings of the 24th IEEE International Conference on High Performance Computing, 2017

2016
CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters.
Parallel Comput., 2016

INAM2: InfiniBand Network Analysis and Monitoring with MPI.
Proceedings of the High Performance Computing - 31st International Conference, 2016

Designing MPI library with on-demand paging (ODP) of infiniband: challenges and benefits.
Proceedings of the International Conference for High Performance Computing, 2016

OpenSHMEM Non-blocking Data Movement Operations with MVAPICH2-X: Early Experiences.
Proceedings of the 2016 PGAS Applications Workshop, 2016

Efficient Reliability Support for Hardware Multicast-Based Broadcast in GPU-enabled Streaming Applications.
Proceedings of the First International Workshop on Communication Optimizations in HPC, 2016

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters.
Proceedings of the 28th International Symposium on Computer Architecture and High Performance Computing, 2016

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning.
Proceedings of the 23rd European MPI Users' Group Meeting, EuroMPI 2016, 2016

Designing high performance communication runtime for GPU managed memory: early experiences.
Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, 2016

Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-Enabled Systems.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Enabling Performance Efficient Runtime Support for Hybrid MPI+UPC++ Programming Models.
Proceedings of the 18th IEEE International Conference on High Performance Computing and Communications; 14th IEEE International Conference on Smart City; 2nd IEEE International Conference on Data Science and Systems, 2016

Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA.
Proceedings of the 23rd IEEE International Conference on High Performance Computing, 2016

CUDA M3: Designing Efficient CUDA Managed Memory-Aware MPI by Exploiting GDR and IPC.
Proceedings of the 23rd IEEE International Conference on High Performance Computing, 2016

Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters.
Proceedings of the 2016 IEEE International Conference on Cloud Computing Technology and Science, 2016

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters.
Proceedings of the IEEE/ACM 16th International Symposium on Cluster, 2016

2015
Porting scientific libraries to PGAS in XSEDE resources: practice and experience.
Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure, St. Louis, MO, USA, July 26, 2015

Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters.
Proceedings of the High Performance Computing - 30th International Conference, 2015

A case for application-oblivious energy-efficient MPI runtime.
Proceedings of the International Conference for High Performance Computing, 2015

GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks.
Proceedings of the 22nd European MPI Users' Group Meeting, 2015

Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM.
Proceedings of the OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies, 2015

Scalable Out-of-core OpenSHMEM Library for HPC.
Proceedings of the OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies, 2015

A Case for Non-blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X.
Proceedings of the OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies, 2015

High-Performance Coarray Fortran Support with MVAPICH2-X: Initial Experience and Evaluation.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

Impact of InfiniBand DC Transport Protocol on Energy Consumption of All-to-All Collective Algorithms.
Proceedings of the 23rd IEEE Annual Symposium on High-Performance Interconnects, 2015

Offloaded GPU Collectives Using CORE-Direct and CUDA Capabilities on InfiniBand Clusters.
Proceedings of the 22nd IEEE International Conference on High Performance Computing, 2015

High Performance OpenSHMEM Strided Communication Support with InfiniBand UMR.
Proceedings of the 22nd IEEE International Conference on High Performance Computing, 2015

High-Performance and Scalable Design of MPI-3 RMA on Xeon Phi Clusters.
Proceedings of the Euro-Par 2015: Parallel Processing, 2015

High Performance MPI Datatype Support with User-Mode Memory Registration: Challenges, Designs, and Benefits.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015

2014
Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand: Early Experiences.
Proceedings of the Supercomputing - 29th International Conference, 2014

Understanding the Memory-Utilization of MPI Libraries: Challenges and Designs in Implementing the MPI_T Interface.
Proceedings of the 21st European MPI Users' Group Meeting, 2014

Initial study of multi-endpoint runtime for MPI+OpenMP hybrid programming model on multi-core systems.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

Scalable MiniMD Design with Hybrid MPI and OpenSHMEM.
Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, 2014

Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models.
Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, 2014

High Performance Alltoall and Allgather Designs for InfiniBand MIC Clusters.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Optimizing Collective Communication in UPC.
Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014

HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters.
Proceedings of the 43rd International Conference on Parallel Processing, 2014

MIC-Check: a distributed check pointing framework for the intel many integrated cores architecture.
Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, 2014

A high performance broadcast design with hardware multicast and GPUDirect RDMA for streaming applications on Infiniband clusters.
Proceedings of the 21st International Conference on High Performance Computing, 2014

Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters.
Proceedings of the 21st International Conference on High Performance Computing, 2014

Scalable Graph500 design with MPI-3 RMA.
Proceedings of the 2014 IEEE International Conference on Cluster Computing, 2014

High performance OpenSHMEM for Xeon Phi clusters: Extensions, runtime designs and application co-design.
Proceedings of the 2014 IEEE International Conference on Cluster Computing, 2014

2013
Parallel Smith-Waterman Comparison on Multicore and Manycore Computing Platforms with BSP++.
Int. J. Parallel Program., 2013

MVAPICH-PRISM: a proxy-based communication framework using InfiniBand and SCIF for intel MIC clusters.
Proceedings of the International Conference for High Performance Computing, 2013

Efficient and truly passive MPI-3 RMA using InfiniBand atomics.
Proceedings of the 20th European MPI Users's Group Meeting, 2013

MIC-RO: enabling efficient remote offload on heterogeneous many integrated core (MIC) clusters with InfiniBand.
Proceedings of the International Conference on Supercomputing, 2013

Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs.
Proceedings of the 42nd International Conference on Parallel Processing, 2013

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters.
Proceedings of the IEEE 21st Annual Symposium on High-Performance Interconnects, 2013

A scalable and portable approach to accelerate hybrid HPL on heterogeneous CPU-GPU clusters.
Proceedings of the 2013 IEEE International Conference on Cluster Computing, 2013

2011
Programmation des architectures hiérarchiques et hétérogènes. (Programming hierarxchical and heterogenous machines).
PhD thesis, 2011

A framework for an automatic hybrid MPI+OpenMP code generation.
Proceedings of the 2011 Spring Simulation Multi-conference, 2011

Parallel Biological Sequence Comparison on Heterogeneous High Performance Computing Platforms with BSP++.
Proceedings of the 23rd International Symposium on Computer Architecture and High Performance Computing, 2011


  Loading...