Ching-Hsiang Chu

Orcid: 0000-0002-6752-3135

According to our database1, Ching-Hsiang Chu authored at least 36 papers between 2011 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large-Scale Recommendation.
CoRR, 2024

2023
Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using SYNDICATE.
Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, 2023

2022

2021
The MVAPICH project: Transforming research into high-performance MPI library for HPC community.
J. Comput. Sci., 2021

High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models.
CoRR, 2021

Designing a ROCm-Aware MPI Library for AMD GPUs: Early Experiences.
Proceedings of the High Performance Computing - 36th International Conference, 2021

Adaptive and Hierarchical Large Message All-to-all Communication Algorithms for Large-scale Dense GPU Systems.
Proceedings of the 21st IEEE/ACM International Symposium on Cluster, 2021

2020
Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects.
IEEE Micro, 2020

FALCON-X: Zero-copy MPI derived datatype processing on modern CPU and GPU architectures.
J. Parallel Distributed Comput., 2020

NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems.
Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020

Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters.
Proceedings of the IEEE International Conference on Cluster Computing, 2020

2019
Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast.
IEEE Trans. Parallel Distributed Syst., 2019

Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?
Parallel Comput., 2019

Performance Evaluation of MPI Libraries on GPU-Enabled OpenPOWER Architectures: Early Experiences.
Proceedings of the High Performance Computing, 2019

OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks.
Proceedings of the 2019 IEEE/ACM Performance Modeling, 2019

C-GDR: High-Performance Container-Aware GPUDirect MPI Communication Schemes on RDMA Networks.
Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

Designing a Profiling and Visualization Tool for Scalable and In-depth Analysis of High-Performance GPU Clusters.
Proceedings of the 26th IEEE International Conference on High Performance Computing, 2019

High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems.
Proceedings of the 26th IEEE International Conference on High Performance Computing, 2019

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation.
Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019

Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures.
Proceedings of the 12th Workshop on General Purpose Processing Using GPUs, 2019

2018
Distributed Topology Control for Energy-Efficient and Reliable Wireless Communications.
IEEE Syst. J., 2018

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Proceedings of the 25th European MPI Users' Group Meeting, 2018

Designing High-Performance In-Memory Key-Value Operations with Persistent GPU Kernels and OpenSHMEM.
Proceedings of the OpenSHMEM and Related Technologies. OpenSHMEM in the Era of Extreme Heterogeneity, 2018

OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training.
Proceedings of the 25th IEEE International Conference on High Performance Computing, 2018

2017
MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling.
Proceedings of the 46th International Conference on Parallel Processing, 2017

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning.
Proceedings of the 46th International Conference on Parallel Processing, 2017

2016
CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters.
Parallel Comput., 2016

Efficient Reliability Support for Hardware Multicast-Based Broadcast in GPU-enabled Streaming Applications.
Proceedings of the First International Workshop on Communication Optimizations in HPC, 2016

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters.
Proceedings of the 28th International Symposium on Computer Architecture and High Performance Computing, 2016

Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-Enabled Systems.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters.
Proceedings of the IEEE/ACM 16th International Symposium on Cluster, 2016

2015
A Case for Non-blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X.
Proceedings of the OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies, 2015

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

2014
Measurement of long-distance Wi-Fi connections: An empirical study.
Proceedings of the IEEE International Conference on Communications, 2014

2013
Channel condition self-clocked packet scheduling scheme for wireless networks.
EURASIP J. Wirel. Commun. Netw., 2013

2011
Improving SCTP Performance by Jitter-Based Congestion Control over Wired-Wireless Networks.
EURASIP J. Wirel. Commun. Netw., 2011


  Loading...