Shigang Li

Yu Zhu

Johannes de Fine Licht

CoRR, 2023

ASDL: A Unified Interface for Gradient Preconditioning in PyTorch.

[BibT_eX]

[DOI]

CoRR, 2023

AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication.

[BibT_eX]

[DOI]

CoRR, 2023

Large-Scale Simulation of Structural Dynamics Computing on GPU Clusters.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2023

Co-design Hardware and Algorithm for Vector Search.

[BibT_eX]

[DOI]

Wenqi Jiang

Yu Zhu

Johannes de Fine Licht

Proceedings of the International Conference for High Performance Computing, 2023

A Scalable Hybrid Total FETI Method for Massively Parallel FEM Simulations.

[BibT_eX]

[DOI]

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2023

Asynch-SGBDT: Train Stochastic Gradient Boosting Decision Trees in an Asynchronous Parallel Manner.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

2022

PipeFisher: Efficient Training of Large Language Models Using Pipelining and Fisher Information Matrices.

[BibT_eX]

[DOI]

Kazuki Osawa

CoRR, 2022

Efficient Quantized Sparse Matrix Operations on Tensor Cores.

[BibT_eX]

[DOI]

Kazuki Osawa

Proceedings of the SC22: International Conference for High Performance Computing, 2022

HammingMesh: A Network Topology for Large-Scale Deep Learning.

[BibT_eX]

[DOI]

Tommaso Bonato

Daniele De Sensi

Proceedings of the SC22: International Conference for High Performance Computing, 2022

Near-optimal sparse allreduce for distributed deep learning.

[BibT_eX]

[DOI]

Proceedings of the PPoPP '22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, April 2, 2022

A data-centric optimization framework for machine learning.

[BibT_eX]

[DOI]

Proceedings of the ICS '22: 2022 International Conference on Supercomputing, Virtual Event, June 28, 2022

2021

Breaking (Global) Barriers in Parallel Stochastic Optimization With Wait-Avoiding Group Averaging.

[BibT_eX]

[DOI]

Tal Ben-Nun

Giorgi Nadiradze

Nikoli Dryden

Dan Alistarh

IEEE Trans. Parallel Distributed Syst., 2021

Why Dataset Properties Bound the Scalability of Parallel Machine Learning Training Algorithms.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2021

Flare: flexible in-network allreduce.

[BibT_eX]

[DOI]

Daniele De Sensi

Saleh Ashkboos

Proceedings of the International Conference for High Performance Computing, 2021

Chimera: efficiently training large-scale neural networks with bidirectional pipelines.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2021

Asynchronous Decentralized SGD with Quantized and Local Updates.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Data Movement Is All You Need: A Case Study on Optimizing Transformers.

[BibT_eX]

[DOI]

Proceedings of Machine Learning and Systems 2021, 2021

2020

FastNBL: fast neighbor lists establishment for molecular dynamics simulation based on bitwise operations.

[BibT_eX]

[DOI]

J. Supercomput., 2020

WP-SGD: Weighted parallel SGD for distributed unbalanced-workload training system.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2020

The static parallel distribution algorithms for hybrid density-functional calculations in HONPAS package.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2020

Deep Learning for Post-Processing Ensemble Weather Forecasts.

[BibT_eX]

[DOI]

CoRR, 2020

Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging.

[BibT_eX]

[DOI]

Tal Ben-Nun

Dan Alistarh

Nikoli Dryden

CoRR, 2020

Taming unbalanced training workloads in deep learning with partial collective operations.

[BibT_eX]

[DOI]

Tal Ben-Nun

Dan Alistarh

Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

2019

Correction to: FastNBL: fast neighbor lists establishment for molecular dynamics simulation based on bitwise operations.

[BibT_eX]

[DOI]

J. Supercomput., 2019

Efficient parallel optimizations of a high-performance SIFT on GPUs.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2019

Predicting Weather Uncertainty with Deep Convnets.

[BibT_eX]

[DOI]

CoRR, 2019

The Scalability for Parallel Machine Learning Training Algorithm: Dataset Matters.

[BibT_eX]

[DOI]

CoRR, 2019

OpenKMC: a KMC design for hundred-billion-atom simulation using millions of cores on Sunway Taihulight.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2019

swMD: Performance Optimizations for Molecular Dynamics Simulation on Sunway Taihulight.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, 2019

Using Gradient Based Multikernel Gaussian Process and Meta-Acquisition Function to Accelerate SMBO.

[BibT_eX]

[DOI]

Proceedings of the 31st IEEE International Conference on Tools with Artificial Intelligence, 2019

2018

Cache-Oblivious MPI All-to-All Communications Based on Morton Order.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2018

Using Known Information to Accelerate HyperParameters Optimization Based on SMBO.

[BibT_eX]

[DOI]

CoRR, 2018

Asynchronous Parallel Sampling Gradient Boosting Decision Tree.

[BibT_eX]

[DOI]

CoRR, 2018

Communication-Avoiding for Dynamical Core of Atmospheric General Circulation Model.

[BibT_eX]

[DOI]

Proceedings of the 47th International Conference on Parallel Processing, 2018

Massively Scaling the Metal Microscopic Damage Simulation on Sunway TaihuLight Supercomputer.

[BibT_eX]

[DOI]

Proceedings of the 47th International Conference on Parallel Processing, 2018

AGCM3D: A Highly Scalable Finite-Difference Dynamical Core of Atmospheric General Circulation Model Based on 3D Decomposition.

[BibT_eX]

[DOI]

Proceedings of the 24th IEEE International Conference on Parallel and Distributed Systems, 2018

2017

Hybrid-optimization strategy for the communication of large-scale Kinetic Monte Carlo simulation.

[BibT_eX]

[DOI]

Comput. Phys. Commun., 2017

Kernel optimization for short-range molecular dynamics.

[BibT_eX]

[DOI]

Comput. Phys. Commun., 2017

Asynchronous COMID: the theoretic basis for transmitted data sparsification tricks on Parameter Server.

[BibT_eX]

[DOI]

CoRR, 2017

Weighted parallel SGD for distributed unbalanced-workload training system.

[BibT_eX]

[DOI]

CoRR, 2017

POSTER: Cache-Oblivious MPI All-to-All Communications on Many-Core Architectures.

[BibT_eX]

[DOI]

Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

2016

A Cross-Platform SpMV Framework on Many-Core Architectures.

[BibT_eX]

[DOI]

ACM Trans. Archit. Code Optim., 2016

Parallel Processing Systems for Big Data: A Survey.

[BibT_eX]

[DOI]

Athanasios V. Vasilakos

Proc. IEEE, 2016

2015

Automatic tuning of sparse matrix-vector multiplication on multicore clusters.

[BibT_eX]

[DOI]

Sci. China Inf. Sci., 2015

Fast Convolution Operations on Many-Core Architectures.

[BibT_eX]

[DOI]

Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications, 2015

Analyzing MPI-3.0 Process-Level Shared Memory: A Case Study with Stencil Computations.

[BibT_eX]

[DOI]

Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015

2014

Improved MPI collectives for MPI processes in shared address spaces.

[BibT_eX]

[DOI]

Clust. Comput., 2014

2013

Asynchronous Work Stealing on Distributed Memory Systems.

[BibT_eX]

[DOI]

Proceedings of the 21st Euromicro International Conference on Parallel, 2013

NUMA-aware shared-memory collective communication for MPI.

[BibT_eX]

[DOI]