Peng Sun

Orcid: 0000-0001-8456-0491

Affiliations:
  • Nanyang Technological University, Energy Research Institute, Interdisciplinary Graduate School, Singapore


According to our database1, Peng Sun authored at least 29 papers between 2013 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Deep Learning Workload Scheduling in GPU Datacenters: A Survey.
ACM Comput. Surv., June, 2024

InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding.
CoRR, 2024

Characterization of Large Language Model Development in the Datacenter.
Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, 2024

2023
AMSP: Super-Scaling LLM Training via Advanced Model States Partitioning.
CoRR, 2023

Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication.
CoRR, 2023

Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters.
Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation, 2023

Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

2022
Astraea: A Fair Deep Learning Scheduler for Multi-Tenant GPU Clusters.
IEEE Trans. Parallel Distributed Syst., 2022

GradientFlow: Optimizing Network Performance for Large-Scale Distributed DNN Training.
IEEE Trans. Big Data, 2022

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision.
CoRR, 2022

A Simulation Platform for Multi-tenant Machine Learning Services on Thousands of GPUs.
CoRR, 2022

Primo: Practical Learning-Augmented Systems with Interpretable Models.
Proceedings of the 2022 USENIX Annual Technical Conference, 2022

Titan: a scheduler for foundation model fine-tuning workloads.
Proceedings of the 13th Symposium on Cloud Computing, SoCC 2022, 2022

2021
ModelCI-e: Enabling Continual Learning in Deep Learning Serving Systems.
CoRR, 2021

Characterization and prediction of deep learning workloads in large-scale GPU datacenters.
Proceedings of the International Conference for High Performance Computing, 2021

Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs.
Proceedings of the SoCC '21: ACM Symposium on Cloud Computing, 2021

2020
GraphMP: I/O-Efficient Big Graph Analytics on a Single Commodity Machine.
IEEE Trans. Big Data, 2020

Elan: Towards Generic and Efficient Elastic Training for Deep Learning.
Proceedings of the 40th IEEE International Conference on Distributed Computing Systems, 2020

2019
Scalable Architectures for Big Data Analysis.
Proceedings of the Encyclopedia of Big Data Technologies., 2019

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes.
CoRR, 2019

2018
MetaFlow: A Scalable Metadata Lookup Service for Distributed File Systems in Data Centers.
IEEE Trans. Big Data, 2018

On Distributed Algorithms for Cost-Efficient Data Center Placement in Cloud Computing.
CoRR, 2018

Speeding-Up Age Estimation in Intelligent Demographics System via Network Optimization.
Proceedings of the 2018 IEEE International Conference on Communications, 2018

2017
Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach.
Proceedings of the 2017 IEEE International Conference on Smart Computing, 2017

GraphMP: An Efficient Semi-External-Memory Big Graph Processing System on a Single Machine.
Proceedings of the 23rd IEEE International Conference on Parallel and Distributed Systems, 2017

GraphH: High Performance Big Graph Analytics in Small Clusters.
Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

2016
Timed Dataflow: Reducing Communication Overhead for Distributed Machine Learning Systems.
Proceedings of the 22nd IEEE International Conference on Parallel and Distributed Systems, 2016

2014
CREATE: Correlation enhanced traffic matrix estimation in Data Center Networks.
Proceedings of the 2014 IFIP Networking Conference, Trondheim, 2014

2013
Cloud3DView: an interactive tool for cloud data center operations.
Proceedings of the ACM SIGCOMM 2013 Conference, 2013


  Loading...