Kun Qian

Orcid: 0000-0001-9882-9279

Affiliations:
  • Alibaba Cloud, Hangzhou, China


According to our database1, Kun Qian authored at least 17 papers between 2022 and 2025.

Collaborative distances:

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
PerfTracker: Online Performance Troubleshooting for Large-scale Model Training in Production.
CoRR, June, 2025

ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production.
CoRR, May, 2025

Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market.
Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, 2025

SyCCL: Exploiting Symmetry for Efficient Collective Communication Scheduling.
Proceedings of the ACM SIGCOMM 2025 Conference, 2025

SkeletonHunter: Diagnosing and Localizing Network Failures in Containerized Large Model Training.
Proceedings of the ACM SIGCOMM 2025 Conference, 2025

SimAI: Unifying Architecture Design and Performance Tuning for Large-Scale Large Language Model Training with Scalability and Precision.
Proceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation, 2025

Evolution of Aegis: Fault Diagnosis for AI Model Training Service in Production.
Proceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation, 2025

Mitigating Scalability Walls of RDMA-based Container Networks.
Proceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation, 2025

2024
Unicron: Economizing Self-Healing LLM Training at Scale.
CoRR, 2024

Alibaba HPN: A Data Center Network for Large Language Model Training.
Proceedings of the ACM SIGCOMM 2024 Conference, 2024

Crux: GPU-Efficient Communication Scheduling for Deep Learning Training.
Proceedings of the ACM SIGCOMM 2024 Conference, 2024

Burstable Cloud Block Storage with Data Processing Units.
Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation, 2024

Near-Lossless Gradient Compression for Data-Parallel Distributed DNN Training.
Proceedings of the 2024 ACM Symposium on Cloud Computing, 2024

2023
Dependable Virtualized Fabric on Programmable Data Plane.
IEEE/ACM Trans. Netw., August, 2023

XRON: A Hybrid Elastic Cloud Overlay Network for Video Conferencing at Planetary Scale.
Proceedings of the ACM SIGCOMM 2023 Conference, 2023

2022
From luna to solar: the evolutions of the compute-to-storage networks in Alibaba cloud.
Proceedings of the SIGCOMM '22: ACM SIGCOMM 2022 Conference, Amsterdam, The Netherlands, August 22, 2022

Predictable vFabric on informative data plane.
Proceedings of the SIGCOMM '22: ACM SIGCOMM 2022 Conference, Amsterdam, The Netherlands, August 22, 2022


  Loading...