Zhiling Lan

Orcid: 0000-0002-1047-8724

According to our database1, Zhiling Lan authored at least 103 papers between 2000 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Union: An Automatic Workload Manager for Accelerating Network Simulation.
CoRR, 2024

MRSch: Multi-Resource Scheduling for HPC.
CoRR, 2024

2023
Performance and power modeling and prediction using MuMMI and 10 machine learning methods.
Concurr. Comput. Pract. Exp., 2023

Machine Learning for Interconnect Network Traffic Forecasting: Investigation and Exploitation.
Proceedings of the 2023 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, 2023

Workload Interference Prevention with Intelligent Routing and Flexible Job Placement on Dragonfly.
Proceedings of the 2023 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, 2023

Hybrid PDES Simulation of HPC Networks Using Zombie Packets.
Proceedings of the 2023 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, 2023

Interpretable Modeling of Deep Reinforcement Learning Driven Scheduling.
Proceedings of the 31st International Symposium on Modeling, 2023

2022
DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing.
IEEE Trans. Parallel Distributed Syst., 2022

Study of Workload Interference with Intelligent Routing on Dragonfly.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

Encoding for Reinforcement Learning Driven Scheduling.
Proceedings of the Job Scheduling Strategies for Parallel Processing, 2022

Hybrid Workload Scheduling on HPC Systems.
Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

MRSch: Multi-Resource Scheduling for HPC.
Proceedings of the IEEE International Conference on Cluster Computing, 2022

2021
DRAS-CQSim: A reinforcement learning based framework for HPC cluster scheduling.
Softw. Impacts, 2021

Performance and Energy Improvement of ECP Proxy App SW4lite under Various Workloads.
Proceedings of the IEEE/ACM Workshop on Memory Centric High Performance Computing, 2021

Deep Reinforcement Agent for Scheduling in HPC.
Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium, 2021

Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network.
Proceedings of the HPDC '21: The 30th International Symposium on High-Performance Parallel and Distributed Computing, 2021

A Dynamic Power Capping Library for HPC Applications.
Proceedings of the IEEE International Conference on Cluster Computing, 2021

2020
Performance and Power Modeling and Prediction Using MuMMI and Ten Machine Learning Methods.
CoRR, 2020

Union: An Automatic Workload Manager for Accelerating Network Simulation.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

2019
Modeling and Analysis of Application Interference on Dragonfly+.
Proceedings of the 2019 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, 2019

Scheduling Beyond CPUs for HPC.
Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, 2019

2018
System-wide trade-off modeling of performance, power, and resilience on petascale systems.
J. Supercomput., 2018

Trade-Off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

Joint Effects of Application Communication Pattern, Job Placement and Network Routing on Fat-Tree Systems.
Proceedings of the 47th International Conference on Parallel Processing, 2018

2017
Toward General Software Level Silent Data Corruption Detection for Parallel Applications.
IEEE Trans. Parallel Distributed Syst., 2017

Topology mapping of irregular parallel applications on torus-connected supercomputers.
J. Supercomput., 2017

Experience and Practice of Batch Scheduling on Leadership Supercomputers at Argonne.
Proceedings of the Job Scheduling Strategies for Parallel Processing, 2017

A Preliminary Study of Intra-Application Interference on Dragonfly Network.
Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

Preliminary Interference Study About Job Placement and Routing Algorithms in the Fat-Tree Topology for HPC Applications.
Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

Trade-Off Between Prediction Accuracy and Underestimation Rate in Job Runtime Estimates.
Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

2016
Improving Batch Scheduling on Blue Gene/Q by Relaxing Network Allocation Constraints.
IEEE Trans. Parallel Distributed Syst., 2016

A Scalable, Non-Parametric Method for Detecting Performance Anomaly in Large Scale Computing.
IEEE Trans. Parallel Distributed Syst., 2016

I/O-aware bandwidth allocation for petascale computing systems.
Parallel Comput., 2016

Application power profiling on IBM Blue Gene/Q.
Parallel Comput., 2016

Watch out for the bully!: job interference study on dragonfly network.
Proceedings of the International Conference for High Performance Computing, 2016

A data driven scheduling approach for power management on HPC systems.
Proceedings of the International Conference for High Performance Computing, 2016

Study of Intra- and Interjob Interference on Torus Networks.
Proceedings of the 22nd IEEE International Conference on Parallel and Distributed Systems, 2016

Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications.
Proceedings of the Euro-Par 2016: Parallel Processing, 2016

Exploring Plan-Based Scheduling for Large-Scale Computing Systems.
Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016

2015
Hierarchical task mapping for parallel applications on supercomputers.
J. Supercomput., 2015

Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart.
IEEE Trans. Computers, 2015

Quantitative modeling of power performance tradeoffs on extreme scale systems.
J. Parallel Distributed Comput., 2015

Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus Network Allocation Constraints.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Utility-Based Scheduling for Bulk Data Transfers between Distributed Computing Facilities.
Proceedings of the 44th International Conference on Parallel Processing Workshops, 2015

Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications.
Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, 2015

I/O-Aware Batch Scheduling for Petascale Computing Systems.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

Comparison of Vendor Supplied Environmental Data Collection Mechanisms.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

2014
Balancing job performance with system performance via locality-aware scheduling on torus-connected systems.
Proceedings of the 2014 IEEE International Conference on Cluster Computing, 2014

Exploring void search for fault detection on extreme scale systems.
Proceedings of the 2014 IEEE International Conference on Cluster Computing, 2014

2013
Multi-domain job coscheduling for leadership computing systems.
J. Supercomput., 2013

Toward balanced and sustainable job scheduling for production supercomputers.
Parallel Comput., 2013

Job scheduling with adjusted runtime estimates on production supercomputers.
J. Parallel Distributed Comput., 2013

Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems.
Proceedings of the International Conference for High Performance Computing, 2013

Reducing Energy Costs for IBM Blue Gene/P via Power-Aware Job Scheduling.
Proceedings of the Job Scheduling Strategies for Parallel Processing, 2013

A Transparent Collective I/O Implementation.
Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

Measuring Power Consumption on IBM Blue Gene/Q.
Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

Application power profiling on IBM Blue Gene/Q.
Proceedings of the 2013 IEEE International Conference on Cluster Computing, 2013

A scalable, non-parametric anomaly detection framework for Hadoop.
Proceedings of the ACM Cloud and Autonomic Computing Conference, 2013

2012
Hierarchical task mapping of cell-based AMR cosmology simulations.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

Improving Parallel IO Performance of Cell-based AMR Cosmology Applications.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

Adaptive Metric-Aware Job Scheduling for Production Supercomputers.
Proceedings of the 41st International Conference on Parallel Processing Workshops, 2012

3-Dimensional root cause diagnosis via co-analysis.
Proceedings of the 9th International Conference on Autonomic Computing, 2012

Filtering log data: Finding the needles in the Haystack.
Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks, 2012

2011
FREM: A Fast Restart Mechanism for General Checkpoint/Restart.
IEEE Trans. Computers, 2011

Co-analysis of RAS Log and Job Log on Blue Gene/P.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Reducing Fragmentation on Torus-Connected Supercomputers.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Improving Job Scheduling on Production Supercomputers.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Job Coscheduling on Coupled High-End Computing Systems.
Proceedings of the 2011 International Conference on Parallel Processing Workshops, 2011

Practical online failure prediction for Blue Gene/P: Period-based vs event-driven.
Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W 2011), 2011

Evaluating Performance Impacts of Delayed Failure Repairing on Large-Scale Systems.
Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER), 2011

Performance Emulation of Cell-Based AMR Cosmology Simulations.
Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER), 2011

2010
Toward Automated Anomaly Identification in Large-Scale Systems.
IEEE Trans. Parallel Distributed Syst., 2010

A study of dynamic meta-learning for failure prediction in large-scale systems.
J. Parallel Distributed Comput., 2010

Automatic and coordinated job recovery for high performance computing.
Proceedings of the 3rd Workshop on Many-Task Computing on Grids and Supercomputers, 2010

Analyzing and adjusting user runtime estimates to improve job scheduling on the Blue Gene/P.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

A practical failure prediction with location and lead time for Blue Gene/P.
Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W 2010), Chicago, Illinois, USA, June 28, 2010

2009
Fault-Aware Runtime Strategies for High-Performance Computing.
IEEE Trans. Parallel Distributed Syst., 2009

System log pre-processing to improve failure prediction.
Proceedings of the 2009 IEEE/IFIP International Conference on Dependable Systems and Networks, 2009

Reliability-aware scalability models for high performance computing.
Proceedings of the 2009 IEEE International Conference on Cluster Computing, August 31, 2009

Fault-aware, utility-based job scheduling on Blue, Gene/P systems.
Proceedings of the 2009 IEEE International Conference on Cluster Computing, August 31, 2009

Performance under Failures of DAG-based Parallel Computing.
Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009

2008
Adaptive Fault Management of Parallel Applications for High-Performance Computing.
IEEE Trans. Computers, 2008

Analytical study of migration-enhanced fault tolerance for long-running applications in IFR systems.
Int. J. Parallel Emergent Distributed Syst., 2008

Enhancing application robustness through adaptive fault tolerance.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study.
Proceedings of the 2008 International Conference on Parallel Processing, 2008

A fast restart mechanism for checkpoint/recovery protocols in networked environments.
Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2008

2007
Fault-Driven Re-Scheduling For Improving System-level Fault Resilience.
Proceedings of the 2007 International Conference on Parallel Processing (ICPP 2007), 2007

A Meta-Learning Failure Predictor for Blue Gene/L Systems.
Proceedings of the 2007 International Conference on Parallel Processing (ICPP 2007), 2007

Anomaly localization in large-scale clusters.
Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

2006
DistDLB: Improving cosmology SAMR simulations on distributed computing systems through hierarchical load balancing.
J. Parallel Distributed Comput., 2006

Poster reception - Improving fault resilience of high performance applications.
Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing.
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2006), 2006

Evaluating Performance and Scalability of Advanced Accelerator Simulations.
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2006), 2006

2005
A novel workload migration scheme for heterogeneous distributed computing.
Proceedings of the 5th International Symposium on Cluster Computing and the Grid (CCGrid 2005), 2005

2004
Performance analysis of a large-scale cosmology application on three cluster systems.
Int. J. High Perform. Comput. Netw., 2004

A Survey of Load Balancing in Grid Computing.
Proceedings of the Computational and Information Science, First International Symposium, 2004

2003
Exploring cosmology applications on distributed environments.
Future Gener. Comput. Syst., 2003

2002
Dynamic load balancing of SAMR applications on distributed systems.
Sci. Program., 2002

A novel dynamic load balancing scheme for parallel systems.
J. Parallel Distributed Comput., 2002

2001
Design and Development of the Prophesy Performance Database for Distributed Scientific Applications.
Proceedings of the Tenth SIAM Conference on Parallel Processing for Scientific Computing, 2001

Prophesy: Automating the Modeling Process.
Proceedings of the 3rd Annual International Workshop on Active Middleware Services (AMS 2001), 2001

Dynamic Load Balancing for Structured Adaptive Mesh Refinement Applications.
Proceedings of the 2001 International Conference on Parallel Processing, 2001

2000
Prophesy: An Infrastructure for Analyzing and Modeling the Performance of Parallel and Distributed Applications.
Proceedings of the Ninth IEEE International Symposium on High Performance Distributed Computing, 2000


  Loading...