Kurt B. Ferreira

Parallel Comput., 2021

Characterizing Memory Failures Using Benford's Law.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2021: Parallel Processing Workshops, 2021

Understanding the Effects of DRAM Correctable Error Logging at Scale.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2021

2020

The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints.

[BibT_eX]

[DOI]

Patrick M. Widener

Concurr. Comput. Pract. Exp., 2020

Hardware MPI message matching: Insights into MPI matching behavior to inform design.

[BibT_eX]

[DOI]

Ryan E. Grant

Michael J. Levenhagen

Taylor L. Groves

Concurr. Comput. Pract. Exp., 2020

ALAMO: Autonomous Lightweight Allocation, Management, and Optimization.

[BibT_eX]

[DOI]

Proceedings of the Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI, 2020

Evaluating MPI Message Size Summary Statistics.

[BibT_eX]

[DOI]

Proceedings of the EuroMPI/USA '20: 27th European MPI Users' Group Meeting, 2020

2019

Using simulation to examine the effect of MPI message matching costs on application performance.

[BibT_eX]

[DOI]

Matthew G. F. Dosanjh

Parallel Comput., 2019

Checkpointing Strategies for Shared High-Performance Computing Platforms.

[BibT_eX]

[DOI]

Int. J. Netw. Comput., 2019

Evaluating tradeoffs between MPI message matching offload hardware capacity and performance.

[BibT_eX]

[DOI]

Proceedings of the 26th European MPI Users' Group Meeting, 2019

Space-Efficient Reed-Solomon Encoding to Detect and Correct Pointer Corruption.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2019: Parallel Processing Workshops, 2019

Sandia Line of LWKs.

[BibT_eX]

[DOI]

Proceedings of the Operating Systems for Supercomputers and High Performance Computing, 2019

2018

Lessons learned from memory errors observed over the lifetime of Cielo.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2018

Using Simulation to Examine the Effect of MPI Message Matching Costs on Application Performance.

[BibT_eX]

[DOI]

Proceedings of the 25th European MPI Users' Group Meeting, 2018

Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

Open Science on Trinity's Knights Landing Partition: An Analysis of User Job Data.

[BibT_eX]

[DOI]

Kevin T. Pedretti

Proceedings of the 47th International Conference on Parallel Processing, 2018

Physics-Informed Machine Learning for DRAM Error Modeling.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, 2018

2017

Characterizing MPI matching via trace-based simulation.

[BibT_eX]

[DOI]

Proceedings of the 24th European MPI Users' Group Meeting, 2017

It's Not the Heat, It's the Humidity: Scheduling Resilience Activity at Scale.

[BibT_eX]

[DOI]

Patrick M. Widener

Proceedings of the Euro-Par 2017: Parallel Processing Workshops, 2017

Automating DRAM Fault Mitigation By Learning From Experience.

[BibT_eX]

[DOI]

Proceedings of the 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2017

Lifetime memory reliability data from the field.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, 2017

Evaluating the Viability of Using Compression to Mitigate Silent Corruption of Read-Mostly Application Data.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

2016

On noise and the performance benefit of nonblocking collectives.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2016

Understanding performance interference in next-generation HPC systems.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2016

Improving application resilience to memory errors with lightweight compression.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2016

How I Learned to Stop Worrying and Love In Situ Analytics: Leveraging Latent Synchronization in MPI Collective Algorithms.

[BibT_eX]

[DOI]

Proceedings of the 23rd European MPI Users' Group Meeting, EuroMPI 2016, 2016

Mini-Ckpts: Surviving OS Failures in Persistent Memory.

[BibT_eX]

[DOI]

Proceedings of the 2016 International Conference on Supercomputing, 2016

An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart.

[BibT_eX]

[DOI]

Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, 2016

Horseshoes and Hand Grenades: The Case for Approximate Coordination in Local Checkpointing Protocols.

[BibT_eX]

[DOI]

Patrick M. Widener

Proceedings of the Euro-Par 2016: Parallel Processing Workshops, 2016

FlipSphere: A Software-Based DRAM Error Detection and Correction Library for HPC.

[BibT_eX]

[DOI]

David Fiala

Frank Mueller

Proceedings of the 20th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications, 2016

Improving DRAM Fault Characterization through Machine Learning.

[BibT_eX]

[DOI]

Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2016

Scheduling In-Situ Analytics in Next-Generation Applications.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM 16th International Symposium on Cluster, 2016

2015

A study of the viability of exploiting memory content similarity to improve resilience to memory errors.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2015

A checkpoint compression study for high-performance computing systems.

[BibT_eX]

[DOI]

Dewan Ibtesham

Dorian C. Arnold

Int. J. High Perform. Comput. Appl., 2015

Early experiences with node-level power capping on the Cray XC40 platform.

[BibT_eX]

[DOI]

Proceedings of the 3rd International Workshop on Energy Efficient Supercomputing, 2015

What is a Lightweight Kernel?

[BibT_eX]

[DOI]

Rolf Riesen

Arthur Barney Maccabe

Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers, 2015

A Principled Approach to HPC Event Monitoring.

[BibT_eX]

[DOI]

Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale, 2015

Canaries in a Coal Mine: Using Application-Level Checkpoints to Detect Memory Failures.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2015: Parallel Processing Workshops, 2015

Memory Errors in Modern Systems: The Good, The Bad, and The Ugly.

[BibT_eX]

[DOI]

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, 2015

2014

Accelerating incremental checkpointing for extreme-scale computing.

[BibT_eX]

[DOI]

Future Gener. Comput. Syst., 2014

Understanding the Effects of Communication and Coordination on Checkpointing at Scale.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2014

Exploring the effect of noise on the performance benefit of nonblocking allreduce.

[BibT_eX]

[DOI]

Proceedings of the 21st European MPI Users' Group Meeting, 2014

Energy Consumption of Resilience Mechanisms in Large Scale Systems.

[BibT_eX]

[DOI]

Proceedings of the 22nd Euromicro International Conference on Parallel, 2014

Characterizing the Impact of Rollback Avoidance at Extreme-Scale: A Modeling Approach.

[BibT_eX]

[DOI]

Proceedings of the 43rd International Conference on Parallel Processing, 2014

Coarse-Grained Energy Modeling of Rollback/Recovery Mechanisms.

[BibT_eX]

[DOI]

Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2014

2013

Evaluating energy savings for checkpoint/restart.

[BibT_eX]

[DOI]

Proceedings of the 1st International Workshop on Energy Efficient Supercomputing, 2013

Using Simulation to Evaluate the Performance of Resilience Strategies at Scale.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation, 2013

Evaluating the feasibility of using memory content similarity to improve system resilience.

[BibT_eX]

[DOI]

Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers, 2013

Using unreliable virtual hardware to inject errors in extreme-scale systems.

[BibT_eX]

[DOI]

Matthew G. F. Dosanjh

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, 2013

Asking the Right Questions: Benchmarking Fault-Tolerant Extreme-Scale Systems.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2013: Parallel Processing Workshops, 2013

Energy-Efficient High Performance Computing - Measurement and Tuning.

[BibT_eX]

[DOI]

Springer Briefs in Computer Science, Springer, ISBN: 978-1-4471-4491-5, 2013

2012

Fault-tolerant linear solvers via selective reliability

[BibT_eX]

[DOI]

CoRR, 2012

Improvements to the structural simulation toolkit.

[BibT_eX]

[DOI]

Proceedings of the International ICST Conference on Simulation Tools and Techniques, 2012

Alleviating scalability issues of checkpointing protocols.

[BibT_eX]

[DOI]

Proceedings of the SC Conference on High Performance Computing Networking, 2012

Poster: Comparing GPU and Increment-Based Checkpoint Compression.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Abstract: Comparing GPU and Increment-Based Checkpoint Compression.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Detection and correction of silent data corruption for large-scale high-performance computing.

[BibT_eX]

[DOI]

Proceedings of the SC Conference on High Performance Computing Networking, 2012

Evaluating operating system vulnerability to memory errors.

[BibT_eX]

[DOI]

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers, 2012

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance.

[BibT_eX]

[DOI]

Proceedings of the 41st International Conference on Parallel Processing, 2012

Combining Partial Redundancy and Checkpointing for HPC.

[BibT_eX]

[DOI]

Proceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems, 2012

The Viability of Using Compression to Decrease Message Log Sizes.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2012: Parallel Processing Workshops, 2012

Does partial replication pay off?

[BibT_eX]

[DOI]

Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2012

2011

Poster: detection and correction of silent data corruption for large-scale high-performance computing.

[BibT_eX]

[DOI]

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2011

Poster: a tunable, software-based DRAM error detection and correction library for HPC.

[BibT_eX]

[DOI]

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2011

Evaluating the viability of process replication reliability for exascale systems.

[BibT_eX]

[DOI]

Proceedings of the Conference on High Performance Computing Networking, 2011

libhashckpt: Hash-Based Incremental Checkpointing Using GPU's.

[BibT_eX]

[DOI]

Proceedings of the Recent Advances in the Message Passing Interface, 2011

Cache injection for parallel applications.

[BibT_eX]

[DOI]

Proceedings of the 20th ACM International Symposium on High Performance Distributed Computing, 2011

Simulating Application Resilience at Exascale.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

A Tunable, Software-Based DRAM Error Detection and Correction Library for HPC.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

Cooperative Application/OS DRAM Fault Recovery.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

2010

Transparent Redundant Computing with MPI.

[BibT_eX]

[DOI]

Ron Brightwell

Rolf Riesen

Proceedings of the Recent Advances in the Message Passing Interface, 2010

See applications run and throughput jump: The case for redundant computing in HPC.

[BibT_eX]

[DOI]

Rolf Riesen