Nathan DeBardeleben

Orcid: 0000-0002-5593-9205

Affiliations:
  • Los Alamos National Laboratory
  • Clemson University, Electrical and Computer Engineering department


According to our database1, Nathan DeBardeleben authored at least 74 papers between 2000 and 2023.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of two.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2023
Fault Injection for TensorFlow Applications.
IEEE Trans. Dependable Secur. Comput., 2023

Incorporating Staggered Planned Maintenance Reservations to Improve Performance in Computational Clusters.
Proceedings of the IEEE International Conference on Cluster Computing, 2023

2022
Resiliency in numerical algorithm design for extreme scale simulations.
Int. J. High Perform. Comput. Appl., 2022

Exploring Data Reduction Techniques for Additive Manufacturing Analysis.
Proceedings of the 8th IEEE/ACM International Workshop on Data Analysis and Reduction for Big Scientific Data, 2022

Online Detection and Classification of State Transitions of Multivariate Shock and Vibration Data.
Proceedings of the IEEE High Performance Extreme Computing Conference, 2022

2021
Thermal neutrons: a possible threat for supercomputer reliability.
J. Supercomput., 2021

Quantifying Server Memory Frequency Margin and Using It to Improve Performance in HPC Systems.
Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture, 2021

Exploring the Tradeoff Between Reliability and Performance in HPC Systems.
Proceedings of the 2021 IEEE High Performance Extreme Computing Conference, 2021

Statistical Framework for Two-Party Acceptance Testing of HPC Systems for Reliability.
Proceedings of the 11th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, 2021

Understanding the Effects of DRAM Correctable Error Logging at Scale.
Proceedings of the IEEE International Conference on Cluster Computing, 2021

2020
TensorFI: A Flexible Fault Injection Framework for TensorFlow Applications.
Proceedings of the 31st IEEE International Symposium on Software Reliability Engineering, 2020

Thermal Neutrons: a Possible Threat for Supercomputers and Safety Critical Applications.
Proceedings of the IEEE European Test Symposium, 2020

An Overview of the Risk Posed by Thermal Neutrons to the Reliability of Computing Devices.
Proceedings of the 50th Annual IEEE-IFIP International Conference on Dependable Systems and Networks, 2020

Extreme Protection Against Data Loss with Single-Overlap Declustered Parity.
Proceedings of the 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2020

Chaser: An Enhanced Fault Injection Tool for Tracing Soft Errors in MPI Applications.
Proceedings of the 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2020

2019
Failure Analysis and Quantification for Contemporary and Future Supercomputers.
CoRR, 2019

Soft Error Resilience and Failure Recovery for Continuum Dynamics Applications.
CoRR, 2019

<i>BinFI</i>: an efficient fault injector for safety-critical machine learning systems.
Proceedings of the International Conference for High Performance Computing, 2019

Quantifying Memory Underutilization in HPC Systems and Using it to Improve Performance via Architecture Support.
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019

Do Solar Proton Events Reduce the Number of Faults in Supercomputers?: A Comparative Analysis of Faults During and without Solar Proton Events.
Proceedings of the IEEE International Reliability Physics Symposium, 2019

TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs.
Proceedings of the ACM International Conference on Supercomputing, 2019

2018
The Atlas Cluster Trace Repository.
login Usenix Mag., 2018

Using virtualization to quantify power conservation via near-threshold voltage reduction for inherently resilient applications.
Parallel Comput., 2018

Characterization and Comparison of Application Resilience for Serial and Parallel Executions.
CoRR, 2018

On the diversity of cluster workloads and its impact on research results.
Proceedings of the 2018 USENIX Annual Technical Conference, 2018

Improving Application Resilience by Extending Error Correction with Contextual Information.
Proceedings of the IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale, 2018

Lessons learned from memory errors observed over the lifetime of Cielo.
Proceedings of the International Conference for High Performance Computing, 2018

SaNSA - The Supercomputer and Node State Architecture.
Proceedings of the IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale, 2018

TensorFI: A Configurable Fault Injector for TensorFlow Applications.
Proceedings of the 2018 IEEE International Symposium on Software Reliability Engineering Workshops, 2018

Enhancing HPC System Log Analysis by Identifying Message Origin in Source Code.
Proceedings of the 2018 IEEE International Symposium on Software Reliability Engineering Workshops, 2018

Modeling Application Resilience in Large-scale Parallel Execution.
Proceedings of the 47th International Conference on Parallel Processing, 2018

Physics-Informed Machine Learning for DRAM Error Modeling.
Proceedings of the 2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, 2018

2017
Addressing statistical significance of fault injection: empirical studies of the soft error susceptibility.
Int. J. High Perform. Comput. Netw., 2017

Experimental and analytical study of Xeon Phi reliability.
Proceedings of the International Conference for High Performance Computing, 2017

Silent Data Corruption Resilient Two-sided Matrix Factorizations.
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures.
Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, 2017

Resilience Analysis of Top K Selection Algorithms.
Proceedings of the 13th European Dependable Computing Conference, 2017

RSVP: Soft Error Resilient Power Savings at Near-Threshold Voltage Using Register Vulnerability.
Proceedings of the 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2017

Automating DRAM Fault Mitigation By Learning From Experience.
Proceedings of the 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2017

Lifetime memory reliability data from the field.
Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, 2017

2016
Design, Use and Evaluation of P-FSEFI: A Parallel Soft Error Fault Injection Framework for Emulating Soft Errors in Parallel Applications.
Proceedings of the 9th EAI International Conference on Simulation Tools and Techniques, 2016

Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra.
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, 2016

On the Inherent Resilience of Integer Operations.
Proceedings of the Euro-Par 2016: Parallel Processing Workshops, 2016

SDC is in the Eye of the Beholder: A Survey and Preliminary Study.
Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2016

Improving DRAM Fault Characterization through Machine Learning.
Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2016

2015
Field, experimental, and analytical data on large-scale HPC systems and evaluation of the implications for exascale system design.
Proceedings of the 33rd IEEE VLSI Test Symposium, 2015

Differentiated Failure Remediation with Action Selection for Resilient Computing.
Proceedings of the 21st IEEE Pacific Rim International Symposium on Dependable Computing, 2015

Empirical Studies of the Soft Error Susceptibility ofSorting Algorithms to Statistical Fault Injection.
Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale, 2015

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation.
Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture, 2015

On the Non-Suitability of Non-Volatility.
Proceedings of the 7th USENIX Workshop on Hot Topics in Storage and File Systems, 2015

Towards Building Resilient Scientific Applications: Resilience Analysis on the Impact of Soft Error and Transient Error Tolerance with the CLAMR Hydrodynamics Mini-App.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

Memory Errors in Modern Systems: The Good, The Bad, and The Ugly.
Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, 2015

2014
Addressing failures in exascale computing.
Int. J. High Perform. Comput. Appl., 2014

An investigation of the effects of hard and soft errors on graphics processing unit-accelerated molecular dynamics simulations.
Concurr. Comput. Pract. Exp., 2014

Fault Injection Experiments with the CLAMR Hydrodynamics Mini-App.
Proceedings of the 25th IEEE International Symposium on Software Reliability Engineering Workshops, 2014

F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Harnessing Unreliable Cores in Heterogeneous Architecture: The PyDac Programming Model and Runtime.
Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2014

GPGPUs: How to combine high computational power with high reliability.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2014

2013
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults.
Proceedings of the International Conference for High Performance Computing, 2013

Analyzing Reliability of Memory Sub-systems with Double-Chipkill Detect/Correct.
Proceedings of the IEEE 19th Pacific Rim International Symposium on Dependable Computing, 2013

Exploring Time and Frequency Domains for Accurate and Automated Anomaly Detection in Cloud Computing Systems.
Proceedings of the IEEE 19th Pacific Rim International Symposium on Dependable Computing, 2013

PyDac: A Resilient Run-Time Framework for Divide-and-Conquer Applications on a Heterogeneous Many-Core Architecture.
Proceedings of the Euro-Par 2013: Parallel Processing Workshops, 2013

GPU Behavior on a Large HPC Cluster.
Proceedings of the Euro-Par 2013: Parallel Processing Workshops, 2013

2012
Application monitoring and checkpointing in HPC: looking towards exascale systems.
Proceedings of the 50th Annual Southeast Regional Conference, 2012

2011
Experimental Framework for Injecting Logic Errors in a Virtual Machine to Profile Applications for Soft Error Resilience.
Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

2010
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters.
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, 2010

1<sup>st</sup> workshop on fault-tolerance for HPC at extreme scale FTXS 2010.
Proceedings of the 2010 IEEE/IFIP International Conference on Dependable Systems and Networks, 2010

2009
Building problem-solving environments with the Arches framework.
J. Syst. Softw., 2009

2008
Application Resilience: Making Progress in Spite of Failure.
Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), 2008

2006
Developing Scientific Applications Using Eclipse.
Comput. Sci. Eng., 2006

A Model-Based Framework for the Integration of Parallel Tools.
Proceedings of the 2006 IEEE International Conference on Cluster Computing, 2006

2004
Arches: An Infrastructure for PSE Development.
Proceedings of the 9th International Workshop on High-Level Programming Models and Supportive Environments (HIPS 2004), 2004

2002
Coven - A Framework for High Performance Problem Solving Environments.
Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing (HPDC-11 2002), 2002

2000
Parallelization Techniques for Spatial-Temporal Occupancy Maps from Multiple Video Streams.
Proceedings of the Parallel and Distributed Processing, 2000


  Loading...