Ignacio Laguna

J. Syst. Softw., 2026

2025

Synthesizing Sound and Precise Abstract Transformers for Nonlinear Hyperbolic PDE Solvers.

[BibT_eX]

[DOI]

Jacob Laurel

Jan Hückelheim

Proc. ACM Program. Lang., 2025

GORC: A Graph Neural Network Based Static Data Race Checker for OpenMP.

[BibT_eX]

[DOI]

Anh Tran

Proceedings of the ISC High Performance 2025 Research Paper Proceedings (40th International Conference), 2025

Accurate Differential Analysis using Record and Selective Replay.

[BibT_eX]

[DOI]

Proceedings of the 37th International Conference on Scalable Scientific Data Management, 2025

Scabbard: LLVM Instrumentation-aided Race Checking in CPU/GPU Unified Memory for AMD GPUs.

[BibT_eX]

[DOI]

Andrew Osterhout

Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, 2025

LM-Offload: Performance Model-Guided Generative Inference of Large Language Models with Parallelism Control.

[BibT_eX]

[DOI]

Jianbo Wu

Jie Ren

Shuangyan Yang

Dong Li

Proceedings of the 2025 IEEE International Parallel and Distributed Processing Symposium, 2025

FloatGuard: Efficient Whole-Program Detection of Floating-Point Exceptions in AMD GPUs.

[BibT_eX]

[DOI]

Proceedings of the 34th International Symposium on High-Performance Parallel and Distributed Computing, 2025

2024

An automated OpenMP mutation testing framework for performance optimization.

[BibT_eX]

[DOI]

Parallel Comput., 2024

Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs.

[BibT_eX]

[DOI]

Anwar Hossain Zahid

Wei Le

Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024

Testing the Unknown: A Framework for OpenMP Testing via Random Program Generation.

[BibT_eX]

[DOI]

Patrick J. Chapman

Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024

MUPPET: Optimizing Performance in OpenMP via Mutation Testing.

[BibT_eX]

[DOI]

Proceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores, 2024

Input Range Generation for Compiler-Induced Numerical Inconsistencies.

[BibT_eX]

[DOI]

Proceedings of the 38th ACM International Conference on Supercomputing, 2024

FPBOXer: Efficient Input-Generation for Targeting Floating-Point Exceptions in GPU Programs.

[BibT_eX]

[DOI]

Anh Tran

Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, 2024

Distributed Order Recording Techniques for Efficient Record-and-Replay of Multi - Threaded Programs.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2024

Understanding Mixed Precision GEMM with MPGemmFI: Insights into Fault Resilience.

[BibT_eX]

[DOI]

Siva Kumar Sastry Hari

Timothy Tsai

Dingwen Tao

Prashant J. Nair

Kevin J. Barker

Proceedings of the IEEE International Conference on Cluster Computing, 2024

Enhancing Performance Through Control-Flow Unmerging and Loop Unrolling on GPUs.

[BibT_eX]

[DOI]

Alnis Murtovi

Chunhua Liao

Bernhard Steffen

Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2024

Discovery of Floating-Point Differences Between NVIDIA and AMD GPUs.

[BibT_eX]

[DOI]

Katarzyna Swirydowicz

Proceedings of the 24th IEEE International Symposium on Cluster, 2024

FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators.

[BibT_eX]

[DOI]

Katarzyna Swirydowicz

Proceedings of the 24th IEEE International Symposium on Cluster, 2024

2023

Finding inputs that trigger floating-point exceptions in heterogeneous computing via Bayesian optimization.

[BibT_eX]

[DOI]

Anh Tran

Parallel Comput., September, 2023

Approximate High-Performance Computing: A Fast and Energy-Efficient Computing Paradigm in the Post-Moore Era.

[BibT_eX]

[DOI]

Jackson Vanover

IT Prof., 2023

MPGemmFI: A Fault Injection Technique for Mixed Precision GEMM in ML Applications.

[BibT_eX]

[DOI]

Siva Kumar Sastry Hari

Timothy Tsai

Dingwen Tao

Prashant J. Nair

Kevin J. Barker

CoRR, 2023

Expression Isolation of Compiler-Induced Numerical Inconsistencies in Heterogeneous Code.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 38th International Conference, 2023

Understanding System Resilience for Converged Computing of Cloud, Edge, and HPC.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing, 2023

Scalable Tuning of (OpenMP) GPU Applications via Kernel Record and Replay.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2023

Design and Evaluation of GPU-FPX: A Low-Overhead tool for Floating-Point Exception Detection in NVIDIA GPUs.

[BibT_eX]

[DOI]

Katarzyna Swirydowicz

Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, 2023

2022

Giving Research Software Engineers a Larger Stage Through the Better Scientific Software Fellowship.

[BibT_eX]

[DOI]

Comput. Sci. Eng., 2022

Giving RSEs a Larger Stage through the Better Scientific Software Fellowship.

[BibT_eX]

[DOI]

CoRR, 2022

Toward Increasing Trust in Exascale Simulations.

[BibT_eX]

[DOI]

Proceedings of the 4th Annual Workshop on Extreme-scale Experiment-in-the-Loop Computing, 2022

Approximate Computing Through the Lens of Uncertainty Quantification.

[BibT_eX]

[DOI]

Proceedings of the SC22: International Conference for High Performance Computing, 2022

Finding Inputs that Trigger Floating-Point Exceptions in GPUs via Bayesian Optimization.

[BibT_eX]

[DOI]

Proceedings of the SC22: International Conference for High Performance Computing, 2022

BinFPE: accurate floating-point exception detection for GPU applications.

[BibT_eX]

[DOI]

Proceedings of the SOAP '22: 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis, 2022

Piper: Pipelining OpenMP Offloading Execution Through Compiler Optimization For Performance.

[BibT_eX]

[DOI]

Johannes Doerfert

Thomas R. W. Scogland

Proceedings of the IEEE/ACM International Workshop on Performance, 2022

FPChecker: Floating-Point Exception Detection Tool and Benchmark for Parallel and Distributed HPC.

[BibT_eX]

[DOI]

Tanmay Tirpankar

Proceedings of the IEEE International Symposium on Workload Characterization, 2022

Towards Precision-Aware Fault Tolerance Approaches for Mixed-Precision Applications.

[BibT_eX]

[DOI]

Siva Kumar Sastry Hari

Timothy Tsai

Kevin J. Barker

Proceedings of the 12th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, 2022

2021

PredCom: A Predictive Approach to Collecting Approximated Communication Traces.

[BibT_eX]

[DOI]

Shinobu Miwa

IEEE Trans. Parallel Distributed Syst., 2021

PARIS: Predicting application resilience using machine learning.

[BibT_eX]

[DOI]

Dong Li

J. Parallel Distributed Comput., 2021

Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance.

[BibT_eX]

[DOI]

CoRR, 2021

Report of the Workshop on Program Synthesis for Scientific Computing.

[BibT_eX]

[DOI]

Hal Finkel

CoRR, 2021

Understanding the use of message passing interface in exascale proxy applications.

[BibT_eX]

[DOI]

Nawrin Sultana

Martin Rüfenacht

Anthony Skjellum

Purushotham V. Bangalore

Kathryn M. Mohror

Concurr. Comput. Pract. Exp., 2021

Keeping science on keel when software moves.

[BibT_eX]

[DOI]

Commun. ACM, 2021

HPAC: evaluating approximate computing techniques on HPC OpenMP applications.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2021

Examining Failures and Repairs on Supercomputers with Multi-GPU Compute Nodes.

[BibT_eX]

[DOI]

Proceedings of the 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2021

Guarding Numerics Amidst Rising Heterogeneity.

[BibT_eX]

[DOI]

Proceedings of the 5th IEEE/ACM International Workshop on Software Correctness for HPC Applications, 2021

Co-Designing Multi-Level Checkpoint Restart for MPI Applications.

[BibT_eX]

[DOI]

Leonardo Bautista-Gomez

Proceedings of the 21st IEEE/ACM International Symposium on Cluster, 2021

2020

EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2020

Reinit<sup>++</sup>: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 35th International Conference, 2020

OMPRacer: a scalable and precise static race detector for OpenMP programs.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2020

Extending the MPI Stages Model of Fault Tolerance.

[BibT_eX]

[DOI]

Proceedings of the Workshop on Exascale MPI, 2020

pLiner: isolating lines of floating-point code for compiler-induced variability.

[BibT_eX]

[DOI]

Hui Guo

Proceedings of the International Conference for High Performance Computing, 2020

ArcherGear: data race equivalencing for expeditious HPC debugging.

[BibT_eX]

[DOI]

Samuel Thayer

Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

Detecting and reproducing error-code propagation bugs in MPI implementations.

[BibT_eX]

[DOI]

Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

FAROS: A Framework to Analyze OpenMP Compilation Through Benchmarking and Compiler Optimization Analysis.

[BibT_eX]

[DOI]

Johannes Doerfert

Thomas R. W. Scogland

Proceedings of the OpenMP: Portable Multi-Level Parallelism on Modern Systems, 2020

Varity: Quantifying Floating-Point Variations in HPC Systems Through Randomized Testing.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

HPC-MixPBench: An HPC Benchmark Suite for Mixed-Precision Analysis.

[BibT_eX]

[DOI]

Tristan Vanderbruggen

Proceedings of the IEEE International Symposium on Workload Characterization, 2020

MATCH: An MPI Fault Tolerance Benchmark Suite.

[BibT_eX]

[DOI]

Dong Li

Proceedings of the IEEE International Symposium on Workload Characterization, 2020

2019

Failure recovery for bulk synchronous applications with MPI stages.

[BibT_eX]

[DOI]

Parallel Comput., 2019

Pruners.

[BibT_eX]

[DOI]

Christopher M. Chambreau

Simone Atzeni

Michael Bentley

Int. J. High Perform. Comput. Appl., 2019

GPUMixer: Performance-Driven Floating-Point Tuning for GPU Scientific Applications.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 34th International Conference, 2019

A large-scale study of MPI usage in open-source HPC applications.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2019

FPChecker: Detecting Floating-Point Exceptions in GPU Applications.

[BibT_eX]

[DOI]

Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering, 2019

SAFIRE: Scalable and Accurate Fault Injection for Parallel Multithreaded Applications.

[BibT_eX]

[DOI]

Dimitrios S. Nikolopoulos

Hans Vandierendonck

Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

AMPT-GA: automatic mixed precision floating point tuning for GPU applications.

[BibT_eX]

[DOI]

Proceedings of the ACM International Conference on Supercomputing, 2019

Multi-Level Analysis of Compiler-Induced Variability and Performance Tradeoffs.

[BibT_eX]

[DOI]

Michael Bentley

Ian Briggs

Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, 2019

ExaMPI: A Modern Design and Implementation to Accelerate Message Passing Interface Innovation.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 6th Latin American Conference, 2019

2018

FlipTracker: understanding natural error resilience in HPC applications.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2018

MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications.

[BibT_eX]

[DOI]

Proceedings of the 25th European MPI Users' Group Meeting, 2018

SWORD: A Bounded Memory-Overhead Detector of OpenMP Data Races in Production Runs.

[BibT_eX]

[DOI]

Simone Atzeni

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

2017

Exploring versioned distributed arrays for resilience in scientific applications.

[BibT_eX]

[DOI]

Zachary A. Rubenstein

Int. J. High Perform. Comput. Appl., 2017

Report of the HPC Correctness Summit, Jan 25-26, 2017, Washington, DC.

[BibT_eX]

[DOI]

Paul D. Hovland

Costin Iancu

Sriram Krishnamoorthy

CoRR, 2017

Snowpack: efficient parameter choice for GPU kernels via static analysis and statistical prediction.

[BibT_eX]

[DOI]

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2017

REFINE: realistic fault injection via compiler-based instrumentation for accuracy, portability and speed.

[BibT_eX]

[DOI]

Dimitrios S. Nikolopoulos

Proceedings of the International Conference for High Performance Computing, 2017

Noise Injection Techniques to Expose Subtle and Unintended Message Races.

[BibT_eX]

[DOI]

Christopher M. Chambreau

Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

Apollo: Reusable Models for Fast, Dynamic Tuning of Input-Dependent Code.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

Understanding the Spatial Characteristics of DRAM Errors in HPC Clusters.

[BibT_eX]

[DOI]

Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, 2017

2016

Evaluating and extending user-level fault tolerance in MPI applications.

[BibT_eX]

[DOI]

Kathryn M. Mohror

Howard Pritchard

Int. J. High Perform. Comput. Appl., 2016

Pinpointing scale-dependent integer overflow bugs in large-scale parallel applications.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2016

Testing Infrastructure for OpenMP Debugging Interface Implementations.

[BibT_eX]

[DOI]

Proceedings of the OpenMP: Memory, Devices, and Tasks, 2016

ARCHER: Effectively Spotting Data Races in Large OpenMP Applications.

[BibT_eX]

[DOI]

Simone Atzeni

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

IPAS: intelligent protection against silent output corruption in scientific applications.

[BibT_eX]

[DOI]

Proceedings of the 2016 International Symposium on Code Generation and Optimization, 2016

2015

Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence Inference.

[BibT_eX]

[DOI]

Todd Gamblin

IEEE Trans. Parallel Distributed Syst., 2015

Debugging high-performance computing applications at massive scales.

[BibT_eX]

[DOI]

Commun. ACM, 2015

Clock delta compression for scalable order-replay of non-deterministic parallel applications.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2015

Lessons Learned from Implementing OMPD: A Debugging Interface for OpenMP.

[BibT_eX]

[DOI]

Proceedings of the OpenMP: Heterogenous Execution and Data Movements, 2015

Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Computational Science, 2015

2014

Towards providing low-overhead data race detection for large OpenMP applications.

[BibT_eX]

[DOI]

Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, 2014

Evaluating User-Level Fault Tolerance for MPI Applications.

[BibT_eX]

[DOI]

Proceedings of the 21st European MPI Users' Group Meeting, 2014

Accurate application progress analysis for large-scale parallel debugging.

[BibT_eX]

[DOI]

Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014

2013

Automatic Problem Localization via Multi-dimensional Metric Profiling.

[BibT_eX]

[DOI]

Nawanol Theera-Ampornpunt

Subrata Mitra

Fahad A. Arshad

Proceedings of the IEEE 32nd Symposium on Reliable Distributed Systems, 2013

A study of application-level recovery methods for transient network faults.

[BibT_eX]

[DOI]

Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2013

Overcoming extreme-scale reproducibility challenges through a unified, targeted, and multilevel toolset.

[BibT_eX]

[DOI]

Gregory L. Lee

Zvonimir Rakamaric

Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering, 2013

Performance Analysis Techniques for the Exascale Co-Design Process.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing: Accelerating Computational Science and Engineering (CSE), 2013

2012

Automatic fault characterization via abnormality-enhanced classification.

[BibT_eX]

[DOI]

Greg Bronevetsky

Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks, 2012

Probabilistic diagnosis of performance faults in large-scale parallel applications.

[BibT_eX]

[DOI]

Todd Gamblin

Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012

2011

Large scale debugging of parallel tasks with AutomaDeD.

[BibT_eX]

[DOI]

Todd Gamblin

Proceedings of the Conference on High Performance Computing Networking, 2011

2010

AutomaDeD: Automata-based debugging for dissimilar parallel tasks.

[BibT_eX]

[DOI]

Greg Bronevetsky

Proceedings of the 2010 IEEE/IFIP International Conference on Dependable Systems and Networks, 2010

2009

Scalable temporal order analysis for large scale debugging.

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE Conference on High Performance Computing, 2009

How to Keep Your Head above Water While Detecting Errors.

[BibT_eX]

[DOI]

Proceedings of the Middleware 2009, ACM/IFIP/USENIX, 10th International Middleware Conference, Urbana, IL, USA, November 30, 2009

Stateful error detection in high throughput applications.

[BibT_eX]

[DOI]

Fahad A. Arshad