Saurabh Hukerikar

Atieh Lotfi

Yanxiang Huang

Jason Campbell

Nirmal R. Saxena

Proceedings of the 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2024

2022

Runtime Fault Diagnostics for GPU Tensor Cores.

[BibT_eX]

[DOI]

Nirmal R. Saxena

Proceedings of the IEEE International Test Conference, 2022

2021

Characterizing and Mitigating Soft Errors in GPU DRAM.

[BibT_eX]

[DOI]

Siva Kumar Sastry Hari

Stephen W. Keckler

Proceedings of the MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021

2020

PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems.

[BibT_eX]

[DOI]

Proceedings of the 25th IEEE Pacific Rim International Symposium on Dependable Computing, 2020

2019

Resiliency of automotive object detection networks on GPU architectures.

[BibT_eX]

[DOI]

Atieh Lotfi

Keshav Balasubramanian

Proceedings of the IEEE International Test Conference, 2019

2018

RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading.

[BibT_eX]

[DOI]

Int. J. Parallel Program., 2018

Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing.

[BibT_eX]

[DOI]

Rizwan A. Ashraf

Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering, 2018

Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery.

[BibT_eX]

[DOI]

Rizwan A. Ashraf

Proceedings of the 26th Euromicro International Conference on Parallel, 2018

2017

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale.

[BibT_eX]

[DOI]

Supercomput. Front. Innov., 2017

Towards New Metrics for High-Performance Computing Resilience.

[BibT_eX]

[DOI]

Rizwan A. Ashraf

Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, 2017

A Pattern Language for High-Performance Computing Resilience.

[BibT_eX]

[DOI]

Proceedings of the 22nd European Conference on Pattern Languages of Programs, 2017

Pattern-Based Modeling of High-Performance Computing Resilience.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2017: Parallel Processing Workshops, 2017

Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

2016

Rolex: resilience-oriented language extensions for extreme-scale systems.

[BibT_eX]

[DOI]

J. Supercomput., 2016

Language Support for Reliable Memory Regions.

[BibT_eX]

[DOI]

Proceedings of the Languages and Compilers for Parallel Computing, 2016

Havens: Explicit reliable memory regions for HPC applications.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE High Performance Extreme Computing Conference, 2016

2015

Enabling application resilience through programming model based fault amelioration.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE High Performance Extreme Computing Conference, 2015

2014

An evaluation of lazy fault detection based on Adaptive Redundant Multithreading.

[BibT_eX]

[DOI]

Proceedings of the IEEE High Performance Extreme Computing Conference, 2014

Opportunistic application-level fault detection through adaptive redundant multithreading.

[BibT_eX]

[DOI]

Proceedings of the International Conference on High Performance Computing & Simulation, 2014

2013

Robust graph traversal: Resiliency techniques for data intensive supercomputing.

[BibT_eX]

[DOI]

Proceedings of the IEEE High Performance Extreme Computing Conference, 2013

A Case for Adaptive Redundancy for HPC Resilience.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2013: Parallel Processing Workshops, 2013

2012

Poster: Programming Model Extensions for Resilience in Extreme Scale Computing.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Programming Model Extensions for Resilience in Extreme Scale Computing.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2012: Parallel Processing Workshops, 2012

A programming model for resilience in extreme scale computing.

[BibT_eX]

[DOI]