Kurt B. Ferreira

Orcid: 0000-0001-5607-5691

According to our database1, Kurt B. Ferreira authored at least 82 papers between 2007 and 2023.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2023
Using Benford's Law to Identify Unusual Failure Regions.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

Evaluating the Viability of LogGP for Modeling MPI Performance with Non-contiguous Datatypes on Modern Architectures.
Proceedings of the 30th European MPI Users' Group Meeting, 2023

2022
Understanding Memory Failures on a Petascale Arm System.
Proceedings of the HPDC '22: The 31st International Symposium on High-Performance Parallel and Distributed Computing, Minneapolis, MN, USA, 27 June 2022, 2022

2021
Evaluating MPI resource usage summary statistics.
Parallel Comput., 2021

Characterizing Memory Failures Using Benford's Law.
Proceedings of the Euro-Par 2021: Parallel Processing Workshops, 2021

Understanding the Effects of DRAM Correctable Error Logging at Scale.
Proceedings of the IEEE International Conference on Cluster Computing, 2021

2020
The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints.
Concurr. Comput. Pract. Exp., 2020

Hardware MPI message matching: Insights into MPI matching behavior to inform design.
Concurr. Comput. Pract. Exp., 2020

ALAMO: Autonomous Lightweight Allocation, Management, and Optimization.
Proceedings of the Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI, 2020

Evaluating MPI Message Size Summary Statistics.
Proceedings of the EuroMPI/USA '20: 27th European MPI Users' Group Meeting, 2020

2019
Using simulation to examine the effect of MPI message matching costs on application performance.
Parallel Comput., 2019

Checkpointing Strategies for Shared High-Performance Computing Platforms.
Int. J. Netw. Comput., 2019

Evaluating tradeoffs between MPI message matching offload hardware capacity and performance.
Proceedings of the 26th European MPI Users' Group Meeting, 2019

Space-Efficient Reed-Solomon Encoding to Detect and Correct Pointer Corruption.
Proceedings of the Euro-Par 2019: Parallel Processing Workshops, 2019

Sandia Line of LWKs.
Proceedings of the Operating Systems for Supercomputers and High Performance Computing, 2019

2018
Characterizing MPI matching via trace-based simulation.
Parallel Comput., 2018

Lessons learned from memory errors observed over the lifetime of Cielo.
Proceedings of the International Conference for High Performance Computing, 2018

Using Simulation to Examine the Effect of MPI Message Matching Costs on Application Performance.
Proceedings of the 25th European MPI Users' Group Meeting, 2018

Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

Open Science on Trinity's Knights Landing Partition: An Analysis of User Job Data.
Proceedings of the 47th International Conference on Parallel Processing, 2018

Physics-Informed Machine Learning for DRAM Error Modeling.
Proceedings of the 2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, 2018

2017
It's Not the Heat, It's the Humidity: Scheduling Resilience Activity at Scale.
Proceedings of the Euro-Par 2017: Parallel Processing Workshops, 2017

Automating DRAM Fault Mitigation By Learning From Experience.
Proceedings of the 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2017

Lifetime memory reliability data from the field.
Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, 2017

Evaluating the Viability of Using Compression to Mitigate Silent Corruption of Read-Mostly Application Data.
Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

2016
On noise and the performance benefit of nonblocking collectives.
Int. J. High Perform. Comput. Appl., 2016

Understanding performance interference in next-generation HPC systems.
Proceedings of the International Conference for High Performance Computing, 2016

Improving application resilience to memory errors with lightweight compression.
Proceedings of the International Conference for High Performance Computing, 2016

How I Learned to Stop Worrying and Love In Situ Analytics: Leveraging Latent Synchronization in MPI Collective Algorithms.
Proceedings of the 23rd European MPI Users' Group Meeting, EuroMPI 2016, 2016

Mini-Ckpts: Surviving OS Failures in Persistent Memory.
Proceedings of the 2016 International Conference on Supercomputing, 2016

An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart.
Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, 2016

Horseshoes and Hand Grenades: The Case for Approximate Coordination in Local Checkpointing Protocols.
Proceedings of the Euro-Par 2016: Parallel Processing Workshops, 2016

FlipSphere: A Software-Based DRAM Error Detection and Correction Library for HPC.
Proceedings of the 20th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications, 2016

Improving DRAM Fault Characterization through Machine Learning.
Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2016

Scheduling In-Situ Analytics in Next-Generation Applications.
Proceedings of the IEEE/ACM 16th International Symposium on Cluster, 2016

2015
A study of the viability of exploiting memory content similarity to improve resilience to memory errors.
Int. J. High Perform. Comput. Appl., 2015

A checkpoint compression study for high-performance computing systems.
Int. J. High Perform. Comput. Appl., 2015

Early experiences with node-level power capping on the Cray XC40 platform.
Proceedings of the 3rd International Workshop on Energy Efficient Supercomputing, 2015

What is a Lightweight Kernel?
Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers, 2015

A Principled Approach to HPC Event Monitoring.
Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale, 2015

Canaries in a Coal Mine: Using Application-Level Checkpoints to Detect Memory Failures.
Proceedings of the Euro-Par 2015: Parallel Processing Workshops, 2015

Memory Errors in Modern Systems: The Good, The Bad, and The Ugly.
Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, 2015

2014
Accelerating incremental checkpointing for extreme-scale computing.
Future Gener. Comput. Syst., 2014

Understanding the Effects of Communication and Coordination on Checkpointing at Scale.
Proceedings of the International Conference for High Performance Computing, 2014

Exploring the effect of noise on the performance benefit of nonblocking allreduce.
Proceedings of the 21st European MPI Users' Group Meeting, 2014

Energy Consumption of Resilience Mechanisms in Large Scale Systems.
Proceedings of the 22nd Euromicro International Conference on Parallel, 2014

Characterizing the Impact of Rollback Avoidance at Extreme-Scale: A Modeling Approach.
Proceedings of the 43rd International Conference on Parallel Processing, 2014

Coarse-Grained Energy Modeling of Rollback/Recovery Mechanisms.
Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2014

2013
The impact of system design parameters on application noise sensitivity.
Clust. Comput., 2013

Evaluating energy savings for checkpoint/restart.
Proceedings of the 1st International Workshop on Energy Efficient Supercomputing, 2013

Using Simulation to Evaluate the Performance of Resilience Strategies at Scale.
Proceedings of the High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation, 2013

Evaluating the feasibility of using memory content similarity to improve system resilience.
Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers, 2013

Using unreliable virtual hardware to inject errors in extreme-scale systems.
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, 2013

Asking the Right Questions: Benchmarking Fault-Tolerant Extreme-Scale Systems.
Proceedings of the Euro-Par 2013: Parallel Processing Workshops, 2013

Energy-Efficient High Performance Computing - Measurement and Tuning.
Springer Briefs in Computer Science, Springer, ISBN: 978-1-4471-4491-5, 2013

2012
Fault-tolerant linear solvers via selective reliability
CoRR, 2012

Improvements to the structural simulation toolkit.
Proceedings of the International ICST Conference on Simulation Tools and Techniques, 2012

Alleviating scalability issues of checkpointing protocols.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

Poster: Comparing GPU and Increment-Based Checkpoint Compression.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Abstract: Comparing GPU and Increment-Based Checkpoint Compression.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Detection and correction of silent data corruption for large-scale high-performance computing.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

Evaluating operating system vulnerability to memory errors.
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers, 2012

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance.
Proceedings of the 41st International Conference on Parallel Processing, 2012

Combining Partial Redundancy and Checkpointing for HPC.
Proceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems, 2012

The Viability of Using Compression to Decrease Message Log Sizes.
Proceedings of the Euro-Par 2012: Parallel Processing Workshops, 2012

Does partial replication pay off?
Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2012

2011
Poster: detection and correction of silent data corruption for large-scale high-performance computing.
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2011

Poster: a tunable, software-based DRAM error detection and correction library for HPC.
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2011

Evaluating the viability of process replication reliability for exascale systems.
Proceedings of the Conference on High Performance Computing Networking, 2011

libhashckpt: Hash-Based Incremental Checkpointing Using GPU's.
Proceedings of the Recent Advances in the Message Passing Interface, 2011

Cache injection for parallel applications.
Proceedings of the 20th ACM International Symposium on High Performance Distributed Computing, 2011

Simulating Application Resilience at Exascale.
Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance.
Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

A Tunable, Software-Based DRAM Error Detection and Correction Library for HPC.
Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

Cooperative Application/OS DRAM Fault Recovery.
Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

2010
Transparent Redundant Computing with MPI.
Proceedings of the Recent Advances in the Message Passing Interface, 2010

See applications run and throughput jump: The case for redundant computing in HPC.
Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W 2010), Chicago, Illinois, USA, June 28, 2010

2009
Designing and implementing lightweight kernels for capability computing.
Concurr. Comput. Pract. Exp., 2009

Topics on measuring real power usage on high performance computing platforms.
Proceedings of the 2009 IEEE International Conference on Cluster Computing, August 31, 2009

2008
Characterizing application sensitivity to OS interference using kernel-level noise injection.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2008

Instrumentation and Analysis of MPI Queue Times on the SeaStar High-Performance Network.
Proceedings of the 17th International Conference on Computer Communications and Networks, 2008

2007
Reducing the Impact of the MemoryWall for I/O Using Cache Injection.
Proceedings of the 15th Annual IEEE Symposium on High-Performance Interconnects, 2007


  Loading...