Stephen L. Scott

Narasimha Raju Gottumukkala

Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

A Case for Virtual Machine Based Fault Injection in a High-Performance Computing Environment.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

2010

Reliability of a System of k Nodes for High Performance Computing Applications.

[BibT_eX]

[DOI]

IEEE Trans. Reliab., 2010

Incremental Checkpoint Schemes for Weibull Failure Distribution.

[BibT_eX]

[DOI]

Mihaela Paun

Int. J. Found. Comput. Sci., 2010

System-level virtualization research at Oak Ridge National Laboratory.

[BibT_eX]

[DOI]

Future Gener. Comput. Syst., 2010

Benefits of Software Rejuvenation on HPC Systems.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Parallel and Distributed Processing with Applications, 2010

Hybrid Checkpointing for MPI Jobs in HPC Environments.

[BibT_eX]

[DOI]

Proceedings of the 16th IEEE International Conference on Parallel and Distributed Systems, 2010

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments.

[BibT_eX]

[DOI]

Swen Böhm

Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications, 2010

Loadable Hypervisor Modules.

[BibT_eX]

[DOI]

Proceedings of the 43rd Hawaii International International Conference on Systems Science (HICSS-43 2010), 2010

2009

Symmetric active/active metadata service for high availability parallel file systems.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2009

High Performance Computing Systems with Various Checkpointing Schemes.

[BibT_eX]

[DOI]

Int. J. Comput. Commun. Control, 2009

A tunable holistic resiliency approach for high-performance computing systems.

[BibT_eX]

[DOI]

Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009

Proactive Fault Tolerance Using Preemptive Migration.

[BibT_eX]

[DOI]

Proceedings of the 17th Euromicro International Conference on Parallel, 2009

Refinement Proposal of the Goldberg's Theory.

[BibT_eX]

[DOI]

Proceedings of the Algorithms and Architectures for Parallel Processing, 2009

Performance comparison of two virtual machine scenarios using an HPC application: a case study using molecular dynamics simulations.

[BibT_eX]

[DOI]

Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing, 2009

An Extensible I/O Performance Analysis Framework for Distributed Environments.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2009 Parallel Processing, 2009

Blue Gene/L Log Analysis and Time to Interrupt Estimation.

[BibT_eX]

[DOI]

Narate Taerat

Proceedings of the The Forth International Conference on Availability, 2009

2008

Virtual System Environments.

[BibT_eX]

[DOI]

Proceedings of the Systems and Virtualization Management. Standards and New Technologies, 2008

Proactive process-level live migration in HPC environments.

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE Conference on High Performance Computing, 2008

System-Level Virtualization for High Performance Computing.

[BibT_eX]

[DOI]

Proceedings of the 16th Euromicro International Conference on Parallel, 2008

Virtualized Environments for the Harness High Performance Computing Workbench.

[BibT_eX]

[DOI]

Proceedings of the 16th Euromicro International Conference on Parallel, 2008

Failure Prediction Models for Proactive Fault Tolerance Within Storage Environments.

[BibT_eX]

Proceedings of the 16th International Symposium on Modeling, 2008

An optimal checkpoint/restart model for a large scale high performance computing system.

[BibT_eX]

[DOI]

Yudan Liu

Raja Nassar

Mihaela Paun

Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Proposal for Modifications to the OSCAR Architecture to Address Challenges in Distributed System Management.

[BibT_eX]

[DOI]

Proceedings of the 22nd Annual International Symposium on High Performance Computing Systems and Applications (HPCS 2008), 2008

Effects of virtualization on a scientific application running a hyperspectral radiative transfer code on virtual machines.

[BibT_eX]

[DOI]

Proceedings of the 2nd Workshop on System-Level Virtualization for High Performance Computing, 2008

An Analysis of HPC Benchmarks in Virtual Machine Environments.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2008 Workshops, 2008

Complementarity between Virtualization and Single System Image Technologies.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2008 Workshops, 2008

Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments.

[BibT_eX]

[DOI]

Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), 2008

Symmetric Active/Active High Availability for High-Performance Computing System Services: Accomplishments and Limitations.

[BibT_eX]

[DOI]

Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), 2008

A Framework for Proactive Fault Tolerance.

[BibT_eX]

[DOI]

Kulathep Charoenpornwattana

Proceedings of the The Third International Conference on Availability, 2008

Symmetric Active/Active Replication for Dependent Services.

[BibT_eX]

[DOI]

Proceedings of the The Third International Conference on Availability, 2008

2007

A unified multiple-level cache for high performance storage systems.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Netw., 2007

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance.

[BibT_eX]

[DOI]

Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Proactive fault tolerance for HPC with Xen virtualization.

[BibT_eX]

[DOI]

Proceedings of the 21th Annual International Conference on Supercomputing, 2007

A Fast Delivery Protocol for Total Order Broadcasting.

[BibT_eX]

[DOI]

Proceedings of the 16th International Conference on Computer Communications and Networks, 2007

Middleware in Modern High Performance Computing System Architectures.

[BibT_eX]

[DOI]

Hong Ong

Proceedings of the Computational Science - ICCS 2007, 7th International Conference, Beijing, China, May 27, 2007

Automatic Testing Tool for OSCAR Using System-level Virtualization.

[BibT_eX]

[DOI]

Proceedings of the 21st Annual International Symposium on High Performance Computing Systems and Applications (HPCS 2007), 2007

Design and Implementation of a Menu Based OSCAR Command Line Interface.

[BibT_eX]

[DOI]

Proceedings of the 21st Annual International Symposium on High Performance Computing Systems and Applications (HPCS 2007), 2007

Evaluation of fault-tolerant policies using simulation.

[BibT_eX]

[DOI]

Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

A reliability-aware approach for an optimal checkpoint/restart model in HPC environments.

[BibT_eX]

[DOI]

Yudan Liu

Raja Nassar

Mihaela Paun

Narasimha Raju Gottumukkala

Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

Reliability-aware resource allocation in HPC systems.

[BibT_eX]

[DOI]

Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

System management software for virtual environments.

[BibT_eX]

[DOI]

Proceedings of the 4th Conference on Computing Frontiers, 2007

Transparent Symmetric Active/Active Replication for Service-Level High Availability.

[BibT_eX]

[DOI]

Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2007), 2007

On Programming Models for Service-Level High Availability.

[BibT_eX]

[DOI]

Proceedings of the The Second International Conference on Availability, 2007

2006

Constructing collaborative desktop storage caches for large scientific datasets.

[BibT_eX]

[DOI]

Sudharshan S. Vazhkudai

Xiaosong Ma

Vincent W. Freeh

Jonathan W. Strickland

Nandan Tammineedi

Tyler A. Simon

ACM Trans. Storage, 2006

MOLAR: adaptive runtime support for high-end computing operating and runtime systems.

[BibT_eX]

[DOI]

Narasimha Raju Gottumukkala

David E. Bernholdt

ACM SIGOPS Oper. Syst. Rev., 2006

Symmetric Active/Active High Availability for High-Performance Computing System Services.

[BibT_eX]

[DOI]

J. Comput., 2006

OSCAR - OSCAR community meeting.

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

Xen-OSCAR for Cluster Virtualization.

[BibT_eX]

[DOI]

Proceedings of the Frontiers of High Performance Computing and Networking, 2006

Scalable, fault tolerant membership for MPI tasks on HPC systems.

[BibT_eX]

[DOI]

Proceedings of the 20th Annual International Conference on Supercomputing, 2006

Coupling prefix caching and collective downloads for remote dataset access.

[BibT_eX]

[DOI]

Proceedings of the 20th Annual International Conference on Supercomputing, 2006

OSCAR Testing with Xen.

[BibT_eX]

[DOI]

Proceedings of the 20th Annual International Symposium on High Performance Computing Systems and Applications (HPCS 2006), 2006

A Component-Based Approach to Improving the Modularity of OSCAR.

[BibT_eX]

[DOI]

Proceedings of the 20th Annual International Symposium on High Performance Computing Systems and Applications (HPCS 2006), 2006

JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management.

[BibT_eX]

[DOI]

Kai Uhlemann

Proceedings of the 2006 IEEE International Conference on Cluster Computing, 2006

IPMI-based Efficient Notification Framework for Large Scale Cluster Computing.

[BibT_eX]

[DOI]

Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2006), 2006

Active/Active Replication for Highly Available HPC System Services.

[BibT_eX]

[DOI]

Proceedings of the The First International Conference on Availability, 2006

2005

Achieving high availability and performance computing with an HA-OSCAR cluster.

[BibT_eX]

[DOI]

Future Gener. Comput. Syst., 2005

UML-based Beowulf Cluster Availability Modeling.

[BibT_eX]

Proceedings of the International Conference on Software Engineering Research and Practice, 2005

FreeLoader: Scavenging Desktop Storage Resources for Scientific Data.

[BibT_eX]

[DOI]

Sudharshan Vazhkudai

Xiaosong Ma

Vincent W. Freeh

Jonathan W. Strickland

Nandan Tammineedi

Proceedings of the ACM/IEEE SC2005 Conference on High Performance Networking and Computing, 2005

A Unified Multiple-Level Cache for High Performance Storage Systems.

[BibT_eX]

[DOI]

Proceedings of the 13th International Symposium on Modeling, 2005

Model-Based Statistical Testing of a Cluster Utility.

[BibT_eX]

[DOI]

W. Thomas Swain

Proceedings of the Computational Science, 2005

SSI-OSCAR: A Cluster Distribution for High Performance Computing Using a Single System Image.

[BibT_eX]

[DOI]

Proceedings of the 19th Annual International Symposium on High Performance Computing Systems and Applications (HPCS 2005), 2005

OSCAR Meta-Package System.

[BibT_eX]

[DOI]

John Mugler

Proceedings of the 19th Annual International Symposium on High Performance Computing Systems and Applications (HPCS 2005), 2005

Grid-Aware HA-OSCAR.

[BibT_eX]

[DOI]

Proceedings of the 19th Annual International Symposium on High Performance Computing Systems and Applications (HPCS 2005), 2005

Reliability-aware resource management for computational grid/cluster environments.

[BibT_eX]

[DOI]

Proceedings of the 6th IEEE/ACM International Conference on Grid Computing (GRID 2005), 2005

Reliability-aware Checkpoint/Restart Scheme: A Performability Trade-off.

[BibT_eX]

[DOI]

Proceedings of the 2005 IEEE International Conference on Cluster Computing (CLUSTER 2005), September 26, 2005

Job-Site Level Fault Tolerance for Cluster and Grid environments.

[BibT_eX]

[DOI]

Proceedings of the 2005 IEEE International Conference on Cluster Computing (CLUSTER 2005), September 26, 2005

2004

Highly Reliable Linux HPC Clusters: Self-Awareness Approach.

[BibT_eX]

[DOI]

Proceedings of the Parallel and Distributed Processing and Applications, 2004

Online Remote Data Backup for iSCSI-Based Storage Systems.

[BibT_eX]

Proceedings of the International Conference on Internet Computing, 2004

2003

ORNL-RSH Package and Windows '03 PVM 3.4.

[BibT_eX]

[DOI]

Phil Pfeiffer

Hardik Shukla

Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface,10th European PVM/MPI Users' Group Meeting, Venice, Italy, September 29, 2003

Dependability Prediction of High Availability OSCAR Cluster Server.

[BibT_eX]

Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, 2003

Availability Prediction and Modeling of High Availability OSCAR Cluster.

[BibT_eX]

[DOI]

Proceedings of the 2003 IEEE International Conference on Cluster Computing (CLUSTER 2003), 2003

2002

Distributed Peer-to-Peer Control in Harness.

[BibT_eX]

[DOI]

George Al Geist II

Proceedings of the Computational Science - ICCS 2002, 2002

2001

Cluster Command and Control (C3) Tool Suite.

[BibT_eX]

[DOI]

Parallel Distributed Comput. Pract., 2001

Systems Administration.

[BibT_eX]

[DOI]

Anthony Skjellum

Rossen Dimitrov

Srihari Venkata Angaluri

Int. J. High Perform. Comput. Appl., 2001

VIA Communication Performance on a Gigabit Ethernet Cluster.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2001: Parallel Processing, 2001

OSCAR and the Beowulf Arms Race for the "Cluster Standard".

[BibT_eX]

[DOI]

Proceedings of the 2001 IEEE International Conference on Cluster Computing (CLUSTER 2001), 2001

M3C: Managing and Monitoring Multiple Clusters.

[BibT_eX]

[DOI]

Proceedings of the First IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2001), 2001

2000

GigaBit Performance under NT.

[BibT_eX]

[DOI]

Proceedings of the Parallel and Distributed Processing, 2000

Tutorial A: Design and Analysis of High Performance Clusters.

[BibT_eX]

[DOI]

Proceedings of the 2000 IEEE International Conference on Cluster Computing (CLUSTER 2000), November 28th, 2000

ORNL M3C tool.

[BibT_eX]

[DOI]

Proceedings of the 2000 IEEE International Conference on Cluster Computing (CLUSTER 2000), November 28th, 2000

Enabling High Performance Data Transfer on Cluster Architecture.

[BibT_eX]

[DOI]

Paul A. Farrell

Hong Ong

Proceedings of the 2000 IEEE International Conference on Cluster Computing (CLUSTER 2000), November 28th, 2000

1999

Harness: Adaptable Virtual Machine Environment for Heterogeneous Clusters.

[BibT_eX]

[DOI]

George Al Geist II

James Arthur Kohl

Philip M. Papadopoulos

Parallel Process. Lett., 1999

1998

PVM on Windows and NT Clusters.

[BibT_eX]

[DOI]

Markus Fischer

Al Geist

Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 1998

HARNESS: Heterogeneous Adaptable Reconfigurable NEtworked SystemS.

[BibT_eX]

[DOI]

Philip M. Papadopoulos

Vaidy S. Sunderam

M. Magliardi

Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, 1998

1997

Work-based performance measurement and analysis of virtual heterogeneous machines.

[BibT_eX]

[DOI]

Int. J. Syst. Sci., 1997

Beyond PVM 3.4: What We've Learned, What's New, and Why.

[BibT_eX]

[DOI]

Al Geist

James Arthur Kohl

Philip M. Papadopoulos

Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 1997

1994

ASC: An Associative-Computing Paradigm.

[BibT_eX]

[DOI]

Computer, 1994

A Task Graph Centroid.

[BibT_eX]

[DOI]

Jerry L. Potter