Chokchai Leangsuksun

Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

Framework for Enabling System Understanding.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

Two-level checkpoint/restart modeling for GPGPU.

[BibT_eX]

[DOI]

Supada Laosooksathit

Narasimha Raju Gottumukkala

Proceedings of the 9th IEEE/ACS International Conference on Computer Systems and Applications, 2011

2010

Reliability of a System of k Nodes for High Performance Computing Applications.

[BibT_eX]

[DOI]

IEEE Trans. Reliab., 2010

Incremental Checkpoint Schemes for Weibull Failure Distribution.

[BibT_eX]

[DOI]

Mihaela Paun

Int. J. Found. Comput. Sci., 2010

Proficiency Metrics for Failure Prediction in High Performance Computing.

[BibT_eX]

[DOI]

Narate Taerat

Clayton Chandler

Proceedings of the IEEE International Symposium on Parallel and Distributed Processing with Applications, 2010

Benefits of Software Rejuvenation on HPC Systems.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Parallel and Distributed Processing with Applications, 2010

2009

High Performance Computing Systems with Various Checkpointing Schemes.

[BibT_eX]

[DOI]

Int. J. Comput. Commun. Control, 2009

A tunable holistic resiliency approach for high-performance computing systems.

[BibT_eX]

[DOI]

Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009

HPC failure prediction proficiency metrics.

[BibT_eX]

[DOI]

Narate Taerat

Proceedings of the 2009 IEEE International Conference on Cluster Computing, August 31, 2009

VCCP: A transparent, coordinated checkpointing system for virtualization-based cluster computing.

[BibT_eX]

[DOI]

Proceedings of the 2009 IEEE International Conference on Cluster Computing, August 31, 2009

Blue Gene/L Log Analysis and Time to Interrupt Estimation.

[BibT_eX]

[DOI]

Narate Taerat

Proceedings of the The Forth International Conference on Availability, 2009

2008

An optimal checkpoint/restart model for a large scale high performance computing system.

[BibT_eX]

[DOI]

Yudan Liu

Mihaela Paun

Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments.

[BibT_eX]

[DOI]

Kulathep Charoenpornwattana

Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), 2008

Symmetric Active/Active High Availability for High-Performance Computing System Services: Accomplishments and Limitations.

[BibT_eX]

[DOI]

Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), 2008

A Framework for Proactive Fault Tolerance.

[BibT_eX]

[DOI]

Geoffroy Vallée

Proceedings of the The Third International Conference on Availability, 2008

Symmetric Active/Active Replication for Dependent Services.

[BibT_eX]

[DOI]

Proceedings of the The Third International Conference on Availability, 2008

2007

Evaluation of fault-tolerant policies using simulation.

[BibT_eX]

[DOI]

Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

A reliability-aware approach for an optimal checkpoint/restart model in HPC environments.

[BibT_eX]

[DOI]

Yudan Liu

Mihaela Paun

Narasimha Raju Gottumukkala

Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

Reliability-aware resource allocation in HPC systems.

[BibT_eX]

[DOI]

Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

Transparent Symmetric Active/Active Replication for Service-Level High Availability.

[BibT_eX]

[DOI]

Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2007), 2007

On Programming Models for Service-Level High Availability.

[BibT_eX]

[DOI]

Proceedings of the The Second International Conference on Availability, 2007

2006

MOLAR: adaptive runtime support for high-end computing operating and runtime systems.

[BibT_eX]

[DOI]

Christian Engelmann

Narasimha Raju Gottumukkala

David E. Bernholdt

ACM SIGOPS Oper. Syst. Rev., 2006

Symmetric Active/Active High Availability for High-Performance Computing System Services.

[BibT_eX]

[DOI]

J. Comput., 2006

Availability Modeling and Evaluation on High Performance Cluster Computing Systems.

[BibT_eX]

[DOI]

J. Res. Pract. Inf. Technol., 2006

Policy-Based Access Control Framework for Grid Computing.

[BibT_eX]

[DOI]

Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2006), 2006

IPMI-based Efficient Notification Framework for Large Scale Cluster Computing.

[BibT_eX]

[DOI]

Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2006), 2006

Work in Progress: RASS Framework for a Cluster-Aware SELinux.

[BibT_eX]

[DOI]

Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2006), 2006

A Novel Computational Framework for Fast Distributed Computing and Knowledge Integration for Microarray Gene Expression Data Analysis.

[BibT_eX]

[DOI]

Prerna Sethi

Proceedings of the 20th International Conference on Advanced Information Networking and Applications (AINA 2006), 2006

Availability Modeling and Analysis on High Performance Cluster Computing Systems.

[BibT_eX]

[DOI]

Proceedings of the The First International Conference on Availability, 2006

Active/Active Replication for Highly Available HPC System Services.

[BibT_eX]

[DOI]

Proceedings of the The First International Conference on Availability, 2006

2005

Achieving high availability and performance computing with an HA-OSCAR cluster.

[BibT_eX]

[DOI]

Future Gener. Comput. Syst., 2005

Performance of an Operating High Energy Physics Data Grid: DØSAR-Grid

[BibT_eX]

[DOI]

CoRR, 2005

UML-based Beowulf Cluster Availability Modeling.

[BibT_eX]

Proceedings of the International Conference on Software Engineering Research and Practice, 2005

OOMSE-An Object Oriented Markov Chain Specification and Evaluation Framework.

[BibT_eX]

Proceedings of the 17th International Conference on Software Engineering and Knowledge Engineering (SEKE'2005), 2005

Grid-Aware HA-OSCAR.

[BibT_eX]

[DOI]

Proceedings of the 19th Annual International Symposium on High Performance Computing Systems and Applications (HPCS 2005), 2005

Reliability-aware resource management for computational grid/cluster environments.

[BibT_eX]

[DOI]

Proceedings of the 6th IEEE/ACM International Conference on Grid Computing (GRID 2005), 2005

Reliability-aware Checkpoint/Restart Scheme: A Performability Trade-off.

[BibT_eX]

[DOI]

Proceedings of the 2005 IEEE International Conference on Cluster Computing (CLUSTER 2005), September 26, 2005

Job-Site Level Fault Tolerance for Cluster and Grid environments.

[BibT_eX]

[DOI]

Proceedings of the 2005 IEEE International Conference on Cluster Computing (CLUSTER 2005), September 26, 2005

Feasibility study and early experimental results towards cluster survivability.

[BibT_eX]

[DOI]

Proceedings of the 5th International Symposium on Cluster Computing and the Grid (CCGrid 2005), 2005

A light-weight solution for large sparse Markov processes.

[BibT_eX]

[DOI]

Proceedings of the 43nd Annual Southeast Regional Conference, 2005

A framework for cluster availability specification and evaluation.

[BibT_eX]

[DOI]

Proceedings of the 43nd Annual Southeast Regional Conference, 2005

2004

Highly Reliable Linux HPC Clusters: Self-Awareness Approach.

[BibT_eX]

[DOI]

Proceedings of the Parallel and Distributed Processing and Applications, 2004

Building highly available HPC clusters with HA-OSCAR.

[BibT_eX]

[DOI]

Ibrahim Haddad

Proceedings of the 2004 IEEE International Conference on Cluster Computing (CLUSTER 2004), 2004

2003

Reliability Modeling Using UML.

[BibT_eX]

Lixin Shen

Proceedings of the International Conference on Software Engineering Research and Practice, 2003

Dependability Prediction of High Availability OSCAR Cluster Server.

[BibT_eX]

Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, 2003

Availability Prediction and Modeling of High Availability OSCAR Cluster.

[BibT_eX]

[DOI]

Proceedings of the 2003 IEEE International Conference on Cluster Computing (CLUSTER 2003), 2003

2000

The Enhanced Service Manager: A service management system for next-generation networks.

[BibT_eX]

[DOI]

Anthony Y. Lui

Shanika A. Karunasekera

Bell Labs Tech. J., 2000

1994

ASC: An Associative-Computing Paradigm.

[BibT_eX]

[DOI]

Computer, 1994

A Task Graph Centroid.

[BibT_eX]

[DOI]

Jerry L. Potter