Zizhong Chen

Franck Cappello

Proceedings of the 37th IEEE International Conference on Data Engineering, 2021

Daps: A Dynamic Asynchronous Progress Stealing Model for MPI Communication.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2021

Exploring Autoencoder-based Error-bounded Compression for Scientific Data.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2021

Improving Lossy Compression for SZ by Exploring the Best-Fit Lossless Compression Techniques.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), 2021

2020

Weighted pseudometric approximation of 2-dimensional fuzzy numbers by fuzzy 2-cell prismoid numbers preserving the centroid.

[BibT_eX]

[DOI]

Shexiang Hai

Zengtai Gong

Fuzzy Sets Syst., 2020

SDC Resilient Error-bounded Lossy Compressor.

[BibT_eX]

[DOI]

CoRR, 2020

Ball k-means.

[BibT_eX]

[DOI]

CoRR, 2020

Algorithm-Based Fault Tolerance for Convolutional Neural Networks.

[BibT_eX]

[DOI]

CoRR, 2020

Normalization of Input-output Shared Embeddings in Text Generation Models.

[BibT_eX]

[DOI]

Jinyang Liu

Yujia Zhai

CoRR, 2020

CAB-MPI: exploring interprocess work-stealing towards balanced MPI communication.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2020

SAOU: safe adaptive overclocking and undervolting for energy-efficient GPU computing.

[BibT_eX]

[DOI]

Proceedings of the ISLPED '20: ACM/IEEE International Symposium on Low Power Electronics and Design, 2020

Significantly Improving Lossy Compression for HPC Datasets with Second-Order Prediction and Parameter Optimization.

[BibT_eX]

[DOI]

Proceedings of the HPDC '20: The 29th International Symposium on High-Performance Parallel and Distributed Computing, 2020

Towards End-to-end SDC Detection for HPC Applications Equipped with Lossy Compression.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2020

SDRBench: Scientific Data Reduction Benchmark for Lossy Compressors.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Big Data (IEEE BigData 2020), 2020

Toward Feature-Preserving 2D and 3D Vector Field Compression.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE Pacific Visualization Symposium, 2020

2019

Optimizing Lossy Compression Rate-Distortion from Automatic Online Selection between SZ and ZFP.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2019

Complete Random Forest Based Class Noise Filtering Learning for Improving the Generalizability of Classifiers.

[BibT_eX]

[DOI]

IEEE Trans. Knowl. Data Eng., 2019

Z-checker: A framework for assessing lossy compression of scientific data.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2019

Transferring Ensemble Representations Using Deep Convolutional Neural Networks for Small-Scale Image Classification.

[BibT_eX]

[DOI]

IEEE Access, 2019

Significantly improving lossy compression quality based on an optimized hybrid prediction model.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2019

FT-iSort: efficient fault tolerance for introsort.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2019

GreenMM: energy efficient GPU matrix multiplication through undervolting.

[BibT_eX]

[DOI]

Proceedings of the ACM International Conference on Supercomputing, 2019

TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs.

[BibT_eX]

[DOI]

Proceedings of the ACM International Conference on Supercomputing, 2019

A Multi-granularity Genetic Algorithm.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE International Conference on Big Knowledge, 2019

Improving Performance of Data Dumping with Lossy Compression for Scientific Simulation.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE International Conference on Cluster Computing, 2019

Data Transfer between Scientific Facilities - Bottleneck Analysis, Insights and Optimizations.

[BibT_eX]

[DOI]

Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019

Efficient concolic testing of MPI applications.

[BibT_eX]

[DOI]

Hongbo Li

Rajiv Gupta

Proceedings of the 28th International Conference on Compiler Construction, 2019

2018

Scaling Up Parallel Computation of Tiled QR Factorizations by a Distributed Scheduling Runtime System and Analytical Modeling.

[BibT_eX]

[DOI]

Parallel Process. Lett., 2018

Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2018

COMPI: Concolic Testing for MPI Applications.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

Non-intrusively Avoiding Scaling Problems in and out of MPI Collectives.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

BeeFlow: A Workflow Management System for In Situ Processing across HPC and Cloud Systems.

[BibT_eX]

[DOI]

Proceedings of the 38th IEEE International Conference on Distributed Computing Systems, 2018

The k-Means Forest Classifier for High Dimensional Data.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Conference on Big Knowledge, 2018

Improving performance of iterative methods by lossy checkponting.

[BibT_eX]

[DOI]

Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, 2018

Performance analysis and optimization of in-situ integration of simulation with data analysis: zipping applications up.

[BibT_eX]

[DOI]

Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, 2018

Fixed-PSNR Lossy Compression for Scientific Data.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2018

An Efficient Transformation Scheme for Lossy Data Compression with Point-Wise Relative Error Bound.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2018

Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Big Data (IEEE BigData 2018), 2018

Optimizing Lossy Compression with Adjacent Snapshots for N-body Simulation Data.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Big Data (IEEE BigData 2018), 2018

Build and Execution Environment (BEE): an Encapsulated Environment Enabling HPC Applications Running Everywhere.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Big Data (IEEE BigData 2018), 2018

2017

Docker-Enabled Build and Execution Environment (BEE): an Encapsulated Environment Enabling HPC Applications Running Everywhere.

[BibT_eX]

[DOI]

CoRR, 2017

Exploration of Pattern-Matching Techniques for Lossy Compression on Cosmology Simulation Data Sets.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing, 2017

Correcting soft errors online in fast fourier transform.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2017

Parastack: efficient hang detection for MPI programs at large scale.

[BibT_eX]

[DOI]

Hongbo Li

Rajiv Gupta

Proceedings of the International Conference for High Performance Computing, 2017

Silent Data Corruption Resilient Two-sided Matrix Factorizations.

[BibT_eX]

[DOI]

Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

HIPS Keynote.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

In-depth exploration of single-snapshot lossy compression techniques for N-body simulations.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Conference on Big Data (IEEE BigData 2017), 2017

2016

Scalable Energy Efficiency with Resilience for High Performance Computing Systems: A Quantitative Methodology.

[BibT_eX]

[DOI]

Li Tan

Shuaiwen Leon Song

ACM Trans. Archit. Code Optim., 2016

GreenLA: green linear algebra software for GPU-accelerated heterogeneous computing.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2016

GPU-ABFT: Optimizing Algorithm-Based Fault Tolerance for Heterogeneous Systems with GPUs.

[BibT_eX]

[DOI]

Jieyang Chen

Sihuan Li

Proceedings of the IEEE International Conference on Networking, 2016

Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs.

[BibT_eX]

[DOI]

Jieyang Chen

Xin Liang

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

suCAQR: A Simplified Communication-Avoiding QR Factorization Solver Using the TBLAS Framework.

[BibT_eX]

[DOI]

Proceedings of the 22nd IEEE International Conference on Parallel and Distributed Systems, 2016

Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory.

[BibT_eX]

[DOI]

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, 2016

Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra.

[BibT_eX]

[DOI]

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, 2016

New-Sum: A Novel Online ABFT Scheme For General Iterative Methods.

[BibT_eX]

[DOI]

Dingwen Tao

Shuaiwen Leon Song

Sriram Krishnamoorthy

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, 2016

2015

Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition.

[BibT_eX]

[DOI]

Doug Hakkarinen

Panruo Wu

IEEE Trans. Parallel Distributed Syst., 2015

Optimising MPI tree-based communication for NUMA architectures.

[BibT_eX]

[DOI]

Int. J. Auton. Adapt. Commun. Syst., 2015

Slow Down or Halt: Saving the Optimal Energy for Scalable HPC Systems.

[BibT_eX]

[DOI]

Li Tan

Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, Austin, TX, USA, January 31, 2015

Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Simulated Annealing to Generate Numerically Stable Real Number Error Correction Codes.

[BibT_eX]

[DOI]

Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications, 2015

Cholesky Factorization on Heterogeneous CPU and GPU Systems.

[BibT_eX]

[DOI]

Jieyang Chen

Proceedings of the Ninth International Conference on Frontier of Computer Science and Technology, 2015

2014

A survey of power and energy efficient techniques for high performance numerical linear algebra operations.

[BibT_eX]

[DOI]

Parallel Comput., 2014

TX: algorithmic energy saving for distributed dense matrix factorizations.

[BibT_eX]

[DOI]

Li Tan

Proceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2014

TOUGH2-PETSc: A Parallel Solver for TOUGH2.

[BibT_eX]

[DOI]

Daniel Hathorn

Yushu Wu

Proceedings of the 15th International Conference on Parallel and Distributed Computing, 2014

Extending checksum-based ABFT to tolerate soft errors online in iterative methods.

[BibT_eX]

[DOI]

Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems, 2014

HP-DAEMON: High Performance Distributed Adaptive Energy-efficient Matrix-multiplicatiON.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Computational Science, 2014

FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines.

[BibT_eX]

[DOI]

Panruo Wu

Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, 2014

2013

Multilevel Diskless Checkpointing.

[BibT_eX]

[DOI]

Douglas Hakkarinen

IEEE Trans. Computers, 2013

On-line soft error correction in matrix-matrix multiplication.

[BibT_eX]

[DOI]

J. Comput. Sci., 2013

Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2013

Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods.

[BibT_eX]

[DOI]

Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2013

A2E: Adaptively aggressive energy efficient DVFS scheduling for data intensive applications.

[BibT_eX]

[DOI]

Proceedings of the IEEE 32nd International Performance Computing and Communications Conference, 2013

Correcting soft errors online in LU factorization.

[BibT_eX]

[DOI]

Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, 2013

Energy-Efficient Scheduling for Multicore Systems with Bounded Resources.

[BibT_eX]

[DOI]

Proceedings of the 2013 IEEE International Conference on Green Computing and Communications (GreenCom) and IEEE Internet of Things (iThings) and IEEE Cyber, 2013

Power and energy characteristics of MapReduce data movements.

[BibT_eX]

[DOI]

Proceedings of the International Green Computing Conference, 2013

Improving performance and energy efficiency of matrix multiplication via pipeline broadcast.

[BibT_eX]

[DOI]

Proceedings of the 2013 IEEE International Conference on Cluster Computing, 2013

2012

Reduced Data Communication for Parallel CMA-ES for REACTS.

[BibT_eX]

[DOI]

Proceedings of the 20th Euromicro International Conference on Parallel, 2012

eTune: A Power Analysis Framework for Data-Intensive Computing.

[BibT_eX]

[DOI]

Proceedings of the 41st International Conference on Parallel Processing Workshops, 2012

Energy Efficient Parallel Matrix-Matrix Multiplication for DVFS-enabled Clusters.

[BibT_eX]

[DOI]

Proceedings of the 41st International Conference on Parallel Processing Workshops, 2012

Runtime Optimization of Broadcast Communications Using Dynamic Network Topology Information from MPI.

[BibT_eX]

[DOI]

Jeffrey Godwin

Proceedings of the 14th IEEE International Conference on High Performance Computing and Communication & 9th IEEE International Conference on Embedded Software and Systems, 2012

Energy consumption analysis of parallel sorting algorithms running on multicore systems.

[BibT_eX]

[DOI]

Proceedings of the 2012 International Green Computing Conference, 2012

Optimizing Process-to-Core Mappings for Application Level Multi-dimensional MPI Communications.

[BibT_eX]

[DOI]

Proceedings of the 2012 IEEE International Conference on Cluster Computing, 2012

2011

Fault tolerant matrix-matrix multiplication: correcting soft errors on-line.

[BibT_eX]

[DOI]

Proceedings of the second workshop on Scalable algorithms for large-scale systems, 2011

Algorithm-based recovery for HPL.

[BibT_eX]

[DOI]

Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2011

Matrix Multiplication on GPUs with On-Line Fault Tolerance.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Parallel and Distributed Processing with Applications, 2011

Algorithm-Based Recovery for Newton's Method without Checkpointing.

[BibT_eX]

[DOI]

Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

High performance linpack benchmark: a fault tolerant implementation without checkpointing.

[BibT_eX]

[DOI]

Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31, 2011

SRC: soft error detection and recovery for high performance linpack.

[BibT_eX]

[DOI]

Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31, 2011

Optimizing Process-to-Core Mappings for Two Dimensional Broadcast/Reduce on Multicore Architectures.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Parallel Processing, 2011

Algorithm-based recovery for iterative methods without checkpointing.

[BibT_eX]

[DOI]

Proceedings of the 20th ACM International Symposium on High Performance Distributed Computing, 2011

2010

Adaptive Checkpointing (Invited Paper).

[BibT_eX]

[DOI]

J. Commun., 2010

Highly scalable checkpointing for exascale computing.

[BibT_eX]

[DOI]

Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Algorithmic Cholesky factorization fault recovery.

[BibT_eX]

[DOI]

Douglas Hakkarinen

Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Fault tolerant linear algebra: Recovering from fail-stop failures without checkpointing.

[BibT_eX]

[DOI]

Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Constructing numerically stable real number codes using evolutionary computation.

[BibT_eX]

[DOI]

Aaron Garrett

Daniel Eric Smith

Proceedings of the Genetic and Evolutionary Computation Conference, 2010

2009

Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing.

[BibT_eX]

[DOI]

IEEE Trans. Computers, 2009

Pipelining parallel image compositing and delivery for efficient remote visualization.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2009

Optimal real number codes for fault tolerant matrix operations.

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE Conference on High Performance Computing, 2009

N-Level Diskless Checkpointing.

[BibT_eX]

[DOI]

Douglas Hakkarinen

Proceedings of the 11th IEEE International Conference on High Performance Computing and Communications, 2009

2008

Algorithm-Based Fault Tolerance for Fail-Stop Failures.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2008

Performance of MPI broadcast algorithms.

[BibT_eX]

[DOI]

Daniel M. Wadsworth

Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments.

[BibT_eX]

[DOI]

Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing.

[BibT_eX]

[DOI]

Proceedings of the 11th IEEE High Assurance Systems Engineering Symposium, 2008

2007

Recovery Patterns for Iterative Methods in a Parallel Unstable Environment.

[BibT_eX]

[DOI]

SIAM J. Sci. Comput., 2007

An efficient packet loss recovery methodology for video-over-IP.

[BibT_eX]

Ming Yang

Nikolaos G. Bourbakis

Guillermo A. Francia III

Proceedings of the Signal and Image Processing (SIP 2007), 2007

Self Adaptive Application Level Fault Tolerance for Parallel and Distributed Computing.

[BibT_eX]

[DOI]

Ming Yang

Guillermo A. Francia III

Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

An Efficient Audio-Video Synchronization Methodology.

[BibT_eX]

[DOI]

Ming Yang

Nikolaos G. Bourbakis

Monica Trifas

Proceedings of the 2007 IEEE International Conference on Multimedia and Expo, 2007

2006

Self-adapting numerical software (SANS) effort.

[BibT_eX]

[DOI]

Jelena Pjesivac-Grbovic

Keith Seymour

Haihang You

Sathish S. Vadhiyar

IBM J. Res. Dev., 2006

Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources.

[BibT_eX]

[DOI]

Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

2005

Condition Numbers of Gaussian Random Matrices.

[BibT_eX]

[DOI]

SIAM J. Matrix Anal. Appl., 2005

Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing.

[BibT_eX]

[DOI]

Jelena Pjesivac-Grbovic

Int. J. High Perform. Comput. Appl., 2005

Fault tolerant high performance computing by a coding approach.

[BibT_eX]

[DOI]

Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2005

Numerically Stable Real Number Codes Based on Random Matrices.

[BibT_eX]

[DOI]