Ahmad Abdelfattah

Massimiliano Fasi

CoRR, January, 2026

2025

Analysis of Floating-Point Matrix Multiplication Computed via Integer Arithmetic.

[BibT_eX]

[DOI]

CoRR, June, 2025

Evolution of the SLATE linear algebra library.

[BibT_eX]

[DOI]

Mark Gates

Kadir Akbudak

Int. J. High Perform. Comput. Appl., 2025

Accelerating Homotopy Continuation with GPUs: Application to Trifocal Pose Estimation.

[BibT_eX]

[DOI]

Chiang-Heng Chien

Benjamin B. Kimia

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2025

2024

Batched sparse and mixed-precision linear algebra interface for efficient use of GPU hardware accelerators in scientific applications.

[BibT_eX]

[DOI]

Future Gener. Comput. Syst., 2024

Interface for Sparse Linear Algebra Operations.

[BibT_eX]

[DOI]

CoRR, 2024

Recovering SLAM Tracking Lost by Trifocal Pose Estimation using GPU-HC++.

[BibT_eX]

[DOI]

Chiang-Heng Chien

Benjamin B. Kimia

Proceedings of the 35th British Machine Vision Conference, 2024

2023

libCEED: Efficient Extensible Discretization.

[BibT_eX]

[DOI]

Dataset, November, 2023

GPU-based LU Factorization and Solve on Batches of Matrices with Band Structure.

[BibT_eX]

[DOI]

Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

PAQR: Pivoting Avoiding QR factorization.

[BibT_eX]

[DOI]

Wissam M. Sid-Lakhdar

David B. Williams-Young

Timothy A. Davis

Hartwig Anzt

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

2022

libCEED: Efficient Extensible Discretization.

[BibT_eX]

[DOI]

Dataset, December, 2022

Reproducability Artifact for Running SLATE's GEMM and POTRF Operations on Summit and Crusher.

[BibT_eX]

[DOI]

Dataset, August, 2022

Addressing Irregular Patterns of Matrix Computations on GPUs and Their Impact on Applications Powered by Sparse Direct Solvers.

[BibT_eX]

[DOI]

Proceedings of the SC22: International Conference for High Performance Computing, 2022

Portable and Efficient Dense Linear Algebra in the Beginning of the Exascale Era.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM International Workshop on Performance, 2022

Batch QR Factorization on GPUs: Design, Optimization, and Tuning.

[BibT_eX]

[DOI]

Stan Tomov

Proceedings of the Computational Science - ICCS 2022, 2022

GPU-Based Homotopy Continuation for Minimal Problems in Computer Vision.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

libCEED: Efficient Extensible Discretization.

[BibT_eX]

[DOI]

Dataset, July, 2021

libCEED: Efficient Extensible Discretization.

[BibT_eX]

[DOI]

Dataset, July, 2021

CEED/libCEED: v0.9.0.

[BibT_eX]

[DOI]

Dataset, July, 2021

A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines.

[BibT_eX]

[DOI]

ACM Trans. Math. Softw., 2021

GPU algorithms for Efficient Exascale Discretizations.

[BibT_eX]

[DOI]

Parallel Comput., 2021

libCEED: Fast algebra for high-order element-based discretizations.

[BibT_eX]

[DOI]

J. Open Source Softw., 2021

Efficient exascale discretizations: High-order finite element methods.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2021

A survey of numerical linear algebra methods utilizing mixed-precision arithmetic.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2021

2020

Matrix multiplication on batches of small matrices in half and half-complex precisions.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2020

MAGMA templates for scalable linear algebra on emerging architectures.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2020

A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic.

[BibT_eX]

[DOI]

CoRR, 2020

High-Order Finite Element Method using Standard and Device-Level Batch GEMM on GPUs.

[BibT_eX]

[DOI]

Proceedings of the 11th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2020

Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse and Batched Computations.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE/ACM Performance Modeling, 2020

Investigating the Benefit of FP16-Enabled Mixed-Precision Solvers for Symmetric Positive Definite Matrices Using GPUs.

[BibT_eX]

[DOI]

Stan Tomov

Proceedings of the Computational Science - ICCS 2020, 2020

Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE High Performance Extreme Computing Conference, 2020

2019

Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices.

[BibT_eX]

[DOI]

Parallel Comput., 2019

Towards Half-Precision Computation for Complex Matrices: A Case Study for Mixed Precision Solvers on GPUs.

[BibT_eX]

[DOI]

Proceedings of the 10th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2019

Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

Massively Parallel Automated Software Tuning.

[BibT_eX]

[DOI]

Proceedings of the 48th International Conference on Parallel Processing, 2019

Progressive Optimization of Batched LU Factorization on GPUs.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE High Performance Extreme Computing Conference, 2019

2018

A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2018

Analysis and Design Techniques towards High-Performance and Energy-Efficient Dense Linear Solvers on GPUs.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2018

Batched one-sided factorizations of tiny matrices using GPUs: Challenges and countermeasures.

[BibT_eX]

[DOI]

J. Comput. Sci., 2018

Performance of Hierarchical-matrix BiCGStab Solver on GPU Clusters.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques.

[BibT_eX]

[DOI]

Proceedings of the Computational Science - ICCS 2018, 2018

Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE High Performance Extreme Computing Conference, 2018

2017

Fast Cholesky factorization on GPUs for batch and native modes in MAGMA.

[BibT_eX]

[DOI]

J. Comput. Sci., 2017

With Extreme Computing, the Rules Have Changed.

[BibT_eX]

[DOI]

Comput. Sci. Eng., 2017

High-performance Cholesky factorization for GPU-only execution.

[BibT_eX]

[DOI]

Proceedings of the General Purpose GPUs, 2017

Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Supercomputing, 2017

Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Computational Science, 2017

2016

KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators.

[BibT_eX]

[DOI]

David E. Keyes

Hatem Ltaief

ACM Trans. Math. Softw., 2016

Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2016

Linear algebra software for large-scale accelerated multicore computing.

[BibT_eX]

[DOI]

Acta Numer., 2016

Performance, Design, and Autotuning of Batched GEMM for GPUs.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 31st International Conference, 2016

On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Computational Science 2016, 2016

High-Performance Tensor Contractions for GPUs.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Computational Science 2016, 2016

High-Performance Matrix-Matrix Multiplications of Very Small Matrices.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2016: Parallel Processing, 2016

2015

Accelerating Scientific Applications using High Performance Dense and Sparse Linear Algebra Kernels on GPUs.

[BibT_eX]

[DOI]

PhD thesis, 2015

Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems.

[BibT_eX]

[DOI]

Supercomput. Front. Innov., 2015

High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications.

[BibT_eX]

[DOI]

Hatem Ltaief

David E. Keyes

Proceedings of the Euro-Par 2015: Parallel Processing, 2015

2014

Pipelining Computational Stages of the Tomographic Reconstructor for Multi-Object Adaptive Optics on a Multi-GPU System.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2014

High Performance Pseudo-analytical Simulation of Multi-Object Adaptive Optics over Multi-GPU Systems.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2014 Parallel Processing, 2014

2012

Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing for Computational Science, 2012

Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU.

[BibT_eX]

[DOI]