Stanimire Tomov

Sci. Program., 2015

HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi.

[BibT_eX]

[DOI]

Sci. Program., 2015

Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs.

[BibT_eX]

[DOI]

SIAM J. Sci. Comput., 2015

Acceleration of GPU-based Krylov solvers via data transfer reduction.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2015

On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 30th International Conference, 2015

A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 30th International Conference, 2015

Performance analysis and design of a hessenberg reduction using stabilized blocked elementary transformations for new architectures.

[BibT_eX]

[DOI]

Proceedings of the Symposium on High Performance Computing, 2015

Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product.

[BibT_eX]

[DOI]

Hartwig Anzt

Proceedings of the Symposium on High Performance Computing, 2015

Mixed-precision block gram Schmidt orthogonalization.

[BibT_eX]

[DOI]

Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2015

Efficient implementation of quantum materials simulations on distributed CPU-GPU systems.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2015

Performance of random sampling for computing low-rank approximations of a dense matrix on GPUs.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2015

Weighted dynamic scheduling with many parallelism grains for offloading of numerical workloads to multiple varied accelerators.

[BibT_eX]

[DOI]

Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2015

Optimization for performance and energy for batched matrix computations on GPUs.

[BibT_eX]

[DOI]

Proceedings of the 8th Workshop on General Purpose Processing using GPUs, 2015

Towards batched linear solvers on accelerated hardware platforms.

[BibT_eX]

[DOI]

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2015

Energy efficiency and performance frontiers for sparse computations on GPU supercomputers.

[BibT_eX]

[DOI]

Hartwig Anzt

Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, 2015

Dense Symmetric Indefinite Factorization on GPU Accelerated Architectures.

[BibT_eX]

[DOI]

Proceedings of the Parallel Processing and Applied Mathematics, 2015

Performance Analysis and Optimisation of Two-sided Factorization Algorithms for Heterogeneous Platform.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Computational Science, 2015

MAGMA embedded: Towards a dense linear algebra library for energy efficient extreme computing.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE High Performance Extreme Computing Conference, 2015

Flexible Linear Algebra Development and Scheduling with Cholesky Factorization.

[BibT_eX]

[DOI]

Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications, 2015

2014

Model-Driven One-Sided Factorizations on Multicore Accelerated Systems.

[BibT_eX]

[DOI]

Supercomput. Front. Innov., 2014

A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine-grained memory aware tasks.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2014

Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2014

Mixed-Precision Orthogonalization Scheme and Adaptive Step Size for Improving the Stability and Performance of CA-GMRES on GPUs.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing for Computational Science - VECPAR 2014 - 11th International Conference, Eugene, OR, USA, June 30, 2014

Heterogenous Acceleration for Linear Algebra in Multi-coprocessor Environments.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing for Computational Science - VECPAR 2014 - 11th International Conference, Eugene, OR, USA, June 30, 2014

Self-adaptive Multiprecision Preconditioners on Multicore and Manycore Architectures.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing for Computational Science - VECPAR 2014 - 11th International Conference, Eugene, OR, USA, June 30, 2014

Deflation strategies to improve the convergence of communication-avoiding GMRES.

[BibT_eX]

[DOI]

Proceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2014

Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster.

[BibT_eX]

[DOI]

Sivasankaran Rajamanickam

Proceedings of the International Conference for High Performance Computing, 2014

Performance and portability with OpenCL for throughput-oriented HPC workloads across accelerators, coprocessors, and multicore processors.

[BibT_eX]

[DOI]

Proceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2014

clMAGMA: high performance dense linear algebra with OpenCL.

[BibT_eX]

[DOI]

Proceedings of the International Workshop on OpenCL, 2014

Improving the Performance of CA-GMRES on Multicores with Multiple GPUs.

[BibT_eX]

[DOI]

Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Hybrid Multi-elimination ILU Preconditioners on GPUs.

[BibT_eX]

[DOI]

Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014

Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment.

[BibT_eX]

[DOI]

Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU.

[BibT_eX]

[DOI]

Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Dynamically Balanced Synchronization-Avoiding LU Factorization with Multicore and GPUs.

[BibT_eX]

[DOI]

Simplice Donfack

Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014

Optimizing Krylov Subspace Solvers on Graphics Processing Units.

[BibT_eX]

[DOI]

Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014

A Fast Batched Cholesky Factorization on a GPU.

[BibT_eX]

[DOI]

Proceedings of the 43rd International Conference on Parallel Processing, 2014

LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU.

[BibT_eX]

[DOI]

Proceedings of the 2014 IEEE International Conference on High Performance Computing and Communications, 2014

Access-averse framework for computing low-rank matrix approximations.

[BibT_eX]

[DOI]

Proceedings of the 2014 IEEE International Conference on Big Data (IEEE BigData 2014), 2014

Accelerating Numerical Dense Linear Algebra Calculations with GPUs.

[BibT_eX]

[DOI]

Proceedings of the Numerical Computations with GPUs, 2014

2013

Accelerating Linear System Solutions Using Randomization Techniques.

[BibT_eX]

[DOI]

ACM Trans. Math. Softw., 2013

Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations.

[BibT_eX]

[DOI]

Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013

Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi.

[BibT_eX]

[DOI]

Proceedings of the Parallel Processing and Applied Mathematics, 2013

Tridiagonalization of a Symmetric Dense Matrix on a GPU Cluster.

[BibT_eX]

[DOI]

Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Supercomputing, 2013

2012

Autotuning GEMM Kernels for the Fermi GPU.

[BibT_eX]

[DOI]

Jakub Kurzak

IEEE Trans. Parallel Distributed Syst., 2012

Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems.

[BibT_eX]

[DOI]

Christof Vömel

SIAM J. Sci. Comput., 2012

One-sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Computational Science, 2012

A Class of Communication-avoiding Algorithms for Solving General Dense Linear Systems on CPU/GPU Parallel Machines.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Computational Science, 2012

Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Computational Science, 2012

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming.

[BibT_eX]

[DOI]

Parallel Comput., 2012

A hybrid Hermitian general eigenvalue solver

[BibT_eX]

[DOI]

CoRR, 2012

Poster: A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Abstract: A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Poster: Matrices over Runtime Systems at Exascale.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Abstract: Matrices Over Runtime Systems at Exascale.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

A Block-Asynchronous Relaxation Method for Graphics Processing Units.

[BibT_eX]

[DOI]

Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems.

[BibT_eX]

[DOI]

Fengguang Song

Proceedings of the International Conference on Supercomputing, 2012

Scalable Dense Linear Algebra on Heterogeneous Hardware.

[BibT_eX]

[DOI]

Proceedings of the Transition of HPC Towards Exascale Computing, 2012

Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2012: Parallel Processing Workshops, 2012

Dense Linear Algebra on Accelerated Multicore Hardware.

[BibT_eX]

[DOI]

Proceedings of the High-Performance Scientific Computing - Algorithms and Applications., 2012

2011

Fully Empirical Autotuned QR Factorization For Multicore Architectures

[BibT_eX]

[DOI]

CoRR, 2011

Optimizing symmetric dense matrix-vector multiplication on GPUs.

[BibT_eX]

[DOI]

Proceedings of the Conference on High Performance Computing Networking, 2011

Soft error resilient QR factorization for hybrid system with GPGPU.

[BibT_eX]

[DOI]

Proceedings of the second workshop on Scalable algorithms for large-scale systems, 2011

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators.

[BibT_eX]

[DOI]

Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Parallel Processing, 2011

Introduction.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2011 Parallel Processing - 17th International Conference, 2011

A Fully Empirical Autotuned Dense QR Factorization for Multicore Architectures.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2011 Parallel Processing - 17th International Conference, 2011

Performance Portability of a GPU Enabled Factorization with the DAGuE Framework.

[BibT_eX]

[DOI]

Narapat Ohm Saengpatsa

Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER), 2011

LU factorization for accelerator-based systems.

[BibT_eX]

[DOI]

Proceedings of the 9th IEEE/ACS International Conference on Computer Systems and Applications, 2011

2010

Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing.

[BibT_eX]

[DOI]

Parallel Comput., 2010

Towards dense linear algebra for hybrid GPU accelerated manycore systems.

[BibT_eX]

[DOI]

Marc Baboulin

Parallel Comput., 2010

An Improved Magma Gemm For Fermi Graphics Processing Units.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2010

Accelerating GPU Kernels for Dense Linear Algebra.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing for Computational Science - VECPAR 2010, 2010

A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing for Computational Science - VECPAR 2010, 2010

Dense linear algebra solvers for multicore with GPU accelerators.

[BibT_eX]

[DOI]

Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Mixed-Tool Performance Analysis on Hybrid Multicore Architectures.

[BibT_eX]

[DOI]

Proceedings of the 39th International Conference on Parallel Processing, 2010

Dense Linear Algebra for Hybrid GPU-Based Systems.

[BibT_eX]

[DOI]

Proceedings of the Scientific Computing with Multicore and Accelerators., 2010

BLAS for GPUs.

[BibT_eX]

[DOI]

Proceedings of the Scientific Computing with Multicore and Accelerators., 2010

2009

Accelerating scientific computations with mixed precision algorithms.

[BibT_eX]

[DOI]

Comput. Phys. Commun., 2009

Bulk based preconditioning for quantum dot computations.

[BibT_eX]

[DOI]

Christof Vömel

Osni Marques

Proceedings of the 2009 ACM Symposium on Applied Computing (SAC), 2009

A Note on Auto-tuning GEMM for GPUs.

[BibT_eX]

[DOI]

Yinan Li

Proceedings of the Computational Science, 2009

2008

Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy.

[BibT_eX]

[DOI]

ACM Trans. Math. Softw., 2008

State-of-the-art eigensolvers for electronic structure calculations of large scale nano-systems.

[BibT_eX]

[DOI]

J. Comput. Phys., 2008

2007

Prospectus for a Dense Linear Algebra Software Library.

[BibT_eX]

[DOI]

Proceedings of the Handbook of Parallel Computing - Models, Algorithms and Applications., 2007

The use of bulk states to accelerate the band edge state calculation of a semiconductor quantum dot.

[BibT_eX]

[DOI]

J. Comput. Phys., 2007

2006

Conjugate-gradient eigenvalue solvers in computing electronic properties of nanostructure architectures.

[BibT_eX]

[DOI]

Int. J. Comput. Sci. Eng., 2006

Prospectus for the Next LAPACK and ScaLAPACK Libraries.

[BibT_eX]

[DOI]

Proceedings of the Applied Parallel Computing. State of the Art in Scientific Computing, 2006

The Impact of Multicore on Math Software.

[BibT_eX]

[DOI]

Proceedings of the Applied Parallel Computing. State of the Art in Scientific Computing, 2006

Exploiting Mixed Precision Floating Point Hardware in Scientific Computations.

[BibT_eX]

Proceedings of the High Performance Computing and Grids in Action, 2006

2005

Explicit and Averaging A Posteriori Error Estimates for Adaptive Finite Volume Methods.

[BibT_eX]

[DOI]

Carsten Carstensen

Raytcho D. Lazarov

SIAM J. Numer. Anal., 2005

Benchmarking and implementation of probability-based simulations on programmable graphics cards.

[BibT_eX]

[DOI]

Comput. Graph., 2005

Comparison of Nonlinear Conjugate-Gradient Methods for Computing the Electronic Properties of Nanostructure Architectures.

[BibT_eX]

[DOI]

Proceedings of the Computational Science, 2005

2004

Interactive visualization of higher dimensional data in a multiview environment

[BibT_eX]

[DOI]