Robert A. van de Geijn

Affiliations:
  • University of Texas at Austin, USA


According to our database1, Robert A. van de Geijn authored at least 137 papers between 1990 and 2021.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2021
Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework.
ACM Trans. Math. Softw., 2021

2020
Strassen's Algorithm Reloaded on GPUs.
ACM Trans. Math. Softw., 2020

2019
The MOMMS Family of Matrix Multiplication Algorithms.
CoRR, 2019

Supporting mixed-datatype matrix multiplication within the BLIS framework.
CoRR, 2019

A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization With Partial Pivoting.
IEEE Access, 2019

2018
Strassen's Algorithm for Tensor Contraction.
SIAM J. Sci. Comput., 2018

Implementing Strassen's Algorithm with CUTLASS on NVIDIA Volta GPUs.
CoRR, 2018

A Simple Methodology for Computing Families of Algorithms.
CoRR, 2018

Learning from Optimizing Matrix-Matrix Multiplication.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

2017
Householder QR Factorization With Randomization for Column Pivoting (HQRRP).
SIAM J. Sci. Comput., 2017

Deriving Correct High-Performance Algorithms.
CoRR, 2017

Pushing the Bounds for Matrix-Matrix Multiplication.
CoRR, 2017

Generating Families of Practical Fast Matrix Multiplication Algorithms.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

2016
The BLIS Framework: Experiments in Portability.
ACM Trans. Math. Softw., 2016

Parallel Matrix Multiplication: A Systematic Journey.
SIAM J. Sci. Comput., 2016

Automating the Last-Mile for High Performance Dense Linear Algebra.
CoRR, 2016

Implementing Strassen's Algorithm with BLIS.
CoRR, 2016

BLISlab: A Sandbox for Optimizing GEMM.
CoRR, 2016

Strassen's algorithm reloaded.
Proceedings of the International Conference for High Performance Computing, 2016

2015
BLIS: A Framework for Rapidly Instantiating BLAS Functionality.
ACM Trans. Math. Softw., 2015

Householder QR Factorization: Adding Randomization for Column Pivoting. FLAME Working Note #78.
CoRR, 2015

2014
Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance.
ACM Trans. Math. Softw., 2014

Algorithm, Architecture, and Floating-Point Unit Codesign of a Matrix Factorization Accelerator.
IEEE Trans. Computers, 2014

Exploiting Symmetry in Tensors for High Performance: Multiplication with Symmetric Tensors.
SIAM J. Sci. Comput., 2014

Understanding performance stairs: elucidating heuristics.
Proceedings of the ACM/IEEE International Conference on Automated Software Engineering, 2014

Anatomy of High-Performance Many-Threaded Matrix Multiplication.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

2013
Elemental: A New Framework for Distributed Memory Dense Matrix Computations.
ACM Trans. Math. Softw., 2013

A case study in mechanically deriving dense linear algebra code.
Int. J. High Perform. Comput. Appl., 2013

Deriving dense linear algebra libraries.
Formal Aspects Comput., 2013

Exploiting Symmetry in Tensors for High Performance
CoRR, 2013

Scheduling algorithms-by-blocks on small clusters.
Concurr. Comput. Pract. Exp., 2013

Interfaces are key.
Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering, 2013

DSLs, DLA, DxT, and MDE in CSE.
Proceedings of the 5th International Workshop on Software Engineering for Computational Science and Engineering, 2013

Code Generation and Optimization of Distributed-Memory Dense Linear Algebra Kernels.
Proceedings of the International Conference on Computational Science, 2013

Floating Point Architecture Extensions for Optimized Matrix Factorization.
Proceedings of the 21st IEEE Symposium on Computer Arithmetic, 2013

2012
Families of Algorithms for Reducing a Matrix to Condensed Form.
ACM Trans. Math. Softw., 2012

A Runtime System for Programming Out-of-Core Matrix Algorithms-by-Tiles on Multithreaded Architectures.
ACM Trans. Math. Softw., 2012

Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures.
IEEE Trans. Computers, 2012

The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations.
J. Parallel Distributed Comput., 2012

Programming many-core architectures - a case study: dense matrix computations on the Intel single-chip cloud computer processor.
Concurr. Comput. Pract. Exp., 2012

Designing Linear Algebra Algorithms by Transformation: Mechanizing the Expert Developer.
Proceedings of the High Performance Computing for Computational Science, 2012

Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators.
Proceedings of the IEEE 24th International Symposium on Computer Architecture and High Performance Computing, 2012

Level-3 BLAS on the TI C6678 Multi-core DSP.
Proceedings of the IEEE 24th International Symposium on Computer Architecture and High Performance Computing, 2012

Mechanizing the expert dense linear algebra developer.
Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2012

A Linear Algebra Core Design for Efficient Level-3 BLAS.
Proceedings of the 23rd IEEE International Conference on Application-Specific Systems, 2012

The Spike Factorization as Domain Decomposition Method; Equivalent and Variant Approaches.
Proceedings of the High-Performance Scientific Computing - Algorithms and Applications., 2012

2011
libflame.
Proceedings of the Encyclopedia of Parallel Computing, 2011

Broadcast.
Proceedings of the Encyclopedia of Parallel Computing, 2011

Allgather.
Proceedings of the Encyclopedia of Parallel Computing, 2011

All-to-All.
Proceedings of the Encyclopedia of Parallel Computing, 2011

Collective Communication.
Proceedings of the Encyclopedia of Parallel Computing, 2011

BLAS (Basic Linear Algebra Subprograms).
Proceedings of the Encyclopedia of Parallel Computing, 2011

High-performance up-and-downdating via householder-like transformations.
ACM Trans. Math. Softw., 2011

Using desktop computers to solve large-scale dense linear algebra problems.
J. Supercomput., 2011

Goal-Oriented and Modular Stability Analysis.
SIAM J. Matrix Anal. Appl., 2011

Power-aware Dense Linear Algebra Implementations on Multi-core and Many-core Processors.
Proceedings of the 3rd Many-core Applications Research Community (MARC) Symposium. Proceedings of the 3rd MARC Symposium, 2011

A high-performance, low-power linear algebra core.
Proceedings of the 22nd IEEE International Conference on Application-specific Systems, 2011

2010
Towards mechanical derivation of Krylov solver libraries.
Proceedings of the International Conference on Computational Science, 2010

Managing the complexity of lookahead for LU factorization with pivoting.
Proceedings of the SPAA 2010: Proceedings of the 22nd Annual ACM Symposium on Parallelism in Algorithms and Architectures, 2010

Transforming linear algebra libraries: From abstraction to parallelism.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Retargeting PLAPACK to clusters with hardware accelerators.
Proceedings of the 2010 International Conference on High Performance Computing & Simulation, 2010

2009
Programming matrix algorithms-by-blocks for thread-level parallelism.
ACM Trans. Math. Softw., 2009

Out-of-core solution of linear systems on graphics processors.
Int. J. Parallel Emergent Distributed Syst., 2009

The libflame Library for Dense Matrix Computations.
Comput. Sci. Eng., 2009

Solving dense linear systems on platforms with multiple hardware accelerators.
Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009

Using Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems.
Proceedings of the Eighth International Symposium on Parallel and Distributed Computing, 2009

Solving "large" dense matrix problems on multi-core processors.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Out-of-Core Computation of the QR Factorization on Multi-core Processors.
Proceedings of the Euro-Par 2009 Parallel Processing, 2009

2008
Scalable parallelization of FLAME code via the workqueuing model.
ACM Trans. Math. Softw., 2008

Updating an LU Factorization with Pivoting.
ACM Trans. Math. Softw., 2008

High-performance implementation of the level-3 BLAS.
ACM Trans. Math. Softw., 2008

Anatomy of high-performance matrix multiplication.
ACM Trans. Math. Softw., 2008

Families of algorithms related to the inversion of a Symmetric Positive Definite matrix.
ACM Trans. Math. Softw., 2008

An Algorithm-by-Blocks for SuperMatrix Band Cholesky Factorization.
Proceedings of the High Performance Computing for Computational Science, 2008

High performance dense linear algebra on a spatially distributed processor.
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008

SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks.
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008

Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures.
Proceedings of the 16th Euromicro International Conference on Parallel, 2008

Design of scalable dense linear algebra libraries for multithreaded architectures: the LU factorization.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

2007
Collective communication: theory, practice, and experience.
Concurr. Comput. Pract. Exp., 2007

Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures.
Proceedings of the SPAA 2007: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, 2007

Toward Scalable Matrix Multiply on Multithreaded Architectures.
Proceedings of the Euro-Par 2007, 2007

The science of programming dense linear algebra libraries.
Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

Satisfying your dependencies with SuperMatrix.
Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

2006
Improving the performance of reduction to Hessenberg form.
ACM Trans. Math. Softw., 2006

Accumulating Householder transformations, revisited.
ACM Trans. Math. Softw., 2006

Collective communication on architectures that support simultaneous communication over multiple links.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2006

2005
Parallel out-of-core computation and updating of the QR factorization.
ACM Trans. Math. Softw., 2005

Representing linear algebra algorithms in code: the FLAME application program interfaces.
ACM Trans. Math. Softw., 2005

The science of deriving dense linear algebra algorithms.
ACM Trans. Math. Softw., 2005

A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations.
SIAM J. Sci. Comput., 2005

Extracting SMP parallelism for dense linear algebra algorithms from high-level specifications.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2005

2004
Rapid Development of High-Performance Out-of-Core Solvers.
Proceedings of the Applied Parallel Computing, 2004

A Family of High-Performance Matrix Multiplication Algorithms.
Proceedings of the Applied Parallel Computing, 2004

Automatic Derivation of Linear Algebra Algorithms with Application to Control Theory.
Proceedings of the Applied Parallel Computing, 2004

Rapid Development of High-Performance Linear Algebra Libraries.
Proceedings of the Applied Parallel Computing, 2004

Attaining higher performance in collective communication.
Proceedings of the 2004 IEEE International Conference on Cluster Computing (CLUSTER 2004), 2004

On optimizing collective communication.
Proceedings of the 2004 IEEE International Conference on Cluster Computing (CLUSTER 2004), 2004

2003
Formal derivation of algorithms: The triangular sylvester equation.
ACM Trans. Math. Softw., 2003

2002
Parallel Cholesky Factorization of a Block Tridiagonal Matrix.
Proceedings of the 31st International Conference on Parallel Processing Workshops (ICPP 2002 Workshops), 2002

2001
FLAME: Formal Linear Algebra Methods Environment.
ACM Trans. Math. Softw., 2001

A Note On Parallel Matrix Inversion.
SIAM J. Sci. Comput., 2001

Specialized Parallel Algorithms for Solving Lyapunov and Stein Equations.
J. Parallel Distributed Comput., 2001

Parallel Out-of-Core Cholesky and QR Factorization with POOCLAPACK.
Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS-01), 2001

A Family of High-Performance Matrix Multiplication Algorithms.
Proceedings of the Computational Science - ICCS 2001, 2001

Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice.
Proceedings of the 2001 International Conference on Dependable Systems and Networks (DSN 2001) (formerly: FTCS), 2001

2000
Formal Methods for High-Performance Linear Algebra Libraries.
Proceedings of the Architecture of Scientific Software, 2000

1999
Fast Parallel Kernels for Selected Problems in Control Theory.
Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing, 1999

Application Driven Fast Summation Methods.
Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing, 1999

1998
A Parallel Linear Algebra Server for Matlab-like Environments.
Proceedings of the ACM/IEEE Conference on Supercomputing, 1998

A Flexible Class of Parallel Matrix Multiplication Algorithms.
Proceedings of the 12th International Parallel Processing Symposium / 9th Symposium on Parallel and Distributed Processing (IPPS/SPDP '98), March 30, 1998

PLAPACK: High Performance through High-Level Abstraction.
Proceedings of the 1998 International Conference on Parallel Processing (ICPP '98), 1998

1997
A block Jacobi method on a mesh of processors.
Concurr. Pract. Exp., 1997

SUMMA: scalable universal matrix multiplication algorithm.
Concurr. Pract. Exp., 1997

Parallel implementation of BLAS: general techniques for Level 3 BLAS.
Concurr. Pract. Exp., 1997

PLAPACK Parallel Linear Algebra Package Design Overview.
Proceedings of the ACM/IEEE Conference on Supercomputing, 1997

PLAPACK: Parallel Linear Algebra Package.
Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientific Computing, 1997

Using PLAPACK - parallel linear algebra package.
MIT Press, ISBN: 978-0-262-72026-7, 1997

1996
Parallelizing the QR Algorithm for the Unsymmetric Algebraic Eigenvalue Problem: Myths and Reality.
SIAM J. Sci. Comput., 1996

A High Performance Parallel Strassen Implementation.
Parallel Process. Lett., 1996

Broadcasting on Meshes with Wormhole Routing.
J. Parallel Distributed Comput., 1996

Exploiting the Symmetry on the Jacobi Method on a Mesh of Processors.
Proceedings of the 4th Euromicro Workshop on Parallel and Distributed Processing (PDP '96), 1996

1995
A Pipelined Broadcast for Multidimensional Meshes.
Parallel Process. Lett., 1995

Global Combine Algorithms for 2-D Meshes with Wormhole Routing.
J. Parallel Distributed Comput., 1995

Anatomy of a Parallel Out-of-Core Dense Linear Solver.
Proceedings of the 1995 International Conference on Parallel Processing, 1995

1994
On Global Combine Operations.
J. Parallel Distributed Comput., 1994

Scalability Issues Affecting the Design of a Dense Linear Algebra Library.
J. Parallel Distributed Comput., 1994

Performance and Scalability of Finite Element Analysis for Distributed Parallel Computation.
J. Parallel Distributed Comput., 1994

Building a high-performance collective communication library.
Proceedings of the Proceedings Supercomputing '94, 1994

1993
Distributed memory matrix-vector multiplication and conjugate gradient algorithms.
Proceedings of the Proceedings Supercomputing '93, 1993

Two Dimensional Basic Linear Algebra Communication Subprograms.
Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing, 1993

LAPACK for Distributed Memory Architectures: The Next Generation.
Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing, 1993

Efficient Communication Primitives on Mesh Architectures with Hardware Routing.
Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing, 1993

Global Combine on Mesh Architectures with Wormhole Routing.
Proceedings of the Seventh International Parallel Processing Symposium, 1993

1992
Reduction to condensed form for the eigenvalue problem on distributed memory architectures.
Parallel Comput., 1992

1991
LAPACK for Distributed Memory Architectures: Progress Report.
Proceedings of the Fifth SIAM Conference on Parallel Processing for Scientific Computing, 1991

1990
An asymptotically 100% efficient parallel implementation of the nonsymmetric QR algorithm.
Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing, 1990


  Loading...