Richard W. Vuduc

Affiliations:
  • Georgia Institute of Technology, Atlanta GA, USA


According to our database1, Richard W. Vuduc authored at least 129 papers between 2000 and 2023.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2023
Calculon: a methodology and tool for high-level co-design of systems and large language models.
Proceedings of the International Conference for High Performance Computing, 2023

Distributed-Memory Parallel JointNMF.
Proceedings of the 37th International Conference on Supercomputing, 2023

2022
Critique of "MemXCT: Memory-Centric X-Ray CT Reconstruction With Massive Parallelization" by SCC Team From Georgia Tech.
IEEE Trans. Parallel Distributed Syst., 2022

Jack, The Autotuner.
Comput. Sci. Eng., 2022

Exaflops Biomedical Knowledge Graph Analytics.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

Nimble GNN Embedding with Tensor-Train Decomposition.
Proceedings of the KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14, 2022

"Smarter" NICs for faster molecular dynamics: a case study.
Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

ParaGraph: An application-simulator interface and toolkit for hardware-software co-design.
Proceedings of the 51st International Conference on Parallel Processing, 2022

2021
ORCA: Outlier detection and Robust Clustering for Attributed graphs.
J. Glob. Optim., 2021

Communication-avoiding kernel ridge regression on parallel and distributed systems.
CCF Trans. High Perform. Comput., 2021

Is it Nemo or Dory? Fast and accurate object detection for IoT and edge devices.
Proceedings of the IoT '21: 11th International Conference on the Internet of Things, St. Gallen, Switzerland, November 8, 2021

An interface for multidimensional arrays in Arkouda.
Proceedings of the 2021 IEEE High Performance Extreme Computing Conference, 2021

Scalable All-pairs Shortest Paths for Huge Graphs on Multi-GPU Clusters.
Proceedings of the HPDC '21: The 30th International Symposium on High-Performance Parallel and Distributed Computing, 2021

Online model swapping for architectural simulation.
Proceedings of the CF '21: Computing Frontiers Conference, 2021

CUP: Cluster Pruning for Compressing Deep Neural Networks.
Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), 2021

2020
Automatic Generation of High-Performance FFT Kernels on Arm and X86 CPUs.
IEEE Trans. Parallel Distributed Syst., 2020

Programming Strategies for Irregular Algorithms on the Emu Chick.
ACM Trans. Parallel Comput., 2020

Scalable knowledge graph analytics at 136 petaflop/s.
Proceedings of the International Conference for High Performance Computing, 2020

Distributed-memory parallel symmetric nonnegative matrix factorization.
Proceedings of the International Conference for High Performance Computing, 2020

A supernodal all-pairs shortest path algorithm.
Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

Intrepydd: performance, productivity, and portability for data science application kernels.
Proceedings of the 2020 ACM SIGPLAN International Symposium on New Ideas, 2020

Evaluating Gather and Scatter Performance on CPUs and GPUs.
Proceedings of the MEMSYS 2020: The International Symposium on Memory Systems, 2020

Max orientation coverage: efficient path planning to avoid collisions in the CNC milling of 3D objects.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2020

2019
A microbenchmark characterization of the Emu chick.
Parallel Comput., 2019

A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems.
J. Parallel Distributed Comput., 2019

Optimizing sparse tensor times matrix on GPUs.
J. Parallel Distributed Comput., 2019

Temporal phenotyping of medically complex children via PARAFAC2 tensor factorization.
J. Biomed. Informatics, 2019

CUP: Cluster Pruning for Compressing Deep Neural Networks.
CoRR, 2019

Self-stabilizing Connected Components.
Proceedings of the 9th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, 2019

Adaptive Deep Path: Efficient Coverage of a Known Environment under Various Configurations.
Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2019

Load-Balanced Sparse MTTKRP on GPUs.
Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

A communication-avoiding 3D sparse triangular solver.
Proceedings of the ACM International Conference on Supercomputing, 2019

Efficient and effective sparse tensor reordering.
Proceedings of the ACM International Conference on Supercomputing, 2019

Faster parallel collision detection at high resolution for CNC milling applications.
Proceedings of the 48th International Conference on Parallel Processing, 2019

2018
Autotuning in High-Performance Computing Applications.
Proc. IEEE, 2018

Spatter: A Benchmark Suite for Evaluating Sparse Access Patterns.
CoRR, 2018

A Simple Methodology for Computing Families of Algorithms.
CoRR, 2018

HiCOO: hierarchical storage of sparse tensors.
Proceedings of the International Conference for High Performance Computing, 2018

SUSTain: Scalable Unsupervised Scoring for Tensors and its Application to Phenotyping.
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018

A Communication-Avoiding 3D LU Factorization Algorithm for Sparse Matrices.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

An Energy-Efficient Single-Source Shortest Path Algorithm.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

An Initial Characterization of the Emu Chick.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems.
Proceedings of the 32nd International Conference on Supercomputing, 2018

2017
Design and Implementation of a Communication-Optimal Classifier for Distributed Kernel Support Vector Machines.
IEEE Trans. Parallel Distributed Syst., 2017

Modeling the Power Variability of Core Speed Scaling on Homogeneous Multicore Systems.
Sci. Program., 2017

Polyadic Regression and its Application to Chemogenomics.
Proceedings of the 2017 SIAM International Conference on Data Mining, 2017

Efficient Communications in Training Large Scale Neural Networks.
Proceedings of the on Thematic Workshops of ACM Multimedia 2017, Mountain View, CA, USA, October 23, 2017

SPARTan: Scalable PARAFAC2 for Large & Sparse Data.
Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13, 2017

HPPAC Workshop Introduction.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Model-Driven Sparse CP Decomposition for Higher-Order Tensors.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

2016
Sparse Hierarchical Tucker Factorization and its Application to Healthcare.
CoRR, 2016

Wanted: Floating-Point Add Round-off Error instruction.
CoRR, 2016

Optimizing Sparse Tensor Times Matrix on Multi-core and Many-Core Architectures.
Proceedings of the 6th Workshop on Irregular Applications: Architecture and Algorithms, 2016

Hybrid Dynamic Trees for Extreme-Resolution 3D Sparse Data Modeling.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Analyzing the Energy Efficiency of the Fast Multipole Method Using a DVFS-Aware Energy Model.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

A Self-Correcting Connected Components Algorithm.
Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, 2016

2015
UNICORN: a unified approach for localizing non-deadlock concurrency bugs.
Softw. Test. Verification Reliab., 2015

Branch-Avoiding Graph Algorithms.
Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures, 2015

An input-adaptive and in-place approach to dense tensor-times-matrix multiply.
Proceedings of the International Conference for High Performance Computing, 2015

A GPU-parallel construction of volumetric tree.
Proceedings of the 5th Workshop on Irregular Applications - Architectures and Algorithms, 2015

CA-SVM: Communication-Avoiding Support Vector Machines on Distributed Systems.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

A Sparse Direct Solver for Distributed Memory Xeon Phi-Accelerated Systems.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Sparse Hierarchical Tucker Factorization and Its Application to Healthcare.
Proceedings of the 2015 IEEE International Conference on Data Mining, 2015

2014
A distributed kernel summation framework for general-dimension machine learning.
Stat. Anal. Data Min., 2014

Improving the energy efficiency of Big Cores.
Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture, 2014

Algorithmic Time, Energy, and Power on Candidate HPC Compute Building Blocks.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

A Distributed CPU-GPU Sparse Direct Solver.
Proceedings of the Euro-Par 2014 Parallel Processing, 2014

A CPU: GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method.
Proceedings of the Seventh Workshop on General Purpose Processing Using GPUs, 2014

2013
Introduction for Special Issue on Autotuning.
Int. J. High Perform. Comput. Appl., 2013

How much (execution) time and energy does my algorithm cost?
XRDS, 2013

Sustainable Software Development for Next-Gen Sequencing (NGS) Bioinformatics on Emerging Platforms.
CoRR, 2013

Self-stabilizing iterative solvers.
Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2013

Methods for High-Throughput Computation of Elementary Functions.
Proceedings of the Parallel Processing and Applied Mathematics, 2013

Griffin: grouping suspicious memory-access patterns to improve understanding of concurrency bugs.
Proceedings of the International Symposium on Software Testing and Analysis, 2013

A Theoretical Framework for Algorithm-Architecture Co-design.
Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

A Roofline Model of Energy.
Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

2012
Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)
Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, ISBN: 978-3-031-01737-7, 2012

When Prefetching Works, When It Doesn't, and Why.
ACM Trans. Archit. Code Optim., 2012

A massively parallel adaptive fast multipole method on heterogeneous architectures.
Commun. ACM, 2012

Toward a Theory of Algorithm-Architecture Co-design.
Proceedings of the High Performance Computing for Computational Science, 2012

Brief announcement: towards a communication optimal fast multipole method and its implications at exascale.
Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, 2012

A Distributed Kernel Summation Framework for General-Dimension Machine Learning.
Proceedings of the Twelfth SIAM International Conference on Data Mining, 2012

Optimizing the computation of n-point correlations on large-scale astronomical data.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

Synthesizing Loops for Program Inversion.
Proceedings of the Reversible Computation, 4th International Workshop, 2012

A performance analysis framework for identifying potential benefits in GPGPU applications.
Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2012

A type theory for probability density functions.
Proceedings of the 39th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2012

Courses in High-performance Computing for Scientists and Engineers.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

Modeling and Analysis for Performance and Power.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

Communication-Optimal Parallel N-body Solvers.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

A Unified Approach for Localizing Non-deadlock Concurrency Bugs.
Proceedings of the Fifth IEEE International Conference on Software Testing, 2012

On the communication complexity of 3D FFTs and its implications for Exascale.
Proceedings of the International Conference on Supercomputing, 2012

A New Method for Program Inversion.
Proceedings of the Compiler Construction - 21st International Conference, 2012

2011
Autotuning.
Proceedings of the Encyclopedia of Parallel Computing, 2011

The Sixth International Workshop on Automatic Performance Tuning (iWAPT2011).
Proceedings of the International Conference on Computational Science, 2011

What GPU Computing Means for High-End Systems.
IEEE Micro, 2011

The Backstroke framework for source level reverse computation applied to parallel discrete event simulation.
Proceedings of the Winter Simulation Conference 2011, 2011

Balance Principles for Algorithm-Architecture Co-Design.
Proceedings of the 3rd USENIX Workshop on Hot Topics in Parallelism, 2011

2010
Toward interactive statistical modeling.
Proceedings of the International Conference on Computational Science, 2010

Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures.
Proceedings of the Conference on High Performance Computing Networking, 2010

Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method.
Proceedings of the Conference on High Performance Computing Networking, 2010

Model-driven autotuning of sparse matrix-vector multiply on GPUs.
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010

Applying the concurrent collections programming model to asynchronous parallel dense linear algebra.
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications.
Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010

Unconventional wisdom in multicore computing.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Performance evaluation of concurrent collections on high-performance multicore computing systems.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Falcon: fault localization in concurrent programs.
Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering, 2010

2009
Optimization of sparse matrix-vector multiplication on emerging multicore platforms.
Parallel Comput., 2009

Effective Source-to-Source Outlining to Support Whole Program Empirical Optimization.
Proceedings of the Languages and Compilers for Parallel Computing, 2009

Understanding the design trade-offs among current multicore systems for numerical computations.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems.
Proceedings of the 23rd international conference on Supercomputing, 2009

Direct N-body Kernels for Multicore Platforms.
Proceedings of the ICPP 2009, 2009

2007
When cache blocking of sparse matrix vector multiply works and why.
Appl. Algebra Eng. Commun. Comput., 2007

Techniques for specifying bug patterns.
Proceedings of the 5th Workshop on Parallel and Distributed Systems: Testing, 2007

POET: Parameterized Optimizations for Empirical Tuning.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Communicating Software Architecture using a Unified Single-View Visualization.
Proceedings of the 12th International Conference on Engineering of Complex Computer Systems (ICECCS 2007), 2007

2006
Improving distributed memory applications testing by message perturbation.
Proceedings of the 4th Workshop on Parallel and Distributed Systems: Testing, 2006

Annotating user-defined abstractions for optimization.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

2005
Self-Adapting Linear Algebra Algorithms and Software.
Proc. IEEE, 2005

An Extensible Open-Source Compiler Infrastructure for Testing.
Proceedings of the Hardware and Software Verification and Testing, 2005

Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure.
Proceedings of the High Performance Computing and Communications, 2005

2004
Statistical Models for Empirical Search-Based Performance Tuning.
Int. J. High Perform. Comput. Appl., 2004

Sparsity: Optimization Framework for Sparse Matrix Kernels.
Int. J. High Perform. Comput. Appl., 2004

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply.
Proceedings of the 33rd International Conference on Parallel Processing (ICPP 2004), 2004

2003
Memory Hierarchy Optimizations and Performance ounds for Sparse A.
Proceedings of the Computational Science - ICCS 2003, 2003

2002
Performance optimizations and bounds for sparse matrix-vector multiply.
Proceedings of the 2002 ACM/IEEE conference on Supercomputing, 2002

2001
Statistical Models for Automatic Performance Tuning.
Proceedings of the Computational Science - ICCS 2001, 2001

2000
SWAMI: a framework for collaborative filtering algorithm development and evaluation.
Proceedings of the SIGIR 2000: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000

Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW.
Proceedings of the Semantics, 2000


  Loading...