Stephen W. Keckler

Orcid: 0000-0001-6701-6099

Affiliations:
  • NVIDIA
  • University of Texas at Austin, USA


According to our database1, Stephen W. Keckler authored at least 157 papers between 1992 and 2024.

Collaborative distances:

Awards

ACM Fellow

ACM Fellow 2011, "For contributions to computer architectures and technology modeling.".

IEEE Fellow

IEEE Fellow 2011, "For contributions to computer architectures and memory systems".

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Abstracting Sparse DNN Acceleration via Structured Sparse Tensor Decomposition.
CoRR, 2024

WASP: Exploiting GPU Pipeline Parallelism with Hardware-Accelerated Automatic Warp Specialization.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2024

2023
Symphony: Orchestrating Sparse and Dense Tensors with Hierarchical Heterogeneous Processing.
ACM Trans. Comput. Syst., 2023

cuCatch: A Debugging Tool for Efficiently Catching Memory Safety Violations in CUDA Applications.
Proc. ACM Program. Lang., 2023

Augmenting Legacy Networks for Flexible Inference.
Proceedings of the IEEE Intelligent Vehicles Symposium, 2023

Community-based Matrix Reordering for Sparse Linear Algebra Optimization.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2023

Implicit Memory Tagging: No-Overhead Memory Safety Using Alias-Free Tagged ECC.
Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023

VaPr: Variable-Precision Tensors to Accelerate Robot Motion Planning.
IROS, 2023

2022
Making Convolutions Resilient Via Algorithm-Based Error Detection Techniques.
IEEE Trans. Dependable Secur. Comput., 2022

GPU Domain Specialization via Composable On-Package Architecture.
ACM Trans. Archit. Code Optim., 2022

Characterizing and Mitigating Soft Errors in GPU DRAM.
IEEE Micro, 2022

Enabling and Accelerating Dynamic Vision Transformer Inference for Real-Time Applications.
CoRR, 2022

Accelerators.
Computer, 2022

Saving PAM4 Bus Energy with SMOREs: Sparse Multi-level Opportunistic Restricted Encodings.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2022

GPU Subwarp Interleaving.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2022

Exploiting Temporal Data Diversity for Detecting Safety-critical Faults in AV Compute Systems.
Proceedings of the 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2022

Zhuyi: perception processing rate estimation for safety in autonomous vehicles.
Proceedings of the DAC '22: 59th ACM/IEEE Design Automation Conference, San Francisco, California, USA, July 10, 2022

2021
Evolution of the Graphics Processing Unit (GPU).
IEEE Micro, 2021

SNAP: An Efficient Sparse Neural Acceleration Processor for Unstructured Sparse Deep Neural Network Inference.
IEEE J. Solid State Circuits, 2021

Cooperative Profile Guided Optimizations.
Comput. Graph. Forum, 2021

Simba: scaling deep-learning inference with chiplet-based architecture.
Commun. ACM, 2021

Generating and Characterizing Scenarios for Safety Testing of Autonomous Vehicles.
Proceedings of the IEEE Intelligent Vehicles Symposium, 2021

Suraksha: A Framework to Analyze the Safety Implications of Perception Design Choices in AVs.
Proceedings of the 32nd IEEE International Symposium on Software Reliability Engineering, 2021

Optimizing Selective Protection for CNN Resilience.
Proceedings of the 32nd IEEE International Symposium on Software Reliability Engineering, 2021

Suraksha: A Quantitative AV Safety Evaluation Framework to Analyze Safety Implications of Perception Design Choices.
Proceedings of the 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2021

NVBitFI: Dynamic Fault Injection for GPUs.
Proceedings of the 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2021

2020
A 0.32-128 TOPS, Scalable Multi-Chip-Module-Based Deep Neural Network Inference Accelerator With Ground-Referenced Signaling in 16 nm.
IEEE J. Solid State Circuits, 2020

Estimating Silent Data Corruption Rates Using a Two-Level Model.
CoRR, 2020

HarDNN: Feature Map Vulnerability Evaluation in CNNs.
CoRR, 2020

Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs.
Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture, 2020

Speculative reconvergence for improved SIMT efficiency.
Proceedings of the CGO '20: 18th ACM/IEEE International Symposium on Code Generation and Optimization, 2020

2019
Exposing Memory Access Patterns to Improve Instruction and Memory Efficiency in GPUs.
ACM Trans. Archit. Code Optim., 2019

Kayotee: A Fault Injection-based System to Assess the Safety and Reliability of Autonomous Vehicles to Faults and Errors.
CoRR, 2019

A 0.11 pJ/Op, 0.32-128 TOPS, Scalable Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm.
Proceedings of the 2019 Symposium on VLSI Circuits, Kyoto, Japan, June 9-14, 2019, 2019

SNAP: A 1.67 - 21.55TOPS/W Sparse Neural Acceleration Processor for Unstructured Sparse Deep Neural Network Inference in 16nm CMOS.
Proceedings of the 2019 Symposium on VLSI Circuits, Kyoto, Japan, June 9-14, 2019, 2019

NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs.
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019

Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture.
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019

Timeloop: A Systematic Approach to DNN Accelerator Evaluation.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2019

GPU snapshot: checkpoint offloading for GPU-dense systems.
Proceedings of the ACM International Conference on Supercomputing, 2019

MAGNet: A Modular Accelerator Generator for Neural Networks.
Proceedings of the International Conference on Computer-Aided Design, 2019

A 0.11 PJ/OP, 0.32-128 Tops, Scalable Multi-Chip-Module-Based Deep Neural Network Accelerator Designed with A High-Productivity vlsi Methodology.
Proceedings of the 2019 IEEE Hot Chips 31 Symposium (HCS), 2019

On the Trend of Resilience for GPU-Dense Systems.
Proceedings of the 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2019

ML-Based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection.
Proceedings of the 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2019

Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration.
Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019

2018
Software-Directed Techniques for Improved GPU Register File Utilization.
ACM Trans. Archit. Code Optim., 2018

Structurally Sparsified Backward Propagation for Faster Long Short-Term Memory Training.
CoRR, 2018

Optimizing software-directed instruction replication for GPU error detection.
Proceedings of the International Conference for High Performance Computing, 2018

SwapCodes: Error Codes for Hardware-Software Cooperative GPU Pipeline Error Detection.
Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, 2018

Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2018

2017
Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks.
CoRR, 2017

Understanding error propagation in deep learning neural network (DNN) accelerators and applications.
Proceedings of the International Conference for High Performance Computing, 2017

Fine-grained DRAM: energy-efficient DRAM for extreme bandwidth systems.
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017

SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation.
Proceedings of the 2017 IEEE International Symposium on Performance Analysis of Systems and Software, 2017

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks.
Proceedings of the 44th Annual International Symposium on Computer Architecture, 2017

Architecting an Energy-Efficient DRAM System for GPUs.
Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture, 2017

2016
Virtualizing Deep Neural Networks for Memory-Efficient Neural Network Design.
CoRR, 2016

vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design.
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016

A patch memory system for image processing and computer vision.
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016

CLARA: Circular Linked-List Auto and Self Refresh Architecture.
Proceedings of the Second International Symposium on Memory Systems, 2016

Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems.
Proceedings of the 43rd ACM/IEEE Annual International Symposium on Computer Architecture, 2016

Towards high performance paged memory for GPUs.
Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture, 2016

A case for toggle-aware compression for GPU systems.
Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture, 2016

Selective GPU caches to eliminate CPU-GPU HW cache coherence.
Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture, 2016

A real-time energy-efficient superpixel hardware accelerator for mobile computer vision applications.
Proceedings of the 53rd Annual Design Automation Conference, 2016

2015
Designing Efficient Heterogeneous Memory Architectures.
IEEE Micro, 2015

Increasing Interconnection Network Throughput with Virtual Channels.
Computer, 2015

Toggle-Aware Compression for GPUs.
IEEE Comput. Archit. Lett., 2015

Anatomy of GPU Memory System for Multi-Application Execution.
Proceedings of the 2015 International Symposium on Memory Systems, 2015

Flexible software profiling of GPU architectures.
Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015

A variable warp size architecture.
Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015

GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors.
Proceedings of the 2015 IEEE International Symposium on Workload Characterization, 2015

Unlocking bandwidth for GPUs in CC-NUMA systems.
Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture, 2015

Page Placement Strategies for GPUs within Heterogeneous Memory Systems.
Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, 2015

2014
Scaling Power and Performance viaProcessor Composability.
IEEE Trans. Computers, 2014

2014 International Symposium on Computer Architecture Influential Paper Award; 2014 Maurice Wilkes Award Given to Ravi Rajwar.
IEEE Micro, 2014

Rethinking caches for throughput processors: technical perspective.
Commun. ACM, 2014

Scaling the Power Wall: A Path to Exascale.
Proceedings of the International Conference for High Performance Computing, 2014

Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures.
Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014

Arbitrary Modulus Indexing.
Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014

A comparative analysis of microarchitecture effects on CPU and GPU memory system behavior.
Proceedings of the 2014 IEEE International Symposium on Workload Characterization, 2014

Author retrospective for a NUCA substrate for flexible CMP cache sharing.
Proceedings of the ACM International Conference on Supercomputing 25th Anniversary Volume, 2014

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications.
Proceedings of the Seventh Workshop on General Purpose Processing Using GPUs, 2014

2013
How to implement effective prediction and forwarding for fusable dynamic multicore architectures.
Proceedings of the 19th IEEE International Symposium on High Performance Computer Architecture, 2013

21st century digital design tools.
Proceedings of the 50th Annual Design Automation Conference 2013, 2013

Convergence and scalarization for data-parallel architectures.
Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, 2013

2012
A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors.
ACM Trans. Comput. Syst., 2012

A QoS-Enabled On-Die Interconnect Fabric for Kilo-Node Chips.
IEEE Micro, 2012

Charles R. (Chuck) Moore (1961 - 2012).
IEEE Micro, 2012

Massively Multithreaded Computing Systems.
Computer, 2012

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor.
Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012

2011
GPUs and the Future of Parallel Computing.
IEEE Micro, 2011

A compile-time managed multi-level register file hierarchy.
Proceedings of the 44rd Annual IEEE/ACM International Symposium on Microarchitecture, 2011

Evaluation and optimization of multicore performance bottlenecks in supercomputing applications.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2011

Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees.
Proceedings of the 38th International Symposium on Computer Architecture (ISCA 2011), 2011

Energy-efficient mechanisms for managing thread context in throughput processors.
Proceedings of the 38th International Symposium on Computer Architecture (ISCA 2011), 2011

Exploiting criticality to reduce bottlenecks in distributed uniprocessors.
Proceedings of the 17th International Conference on High-Performance Computer Architecture (HPCA-17 2011), 2011

2010
Netrace: dependency-driven trace-based network-on-chip simulation.
Proceedings of the Third International Workshop on Network on Chip Architectures, 2010

Topology-Aware Quality-of-Service Support in Highly Integrated Chip Multiprocessors.
Proceedings of the Computer Architecture, 2010

2009
On-Chip Networks for Multicore Systems.
Proceedings of the Multicore Processors and Systems, 2009

Composable Multicore Chips.
Proceedings of the Multicore Processors and Systems, 2009

Segment gating for static energy reduction in Networks-on-Chip.
Proceedings of the Second International Workshop on Network on Chip Architectures, 2009

Preemptive virtual clock: a flexible, efficient, and cost-effective QOS scheme for networks-on-chip.
Proceedings of the 42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009), 2009

Analysis of the TRIPS prototype block predictor.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2009

End-to-end validation of architectural power models.
Proceedings of the 2009 International Symposium on Low Power Electronics and Design, 2009

Express Cube Topologies for on-Chip Interconnects.
Proceedings of the 15th International Conference on High-Performance Computer Architecture (HPCA-15 2009), 2009

An evaluation of the TRIPS computer system.
Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, 2009

2008
Multitasking workload scheduling on flexible core chip multiprocessors.
SIGARCH Comput. Archit. News, 2008

High performance dense linear algebra on a spatially distributed processor.
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008

Counting Dependence Predictors.
Proceedings of the 35th International Symposium on Computer Architecture (ISCA 2008), 2008

Regional congestion awareness for load balance in networks-on-chip.
Proceedings of the 14th International Conference on High-Performance Computer Architecture (HPCA-14 2008), 2008

2007
A NUCA Substrate for Flexible CMP Cache Sharing.
IEEE Trans. Parallel Distributed Syst., 2007

Research Challenges for On-Chip Interconnection Networks.
IEEE Micro, 2007

On-Chip Interconnection Networks of the TRIPS Chip.
IEEE Micro, 2007

Reconciling performance and programmability in networking systems.
Proceedings of the ACM SIGCOMM 2007 Conference on Applications, 2007

Implementation and Evaluation of a Dynamically Routed Processor Operand Network.
Proceedings of the First International Symposium on Networks-on-Chips, 2007

Composable Lightweight Processors.
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40 2007), 2007

Thermal response to DVFS: analysis with an Intel Pentium M.
Proceedings of the 2007 International Symposium on Low Power Electronics and Design, 2007

Late-binding: enabling unordered load-store queues.
Proceedings of the 34th International Symposium on Computer Architecture (ISCA 2007), 2007

Power, Performance, and Thermal Management for High-Performance Systems.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

The future of multi-core technologies.
Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

2006
Dataflow Predication.
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39 2006), 2006

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor.
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39 2006), 2006

Decomposing memory performance: data structures and phases.
Proceedings of the 5th International Symposium on Memory Management, 2006

Critical path analysis of the TRIPS architecture.
Proceedings of the 2006 IEEE International Symposium on Performance Analysis of Systems and Software, 2006

Design and Implementation of the TRIPS Primary Memory System.
Proceedings of the 24th International Conference on Computer Design (ICCD 2006), 2006

Implementation and Evaluation of On-Chip Network Architectures.
Proceedings of the 24th International Conference on Computer Design (ICCD 2006), 2006

2004
TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP.
ACM Trans. Archit. Code Optim., 2004

Recent extensions to the SimpleScalar tool suite.
SIGMETRICS Perform. Evaluation Rev., 2004

Scalable Hardware Memory Disambiguation for High-ILP Processors.
IEEE Micro, 2004

Scaling to the End of Silicon with EDGE Architectures.
Computer, 2004

Scalable selective re-execution for EDGE architectures.
Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems, 2004

Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures.
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT 2004), 29 September, 2004

2003
Static energy reduction techniques for microprocessor caches.
IEEE Trans. Very Large Scale Integr. Syst., 2003

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture.
IEEE Micro, 2003

Nonuniform Cache Architectures for Wire-Delay Dominated On-Chip Caches.
IEEE Micro, 2003

Universal Mechanisms for Data-Parallel Architectures.
Proceedings of the 36th Annual International Symposium on Microarchitecture, 2003

Microprocessor pipeline energy analysis.
Proceedings of the 2003 International Symposium on Low Power Electronics and Design, 2003

Exploiting Microarchitectural Redundancy For Defect Tolerance.
Proceedings of the 21st International Conference on Computer Design (ICCD 2003), 2003

Routed Inter-ALU Networks for ILP Scalability and Performance.
Proceedings of the 21st International Conference on Computer Design (ICCD 2003), 2003

2002
Errata on "Measuring Experimental Error in Microprocessor Simulation".
SIGARCH Comput. Archit. News, 2002

The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays.
Proceedings of the 29th International Symposium on Computer Architecture (ISCA 2002), 2002

Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic.
Proceedings of the 2002 International Conference on Dependable Systems and Networks (DSN 2002), 2002

An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches.
Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-X), 2002

2001
A design space evaluation of grid processor architectures.
Proceedings of the 34th Annual International Symposium on Microarchitecture, 2001

Measuring Experimental Error in Microprocessor Simulation.
Proceedings of the 28th Annual International Symposium on Computer Architecture, 2001

Exploring the Design Space of Future CMPs.
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques (PACT 2001), 2001

2000
The impact of delay on the design of branch predictors.
Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, 2000

Processor Mechanisms for Software Shared Memory.
Proceedings of the High Performance Computing, Third International Symposium, 2000

Clock rate versus IPC: the end of the road for conventional microarchitectures.
Proceedings of the 27th International Symposium on Computer Architecture (ISCA 2000), 2000

1999
Concurrent Event Handling through Multithreading.
IEEE Trans. Computers, 1999

1998
Fast thread communication and synchronization mechanisms for a scalable single chip multiprocessor.
PhD thesis, 1998

An Efficient, Protected Message Interface.
Computer, 1998

Exploiting Fine-grain Thread Level Parallelism on the MIT Multi-ALU Processor.
Proceedings of the 25th Annual International Symposium on Computer Architecture, 1998

The effects of explicitly parallel mechanisms on the multi-ALU processor cluster pipeline.
Proceedings of the International Conference on Computer Design: VLSI in Computers and Processors, 1998

1997
The M-machine multicomputer.
Int. J. Parallel Program., 1997

1994
Hardware Support for Fast Capability-based Addressing.
Proceedings of the ASPLOS-VI Proceedings, 1994

1992
Processor Coupling: Integrating Compile Time and Runtime Scheduling for Parallelism.
Proceedings of the 19th Annual International Symposium on Computer Architecture. Gold Coast, 1992


  Loading...