Norman P. Jouppi

Orcid: 0000-0003-1765-1929

  • Google, Mountain View, CA, USA

According to our database1, Norman P. Jouppi authored at least 139 papers between 1982 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.


ACM Fellow

ACM Fellow 2006, "For contributions to the design and analysis of high-performance processors and memory systems.".

IEEE Fellow

IEEE Fellow 2003, "For contributions to the design and analysis of high performance processors and memory systems.".



In proceedings 
PhD thesis 


Online presence:



Reconfigurable Lightwave Fabrics for ML Supercomputers.
Proceedings of the Optical Fiber Communications Conference and Exhibition, 2024

FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search.
CoRR, 2023

RETROSPECTIVE: Corona: System Implications of Emerging Nanophotonic Technology.
CoRR, 2023

Lightwave Fabrics: At-Scale Optical Circuit Switching for Datacenter and Machine Learning Systems.
Proceedings of the ACM SIGCOMM 2023 Conference, 2023

TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings.
Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023

A Machine Learning Supercomputer with an Optically Reconfigurable Interconnect and Embeddings Support.
Proceedings of the 35th IEEE Hot Chips Symposium, 2023

Hyperscale Hardware Optimized Neural Architecture Search.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

The Design Process for Google's Training Chips: TPUv2 and TPUv3.
IEEE Micro, 2021

Ten Lessons From Three Generations Shaped Google's TPUv4i : Industrial Product.
Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture, 2021

NeuroMeter: An Integrated Power, Area, and Timing Modeling Framework for Machine Learning Accelerators Industry Track Paper.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2021

Searching for Fast Model Families on Datacenter Accelerators.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Highly Available Data Parallel ML training on Mesh Networks.
CoRR, 2020

A domain-specific supercomputer for training deep neural networks.
Commun. ACM, 2020

Google's Training Chips Revealed: TPUv2 and TPUv3.
Proceedings of the IEEE Hot Chips 32 Symposium, 2020

Motivation for and Evaluation of the First Tensor Processing Unit.
IEEE Micro, 2018

A domain-specific architecture for deep neural networks.
Commun. ACM, 2018

In-Datacenter Performance Analysis of a Tensor Processing Unit.
CoRR, 2017

Common Bonds: MIPS, HPS, Two-Level Branch Prediction, and Compressed Code RISC Processor.
IEEE Micro, 2016

CACTI-IO: CACTI With OFF-Chip Power-Area-Timing Models.
IEEE Trans. Very Large Scale Integr. Syst., 2015

History-Assisted Adaptive-Granularity Caches (HAAG$) for High Performance 3D DRAM Architectures.
Proceedings of the 29th ACM on International Conference on Supercomputing, 2015

Efficient Data Mapping and Buffering Techniques for Multilevel Cell Phase-Change Memories.
ACM Trans. Archit. Code Optim., 2014

Endurance-aware cache line management for non-volatile caches.
ACM Trans. Archit. Code Optim., 2014

The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing.
ACM Trans. Archit. Code Optim., 2013

A circuit-architecture co-optimization framework for exploring nonvolatile memory hierarchies.
ACM Trans. Archit. Code Optim., 2013

Practical nonvolatile multilevel-cell phase change memory.
Proceedings of the International Conference for High Performance Computing, 2013

Kiln: closing the performance gap between systems with and without persistence support.
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013

A circuit-architecture co-optimization framework for evaluating emerging memory hierarchies.
Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 2013

McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling.
Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 2013

Design of cross-point metal-oxide ReRAM emphasizing reliability and cost.
Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, 2013

i<sup>2</sup>WAP: Improving non-volatile cache lifetime by reducing inter- and intra-set write variations.
Proceedings of the 19th IEEE International Symposium on High Performance Computer Architecture, 2013

Understanding the trade-offs in multi-level cell ReRAM memory design.
Proceedings of the 50th Annual Design Automation Conference 2013, 2013

NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2012

Improving System Energy Efficiency with Memory Rank Subsetting.
ACM Trans. Archit. Code Optim., 2012

Free-p: A Practical End-to-End Nonvolatile Memory Protection Mechanism.
IEEE Micro, 2012

Optical High Radix Switch Design.
IEEE Micro, 2012

MAGE: adaptive granularity and ECC for resilient and power efficient memory systems.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

Design trade-offs for high density cross-point resistive memory.
Proceedings of the International Symposium on Low Power Electronics and Design, 2012

LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems.
Proceedings of the 39th International Symposium on Computer Architecture (ISCA 2012), 2012

Staged Reads: Mitigating the impact of DRAM writes on DRAM reads.
Proceedings of the 18th IEEE International Symposium on High Performance Computer Architecture, 2012

CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory.
Proceedings of the 2012 Design, Automation & Test in Europe Conference & Exhibition, 2012

Multi-Core Cache Hierarchies
Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, ISBN: 978-3-031-01734-6, 2011

Hybrid checkpointing using emerging nonvolatile memories for future exascale systems.
ACM Trans. Archit. Code Optim., 2011

DRAM errors in the wild: technical perspective.
Commun. ACM, 2011

System implications of memory reliability in exascale computing.
Proceedings of the Conference on High Performance Computing Networking, 2011

System-level integrated server architectures for scale-out datacenters.
Proceedings of the 44rd Annual IEEE/ACM International Symposium on Microarchitecture, 2011

Combining memory and a controller with photonics through 3D-stacking to enable scalable and energy-efficient systems.
Proceedings of the 38th International Symposium on Computer Architecture (ISCA 2011), 2011

The role of optics in future high radix switch design.
Proceedings of the 38th International Symposium on Computer Architecture (ISCA 2011), 2011

CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques.
Proceedings of the 2011 IEEE/ACM International Conference on Computer-Aided Design, 2011

FREE-p: Protecting non-volatile memory against both hard and soft errors.
Proceedings of the 17th International Conference on High-Performance Computer Architecture (HPCA-17 2011), 2011

Design implications of memristor-based RRAM cross-point structures.
Proceedings of the Design, Automation and Test in Europe, 2011

CMOS Nanophotonics: Technology, System Implications, and a CMP Case Study.
Proceedings of the Low Power Networks-on-Chip., 2011

Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support.
Proceedings of the Conference on High Performance Computing Networking, 2010

Rethinking DRAM design and organization for energy-constrained multi-cores.
Proceedings of the 37th International Symposium on Computer Architecture (ISCA 2010), 2010

Introduction to the special issue on the 2008 workshop on design, analysis, and simulation of chip multiprocessors (dasCMP'08).
SIGARCH Comput. Archit. News, 2009

A High-Speed Optical Multidrop Bus for Computer Interconnections.
IEEE Micro, 2009

Multicore DIMM: an Energy Efficient Memory Module with Independently Controlled DRAMs.
IEEE Comput. Archit. Lett., 2009

Technical perspective - Software and hardware support for deterministic replay of parallel programs.
Commun. ACM, 2009

Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2009

Future scaling of processor-memory interfaces.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2009

McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures.
Proceedings of the 42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009), 2009

Emerging technologies and their impact on system design.
Proceedings of the 2009 International Symposium on Low Power Electronics and Design, 2009

PCRAMsim: System-level performance, energy, and area modeling for Phase-Change RAM.
Proceedings of the 2009 International Conference on Computer-Aided Design, 2009

Resilience Challenges for Exascale Systems.
Proceedings of the 24th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, 2009

Introduction to the special issue on the 2007 workshop on design, analysis, and simulation of chip multiprocessors (dasCMP'07).
SIGARCH Comput. Archit. News, 2008

Architecting Efficient Interconnects for Large Caches with CACTI 6.0.
IEEE Micro, 2008

Implementing high availability memory with a duplication cache.
Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41 2008), 2008

System implications of integrated photonics.
Proceedings of the 2008 International Symposium on Low Power Electronics and Design, 2008

Corona: System Implications of Emerging Nanophotonic Technology.
Proceedings of the 35th International Symposium on Computer Architecture (ISCA 2008), 2008

A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies.
Proceedings of the 35th International Symposium on Computer Architecture (ISCA 2008), 2008

A High-Speed Optical Multi-Drop Bus for Computer Interconnections.
Proceedings of the 16th Annual IEEE Symposium on High Performance Interconnects (HOTI 2008), 2008

A Nanophotonic Interconnect for High-Performance Many-Core Computation.
Proceedings of the 16th Annual IEEE Symposium on High Performance Interconnects (HOTI 2008), 2008

Introduction to the special issue on the 2006 workshop on design, analysis, and simulation of chip multiprocessors: (dasCMP'06).
SIGARCH Comput. Archit. News, 2007

Isolation in Commodity Multicore Processors.
Computer, 2007

High-performance ethernet-based communications for future multi-core processors.
Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, 2007

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0.
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40 2007), 2007

Configurable isolation: building high availability systems with commodity multi-core processors.
Proceedings of the 34th International Symposium on Computer Architecture (ISCA 2007), 2007

Microprocessors in the era of terascale integration.
Proceedings of the 2007 Design, Automation and Test in Europe Conference and Exposition, 2007

Architecture - The potential energy efficiency of vector acceleration.
Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers.
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39 2006), 2006

Improving the performance and power efficiency of shared helpers in CMPs.
Proceedings of the 2006 International Conference on Compilers, 2006

Core architecture optimization for heterogeneous chip multiprocessors.
Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques (PACT 2006), 2006

Dynamically configurable shared CMP helper engines for improved performance.
SIGARCH Comput. Archit. News, 2005

Fast synchronization for chip multiprocessors.
SIGARCH Comput. Archit. News, 2005

Introduction to the special issue on the 2005 workshop on design, analysis, and simulation of chip multiprocessors (dasCMP'05).
SIGARCH Comput. Archit. News, 2005

Heterogeneous Chip Multiprocessors.
Computer, 2005

System-wide performance monitors and their application to the optimization of coherent memory accesses.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2005

Telepresence Systems With Automatic Preservation of User Head Height, Local Rotation, and Remote Translation.
Proceedings of the 2005 IEEE International Conference on Robotics and Automation, 2005

Enterprise IT Trends and Implications for Architecture Research.
Proceedings of the 11th International Conference on High-Performance Computer Architecture (HPCA-11 2005), 2005

BiReality: mutually-immersive telepresence.
Proceedings of the 12th ACM International Conference on Multimedia, 2004

Conjoined-Core Chip Multiprocessing.
Proceedings of the 37th Annual International Symposium on Microarchitecture (MICRO-37 2004), 2004

Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance.
Proceedings of the 31st International Symposium on Computer Architecture (ISCA 2004), 2004

A First Generation Mutually-Immersive Mobile Telepresence Surrogate with Automatic Backtracking.
Proceedings of the 2004 IEEE International Conference on Robotics and Automation, 2004

Region of interest editing of MPEG-2 video streams in the compressed domain.
Proceedings of the 2004 IEEE International Conference on Multimedia and Expo, 2004

The Future Evolution of High-Performance Microprocessors.
Proceedings of the High Performance Computing, 2004

Processor Power Reduction Via Single-ISA Heterogeneous Multi-Core Architectures.
IEEE Comput. Archit. Lett., 2003

Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction.
Proceedings of the 36th Annual International Symposium on Microarchitecture, 2003

The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays.
Proceedings of the 29th International Symposium on Computer Architecture (ISCA 2002), 2002

First steps towards mutually-immersive mobile telepresence.
Proceedings of the 2002 ACM on Computer supported cooperative work video program, 2002

First steps towards mutually-immersive mobile telepresence.
Proceedings of the CSCW 2002, 2002

Reconfigurable caches and their application to media processing.
Proceedings of the 27th International Symposium on Computer Architecture (ISCA 2000), 2000

Prefiltered Antialiased Lines Using Half-Plane Distance Functions.
Proceedings of the 2000 ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware, 2000

Implementing Neon: a 256-bit graphics accelerator.
IEEE Micro, 1999

Real products, real technology Guest Editor's Introduction].
IEEE Micro, 1999

The Multicluster Architecture: Reducing Processor Cycle Time Through Partitioning.
Int. J. Parallel Program., 1999

Feline: Fast Elliptical Lines for Anisotropic Texture Mapping.
Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, 1999

Performance of Image and Video Processing with General-Purpose Processors and Media ISA Extensions.
Proceedings of the 26th Annual International Symposium on Computer Architecture, 1999

Z3: An Economical Hardware Technique for High-Quality Antialiasing and Transparency.
Proceedings of the 1999 ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware, 1999

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache Prefetch Buffers.
Proceedings of the 25 Years of the International Symposia on Computer Architecture (Selected Papers)., 1998

Retrospective: Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers.
Proceedings of the 25 Years of the International Symposia on Computer Architecture (Selected Papers)., 1998

Neon: A Single-Chip 3D Workstation Graphics Accelerator.
Proceedings of the 1998 ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware, Lisbon, Portugal, August 31, 1998

The Multicluster Architecture: Reducing Cycle Time Through Partitioning.
Proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, 1997

Complexity-Effective Superscalar Processors.
Proceedings of the 24th International Symposium on Computer Architecture, 1997

Memory-System Design Considerations for Dynamically-Scheduled Processors.
Proceedings of the 24th International Symposium on Computer Architecture, 1997

CACTI: an enhanced cache access and cycle time model.
IEEE J. Solid State Circuits, 1996

A speed, power, and supply noise evaluation of ECL driver circuits.
IEEE J. Solid State Circuits, 1996

Register File Design Considerations in Dynamically Scheduled Processors.
Proceedings of the Second International Symposium on High-Performance Computer Architecture, 1996

How Useful Are Non-Blocking Loads, Stream Buffers and Speculative Execution in Multiple Issue Processors?
Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture (HPCA 1995), 1995

Designing, packaging, and testing a 300-MHz, 115 W ECL microprocessor.
IEEE Micro, 1994

Tradeoffs in Two-Level On-Chip Caching.
Proceedings of the 21st Annual International Symposium on Computer Architecture. Chicago, 1994

Complexity/Performance Tradeoffs with Non-Blocking Loads.
Proceedings of the 21st Annual International Symposium on Computer Architecture. Chicago, 1994

A 300-MHz 115-W 32-b bipolar ECL microprocessor.
IEEE J. Solid State Circuits, November, 1993

Cache Write Policies and Performance.
Proceedings of the 20th Annual International Symposium on Computer Architecture, 1993

A Simulation Based Study of TLB Performance.
Proceedings of the 19th Annual International Symposium on Computer Architecture. Gold Coast, 1992

Computer Technology and Architecture: An Evolving Interaction.
Computer, 1991

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers.
Proceedings of the 17th Annual International Symposium on Computer Architecture, 1990

A 20-MIPS sustained 32-bit CMOS microprocessor with high ratio of sustained to peak performance.
IEEE J. Solid State Circuits, October, 1989

The Nonuniform Distribution of Instruction-Level and Machine Parallelism and Its Effect on Performance.
IEEE Trans. Computers, 1989

Architectural and Organizational Tradeoffs in the Design of the MultiTitan CPU.
Proceedings of the 16th Annual International Symposium on Computer Architecture. Jerusalem, 1989

Integration and packaging plateaus of processor performance.
Proceedings of the Computer Design: VLSI in Computers and Processors, 1989

Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines.
Proceedings of the ASPLOS-III Proceedings, 1989

A Unified Vector/Scalar Floating-Point Architecture.
Proceedings of the ASPLOS-III Proceedings, 1989

Superscalar vs. superpipelined machines.
SIGARCH Comput. Archit. News, 1988

Timing Analysis and Performance Improvement of MOS VLSI Designs.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 1987

Derivation of Signal Flow Direction in MOS VLSI.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 1987

Timing analysis for nMOS VLSI.
Proceedings of the 20th Design Automation Conference, 1983

MIPS: A microprocessor architecture.
Proceedings of the 15th annual workshop on Microprogramming, 1982

Hardware/Software Tradeoffs for Increased Performance.
Proceedings of the Symposium on Architectural Support for Programming Languages and Operating Systems, 1982
