Tushar Krishna

Orcid: 0000-0001-5738-6942

According to our database1, Tushar Krishna authored at least 154 papers between 2008 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Abstracting Sparse DNN Acceleration via Structured Sparse Tensor Decomposition.
CoRR, 2024

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM.
CoRR, 2024

Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN Inference.
CoRR, 2024

Progressive Gradient Flow for Robust N: M Sparsity Training in Transformers.
CoRR, 2024

Towards Cognitive AI Systems: a Survey and Prospective on Neuro-Symbolic AI.
CoRR, 2024

2023
SPOCK: Reverse Packet Traversal for Deadlock Recovery.
IEEE Des. Test, December, 2023

On Continuing DNN Accelerator Architecture Scaling Using Tightly Coupled Compute-on-Memory 3-D ICs.
IEEE Trans. Very Large Scale Integr. Syst., October, 2023

STIFT: A Spatio-Temporal Integrated Folding Tree for Efficient Reductions in Flexible DNN Accelerators.
ACM J. Emerg. Technol. Comput. Syst., October, 2023

Introduction to the Special Issue on Next-Generation On-Chip and Off-Chip Communication Architectures for Edge, Cloud and HPC.
ACM J. Emerg. Technol. Comput. Syst., October, 2023

TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency.
ACM Trans. Archit. Code Optim., September, 2023

Hardware-Software Co-Design for Real-Time Latency-Accuracy Navigation in Tiny Machine Learning Applications.
IEEE Micro, 2023

Subgraph Stationary Hardware-Software Inference Co-Design.
CoRR, 2023

Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces.
CoRR, 2023

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Training.
CoRR, 2023

Perspectives on AI Architectures and Co-design for Earth System Predictability.
CoRR, 2023

Exploiting Inter-Operation Data Reuse in Scientific Applications using GOGETA.
CoRR, 2023

Mitigating Timing-Based NoC Side-Channel Attacks With LLC Remapping.
IEEE Comput. Archit. Lett., 2023

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2023

Characterization of Data Compression in Datacenters.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2023

SNATCH: Stealing Neural Network Architecture from ML Accelerator in Intelligent Sensors.
Proceedings of the 2023 IEEE SENSORS, Vienna, Austria, October 29 - Nov. 1, 2023, 2023

VEGETA: Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2023

AIrchitect: Automating Hardware Architecture and Mapping Optimization.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2023

Proteus : HLS-based NoC Generator and Simulator.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2023

Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

Efficient Distributed Inference of Deep Neural Networks via Restructuring and Pruning.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022
Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication.
IEEE Trans. Parallel Distributed Syst., 2022

Guest Editorial: IEEE TC Special Issue: Hardware Acceleration of Machine Learning.
IEEE Trans. Computers, 2022

Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators.
ACM Trans. Archit. Code Optim., 2022

A Formalism of DNN Accelerator Flexibility.
Proc. ACM Meas. Anal. Comput. Syst., 2022

COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training.
CoRR, 2022

XRBench: An Extended Reality (XR) Machine Learning Benchmark Suite for the Metaverse.
CoRR, 2022

Training Recipe for N: M Structured Sparsity with Decaying Pruning Mask.
CoRR, 2022

DNNFuser: Generative Pre-Trained Transformer as a Generalized Mapper for Layer Fusion in DNN Accelerators.
CoRR, 2022

Enabling Flexibility for Sparse Tensor Acceleration via Heterogeneity.
CoRR, 2022

MicroEdge: a multi-tenant edge cluster system architecture for scalable camera processing.
Proceedings of the Middleware '22: 23rd International Middleware Conference, Quebec, QC, Canada, November 7, 2022

Understanding Data Compression in Warehouse-Scale Datacenter Services.
Proceedings of the International IEEE Symposium on Performance Analysis of Systems and Software, 2022

Themis: a network bandwidth-aware collective scheduling policy for distributed training of DL models.
Proceedings of the ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18, 2022

Understanding the Design-Space of Sparse/Dense Multiphase GNN dataflows on Spatial Accelerators.
Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

Demystifying Map Space Exploration for NPUs.
Proceedings of the IEEE International Symposium on Workload Characterization, 2022

MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2022

Stay in your Lane: A NoC with Low-overhead Multi-packet Bypassing.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2022

Impact of RoCE Congestion Control Policies on Distributed Training of DNNs.
Proceedings of the IEEE Symposium on High-Performance Interconnects, 2022

DiGamma: Domain-aware Genetic Algorithm for HW-Mapping Co-optimization for DNN Accelerators.
Proceedings of the 2022 Design, Automation & Test in Europe Conference & Exhibition, 2022

Self adaptive reconfigurable arrays (SARA): learning flexible GEMM accelerator configuration and mapping-space using ML.
Proceedings of the DAC '22: 59th ACM/IEEE Design Automation Conference, San Francisco, California, USA, July 10, 2022

2021
Clock Delivery Network Design and Analysis for Interposer-Based 2.5-D Heterogeneous Systems.
IEEE Trans. Very Large Scale Integr. Syst., 2021

Efficiently Solving Partial Differential Equations in a Partially Reconfigurable Specialized Hardware.
IEEE Trans. Computers, 2021

Exploring Multi-dimensional Hierarchical Network Topologies for Efficient Distributed Training of Trillion Parameter DL Models.
CoRR, 2021

AIRCHITECT: Learning Custom Architecture Design and Mapping Space.
CoRR, 2021

ATTACC the Quadratic Bottleneck of Attention Layers.
CoRR, 2021

Domain-specific Genetic Algorithm for Multi-tenant DNNAccelerator Scheduling.
CoRR, 2021

A Taxonomy for Classification and Comparison of Dataflows for GNN Accelerators.
CoRR, 2021

Self-Adaptive Reconfigurable Arrays (SARA): Using ML to Assist Scaling GEMM Acceleration.
CoRR, 2021

STONNE: Enabling Cycle-Level Microarchitectural Simulation for DNN Inference Accelerators.
IEEE Comput. Archit. Lett., 2021

Flexion: A Quantitative Metric for Flexibility in DNN Accelerators.
IEEE Comput. Archit. Lett., 2021

SEEC: stochastic escape express channel.
Proceedings of the International Conference for High Performance Computing, 2021

A novel network fabric for efficient spatio-temporal reduction in flexible DNN accelerators.
Proceedings of the NOCS '21: International Symposium on Networks-on-Chip, 2021

DUB: dynamic underclocking and bypassing in nocs for heterogeneous GPU workloads.
Proceedings of the NOCS '21: International Symposium on Networks-on-Chip, 2021

Technology-aware Router Architectures for On-Chip-Networks in Heterogeneous Technologies.
Proceedings of the NANOCOM '21: The Eighth Annual ACM International Conference on Nanoscale Computing and Communication, Virtual Event, Italy, September 7, 2021

Architecture, Dataflow and Physical Design Implications of 3D-ICs for DNN-Accelerators.
Proceedings of the 22nd International Symposium on Quality Electronic Design, 2021

E3: A HW/SW Co-design Neuroevolution Platform for Autonomous Learning in Edge Device.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2021

Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms.
Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture, 2021

Extending Sparse Tensor Accelerators to Support Multiple Compression Formats.
Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium, 2021

Heterogeneous Dataflow Accelerators for Multi-DNN Workloads.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2021

Pitstop: Enabling a Virtual Network Free Network-on-Chip.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2021

RASA: Efficient Register-Aware Systolic Array Matrix Engine for CPU.
Proceedings of the 58th ACM/IEEE Design Automation Conference, 2021

Bridging the Frequency Gap in Heterogeneous 3D SoCs through Technology-Specific NoC Router Architectures.
Proceedings of the ASPDAC '21: 26th Asia and South Pacific Design Automation Conference, 2021

Dataflow-Architecture Co-Design for 2.5D DNN Accelerators using Wireless Network-on-Package.
Proceedings of the ASPDAC '21: 26th Asia and South Pacific Design Automation Conference, 2021

Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators.
Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques, 2021

2020
Data Orchestration in Deep Learning Accelerators
Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, ISBN: 978-3-031-01767-4, 2020

Architecture, Chip, and Package Codesign Flow for Interposer-Based 2.5-D Chiplet Integration Enabling Heterogeneous IP Reuse.
IEEE Trans. Very Large Scale Integr. Syst., 2020

ECOTLB: Eventually Consistent TLBs.
ACM Trans. Archit. Code Optim., 2020

MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings.
IEEE Micro, 2020

Restructuring, Pruning, and Adjustment of Deep Models for Parallel Distributed Inference.
CoRR, 2020

The gem5 Simulator: Version 20.0+.
CoRR, 2020

Efficient Communication Acceleration for Next-Gen Scale-up Deep Learning Training Platforms.
CoRR, 2020

STONNE: A Detailed Architectural Simulator for Flexible Neural Network Accelerators.
CoRR, 2020

Conditional Neural Architecture Search.
CoRR, 2020

Generative Design of Hardware-aware DNNs.
CoRR, 2020

MARVEL: A Decoupled Model-driven Approach for Efficiently Mapping Convolutions on Spatial DNN Accelerators.
CoRR, 2020

Statistical Array Allocation and Partitioning for Compute In-Memory Fabrics.
Proceedings of the VLSI-SoC: Design Trends, 2020

Breaking Barriers: Maximizing Array Utilization for Compute in-Memory Fabrics.
Proceedings of the 28th IFIP/IEEE International Conference on Very Large Scale Integration, 2020

ConfuciuX: Autonomous Hardware Resource Assignment for DNN Accelerators using Reinforcement Learning.
Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture, 2020

A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2020

ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2020

CLAN: Continuous Learning using Asynchronous Neuroevolution on Commodity Edge Devices.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2020

GAMMA: Automating the HW Mapping of DNN Models on Accelerators via Genetic Algorithm.
Proceedings of the IEEE/ACM International Conference On Computer Aided Design, 2020

SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2020

DRAIN: Deadlock Removal for Arbitrary Irregular Networks.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2020

ALRESCHA: A Lightweight Reconfigurable Sparse-Computation Accelerator.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2020

Scalable Distributed Training of Recommendation Models: An ASTRA-SIM + NS3 case-study with TCP/IP transport.
Proceedings of the IEEE Symposium on High-Performance Interconnects, 2020

Co-Exploration of Neural Architectures and Heterogeneous ASIC Accelerator Designs Targeting Multiple Tasks.
Proceedings of the 57th ACM/IEEE Design Automation Conference, 2020

Kite: A Family of Heterogeneous Interposer Topologies Enabled via Accurate Interconnect Modeling.
Proceedings of the 57th ACM/IEEE Design Automation Conference, 2020

2019
Synchronized Progress in Interconnection Networks (SPIN): A New Theory for Deadlock Freedom.
IEEE Micro, 2019

HERALD: Optimizing Heterogeneous DNN Accelerators for Edge Devices.
CoRR, 2019

BINDU: deadlock-freedom with one bubble in the network.
Proceedings of the 13th IEEE/ACM International Symposium on Networks-on-Chip, 2019

Reinforcement learning based interconnection routing for adaptive traffic optimization.
Proceedings of the 13th IEEE/ACM International Symposium on Networks-on-Chip, 2019

SWAP: Synchronized Weaving of Adjacent Packets for Network Deadlock Resolution.
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019

Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach.
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019

A communication-centric approach for designing flexible DNN accelerators.
Proceedings of the 12th International Workshop on Network on Chip Architectures, 2019

mRNA: Enabling Efficient Mapping Space Exploration for a Reconfiguration Neural Accelerator.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2019

Characterizing the Deployment of Deep Neural Networks on Commercial Edge Devices.
Proceedings of the IEEE International Symposium on Workload Characterization, 2019

Understanding the Impact of On-chip Communication on DNN Accelerator Performance.
Proceedings of the 26th IEEE International Conference on Electronics, Circuits and Systems, 2019

Scaling the Cascades: Interconnect-Aware FPGA Implementation of Machine Learning Problems.
Proceedings of the 29th International Conference on Field Programmable Logic and Applications, 2019

Architecture, Chip, and Package Co-design Flow for 2.5D IC Design Enabling Heterogeneous IP Reuse.
Proceedings of the 56th Annual Design Automation Conference 2019, 2019

2018
A Communication-Centric Approach for Designing Flexible DNN Accelerators.
IEEE Micro, 2018

SCALE-Sim: Systolic CNN Accelerator.
CoRR, 2018

MAESTRO: An Open-source Infrastructure for Modeling Dataflows within Deep Learning Accelerators.
CoRR, 2018

Brownian Bubble Router: Enabling Deadlock Freedom via Guaranteed Forward Progress.
Proceedings of the Twelfth IEEE/ACM International Symposium on Networks-on-Chip, 2018

Architecting a Secure Wireless Network-on-Chip.
Proceedings of the Twelfth IEEE/ACM International Symposium on Networks-on-Chip, 2018

GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware.
Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, 2018

Scalable Distributed Last-Level TLBs Using Low-Latency Interconnects.
Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, 2018

Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2018

SEESAW: Using Superpages to Improve VIPT Caches.
Proceedings of the 45th ACM/IEEE Annual International Symposium on Computer Architecture, 2018

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-Cost High-Performance Soft NoCs.
Proceedings of the 45th ACM/IEEE Annual International Symposium on Computer Architecture, 2018

Merge Network for a Non-Von Neumann Accumulate Accelerator in a 3D Chip.
Proceedings of the 2018 IEEE International Conference on Rebooting Computing, 2018

Spoofing Prevention via RF Power Profiling in Wireless Network-on-Chip.
Proceedings of the 3rd International Workshop on Advanced Interconnect Solutions and Technologies for Emerging Computing Systems, 2018

FastTrack: Exploiting Fast FPGA Wiring for Implementing NoC Shortcuts (Abstract Only).
Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2018

Optimizing the data placement and transformation for multi-bank CGRA computing system.
Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition, 2018

MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects.
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018

LATR: Lazy Translation Coherence.
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018

2017
On-Chip Networks, Second Edition
Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, ISBN: 978-3-031-01755-1, 2017

Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks.
IEEE J. Solid State Circuits, 2017

FASHION: Fault-Aware Self-Healing Intelligent On-chip Network.
CoRR, 2017

VESPA: VIPT Enhancements for Superpage Accesses.
CoRR, 2017

Rethinking NoCs for Spatial Neural Network Accelerators.
Proceedings of the Eleventh IEEE/ACM International Symposium on Networks-on-Chip, 2017

Adaptive Manycore Architectures for Big Data Computing.
Proceedings of the Eleventh IEEE/ACM International Symposium on Networks-on-Chip, 2017

Lightweight Emulation of Virtual Channels using Swaps.
Proceedings of the 10th International Workshop on Network on Chip Architectures, 2017

OpenSMART: Single-cycle multi-hop NoC generator in BSV and Chisel.
Proceedings of the 2017 IEEE International Symposium on Performance Analysis of Systems and Software, 2017

A case for low frequency single cycle multi hop NoCs for energy efficiency and high performance.
Proceedings of the 2017 IEEE/ACM International Conference on Computer-Aided Design, 2017

Static Bubble: A Framework for Deadlock-Free Irregular On-chip Topologies.
Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture, 2017

Automatic place-and-route of emerging LED-driven wires within a monolithically-integrated CMOS-III-V process.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2017

2016
14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks.
Proceedings of the 2016 IEEE International Solid-State Circuits Conference, 2016

2015
Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures.
ACM Trans. Comput. Syst., 2015

2014
Enabling dedicated single-cycle connections over a shared network-on-chip.
PhD thesis, 2014

Smart: Single-Cycle Multihop Traversals over a Shared Network on Chip.
IEEE Micro, 2014

Single-cycle collective communication over a shared network fabric.
Proceedings of the Eighth IEEE/ACM International Symposium on Networks-on-Chip, 2014

SCORPIO: A 36-core research chip demonstrating snoopy coherence on a scalable mesh NoC with in-network ordering.
Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture, 2014

SCORPIO: 36-core shared memory processor demonstrating snoopy coherence on a mesh interconnect.
Proceedings of the 2014 IEEE Hot Chips 26 Symposium (HCS), 2014

Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs.
Proceedings of the Architectural Support for Programming Languages and Operating Systems, 2014

2013
SWIFT: A Low-Power Network-On-Chip Implementing the Token Flow Control Router Architecture With Swing-Reduced Interconnects.
IEEE Trans. Very Large Scale Integr. Syst., 2013

Single-Cycle Multihop Asynchronous Repeated Traversal: A SMART Future for Reconfigurable On-Chip Networks.
Computer, 2013

Breaking the on-chip latency barrier using SMART.
Proceedings of the 19th IEEE International Symposium on High Performance Computer Architecture, 2013

SMART: a single-cycle reconfigurable NoC for SoC applications.
Proceedings of the Design, Automation and Test in Europe, 2013

2012
Approaching the theoretical limits of a mesh NoC with a 16-node chip prototype in 45nm SOI.
Proceedings of the 49th Annual Design Automation Conference 2012, 2012

2011
The gem5 simulator.
SIGARCH Comput. Archit. News, 2011

Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication.
Proceedings of the 44rd Annual IEEE/ACM International Symposium on Microarchitecture, 2011

A low-swing crossbar and link generator for low-power networks-on-chip.
Proceedings of the 2011 IEEE/ACM International Conference on Computer-Aided Design, 2011

2010
Physical vs. Virtual Express Topologies with Low-Swing Links for Future Many-Core NoCs.
Proceedings of the NOCS 2010, 2010

SWIFT: A SWing-reduced interconnect for a Token-based Network-on-Chip in 90nm CMOS.
Proceedings of the 28th International Conference on Computer Design, 2010

2009
Express Virtual Channels with Capacitively Driven Global Links.
IEEE Micro, 2009

GARNET: A detailed on-chip network model inside a full-system simulator.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2009

2008
Texture filter memory: a power-efficient and scalable texture memory architecture for mobile graphics processors.
Proceedings of the 2008 International Conference on Computer-Aided Design, 2008

NoC with Near-Ideal Express Virtual Channels Using Global-Line Communication.
Proceedings of the 16th Annual IEEE Symposium on High Performance Interconnects (HOTI 2008), 2008


  Loading...