Scott A. Mahlke

Orcid: 0000-0002-0438-0616

Affiliations:
  • University of Michigan, Ann Arbor, MI, USA


According to our database1, Scott A. Mahlke authored at least 233 papers between 1991 and 2023.

Collaborative distances:

Awards

ACM Fellow

ACM Fellow 2020, "For contributions in compiler code generation for instruction level parallelism, and customized microprocessor architectures".

IEEE Fellow

IEEE Fellow 2015, "For contributions to compiler code generation and automatic processor customization".

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2023
BitSET: Bit-Serial Early Termination for Computation Reduction in Convolutional Neural Networks.
ACM Trans. Embed. Comput. Syst., October, 2023

Vector-Processing for Mobile Devices: Benchmark and Analysis.
Proceedings of the IEEE International Symposium on Workload Characterization, 2023

2022
Multi-Layer In-Memory Processing.
Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture, 2022

AVMaestro: A Centralized Policy Enforcement Framework for Safe Autonomous-driving Environments.
Proceedings of the 2022 IEEE Intelligent Vehicles Symposium, 2022

SoftFusion: A Low-Cost Approach to Enhance Reliability of Object Detection Applications.
Proceedings of the IEEE 40th International Conference on Computer Design, 2022

SRTuner: Effective Compiler Optimization Customization by Exposing Synergistic Relations.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2022

Loner: utilizing the CPU vector datapath to process scalar integer data.
Proceedings of the CC '22: 31st ACM SIGPLAN International Conference on Compiler Construction, Seoul, South Korea, April 2, 2022

2021
A Systematic Framework to Identify Violations of Scenario-dependent Driving Rules in Autonomous Vehicle Software.
Proc. ACM Meas. Anal. Comput. Syst., 2021

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2021

2020
Path Sensitive Signatures for Control Flow Error Detection.
Proceedings of the 21st ACM SIGPLAN/SIGBED International Conference on Languages, 2020

AVGuardian: Detecting and Mitigating Publish-Subscribe Overprivilege for Autonomous Vehicle Systems.
Proceedings of the IEEE European Symposium on Security and Privacy, 2020

PolygraphMR: Enhancing the Reliability and Dependability of CNNs.
Proceedings of the 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2020

SIEVE: Speculative Inference on the Edge with Versatile Exportation.
Proceedings of the 57th ACM/IEEE Design Automation Conference, 2020

Low-cost prediction-based fault protection strategy.
Proceedings of the CGO '20: 18th ACM/IEEE International Symposium on Code Generation and Optimization, 2020

2019
TF-Net: Deploying Sub-Byte Deep Neural Networks on Microcontrollers.
ACM Trans. Embed. Comput. Syst., 2019

Multi-objective Exploration for Practical Optimization Decisions in Binary Translation.
ACM Trans. Embed. Comput. Syst., 2019

Characterization of Unnecessary Computations in Web Applications.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2019

Duality cache for data parallel acceleration.
Proceedings of the 46th International Symposium on Computer Architecture, 2019

POSTER: Pairing Up CNNs for High Throughput Deep Learning.
Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques, 2019

2018
Scratch That (But Cache This): A Hybrid Register Cache/Scratchpad for GPUs.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2018

Iterative Modulo Scheduling.
IEEE Micro, 2018

Rethinking Numerical Representations for Deep Neural Networks.
CoRR, 2018

Sculptor: Flexible Approximation with Selective Dynamic Loop Perforation.
Proceedings of the 32nd International Conference on Supercomputing, 2018

Low Cost Transient Fault Protection Using Loop Output Prediction.
Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2018

In-Memory Data Parallel Processor.
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018

2017
Mirage cores: the illusion of many out-of-order cores using in-order hardware.
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017

Regless: just-in-time operand staging for GPUs.
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017

DeftNN: addressing bottlenecks for DNN execution on GPUs via synapse vector elimination and near-compute data fission.
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism.
Proceedings of the 44th Annual International Symposium on Computer Architecture, 2017

Dynamic Resource Management for Efficient Utilization of Multitasking GPUs.
Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017

In-memory Data Flow Processor.
Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques, 2017

2016
Exploring Fine-Grained Heterogeneity with Composite Cores.
IEEE Trans. Computers, 2016

Quality Control for Approximate Accelerators by Error Prediction.
IEEE Des. Test, 2016

A bypass first policy for energy-efficient last level caches.
Proceedings of the International Conference on Embedded Computer Systems: Architectures, 2016

Input responsiveness: using canary inputs to dynamically steer approximation.
Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2016

Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation.
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016

BugMD: automatic mismatch diagnosis for bug triaging.
Proceedings of the 35th International Conference on Computer-Aided Design, 2016

2015
Using Graphics Processing Units in an LTE Base Station.
J. Signal Process. Syst., 2015

SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration.
ACM Trans. Comput. Syst., 2015

Tango: Accelerating Mobile Applications through Flip-Flop Replication.
GetMobile Mob. Comput. Commun., 2015

ELF: maximizing memory-level parallelism for GPUs with coordinated warp and fetch scheduling.
Proceedings of the International Conference for High Performance Computing, 2015

Accelerating Mobile Applications through Flip-Flop Replication.
Proceedings of the 13th Annual International Conference on Mobile Systems, 2015

DynaMOS: dynamic schedule migration for heterogeneous cores.
Proceedings of the 48th International Symposium on Microarchitecture, 2015

WarpPool: sharing requests with inter-warp coalescing for throughput processors.
Proceedings of the 48th International Symposium on Microarchitecture, 2015

Rumba: an online quality management system for approximate computing.
Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015

Accelerating asynchronous programs through event sneak peek.
Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015

Mascar: Speeding up GPU warps by reducing memory pitstops.
Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture, 2015

Chimera: Collaborative Preemption for Multitasking on a Shared GPU.
Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, 2015

Fine Grain Cache Partitioning Using Per-Instruction Working Blocks.
Proceedings of the 2015 International Conference on Parallel Architectures and Compilation, 2015

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices.
Proceedings of the 2015 International Conference on Parallel Architectures and Compilation, 2015

2014
Scaling Performance via Self-Tuning Approximation for Graphics Engines.
ACM Trans. Comput. Syst., 2014

Leveraging GPUs using cooperative loop speculation.
ACM Trans. Archit. Code Optim., 2014

Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution.
Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014

Harnessing Soft Computations for Low-Budget Fault Tolerance.
Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014

Embracing heterogeneity with dynamic core boosting.
Proceedings of the Computing Frontiers Conference, CF'14, 2014

Paraprox: pattern-based approximation for data parallel applications.
Proceedings of the Architectural Support for Programming Languages and Operating Systems, 2014

Heterogeneous microarchitectures trump voltage scaling for low-power cores.
Proceedings of the International Conference on Parallel Architectures and Compilation, 2014

VAST: the illusion of a large memory space for GPUs.
Proceedings of the International Conference on Parallel Architectures and Compilation, 2014

D<sup>2</sup>MA: accelerating coarse-grained data transfer for GPUs.
Proceedings of the International Conference on Parallel Architectures and Compilation, 2014

EFetch: optimizing instruction fetch for event-driven webapplications.
Proceedings of the International Conference on Parallel Architectures and Compilation, 2014

2013
Eliminating Concurrency Bugs in Multithreaded Software: A New Approach Based on Discrete-Event Control.
IEEE Trans. Control. Syst. Technol., 2013

Optimal Liveness-Enforcing Control for a Class of Petri Nets Arising in Multithreaded Software.
IEEE Trans. Autom. Control., 2013

Concurrency bugs in multithreaded software: modeling and analysis using Petri nets.
Discret. Event Dyn. Syst., 2013

Architecting an LTE base station with graphics processing units.
Proceedings of the IEEE Workshop on Signal Processing Systems, 2013

SAGE: self-tuning approximation for graphics engines.
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013

Trace based phase prediction for tightly-coupled heterogeneous cores.
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013

Low cost control flow protection using abstract control signatures.
Proceedings of the SIGPLAN/SIGBED Conference on Languages, 2013

Parallelization techniques for implementing trellis algorithms on graphics processors.
Proceedings of the 2013 IEEE International Symposium on Circuits and Systems (ISCAS2013), 2013

WiBench: An open source kernel suite for benchmarking wireless systems.
Proceedings of the IEEE International Symposium on Workload Characterization, 2013

Illusionist: Transforming lightweight cores into aggressive cores on demand.
Proceedings of the 19th IEEE International Symposium on High Performance Computer Architecture, 2013

Efficient execution of augmented reality applications on mobile programmable accelerators.
Proceedings of the 2013 International Conference on Field-Programmable Technology, 2013

Instant profiling: Instrumentation sampling for profiling datacenter applications.
Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, 2013

Practical lock/unlock pairing for concurrent programs.
Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, 2013

APOGEE: Adaptive prefetching on GPUs for energy efficiency.
Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 2013

Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems.
Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 2013

2012
A Customized Processor for Energy Efficient Scientific Computing.
IEEE Trans. Computers, 2012

Adaptive input-aware compilation for graphics engines.
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2012

COMET: Code Offload by Migrating Execution Transparently.
Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation, 2012

Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability.
Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012

Composite Cores: Pushing Heterogeneity Into a Core.
Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012

Dynamic acceleration of multithreaded program critical paths in near-threshold systems.
Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012

Efficient soft error protection for commodity embedded microprocessors using profile information.
Proceedings of the SIGPLAN/SIGBED Conference on Languages, 2012

Efficient performance scaling of future CGRAs for mobile applications.
Proceedings of the 2012 International Conference on Field-Programmable Technology, 2012

Process variation in near-threshold wide SIMD architectures.
Proceedings of the 49th Annual Design Automation Conference 2012, 2012

Runtime asynchronous fault tolerance via speculation.
Proceedings of the 10th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2012

Automatic speculative DOALL for clusters.
Proceedings of the 10th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2012

When less is more (LIMO): controlled parallelism forimproved efficiency.
Proceedings of the 15th International Conference on Compilers, 2012

Paragon: collaborative speculative loop execution on GPU and CPU.
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, 2012

SIMD defragmenter: efficient ILP realization on data-parallel architectures.
Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, 2012

2011
Analyzing the Next Generation Software Defined Radio for Future Architectures.
J. Signal Process. Syst., 2011

StageNet: A Reconfigurable Fabric for Constructing Dependable CMPs.
IEEE Trans. Computers, 2011

Maximizing Spare Utilization by Virtually Reorganizing Faulty Cache Lines.
IEEE Trans. Computers, 2011

Bundled execution of recurring traces for energy-efficient general purpose processing.
Proceedings of the 44rd Annual IEEE/ACM International Symposium on Microarchitecture, 2011

Encore: low-cost, fine-grained transient fault recovery.
Proceedings of the 44rd Annual IEEE/ACM International Symposium on Microarchitecture, 2011

Dynamic parallelization of JavaScript applications using an ultra-lightweight speculation mechanism.
Proceedings of the 17th International Conference on High-Performance Computer Architecture (HPCA-17 2011), 2011

Archipelago: A polymorphic cache design for enabling robust near-threshold operation.
Proceedings of the 17th International Conference on High-Performance Computer Architecture (HPCA-17 2011), 2011

Dynamically accelerating client-side web applications through decoupled execution.
Proceedings of the CGO 2011, 2011

Deadlock-avoidance control of multithreaded software: An efficient siphon-based algorithm for Gadara petri nets.
Proceedings of the 50th IEEE Conference on Decision and Control and European Control Conference, 2011

Sponge: portable stream programming on graphics engines.
Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, 2011

PEPSC: A Power-Efficient Processor for Scientific Computing.
Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011

2010
AnySP: Anytime Anywhere Anyway Signal Processing.
IEEE Micro, 2010

Putting Faulty Cores to Work.
IEEE Micro, 2010

Mobile Supercomputers for the Next-Generation Cell Phone.
Computer, 2010

Supervisory control of software execution for failure avoidance: Experience from the Gadara project.
Proceedings of the 10th International Workshop on Discrete Event Systems, 2010

Erasing Core Boundaries for Robust and Configurable Performance.
Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010

Diet SODA: a power-efficient processor for digital cameras.
Proceedings of the 2010 International Symposium on Low Power Electronics and Design, 2010

Necromancer: enhancing system throughput by animating dead cores.
Proceedings of the 37th International Symposium on Computer Architecture (ISCA 2010), 2010

Maestro: Orchestrating Lifetime Reliability in Chip Multiprocessors.
Proceedings of the High Performance Embedded Architectures and Compilers, 2010

StageWeb: Interweaving pipeline stages into a wearout and variation tolerant CMP fabric.
Proceedings of the 2010 IEEE/IFIP International Conference on Dependable Systems and Networks, 2010

Compilation techniques for CGRAs: exploring all parallelization approaches.
Proceedings of the 8th International Conference on Hardware/Software Codesign and System Synthesis, 2010

Synthesis of maximally-permissive liveness-enforcing control policies for Gadara petri nets.
Proceedings of the 49th IEEE Conference on Decision and Control, 2010

Resource recycling: putting idle resources to work on a composable accelerator.
Proceedings of the 2010 International Conference on Compilers, 2010

Mighty-morphing power-SIMD.
Proceedings of the 2010 International Conference on Compilers, 2010

MacroSS: macro-SIMDization of streaming applications.
Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, 2010

Shoestring: probabilistic soft error reliability on the cheap.
Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, 2010

CoreGenesis: erasing core boundaries for robust and configurable performance.
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, 2010

MEDICS: ultra-portable processing for medical image reconstruction.
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, 2010

2009
Multicore compilation strategies and challenges.
IEEE Signal Process. Mag., 2009

Eliminating Concurrency Bugs with Control Engineering.
Computer, 2009

A dataflow-centric approach to design low power control paths in CGRAs.
Proceedings of the IEEE 7th Symposium on Application Specific Processors, 2009

Power-efficient medical image processing using PUMA.
Proceedings of the IEEE 7th Symposium on Application Specific Processors, 2009

Parade: A versatile parallel architecture for accelerating pulse train clustering.
Proceedings of the IEEE 7th Symposium on Application Specific Processors, 2009

Customizing wide-SIMD architectures for H.264.
Proceedings of the 2009 International Conference on Embedded Computer Systems: Architectures, 2009

The theory of deadlock avoidance via discrete control.
Proceedings of the 36th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2009

Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory.
Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, 2009

Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications.
Proceedings of the 42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009), 2009

ZerehCache: armoring cache architectures in high defect density technologies.
Proceedings of the 42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009), 2009

Recurrence cycle aware modulo scheduling for coarse-grained reconfigurable architectures.
Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, 2009

Enabling ultra low voltage system operation by tolerating on-chip cache failures.
Proceedings of the 2009 International Symposium on Low Power Electronics and Design, 2009

Adaptive online testing for efficient hard fault detection.
Proceedings of the 27th International Conference on Computer Design, 2009

Bridging the computation gap between programmable processors and hardwired accelerators.
Proceedings of the 15th International Conference on High-Performance Computer Architecture (HPCA-15 2009), 2009

Stream Compilation for Real-Time Embedded Multicore Systems.
Proceedings of the CGO 2009, 2009

Gadara nets: Modeling and analyzing lock allocation for deadlock avoidance in multithreaded software.
Proceedings of the 48th IEEE Conference on Decision and Control, 2009

CGRA express: accelerating execution using dynamic operation fusion.
Proceedings of the 2009 International Conference on Compilers, 2009

Maximally permissive deadlock avoidance for multithreaded computer programs (Extended abstract).
Proceedings of the IEEE Conference on Automation Science and Engineering, 2009

Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures.
Proceedings of the PACT 2009, 2009

2008
Reliable Systems on Unreliable Fabrics.
IEEE Des. Test Comput., 2008

A parameterized dataflow language extension for embedded streaming systems.
Proceedings of the 2008 International Conference on Embedded Computer Systems: Architectures, 2008

Orchestrating the execution of stream programs on multicore platforms.
Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, 2008

Gadara: Dynamic Deadlock Avoidance for Multithreaded Programs.
Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation, 2008

From SODA to scotch: The evolution of a wireless baseband processor.
Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41 2008), 2008

The StageNet fabric for constructing resilient multicore systems.
Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41 2008), 2008

VEAL: Virtualized Execution Accelerator for Loops.
Proceedings of the 35th International Symposium on Computer Architecture (ISCA 2008), 2008

Analyzing the scalability of SIMD for the next generation software defined radio.
Proceedings of the IEEE International Conference on Acoustics, 2008

Uncovering hidden loop level parallelism in sequential applications.
Proceedings of the 14th International Conference on High-Performance Computer Architecture (HPCA-14 2008), 2008

DVFS in loop accelerators using BLADES.
Proceedings of the 45th Design Automation Conference, 2008

Modulo scheduling for highly customized datapaths to increase hardware reusability.
Proceedings of the Sixth International Symposium on Code Generation and Optimization (CGO 2008), 2008

Optimus: efficient realization of streaming applications on FPGAs.
Proceedings of the 2008 International Conference on Compilers, 2008

StageNetSlice: a reconfigurable microarchitecture building block for resilient CMP systems.
Proceedings of the 2008 International Conference on Compilers, 2008

Edge-centric modulo scheduling for coarse-grained reconfigurable architectures.
Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, 2008

2007
Architecting a reliable CMP switch architecture.
ACM Trans. Archit. Code Optim., 2007

SODA: A High-Performance DSP Architecture for Software-Defined Radio.
IEEE Micro, 2007

Reliability: Fallacy or Reality?
IEEE Micro, 2007

The Next Generation Challenge for Software Defined Radio.
Proceedings of the Embedded Computer Systems: Architectures, 2007

Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures.
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40 2007), 2007

Self-calibrating Online Wearout Detection.
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40 2007), 2007

Compiler-managed partitioned data caches for low power.
Proceedings of the 2007 ACM SIGPLAN/SIGBED Conference on Languages, 2007

Code and data partitioning for fine-grain parallelism.
Proceedings of the 2007 ACM SIGPLAN/SIGBED Conference on Languages, 2007

Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications.
Proceedings of the 13st International Conference on High-Performance Computer Architecture (HPCA-13 2007), 2007

Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping.
Proceedings of the 13st International Conference on High-Performance Computer Architecture (HPCA-13 2007), 2007

Exploiting Narrow Accelerators with Data-Centric Subgraph Mapping.
Proceedings of the Fifth International Symposium on Code Generation and Optimization (CGO 2007), 2007

Hierarchical coarse-grained stream compilation for software defined radio.
Proceedings of the 2007 International Conference on Compilers, 2007

2006
Design and Implementation of Turbo Decoders for Software Defined Radio.
Proceedings of the IEEE Workshop on Signal Processing Systems, 2006

SODA: A Low-power Architecture For Software Radio.
Proceedings of the 33rd International Symposium on Computer Architecture (ISCA 2006), 2006

BulletProof: a defect-tolerant CMP switch architecture.
Proceedings of the 12th International Symposium on High-Performance Computer Architecture, 2006

Streamroller: : automatic synthesis of prescribed throughput accelerator pipelines.
Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis, 2006

Increasing hardware efficiency with multifunction loop accelerators.
Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis, 2006

Compiler-directed Data Partitioning for Multicluster Processors.
Proceedings of the Fourth IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2006), 2006

Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures.
Proceedings of the 2006 International Conference on Compilers, 2006

Scalable subgraph mapping for acyclic computation accelerators.
Proceedings of the 2006 International Conference on Compilers, 2006

Cost-efficient soft error protection for embedded microprocessors.
Proceedings of the 2006 International Conference on Compilers, 2006

2005
Partitioning Variables across Register Windows to Reduce Spill Code in a Low-Power Processor.
IEEE Trans. Computers, 2005

Automated Custom Instruction Generation for Domain-Specific Processor Acceleration.
IEEE Trans. Computers, 2005

Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System.
Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38 2005), 2005

An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors.
Proceedings of the 32st International Symposium on Computer Architecture (ISCA 2005), 2005

Software Defined Radio - A High Performance Embedded Challenge.
Proceedings of the High Performance Embedded Architectures and Compilers, 2005

Compiler Managed Dynamic Instruction Placement in a Low-Power Code Cache.
Proceedings of the 3nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2005), 2005

Exploring the design space of LUT-based transparent accelerators.
Proceedings of the 2005 International Conference on Compilers, 2005

A Distributed Control Path Architecture for VLIW Processors.
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT 2005), 2005

2004
Cost-Sensitive Partitioning in an Architecture Synthesis System for Multicluster Processors.
IEEE Micro, 2004

Mobile Supercomputers.
Computer, 2004

Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization.
Proceedings of the 37th Annual International Symposium on Microarchitecture (MICRO-37 2004), 2004

Trimaran: An Infrastructure for Research in Instruction-Level Parallelism.
Proceedings of the Languages and Compilers for High Performance Computing, 2004

Memory system design space exploration for low-power, real-time speech recognition.
Proceedings of the 2nd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, 2004

Probabilistic Predicate-Aware Modulo Scheduling.
Proceedings of the 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 2004

FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized Datapaths.
Proceedings of the 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 2004

Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
Proceedings of the 15th IEEE International Conference on Application-Specific Systems, 2004

2003
Automatic Design of Application Specific Instruction Set Extensions Through Dataflow Graph Exploration.
Int. J. Parallel Program., 2003

Region-based hierarchical operation partitioning for multicluster processors.
Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation 2003, 2003

Processor Acceleration Through Automated Instruction Set Customization.
Proceedings of the 36th Annual International Symposium on Microarchitecture, 2003

Predicate-Aware Scheduling: A Technique for Reducing Resource Constraints.
Proceedings of the 1st IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2003), 2003

Increasing the number of effective registers in a low-power processor using a windowed register file.
Proceedings of the International Conference on Compilers, 2003

Architectural optimizations for low-power, real-time speech recognition.
Proceedings of the International Conference on Compilers, 2003

Systematic Register Bypass Customization for Application-Specific Processors.
Proceedings of the 14th IEEE International Conference on Application-Specific Systems, 2003

2002
PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators.
J. VLSI Signal Process., 2002

2001
Bitwidth cognizant architecture synthesis of custom hardwareaccelerators.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2001

2000
Code size minimization and retargetable assembly for custom EPIC and VLIW instruction formats.
ACM Trans. Design Autom. Electr. Syst., 2000

High-Level Synthesis of Nonprogrammable Hardware Accelerators.
Proceedings of the 12th IEEE International Conference on Application-Specific Systems, 2000

1999
The Partial Reverse If-Conversion Framework for Balancing Control Flow and Predication.
Int. J. Parallel Program., 1999

Control CPR: A Branch Height Reduction Optimization for EPIC Architectures.
Proceedings of the 1999 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 1999

Automatic and Efficient Evaluation of Memory Hierarchies for Embedded Systems.
Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture, 1999

The Program Decision Logic Approach to Predicated Execution.
Proceedings of the 26th Annual International Symposium on Computer Architecture, 1999

1998
IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors.
Proceedings of the 25 Years of the International Symposia on Computer Architecture (Selected Papers)., 1998

Integrated Predicated and Speculative Execution in the IMPACT EPIC Architecture.
Proceedings of the 25th Annual International Symposium on Computer Architecture, 1998

1997
Exploiting Instruction Level Parallelism in the Presence of Conditional Branches
PhD thesis, 1997

A Framework for Balancing Control Flow and Predication.
Proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, 1997

1996
Compiler Synthesized Dynamic Branch Prediction.
Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, 1996

1995
Three Architecutral Models for Compiler-Controlled Speculative Execution.
IEEE Trans. Computers, 1995

The Importance of Prepass Code Scheduling for Superscalar and Superpipelined Processors.
IEEE Trans. Computers, 1995

Compiler technology for future microprocessors.
Proc. IEEE, 1995

A Comparison of Full and Partial Predicated Execution Support for ILP Processors.
Proceedings of the 22nd Annual International Symposium on Computer Architecture, 1995

A study of the effects of compiler-controlled speculation on instruction and data caches.
Proceedings of the 28th Annual Hawaii International Conference on System Sciences (HICSS-28), 1995

1994
Profile-assisted instruction scheduling.
Int. J. Parallel Program., 1994

Characterizing the impact of predicated execution on branch prediction.
Proceedings of the 27th Annual International Symposium on Microarchitecture, San Jose, California, USA, November 30, 1994

Dynamic Memory Disambiguation Using the Memory Conflict Buffer.
Proceedings of the ASPLOS-VI Proceedings, 1994

1993
Sentinel Scheduling for VLIW and Superscalar Processors.
ACM Trans. Comput. Syst., 1993

The superblock: An effective technique for VLIW and superscalar compilation.
J. Supercomput., 1993

Reverse If-Conversion.
Proceedings of the ACM SIGPLAN'93 Conference on Programming Language Design and Implementation (PLDI), 1993

Superblock formation using static program analysis.
Proceedings of the 26th Annual International Symposium on Microarchitecture, 1993

Speculative execution exception recovery using write-back suppression.
Proceedings of the 26th Annual International Symposium on Microarchitecture, 1993

Register Connection: A New Approach to Adding Registers into Instruction Set Architectures.
Proceedings of the 20th Annual International Symposium on Computer Architecture, 1993

1992
Profile-guided Automatic Inline Expansion for C Programs.
Softw. Pract. Exp., 1992

Compiler Code Transformations for Superscalar-Based High Performance Systems.
Proceedings of the Proceedings Supercomputing '92, 1992

Effective compiler support for predicated execution using the hyperblock.
Proceedings of the 25th Annual International Symposium on Microarchitecture, 1992

An efficient architecture for loop based data preloading.
Proceedings of the 25th Annual International Symposium on Microarchitecture, 1992

Using Profile Information to Assist Advaced Compiler Optimization and Scheduling.
Proceedings of the Languages and Compilers for Parallel Computing, 1992

Tolerating data access latency with register preloading.
Proceedings of the 6th international conference on Supercomputing, 1992

Tolerating First Level Memory Access Latency in High-Performance Systems.
Proceedings of the 1992 International Conference on Parallel Processing, 1992

Sentinel Scheduling for VLIW and Superscalar Processors.
Proceedings of the ASPLOS-V Proceedings, 1992

1991
Using Profile Information to Assist Classic Code Optimizations.
Softw. Pract. Exp., 1991

Data Access Microarchitectures for Superscalar Processors with Compiler-Assisted Data Prefetching.
Proceedings of the 24th Annual IEEE/ACM International Symposium on Microarchitecture, 1991

Comparing Static and Dynamic Code Scheduling for Multiple-Instruction-Issue Processors.
Proceedings of the 24th Annual IEEE/ACM International Symposium on Microarchitecture, 1991

The Effect of Compiler Optimizations on Available Parallelism in Scalar Programs.
Proceedings of the International Conference on Parallel Processing, 1991


  Loading...