Wen-Mei W. Hwu

According to our database1, Wen-Mei W. Hwu
  • authored at least 234 papers between 1985 and 2018.
  • has a "Dijkstra number"2 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Other 

Links

Homepages:

On csauthors.net:

Bibliography

2018
Iterative Modulo Scheduling.
IEEE Micro, 2018

High-throughput Ant Colony Optimization on graphics processing units.
J. Parallel Distrib. Comput., 2018

Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts.
CoRR, 2018

2017
SAVI objects: sharing and virtuality incorporated.
PACMPL, 2017

Heterogeneous Computing Meets Near-Memory Acceleration and High-Level Synthesis in the Post-Moore Era.
IEEE Micro, 2017

Collaborative Computing for Heterogeneous Integrated Systems.
Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering, 2017

Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts.
Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 2017

Chai: Collaborative heterogeneous applications for integrated-architectures.
Proceedings of the 2017 IEEE International Symposium on Performance Analysis of Systems and Software, 2017

Keynote: Architecture and software for emerging low-power systems.
Proceedings of the 2017 IEEE/ACM International Symposium on Low Power Electronics and Design, 2017

RAI: A Scalable Project Submission System for Parallel Programming Courses.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Rebooting the Data Access Hierarchy of Computing Systems.
Proceedings of the IEEE International Conference on Rebooting Computing, 2017

Generalize or Die: Operating Systems Support for Memristor-Based Accelerators.
Proceedings of the IEEE International Conference on Rebooting Computing, 2017

Collaborative (CPU + GPU) algorithms for triangle counting and truss decomposition on the Minsky architecture: Static graph challenge: Subgraph isomorphism.
Proceedings of the 2017 IEEE High Performance Extreme Computing Conference, 2017

Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on Next-Generation Architectures.
Proceedings of the 19th IEEE International Conference on High Performance Computing and Communications; 15th IEEE International Conference on Smart City; 3rd IEEE International Conference on Data Science and Systems, 2017

Hardware Acceleration of the Pair-HMM Algorithm for DNA Variant Calling.
Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017

2016
In-Place Matrix Transposition on GPUs.
IEEE Trans. Parallel Distrib. Syst., 2016

FCUDA-HB: Hierarchical and Scalable Bus Architecture Generation on FPGAs With the FCUDA Flow.
IEEE Trans. on CAD of Integrated Circuits and Systems, 2016

Common Bonds: MIPS, HPS, Two-Level Branch Prediction, and Compressed Code RISC Processor.
IEEE Micro, 2016

Platform choices and design demands for IoT platforms: cost, power, and performance tradeoffs.
IET Cyper-Phys. Syst.: Theory & Appl., 2016

BLESS 2: accurate, memory-efficient and fast error correction method.
Bioinformatics, 2016

Design of a power-efficient ARM processor with a timing-error detection and correction mechanism.
Proceedings of the 29th IEEE International System-on-Chip Conference, 2016

A programming system for future proofing performance critical libraries.
Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2016

KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism.
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016

Efficient kernel synthesis for performance portable programming.
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016

AsHES 2016 Keynote.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

WebGPU: A Scalable Online Development Platform for GPU Programming Courses.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

Efficient and Scalable Workflows for Genomic Analyses.
Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing, 2016

Acceleration of the Pair-HMM Algorithm for DNA Variant Calling.
Proceedings of the 24th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2016

SpaceJMP: Programming with Multiple Virtual Address Spaces.
Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, 2016

DySel: Lightweight Dynamic Selection for Kernel-based Data-parallel Programming Model.
Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, 2016

2015
Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications.
IEEE Trans. Parallel Distrib. Syst., 2015

Optimized Data Transfers Based on the OpenCL Event Management Mechanism.
Scientific Programming, 2015

Enhancing the Usability and Utilization of Accelerated Architectures via Docker.
Proceedings of the 8th IEEE/ACM International Conference on Utility and Cloud Computing, 2015

GPU-SM: shared memory multi-GPU programming.
Proceedings of the 8th Workshop on General Purpose Processing using GPUs, 2015

Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes.
Proceedings of the 29th ACM on International Conference on Supercomputing, 2015

In-Place Data Sliding Algorithms for Many-Core Architectures.
Proceedings of the 44th International Conference on Parallel Processing, 2015

FPGA accelerated DNA error correction.
Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, 2015

Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures.
Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2015

2014
What is ahead for parallel computing.
J. Parallel Distrib. Comput., 2014

BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads.
Bioinformatics, 2014

SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance.
Proceedings of the High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, 2014

In-place transposition of rectangular matrices on accelerators.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

Triolet: a programming system that unifies algorithmic skeleton interfaces for high-performance cluster computing.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

Adaptive Cache Management for Energy-Efficient GPU Computing.
Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014

Adaptive Cache Bypass and Insertion for Many-core Accelerators.
Proceedings of the 2nd International Workshop on Many-core Embedded Systems, 2014

Automatic execution of single-GPU computations across multiple GPUs.
Proceedings of the International Conference on Parallel Architectures and Compilation, 2014

2013
Scalable SIMD-parallel memory allocation for many-core machines.
The Journal of Supercomputing, 2013

Efficient compilation of CUDA kernels for high-performance computing on FPGAs.
ACM Trans. Embedded Comput. Syst., 2013

More IMPATIENT: A gridding-accelerated Toeplitz-based strategy for non-Cartesian high-resolution 3D MRI on GPUs.
J. Parallel Distrib. Comput., 2013

Rapid computation of sodium bioscales using gpu-accelerated image reconstruction.
Int. J. Imaging Systems and Technology, 2013

Rethinking computer architecture for throughput computing.
Proceedings of the 2013 International Conference on Embedded Computer Systems: Architectures, 2013

clMPI: An OpenCL Extension for Interoperation with the Message Passing Interface.
Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

Throughput-oriented kernel porting onto FPGAs.
Proceedings of the 50th Annual Design Automation Conference 2013, 2013

Comparison based sorting for systems with multiple GPUs.
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, 2013

2012
Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)
Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, 2012

Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications.
International Journal of Parallel Programming, 2012

Algorithm and Data Optimization Techniques for Scaling to Massively Threaded Systems.
IEEE Computer, 2012

TIGER: tiled iterative genome assembler.
BMC Bioinformatics, 2012

A scalable, numerically stable, high-performance tridiagonal solver using GPUs.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors.
Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2012

Efficient Pattern-Based Time Series Classification on GPU.
Proceedings of the 12th IEEE International Conference on Data Mining, 2012

Design evaluation of OpenCL compiler framework for Coarse-Grained Reconfigurable Arrays.
Proceedings of the 2012 International Conference on Field-Programmable Technology, 2012

2011
Superscalar Processors.
Proceedings of the Encyclopedia of Parallel Computing, 2011

EcoG: A Power-Efficient GPU Cluster Architecture for Scientific Computing.
Computing in Science and Engineering, 2011

Advanced MRI reconstruction toolbox with accelerating on GPU.
Proceedings of the Conference on Parallel Processing for Imaging Applications 2011, 2011

Impatient MRI: Illinois Massively Parallel Acceleration Toolkit for image reconstruction with enhanced throughput in MRI.
Proceedings of the 8th IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 2011

Panel Statement.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

A Scalable Tridiagonal Solver for GPUs.
Proceedings of the International Conference on Parallel Processing, 2011

Parallel implementation of Multi-dimensional Ensemble Empirical Mode Decomposition.
Proceedings of the IEEE International Conference on Acoustics, 2011

Multilevel Granularity Parallelism Synthesis on FPGAs.
Proceedings of the IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines, 2011

2010
High-Performance Computing with Accelerators.
Computing in Science and Engineering, 2010

An adaptive performance modeling tool for GPU architectures.
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010

Implementing a GPU Programming Model on a Non-GPU Accelerator Architecture.
Proceedings of the Computer Architecture, 2010

Accelerating iterative field-compensated MR image reconstruction on GPUS.
Proceedings of the 2010 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 2010

An effective GPU implementation of breadth-first search.
Proceedings of the 47th Design Automation Conference, 2010

Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs.
Proceedings of the CGO 2010, 2010

An asymmetric distributed shared memory model for heterogeneous parallel systems.
Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, 2010

Data layout transformation exploiting memory-level parallelism in structured grid many-core applications.
Proceedings of the 19th International Conference on Parallel Architecture and Compilation Techniques, 2010

Raising the level of many-core programming with compiler technology: meeting a grand challenge.
Proceedings of the 19th International Conference on Parallel Architecture and Compilation Techniques, 2010

Exploiting More Parallelism from Applications Having Generalized Reductions on GPU Architectures.
Proceedings of the 10th IEEE International Conference on Computer and Information Technology, 2010

XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines.
Proceedings of the 10th IEEE International Conference on Computer and Information Technology, 2010

Programming Massively Parallel Processors - A Hands-on Approach.
Morgan Kaufmann, ISBN: 978-0-12-381472-2, 2010

2009
Hardware-compiler co-design for adjustable data power savings.
Microprocessors and Microsystems - Embedded Hardware Design, 2009

Compute Unified Device Architecture Application Suitability.
Computing in Science and Engineering, 2009

FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs.
Proceedings of the IEEE 7th Symposium on Application Specific Processors, 2009

Accelerating MR Image Reconstruction on GPUS.
Proceedings of the 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, Boston, MA, USA, June 28, 2009

Long time-scale simulations of in vivo diffusion using GPU hardware.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Many-core parallel computing - Can compilers and tools do the heavy lifting?
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

High-performance CUDA kernel execution on FPGAs.
Proceedings of the 23rd international conference on Supercomputing, 2009

GPU clusters for high-performance computing.
Proceedings of the 2009 IEEE International Conference on Cluster Computing, August 31, 2009

High performance computation and interactive display of molecular orbitals on GPUs and multi-core CPUs.
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, 2009

Optimization of tele-immersion codes.
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, 2009

2008
Guest Editors' Introduction: Accelerator Architectures.
IEEE Micro, 2008

Accelerating advanced MRI reconstructions on GPUs.
J. Parallel Distrib. Comput., 2008

Program optimization carving for GPU computing.
J. Parallel Distrib. Comput., 2008

Thousand-Core Chips [Roundtable].
IEEE Design & Test of Computers, 2008

The Concurrency Challenge.
IEEE Design & Test of Computers, 2008

Application Acceleration with the Explicitly Parallel Operations System - the EPOS Processor.
Proceedings of the IEEE Symposium on Application Specific Processors, 2008

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA.
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008

CUDA-Lite: Reducing GPU Programming Complexity.
Proceedings of the Languages and Compilers for Parallel Computing, 2008

MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs.
Proceedings of the Languages and Compilers for Parallel Computing, 2008

CUBA: an architecture for efficient CPU/co-processor data communication.
Proceedings of the 22nd Annual International Conference on Supercomputing, 2008

Visualization and Analysis of GPU Summer School Applicants and Participants.
Proceedings of the Fourth International Conference on e-Science, 2008

Program optimization space pruning for a multithreaded gpu.
Proceedings of the Sixth International Symposium on Code Generation and Optimization (CGO 2008), 2008

Accelerating advanced mri reconstructions on gpus.
Proceedings of the 5th Conference on Computing Frontiers, 2008

GPU acceleration of cutoff pair potentials for molecular modeling applications.
Proceedings of the 5th Conference on Computing Frontiers, 2008

2007
Automatic Discovery of Coarse-Grained Parallelism in Media Applications.
Trans. HiPEAC, 2007

Toward Application-Aware Security and Reliability.
IEEE Security & Privacy, 2007

Iteration Disambiguation for Parallelism Identification in Time-Sliced Applications.
Proceedings of the Languages and Compilers for Parallel Computing, 2007

Corezilla: Build and Tame the Multicore Beast?
Proceedings of the 44th Design Automation Conference, 2007

Implicitly Parallel Programming Models for Thousand-Core Microprocessors.
Proceedings of the 44th Design Automation Conference, 2007

CIGAR: Application Partitioning for a CPU/Coprocessor Architecture.
Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007), 2007

2006
Beating In-Order Stalls with "Flea-Flicker" Two-Pass Pipelining.
IEEE Trans. Computers, 2006

Tolerating Cache-Miss Latency with Multipass Pipelines.
IEEE Micro, 2006

2005
Guest Editors' Introduction.
IEEE Trans. Computers, 2005

"Flea-flicker" Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense.
Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38 2005), 2005

The Future of Computer Architecture Research: An Industrial Perspective.
Proceedings of the 11th International Conference on High-Performance Computer Architecture (HPCA-11 2005), 2005

2004
Bottom-Up and Top-Down Context-Sensitive Summary-Based Pointer Analysis.
Proceedings of the Static Analysis, 11th International Symposium, 2004

Importance of heap specialization in pointer analysis.
Proceedings of the 2004 ACM SIGPLAN-SIGSOFT Workshop on Program Analysis For Software Tools and Engineering, 2004

Trimaran: An Infrastructure for Research in Instruction-Level Parallelism.
Proceedings of the Languages and Compilers for High Performance Computing, 2004

Field-testing IMPACT EPIC research results in Itanium 2.
Proceedings of the 31st International Symposium on Computer Architecture (ISCA 2004), 2004

2003
Energy saving and capacity improvement potential of power control in multi-hop wireless networks.
Computer Networks, 2003

Beating in-order stalls with "flea-flicker" two-pass pipelining.
Proceedings of the 36th Annual International Symposium on Microarchitecture, 2003

2002
Vacuum packing: extracting hardware-detected program phases for post-link optimization.
Proceedings of the 35th Annual International Symposium on Microarchitecture, 2002

Code coverage and input variability: effects on architecture and compiler research.
Proceedings of the International Conference on Compilers, 2002

2001
An Architectural Framework for Runtime Optimization.
IEEE Trans. Computers, 2001

Enhancing loop buffering of media and telecommunications applications using low-overhead predication.
Proceedings of the 34th Annual International Symposium on Microarchitecture, 2001

Modulo schedule buffers.
Proceedings of the 34th Annual International Symposium on Microarchitecture, 2001

A Study of the Energy Saving and Capacity Improvement Potential of Power Control in Multi-Hop Wireless Networks.
Proceedings of the 26th Annual IEEE Conference on Local Computer Networks (LCN 2001), 2001

A Power Controlled Multiple Access Protocol for Wireless Packet Networks.
Proceedings of the Proceedings IEEE INFOCOM 2001, 2001

Code Reordering and Speculation Support for Dynamic Optimization System.
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques (PACT 2001), 2001

2000
Modular interprocedural pointer analysis using access paths: design, implementation, and evaluation.
Proceedings of the 2000 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2000

Accurate and efficient predicate analysis with binary decision diagrams.
Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, 2000

Transmission Power Control for Multiple Access Wireless Packet Networks.
Proceedings of the Proceedings 27th Conference on Local Computer Networks, 2000

A hardware mechanism for dynamic extraction and relayout of program hot spots.
Proceedings of the 27th International Symposium on Computer Architecture (ISCA 2000), 2000

Hardware Support for Dynamic Management of Compiler-Directed Computation Reuse.
Proceedings of the ASPLOS-IX Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, 2000

1999
Architecture.
Proceedings of the VLSI Handbook., 1999

Run-Time Cache Bypassing.
IEEE Trans. Computers, 1999

Editors' Introduction.
International Journal of Parallel Programming, 1999

Editor's Introduction.
International Journal of Parallel Programming, 1999

The Partial Reverse If-Conversion Framework for Balancing Control Flow and Predication.
International Journal of Parallel Programming, 1999

A New Framework for Debugging Globally Optimized Code.
Proceedings of the 1999 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 1999

Compiler-Directed Dynamic Computation Reuse: Rationale and Initial Results.
Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture, 1999

An Empirical Study of Function Pointers Using SPEC Benchmarks.
Proceedings of the Languages and Compilers for Parallel Computing, 1999

A Hardware-Driven Profiling Scheme for Identifying Program Hot Spots to Support Runtime Optimization.
Proceedings of the 26th Annual International Symposium on Computer Architecture, 1999

The Program Decision Logic Approach to Predicated Execution.
Proceedings of the 26th Annual International Symposium on Computer Architecture, 1999

An Architecture Framework for Introducing Predicated Execution into Embedded Microprocessors.
Proceedings of the Euro-Par '99 Parallel Processing, 5th International Euro-Par Conference, Toulouse, France, August 31, 1999

1998
Combining Trace Sampling with Single Pass Methods for Efficient Cache Simulation.
IEEE Trans. Computers, 1998

Optimization of Machine Descriptions for Efficient Use.
International Journal of Parallel Programming, 1998

Foreword to the Special Issue.
International Journal of Parallel Programming, 1998

Introduction to Predicate Execution.
IEEE Computer, 1998

Compiler-Directed Early Load-Address Generation.
Proceedings of the 31st Annual IEEE/ACM International Symposium on Microarchitecture, 1998

HPSm, a High Performance Restricted Data Flow Architecture Having Minimal Functionality.
Proceedings of the 25 Years of the International Symposia on Computer Architecture (Selected Papers)., 1998

Retrospective: HPSm, a High Performance Restricted Data Flow Architecture Having Minimal Functionality.
Proceedings of the 25 Years of the International Symposia on Computer Architecture (Selected Papers)., 1998

Retrospective: IMPACT: An Architectural Framework for Multiple-Instruction Issue.
Proceedings of the 25 Years of the International Symposia on Computer Architecture (Selected Papers)., 1998

IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors.
Proceedings of the 25 Years of the International Symposia on Computer Architecture (Selected Papers)., 1998

Integrated Predicated and Speculative Execution in the IMPACT EPIC Architecture.
Proceedings of the 25th Annual International Symposium on Computer Architecture, 1998

Run-Time Adaptive Cache Management.
Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences, 1998

Improving Static Branch Prediction in a Compiler.
Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, 1998

1997
Region-based compilation: Introduction, motivation, and initial experience.
International Journal of Parallel Programming, 1997

Optimizing NET Compilers for Improved Java Performance.
IEEE Computer, 1997

Run-Time Spatial Locality Detection and Optimization.
Proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, 1997

A Framework for Balancing Control Flow and Predication.
Proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, 1997

Run-Time Adaptive Cache Hierarchy Management via Reference Analysis.
Proceedings of the 24th International Symposium on Computer Architecture, 1997

Architectural Support for Compiler-Synthesized Dynamic Branch Prediction Strategies: Rationale and Initial Results.
Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture (HPCA '97), 1997

A study of the cache and branch performance issues with running Java on current hardware platforms.
Proceedings of the Proceedings IEEE COMPCON 97, 1997

1996
Guest Editors' Introduction.
International Journal of Parallel Programming, 1996

Modulo Scheduling of Loops in Control-intensive Non-numeric Programs.
Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, 1996

Java Bytecode to Native Code Translation: The Caffeine Prototype and Preliminary Results.
Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, 1996

Optimization of Machine Descriptions for Efficient Use.
Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, 1996

Speculative Hedge: Regulating Compile-time Speculation Against Profile Variations.
Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, 1996

1995
Compiler-Based Multiple Instruction Retry.
IEEE Trans. Computers, 1995

Three Architecutral Models for Compiler-Controlled Speculative Execution.
IEEE Trans. Computers, 1995

The Importance of Prepass Code Scheduling for Superscalar and Superpipelined Processors.
IEEE Trans. Computers, 1995

Compiler-Assisted Multiple Instruction Rollback Recovery Using a Read Buffer.
IEEE Trans. Computers, 1995

Advances in Benchmarking Techniques: New Standards and Quantitative Metrics.
Advances in Computers, 1995

Unrolling-based optimizations for modulo scheduling.
Proceedings of the 28th Annual International Symposium on Microarchitecture, Ann Arbor, Michigan, USA, November 29, 1995

Region-based compilation: an introduction and motivation.
Proceedings of the 28th Annual International Symposium on Microarchitecture, Ann Arbor, Michigan, USA, November 29, 1995

A Comparison of Full and Partial Predicated Execution Support for ILP Processors.
Proceedings of the 22nd Annual International Symposium on Computer Architecture, 1995

A study of the effects of compiler-controlled speculation on instruction and data caches.
Proceedings of the 28th Annual Hawaii International Conference on System Sciences (HICSS-28), 1995

1994
The Susceptibility of Programs to Context Switching.
IEEE Trans. Computers, 1994

Incremental Compiler Transformations for Multiple Instruction Retry.
Softw., Pract. Exper., 1994

Performance Implications of Synchronization Support for Parallel Fortran Programs.
J. Parallel Distrib. Comput., 1994

From the guest editors.
International Journal of Parallel Programming, 1994

Profile-assisted instruction scheduling.
International Journal of Parallel Programming, 1994

Data relocation and prefetching for programs with large data sets.
Proceedings of the 27th Annual International Symposium on Microarchitecture, San Jose, California, USA, November 30, 1994

Characterizing the impact of predicated execution on branch prediction.
Proceedings of the 27th Annual International Symposium on Microarchitecture, San Jose, California, USA, November 30, 1994

An Analytical Approach to Scheduling Code for Superscalar and VLIW Architectures.
Proceedings of the 1994 International Conference on Parallel Processing, 1994

Dynamic Memory Disambiguation Using the Memory Conflict Buffer.
Proceedings of the ASPLOS-VI Proceedings, 1994

1993
Sentinel Scheduling for VLIW and Superscalar Processors.
ACM Trans. Comput. Syst., 1993

The superblock: An effective technique for VLIW and superscalar compilation.
The Journal of Supercomputing, 1993

The Effect of Code Expanding Optimizations on Instruction Cache Design.
IEEE Trans. Computers, 1993

An execution Profiler for Window-oriented Applications.
Softw., Pract. Exper., 1993

Reverse If-Conversion.
Proceedings of the ACM SIGPLAN'93 Conference on Programming Language Design and Implementation (PLDI), 1993

Superblock formation using static program analysis.
Proceedings of the 26th Annual International Symposium on Microarchitecture, 1993

Speculative execution exception recovery using write-back suppression.
Proceedings of the 26th Annual International Symposium on Microarchitecture, 1993

Register Connection: A New Approach to Adding Registers into Instruction Set Architectures.
Proceedings of the 20th Annual International Symposium on Computer Architecture. San Diego, 1993

Application of Compiler-Assisted Rollback Recovery to Speculative Execution Repair.
Proceedings of the Hardware and Software Architectures for Fault Tolerance, 1993

1992
Efficient Instruction Sequencing with Inline Target Insertion.
IEEE Trans. Computers, 1992

Profile-guided Automatic Inline Expansion for C Programs.
Softw., Pract. Exper., 1992

Xprof: Profiling the Execution of X Window Programs.
Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, 1992

Compiler Code Transformations for Superscalar-Based High Performance Systems.
Proceedings of the Proceedings Supercomputing '92, 1992

Systematic prototyping of superscalar computer architectures.
Proceedings of the Third International Workshop on Rapid System Prototyping, 1992

Using Profile Information to Assist Advaced Compiler Optimization and Scheduling.
Proceedings of the Languages and Compilers for Parallel Computing, 1992

Tolerating data access latency with register preloading.
Proceedings of the 6th international conference on Supercomputing, 1992

Tolerating First Level Memory Access Latency in High-Performance Systems.
Proceedings of the 1992 International Conference on Parallel Processing, 1992

Executing Nested Parallel Loops on Shared-Memory Multiprocessors.
Proceedings of the 1992 International Conference on Parallel Processing, 1992

Branch Recovery with Compiler-Assisted Multiple Instruction Retry.
Proceedings of the Digest of Papers: FTCS-22, 1992

Sentinel Scheduling for VLIW and Superscalar Processors.
Proceedings of the ASPLOS-V Proceedings, 1992

1991
Using Profile Information to Assist Classic Code Optimizations.
Softw., Pract. Exper., 1991

A brief survey of benchmark usage in the architecture community.
SIGARCH Computer Architecture News, 1991

Benchmark Characterization.
IEEE Computer, 1991

Data Access Microarchitectures for Superscalar Processors with Compiler-Assisted Data Prefetching.
Proceedings of the 24th Annual IEEE/ACM International Symposium on Microarchitecture, 1991

Comparing Static and Dynamic Code Scheduling for Multiple-Instruction-Issue Processors.
Proceedings of the 24th Annual IEEE/ACM International Symposium on Microarchitecture, 1991

IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors.
Proceedings of the 18th Annual International Symposium on Computer Architecture. Toronto, 1991

The Effect of Compiler Optimizations on Available Parallelism in Scalar Programs.
Proceedings of the International Conference on Parallel Processing, 1991

1990
A software based approach to achieving optimal performance for signature control flow checking.
Proceedings of the 20th International Symposium on Fault-Tolerant Computing, 1990

1989
A Simulation Study of Simultaneous Vector Prefetch Performance in Multiprocessor Memory Subsystems (Extended Abstract).
Proceedings of the 1989 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 1989

Inline Function Expansion for Compiling C Programs.
Proceedings of the ACM SIGPLAN'89 Conference on Programming Language Design and Implementation (PLDI), 1989

Forward semantic: a compiler-assisted instruction fetch method for heavily pipelined processors.
Proceedings of the 22nd Annual Workshop and Symposium on Microprogramming and Microarchitecture, 1989

Comparing Software and Hardware Schemes For Reducing the Cost of Branches.
Proceedings of the 16th Annual International Symposium on Computer Architecture. Jerusalem, 1989

Achieving High Instruction Cache Performance with an Optimizing Compiler.
Proceedings of the 16th Annual International Symposium on Computer Architecture. Jerusalem, 1989

Control flow optimization for supercomputer scalar processing.
Proceedings of the 3rd international conference on Supercomputing, 1989

1988
Trace selection for compiling large C application programs to microcode.
Proceedings of the 21st Annual Workshop and Symposium on Microprogramming and Microarchitecture, 1988, San Diego, California, USA, November 28, 1988

Exploiting Parallel Microprocessor Microarchitectures With a Compiler Code Generator.
Proceedings of the 15th Annual International Symposium on Computer Architecture. Honolulu, 1988

1987
Checkpoint Repair for High-Performance Out-of-Order Execution Machines.
IEEE Trans. Computers, 1987

On tuning the microarchitecture of an HPS implementation of the VAX.
Proceedings of the 20st Annual Workshop and Symposium on Microprogramming and Microarchitecture, 1987

Exploiting horizontal and vertical concurrency via the HPSm microprocessor.
Proceedings of the 20st Annual Workshop and Symposium on Microprogramming and Microarchitecture, 1987

Checkpoint Repair for Out-of-order Execution Machines.
Proceedings of the 14th Annual International Symposium on Computer Architecture. Pittsburgh, 1987

1986
Run-time generation of HPS microinstructions from a VAX instruction stream.
Proceedings of the 19th annual workshop on Microprogramming, 1986

HPSm, a High Performance Restricted Data Flow Architecture Having Minimal Functionality.
Proceedings of the 13th Annual Symposium on Computer Architecture, Tokyo, Japan, June 1986, 1986

Experiments with HPS, a Restricted Data Flow Microarchitecture for High Performance Computers.
Proceedings of the Spring COMPCON'86, 1986

1985
Critical issues regarding HPS, a high performance microarchitecture.
Proceedings of the 18th annual workshop on Microprogramming, 1985

HPS, a new microarchitecture: rationale and introduction.
Proceedings of the 18th annual workshop on Microprogramming, 1985


  Loading...