Guang R. Gao

Orcid: 0000-0002-5265-7528

Affiliations:
  • University of Delaware, Newark, USA


According to our database1, Guang R. Gao authored at least 297 papers between 1983 and 2023.

Collaborative distances:

Awards

ACM Fellow

ACM Fellow 2007, "For contributions to multiprocessor computers and compiler optimization techniques.".

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2023
Codelet Pipe: Realization of Dataflow Software Pipelining for Extended Codelet Model.
Proceedings of the 52nd International Conference on Parallel Processing Workshops, 2023

2022
Extending an asynchronous runtime system for high throughput applications: A case study.
J. Parallel Distributed Comput., 2022

A Profile-Based AI-Assisted Dynamic Scheduling Approach for Heterogeneous Architectures.
Int. J. Parallel Program., 2022

The SuperCodelet architecture.
Proceedings of the ExHET@PPoPP 2022: Proceedings of the 1st International Workshop on Extreme Heterogeneity Solutions, 2022

2021
swFLOW: A large-scale distributed framework for deep learning on Sunway TaihuLight supercomputer.
Inf. Sci., 2021

Guest Editorial: Special issue on Network and Parallel Computing for Emerging Architectures and Applications.
Int. J. Parallel Program., 2021

The Promise of Dataflow Architectures in the Design of Processing Systems for Autonomous Machines.
CoRR, 2021

E.T.: re-thinking self-attention for transformer models on GPUs.
Proceedings of the International Conference for High Performance Computing, 2021

2020
DEMAC: A Modular Platform for HW-SW Co-Design.
Proceedings of the Fourth IEEE/ACM Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, 2020

CODIR: Towards an MLIR Codelet Model Dialect.
Proceedings of the Fourth IEEE/ACM Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, 2020

PDAWL: Profile-Based Iterative Dynamic Adaptive WorkLoad Balance on Heterogeneous Architectures.
Proceedings of the Job Scheduling Strategies for Parallel Processing, 2020

On the Marriage of Asynchronous Many Task Runtimes and Big Data: A Glance.
Proceedings of the 27th IEEE International Conference on High Performance Computing, 2020

2019
Editorial for the special issue on innovations in supercomputing techniques.
CCF Trans. High Perform. Comput., 2019

swFLOW: A Dataflow Deep Learning Framework on Sunway TaihuLight Supercomputer.
Proceedings of the 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems, 2019

2018
DCF: A Dataflow-Based Collaborative Filtering Training Algorithm.
Int. J. Parallel Program., 2018

2017
Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach.
ACM Trans. Archit. Code Optim., 2017

Parallel Turing Machine, a Proposal.
J. Comput. Sci. Technol., 2017

HAMR: A dataflow-based real-time in-memory cluster computing engine.
Int. J. High Perform. Comput. Appl., 2017

Verification of the Extended Roofline Model for Asynchronous Many Task Runtimes.
Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware, 2017

Multigrain Parallelism: Bridging Coarse-Grain Parallel Programs and Fine-Grain Event-Driven Multithreading.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

Leveraging access port positions to accelerate page table walk in DWM-based main memory.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2017

Leveraging Compiler Optimizations to Reduce Runtime Fault Recovery Overhead.
Proceedings of the 54th Annual Design Automation Conference, 2017

Designing Scalable Distributed Memory Models: A Case Study.
Proceedings of the Computing Frontiers Conference, 2017

2016
The Design and Implementation of TIDeFlow: A Dataflow-Inspired Execution Model for Parallel Loops and Task Pipelining.
Int. J. Parallel Program., 2016

Toward a Parallel Turing Machine Model.
Proceedings of the Network and Parallel Computing, 2016

Energy Avoiding Matrix Multiply.
Proceedings of the Languages and Compilers for Parallel Computing, 2016

The Importance of Efficient Fine-Grain Synchronization for Many-Core Systems.
Proceedings of the Languages and Compilers for Parallel Computing, 2016

Asynchronous Runtimes in Action: An Introspective Framework for a Next Gen Runtime.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

Extending the Roofline Model for Asynchronous Many-Task Runtimes.
Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016

Application characterization at scale: lessons learned from developing a distributed open community runtime system for high performance computing.
Proceedings of the ACM International Conference on Computing Frontiers, CF'16, 2016

2015
Author Rebuttal to Rocha et al. "Comments on Minimizing Buffer Requirements under Rate-Optimal Schedule in Regular Dataflow Networks".
J. Signal Process. Syst., 2015

Design and evaluation of a novel dataflow based bigdata solution.
Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, 2015

Landing Containment Domains on SWARM: Toward a Robust Resiliency Solution on a Dynamic Adaptive Runtime Machine.
Proceedings of the Parallel Computing: On the Road to Exascale, 2015

FreshBreeze: A Data Flow Approach for Meeting DDDAS Challenges.
Proceedings of the International Conference on Computational Science, 2015

Gregarious Data Re-structuring in a Many Core Architecture.
Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications, 2015

Energy efficient multi-level tiling for dense matrix multiplication on many-core architecture.
Proceedings of the Sixth International Green and Sustainable Computing Conference, 2015

Locality aware concurrent start for stencil applications.
Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2015

2014
TERAFLUX: Harnessing dataflow in next generation teradevices.
Microprocess. Microsystems, 2014

Jagged Tiling for Intra-tile Parallelism and Fine-Grain Multithreading.
Proceedings of the Languages and Compilers for Parallel Computing, 2014

Position Paper: Locality-Driven Scheduling of Tasks for Data-Dependent Multithreading.
Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014

ACDT: Architected Composite Data Types trading-in unfettered data access for improved execution.
Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems, 2014

A Dataflow Programming Language and its Compiler for Streaming Systems.
Proceedings of the International Conference on Computational Science, 2014

2013
StreamTMC: Stream compilation for tiled multi-core architectures.
J. Parallel Distributed Comput., 2013

Automatic Locality Exploitation in the Codelet Model.
Proceedings of the 12th IEEE International Conference on Trust, 2013

Optimizing the LU Factorization for Energy Efficiency on a Many-Core Architecture.
Proceedings of the Languages and Compilers for Parallel Computing, 2013

Towards Memory-Load Balanced Fast Fourier Transformations in Fine-Grain Execution Models.
Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

A dynamic schema to increase performance in many-core architectures through percolation operations.
Proceedings of the 20th Annual International Conference on High Performance Computing, 2013

An Implementation of the Codelet Model.
Proceedings of the Euro-Par 2013 Parallel Processing, 2013

Toward a Self-aware System for Exascale Architectures.
Proceedings of the Euro-Par 2013: Parallel Processing Workshops, 2013


Strategies for improving performance and energy efficiency on a many-core.
Proceedings of the Computing Frontiers Conference, 2013

2012
Software Pipelining for Stream Programs on Resource Constrained Multicore Architectures.
IEEE Trans. Parallel Distributed Syst., 2012

Toward high-throughput algorithms on many-core architectures.
ACM Trans. Archit. Code Optim., 2012

Massively parallel breadth first search using a tree-structured memory model.
Proceedings of the 2012 PPOPP International Workshop on Programming Models and Applications for Multicores and Manycores, 2012

Demystifying Performance Predictions of Distributed FFT3D Implementations.
Proceedings of the Network and Parallel Computing, 9th IFIP International Conference, 2012

A Discussion in Favor of Dynamic Scheduling for Regular Applications in Many-core Architectures.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

The Role of Non-strict Fine-grain Synchronization.
Proceedings of the Transition of HPC Towards Exascale Computing, 2012

Dynamic percolation: a case of study on the shortcomings of traditional optimization in many-core architectures.
Proceedings of the Computing Frontiers Conference, CF'12, 2012

2011
Analysis and performance results of computing betweenness centrality on IBM Cyclops64.
J. Supercomput., 2011

Experiments with the Fresh Breeze tree-based memory model.
Comput. Sci. Res. Dev., 2011

The Fresh Breeze Program Execution Model.
Proceedings of the Applications, Tools and Techniques on the Road to Exascale Computing, Proceedings of the conference ParCo 2011, 31 August, 2011

Polytasks: A Compressed Task Representation for HPC Runtimes.
Proceedings of the Languages and Compilers for Parallel Computing, 2011

OPELL and PM: A Case Study on Porting Shared Memory Programming Models to Accelerators Architectures.
Proceedings of the Languages and Compilers for Parallel Computing, 2011

The elephant and the mice: the role of non-strict fine-grain synchronization for modern many-core architectures.
Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31, 2011

Source Code Partitioning in Program Optimization.
Proceedings of the 17th IEEE International Conference on Parallel and Distributed Systems, 2011

DEEP: an iterative fpga-based many-core emulation system for chip verification and architecture research.
Proceedings of the ACM/SIGDA 19th International Symposium on Field Programmable Gate Arrays, 2011

Hardware and Software Tradeoffs for Task Synchronization on Manycore Architectures.
Proceedings of the Euro-Par 2011 Parallel Processing - 17th International Conference, 2011

Exploring Fine-Grained Task-Based Execution on Multi-GPU Systems.
Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER), 2011

2010
Performance analysis of Cooley-Tukey FFT algorithms for a many-core architecture.
Proceedings of the 2010 Spring Simulation Multiconference, 2010

Locality Optimization of Stencil Applications Using Data Dependency Graphs.
Proceedings of the Languages and Compilers for Parallel Computing, 2010

TiNy threads on BlueGene/P: Exploring many-core parallelisms beyond The traditional OS.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Dynamic load balancing on single- and multi-GPU systems.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Optimized Dense Matrix Multiplication on a Many-Core Architecture.
Proceedings of the Euro-Par 2010 - Parallel Processing, 16th International Euro-Par Conference, Ischia, Italy, August 31, 2010

A Study of a Software Cache Implementation of the OpenMP Memory Model for Multicore and Manycore Architectures.
Proceedings of the Euro-Par 2010 - Parallel Processing, 16th International Euro-Par Conference, Ischia, Italy, August 31, 2010

Minimizing communication in rate-optimal software pipelining for stream programs.
Proceedings of the CGO 2010, 2010

2009
Improving Performance of Dynamic Programming via Parallelism and Locality on Multicore Architectures.
IEEE Trans. Parallel Distributed Syst., 2009

Tile Reduction: The First Step towards Tile Aware Parallelization in OpenMP.
Proceedings of the Evolving OpenMP in an Age of Extreme Parallelism, 2009

Iterative layer-based raytracing on CUDA.
Proceedings of the 28th International Performance Computing and Communications Conference, 2009

Mapping the FDTD Application to Many-Core Chip Architectures.
Proceedings of the ICPP 2009, 2009

Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor.
Proceedings of the Euro-Par 2009 Parallel Processing, 2009

Mapping the LU decomposition on a many-core architecture: challenges and solutions.
Proceedings of the 6th Conference on Computing Frontiers, 2009

2008
Register allocation for software pipelined multidimensional loops.
ACM Trans. Program. Lang. Syst., 2008

Engenius - Environmental genome Informational Utility System.
J. Bioinform. Comput. Biol., 2008

Guest Editors Introduction: Special Issue on OpenMP.
Int. J. Parallel Program., 2008

Experience on optimizing irregular computation for memory hierarchy in manycore architecture.
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008

Minimum Lock Assignment: A Method for Exploiting Concurrency among Critical Sections.
Proceedings of the Languages and Compilers for Parallel Computing, 2008

Just-In-Time Locality and Percolation for Optimizing Irregular Applications on a Manycore Architecture.
Proceedings of the Languages and Compilers for Parallel Computing, 2008

Open64 compiler infrastructure for emerging multicore/manycore architecture All Symposium Tutorial.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

2007
Single-dimension software pipelining for multidimensional loops.
ACM Trans. Archit. Code Optim., 2007

Performance portability on EARTH: a case study across several parallel architectures.
Clust. Comput., 2007

A parallel dynamic programming algorithm on a multi-core architecture.
Proceedings of the SPAA 2007: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, 2007

Implementation of the Smith-Waterman algorithm on a reconfigurable supercomputing platform.
Proceedings of the 1st international workshop on High-performance reconfigurable computing technology and applications, 2007

Optimized lock assignment and allocation: a method for exploiting concurrency among critical sections.
Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2007

On Parallel Models of Computation.
Proceedings of the Network and Parallel Computing, IFIP International Conference, 2007

Concurrency Analysis for Shared Memory Programs with Textually Unaligned Barriers.
Proceedings of the Languages and Compilers for Parallel Computing, 2007

Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures.
Proceedings of the 34th International Symposium on Computer Architecture (ISCA 2007), 2007

On the Role of Deterministic Fine-Grain Data Synchronization for Scientific Applications: A Revisit in the Emerging Many-Core Era.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Automatic Program Segment Similarity Detection in Targeted Program Performance Improvement.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Experience of Optimizing FFT on Intel Architectures.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

ParalleX: A Study of A New Parallel Computation Model.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Exploring a Multithreaded Methodology to Implement a Network Communication Protocol on the Cyclops-64 Multithreaded Architecture.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Optimizing the Fast Fourier Transform on a Multi-core Architecture.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Software-Pipelining on Multi-Core Architectures.
Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques (PACT 2007), 2007

2006
User-Friendly Methodology for Automatic Exploration of Compiler Options: A Case Study on the Intel XScale Microarchitecture.
Proceedings of the International Conference on Software Engineering Research and Practice & Conference on Programming Languages and Compilers, 2006

A User-Friendly Methodology for Automatic Exploration of Compiler Options.
Proceedings of the International Conference on Software Engineering Research and Practice & Conference on Programming Languages and Compilers, 2006

Performance Characteristics of OpenMP Language Constructs on a Many-core-on-a-chip Architecture.
Proceedings of the OpenMP Shared Memory Parallel Programming - International Workshops, 2006

Exploring Financial Applications on Many-Core-on-a-Chip Architecture: A First Experiment.
Proceedings of the Frontiers of High Performance Computing and Networking, 2006

A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Hierarchical multithreading: programming model and system software.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture.
Proceedings of the 20th Annual International Symposium on High Performance Computing Systems and Applications (HPCS 2006), 2006

Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences.
Proceedings of the Euro-Par 2006, Parallel Processing, 12th International Euro-Par Conference, Dresden, Germany, August 28, 2006

Multi-dimensional Kernel Generation for Loop Nest Software Pipelining.
Proceedings of the Euro-Par 2006, Parallel Processing, 12th International Euro-Par Conference, Dresden, Germany, August 28, 2006

Landing openMP on cyclops-64: an efficient mapping of openMP to a many-core system-on-a-chip.
Proceedings of the Third Conference on Computing Frontiers, 2006

The Era of Multi-core Chips -A Fresh Look on Software Challenges.
Proceedings of the Advances in Computer Systems Architecture, 11th Asia-Pacific Conference, 2006

2005
Improving power efficiency with compiler-assisted cache replacement.
J. Embed. Comput., 2005

Madd Operation Aware Redundancy Elimination.
Int. J. Softw. Eng. Knowl. Eng., 2005

Quasi-consensus-based comparison of profile hidden Markov models for protein sequences.
Bioinform., 2005

An improved hidden Markov model for transmembrane protein detection and topology prediction and its applications to complete genomes.
Bioinform., 2005

Register allocation for software pipelined multi-dimensional loops.
Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, 2005

Sequential Consistency Revisit: The Sufficient Condition and Method to Reason the Consistency Model of a Multiprocessor-on-a-Chip Architecture.
Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks, 2005

Performance Modelling and Optimization of Memory Access on Cellular Computer Architecture Cyclops64.
Proceedings of the Network and Parallel Computing, IFIP International Conference, 2005

Register Pressure in Software-Pipelined Loop Nests: Fast Computation and Impact on Architecture Design.
Proceedings of the Languages and Compilers for Parallel Computing, 2005

An energy efficient TLB design methodology.
Proceedings of the 2005 International Symposium on Low Power Electronics and Design, 2005

Sustained Petaflop and Beyond: Can Parallel Computing Systems Meet The Challenges?
Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005

TiNy Threads: A Thread Virtual Machine for the Cyclops64 Cellular Architecture.
Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005

Discriminating transmembrane proteins from signal peptides using SVM-Fisher approach.
Proceedings of the Fourth International Conference on Machine Learning and Applications, 2005

Identifying Multiply-Add Operations in Kylin Compiler.
Proceedings of The 2005 International Conference on Embedded Systems and Applications, 2005

2004
A fine-grain load-adaptive algorithm of the 2D discrete wavelet transform for multithreaded architectures.
J. Parallel Distributed Comput., 2004

A cluster-based solution for high performance hmmpfam using EARTH execution model.
Int. J. High Perform. Comput. Netw., 2004

An Improved Hidden Markov Model for Transmembrane Topology Prediction.
Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004), 2004

If-Conversion in SSA Form.
Proceedings of the Euro-Par 2004 Parallel Processing, 2004

Implementing parallel conjugate gradient on the EARTH multithreaded architecture.
Proceedings of the 2004 IEEE International Conference on Cluster Computing (CLUSTER 2004), 2004

Single-Dimension Software Pipelining for Multi-Dimensional Loops.
Proceedings of the 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 2004

Code Generation for Single-Dimension Software Pipelining of Multi-Dimensional Loops.
Proceedings of the 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 2004

2003
Special issue on compilers, architecture, and synthesis for embedded systems.
ACM Trans. Embed. Comput. Syst., 2003

Minimum Register Instruction Sequencing to Reduce Register Spills in Out-of-Order Issue Superscalar Architectures.
IEEE Trans. Computers, 2003

Evaluation and Choice of Various Branch Predictors for Low-Power Embedded Processor.
J. Comput. Sci. Technol., 2003

Implementation of the EARTH programming model on SMP clusters: a multi-threaded language and runtime system.
Concurr. Comput. Pract. Exp., 2003

Compiler-Assisted Cache Replacement: Problem Formulation and Performance Evaluation.
Proceedings of the Languages and Compilers for Parallel Computing, 2003

CARE: Overview of an Adaptive Multithreaded Architecture.
Proceedings of the High Performance Computing, 5th International Symposium, 2003

Performance Study of a Whole Genome Comparison Tool on a Hyper-Threading Multiprocessor.
Proceedings of the High Performance Computing, 5th International Symposium, 2003

An Executable Analytical Performance Evaluation Approach for Early Performance Prediction.
Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), 2003

Programming Models and System Software for Future High-End Computing Systems: Work-in-Progress.
Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), 2003

Inter-procedural stacked register allocation for itanium® like architecture.
Proceedings of the 17th Annual International Conference on Supercomputing, 2003

DIMES: an iterative emulation platform for Multiprocessor-System-On-Chip designs.
Proceedings of the 2003 IEEE International Conference on Field-Programmable Technology, 2003

Implementing Parallel Hmm-pfam on the EARTH Multithreaded Architecture.
Proceedings of the 2nd IEEE Computer Society Bioinformatics Conference, 2003

2002
Minimizing Buffer Requirements under Rate-Optimal Schedule in Regular Dataflow Networks.
J. VLSI Signal Process., 2002

Efficent Multithreaded Algorithms for the Fast Fourier Transform.
Scalable Comput. Pract. Exp., 2002

A Theory for Co-Scheduling Hardware and Software Pipelines in ASIPs and Embedded Processors.
Des. Autom. Embed. Syst., 2002

Implementation and evaluation of a communication intensive application on the EARTH multithreaded system.
Concurr. Comput. Pract. Exp., 2002

CASA: a server for the critical assessment of protein sequence alignment accuracy.
Bioinform., 2002

TROLL-Tandem Repeat Occurrence Locator.
Bioinform., 2002

Fine-Grain Stacked Register Allocation for the Itanium Architecture.
Proceedings of the Languages and Compilers for Parallel Computing, 15th Workshop, 2002

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture.
Proceedings of the 16th International Parallel and Distributed Processing Symposium (IPDPS 2002), 2002

Next Generation System Software for Future High-End Computing Systems.
Proceedings of the 16th International Parallel and Distributed Processing Symposium (IPDPS 2002), 2002

Visualizing Biosequence Data Using Texture Mapping.
Proceedings of the 2002 IEEE Symposium on Information Visualization (InfoVis 2002), 27 October, 2002

Power-Performance Trade-Offs for Energy-Efficient Architectures: A Quantitative Study.
Proceedings of the 20th International Conference on Computer Design (ICCD 2002), 2002

An Adaptive Meta-Clustering Approach: Combining the Information from Different Clustering Results.
Proceedings of the 1st IEEE Computer Society Bioinformatics Conference, 2002

A Bayesian Modeling Framework for Genetic Regulation.
Proceedings of the 1st IEEE Computer Society Bioinformatics Conference, 2002

On achieving balanced power consumption in software pipelined loops.
Proceedings of the International Conference on Compilers, 2002

2001
Dynamic Load Balancers for a Multithreaded Multiprocessor System.
Parallel Process. Lett., 2001

Exploiting Locality in Single Assignment Data Structures Updated Through Split-Phase Transactions.
Clust. Comput., 2001

A Multithreaded Parallel Implementation of a Dynamic Programming Algorithm for Sequence Comparison.
Proceedings of the 6th Pacific Symposium on Biocomputing, 2001

New Design Paradigms: What Needs to be Standardized?.
Proceedings of the 14th International Symposium on Systems Synthesis, 2001

Bridging the gap between ISA compilers and silicon compilers a challenge for future SoC design.
Proceedings of the 14th International Symposium on Systems Synthesis, 2001

Multithreaded Algorithms for Pricing a Class of Complex Options.
Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS-01), 2001

Minimum Register Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs.
Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS-01), 2001

Topic 08+13: Instruction-Level Parallelism and Computer Architecture.
Proceedings of the Euro-Par 2001: Parallel Processing, 2001

Speculative Prefetching of Induction Pointers.
Proceedings of the Compiler Construction, 10th International Conference, 2001

2000
Location Consistency-A New Memory Model and Cache Consistency Protocol.
IEEE Trans. Computers, 2000

Enhanced Co-Scheduling: A Software Pipelining Method Using Modulo-Scheduled Pipeline Theory.
Int. J. Parallel Program., 2000

Self-Avoiding Walks over Adaptive Unstructured Grids.
Concurr. Pract. Exp., 2000

Multithreaded algorithms for the fast Fourier transform.
Proceedings of the Twelfth annual ACM Symposium on Parallel Algorithms and Architectures, 2000

Landing CG on EARTH: A Case Study of Fine-Grained Multithreading on an Evolutionary Path.
Proceedings of the Proceedings Supercomputing 2000, 2000

Recursive and Iterative Multithreaded Algorithms for Pricing American Securities.
Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, 2000

Design and Implementation of an Efficient Thread Partitioning Algorithm.
Proceedings of the High Performance Computing, Third International Symposium, 2000

Caching Single-Assignment Structures to Build a Robust Fine-Grain Multi-Threading System.
Proceedings of the 14th International Parallel & Distributed Processing Symposium (IPDPS'00), 2000

Parallel FEM Simulation of Crack Propagation - Challenges, Status, and Perspectives.
Proceedings of the Parallel and Distributed Processing, 2000

Automatic compiler techniques for thread coarsening for multithreaded architectures.
Proceedings of the 14th international conference on Supercomputing, 2000

Developing a Communication Intensive Application on the EARTH Multithreaded Architecture (Distinguished Paper).
Proceedings of the Euro-Par 2000, Parallel Processing, 6th International Euro-Par Conference, Munich, Germany, August 29, 2000

A Theory for Software-Hardware Co-Scheduling for ASIPs and Embedded Processors.
Proceedings of the 12th IEEE International Conference on Application-Specific Systems, 2000

1999
Advances in the dataflow computational model.
Parallel Comput., 1999

Automatically Partitioning Threads for Multithreaded Architectures.
J. Parallel Distributed Comput., 1999

Self-Avoiding Walks Over Adaptive Triangular Grids.
Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing, 1999

Minimum Register Instruction Scheduling: A New Approach for Dynamic Instruction Issue Processors.
Proceedings of the Languages and Compilers for Parallel Computing, 1999

Coping with very High Latencies in Petaflop Computer Systems.
Proceedings of the High Performance Computing, Second International Symposium, 1999

Implementing a Non-Strict Functional Programming Language on a Threaded Architecture.
Proceedings of the Parallel and Distributed Processing, 1999

Load Adaptive Algorithms and Implementations for the 2D Discrete Wavelet Transform on Fine-Grain Multithreaded Architectures.
Proceedings of the 13th International Parallel Processing Symposium / 10th Symposium on Parallel and Distributed Processing (IPPS / SPDP '99), 1999

A New Approach to Parallel Dynamic Partitioning for Adaptive Unstructured Meshes.
Proceedings of the 13th International Parallel Processing Symposium / 10th Symposium on Parallel and Distributed Processing (IPPS / SPDP '99), 1999

From EARTH to HTMT: An Evolution of a Multiheaded Architecture Model (Abstract).
Proceedings of the Parallel and Distributed Processing, 1999

Multithreaded Execution Architecture and Compilation.
Proceedings of the Fifth International Symposium on High-Performance Computer Architecture, 1999

Efficient State-Diagram Construction Methods for Software Pipelining.
Proceedings of the Compiler Construction, 8th International Conference, 1999

1998
A New Framework for Elimination-Based Data Flow Analysis Using DJ Graphs.
ACM Trans. Program. Lang. Syst., 1998

A Unified Framework for Instruction Scheduling and Mapping for Function Units with Structural Hazards.
J. Parallel Distributed Comput., 1998

Optimal Modulo Scheduling Through Enumeration.
Int. J. Parallel Program., 1998

How "Hard" is Thread Partitioning and How "Bad" is a List Scheduling Based Partitioning Algorithm?
Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, 1998

Using Multithreading for the Automatic Load Balancing of Adaptive Finite Element Meshes.
Proceedings of the Solving Irregularly Structured Problems in Parallel, 1998

An Enhanced Co-Scheduling Method Using Reduced MS-State Diagrams.
Proceedings of the 12th International Parallel Processing Symposium / 9th Symposium on Parallel and Distributed Processing (IPPS/SPDP '98), March 30, 1998

Automatically Partitioning Threads Based on Remote Paths.
Proceedings of the International Conference on Parallel and Distributed Systems, 1998

Partial Sampling with Reverse State Reconstruction: A New Technique for Branch Predictor Performance Estimation.
Proceedings of the Fourth International Symposium on High-Performance Computer Architecture, Las Vegas, Nevada, USA, January 31, 1998

A New Fast Algorithm for Optimal Register Allocation in Modulo Scheduled Loops.
Proceedings of the Compiler Construction, 7th International Conference, 1998

1997
Incremental Computation of Dominator Trees.
ACM Trans. Program. Lang. Syst., 1997

Compiling C for the EARTH multithreaded architecture.
Int. J. Parallel Program., 1997

Thread Partitioning and Scheduling Based on Cost Model.
Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures, 1997

Experiences with Non-numeric Applications on Multithreaded Architectures.
Proceedings of the Sixth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), 1997

On the Importance of an End-To-End View of Memory Consistency in Future Computer Systems.
Proceedings of the High Performance Computing, International Symposium, 1997

Latency Tolerance: A Metric for Performance Analysis of Multithreaded Architectures.
Proceedings of the 11th International Parallel Processing Symposium (IPPS '97), 1997

Elastic History Buffer: A Low-Cost Method to Improve Branch Prediction Accuracy.
Proceedings of the Proceedings 1997 International Conference on Computer Design: VLSI in Computers & Processors, 1997

Heap Analysis and Optimizations for Threaded Programs.
Proceedings of the 1997 Conference on Parallel Architectures and Compilation Techniques (PACT '97), 1997

A Register Pressure Sensitive Instruction Scheduler for Dynamic Issue Processors.
Proceedings of the 1997 Conference on Parallel Architectures and Compilation Techniques (PACT '97), 1997

1996
A Framework for Resource-Constrained Rate-Optimal Software Pipelining.
IEEE Trans. Parallel Distributed Syst., 1996

Identifying Loops Using DJ Graphs.
ACM Trans. Program. Lang. Syst., 1996

A Study of the EARTH-MANNA Multithreaded System.
Int. J. Parallel Program., 1996

A New Framework for Exhaustive and Incremental Data Flow Analysis Using DJ Graphs.
Proceedings of the ACM SIGPLAN'96 Conference on Programming Language Design and Implementation (PLDI), 1996

Software Pipelining Showdown: Optimal vs. Heuristic Methods in a Production Compiler.
Proceedings of the ACM SIGPLAN'96 Conference on Programming Language Design and Implementation (PLDI), 1996

Measurement and Modeling of EARTH-MANNA Multithreaded Architecture.
Proceedings of the MASCOTS '96, 1996

Locality Analysis for Distributed Shared-Memory Multiprocessors.
Proceedings of the Languages and Compilers for Parallel Computing, 1996

Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling.
Proceedings of the 23rd Annual International Symposium on Computer Architecture, 1996

Co-Scheduling Hardware and Software Pipelines.
Proceedings of the Second International Symposium on High-Performance Computer Architecture, 1996

Quantitive studies of data-locality sensitivity on the EARTH multithreaded architecture: preliminary results.
Proceedings of the 3rd International Conference on High Performance Computing, 1996

Multithreading implementation of a distributed shortest path algorithm on EARTH multiprocessor.
Proceedings of the 3rd International Conference on High Performance Computing, 1996

Optimal Software Pipelining Through Enumeration of Schedules.
Proceedings of the Euro-Par '96 Parallel Processing, 1996

Pipelining-Dovetailing: A Transformation to Enhance Software Pipelining for Nested Loops.
Proceedings of the Compiler Construction, 6th International Conference, 1996

Data locality sensitivity of multithreaded computations on a distributed-memory multiprocessor.
Proceedings of the 1996 conference of the Centre for Advanced Studies on Collaborative Research, 1996

Compiling C for the EARTH multithreaded architecture.
Proceedings of the Fifth International Conference on Parallel Architectures and Compilation Techniques, 1996

1995
Rate-optimal schedule for multi-rate DSP computations.
J. VLSI Signal Process., 1995

Automatic Data and Computation Decomposition for Distributed-Memory Machines.
Parallel Process. Lett., 1995

Computing phi-nodes in linear time using DJ graphs.
J. Program. Lang., 1995

ABC++: Concurrency by Inheritance in C++.
IBM Syst. J., 1995

On memory models and cache management for shared-memory multiprocessors.
Proceedings of the Seventh IEEE Symposium on Parallel and Distributed Processing, 1995

A Linear Time Algorithm for Placing phi-nodes.
Proceedings of the Conference Record of POPL'95: 22nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 1995

Scheduling and Mapping: Software Pipelining in the Presence of Structural Hazards.
Proceedings of the ACM SIGPLAN'95 Conference on Programming Language Design and Implementation (PLDI), 1995

Exploiting short-lived variables in superscalar processors.
Proceedings of the 28th Annual International Symposium on Microarchitecture, Ann Arbor, Michigan, USA, November 29, 1995

An Experimental Study of an ILP-based Exact Solution Method for Software Pipelining.
Proceedings of the Languages and Compilers for Parallel Computing, 1995

The Threaded Communication Library: Preliminary Experiences on a Multiprocessor with Dual-Processor Nodes.
Proceedings of the 9th international conference on Supercomputing, 1995

Location Consistency: Stepping Beyond the Memory Coherence Barrier.
Proceedings of the 1995 International Conference on Parallel Processing, 1995

A Design Frame for Hybrid Access Caches.
Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture (HPCA 1995), 1995

Automatic data and computation decomposition for distributed memory machines.
Proceedings of the 28th Annual Hawaii International Conference on System Sciences (HICSS-28), 1995

Costs and Benefits of Multithreading with Off-the-Shelf RISC Processors.
Proceedings of the Euro-Par '95 Parallel Processing, 1995

A design study of the EARTH multiprocessor.
Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques, 1995

Advanced topics in dataflow computing and multithreading.
IEEE, ISBN: 978-0-8186-6542-4, 1995

1994
Performance of Interconnection Network in Multithreaded Architectures.
Proceedings of the PARLE '94: Parallel Architectures and Languages Europe, 1994

Minimizing register requirements under resource-constrained rate-optimal software pipelining.
Proceedings of the 27th Annual International Symposium on Microarchitecture, San Jose, California, USA, November 30, 1994

Building Multithreaded Architectures with Off-the-Shelf Microprocessors.
Proceedings of the 8th International Symposium on Parallel Processing, 1994

A Comparative Study of Multiprocessor List Scheduling Heuristics.
Proceedings of the 27th Annual Hawaii International Conference on System Sciences (HICSS-27), 1994

Automatic decomposition in EPPP compiler.
Proceedings of the 1994 Conference of the Centre for Advanced Studies on Collaborative Research, October 31, 1994

FTL: a multithreaded environment for parallel computation.
Proceedings of the 1994 Conference of the Centre for Advanced Studies on Collaborative Research, October 31, 1994

EPPP - an integrated environment for portable parallel programming.
Proceedings of the 1994 Conference of the Centre for Advanced Studies on Collaborative Research, October 31, 1994

Data parallelism with high performance C.
Proceedings of the 1994 Conference of the Centre for Advanced Studies on Collaborative Research, October 31, 1994

Minimizing memory requirements in rate-optimal schedules.
Proceedings of the International Conference on Application Specific Array Processors, 1994

Concurrent Execution of Heterogeneous Threads in the Super-Actor Machine.
Proceedings of the Multithreaded Computer Architecture, 1994

Multithreaded Architectures: Principles, Projects, and Issues.
Proceedings of the Multithreaded Computer Architecture, 1994

1993
Special Issue on DataFlow and Multithreaded Architectures - Guest Editors' Introduction.
J. Parallel Distributed Comput., 1993

An Efficient Hybrid Dataflow Architecture Modle.
J. Parallel Distributed Comput., 1993

Designing Programming Languages for the Analyzability of Pointer Data Structures.
Comput. Lang., 1993

Analysis of Multithreaded Multiprocessors with Distributed Shared Memory.
Proceedings of the Fifth IEEE Symposium on Parallel and Distributed Processing, 1993

A Novel Framework of Register Allocation for Software Pipelining.
Proceedings of the Conference Record of the Twentieth Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 1993

A Kahn Principle for Networks of Nonmonotonic Real-time Processes.
Proceedings of the PARLE '93, 1993

Extending Software Pipelining Techniques for Scheduling Nested Loops.
Proceedings of the Languages and Compilers for Parallel Computing, 1993

Speculative Execution and Branch Prediction on Parallel Machines.
Proceedings of the 7th international conference on Supercomputing, 1993

A Novel Methodology Using Genetic Algorithms for the Design of Caches and Cache Replacement Policy.
Proceedings of the 5th International Conference on Genetic Algorithms, 1993

A novel framework for multi-rate scheduling in DSP applications.
Proceedings of the International Conference on Application-Specific Array Processors, 1993

1992
Optimal loop storage allocation for argument-fetching dataflow machines.
Int. J. Parallel Program., 1992

A high-speed memory organization for hybrid dataflow / von Neumann computing.
Future Gener. Comput. Syst., 1992

Minimizing Loop Storage Allocation for An Argument-Fetching Dataflow Architecture Model.
Proceedings of the PARLE '92: Parallel Architectures and Languages Europe, 1992

On the limits of program parallelism and its smoothability.
Proceedings of the 25th Annual International Symposium on Microarchitecture, 1992

Designing the McCAT Compiler Based on a Family of Structured Intermediate Representations.
Proceedings of the Languages and Compilers for Parallel Computing, 1992

Collective Loop Fusion for Array Contraction.
Proceedings of the Languages and Compilers for Parallel Computing, 1992

Efficient Interprocessor Synchronization/Communication on a Dataflow Multiprocessor Architecture.
Proceedings of the 1992 International Conference on Parallel Processing, 1992

Designing programming languages for analyzability: a fresh look at pointer data structures.
Proceedings of the ICCL'92, 1992

Performance Evaluation of Latency Tolerant Architectures.
Proceedings of the Computing and Information, 1992

Well-behaved dataflow programs for DSP computation.
Proceedings of the 1992 IEEE International Conference on Acoustics, 1992

A Polynomial Time Method for Optimal Software Pipelining.
Proceedings of the Parallel Processing: CONPAR 92, 1992

A Register Allocation Framework Based on Hierarchical Cyclic Interval Graphs.
Proceedings of the Compiler Construction, 1992

1991
Efficient support of concurrent threads in a hybrid dataflow/von Neumann architecture.
Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing, 1991

An efficient parallel algorithm for all pairs examination.
Proceedings of the Proceedings Supercomputing '91, 1991

A Timed Petri-Net Model for Fine-Grain Loop Scheduling.
Proceedings of the ACM SIGPLAN'91 Conference on Programming Language Design and Implementation (PLDI), 1991

A Novel High-Speed Memory Organization for Fine-Grain Multi-Thread Computing.
Proceedings of the PARLE '91: Parallel Architectures and Languages Europe, 1991

Towards an Efficient Hybrid Dataflow Architecture Model.
Proceedings of the PARLE '91: Parallel Architectures and Languages Europe, 1991

Loop Storage Optimization for Dataflow Machines.
Proceedings of the Languages and Compilers for Parallel Computing, 1991

Optimization of array accesses by collective loop transformations.
Proceedings of the 5th international conference on Supercomputing, 1991

A code mapping scheme for dataflow software pipelining.
The Kluwer international series in engineering and computer science 125, Kluwer, ISBN: 978-0-7923-9130-2, 1991

1990
Exploiting fine-grain parallelism on dataflow architectures.
Parallel Comput., 1990

A strict monolithic array constructor.
Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing, 1990

Towards efficient fine-grain software pipelining.
Proceedings of the 4th international conference on Supercomputing, 1990

An Efficient Scheme for Fine-Grain Software Pipelining.
Proceedings of the CONPAR 90, 1990

1989
Algorithmic Aspects of Balancing Techniques for Pipelined Data Flow Code Generation.
J. Parallel Distributed Comput., 1989

1988
Summary of the workshop on frontiers in functional programming and dataflow architecture.
SIGARCH Comput. Archit. News, 1988

An efficient pipelined dataflow processor architecture.
Proceedings of the Proceedings Supercomputing '88, Orlando, FL, USA, November 12-17, 1988, 1988

Design of an Efficient Dataflow Architecture without Data Flow.
Proceedings of the International Conference on Fifth Generation Computer Systems, 1988

1987
A stability classification method and its application to pipelined solution of linear recurrences.
Parallel Comput., 1987

1986
A pipelined code mapping scheme for static data flow computers.
PhD thesis, 1986

A Maximally Pipelined Tridiagonal Linear Equation Solver.
J. Parallel Distributed Comput., 1986

Maximum pipelining linear recurrence on static data flow computers.
Int. J. Parallel Program., 1986

A Pipelined Solution Method of Tridiagonal Linear Equation Systems.
Proceedings of the International Conference on Parallel Processing, 1986

1984
Modeling the Weather with a Data Flow Supercomputer.
IEEE Trans. Computers, 1984

1983
Maximum Pipelining of Array Operations on Static Data Flow Machine.
Proceedings of the International Conference on Parallel Processing, 1983


  Loading...