Tarek S. Abdelrahman

CoRR, 2022

Reuse-Aware Partitioning of Dataflow Graphs on a Tightly-Coupled CGRA.

[BibT_eX]

[DOI]

Nikhil Sambhus

Proceedings of the IEEE Intl Conf on Parallel & Distributed Processing with Applications, 2022

Optimization of Compiler-Generated OpenCL CNN Kernels and Runtime for FPGAs.

[BibT_eX]

[DOI]

Seung-Hun Chung

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

2021

Pipelined Training with Stale Weights in Deep Convolutional Neural Networks.

[BibT_eX]

[DOI]

Lifu Zhang

Appl. Comput. Intell. Soft Comput., 2021

A Streaming Accelerator for Heterogeneous CPU-FPGA Processing of Graph Applications.

[BibT_eX]

[DOI]

Francis O'Brien

Matthew Agostini

Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops, 2021

2020

Cooperative Software-hardware Acceleration of K-means on a Tightly Coupled CPU-FPGA System.

[BibT_eX]

[DOI]

ACM Trans. Archit. Code Optim., 2020

Optimizing OpenCL Kernels and Runtime for DNN Inference on FPGAs.

[BibT_eX]

[DOI]

Seung-Hun Chung

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems.

[BibT_eX]

[DOI]

Matthew Agostini

Francis O'Brien

Proceedings of the ICPP 2020: 49th International Conference on Parallel Processing, 2020

2019

Retraining-free methods for fast on-the-fly pruning of convolutional neural networks.

[BibT_eX]

[DOI]

Amir H. Ashouri

Alwyn Dos Remedios

Neurocomputing, 2019

2018

A Strategy for Automatic Performance Tuning of Stencil Computations on GPUs.

[BibT_eX]

[DOI]

Joseph D. Garvey

Sci. Program., 2018

Fast On-the-fly Retraining-free Sparsification of Convolutional Neural Networks.

[BibT_eX]

[DOI]

Amir H. Ashouri

Alwyn Dos Remedios

CoRR, 2018

User-Transparent Translation of Machine Instructions to Programmable Hardware.

[BibT_eX]

[DOI]

Leslie Barron

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

2017

A Language and Preprocessor for User-Controlled Generation of Synthetic Programs.

[BibT_eX]

[DOI]

Alton Chiu

Joseph Garvey

Sci. Program., 2017

Launch-Time Optimization of OpenCL GPU Kernels.

[BibT_eX]

[DOI]

Andrew S. D. Lee

Proceedings of the General Purpose GPUs, 2017

Use of Synthetic Benchmarks for Machine-Learning-Based Performance Auto-Tuning.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

A Sampling Based Strategy to Automatic Performance Tuning of GPU Programs.

[BibT_eX]

[DOI]

Wilson Feng

Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

2016

Accelerating K-means clustering on a tightly-coupled processor-FPGA heterogeneous system.

[BibT_eX]

[DOI]

Proceedings of the 27th IEEE International Conference on Application-specific Systems, 2016

2015

Clean: a race detector with cleaner semantics.

[BibT_eX]

[DOI]

Cedomir Segulja

Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015

Automatic Performance Tuning of Stencil Computations on GPUs.

[BibT_eX]

[DOI]

Joseph D. Garvey

Proceedings of the 44th International Conference on Parallel Processing, 2015

Genesis: a language for generating synthetic training programs for machine learning.

[BibT_eX]

[DOI]

Alton Chiu

Joseph Garvey

Proceedings of the 12th ACM International Conference on Computing Frontiers, 2015

2014

Automatic Tuning of Local Memory Use on GPGPUs.

[BibT_eX]

[DOI]

CoRR, 2014

Tile-based bottom-up compilation of custom mesh-of-functional-units FPGA overlays.

[BibT_eX]

[DOI]

Proceedings of the 24th International Conference on Field Programmable Logic and Applications, 2014

What is the cost of weak determinism?

[BibT_eX]

[DOI]

Cedomir Segulja

Proceedings of the International Conference on Parallel Architectures and Compilation, 2014

2013

Microarchitecture of a Coarse-Grain Out-of-Order Superscalar Processor.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2013

Parallel Radix Sort on the AMD Fusion Accelerated Processing Unit.

[BibT_eX]

[DOI]

Michael C. Delorme

Chengyan Zhao

Proceedings of the 42nd International Conference on Parallel Processing, 2013

A high-performance overlay architecture for pipelined execution of data flow graphs.

[BibT_eX]

[DOI]

Proceedings of the 23rd International Conference on Field programmable Logic and Applications, 2013

Reducing divergence in GPGPU programs with loop merging.

[BibT_eX]

[DOI]

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, 2013

2012

Relaxed Concurrency Control in Software Transactional Memory.

[BibT_eX]

[DOI]

Utku Aydonat

IEEE Trans. Parallel Distributed Syst., 2012

Inlining with traces in Java programs.

[BibT_eX]

Comput. Syst. Sci. Eng., 2012

Architectural support for synchronization-free deterministic parallel programming.

[BibT_eX]

[DOI]

Cedomir Segulja

Proceedings of the 18th IEEE International Symposium on High Performance Computer Architecture, 2012

Efficient bottom-up heap analysis for symbolic path-based data access summaries.

[BibT_eX]

[DOI]

Ivan Matosevic

Proceedings of the 10th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2012

2011

hiCUDA: High-Level GPGPU Programming.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2011

Parallelization of multimedia applications on the multi-level computing architecture.

[BibT_eX]

[DOI]

Utku Aydonat

J. Embed. Comput., 2011

Towards Synthesis-Free JIT Compilation to Commodity FPGAs.

[BibT_eX]

[DOI]

Proceedings of the IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines, 2011

Reducing branch divergence in GPU programs.

[BibT_eX]

[DOI]

Proceedings of 4th Workshop on General Purpose Processing on Graphics Processing Units, 2011

2010

Hardware Support for Relaxed Concurrency Control in Transactional Memory.

[BibT_eX]

[DOI]

Utku Aydonat

Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010

2009

A study of potential parallelism among traces in Java programs.

[BibT_eX]

[DOI]

Sci. Comput. Program., 2009

The use of hardware transactional memory for the trace-based parallelization of recursive Java programs.

[BibT_eX]

[DOI]

Proceedings of the 7th International Conference on Principles and Practice of Programming in Java, 2009

<i>hi</i>CUDA: a high-level directive-based language for GPU programming.

[BibT_eX]

[DOI]

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, 2009

2007

The potential of trace-level parallelism in Java programs.

[BibT_eX]

[DOI]

Proceedings of the 5th International Symposium on Principles and Practice of Programming in Java, 2007

Automatic Trace-Based Parallelization of Java Programs.

[BibT_eX]

[DOI]

Proceedings of the 2007 International Conference on Parallel Processing (ICPP 2007), 2007

2006

Locality management using multiple SPMs on the Multi-Level Computing Architecture.

[BibT_eX]

[DOI]

Ahmed Abdelkhalek

Proceedings of the 2006 4th Workshop on Embedded Systems for Real-Time Multimedia, 2006

2005

Power Optimization for the MLCA Using Dynamic Voltage Scaling.

[BibT_eX]

[DOI]

Proceedings of the 9th International Workshop on Software and Compilers for Embedded Systems, Dallas, Texas, USA, September 29, 2005

A Characterization of Traces in Java Programs.

[BibT_eX]

Proceedings of The 2005 International Conference on Programming Languages and Compilers, 2005

2004

Run-Time Support for the Automatic Parallelization of Java Programs.

[BibT_eX]

[DOI]

Bryan Chan

J. Supercomput., 2004

The design and implementation of a modular and extensible Java Virtual Machine.

[BibT_eX]

[DOI]

Patrick Doyle

Carlos Cavanna

Softw. Pract. Exp., 2004

A Multilevel Computing Architecture for Embedded Multimedia Applications.

[BibT_eX]

[DOI]

IEEE Micro, 2004

Improving the structure of loop nests in scientific programs.

[BibT_eX]

Robert Sawaya

Comput. Syst. Sci. Eng., 2004

The Use of Traces for Inlining in Java Programs.

[BibT_eX]

[DOI]

Proceedings of the Languages and Compilers for High Performance Computing, 2004

Catenation and specialization for Tcl virtual machine performance.

[BibT_eX]

[DOI]

Benjamin Vitale

Proceedings of the 2004 Workshop on Interpreters, Virtual Machines and Emulators, 2004

2002

A Modular and Extensible JVM Infrastructure.

[BibT_eX]

[DOI]

Patrick Doyle

Proceedings of the 2nd Java Virtual Machine Research and Technology Symposium, 2002

2001

Exploiting Wavefront Parallelism on Large-Scale Shared-Memory Multiprocessors.

[BibT_eX]

[DOI]

Naraig Manjikian

IEEE Trans. Parallel Distributed Syst., 2001

A Compiler Infrastructure for High-Performance Java.

[BibT_eX]

[DOI]

Neil V. Brewster

Proceedings of the High-Performance Computing and Networking, 9th International Conference, 2001

2000

The NUMAchine Multiprocessor.

[BibT_eX]

[DOI]

Proceedings of the 2000 International Conference on Parallel Processing, 2000

1999

Overlap of Computation and Communication on Shared-Memory.

[BibT_eX]

[DOI]

Gary Liu

Scalable Comput. Pract. Exp., 1999

1998

Compiler Support for Array Distribution on NUMA Shared Memory Multiprocessors.

[BibT_eX]

[DOI]

Thomas N. Wong

J. Supercomput., 1998

Locality Enhancement for Large-Scale Shared-Memory Multiprocessors.

[BibT_eX]

[DOI]

Proceedings of the Languages, 1998

1997

Fusion of Loops for Parallelism and Locality.

[BibT_eX]

[DOI]

Naraig Manjikian

IEEE Trans. Parallel Distributed Syst., 1997

Tuning Shared Network Cache Size vs. Second-Level Cache Size in Clusters-Based Multiprocessors.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing Technologies, 1997

Automatic Partitioning of Data and Computations on Scalable Shared Memory Multiprocessors.

[BibT_eX]

[DOI]

Sudarsan Tandri

Proceedings of the 1997 International Conference on Parallel Processing (ICPP '97), 1997

1996

Latency hiding on COMA multiprocessors.

[BibT_eX]

[DOI]

J. Supercomput., 1996

Evaluation of Dynamic Data Distributions on NUMA Shared Memory Multiprocessors.

[BibT_eX]

Kenneth L. Ma

Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, 1996

Exploiting Task-Level Parallelism Using pTask.

[BibT_eX]

Sum Huynh

Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, 1996

Automatic Data and Computation Partitioning on Scalable Shared Memory Multiprocessors.

[BibT_eX]

[DOI]

Sudarsan Tandri

Proceedings of the Languages and Compilers for Parallel Computing, 1996

Scheduling of Wavefront Parallelism on Scalable Shared-memory Multiprocessors.

[BibT_eX]

[DOI]

Naraig Manjikian

Proceedings of the 1996 International Conference on Parallel Processing, 1996

1995

Computation and Data Partitioning on Scalable Shared Memory Multiprocessors.

[BibT_eX]

Sudarsan Tandri