Tarek S. Abdelrahman

Orcid: 0000-0002-2985-4873

Affiliations:
  • University of Toronto, Canada


According to our database1, Tarek S. Abdelrahman authored at least 67 papers between 1995 and 2023.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2023
Efficient Data Streaming for a Tightly-Coupled Coarse-Grained Reconfigurable Array.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

2022
A Compilation Flow for the Generation of CNN Inference Accelerators on FPGAs.
CoRR, 2022

Reuse-Aware Partitioning of Dataflow Graphs on a Tightly-Coupled CGRA.
Proceedings of the IEEE Intl Conf on Parallel & Distributed Processing with Applications, 2022

Optimization of Compiler-Generated OpenCL CNN Kernels and Runtime for FPGAs.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

2021
Pipelined Training with Stale Weights in Deep Convolutional Neural Networks.
Appl. Comput. Intell. Soft Comput., 2021

A Streaming Accelerator for Heterogeneous CPU-FPGA Processing of Graph Applications.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops, 2021

2020
Cooperative Software-hardware Acceleration of K-means on a Tightly Coupled CPU-FPGA System.
ACM Trans. Archit. Code Optim., 2020

Optimizing OpenCL Kernels and Runtime for DNN Inference on FPGAs.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems.
Proceedings of the ICPP 2020: 49th International Conference on Parallel Processing, 2020

2019
Retraining-free methods for fast on-the-fly pruning of convolutional neural networks.
Neurocomputing, 2019

2018
A Strategy for Automatic Performance Tuning of Stencil Computations on GPUs.
Sci. Program., 2018

Fast On-the-fly Retraining-free Sparsification of Convolutional Neural Networks.
CoRR, 2018

User-Transparent Translation of Machine Instructions to Programmable Hardware.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

2017
A Language and Preprocessor for User-Controlled Generation of Synthetic Programs.
Sci. Program., 2017

Launch-Time Optimization of OpenCL GPU Kernels.
Proceedings of the General Purpose GPUs, 2017

Use of Synthetic Benchmarks for Machine-Learning-Based Performance Auto-Tuning.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

A Sampling Based Strategy to Automatic Performance Tuning of GPU Programs.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

2016
Accelerating K-means clustering on a tightly-coupled processor-FPGA heterogeneous system.
Proceedings of the 27th IEEE International Conference on Application-specific Systems, 2016

2015
Clean: a race detector with cleaner semantics.
Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015

Automatic Performance Tuning of Stencil Computations on GPUs.
Proceedings of the 44th International Conference on Parallel Processing, 2015

Genesis: a language for generating synthetic training programs for machine learning.
Proceedings of the 12th ACM International Conference on Computing Frontiers, 2015

2014
Automatic Tuning of Local Memory Use on GPGPUs.
CoRR, 2014

Tile-based bottom-up compilation of custom mesh-of-functional-units FPGA overlays.
Proceedings of the 24th International Conference on Field Programmable Logic and Applications, 2014

What is the cost of weak determinism?
Proceedings of the International Conference on Parallel Architectures and Compilation, 2014

2013
Microarchitecture of a Coarse-Grain Out-of-Order Superscalar Processor.
IEEE Trans. Parallel Distributed Syst., 2013

Parallel Radix Sort on the AMD Fusion Accelerated Processing Unit.
Proceedings of the 42nd International Conference on Parallel Processing, 2013

A high-performance overlay architecture for pipelined execution of data flow graphs.
Proceedings of the 23rd International Conference on Field programmable Logic and Applications, 2013

Reducing divergence in GPGPU programs with loop merging.
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, 2013

2012
Relaxed Concurrency Control in Software Transactional Memory.
IEEE Trans. Parallel Distributed Syst., 2012

Inlining with traces in Java programs.
Comput. Syst. Sci. Eng., 2012

Architectural support for synchronization-free deterministic parallel programming.
Proceedings of the 18th IEEE International Symposium on High Performance Computer Architecture, 2012

Efficient bottom-up heap analysis for symbolic path-based data access summaries.
Proceedings of the 10th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2012

2011
hiCUDA: High-Level GPGPU Programming.
IEEE Trans. Parallel Distributed Syst., 2011

Parallelization of multimedia applications on the multi-level computing architecture.
J. Embed. Comput., 2011

Towards Synthesis-Free JIT Compilation to Commodity FPGAs.
Proceedings of the IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines, 2011

Reducing branch divergence in GPU programs.
Proceedings of 4th Workshop on General Purpose Processing on Graphics Processing Units, 2011

2010
Hardware Support for Relaxed Concurrency Control in Transactional Memory.
Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010

2009
A study of potential parallelism among traces in Java programs.
Sci. Comput. Program., 2009

The use of hardware transactional memory for the trace-based parallelization of recursive Java programs.
Proceedings of the 7th International Conference on Principles and Practice of Programming in Java, 2009

<i>hi</i>CUDA: a high-level directive-based language for GPU programming.
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, 2009

2007
The potential of trace-level parallelism in Java programs.
Proceedings of the 5th International Symposium on Principles and Practice of Programming in Java, 2007

Automatic Trace-Based Parallelization of Java Programs.
Proceedings of the 2007 International Conference on Parallel Processing (ICPP 2007), 2007

2006
Locality management using multiple SPMs on the Multi-Level Computing Architecture.
Proceedings of the 2006 4th Workshop on Embedded Systems for Real-Time Multimedia, 2006

2005
Power Optimization for the MLCA Using Dynamic Voltage Scaling.
Proceedings of the 9th International Workshop on Software and Compilers for Embedded Systems, Dallas, Texas, USA, September 29, 2005

A Characterization of Traces in Java Programs.
Proceedings of The 2005 International Conference on Programming Languages and Compilers, 2005

2004
Run-Time Support for the Automatic Parallelization of Java Programs.
J. Supercomput., 2004

The design and implementation of a modular and extensible Java Virtual Machine.
Softw. Pract. Exp., 2004

A Multilevel Computing Architecture for Embedded Multimedia Applications.
IEEE Micro, 2004

Improving the structure of loop nests in scientific programs.
Comput. Syst. Sci. Eng., 2004

The Use of Traces for Inlining in Java Programs.
Proceedings of the Languages and Compilers for High Performance Computing, 2004

Catenation and specialization for Tcl virtual machine performance.
Proceedings of the 2004 Workshop on Interpreters, Virtual Machines and Emulators, 2004

2002
A Modular and Extensible JVM Infrastructure.
Proceedings of the 2nd Java Virtual Machine Research and Technology Symposium, 2002

2001
Exploiting Wavefront Parallelism on Large-Scale Shared-Memory Multiprocessors.
IEEE Trans. Parallel Distributed Syst., 2001

A Compiler Infrastructure for High-Performance Java.
Proceedings of the High-Performance Computing and Networking, 9th International Conference, 2001

2000

1999
Overlap of Computation and Communication on Shared-Memory.
Scalable Comput. Pract. Exp., 1999

1998
Compiler Support for Array Distribution on NUMA Shared Memory Multiprocessors.
J. Supercomput., 1998

Locality Enhancement for Large-Scale Shared-Memory Multiprocessors.
Proceedings of the Languages, 1998

1997
Fusion of Loops for Parallelism and Locality.
IEEE Trans. Parallel Distributed Syst., 1997

Tuning Shared Network Cache Size vs. Second-Level Cache Size in Clusters-Based Multiprocessors.
Proceedings of the Parallel Computing Technologies, 1997

Automatic Partitioning of Data and Computations on Scalable Shared Memory Multiprocessors.
Proceedings of the 1997 International Conference on Parallel Processing (ICPP '97), 1997

1996
Latency hiding on COMA multiprocessors.
J. Supercomput., 1996

Evaluation of Dynamic Data Distributions on NUMA Shared Memory Multiprocessors.
Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, 1996

Exploiting Task-Level Parallelism Using pTask.
Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, 1996

Automatic Data and Computation Partitioning on Scalable Shared Memory Multiprocessors.
Proceedings of the Languages and Compilers for Parallel Computing, 1996

Scheduling of Wavefront Parallelism on Scalable Shared-memory Multiprocessors.
Proceedings of the 1996 International Conference on Parallel Processing, 1996

1995
Computation and Data Partitioning on Scalable Shared Memory Multiprocessors.
Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, 1995


  Loading...