Michela Becchi

Proceedings of the IEEE International Symposium on Workload Characterization, 2025

Exploring Lossy Compression of Activation Data for Emerging AI Accelerators: A Case Study on the Graphcore IPU.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Workload Characterization, 2025

DIMPLES: Distributed Influence Maximization for Pandemic pLanning on Exascale Systems.

[BibT_eX]

[DOI]

Parantapa Bhattacharya

Proceedings of the 39th ACM International Conference on Supercomputing, 2025

2024

\texttt{Picasso}: Memory-Efficient Graph Coloring Using Palettes With Applications in Quantum Computing.

[BibT_eX]

[DOI]

CoRR, 2024

Picasso: Memory-Efficient Graph Coloring Using Palettes With Applications in Quantum Computing.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

FuseIM: Fusing Probabilistic Traversals for Influence Maximization on Exascale Systems.

[BibT_eX]

[DOI]

Reece Neff

Marco Minutoli

Antonino Tumeo

Ananth Kalyanaraman

Proceedings of the 38th ACM International Conference on Supercomputing, 2024

Significantly Improving Fixed-Ratio Compression Framework for Resource-limited Applications.

[BibT_eX]

[DOI]

Proceedings of the 53rd International Conference on Parallel Processing, 2024

A Portable, Fast, DCT-based Compressor for AI Accelerators.

[BibT_eX]

[DOI]

Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, 2024

A Transducers-based Programming Framework for Efficient Data Transformation.

[BibT_eX]

[DOI]

Tri Nguyen

Proceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques, 2024

2023

Fused Breadth-First Probabilistic Traversals on Distributed GPU Systems.

[BibT_eX]

[DOI]

Reece Neff

Marco Minutoli

Antonino Tumeo

Ananth Kalyanaraman

CoRR, 2023

GPU-Accelerated Error-Bounded Compression Framework for Quantum Circuit Simulations.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

Evaluating Asynchronous Parallel I/O on HPC Systems.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

Lightweight Huffman Coding for Efficient GPU Compression.

[BibT_eX]

[DOI]

Proceedings of the 37th International Conference on Supercomputing, 2023

A Code Transformation to Improve the Efficiency of OpenCL Code on FPGA through Pipes.

[BibT_eX]

[DOI]

Proceedings of the 20th ACM International Conference on Computing Frontiers, 2023

High-Level Synthesis of Irregular Applications: A Case Study on Influence Maximization.

[BibT_eX]

[DOI]

Proceedings of the 20th ACM International Conference on Computing Frontiers, 2023

Runway: In-transit Data Compression on Heterogeneous HPC Systems.

[BibT_eX]

[DOI]

John Ravi

Suren Byna

Proceedings of the 23rd IEEE/ACM International Symposium on Cluster, 2023

2022

Enabling The Feed-Forward Design Model in OpenCL Using Pipes.

[BibT_eX]

[DOI]

CoRR, 2022

Accelerating Random Forest Classification on GPU and FPGA.

[BibT_eX]

[DOI]

Proceedings of the 51st International Conference on Parallel Processing, 2022

A GPU-accelerated Data Transformation Framework Rooted in Pushdown Transducers.

[BibT_eX]

[DOI]

Tri Nguyen

Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

Data Transformation Acceleration using Deterministic Finite-State Transducers.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Big Data, 2022

2021

Exploring Thread Coarsening on FPGA.

[BibT_eX]

[DOI]

Reece Neff

Proceedings of the 28th IEEE International Conference on High Performance Computing, 2021

PILOT: a Runtime System to Manage Multi-tenant GPU Unified Memory Footprint.

[BibT_eX]

[DOI]

Proceedings of the 28th IEEE International Conference on High Performance Computing, 2021

2020

A Loop-Aware Autotuner for High-Precision Floating-Point Applications.

[BibT_eX]

[DOI]

Paul Beata

Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2020

GPU-Based Static Data-Flow Analysis for Fast and Scalable Android App Vetting.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

Evaluating Thread Coarsening and Low-cost Synchronization on Intel Xeon Phi.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

Optimizing Complex OpenCL Code for FPGA: A Case Study on Finite Automata Traversal.

[BibT_eX]

[DOI]

Marziyeh Nourian

Proceedings of the 26th IEEE International Conference on Parallel and Distributed Systems, 2020

GPU-FPtuner: Mixed-precision Auto-tuning for Floating-point Applications on GPU.

[BibT_eX]

[DOI]

Proceedings of the 27th IEEE International Conference on High Performance Computing, 2020

A Flexible and Scalable NTT Hardware : Applications from Homomorphically Encrypted Deep Learning to Post-Quantum Cryptography.

[BibT_eX]

[DOI]

Proceedings of the 2020 Design, Automation & Test in Europe Conference & Exhibition, 2020

2019

Editorial: Special Issue on Computing Frontiers.

[BibT_eX]

[DOI]

Francesca Palumbo

J. Signal Process. Syst., 2019

Evaluating High Performance Pattern Matching on the Automata Processor.

[BibT_eX]

[DOI]

IEEE Trans. Computers, 2019

Characterizing the Performance/Accuracy Tradeoff of High-Precision Applications via Auto-tuning.

[BibT_eX]

[DOI]

Paul Beata

Proceedings of the IEEE International Symposium on Workload Characterization, 2019

A comparative study of parallel programming frameworks for distributed GPU applications.

[BibT_eX]

[DOI]

Proceedings of the 16th ACM International Conference on Computing Frontiers, 2019

2018

A Compiler Framework for Fixed-Topology Non-Deterministic Finite Automata on SIMD Platforms.

[BibT_eX]

[DOI]

Marziyeh Nourian

Proceedings of the 24th IEEE International Conference on Parallel and Distributed Systems, 2018

Compiling SIMT Programs on Multi- and Many-Core Processors with Wide Vector Units: A Case Study with CUDA.

[BibT_eX]

[DOI]

John Ravi

Proceedings of the 25th IEEE International Conference on High Performance Computing, 2018

2017

A Principled Approach to Secure Multi-core Processor Design with ReWire.

[BibT_eX]

[DOI]

ACM Trans. Embed. Comput. Syst., 2017

Fast Integral Histogram Computations on GPU for Real-Time Video Analytics.

[BibT_eX]

[DOI]

Mahdieh Poostchi

Kannappan Palaniappan

CoRR, 2017

Understanding the performance-accuracy tradeoffs of floating-point arithmetic on GPUs.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Symposium on Workload Characterization, 2017

Demystifying automata processing: GPUs, FPGAs or Micron's AP?

[BibT_eX]

[DOI]

Proceedings of the International Conference on Supercomputing, 2017

An Analytical Study of Recursive Tree Traversal Patterns on Multi- and Many-Core Platforms.

[BibT_eX]

[DOI]

Proceedings of the 23rd IEEE International Conference on Parallel and Distributed Systems, 2017

A Memory-Efficient GPU Method for Hamming and Levenshtein Distance Similarity.

[BibT_eX]

[DOI]

Andrew Todd

Marziyeh Nourian

Proceedings of the 24th IEEE International Conference on High Performance Computing, 2017

2016

A programming model for reconfigurable computing based in functional concurrency.

[BibT_eX]

[DOI]

Proceedings of the 11th International Symposium on Reconfigurable Communication-centric Systems-on-Chip, 2016

Compiler-Assisted Workload Consolidation for Efficient Dynamic Parallelism on GPU.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

High Performance Pattern Matching Using the Automata Processor.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Parallel Gene Upstream Comparison via Multi-Level Hash Tables on GPU.

[BibT_eX]

[DOI]

Proceedings of the 22nd IEEE International Conference on Parallel and Distributed Systems, 2016

IVM: a task-based shared memory programming model and runtime system to enable uniform access to CPU-GPU clusters.

[BibT_eX]

[DOI]

Proceedings of the ACM International Conference on Computing Frontiers, CF'16, 2016

Evaluating the Energy Efficiency of Deep Convolutional Neural Networks on CPUs and GPUs.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), 2016

O3FA: A Scalable Finite Automata-based Pattern-Matching Engine for Out-of-Order Deep Packet Inspection.

[BibT_eX]

[DOI]

Proceedings of the 2016 Symposium on Architectures for Networking and Communications Systems, 2016

2015

Fast support for unstructured data processing: the unified automata processor.

[BibT_eX]

[DOI]

Proceedings of the 48th International Symposium on Microarchitecture, 2015

Semantics Driven Hardware Design, Implementation, and Verification with ReWire.

[BibT_eX]

[DOI]

Proceedings of the 16th ACM SIGPLAN/SIGBED Conference on Languages, 2015

Accelerating regular expression matching over compressed HTTP.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE Conference on Computer Communications, 2015

Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations.

[BibT_eX]

[DOI]

Proceedings of the 44th International Conference on Parallel Processing, 2015

Improving Application Concurrency on GPUs by Managing Implicit and Explicit Synchronizations.

[BibT_eX]

[DOI]

Michael Butler

Proceedings of the 21st IEEE International Conference on Parallel and Distributed Systems, 2015

Exploiting Dynamic Parallelism to Efficiently Support Irregular Nested Loops on GPUs.

[BibT_eX]

[DOI]

Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores, 2015

Hardware Synthesis from Functional Embedded Domain-Specific Languages: A Case Study in Regular Expression Compilation.

[BibT_eX]

[DOI]

Proceedings of the Applied Reconfigurable Computing - 11th International Symposium, 2015

2014

Large-Scale Pairwise Alignments on GPU Clusters: Exploring the Implementation Space.

[BibT_eX]

[DOI]

J. Signal Process. Syst., 2014

Revisiting State Blow-Up: Automatically Building Augmented-FA While Preserving Functional Equivalence.

[BibT_eX]

[DOI]

Xiaodong Yu

Bill Lin

IEEE J. Sel. Areas Commun., 2014

GRapid: A compilation and runtime framework for rapid prototyping of graph applications on many-core processors.

[BibT_eX]

[DOI]

Srimat T. Chakradhar

Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems, 2014

A flexible scheduling framework for heterogeneous CPU-GPU clusters.

[BibT_eX]

[DOI]

Tejaswi Agarwal

Proceedings of the 21st International Conference on High Performance Computing, 2014

Design of a hybrid MPI-CUDA benchmark suite for CPU-GPU clusters.

[BibT_eX]

[DOI]

Tejaswi Agarwal

Proceedings of the International Conference on Parallel Architectures and Compilation, 2014

2013

A-DFA: A Time- and Space-Efficient DFA Compression Algorithm for Fast Regular Expression Evaluation.

[BibT_eX]

[DOI]

ACM Trans. Archit. Code Optim., 2013

Exploring different automata representations for efficient regular expression matching on GPUs.

[BibT_eX]

[DOI]

Xiaodong Yu

Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2013

Deploying Graph Algorithms on GPUs: An Adaptive Solution.

[BibT_eX]

[DOI]

Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

A preemption-based runtime to efficiently schedule multi-process applications on heterogeneous clusters with GPUs.

[BibT_eX]

[DOI]

Xiang Wang

Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, 2013

Semantics-directed machine architecture in ReWire.

[BibT_eX]

[DOI]

Proceedings of the 2013 International Conference on Field-Programmable Technology, 2013

GPU acceleration of regular expression matching for large datasets: exploring the implementation space.

[BibT_eX]

[DOI]

Xiaodong Yu

Proceedings of the Computing Frontiers Conference, 2013

A distributed CPU-GPU framework for pairwise alignments on large-scale sequence datasets.

[BibT_eX]

[DOI]

Proceedings of the 24th International Conference on Application-Specific Systems, 2013

Picking pesky parameters: Optimizing regular expression matching in practice.

[BibT_eX]

[DOI]

Proceedings of the Symposium on Architecture for Networking and Communications Systems, 2013

2012

A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification.

[BibT_eX]

[DOI]

ACM Trans. Archit. Code Optim., 2012

Formal Semantics of Heterogeneous CUDA-C: A Modular Approach with Applications

[BibT_eX]

[DOI]

Proceedings of the Proceedings Seventh Conference on Systems Software Verification, 2012

ValuePack: value-based scheduling framework for CPU-GPU clusters.

[BibT_eX]

[DOI]

Proceedings of the SC Conference on High Performance Computing Networking, 2012

Poster: Multiple Pairwise Sequence Alignments with the Needleman-Wunsch Algorithm on GPU.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Abstract: Multiple Pairwise Sequence Alignments with the Needleman-Wunsch Algorithm on GPU.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

A virtual memory based runtime to support multi-tenancy in clusters with GPUs.

[BibT_eX]

[DOI]

Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, 2012

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes.

[BibT_eX]

[DOI]

Proceedings of the 12th IEEE/ACM International Symposium on Cluster, 2012

Efficient GPU Implementation of the Integral Histogram.

[BibT_eX]

[DOI]

Mahdieh Poostchi

Kannappan Palaniappan

Filiz Bunyak

Guna Seetharaman

Proceedings of the Computer Vision - ACCV 2012 Workshops, 2012

2011

Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework.

[BibT_eX]

[DOI]

Proceedings of the 20th ACM International Symposium on High Performance Distributed Computing, 2011

2010

Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory.

[BibT_eX]

[DOI]

Proceedings of the SPAA 2010: Proceedings of the 22nd Annual ACM Symposium on Parallelism in Algorithms and Architectures, 2010

A programmable parallel accelerator for learning and classification.

[BibT_eX]

[DOI]

Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, 2010

2009

Evaluating regular expression matching engines on network and general purpose processors.

[BibT_eX]

[DOI]

Charlie Wiseman

Proceedings of the 2009 ACM/IEEE Symposium on Architecture for Networking and Communications Systems, 2009

2008

A workload for evaluating deep packet inspection architectures.

[BibT_eX]

[DOI]

Mark A. Franklin

Proceedings of the 4th International Symposium on Workload Characterization (IISWC 2008), 2008

Extending finite automata to efficiently match Perl-compatible regular expressions.

[BibT_eX]

[DOI]

Proceedings of the 2008 ACM Conference on Emerging Network Experiment and Technology, 2008

A remotely accessible network processor-based router for network experimentation.

[BibT_eX]

[DOI]

Proceedings of the 2008 ACM/IEEE Symposium on Architecture for Networking and Communications Systems, 2008

Efficient regular expression evaluation: theory to practice.

[BibT_eX]

[DOI]

Proceedings of the 2008 ACM/IEEE Symposium on Architecture for Networking and Communications Systems, 2008

2007

Memory-Efficient Regular Expression Search Using State Merging.

[BibT_eX]

[DOI]

Srihari Cadambi

Proceedings of the INFOCOM 2007. 26th IEEE International Conference on Computer Communications, 2007

A hybrid finite automaton for practical deep packet inspection.

[BibT_eX]

[DOI]

Proceedings of the 2007 ACM Conference on Emerging Network Experiment and Technology, 2007

Performance/area efficiency in chip multiprocessors with micro-caches.

[BibT_eX]

[DOI]

Mark A. Franklin

Proceedings of the 4th Conference on Computing Frontiers, 2007

An improved algorithm to accelerate regular expression evaluation.

[BibT_eX]

[DOI]

Proceedings of the 2007 ACM/IEEE Symposium on Architecture for Networking and Communications Systems, 2007

2006

Dynamic thread assignment on heterogeneous multiprocessor architectures.

[BibT_eX]

[DOI]