Manuel E. Acacio

J. Supercomput., June, 2025

WoperTM: Got Nacks? Use Them!

[BibT_eX]

[DOI]

IEEE Comput. Archit. Lett., 2025

QuFi: Adaptive Tiled Gustavson Output Reuse for Edge Sparse DNN Accelerators.

[BibT_eX]

[DOI]

Proceedings of the 43rd IEEE International Conference on Computer Design, 2025

No Rush in Executing Atomic Instructions.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2025

ACTA: Automatic Configuration of the Tensor Memory Accelerator for High-End GPUs.

[BibT_eX]

[DOI]

Proceedings of the 17th Workshop on General Purpose Processing Using GPU, 2025

2024

On the interactions between ILP and TLP with hardware transactional memory.

[BibT_eX]

[DOI]

Microprocess. Microsystems, 2024

Chaining Transactions for Effective Concurrency Management in Hardware Transactional Memory.

[BibT_eX]

[DOI]

Proceedings of the 57th IEEE/ACM International Symposium on Microarchitecture, 2024

2023

STIFT: A Spatio-Temporal Integrated Folding Tree for Efficient Reductions in Flexible DNN Accelerators.

[BibT_eX]

[DOI]

ACM J. Emerg. Technol. Comput. Syst., October, 2023

Speculative inter-thread store-to-load forwarding in SMT architectures.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., March, 2023

Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing.

[BibT_eX]

[DOI]

Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions.

[BibT_eX]

[DOI]

Proceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques, 2023

2022

DeTraS: Delaying Stores for Friendly-Fire Mitigation in Hardware Transactional Memory.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2022

Analysing software prefetching opportunities in hardware transactional memory.

[BibT_eX]

[DOI]

Marina Shimchenko

J. Supercomput., 2022

Analysis of the Interactions Between ILP and TLP With Hardware Transactional Memory.

[BibT_eX]

[DOI]

Proceedings of the 30th Euromicro International Conference on Parallel, 2022

Understanding the Design-Space of Sparse/Dense Multiphase GNN dataflows on Spatial Accelerators.

[BibT_eX]

[DOI]

Raveesh Garg

Eric Qin

Sivasankaran Rajamanickam

Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

2021

A Taxonomy for Classification and Comparison of Dataflows for GNN Accelerators.

[BibT_eX]

[DOI]

Raveesh Garg

Eric Qin

Sivasankaran Rajamanickam

CoRR, 2021

A novel network fabric for efficient spatio-temporal reduction in flexible DNN accelerators.

[BibT_eX]

[DOI]

Proceedings of the NOCS '21: International Symposium on Networks-on-Chip, 2021

ITSLF: Inter-Thread Store-to-Load Forwardingin Simultaneous Multithreading.

[BibT_eX]

[DOI]

Proceedings of the MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021

STONNE: Enabling Cycle-Level Microarchitectural Simulation for DNN Inference Accelerators.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Workload Characterization, 2021

2020

Concurrent Irrevocability in Best-Effort Hardware Transactional Memory.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2020

Ant Colony Optimization-Based Streaming Feature Selection: An Application to the Medical Image Diagnosis.

[BibT_eX]

[DOI]

Manuel E. Acacio Sanchez

Sci. Program., 2020

A Machine Learning Gateway for Scientific Workflow Design.

[BibT_eX]

[DOI]

Manuel E. Acacio Sanchez

Sci. Program., 2020

PfTouch: Concurrent page-fault handling for Intel restricted transactional memory.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2020

STONNE: A Detailed Architectural Simulator for Flexible Neural Network Accelerators.

[BibT_eX]

[DOI]

CoRR, 2020

2019

Way Combination for an Adaptive and Scalable Coherence Directory.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2019

InsideNet: A tool for characterizing convolutional neural networks.

[BibT_eX]

[DOI]

Future Gener. Comput. Syst., 2019

Foreword to the Special Issue on Processors, Interconnects, Storage, and Caches for Exascale Systems.

[BibT_eX]

[DOI]

Julio Sahuquillo

Concurr. Comput. Pract. Exp., 2019

CNN-SIM: A Detailed Arquitectural Simulator of CNN Accelerators.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2019: Parallel Processing Workshops, 2019

2018

Parallel implementations of the 3D fast wavelet transform on a Raspberry Pi 2 cluster.

[BibT_eX]

[DOI]

Raúl Hernández

J. Supercomput., 2018

On the Parallelization of Stream Compaction on a Low-Cost SDC Cluster.

[BibT_eX]

[DOI]

Sci. Program., 2018

Photonic-based express coherence notifications for many-core CMPs.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2018

SAWS: Simple and Adaptive Warp Scheduling for Improved Performance in Throughput Processors.

[BibT_eX]

[DOI]

Proceedings of the 26th Euromicro International Conference on Parallel, 2018

2017

To be silent or not: on the impact of evictions of clean data in cache-coherent multicores.

[BibT_eX]

[DOI]

J. Supercomput., 2017

A dedicated private-shared cache design for scalable multiprocessors.

[BibT_eX]

[DOI]

Juan M. Cebrian

Alexandra Jimborean

Concurr. Comput. Pract. Exp., 2017

Way-combining directory: an adaptive and scalable low-cost coherence directory.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Supercomputing, 2017

2016

Are distributed sharing codes a solution to the scalability problem of coherence directories in manycores? An evaluation study.

[BibT_eX]

[DOI]

J. Supercomput., 2016

Optimization of a Linked Cache Coherence Protocol for Scalable Manycore Coherence.

[BibT_eX]

[DOI]

Proceedings of the Architecture of Computing Systems - ARCS 2016, 2016

2015

Hardware Approaches to Transactional Memory in Chip Multiprocessors.

[BibT_eX]

[DOI]

Proceedings of the Handbook on Data Centers, 2015

Efficient Hardware-Supported Synchronization Mechanisms for Manycores.

[BibT_eX]

[DOI]

Proceedings of the Handbook on Data Centers, 2015

DASC-DIR: a low-overhead coherence directory for many-core processors.

[BibT_eX]

[DOI]

J. Supercomput., 2015

Fast and efficient commits for Lazy-Lazy hardware transactional memory.

[BibT_eX]

[DOI]

J. Supercomput., 2015

Adaptive Selection of Cache Indexing Bits for Removing Conflict Misses.

[BibT_eX]

[DOI]

Polychronis Xekalakis

Marcelo Cintra

IEEE Trans. Computers, 2015

Early Experiences with Separate Caches for Private and Shared Data.

[BibT_eX]

[DOI]

Juan M. Cebrian

Proceedings of the 11th IEEE International Conference on e-Science, 2015

2014

ZEBRA: Data-Centric Contention Management in Hardware Transactional Memory.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2014

Selective dynamic serialization for reducing energy consumption in hardware transactional memory systems.

[BibT_eX]

[DOI]

J. Supercomput., 2014

Characterization of a List-Based Directory Cache Coherence Protocol for Manycore CMPs.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2014: Parallel Processing Workshops, 2014

2013

Eager Beats Lazy: Improving Store Management in Eager Hardware Transactional Memory.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2013

Efficient Eager Management of Conflicts for Scalable Hardware Transactional Memory.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2013

Design of an efficient communication infrastructure for highly contended locks in many-core CMPs.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2013

On the design of energy-efficient hardware transactional memory systems.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2013

ECONO: Express coherence notifications for efficient cache coherency in many-core CMPs.

[BibT_eX]

[DOI]

Juan Fernández Peinador

Proceedings of the 2013 International Conference on Embedded Computer Systems: Architectures, 2013

Efficient Dir0B Cache Coherency for Many-Core CMPs.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Computational Science, 2013

Towards Efficient Dynamic LLC Home Bank Mapping with NoC-Level Support.

[BibT_eX]

[DOI]

Mario Lodde

José Flich

Proceedings of the Euro-Par 2013 Parallel Processing, 2013

Deploying Hardware Locks to Improve Performance and Energy Efficiency of Hardware Transactional Memory.

[BibT_eX]

[DOI]

Proceedings of the Architecture of Computing Systems - ARCS 2013, 2013

2012

Efficient Hardware Barrier Synchronization in Many-Core CMPs.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2012

Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE.

[BibT_eX]

[DOI]

J. Supercomput., 2012

Extending Magny-Cours Cache Coherence.

[BibT_eX]

[DOI]

Blas Cuesta Saez

IEEE Trans. Computers, 2012

Hardware transactional memory with software-defined conflicts.

[BibT_eX]

[DOI]

ACM Trans. Archit. Code Optim., 2012

Using Heterogeneous Networks to Improve Energy Efficiency in Direct Coherence Protocols for Many-Core CMPs.

[BibT_eX]

[DOI]

Proceedings of the IEEE 24th International Symposium on Computer Architecture and High Performance Computing, 2012

Dynamic Serialization: Improving Energy Consumption in Eager-Eager Hardware Transactional Memory Systems.

[BibT_eX]

[DOI]

Proceedings of the 20th Euromicro International Conference on Parallel, 2012

Heterogeneous NoC Design for Efficient Broadcast-based Coherence Protocol Support.

[BibT_eX]

[DOI]

Mario Lodde

José Flich

Proceedings of the 2012 Sixth IEEE/ACM International Symposium on Networks-on-Chip (NoCS), 2012

ASCIB: adaptive selection of cache indexing bits for removing conflict misses.

[BibT_eX]

[DOI]

Polychronis Xekalakis

Marcelo Cintra

Proceedings of the International Symposium on Low Power Electronics and Design, 2012

An Experience of Early Initiation to Parallelism in the Computing Engineering Degree at the University of Murcia, Spain.

[BibT_eX]

[DOI]

Javier Cuenca

Lorenzo Fernández Maimó

Juan Alejandro Palomino Benito

Joaquín Cervera

Domingo Giménez

M. Carmen Garrido

Juan A. Sánchez-Laguna

José Guillén

María-Eugenia Requena

Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

π-TM: Pessimistic invalidation for scalable lazy hardware transactional memory.

[BibT_eX]

[DOI]

Proceedings of the 18th IEEE International Symposium on High Performance Computer Architecture, 2012

Dynamic Last-Level Cache Allocation to Reduce Area and Power Overhead in Directory Coherence Protocols.

[BibT_eX]

[DOI]

Mario Lodde

José Flich

Proceedings of the Euro-Par 2012 Parallel Processing - 18th International Conference, 2012

Design of a collective communication infrastructure for barrier synchronization in cluster-based nanoscale MPSoCs.

[BibT_eX]

[DOI]

Juan Fernández Peinador

Proceedings of the 2012 Design, Automation & Test in Europe Conference & Exhibition, 2012

2011

The Impact of Non-coherent Buffers on Lazy Hardware Transactional Memory Systems.

[BibT_eX]

[DOI]

Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs.

[BibT_eX]

[DOI]

Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

ZEBRA: a data-centric, hybrid-policy hardware transactional memory design.

[BibT_eX]

[DOI]

Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31, 2011

Eager Meets Lazy: The Impact of Write-Buffering on Hardware Transactional Memory.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Parallel Processing, 2011

Pi-TM: Pessimistic Invalidation for Scalable Lazy Hardware Transactional Memory.

[BibT_eX]

[DOI]

Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011

2010

A Direct Coherence Protocol for Many-Core Chip Multiprocessors.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2010

Dealing with Transient Faults in the Interconnection Network of CMPs at the Cache Coherence Level.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2010

Characterizing the basic synchronization and communication operations in Dual Cell-based Blades through CellStats.

[BibT_eX]

[DOI]

J. Supercomput., 2010

Heterogeneous Interconnects for Energy-Efficient Message Management in CMPs.

[BibT_eX]

[DOI]

IEEE Trans. Computers, 2010

A scalable organization for distributed directories.

[BibT_eX]

[DOI]

J. Syst. Archit., 2010

Exploiting address compression and heterogeneous interconnects for efficient message management in tiled CMPs.

[BibT_eX]

[DOI]

J. Syst. Archit., 2010

Characterizing Energy Consumption in Hardware Transactional Memory Systems.

[BibT_eX]

[DOI]

Proceedings of the 22st International Symposium on Computer Architecture and High Performance Computing, 2010

Energy-Efficient Hardware Prefetching for CMPs Using Heterogeneous Interconnects.

[BibT_eX]

[DOI]

Proceedings of the 18th Euromicro Conference on Parallel, 2010

A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs.

[BibT_eX]

[DOI]

Proceedings of the 39th International Conference on Parallel Processing, 2010

EMC<sup>2</sup>: Extending Magny-Cours coherence for large-scale servers.

[BibT_eX]

[DOI]

Blas Cuesta

Proceedings of the 2010 International Conference on High Performance Computing, 2010

Evaluation of Low-Overhead Organizations for the Directory in Future Many-Core CMPs.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2010 Parallel Processing Workshops, 2010

Efficient and scalable barrier synchronization for many-core CMPs.

[BibT_eX]

[DOI]

Proceedings of the 7th Conference on Computing Frontiers, 2010

2009

A Parallel Implementation of the 2D Wavelet Transform Using CUDA.

[BibT_eX]

[DOI]

Proceedings of the 17th Euromicro International Conference on Parallel, 2009

Speculation-based conflict resolution in hardware transactional memory.

[BibT_eX]

[DOI]

José Manuel García Carrasco

Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Distance-aware round-robin mapping for large NUCA caches.

[BibT_eX]

[DOI]

Proceedings of the 16th International Conference on High Performance Computing, 2009

Fast and Efficient Synchronization and Communication Collective Primitives for Dual Cell-Based Blades.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2009 Parallel Processing, 2009

Dealing with Traffic-Area Trade-Off in Direct Coherence Protocols for Many-Core CMPs.

[BibT_eX]

[DOI]

Proceedings of the Advanced Parallel Processing Technologies, 8th International Symposium, 2009

2008

Extending the TokenCMP Cache Coherence Protocol for Low Overhead Fault Tolerance in CMP Architectures.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2008

An energy consumption characterization of on-chip interconnection networks for tiled CMP architectures.

[BibT_eX]

[DOI]

J. Supercomput., 2008

Two proposals for the inclusion of directory information in the last-level private caches of glueless shared-memory multiprocessors.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2008

Characterization of Conflicts in Log-Based Transactional Memory (LogTM).

[BibT_eX]

[DOI]

José Manuel García Carrasco

Proceedings of the 16th Euromicro International Conference on Parallel, 2008

CellStats: A Tool to Evaluate the Basic Synchronization and Communication Operations of the Cell BE.

[BibT_eX]

[DOI]

Proceedings of the 16th Euromicro International Conference on Parallel, 2008

DiCo-CMP: Efficient cache coherency in tiled CMP architectures.

[BibT_eX]

[DOI]

Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Address Compression and Heterogeneous Interconnects for Energy-Efficient High-Performance in Tiled CMPs.

[BibT_eX]

[DOI]

Proceedings of the 2008 International Conference on Parallel Processing, 2008

Characterizing the Basic Synchronization and Communication Operations in Dual Cell-Based Blades.

[BibT_eX]

[DOI]

Proceedings of the Computational Science, 2008

Directory-Based Conflict Detection in Hardware Transactional Memory.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing, 2008

Fault-Tolerant Cache Coherence Protocols for CMPs: Evaluation and Trade-Offs.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing, 2008

A fault-tolerant directory-based cache coherence protocol for CMP architectures.

[BibT_eX]

[DOI]

Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2008

Multicore Platforms for Scientific Computing: Cell BE and NVIDIA Tesla.

[BibT_eX]

Proceedings of the 2008 International Conference on Scientific Computing, 2008

Scalable Directory Organization for Tiled CMP Architectures.

[BibT_eX]

Proceedings of the 2008 International Conference on Computer Design, 2008

2007

An efficient implementation of a 3D wavelet transform based encoder on hyper-threading technology.

[BibT_eX]

[DOI]

José González

Parallel Comput., 2007

A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures.

[BibT_eX]

[DOI]

Proceedings of the 13st International Conference on High-Performance Computer Architecture (HPCA-13 2007), 2007

Direct Coherence: Bringing Together Performance and Scalability in Shared-Memory Multiprocessors.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing, 2007

Efficient Message Management in Tiled CMP Architectures Using a Heterogeneous Interconnection Network.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing, 2007

Sim-PowerCMP: A Detailed Simulator for Energy Consumption Analysis in Future Embedded CMP Architectures.

[BibT_eX]

[DOI]

Proceedings of the 21st International Conference on Advanced Information Networking and Applications (AINA 2007), 2007

2006

On the Evaluation of Dense Chip-Multiprocessor Architectures.

[BibT_eX]

[DOI]

Proceedings of 2006 International Conference on Embedded Computer Systems: Architectures, 2006

An efficient cache design for scalable glueless shared-memory multiprocessors.

[BibT_eX]

[DOI]

Proceedings of the Third Conference on Computing Frontiers, 2006

2005

A Two-Level Directory Architecture for Highly Scalable cc-NUMA Multiprocessors.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2005

Evaluating IA-32 web servers through simics: a practical experience.

[BibT_eX]

[DOI]

J. Syst. Archit., 2005

Optimizing a 3D-FWT Video Encoder for SMPs and HyperThreading Architectures.

[BibT_eX]

[DOI]

Proceedings of the 13th Euromicro Workshop on Parallel, 2005

Memory Subsystem Characterization in a 16-Core Snoop-Based Chip-Multiprocessor Architecture.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing and Communications, 2005

A Novel Lightweight Directory Architecture for Scalable Shared-Memory Multiprocessors.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2005, Parallel Processing, 11th International Euro-Par Conference, Lisbon, Portugal, August 30, 2005

2004

An Architecture for High-Performance Scalable Shared-Memory Multiprocessors Exploiting On-Chip Integration.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2004

On the Evaluation of x86 Web Servers Using Simics: Limitations and Trade-Offs.

[BibT_eX]

[DOI]

Proceedings of the Computational Science, 2004

2002

MPI-Delphi: an MPI implementation for visual programming environments and heterogeneous computing.

[BibT_eX]

[DOI]

Future Gener. Comput. Syst., 2002

Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture.

[BibT_eX]

[DOI]

Proceedings of the 2002 ACM/IEEE conference on Supercomputing, 2002

Reducing the Latency of L2 Misses in Shared-Memory Multiprocessors through On-Chip Directory Integration.

[BibT_eX]

[DOI]

Proceedings of the 10th Euromicro Workshop on Parallel, 2002

A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors.

[BibT_eX]

[DOI]

Proceedings of the 16th International Parallel and Distributed Processing Symposium (IPDPS 2002), 2002

The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors.

[BibT_eX]

[DOI]

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT 2002), 2002

2001

A New Scalable Directory Architecture for Large-Scale Multiprocessors.

[BibT_eX]

[DOI]

Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA'01), 2001

1999

The Parallel EM Algorithm and its Applications in Computer Vision.

[BibT_eX]

Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, 1999

The MPI-Delphi Interface: A Visual Programming Environment for Clusters of Workstations.

[BibT_eX]

Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, 1999

A Performance Evaluation of P-EDR in Different Parallel Environments.

[BibT_eX]

Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, 1999

P-EDR: An Algorithm for Parallel Implementation of Parzen Density Estimation from Uncertain Observations.

[BibT_eX]

[DOI]

Proceedings of the 13th International Parallel Processing Symposium / 10th Symposium on Parallel and Distributed Processing (IPPS / SPDP '99), 1999

An Evaluation of Parallel Computing in PC Clusters with Fast Ethernet.

[BibT_eX]

[DOI]