Mateo Valero

Orcid: 0000-0003-2917-2482

Affiliations:
  • Polytechnic University of Catalonia, Barcelona, Spain
  • Barcelona Supercomputing Center, Spain


According to our database1, Mateo Valero authored at least 473 papers between 1982 and 2023.

Collaborative distances:

Awards

ACM Fellow

ACM Fellow 2002, "For contributions to the design of vector, superscalar, and VLIW architectures, and technical leadership.".

IEEE Fellow

IEEE Fellow 2001, "For contributions to the design of vector architectures and superscalar processors.".

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2023
Vitruvius+: An Area-Efficient RISC-V Decoupled Vector Coprocessor for High Performance Computing Applications.
ACM Trans. Archit. Code Optim., June, 2023

VAQUERO: A Scratchpad-based Vector Accelerator for Query Processing.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2023


2022
Adaptable Register File Organization for Vector Processors.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2022


2021
When Sally Met Harry or When AI Met HPC.
Supercomput. Front. Innov., 2021

The Ultimate DataFlow for Ultimate SuperComputers-on-a-Chip, for Scientific Computing, Geo Physics, Complex Mathematics, and Information Processing.
Proceedings of the 10th Mediterranean Conference on Embedded Computing, 2021

VIA: A Smart Scratchpad for Vector Units with Application to Sparse Matrix Computations.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2021

PrioRAT: Criticality-Driven Prioritization Inside the On-Chip Memory Hierarchy.
Proceedings of the Euro-Par 2021: Parallel Processing, 2021

2020
Efficiency analysis of modern vector architectures: vector ALU sizes, core counts and clock frequencies.
J. Supercomput., 2020

Using Arm's scalable vector extension on stencil codes.
J. Supercomput., 2020

Advances in the Hierarchical Emergent Behaviors (HEB) Approach to Autonomous Vehicles.
IEEE Intell. Transp. Syst. Mag., 2020

Semi-automatic validation of cycle-accurate simulation infrastructures: The case for gem5-x86.
Future Gener. Comput. Syst., 2020

The Ultimate DataFlow for Ultimate SuperComputers-on-a-Chips.
CoRR, 2020

Runtime-guided ECC protection using online estimation of memory vulnerability.
Proceedings of the International Conference for High Performance Computing, 2020

RICH: implementing reductions in the cache hierarchy.
Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020

Improving Accuracy and Speeding Up Document Image Classification Through Parallel Systems.
Proceedings of the Computational Science - ICCS 2020, 2020

Improving Predication Efficiency through Compaction/Restoration of SIMD Instructions.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2020


2019
A Hardware Runtime for Task-Based Programming Models.
IEEE Trans. Parallel Distributed Syst., 2019

On the maturity of parallel applications for asymmetric multi-core processors.
J. Parallel Distributed Comput., 2019

Guest Editorial: Special Issue on Network and Parallel Computing for Emerging Architectures and Applications.
Int. J. Parallel Program., 2019

The international race towards Exascale in Europe.
CCF Trans. High Perform. Comput., 2019

Optimizing computation-communication overlap in asynchronous task-based programs: poster.
Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019

A Vulnerability Factor for ECC-protected Memory.
Proceedings of the 25th IEEE International Symposium on On-Line Testing and Robust System Design, 2019

Power efficient job scheduling by predicting the impact of processor manufacturing variability.
Proceedings of the ACM International Conference on Supercomputing, 2019

Optimizing computation-communication overlap in asynchronous task-based programs.
Proceedings of the ACM International Conference on Supercomputing, 2019

POSTER: An Optimized Predication Execution for SIMD Extensions.
Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques, 2019

2018
Vector Processing-Aware Advanced Clock-Gating Techniques for Low-Power Fused Multiply-Add.
IEEE Trans. Very Large Scale Integr. Syst., 2018

Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers.
IEEE Trans. Parallel Distributed Syst., 2018

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach.
IEEE Trans. Parallel Distributed Syst., 2018

Performance and energy effects on task-based parallelized applications - User-directed versus manual vectorization.
J. Supercomput., 2018

A General Guide to Applying Machine Learning to Computer Architecture.
Supercomput. Front. Innov., 2018

Memory Vulnerability: A Case for Delaying Error Reporting.
CoRR, 2018

Runtime-assisted cache coherence deactivation in task parallel programs.
Proceedings of the International Conference for High Performance Computing, 2018

Graph partitioning applied to DAG scheduling to reduce NUMA effects.
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018

Runtime Aware Architectures.
Proceedings of the 2018 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, 2018

Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies.
Proceedings of the 32nd International Conference on Supercomputing, 2018

Runtime-Guided Management of Stacked DRAM Memories in Task Parallel Programs.
Proceedings of the 32nd International Conference on Supercomputing, 2018

Architectural Support for Task Dependence Management with Flexible Software Scheduling.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2018

Stencil codes on a vector length agnostic architecture.
Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, 2018

2017
Task Scheduling Techniques for Asymmetric Multi-Core Systems.
IEEE Trans. Parallel Distributed Syst., 2017

An Integrated Vector-Scalar Design on an In-Order ARM Core.
ACM Trans. Archit. Code Optim., 2017

Determinism at Standard-Library Level in TM-Based Applications.
Int. J. Parallel Program., 2017

A scalable synthetic traffic model of Graph500 for computer networks analysis.
Concurr. Comput. Pract. Exp., 2017

SEDEA: A Sensible Approach to Account DRAM Energy in Multicore Systems.
Proceedings of the 29th International Symposium on Computer Architecture and High Performance Computing, 2017

iQ: An Efficient and Flexible Queue-Based Simulation Framework.
Proceedings of the 25th IEEE International Symposium on Modeling, 2017

General Purpose Task-Dependence Management Hardware for Task-Based Dataflow Programming Models.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

FlexVC: Flexible Virtual Channel Management in Low-Diameter Networks.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

ATM: Approximate Task Memoization in the Runtime System.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

Picos, A Hardware Task-Dependence Manager for Task-Based Dataflow Programming Models.
Proceedings of the 2017 International Conference on High Performance Computing & Simulation, 2017

Fog Function Virtualization: A flexible solution for IoT applications.
Proceedings of the Second International Conference on Fog and Mobile Edge Computing, 2017

Direct Inter-Process Communication (dIPC): Repurposing the CODOMs Architecture to Accelerate IPC.
Proceedings of the Twelfth European Conference on Computer Systems, 2017

To Distribute or Not to Distribute: The Question of Load Balancing for Performance or Energy.
Proceedings of the Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28, 2017

Runtime-Assisted Shared Cache Insertion Policies Based on Re-reference Intervals.
Proceedings of the Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28, 2017

A Deep Learning Mapper (DLM) for Scheduling on Heterogeneous Systems.
Proceedings of the High Performance Computing - 4th Latin American Conference, 2017

2016
DReAM: An Approach to Estimate per-Task DRAM Energy in Multicore Systems.
ACM Trans. Design Autom. Electr. Syst., 2016

Network unfairness in dragonfly topologies.
J. Supercomput., 2016

Thread Assignment in Multicore/Multithreaded Processors: A Statistical Approach.
IEEE Trans. Computers, 2016

Sensible Energy Accounting with Abstract Metering for Multicore Systems.
ACM Trans. Archit. Code Optim., 2016

PARSECSs: Evaluating the Impact of Task Parallelism in the PARSEC Benchmark Suite.
ACM Trans. Archit. Code Optim., 2016

Emergent Behaviors in the Internet of Things: The Ultimate Ultra-Large-Scale System.
IEEE Micro, 2016

Alya: Multiphysics engineering simulation toward exascale.
J. Comput. Sci., 2016

Interconnection Networks in Petascale Computer Systems: A Survey.
ACM Comput. Surv., 2016


MUSA: a multi-level simulation approach for next-generation HPC machines.
Proceedings of the International Conference for High Performance Computing, 2016

Performance analysis of a hardware accelerator of dependence management for task-based dataflow programming models.
Proceedings of the 2016 IEEE International Symposium on Performance Analysis of Systems and Software, 2016

A Fully Parameterizable Low Power Design of Vector Fused Multiply-Add Using Active Clock-Gating Techniques.
Proceedings of the 2016 International Symposium on Low Power Electronics and Design, 2016

Future Vector Microprocessor Extensions for Data Aggregations.
Proceedings of the 43rd ACM/IEEE Annual International Symposium on Computer Architecture, 2016

CATA: Criticality Aware Task Acceleration for Multicore Processors.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Runtime-Guided Mitigation of Manufacturing Variability in Power-Constrained Multi-Socket NUMA Nodes.
Proceedings of the 2016 International Conference on Supercomputing, 2016

POSTER: An Integrated Vector-Scalar Design on an In-order ARM Core.
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, 2016

POSTER: Exploiting Asymmetric Multi-Core Processors with Flexible System Sofware.
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, 2016

Reducing Cache Coherence Traffic with Hierarchical Directory Cache and NUMA-Aware Runtime Scheduling.
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, 2016

2015
On-the-fly adaptive routing for dragonfly interconnection networks.
J. Supercomput., 2015

Reimagining Heterogeneous Computing: A Functional Instruction-Set Architecture Computing Model.
IEEE Micro, 2015

Kernel-to-User-Mode Transition-Aware Hardware Scheduling.
IEEE Micro, 2015

New Benchmarking Methodology and Programming Model for Big Data Processing.
Int. J. Distributed Sens. Networks, 2015

Picos: A hardware runtime architecture support for OmpSs.
Future Gener. Comput. Syst., 2015

Adaptive and application dependent runtime guided hardware prefetcher reconfiguration on the IBM POWER7.
CoRR, 2015

Thread Lock Section-Aware Scheduling on Asymmetric Single-ISA Multi-Core.
IEEE Comput. Archit. Lett., 2015

Exploiting asynchrony from exact forward recovery for DUE in iterative solvers.
Proceedings of the International Conference for High Performance Computing, 2015

Performance and Energy Efficient Hardware-Based Scheduler for Symmetric/Asymmetric CMPs.
Proceedings of the 27th International Symposium on Computer Architecture and High Performance Computing, 2015

Imposing coarse-grained reconfiguration to general purpose processors.
Proceedings of the 2015 International Conference on Embedded Computer Systems: Architectures, 2015

Evaluating the Impact of OpenMP 4.0 Extensions on Relevant Parallel Workloads.
Proceedings of the OpenMP: Heterogenous Execution and Data Movements, 2015

Joint Circuit-System Design Space Exploration of Multiplier Unit Structure for Energy-Efficient Vector Processors.
Proceedings of the 2015 IEEE Computer Society Annual Symposium on VLSI, 2015

Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures.
Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015

Contention-Based Nonminimal Adaptive Routing in High-Radix Networks.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architectures.
Proceedings of the 29th ACM on International Conference on Supercomputing, 2015

Increasing multicore system efficiency through intelligent bandwidth shifting.
Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture, 2015

VSR sort: A novel vectorised sorting algorithm & architecture extensions for future microprocessors.
Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture, 2015

Hardware Round-Robin Scheduler for Single-ISA Asymmetric Multi-core.
Proceedings of the Euro-Par 2015: Parallel Processing, 2015


Throughput Unfairness in Dragonfly Networks under Realistic Traffic Patterns.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

Spark deployment and performance evaluation on the MareNostrum supercomputer.
Proceedings of the 2015 IEEE International Conference on Big Data (IEEE BigData 2015), Santa Clara, CA, USA, October 29, 2015

Runtime-Guided Management of Scratchpad Memories in Multicore Architectures.
Proceedings of the 2015 International Conference on Parallel Architectures and Compilation, 2015

2014
Analyzing the Efficiency of L1 Caches for Reliable Hybrid-Voltage Operation Using EDC Codes.
IEEE Trans. Very Large Scale Integr. Syst., 2014

Hybrid Cache Designs for Reliable Hybrid High and Ultra-Low Voltage Operation.
ACM Trans. Design Autom. Electr. Syst., 2014

Runtime-Aware Architectures: A First Approach.
Supercomput. Front. Innov., 2014

TERAFLUX: Harnessing dataflow in next generation teradevices.
Microprocess. Microsystems, 2014

Using Dynamic Runtime Testing for Rapid Development of Architectural Simulators.
Int. J. Parallel Program., 2014

Editorial.
Computación y Sistemas, 2014

Per-task Energy Accounting in Computing Systems.
IEEE Comput. Archit. Lett., 2014

Automatic Exploration of Potential Parallelism in Sequential Applications.
Proceedings of the Supercomputing - 29th International Conference, 2014

DeTrans: Deterministic and Parallel execution of Transactions.
Proceedings of the 26th IEEE International Symposium on Computer Architecture and High Performance Computing, 2014

Dynamic-vector execution on a general purpose EDGE chip multiprocessor.
Proceedings of the XIVth International Conference on Embedded Computer Systems: Architectures, 2014

PAMS: Pattern Aware Memory System for embedded systems.
Proceedings of the 2014 International Conference on ReConFigurable Computing and FPGAs, 2014

Physical vs. Physically-Aware Estimation Flow: Case Study of Design Space Exploration of Adders.
Proceedings of the IEEE Computer Society Annual Symposium on VLSI, 2014

CODOMs: Protecting software with Code-centric memory Domains.
Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture, 2014

Enabling preemptive multiprogramming on GPUs.
Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture, 2014

Scaling Irregular Applications through Data Aggregation and Software Multithreading.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Big Data Processing: Data Flow vs Control Flow (New Benchmarking Methodology).
Proceedings of the International Conference on Identification, 2014

Author retrospective for software trace cache.
Proceedings of the ACM International Conference on Supercomputing 25th Anniversary Volume, 2014

Evaluation of vectorization potential of Graph500 on Intel's Xeon Phi.
Proceedings of the International Conference on High Performance Computing & Simulation, 2014

Advanced Pattern based Memory Controller for FPGA based HPC applications.
Proceedings of the International Conference on High Performance Computing & Simulation, 2014

AMMC: Advanced Multi-Core Memory Controller.
Proceedings of the 2014 International Conference on Field-Programmable Technology, 2014

MAPC: Memory access pattern based controller.
Proceedings of the 24th International Conference on Field Programmable Logic and Applications, 2014

APMC: advanced pattern based memory controller (abstract only).
Proceedings of the 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2014

DReAM: Per-Task DRAM Energy Metering in Multicore Systems.
Proceedings of the Euro-Par 2014 Parallel Processing, 2014

EVX: Vector execution on low power EDGE cores.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2014

Dynamic transaction coalescing.
Proceedings of the Computing Frontiers Conference, CF'14, 2014

Characterizing the Communication Demands of the Graph500 Benchmark on a Commodity Cluster.
Proceedings of the 1st IEEE/ACM International Symposium on Big Data Computing, 2014

PVMC: Programmable Vector Memory Controller.
Proceedings of the IEEE 25th International Conference on Application-Specific Systems, 2014

Stand-Alone Memory Controller for Graphics System.
Proceedings of the Reconfigurable Computing: Architectures, Tools, and Applications, 2014

2013
Thread Assignment of Multithreaded Network Applications in Multicore/Multithreaded Processors.
IEEE Trans. Parallel Distributed Syst., 2013

SMT Malleability in IBM POWER5 and POWER6 Processors.
IEEE Trans. Computers, 2013

Profile-guided transaction coalescing - lowering transactional overheads by merging transactions.
ACM Trans. Archit. Code Optim., 2013

Fair CPU time accounting in CMP+SMT processors.
ACM Trans. Archit. Code Optim., 2013

Hardware support for accurate per-task energy metering in multicore systems.
ACM Trans. Archit. Code Optim., 2013

Programmability and portability for exascale: Top down programming methodology and tools with StarSs.
J. Comput. Sci., 2013

Moving from petaflops to petadata.
Commun. ACM, 2013

Supercomputing with commodity CPUs: are mobile SoCs ready for HPC?
Proceedings of the International Conference for High Performance Computing, 2013

Identifying Critical Code Sections in Dataflow Programming Models.
Proceedings of the 21st Euromicro International Conference on Parallel, 2013

On the selection of adder unit in energy efficient vector processing.
Proceedings of the International Symposium on Quality Electronic Design, 2013

Trace filtering of multithreaded applications for CMP memory simulation.
Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 2013

TM-dietlibc: A TM-aware Real-World System Library.
Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

HPC System Software for Regular and Irregular Parallel Applications.
Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

HPCS 2013 panel: The era of exascale sciences: Challenges, needs and requirements.
Proceedings of the International Conference on High Performance Computing & Simulation, 2013

Efficient Routing Mechanisms for Dragonfly Networks.
Proceedings of the 42nd International Conference on Parallel Processing, 2013

EcoTM: Conflict-Aware Economical Unbounded Hardware Transactional Memory.
Proceedings of the International Conference on Computational Science, 2013

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management.
Proceedings of the IEEE 21st Annual Symposium on High-Performance Interconnects, 2013

Global misrouting policies in two-level hierarchical networks.
Proceedings of the 2013 Interconnection Network Architecture: On-Chip, Multi-Chip, 2013


Efficient cache architectures for reliable hybrid voltage operation using EDC codes.
Proceedings of the Design, Automation and Test in Europe, 2013

APPLE: adaptive performance-predictable low-energy caches for reliable hybrid voltage operation.
Proceedings of the 50th Annual Design Automation Conference 2013, 2013

Killer-mobiles: The way towards energy efficient high performance computers?
Proceedings of the 13th International Conference on Application of Concurrency to System Design, 2013

2012
CPU Accounting for Multicore Processors.
IEEE Trans. Computers, 2012

Dynamic Tolerance Region Computing for Multimedia.
IEEE Trans. Computers, 2012

On the simulation of large-scale architectures using multiple application abstraction levels.
ACM Trans. Archit. Code Optim., 2012

Hardware transactional memory with software-defined conflicts.
ACM Trans. Archit. Code Optim., 2012

Parallel job scheduling for power constrained HPC systems.
Parallel Comput., 2012

The Problem of Evaluating CPU-GPU Systems with 3D Visualization Applications.
IEEE Micro, 2012

Resource-bounded multicore emulation using Beefarm.
Microprocess. Microsystems, 2012

Understanding the future of energy-performance trade-off via DVFS in HPC environments.
J. Parallel Distributed Comput., 2012

Circuit design of a dual-versioning L1 data cache.
Integr., 2012

Profiling and Optimizing Transactional Memory Applications.
Int. J. Parallel Program., 2012

The Network Adapter: The Missing Link between MPI Applications and Network Performance.
Proceedings of the IEEE 24th International Symposium on Computer Architecture and High Performance Computing, 2012

Efficient Sorting on the Tilera Manycore Architecture.
Proceedings of the IEEE 24th International Symposium on Computer Architecture and High Performance Computing, 2012

Novel SRAM bias control circuits for a low power L1 data cache.
Proceedings of the NORCHIP 2012, Copenhagen, Denmark, November 12-13, 2012, 2012

Vector Extensions for Decision Support DBMS Acceleration.
Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012

Improving Cache Management Policies Using Dynamic Reuse Distances.
Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012

Evaluating the Impact of TLB Misses on Future HPC Systems.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

Enhancing the performance of assisted execution runtime systems through hardware/software techniques.
Proceedings of the International Conference on Supercomputing, 2012

On-the-Fly Adaptive Routing in High-Radix Hierarchical Networks.
Proceedings of the 41st International Conference on Parallel Processing, 2012

ADAM: an efficient data management mechanism for hybrid high and ultra-low voltage operation caches.
Proceedings of the Great Lakes Symposium on VLSI 2012, 2012

TagTM - accelerating STMs with hardware tags for fast meta-data access.
Proceedings of the 2012 Design, Automation & Test in Europe Conference & Exhibition, 2012

Optimal task assignment in multithreaded processors: a statistical approach.
Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, 2012

2011
Assessing Accelerator-Based HPC Reverse Time Migration.
IEEE Trans. Parallel Distributed Syst., 2011

Dynamic Cache Partitioning Based on the MLP of Cache Misses.
Trans. High Perform. Embed. Archit. Compil., 2011

A Highly Scalable Parallel Implementation of H.264.
Trans. High Perform. Embed. Archit. Compil., 2011

RMS-TM: a comprehensive benchmark suite for transactional memory systems (abstracts only).
SIGMETRICS Perform. Evaluation Rev., 2011

Exploiting intra-task slack time of load operations for DVFS in hard real-time multi-core systems.
SIGBED Rev., 2011

Energy-Aware Accounting and Billing in Large-Scale Computing Facilities.
IEEE Micro, 2011

Simulating Whole Supercomputer Applications.
IEEE Micro, 2011

Hybrid Transactional Memory with Pessimistic Concurrency Control.
Int. J. Parallel Program., 2011

The International Exascale Software Project roadmap.
Int. J. High Perform. Comput. Appl., 2011

Characterizing Power and Temperature Behavior of POWER6-Based System.
IEEE J. Emerg. Sel. Topics Circuits Syst., 2011

Scalable multicore architectures for long DNA sequence comparison.
Concurr. Comput. Pract. Exp., 2011

RMS-TM: a comprehensive benchmark suite for transactional memory systems.
Proceedings of the ICPE'11, 2011

Rapid Development of Error-Free Architectural Simulators Using Dynamic Runtime Testing.
Proceedings of the 23rd International Symposium on Computer Architecture and High Performance Computing, 2011

Breaking the bandwidth wall in chip multiprocessors.
Proceedings of the 2011 International Conference on Embedded Computer Systems: Architectures, 2011

IA^3: An Interference Aware Allocation Algorithm for Multicore Hard Real-Time Systems.
Proceedings of the 17th IEEE Real-Time and Embedded Technology and Applications Symposium, 2011

The Impact of Application's Micro-Imbalance on the Communication-Computation Overlap.
Proceedings of the 19th International Euromicro Conference on Parallel, 2011

Hybrid Parallel Programming with MPI/StarSs.
Proceedings of the Applications, Tools and Techniques on the Road to Exascale Computing, Proceedings of the conference ParCo 2011, 31 August, 2011

An Abstraction Methodology for the Evaluation of Multi-core Multi-threaded Architectures.
Proceedings of the MASCOTS 2011, 2011

Trace-driven simulation of multithreaded applications.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2011

A Quantitative Analysis of OS Noise.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

RVC-based time-predictable faulty caches for safety-critical systems.
Proceedings of the 17th IEEE International On-Line Testing Symposium (IOLTS 2011), 2011

Linear programming based parallel job scheduling for power constrained systems.
Proceedings of the 2011 International Conference on High Performance Computing & Simulation, 2011

FIMSIM: A fault injection infrastructure for microarchitectural simulators.
Proceedings of the IEEE 29th International Conference on Computer Design, 2011

RVC: a mechanism for time-analyzable real-time processors with faulty caches.
Proceedings of the High Performance Embedded Architectures and Compilers, 2011

Circuit design of a dual-versioning L1 data cache for optimistic concurrency.
Proceedings of the 21st ACM Great Lakes Symposium on VLSI 2010, 2011

TMbox: A Flexible and Reconfigurable 16-Core Hybrid Transactional Memory System.
Proceedings of the IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines, 2011

Quantifying the Potential Task-Based Dataflow Parallelism in MPI Applications.
Proceedings of the Euro-Par 2011 Parallel Processing - 17th International Conference, 2011

Hybrid high-performance low-power and ultra-low energy reliable caches.
Proceedings of the 8th Conference on Computing Frontiers, 2011

From Plasma to BeeFarm: Design Experience of an FPGA-Based Multicore Prototype.
Proceedings of the Reconfigurable Computing: Architectures, Tools and Applications, 2011

SymptomTM: Symptom-Based Error Detection and Recovery Using Hardware Transactional Memory.
Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011

STM2: A Parallel STM for High Performance Simultaneous Multithreading Systems.
Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011

Using a Reconfigurable L1 Data Cache for Efficient Version Management in Hardware Transactional Memory.
Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011

2010
On the Problem of Evaluating the Performance of Multiprogrammed Workloads.
IEEE Trans. Computers, 2010

Multicore: The View from Europe.
IEEE Micro, 2010

Utilization driven power-aware parallel job scheduling.
Comput. Sci. Res. Dev., 2010

Trends and techniques for energy efficient architectures.
Proceedings of the 18th IEEE/IFIP VLSI-SoC 2010, 2010

Debugging programs that use atomic blocks and transactional memory.
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010

Thread to strand binding of parallel network applications in massive multi-threaded systems.
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010

Effective communication and computation overlap with hybrid MPI/SMPSs.
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010

Architectural Support for Fair Reader-Writer Locking.
Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010

Task Superscalar: An Out-of-Order Task Pipeline.
Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010

Simulation environment for studying overlap of communication and computation.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2010

Adapting cache partitioning algorithms to pseudo-LRU replacement policies.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

BSLD threshold driven power management policy for HPC centers.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Comparing last-level cache designs for CMP architectures.
Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies, 2010

Power and performance aware reconfigurable cache for CMPs.
Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies, 2010

Overlapping communication and computation by using a hybrid MPI/SMPSs approach.
Proceedings of the 24th International Conference on Supercomputing, 2010

Optimizing job performance under a given power constraint in HPC centers.
Proceedings of the International Green Computing Conference 2010, 2010

Long DNA Sequence Comparison on Multicore Architectures.
Proceedings of the Euro-Par 2010 - Parallel Processing, 16th International Euro-Par Conference, Ischia, Italy, August 31, 2010

A Simulation Framework to Automatically Analyze the Communication-Computation Overlap in Scientific Applications.
Proceedings of the 2010 IEEE International Conference on Cluster Computing, 2010

Designing OS for HPC Applications: Scheduling.
Proceedings of the 2010 IEEE International Conference on Cluster Computing, 2010

Scalability Analysis of Progressive Alignment on a Multicore.
Proceedings of the CISIS 2010, 2010

Load balancing using dynamic cache allocation.
Proceedings of the 7th Conference on Computing Frontiers, 2010

Exploiting Inactive Rename Slots for Detecting Soft Errors.
Proceedings of the Architecture of Computing Systems, 2010

Discovering and understanding performance bottlenecks in transactional applications.
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, 2010

Efficient runahead threads.
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, 2010

Power and thermal characterization of POWER6 system.
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, 2010

2009
DIA: A Complexity-Effective Decoding Architecture.
IEEE Trans. Computers, 2009

Available task-level parallelism on the Cell BE.
Sci. Program., 2009

FlexDCP: a QoS framework for CMP architectures.
ACM SIGOPS Oper. Syst. Rev., 2009

Evaluación del rendimiento paralelo en el nivel macro bloque del decodificador H.264 en una arquitectura multiprocesador cc-NUMA.
Rev. Avances en Sistemas Informática, 2009

BSC Vision Towards Exascale.
Int. J. High Perform. Comput. Appl., 2009

The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community.
Int. J. High Perform. Comput. Appl., 2009

An Analyzable Memory Controller for Hard Real-Time CMPs.
IEEE Embed. Syst. Lett., 2009

CPU Accounting in CMP Processors.
IEEE Comput. Archit. Lett., 2009

Thread to Core Assignment in SMT On-Chip Multiprocessors.
Proceedings of the 21st International Symposium on Computer Architecture and High Performance Computing, 2009

Atomic quake: using transactional memory in an interactive multiplayer game server.
Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009

Turbocharging boosted transactions or: how i learnt to stop worrying and love longer transactions.
Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009

EazyHTM: eager-lazy hardware transactional memory.
Proceedings of the 42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009), 2009

Characterizing the resource-sharing levels in the UltraSPARC T2 processor.
Proceedings of the 42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009), 2009

Hardware support for WCET analysis of hard real-time multicore systems.
Proceedings of the 36th International Symposium on Computer Architecture (ISCA 2009), 2009

Taking the heat off transactions: Dynamic selection of pessimistic concurrency control.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Clock gate on abort: Towards energy-efficient hardware Transactional Memory.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Power-aware load balancing of large scale MPI applications.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

A european perspective on supercomputing.
Proceedings of the 23rd international conference on Supercomputing, 2009

Exploring pattern-aware routing in generalized fat tree networks.
Proceedings of the 23rd international conference on Supercomputing, 2009

QuakeTM: parallelizing a complex sequential application using transactional memory.
Proceedings of the 23rd international conference on Supercomputing, 2009

Code Semantic-Aware Runahead Threads.
Proceedings of the ICPP 2009, 2009

Scalability of Macroblock-level Parallelism for H.264 Decoding.
Proceedings of the 15th IEEE International Conference on Parallel and Distributed Systems, 2009

Dynamically Filtering Thread-Local Variables in Lazy-Lazy Hardware Transactional Memory.
Proceedings of the 11th IEEE International Conference on High Performance Computing and Communications, 2009

Oblivious routing schemes in extended generalized Fat Tree networks.
Proceedings of the 2009 IEEE International Conference on Cluster Computing, August 31, 2009

Quantitative analysis of sequence alignment applications on multiprocessor architectures.
Proceedings of the 6th Conference on Computing Frontiers, 2009

ITCA: Inter-task Conflict-Aware CPU Accounting for CMPs.
Proceedings of the PACT 2009, 2009

2008
Multicore Resource Management.
IEEE Micro, 2008

Nebelung: Execution Environment for Transactional OpenMP.
Int. J. Parallel Program., 2008

Power-efficient VLIW design using clustering and widening.
Int. J. Embed. Syst., 2008

Vectorized AES Core for High-throughput Secure Environments.
Proceedings of the High Performance Computing for Computational Science, 2008

A dynamic scheduler for balancing HPC applications.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2008

Measuring Operating System Overhead on CMT Processors.
Proceedings of the 20th International Symposium on Computer Architecture and High Performance Computing, 2008

Selection of the Register File Size and the Resource Allocation Policy on SMT Processors.
Proceedings of the 20th International Symposium on Computer Architecture and High Performance Computing, 2008

Preliminary Analysis of the Cell BE Processor Limitations for Sequence Alignment Applications.
Proceedings of the Embedded Computer Systems: Architectures, 2008

A distributed processor state management architecture for large-window processors.
Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41 2008), 2008

WormBench: a configurable workload for evaluating transactional memory systems.
Proceedings of the 9th workshop on MEmory performance, 2008

A Two-Level Load/Store Queue Based on Execution Locality.
Proceedings of the 35th International Symposium on Computer Architecture (ISCA 2008), 2008

Software-Controlled Priority Characterization of POWER5 Processor.
Proceedings of the 35th International Symposium on Computer Architecture (ISCA 2008), 2008

Balancing HPC applications through smart allocation of resources in MT processors.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

MFLUSH: Handling Long-Latency Loads in SMT On-Chip Multiprocessors.
Proceedings of the 2008 International Conference on Parallel Processing, 2008

Runahead Threads to improve SMT performance.
Proceedings of the 14th International Conference on High-Performance Computer Architecture (HPCA-14 2008), 2008

Supercomputing for the Future, Supercomputing from the Past (Keynote).
Proceedings of the High Performance Embedded Architectures and Compilers, 2008

LPA: A First Approach to the Loop Processor Architecture.
Proceedings of the High Performance Embedded Architectures and Compilers, 2008

Architecture Performance Prediction Using Evolutionary Artificial Neural Networks.
Proceedings of the Applications of Evolutionary Computing, 2008

The limits of software transactional memory (STM): dissecting Haskell STM applications on a many-core environment.
Proceedings of the 5th Conference on Computing Frontiers, 2008

Evolutionary system for prediction and optimization of hardware architecture performance.
Proceedings of the IEEE Congress on Evolutionary Computation, 2008

Soft Real-Time Scheduling on SMT Processors with Explicit Resource Allocation.
Proceedings of the Architecture of Computing Systems, 2008

MultiLayer processing - an execution model for parallel stateful packet processing.
Proceedings of the 2008 ACM/IEEE Symposium on Architecture for Networking and Communications Systems, 2008

2007
Enlarging Instruction Streams.
IEEE Trans. Computers, 2007

Energy saving through a simple load control mechanism.
SIGARCH Comput. Archit. News, 2007

Transactional Memory: An Overview.
IEEE Micro, 2007

Explaining Dynamic Cache Partitioning Speed Ups.
IEEE Comput. Archit. Lett., 2007

unreadTVar: Extending Haskell Software Transactional Memory for Performance.
Proceedings of the Eighth Symposium on Trends in Functional Programming, 2007

Online Prediction of Applications Cache Utility.
Proceedings of the 2007 International Conference on Embedded Computer Systems: Architectures, 2007

On the Problem of Minimizing Workload Execution Time in SMT Processors.
Proceedings of the 2007 International Conference on Embedded Computer Systems: Architectures, 2007

Multithreaded software transactional memory and OpenMP.
Proceedings of the 2007 workshop on MEmory performance, 2007

Transactional Memory and OpenMP.
Proceedings of the A Practical Programming Model for the Multi-Core Era, 2007

Performance Impact of Unaligned Memory Operations in SIMD Extensions for Video Codec Applications.
Proceedings of the 2007 IEEE International Symposium on Performance Analysis of Systems and Software, 2007

Microarchitectural Support for Speculative Register Renaming.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

HD-VideoBench. A Benchmark for Evaluating High Definition Digital Video Applications.
Proceedings of the IEEE 10th International Symposium on Workload Characterization, 2007

Hardware Transactional Memory with Operating System Support, HTMOS.
Proceedings of the Euro-Par 2007 Workshops: Parallel Processing, 2007

Implicit Transactional Memory in Kilo-Instruction Multiprocessors.
Proceedings of the Advances in Computer Systems Architecture, 2007

FAME: FAirly MEasuring Multithreaded Architectures.
Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques (PACT 2007), 2007

Runahead Threads: Reducing Resource Contention in SMT Processors.
Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques (PACT 2007), 2007

A Flexible Heterogeneous Multi-Core Architecture.
Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques (PACT 2007), 2007

MLP-Aware Dynamic Cache Partitioning.
Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques (PACT 2007), 2007

2006
A DRAM/SRAM Memory Scheme for Fast Packet Buffers.
IEEE Trans. Computers, 2006

Predictable Performance in SMT Processors: Synergy between the OS and SMTs.
IEEE Trans. Computers, 2006

Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors.
IEEE Comput. Archit. Lett., 2006

A simple speculative load control mechanism for energy saving.
Proceedings of the 2006 workshop on MEmory performance, 2006

Performance Analysis of Sequence Alignment Applications.
Proceedings of the 2006 IEEE International Symposium on Workload Characterization, 2006

A decoupled KILO-instruction processor.
Proceedings of the 12th International Symposium on High-Performance Computer Architecture, 2006

Kilo-instruction processors, runahead and prefetching.
Proceedings of the Third Conference on Computing Frontiers, 2006

Speculative early register release.
Proceedings of the Third Conference on Computing Frontiers, 2006

Branch predictor guided instruction decoding.
Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques (PACT 2006), 2006

2005
Software Trace Cache.
IEEE Trans. Computers, 2005

Fuzzy Memoization for Floating-Point Multimedia Applications.
IEEE Trans. Computers, 2005

Dynamic memory interval test vs. interprocedural pointer analysis in multimedia applications.
ACM Trans. Archit. Code Optim., 2005

The impact of traffic aggregation on the memory performance of networking applications.
SIGARCH Comput. Archit. News, 2005

Speculative execution for hiding memory latency.
SIGARCH Comput. Archit. News, 2005

Better Branch Prediction Through Prophet/Critic Hybrids.
IEEE Micro, 2005

Kilo-Instruction Processors: Overcoming the Memory Wall.
IEEE Micro, 2005

Hardware support for early register release.
Int. J. High Perform. Comput. Netw., 2005

On the Scalability of 1- and 2-Dimensional SIMD Extensions for Multimedia Applications.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005

Performance Analysis of a New Packet Trace Compressor based on TCP Flow Clustering.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005

Workload Characterization of Stateful Networking Applications.
Proceedings of the High-Performance Computing - 6th International Symposium, 2005

Multiple Stream Prediction.
Proceedings of the High-Performance Computing - 6th International Symposium, 2005

Decoupled State-Execute Architecture.
Proceedings of the High-Performance Computing - 6th International Symposium, 2005

Exploiting Execution Locality with a Decoupled Kilo-Instruction Processor.
Proceedings of the High-Performance Computing - 6th International Symposium, 2005

Control-Flow Independence Reuse via Dynamic Vectorization.
Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005

Effective Instruction Prefetching via Fetch Prestaging.
Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005

An asymmetric clustered processor based on value content.
Proceedings of the 19th Annual International Conference on Supercomputing, 2005

Implementing Kilo-Instruction Multiprocessors.
Proceedings of the International Conference on Pervasive Services 2005, 2005

A Vector-µSIMD-VLIW Architecture for Multimedia Applications.
Proceedings of the 34th International Conference on Parallel Processing (ICPP 2005), 2005

A Complexity-Effective Simultaneous Multithreading Architecture.
Proceedings of the 34th International Conference on Parallel Processing (ICPP 2005), 2005

A New Pointer-based Instruction Queue Design and Its Power-Performance Evaluation.
Proceedings of the 23rd International Conference on Computer Design (ICCD 2005), 2005

Architectural support for real-time task scheduling in SMT processors.
Proceedings of the 2005 International Conference on Compilers, 2005

Architectural impact of stateful networking applications.
Proceedings of the 2005 ACM/IEEE Symposium on Architecture for Networking and Communications Systems, 2005

2004
Register Constrained Modulo Scheduling.
IEEE Trans. Parallel Distributed Syst., 2004

Late Allocation and Early Release of Physical Registers.
IEEE Trans. Computers, 2004

A low-complexity fetch architecture for high-performance superscalar processors.
ACM Trans. Archit. Code Optim., 2004

Toward kilo-instruction processors.
ACM Trans. Archit. Code Optim., 2004

A case for resource-conscious out-of-order processors: towards kilo-instruction in-flight processors.
SIGARCH Comput. Archit. News, 2004

QoS for High-Performance SMT Processors in Embedded Systems.
IEEE Micro, 2004

Software and Hardware Techniques to Optimize Register File Utilization in VLIW Architectures.
Int. J. Parallel Program., 2004

Dynamic Memory Instruction Bypassing.
Int. J. Parallel Program., 2004

A partitioned instruction queue to reduce instruction wakeup energy.
Int. J. High Perform. Comput. Netw., 2004

High-performance and low-power VLIW cores for numerical computations.
Int. J. High Perform. Comput. Netw., 2004

A latency-conscious SMT branch prediction architecture.
Int. J. High Perform. Comput. Netw., 2004

Future ILP processors.
Int. J. High Perform. Comput. Netw., 2004

Optimising long-latency-load-aware fetch policies for SMT processors.
Int. J. High Perform. Comput. Netw., 2004

Evaluating kilo-instruction multiprocessors.
Proceedings of the 3rd Workshop on Memory Performance Issues, 2004

Initial Evaluation of Multimedia Extensions on VLIW Architectures.
Proceedings of the Computer Systems: Architectures, 2004

Performance and Power Evaluation of Clustered VLIW Processors with Wide Functional Units.
Proceedings of the Computer Systems: Architectures, 2004

An Optimized Front-End Physical Register File with Banking and Writeback Filtering.
Proceedings of the Power-Aware Computer Systems, 4th International Workshop, 2004

Dynamically Controlled Resource Allocation in SMT Processors.
Proceedings of the 37th Annual International Symposium on Microarchitecture (MICRO-37 2004), 2004

The impact of traffic aggregation on the memory performance of networking applications.
Proceedings of the 2004 workshop on MEmory performance, 2004

A Content Aware Integer Register File Organization.
Proceedings of the 31st International Symposium on Computer Architecture (ISCA 2004), 2004

Prophet/Critic Hybrid Branch Prediction.
Proceedings of the 31st International Symposium on Computer Architecture (ISCA 2004), 2004

DCache Warn: An I-Fetch Policy to Increase SMT Efficiency.
Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), 2004

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors.
Proceedings of the 10th International Conference on High-Performance Computer Architecture (HPCA-10 2004), 2004

Out-of-Order Commit Processors.
Proceedings of the 10th International Conference on High-Performance Computer Architecture (HPCA-10 2004), 2004

Enabling SMT for real-time embedded systems.
Proceedings of the 2004 12th European Signal Processing Conference, 2004

Maintaining Thousands of In-flight Instructions.
Proceedings of the Euro-Par 2004 Parallel Processing, 2004

Feasibility of QoS for SMT.
Proceedings of the Euro-Par 2004 Parallel Processing, 2004

Implicit vs. Explicit Resource Allocation in SMT Processors.
Proceedings of the 2004 Euromicro Symposium on Digital Systems Design (DSD 2004), Architectures, Methods and Tools, 31 August, 2004

A first glance at Kilo-instruction based multiprocessors.
Proceedings of the First Conference on Computing Frontiers, 2004

Predictable performance in SMT processors.
Proceedings of the First Conference on Computing Frontiers, 2004

Reducing Fetch Architecture Complexity Using Procedure Inlining.
Proceedings of the 8th Annual Workshop on Interaction between Compilers and Computer Architecture (INTERACT-8 2004), 2004

2003
A Cost-Effective Architecture for Vectorizable Numerical and Multimedia Applications.
Theory Comput. Syst., 2003

A Case for Resource-conscious Out-of-order Processors.
IEEE Comput. Archit. Lett., 2003

Design and Implementation of High-Performance Memory Systems for Future Packet Buffers.
Proceedings of the 36th Annual International Symposium on Microarchitecture, 2003

An MPEG-4 performance study for non-SIMD, general purpose architectures.
Proceedings of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software, 2003

A Simple Low-Energy Instruction Wakeup Mechanism.
Proceedings of the High Performance Computing, 5th International Symposium, 2003

Power-Performance Trade-Offs in Wide and Clustered VLIW Cores for Numerical Codes.
Proceedings of the High Performance Computing, 5th International Symposium, 2003

Tolerating Branch Predictor Latency on SMT.
Proceedings of the High Performance Computing, 5th International Symposium, 2003

Kilo-instruction Processors.
Proceedings of the High Performance Computing, 5th International Symposium, 2003

Improving Memory Latency Aware Fetch Policies for SMT Processors.
Proceedings of the High Performance Computing, 5th International Symposium, 2003

Hierarchical Clustered Register File Organization for VLIW Processors.
Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), 2003

A conflict-free memory banking architecture for fast VOQ packet buffers.
Proceedings of the Global Telecommunications Conference, 2003

2002
Errata on "Measuring Experimental Error in Microprocessor Simulation".
SIGARCH Comput. Archit. News, 2002

Software Trace Cache for Commercial Applications.
Int. J. Parallel Program., 2002

Initial Results on Fuzzy Floating Point Computation for Multimedia Processors.
IEEE Comput. Archit. Lett., 2002

Fetching instruction streams.
Proceedings of the 35th Annual International Symposium on Microarchitecture, 2002

Three-dimensional memory vectorization for high bandwidth media memory systems.
Proceedings of the 35th Annual International Symposium on Microarchitecture, 2002

A Comprehensive Analysis of Indirect Branch Prediction.
Proceedings of the High Performance Computing, 4th International Symposium, 2002

Studying New Ways for Improving Adaptive History Length Branch Predictors.
Proceedings of the High Performance Computing, 4th International Symposium, 2002

Speculative Dynamic Vectorization.
Proceedings of the 29th International Symposium on Computer Architecture (ISCA 2002), 2002

Hardware Schemes for Early Register Release.
Proceedings of the 31st International Conference on Parallel Processing (ICPP 2002), 2002

A Comparative Study of Redundancy in Trace Caches (Research Note).
Proceedings of the Euro-Par 2002, 2002

Cost effective memory disambiguation for multimedia codes.
Proceedings of the International Conference on Compilers, 2002

Cost-Effective Compiler Directed Memory Prefetching and Bypassing.
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT 2002), 2002

2001
Cost-Conscious Strategies to Increase Performance of Numerical Programs on Aggressive VLIW Architectures.
IEEE Trans. Computers, 2001

Lifetime-Sensitive Modulo Scheduling in a Production Environment.
IEEE Trans. Computers, 2001

Parallel architecture and compilation techniques: selection of workshop papers, guests' editors introduction.
SIGARCH Comput. Archit. News, 2001

Instruction fetch architectures and code layout optimizations.
Proc. IEEE, 2001

Early 21st Century Processors - Guest Editors' Introduction.
Computer, 2001

Modulo scheduling with integrated register spilling for clustered VLIW architectures.
Proceedings of the 34th Annual International Symposium on Microarchitecture, 2001

<i>MIRS</i>: Modulo Scheduling with Integrated Register Spilling.
Proceedings of the Languages and Compilers for Parallel Computing, 2001

Code layout optimizations for transaction processing workloads.
Proceedings of the 28th Annual International Symposium on Computer Architecture, 2001

A novel renaming mechanism that boosts software prefetching.
Proceedings of the 15th international conference on Supercomputing, 2001

On the potential of tolerant region reuse for multimedia applications.
Proceedings of the 15th international conference on Supercomputing, 2001

DLP + TLP Processors for the Next Generation of Media Workloads.
Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA'01), 2001

Topic 15+20: Multimedia and Embedded Systems.
Proceedings of the Euro-Par 2001: Parallel Processing, 2001

Branch Prediction Using Profile Data.
Proceedings of the Euro-Par 2001: Parallel Processing, 2001

On the Efficiency of Reductions in µ-SIMD Media Extensions.
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques (PACT 2001), 2001

2000
Dynamic Register Renaming Through Virtual-Physical Registers.
J. Instr. Level Parallelism, 2000

Improved spill code generation for software pipelined loops.
Proceedings of the 2000 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2000

Two-level hierarchical register file organization for VLIW processors.
Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, 2000

Architectures for One Billion of Transistors.
Proceedings of the 13th International Symposium on System Synthesis, 2000

Multiple-banked register file architectures.
Proceedings of the 27th International Symposium on Computer Architecture (ISCA 2000), 2000

Trace Cache Redundancy: Red & Blue Traces.
Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, 2000

On the Performance of Fetch Engines Running DSS Workloads.
Proceedings of the Euro-Par 2000, Parallel Processing, 6th International Euro-Par Conference, Munich, Germany, August 29, 2000

Parallel Computer Architecture.
Proceedings of the Euro-Par 2000, Parallel Processing, 6th International Euro-Par Conference, Munich, Germany, August 29, 2000

The Effect of Code Reordering on Branch Prediction.
Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques (PACT'00), 2000

1999
A Simulation Study of Decoupled Vector Architectures.
J. Supercomput., 1999

Enhancing and Exploiting the Locality.
IEEE Trans. Computers, 1999

MOM: a Matrix SIMD Instruction Set Architecture for Multimedia Applications.
Proceedings of the ACM/IEEE Conference on Supercomputing, 1999

Delaying Physical Register Allocation through Virtual-Physical Registers.
Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture, 1999

Exploiting a New Level of DLP in Multimedia Applications.
Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture, 1999

Software trace cache.
Proceedings of the 13th international conference on Supercomputing, 1999

Adding a vector unit to a superscalar processor.
Proceedings of the 13th international conference on Supercomputing, 1999

Increasing effective IPC by exploiting distant parallelism.
Proceedings of the 13th international conference on Supercomputing, 1999

Optimization of Instruction Fetch for Decision Support Workloads.
Proceedings of the International Conference on Parallel Processing 1999, 1999

Impact on Performance of Fused Multiply-Add Units in Aggressive VLIW Architectures.
Proceedings of the International Conference on Parallel Processing 1999, 1999

Instruction-Level Parallelism and Uniprocessor Architecture - Introduction.
Proceedings of the Euro-Par '99 Parallel Processing, 5th International Euro-Par Conference, Toulouse, France, August 31, 1999

Quantifying the Benefits of SPECint Distant Parallelism in Simultaneous Multi-Threading Architectures.
Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques, 1999

1998
Modulo Scheduling with Reduced Register Pressure.
IEEE Trans. Computers, 1998

Quantitative Evaluation of Register Pressure on Software Pipelined Loops.
Int. J. Parallel Program., 1998

Registers Size Influence on Vector Architectures.
Proceedings of the Vector and Parallel Processing, 1998

An ISA Comparison Between Superscalar and Vector Processors.
Proceedings of the Vector and Parallel Processing, 1998

Effective usage of vector registers in decoupled vector architectures.
Proceedings of the Sixth Euromicro Workshop on Parallel and Distributed Processing, 1998

A case for merging the ILP and DLP paradigms.
Proceedings of the Sixth Euromicro Workshop on Parallel and Distributed Processing, 1998

Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures.
Proceedings of the 31st Annual IEEE/ACM International Symposium on Microarchitecture, 1998

A Performance Study of Out-of-order Vector Architectures and Short Registers.
Proceedings of the 12th international conference on Supercomputing, 1998

Resource Widening Versus Replication: Limits and Performance-cost Trade-off.
Proceedings of the 12th international conference on Supercomputing, 1998

Vector Architectures: Past, Present and Future.
Proceedings of the 12th international conference on Supercomputing, 1998

Virtual-Physical Registers.
Proceedings of the Fourth International Symposium on High-Performance Computer Architecture, Las Vegas, Nevada, USA, January 31, 1998

Command Vector Memory Systems: High Performance at Low Cost.
Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, 1998

1997
Exploiting instruction- and data-level parallelism.
IEEE Micro, 1997

Out-of-Order Vector Architectures.
Proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, 1997

Increasing Memory Bandwidth with Wide Buses: Compiler, Hardware and Performance Trade-Offs.
Proceedings of the 11th international conference on Supercomputing, 1997

Eliminating Cache Conflict Misses through XOR-Based Placement Functions.
Proceedings of the 11th international conference on Supercomputing, 1997

A Victim Cache for Vector Registers.
Proceedings of the 11th international conference on Supercomputing, 1997

Multithreaded Vector Architectures.
Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture (HPCA '97), 1997

Virtual registers.
Proceedings of the Fourth International on High-Performance Computing, 1997

Simultaneous multithreaded vector architecture: merging ILP and DLP for high performance.
Proceedings of the Fourth International on High-Performance Computing, 1997

Effective Usage of Vector Registers in Advanced Vector Architectures.
Proceedings of the 1997 Conference on Parallel Architectures and Compilation Techniques (PACT '97), 1997

Static Locality Analysis for Cache Management.
Proceedings of the 1997 Conference on Parallel Architectures and Compilation Techniques (PACT '97), 1997

1996
Loop Parallelization: Revisiting Framework of Unimodular Transformations.
Proceedings of the 4th Euromicro Workshop on Parallel and Distributed Processing (PDP '96), 1996

Heuristics for Register-Constrained Software Pipelining.
Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, 1996

Decoupled Vector Architectures.
Proceedings of the Second International Symposium on High-Performance Computer Architecture, 1996

Swing module scheduling: a lifetime-sensitive approach.
Proceedings of the Fifth International Conference on Parallel Architectures and Compilation Techniques, 1996

1995
Conflict-Free Access for Streams in Multimodule Memories.
IEEE Trans. Computers, 1995

Analyzing reference patterns in automatic data distribution tools.
Int. J. Parallel Program., 1995

Quantitative analysis of vector code.
Proceedings of the 3rd Euromicro Workshop on Parallel and Distributed Processing (PDP '95), 1995

Hypernode reduction modulo scheduling.
Proceedings of the 28th Annual International Symposium on Microarchitecture, Ann Arbor, Michigan, USA, November 29, 1995

Vector Multiprocessors with Arbitrated Memory Access.
Proceedings of the 22nd Annual International Symposium on Computer Architecture, 1995

A Data Cache with Multiple Caching Strategies Tuned to Different Types of Locality.
Proceedings of the 9th international conference on Supercomputing, 1995

Non-Consistent Dual Register Files to Reduce Register Pressure.
Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture (HPCA 1995), 1995

Automatic generation of loop scheduling for VLIW.
Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques, 1995

1994
Network Synchronization and Out-of-Order Access to Vectors.
Parallel Process. Lett., 1994

Access To Vectors In Multi-module Memories.
Proceedings of the Second Euromicro Workshop on Parallel and Distributed Processing, 1994

Detecting and Using Affinity in an Automatic Data Distribution Tool.
Proceedings of the Languages and Compilers for Parallel Computing, 1994

Synchronized access to streams in SIMD vector multiprocessors.
Proceedings of the 8th international conference on Supercomputing, 1994

Memory Access Synchronization in Vector Multiprocessors.
Proceedings of the Parallel Processing: CONPAR 94, 1994

Using Sacks to Organize Registers in VLIW Machines.
Proceedings of the Parallel Processing: CONPAR 94, 1994

1993
Chairmen's introduction.
Microprocess. Microprogramming, 1993

Conflict-free access to streams in multiprocessor systems.
Microprocess. Microprogramming, 1993

Access to streams in multiprocessor systems.
Proceedings of the 1993 Euromicro Workshop on Parallel and Distributed Processing, 1993

Align and Distribute-based Linear Loop Transformations.
Proceedings of the Languages and Compilers for Parallel Computing, 1993

1992
A method for implementation of one-dimensional systolic algorithms with data contraflow using pipelined functional units.
J. VLSI Signal Process., 1992

Increasing the Number of Strides for Conflict-Free Vector Access.
Proceedings of the 19th Annual International Symposium on Computer Architecture. Gold Coast, 1992

Conflict-free access of vectors with power-of-two strides.
Proceedings of the 6th international conference on Supercomputing, 1992

1991
Conflict-Free Strides for Vectors in Matched Memories.
Parallel Process. Lett., 1991

Balanced Loop Partitioning Using GTS.
Proceedings of the Languages and Compilers for Parallel Computing, 1991

On Automatic Loop Data-Mapping for Distributed-Memory Multiprocessors.
Proceedings of the Distributed Memory Computing, 2nd European Conference, 1991

Mapping QR decomposition of a banded matrix on a ID systolic array with data contraflow and pipelined functional units.
Proceedings of the Algorithms and Parallel VLSI Architectures II, 1991

1990
Implementation of systolic algorithms using pipelined functional units.
Proceedings of the Application Specific Array Processors, 1990

1989
A block algorithm and optimal fixed-size systolic array processor for the algebraic path problem.
J. VLSI Signal Process., 1989

Systematic Hardware Adaptation of Systolic Algorithms.
Proceedings of the 16th Annual International Symposium on Computer Architecture. Jerusalem, 1989

1987
A Discrete Optimization Problem in Local Networks and Data Alignment.
IEEE Trans. Computers, 1987

Partitioning: An Essential Step in Mapping Algorithms Into Systolic Array Processors.
Computer, 1987

1986
Computing Size-Independent Matrix Problems on Systolic Array Processors.
Proceedings of the 13th Annual Symposium on Computer Architecture, Tokyo, Japan, June 1986, 1986

Solving Matrix Problems with No Size Restriction on a Systolic Array Processor.
Proceedings of the International Conference on Parallel Processing, 1986

1985
Analysis and Simulation of Multiplexed Single-Bus Networks With and Without Buffering.
Proceedings of the 12th Annual Symposium on Computer Architecture, 1985

1983
Reduction of Connections for Multibus Organization.
IEEE Trans. Computers, 1983

A performance evaluation of the multiple bus network for multiprocessor systems.
Proceedings of the International Conference on Measurements and Modeling of Computer Systems, 1983

1982
Bandwidth of Crossbar and Multiple-Bus Connections for Multiprocessors.
IEEE Trans. Computers, 1982


  Loading...