Nuno Roma

Orcid: 0000-0003-2491-4977

According to our database1, Nuno Roma authored at least 117 papers between 1999 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
gem5-accel: A Pre-RTL Simulation Toolchain for Accelerator Architecture Validation.
IEEE Comput. Archit. Lett., 2024

NDPmulator: Enabling Full-System Simulation for Near-Data Accelerators From Caches to DRAM.
IEEE Access, 2024

2023
SBCCI 2022.
IEEE Des. Test, October, 2023

Trading Performance, Power, and Area on Low-Precision Posit MAC Units for CNN Training.
Proceedings of the 35th IEEE International Symposium on Computer Architecture and High Performance Computing, 2023

GPU Acceleration of MIP Intra Prediction in VVC.
Proceedings of the 31st European Signal Processing Conference, 2023

Neural Network Predictor for Fast Channel Change on DVB Set-Top-Boxes.
Proceedings of the Design and Architecture for Signal and Image Processing, 2023

Supporting RISC-V Performance Counters Through Linux Performance Analysis Tools.
Proceedings of the 34th IEEE International Conference on Application-specific Systems, 2023

2022
Unified Posit/IEEE-754 Vector MAC Unit for Transprecision Computing.
IEEE Trans. Circuits Syst. II Express Briefs, 2022

Compiling for Vector Extensions With Stream-Based Specialization.
IEEE Micro, 2022

Decoupling GPGPU voltage-frequency scaling for deep-learning applications.
J. Parallel Distributed Comput., 2022

gem5-ndp: Near-Data Processing Architecture Simulation From Low Level Caches to DRAM.
Proceedings of the 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2022

Early prototyping and testing of CERN LHC CMS high-granularity calorimeter slow-control system.
Proceedings of the IEEE International Workshop on Rapid System Prototyping, 2022

Mode-Adaptive Subsampling of SAD/SSE Operations for Intra Prediction Cost Reduction.
Proceedings of the IEEE International Symposium on Circuits and Systems, 2022

2021
A Compute Cache System for Signal Processing Applications.
J. Signal Process. Syst., 2021

A Reconfigurable Posit Tensor Unit with Variable-Precision Arithmetic and Automatic Data Streaming.
J. Signal Process. Syst., 2021

Compiler-Assisted Data Streaming for Regular Code Structures.
IEEE Trans. Computers, 2021

Fast and energy-efficient approximate motion estimation architecture for real-time 4 K UHD processing.
J. Real Time Image Process., 2021

Unlimited Vector Extension with Data Streaming Support.
Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture, 2021

Positnn: Training Deep Neural Networks with Mixed Low-Precision Posit.
Proceedings of the IEEE International Conference on Acoustics, 2021

2020
UHD 8K energy-quality scalable HEVC intra-prediction SAD unit hardware using optimized and configurable imprecise adders.
J. Real Time Image Process., 2020

Dynamic Fused Multiply-Accumulate Posit Unit with Variable Exponent Size for Low-Precision DSP Applications.
Proceedings of the IEEE Workshop on Signal Processing Systems, 2020

2PSA: An Optimized and Flexible Power-Precision Scalable Adder.
Proceedings of the 33rd Symposium on Integrated Circuits and Systems Design, 2020

Exploiting Non-conventional DVFS on GPUs: Application to Deep Learning.
Proceedings of the 32nd IEEE International Symposium on Computer Architecture and High Performance Computing, 2020

Processing Convolutional Neural Networks on Cache.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Reconfigurable Stream-based Tensor Unit with Variable-Precision Posit Arithmetic.
Proceedings of the 31st IEEE International Conference on Application-specific Systems, 2020

2019
Modeling and Decoupling the GPU Power Consumption for Cross-Domain DVFS.
IEEE Trans. Parallel Distributed Syst., 2019

DVFS-aware application classification to improve GPGPUs energy efficiency.
Parallel Comput., 2019

Flying tourist problem: Flight time and cost minimization in complex routes.
Expert Syst. Appl., 2019

GPU Static Modeling Using PTX and Deep Structured Learning.
IEEE Access, 2019

Power-Efficient Approximate SAD Architecture with LOA Imprecise Adders.
Proceedings of the 10th IEEE Latin American Symposium on Circuits & Systems, 2019

Heart Disease Detection Architecture for Lead I Off-the-Person ECG Monitoring Devices.
Proceedings of the 27th European Signal Processing Conference, 2019

2018
Stream data prefetcher for the GPU memory interface.
J. Supercomput., 2018

Highly parallel HEVC decoding for heterogeneous systems with CPU and GPU.
Signal Process. Image Commun., 2018

Exploiting Compute Caches for Memory Bound Vector Operations.
Proceedings of the 30th International Symposium on Computer Architecture and High Performance Computing, 2018

GPGPU Power Modeling for Multi-domain Voltage-Frequency Scaling.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2018

2017
Adaptive In-Cache Streaming for Efficient Data Management.
IEEE Trans. Very Large Scale Integr. Syst., 2017

GHEVC: An Efficient HEVC Decoder for Graphics Processing Units.
IEEE Trans. Multim., 2017

Special issue on real-time energy-aware circuits and systems for HEVC and for its 3D and SVC extensions.
J. Real Time Image Process., 2017

GPU Parallelization of HEVC In-Loop Filters.
Int. J. Parallel Program., 2017

Efficient parallelization of perturbative Monte Carlo QM/MM simulations in heterogeneous platforms.
Int. J. High Perform. Comput. Appl., 2017

Energy-efficient motion estimation with approximate arithmetic.
Proceedings of the 19th IEEE International Workshop on Multimedia Signal Processing, 2017

2016
Adaptive Scheduling Framework for Real-Time Video Encoding on Heterogeneous Systems.
IEEE Trans. Circuits Syst. Video Technol., 2016

BowMapCL: Burrows-Wheeler Mapping on Multiple Heterogeneous Accelerators.
IEEE ACM Trans. Comput. Biol. Bioinform., 2016

GPU-assisted HEVC intra decoder.
J. Real Time Image Process., 2016

Exploiting task and data parallelism for advanced video coding on hybrid CPU + GPU platforms.
J. Real Time Image Process., 2016

Multi-objective kernel mapping and scheduling for morphable many-core architectures.
Expert Syst. Appl., 2016

Editorial to special issue on energy efficient architectures for embedded systems.
EURASIP J. Embed. Syst., 2016

A Cross-Core Performance Model for Heterogeneous Many-Core Architectures.
Proceedings of the High Performance Computing for Computational Science - VECPAR 2016, 2016

Efficient HEVC decoder for heterogeneous CPU with GPU systems.
Proceedings of the 18th IEEE International Workshop on Multimedia Signal Processing, 2016

Unsupervised variable-grained online phase clustering for heterogeneous/morphable processors.
Proceedings of the International Conference on High Performance Computing & Simulation, 2016

In-Cache Streaming: Morphable Infrastructure for Many-Core Processing Systems.
Proceedings of the Euro-Par 2016: Parallel Processing Workshops, 2016

Performance and Power-Aware Classification for Frequency Scaling of GPGPU Applications.
Proceedings of the Euro-Par 2016: Parallel Processing Workshops, 2016

2015
Multicore SIMD ASIP for Next-Generation Sequencing and Alignment Biochip Platforms.
IEEE Trans. Very Large Scale Integr. Syst., 2015

Morphable hundred-core heterogeneous architecture for energy-aware computation.
IET Comput. Digit. Tech., 2015

Acceleration of stochastic seismic inversion in OpenCL-based heterogeneous platforms.
Comput. Geosci., 2015

Implementation and performance analysis of efficient index structures for DNA search algorithms in parallel platforms.
Concurr. Comput. Pract. Exp., 2015

HEVC in-loop filters GPU parallelization in embedded systems.
Proceedings of the 2015 International Conference on Embedded Computer Systems: Architectures, 2015

Multi-kernel Auto-Tuning on GPUs: Performance and Energy-Aware Optimization.
Proceedings of the 23rd Euromicro International Conference on Parallel, 2015

Energy-Efficient Architecture for DP Local Sequence Alignment: Exploiting ILP and DLP.
Proceedings of the Bioinformatics and Biomedical Engineering, 2015

Run-Time Machine Learning for HEVC/H.265 Fast Partitioning Decision.
Proceedings of the 2015 IEEE International Symposium on Multimedia, 2015

High performance IP core for HEVC quantization.
Proceedings of the 2015 IEEE International Symposium on Circuits and Systems, 2015

Towards GPU HEVC intra decoding: Seizing fine-grain parallelism.
Proceedings of the 2015 IEEE International Conference on Multimedia and Expo, 2015

GPU acceleration of the HEVC decoder inter prediction module.
Proceedings of the 2015 IEEE Global Conference on Signal and Information Processing, 2015

Efficient data-stream management for shared-memory many-core systems.
Proceedings of the 25th International Conference on Field Programmable Logic and Applications, 2015

Fast and Scalable Thread Migration for Multi-core Architectures.
Proceedings of the 13th IEEE International Conference on Embedded and Ubiquitous Computing, 2015

2014
Dynamic Load Balancing for Real-Time Video Encoding on Heterogeneous CPU+GPU Systems.
IEEE Trans. Multim., 2014

Unified transform architecture for AVC, AVS, VC-1 and HEVC high-performance codecs.
EURASIP J. Adv. Signal Process., 2014

Cache-Oblivious parallel SIMD Viterbi decoding for sequence search in HMMER.
BMC Bioinform., 2014

Stream Oriented Modular Architecture with Polymorphic Processing Engines.
Proceedings of the 26th IEEE International Symposium on Computer Architecture and High Performance Computing Workshop, 2014

FEVES: Framework for Efficient Parallel Video Encoding on Heterogeneous Systems.
Proceedings of the 43rd International Conference on Parallel Processing, 2014

Collaborative inter-prediction on CPU+GPU systems.
Proceedings of the 2014 IEEE International Conference on Image Processing, 2014

Cooperative CPU+GPU deblocking filter parallelization for high performance HEVC video codecs.
Proceedings of the IEEE International Conference on Acoustics, 2014

Optimized ASIP architecture for compressed BWT-indexed search in bioinformatics applications.
Proceedings of the International Conference on High Performance Computing & Simulation, 2014

Burrows-Wheeler Transform based indexed exact search on a multi-GPU OpenCL platform.
Proceedings of the International Conference on High Performance Computing & Simulation, 2014

Low-power vectorial VLIW architecture for maximum parallelism exploitation of dynamic programming algorithms.
Proceedings of the International Conference on High Performance Computing & Simulation, 2014

Opencl parallelization of the HEVC de-quantization and inverse transform for heterogeneous platforms.
Proceedings of the 22nd European Signal Processing Conference, 2014

GPU Accelerated Stochastic Inversion of Deep Water Seismic Data.
Proceedings of the Euro-Par 2014: Parallel Processing Workshops, 2014

2013
Scalable Unified Transform Architecture for Advanced Video Coding Embedded Systems.
Int. J. Parallel Program., 2013

Configurable and scalable class of high performance hardware accelerators for simultaneous DNA sequence alignment.
Concurr. Comput. Pract. Exp., 2013

HotStream: Efficient Data Streaming of Complex Patterns to Multiple Accelerating Kernels.
Proceedings of the 25th International Symposium on Computer Architecture and High Performance Computing, 2013

Transparent Application Acceleration by Intelligent Scheduling of Shared Library Calls on Heterogeneous Systems.
Proceedings of the Parallel Processing and Applied Mathematics, 2013

A flexible shared library profiler for early estimation of performance gains in heterogeneous systems.
Proceedings of the International Conference on High Performance Computing & Simulation, 2013

Scalable and high throughput biosensing platform.
Proceedings of the 23rd International Conference on Field programmable Logic and Applications, 2013

High performance multi-standard architecture for DCT computation in H.264/AVC High Profile and HEVC codecs.
Proceedings of the 2013 Conference on Design and Architectures for Signal and Image Processing, 2013

BioBlaze: Multi-core SIMD ASIP for DNA sequence alignment.
Proceedings of the 24th International Conference on Application-Specific Systems, 2013

2012
Integrated Hardware Architecture for Efficient Computation of the $n$-Best Bio-Sequence Local Alignments in Embedded Platforms.
IEEE Trans. Very Large Scale Integr. Syst., 2012

Hardware accelerator architecture for simultaneous short-read DNA sequences alignment with enhanced traceback phase.
Microprocess. Microsystems, 2012

System-level prototyping framework for heterogeneous multi-core architecture applied to biological sequence analysis.
Proceedings of the 23rd IEEE International Symposium on Rapid System Prototyping, 2012

Multi-level Parallelization of Advanced Video Coding on Hybrid CPU+GPU Platforms.
Proceedings of the Euro-Par 2012: Parallel Processing Workshops, 2012

High Performance Unified Architecture for Forward and Inverse Quantization in H.264/AVC.
Proceedings of the 15th Euromicro Conference on Digital System Design, 2012

2011
A flexible architecture for the computation of direct and inverse transforms in H.264/AVC video codecs.
IEEE Trans. Consumer Electron., 2011

A tutorial overview on the properties of the discrete cosine transform for encoded image and video processing.
Signal Process., 2011

High throughput and scalable architecture for unified transform coding in embedded H.264/AVC video coding systems.
Proceedings of the 2011 International Conference on Embedded Computer Systems: Architectures, 2011

Advantages and GPU implementation of high-performance indexed DNA search based on suffix arrays.
Proceedings of the 2011 International Conference on High Performance Computing & Simulation, 2011

2010
p264: open platform for designing parallel H.264/AVC video encoders on multi-core systems.
Proceedings of the Network and Operating System Support for Digital Audio and Video, 2010

H.264/AVC framework for multi-core embedded video encoders.
Proceedings of the 2010 International Symposium on System on Chip, SoC 2010, Tampere, 2010

Integrated accelerator architecture for DNA sequences alignment with enhanced traceback phase.
Proceedings of the 2010 International Conference on High Performance Computing & Simulation, 2010

Hardware/software co-design of H.264/AVC encoders for multi-core embedded systems.
Proceedings of the 2010 Conference on Design & Architectures for Signal & Image Processing, 2010

A Parallel Programming Framework for Multi-core DNA Sequence Alignment.
Proceedings of the CISIS 2010, 2010

2009
Distributed Software Platform for Automation and Control of General Anaesthesia.
Proceedings of the Eighth International Symposium on Parallel and Distributed Computing, 2009

2008
Application Specific Programmable IP Core for Motion Estimation: Technology Comparison Targeting Efficient Embedded Co-Processing Units.
Proceedings of the 11th Euromicro Conference on Digital System Design: Architectures, 2008

2007
Reconfigurable architectures and processors for real-time video motion estimation.
J. Real Time Image Process., 2007

Adaptive Motion Estimation Processor for Autonomous Video Devices.
EURASIP J. Embed. Syst., 2007

Efficient Hybrid DCT-Domain Algorithm for Video Spatial Downscaling.
EURASIP J. Adv. Signal Process., 2007

Adaptive Motion Estimation Algorithm for H.264/AVC.
Proceedings of the 15th International Conference on Digital Signal Processing, 2007

2006
Low Power Distance Measurement Unit for Real-Time Hardware Motion Estimators.
Proceedings of the Integrated Circuit and System Design. Power and Timing Modeling, 2006

Application Specific Instruction Set Processor for Adaptive Video Motion Estimation.
Proceedings of the Ninth Euromicro Conference on Digital System Design: Architectures, Methods and Tools (DSD 2006), 30 August, 2006

2005
Efficient VLSI Architecture for Real-Time Motion Estimation in Advanced Video Coding.
Proceedings of the Proceedings 2005 IEEE International SOC Conference, 2005

Least squares motion estimation algorithm in the compressed DCT domain for H.26x/MPEG-x video sequences.
Proceedings of the Advanced Video and Signal Based Surveillance, 2005

2003
Automatic Synthesis of Motion Estimation Processors Based on a New Class of Hardware Architectures.
J. VLSI Signal Process., 2003

Fast transcoding architectures for insertion of non-regular shaped objects in the compressed DCT-domain.
Signal Process. Image Commun., 2003

Customisable Core-Based Architectures for Real-Time Motion Estimation on FPGAs.
Proceedings of the Field Programmable Logic and Application, 13th International Conference, 2003

2002
Efficient and configurable full-search block-matching processors.
IEEE Trans. Circuits Syst. Video Technol., 2002

Insertion of irregular-shaped logos in the compressed DCT domain.
Proceedings of the 14th International Conference on Digital Signal Processing, 2002

2001
A New Efficient VLSI Architecture for Full Search Block Matching Motion Estimation.
Proceedings of the SOC Design Methodologies, 2001

2000
In the Development and Evaluation of Specialized Processors for Computing High-Order 2-D Image Moments in Real-Time.
Proceedings of the Fifth International Workshop on Computer Architectures for Machine Perception (CAMP 2000), 2000

1999
Low-power array architectures for motion estimation.
Proceedings of the Third IEEE Workshop on Multimedia Signal Processing, 1999


  Loading...