Ben H. H. Juurlink

Affiliations:
  • TU Berlin, Department of Computer Engineering and Microelectronics


According to our database1, Ben H. H. Juurlink authored at least 164 papers between 1993 and 2022.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2022
A Quantitative Study of Locality in GPU Caches for Memory-Divergent Workloads.
Int. J. Parallel Program., 2022

Memory Access Granularity Aware Lossless Compression for GPUs.
Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

FLEXDP: flexible frequency scaling for energy-delay product optimization of GPU applications.
Proceedings of the CF '22: 19th ACM International Conference on Computing Frontiers, Turin, Italy, May 17, 2022

Effects of Approximate Computing on Workload Characteristics.
Proceedings of the Architecture of Computing Systems - 35th International Conference, 2022

2021
Easy and efficient agent-based simulations with the OpenABL language and compiler.
Future Gener. Comput. Syst., 2021

Lightweight Dual Modular Redundancy through Approximate Computing.
Proceedings of the XI Brazilian Symposium on Computing Systems Engineering, 2021

Model-Based Loop Perforation.
Proceedings of the Euro-Par 2021: Parallel Processing Workshops, 2021

ALONA: Automatic Loop Nest Approximation with Reconstruction and Space Pruning.
Proceedings of the Euro-Par 2021: Parallel Processing, 2021

QSLC: Quantization-Based, Low-Error Selective Approximation for GPUs.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2021

2020
Vectorization cost modeling for NEON, AVX and SVE.
Perform. Evaluation, 2020

Accurate Energy and Performance Prediction for Frequency-Scaled GPU Kernels.
Comput., 2020

A Quantitative Study of Locality in GPU Caches.
Proceedings of the Embedded Computer Systems: Architectures, Modeling, and Simulation, 2020

Efficient Wavefront Parallel Processing for HEVC CABAC Decoding.
Proceedings of the 28th Euromicro International Conference on Parallel, 2020

Accelerating The Vvc Decoder For Vector Length Agnostic Simd Architectures.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2020

DenseDisp: Resource-Aware Disparity Map Estimation by Compressing Siamese Neural Architecture.
Proceedings of the IEEE Congress on Evolutionary Computation, 2020

2019
VComputeLib: Enabling Cross-Platform GPGPU on Mobile and Embedded GPUs.
Proceedings of the MoMM 2019: The 17th International Conference on Advances in Mobile Computing & Multimedia, 2019

Portable Cost Modeling for Auto-Vectorizers.
Proceedings of the 27th IEEE International Symposium on Modeling, 2019

A Bin-Based Bitstream Partitioning Approach for Parallel CABAC Decoding in Next Generation Video Coding.
Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

An Efficient Lightweight Framework for Porting Vision Algorithms on Embedded SoCs.
Proceedings of the Analysis, Estimations, and Applications of Embedded Systems, 2019

A Performance Analysis of Vector Length Agnostic Code.
Proceedings of the 17th International Conference on High Performance Computing & Simulation, 2019

Performance Counters based Power Modeling of Mobile GPUs using Deep Learning.
Proceedings of the 17th International Conference on High Performance Computing & Simulation, 2019

Approximating Memory-bound Applications on Mobile GPUs.
Proceedings of the 17th International Conference on High Performance Computing & Simulation, 2019

Evaluating the Memory Architecture of Next-Generation FPGA-SoCs for HPC.
Proceedings of the 17th International Conference on High Performance Computing & Simulation, 2019

Predictable GPUs Frequency Scaling for Energy and Performance.
Proceedings of the 48th International Conference on Parallel Processing, 2019

SLC: Memory Access Granularity Aware Selective Lossy Compression for GPUs.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2019

MEMPower: Data-Aware GPU Memory Power Model.
Proceedings of the Architecture of Computing Systems - ARCS 2019, 2019

2018
Highly parallel HEVC decoding for heterogeneous systems with CPU and GPU.
Signal Process. Image Commun., 2018

Performance evaluation of implicit and explicit SIMDization.
Microprocess. Microsystems, 2018

Control Flow Vectorization for ARM NEON.
Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems, 2018

Accelerating the RICH Particle Detector Algorithm on Intel Xeon Phi.
Proceedings of the 26th Euromicro International Conference on Parallel, 2018

An Application-Specific Memory Management Unit for FPGA-SoCs.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

VComputeBench: A Vulkan Benchmark Suite for GPGPU on Mobile and Embedded GPUs.
Proceedings of the 2018 IEEE International Symposium on Workload Characterization, 2018

OpenABL: A Domain-Specific Language for Parallel and Distributed Agent-Based Simulations.
Proceedings of the Euro-Par 2018: Parallel Processing, 2018

Optimal DC/AC data bus inversion coding.
Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition, 2018

Cost Modelling for Vectorization on ARM.
Proceedings of the IEEE International Conference on Cluster Computing, 2018

Local memory-aware kernel perforation.
Proceedings of the 2018 International Symposium on Code Generation and Optimization, 2018

2017
GPU Parallelization of HEVC In-Loop Filters.
Int. J. Parallel Program., 2017

Application-Specific Cache and Prefetching for HEVC CABAC Decoding.
IEEE Multim., 2017

The LPGPU2 Project: Low-Power Parallel Computing on GPUs: Extended Abstract.
Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems, 2017

Stencil Autotuning with Ordinal Regression: Extended Abstract.
Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems, 2017

E^2MC: Entropy Encoding Based Memory Compression for GPUs.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

Autotuning Stencil Computations with Structural Ordinal Regression Learning.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

Syntax Element Partitioning for high-throughput HEVC CABAC decoding.
Proceedings of the 2017 IEEE International Conference on Acoustics, 2017

A Methodology for Predicting Application-Specific Achievable Memory Bandwidth for HW/SW-Codesign.
Proceedings of the Euromicro Conference on Digital System Design, 2017

Enabling GPU software developers to optimize their applications - The LPGPU<sup>2</sup> approach.
Proceedings of the 2017 Conference on Design and Architectures for Signal and Image Processing, 2017

Static optimization in PHP 7.
Proceedings of the 26th International Conference on Compiler Construction, 2017

A Quantitative Analysis of the Memory Architecture of FPGA-SoCs.
Proceedings of the Applied Reconfigurable Computing - 13th International Symposium, 2017

2016
An evaluation of current SIMD programming models for C++.
Proceedings of the 3rd Workshop on Programming Models for SIMD/Vector Processing, 2016

Efficient HEVC decoder for heterogeneous CPU with GPU systems.
Proceedings of the 18th IEEE International Workshop on Multimedia Signal Processing, 2016

ALUPower: Data Dependent Power Consumption in GPUs.
Proceedings of the 24th IEEE International Symposium on Modeling, 2016

FPGA based hardware accelerator for KAZE feature extraction algorithm.
Proceedings of the 2016 International Conference on Field-Programmable Technology, 2016

The neuro vector engine: Flexibility to improve convolutional net efficiency for wearable vision.
Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition, 2016

2015
Parallel H.264/AVC Motion Compensation for GPUs Using OpenCL.
IEEE Trans. Circuits Syst. Video Technol., 2015

SIMD Acceleration for HEVC Decoding.
IEEE Trans. Circuits Syst. Video Technol., 2015

Spatiotemporal SIMT and Scalarization for Improving GPU Efficiency.
ACM Trans. Archit. Code Optim., 2015

Reducing HEVC encoding complexity using two-stage motion estimation.
Proceedings of the 2015 Visual Communications and Image Processing, 2015

On latency in GPU throughput microarchitectures.
Proceedings of the 2015 IEEE International Symposium on Performance Analysis of Systems and Software, 2015

Optimizing HEVC CABAC Decoding with a Context Model Cache and Application-Specific Prefetching.
Proceedings of the 2015 IEEE International Symposium on Multimedia, 2015

Nexus#: A Distributed Hardware Task Manager for Task-Based Programming Models.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

An Efficient and Flexible FPGA Implementation of a Face Detection System (Abstract Only).
Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2015

High Performance Memory Accesses on FPGA-SoCs: A Quantitative Analysis.
Proceedings of the 23rd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2015

Multi/many-core programming: where are we standing?
Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, 2015

An Efficient and Flexible FPGA Implementation of a Face Detection System.
Proceedings of the Applied Reconfigurable Computing - 11th International Symposium, 2015

2014
Low-Power High-Efficiency Video Decoding using General-Purpose Processors.
ACM Trans. Archit. Code Optim., 2014

TACO: A scheduling scheme for parallel applications on multicore architectures.
Sci. Program., 2014

GPGPU workload characteristics and performance analysis.
Proceedings of the XIVth International Conference on Embedded Computer Systems: Architectures, 2014

An Integrated Hardware-Software Approach to Task Graph Management.
Proceedings of the 2014 IEEE International Conference on High Performance Computing and Communications, 2014

A generic implementation of a quantified predictor on FPGAs.
Proceedings of the Great Lakes Symposium on VLSI 2014, GLSVLSI '14, Houston, TX, USA - May 21, 2014

2013
Parallel HEVC Decoding on Multi- and Many-core Architectures - A Power and Performance Analysis.
J. Signal Process. Syst., 2013

How a single chip causes massive power bills GPUSimPow: A GPGPU power simulator.
Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 2013

HEVC performance and complexity for 4K video.
Proceedings of the IEEE Third International Conference on Consumer Electronics, 2013

2012
Scalable Parallel Programming Applied to H.264/AVC Decoding.
Springer Briefs in Computer Science, Springer, ISBN: 978-1-4614-2230-3, 2012

Parallel Scalability and Efficiency of HEVC Parallelization Approaches.
IEEE Trans. Circuits Syst. Video Technol., 2012

Amdahl's law for predicting the future of multicores considered harmful.
SIGARCH Comput. Archit. News, 2012

Using OpenMP superscalar for parallelization of embedded and consumer applications.
Proceedings of the 2012 International Conference on Embedded Computer Systems: Architectures, 2012

Programming parallel embedded and consumer applications in OpenMP superscalar.
Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2012

SynZEN: A hybrid TTA/VLIW architecture with a distributed register file.
Proceedings of the NORCHIP 2012, Copenhagen, Denmark, November 12-13, 2012, 2012

Hardware-Based Task Dependency Resolution for the StarSs Programming Model.
Proceedings of the 41st International Conference on Parallel Processing Workshops, 2012

Improving the parallelization efficiency of HEVC decoding.
Proceedings of the 19th IEEE International Conference on Image Processing, 2012

Parallel video decoding in the emerging HEVC standard.
Proceedings of the 2012 IEEE International Conference on Acoustics, 2012

An Optimized Parallel IDCT on Graphics Processing Units.
Proceedings of the Euro-Par 2012: Parallel Processing Workshops, 2012

A Predictor-Based Power-Saving Policy for DRAM Memories.
Proceedings of the 15th Euromicro Conference on Digital System Design, 2012

2011
A Highly Scalable Parallel Implementation of H.264.
Trans. High Perform. Embed. Archit. Compil., 2011

Multi-Core - the Future of Embedded Systems.
Proceedings of the Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen (MBMV), 2011

Poster: implications of merging phases on scalability of multi-core architectures.
Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31, 2011

A QHD-capable parallel H.264 decoder.
Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31, 2011

Implications of Merging Phases on Scalability of Multi-core Architectures.
Proceedings of the International Conference on Parallel Processing, 2011

Nexus: Hardware Support for Task-Based Programming.
Proceedings of the 14th Euromicro Conference on Digital System Design, 2011

Composable local memory organisation for streaming applications on embedded MPSoCs.
Proceedings of the 8th Conference on Computing Frontiers, 2011

An Instruction to Accelerate Software Caches.
Proceedings of the Architecture of Computing Systems - ARCS 2011, 2011

2010
The SARC Architecture.
IEEE Micro, 2010

A Multidimensional Software Cache for Scratchpad-Based Systems.
Int. J. Embed. Real Time Commun. Syst., 2010

Evaluation of parallel H.264 decoding strategies for the Cell Broadband Engine.
Proceedings of the 24th International Conference on Supercomputing, 2010

Extending the Cell SPE with Energy Efficient Branch Prediction.
Proceedings of the Euro-Par 2010 - Parallel Processing, 16th International Euro-Par Conference, Ischia, Italy, August 31, 2010

A Case for Hardware Task Management Support for the StarSS Programming Model.
Proceedings of the 13th Euromicro Conference on Digital System Design, 2010

Instruction precomputation with memoization for fault detection.
Proceedings of the Design, Automation and Test in Europe, 2010

Protective redundancy overhead reduction using instruction vulnerability factor.
Proceedings of the 7th Conference on Computing Frontiers, 2010

2009
Parallel Scalability of Video Decoders.
J. Signal Process. Syst., 2009

Leakage-Aware Multiprocessor Scheduling.
J. Signal Process. Syst., 2009

Instruction-Level Fault Tolerance Configurability.
J. Signal Process. Syst., 2009

An efficient software cache for H.264 motion compensation.
Proceedings of the 2008 IEEE International Symposium on System-on-Chip, 2009

Scalability of Macroblock-level Parallelism for H.264 Decoding.
Proceedings of the 15th IEEE International Conference on Parallel and Distributed Systems, 2009

Intra-vector SIMD instructions for core specialization.
Proceedings of the 27th International Conference on Computer Design, 2009

Parallel H.264 Decoding on an Embedded Multicore Processor.
Proceedings of the High Performance Embedded Architectures and Compilers, 2009

Introduction.
Proceedings of the Euro-Par 2009 Parallel Processing, 2009

The 3TU embedded systems master in the Netherlands.
Proceedings of the 2009 Workshop on Embedded Systems Education, 2009

SIMD Architectural Enhancements to Improve the Performance of the 2D Discrete Wavelet Transform.
Proceedings of the 12th Euromicro Conference on Digital System Design, 2009

Instruction Precomputation for Fault Detection.
Proceedings of the 12th Euromicro Conference on Digital System Design, 2009

Limiting the number of dirty cache lines.
Proceedings of the Design, Automation and Test in Europe, 2009

Specialization of the Cell SPE for Media Applications.
Proceedings of the 20th IEEE International Conference on Application-Specific Systems, 2009

Scalar Processing Overhead on SIMD-Only Architectures.
Proceedings of the 20th IEEE International Conference on Application-Specific Systems, 2009

Performance Improvement of Multimedia Kernels by Alleviating Overhead Instructions on SIMD Devices.
Proceedings of the Advanced Parallel Processing Technologies, 8th International Symposium, 2009

2008
Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors.
IEEE Trans. Multim., 2008

Versatility of extended subwords and the matrix register file.
ACM Trans. Archit. Code Optim., 2008

GRAAL: A Framework for Low-Power 3D Graphics Accelerators.
IEEE Computer Graphics and Applications, 2008

Optimization of Content-Based Image Retrieval Functions.
Proceedings of the Tenth IEEE International Symposium on Multimedia (ISM2008), 2008

Analysis of video filtering on the cell processor.
Proceedings of the International Symposium on Circuits and Systems (ISCAS 2008), 2008

(When) Will CMPs Hit the Power Wall?.
Proceedings of the Euro-Par 2008 Workshops, 2008

Analyzing Scalability of Deblocking Filter of H.264 via TLP Exploitation in a New Many-Core Architecture.
Proceedings of the 11th Euromicro Conference on Digital System Design: Architectures, 2008

A Low-Cost Cache Coherence Verification Method for Snooping Systems.
Proceedings of the 11th Euromicro Conference on Digital System Design: Architectures, 2008

Memory copies in multi-level memory systems.
Proceedings of the 19th IEEE International Conference on Application-Specific Systems, 2008

2007
Trade-Offs Between Voltage Scaling and Processor Shutdown for Low-Energy Embedded Multiprocessors.
Proceedings of the Embedded Computer Systems: Architectures, 2007

Instruction-Level Fault Tolerance Configurability.
Proceedings of the 2007 International Conference on Embedded Computer Systems: Architectures, 2007

Optimizing Cache Performance of the Discrete Wavelet Transform Using a Visualization Tool.
Proceedings of the Ninth IEEE International Symposium on Multimedia, 2007

SIMD Vectorization of Histogram Functions.
Proceedings of the IEEE International Conference on Application-Specific Systems, 2007

2006
Avoiding Conversion and Rearrangement Overhead in SIMD Architectures.
Int. J. Parallel Program., 2006

Accelerating Color Space Conversion Using Extended Subwords and the Matrix Register File.
Proceedings of the Eigth IEEE International Symposium on Multimedia (ISM 2006), 2006

Leakage-aware multiprocessor scheduling for low power.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Improving the memory behavior of vertical filtering in the discrete wavelet transform.
Proceedings of the Third Conference on Computing Frontiers, 2006

Limitations of special-purpose instructions for similarity measurements in media SIMD extensions.
Proceedings of the 2006 International Conference on Compilers, 2006

2005
The CSI multimedia architecture.
IEEE Trans. Very Large Scale Integr. Syst., 2005

Avoiding data conversions in embedded media processors.
Proceedings of the 2005 ACM Symposium on Applied Computing (SAC), 2005

Implementing Hardware Multithreading in a VLIW Architecture.
Proceedings of the International Conference on Parallel and Distributed Computing Systems, 2005

Matrix register file and extended subwords: two techniques for embedded media processors.
Proceedings of the Second Conference on Computing Frontiers, 2005

Performance Comparison of SIMD Implementations of the Discrete Wavelet Transform.
Proceedings of the 16th IEEE International Conference on Application-Specific Systems, 2005

2004
Memory Bandwidth Requirements of Tile-Based Rendering.
Proceedings of the Computer Systems: Architectures, 2004

GraalBench: a 3D graphics benchmark suite for mobile phones.
Proceedings of the 2004 ACM SIGPLAN/SIGBED Conference on Languages, 2004

Efficient tile-aware bounding-box overlap test for tile-based rendering.
Proceedings of the 2004 International Symposium on System-on-Chip, 2004

Sparse Matrix Transpose Unit.
Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), 2004

Scene Management Models and Overlap Tests for Tile-Based Rendering.
Proceedings of the 2004 Euromicro Symposium on Digital Systems Design (DSD 2004), Architectures, Methods and Tools, 31 August, 2004

Reducing traffic generated by conflict misses in caches.
Proceedings of the First Conference on Computing Frontiers, 2004

Dynamic techniques to reduce memory traffic in embedded systems.
Proceedings of the First Conference on Computing Frontiers, 2004

Approximating the optimal replacement algorithm.
Proceedings of the First Conference on Computing Frontiers, 2004

Accelerating the secure remote password protocol using reconfigurable hardware.
Proceedings of the First Conference on Computing Frontiers, 2004

2003
The Paderborn University BSP (PUB) library.
Parallel Comput., 2003

Implementation of a streaming execution unit.
J. Syst. Archit., 2003

Optimal broadcast on parallel locality models.
J. Discrete Algorithms, 2003

Unified Dual Data Caches.
Proceedings of the 2003 Euromicro Symposium on Digital Systems Design (DSD 2003), 2003

2002
Architectural Support for 3D Graphics in the Complex Streamed Instruction Set.
Proceedings of the International Conference on Parallel and Distributed Computing Systems, 2002

Performance Scalability of Multimedia Instruction Set Extensions.
Proceedings of the Euro-Par 2002, 2002

2001
Performance of the Complex Streamed Instruction Set on Image Processing Kernels.
Proceedings of the Euro-Par 2001: Parallel Processing, 2001

Implementation and Evaluation of the Complex Streamed Instruction Set.
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques (PACT 2001), 2001

2000
Complex Streamed Instructions: Introduction and Initial Evaluatio.
Proceedings of the 26th EUROMICRO 2000 Conference, 2000

Counter Based Superscalar Instruction Issuing.
Proceedings of the 26th EUROMICRO 2000 Conference, 2000

1999
The Paderborn University BSP (PUB) Library - Design, Implementation and Performance.
Proceedings of the 13th International Parallel Processing Symposium / 10th Symposium on Parallel and Distributed Processing (IPPS / SPDP '99), 1999

1998
Gossiping on Meshes and Tori.
IEEE Trans. Parallel Distributed Syst., 1998

A Quantitative Comparison of Parallel Computation Models.
ACM Trans. Comput. Syst., 1998

Communication-Optimal Parallel Minimum Spanning Tree Algorithms (Extended Abstract).
Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, 1998

Experimental Validation of Parallel Computation Models on the Intel Paragon.
Proceedings of the 12th International Parallel Processing Symposium / 9th Symposium on Parallel and Distributed Processing (IPPS/SPDP '98), March 30, 1998

1996
Communication Primitives for BSP Computers.
Inf. Process. Lett., 1996

The E-BSP Model: Incorporating General Locality and Unbalanced Communication into the BSP Model.
Proceedings of the Euro-Par '96 Parallel Processing, 1996

Worm-Hole Gossiping on Meshes.
Proceedings of the Euro-Par '96 Parallel Processing, 1996

1994
The Parallel Hierarchical Memory Model.
Proceedings of the Algorithm Theory, 1994

1993
Experiences with a Model for Parallel Computation.
Proceedings of the Twelth Annual ACM Symposium on Principles of Distributed Computing, 1993


  Loading...