Juan Gómez-Luna

According to our database1, Juan Gómez-Luna authored at least 44 papers between 2008 and 2019.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.



In proceedings 
PhD thesis 



On csauthors.net:


Processing data where it makes sense: Enabling in-memory computation.
Microprocessors and Microsystems - Embedded Hardware Design, 2019

Processing-in-memory: A workload-driven perspective.
IBM Journal of Research and Development, 2019

SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations.
CoRR, 2019

SneakySnake: A Fast and Accurate Universal Genome Pre-Alignment Filter for CPUs, GPUs, and FPGAs.
CoRR, 2019

A Workload and Programming Ease Driven Perspective of Processing-in-Memory.
CoRR, 2019

Enabling Practical Processing in and near Memory for Data-Intensive Computing.
CoRR, 2019

Dataplant: In-DRAM Security Mechanisms for Low-Cost Devices.
CoRR, 2019

Analysis and Modeling of Collaborative Execution Strategies for Heterogeneous CPU-FPGA Architectures.
Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering, 2019

NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learning.
Proceedings of the 56th Annual Design Automation Conference 2019, 2019

Automatic Generation of Warp-Level Primitives and Atomic Instructions for Fast and Portable Parallel Reduction on GPUs.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2019

High-throughput Ant Colony Optimization on graphics processing units.
J. Parallel Distrib. Comput., 2018

High-Performance Computation of Bézier Surfaces on Parallel and Heterogeneous Platforms.
International Journal of Parallel Programming, 2018

Enabling Efficient RDMA-based Synchronous Mirroring of Persistent Memory Transactions.
CoRR, 2018

Improving tasks throughput on accelerators using OpenCL command concurrency.
CoRR, 2018

FLIN: Enabling Fairness and Enhancing Performance in Modern NVMe Solid State Drives.
Proceedings of the 45th ACM/IEEE Annual International Symposium on Computer Architecture, 2018

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices.
Proceedings of the 16th USENIX Conference on File and Storage Technologies, 2018

A tasks reordering model to reduce transfers overhead on GPUs.
J. Parallel Distrib. Comput., 2017

Collaborative Computing for Heterogeneous Integrated Systems.
Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering, 2017

Chai: Collaborative heterogeneous applications for integrated-architectures.
Proceedings of the 2017 IEEE International Symposium on Performance Analysis of Systems and Software, 2017

Efficient OpenCL-based concurrent tasks offloading on accelerators.
Proceedings of the International Conference on Computational Science, 2017

In-Place Matrix Transposition on GPUs.
IEEE Trans. Parallel Distrib. Syst., 2016

Configurable XOR Hash Functions for Banked Scratchpad Memories in GPUs.
IEEE Trans. Computers, 2016

A programming system for future proofing performance critical libraries.
Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2016

KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism.
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016

Efficient kernel synthesis for performance portable programming.
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016

Evaluating the effect of last-level cache sharing on integrated GPU-CPU systems with heterogeneous applications.
Proceedings of the 2016 IEEE International Symposium on Workload Characterization, 2016

Calculation of dense trajectory descriptors on a heterogeneous embedded architecture.
Journal of Systems Architecture - Embedded Systems Design, 2015

In-Place Data Sliding Algorithms for Many-Core Architectures.
Proceedings of the 44th International Conference on Parallel Processing, 2015

In-place transposition of rectangular matrices on accelerators.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

Low-textured regions detection for improving stereoscopy algorithms.
Proceedings of the International Conference on High Performance Computing & Simulation, 2014

CUVLE: Variable-length encoding on CUDA.
Proceedings of the 2014 Conference on Design and Architectures for Signal and Image Processing, 2014

Performance Modeling of Atomic Additions on GPU Scratchpad Memory.
IEEE Trans. Parallel Distrib. Syst., 2013

An optimized approach to histogram computation on GPU.
Mach. Vis. Appl., 2013

A robust and low resource FPGA-based stereoscopic vision algorithm.
Proceedings of the 2012 International Conference on Reconfigurable Computing and FPGAs, 2013

Simulation and architecture improvements of atomic operations on GPU scratchpad memory.
Proceedings of the 2013 IEEE 31st International Conference on Computer Design, 2013

Performance models for asynchronous data transfers on consumer Graphics Processing Units.
J. Parallel Distrib. Comput., 2012

Load Balancing versus Occupancy Maximization on Graphics Processing Units: The Generalized Hough Transform as a Case Study.
IJHPCA, 2011

simARQ, An Automatic Repeat Request Simulator for Teaching Purposes.
Proceedings of the IT Revolutions, 2011

Egomotion compensation and moving objects detection algorithm on GPU.
Proceedings of the Applications, Tools and Techniques on the Road to Exascale Computing, Proceedings of the conference ParCo 2011, 31 August, 2011

Parallelizing and Optimizing LIP-Canny Using NVIDIA CUDA.
Proceedings of the Trends in Applied Intelligent Systems, 2010

MESI Cache Coherence Simulator for Teaching Purposes.
CLEI Electron. J., 2009

FPGA Implementation of the Generalized Hough Transform.
Proceedings of the ReConFig'09: 2009 International Conference on Reconfigurable Computing and FPGAs, 2009

Parallelization of a Video Segmentation Algorithm on CUDA-Enabled Graphics Processing Units.
Proceedings of the Euro-Par 2009 Parallel Processing, 2009

Biprocessor SoC in an FPGA for Teaching Purposes.
Proceedings of the 8th IEEE International Conference on Advanced Learning Technologies, 2008