Enrique S. Quintana-Ortí

Affiliations:
  • Jaume I University, Castellón de la Plana, Spain


According to our database1, Enrique S. Quintana-Ortí authored at least 349 papers between 1993 and 2022.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2022
Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing.
ACM Trans. Math. Softw., 2022

High performance and energy efficient inference for deep learning on multicore ARM processors using general optimization techniques and BLIS.
J. Syst. Archit., 2022

Enabling Dynamic and Intelligent Workflows for HPC, Data Analytics, and AI Convergence.
CoRR, 2022

Towards Portable Realizations of Winograd-based Convolution with Vector Intrinsics and OpenMP.
Proceedings of the 30th Euromicro International Conference on Parallel, 2022

Anatomy of the BLIS Family of Algorithms for Matrix Multiplication.
Proceedings of the 30th Euromicro International Conference on Parallel, 2022

2021
Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software.
ACM Trans. Math. Softw., 2021

Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors.
J. Supercomput., 2021

Factorized solution of generalized stable Sylvester equations using many-core GPU accelerators.
J. Supercomput., 2021

On the performance of a GPU-based SoC in a distributed spatial audio system.
J. Supercomput., 2021

DMRlib: Easy-Coding and Efficient Resource Management for Job Malleability.
IEEE Trans. Computers, 2021

Machine learning for optimal selection of sparse triangular system solvers on GPUs.
J. Parallel Distributed Comput., 2021

Selecting optimal SpMV realizations for GPUs via machine learning.
Int. J. High Perform. Comput. Appl., 2021

Introduction to the Special Issue related to the Power-Aware Computing Workshop 2019 - PACO 2019.
Int. J. High Perform. Comput. Appl., 2021

Efficient update of determinants for many-electron wave function overlaps.
Comput. Phys. Commun., 2021

High performance and energy efficient inference for deep learning on ARM processors.
CoRR, 2021

Accelerating distributed deep neural network training with pipelined MPI allreduce.
Clust. Comput., 2021

A New Generation of Task-Parallel Algorithms for Matrix Inversion in Many-Threaded CPUs.
Proceedings of the PMAM@PPoPP 2021: Proceedings of the Twelfth International Workshop on Programming Models and Applications for Multicores and Manycores, 2021

High Performance and Energy Efficient Integer Matrix Multiplication for Deep Learning.
Proceedings of the 29th Euromicro International Conference on Parallel, 2021

Evaluation of MPI Allreduce for Distributed Training of Convolutional Neural Networks.
Proceedings of the 29th Euromicro International Conference on Parallel, 2021

Performance Modeling for Distributed Training of Convolutional Neural Networks.
Proceedings of the 29th Euromicro International Conference on Parallel, 2021

Scalable Hybrid Loop- and Task-Parallel Matrix Inversion for Multicore Processors.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops, 2021

2020
Acceleration of PageRank with Customized Precision Based on Mantissa Segmentation.
ACM Trans. Parallel Comput., 2020

Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processors.
J. Supercomput., 2020

Integration and exploitation of intra-routine malleability in BLIS.
J. Supercomput., 2020

Performance modeling of the sparse matrix-vector product via convolutional neural networks.
J. Supercomput., 2020

Analysis of Threading Libraries for High Performance Computing.
IEEE Trans. Computers, 2020

Reproducibility strategies for parallel Preconditioned Conjugate Gradient.
J. Comput. Appl. Math., 2020

Reproducibility of parallel preconditioned conjugate gradient in hybrid programming environments.
Int. J. High Perform. Comput. Appl., 2020

Resiliency in Numerical Algorithm Design for Extreme Scale Simulations.
CoRR, 2020

Compressed Basis GMRES on High Performance GPUs.
CoRR, 2020

Reproducibility of Parallel Preconditioned Conjugate Gradient in Hybrid Programming Environments.
CoRR, 2020

High Performance and Portable Convolution Operators for ARM-based Multicore Processors.
CoRR, 2020

Programming parallel dense matrix factorizations with look-ahead and OpenMP.
Clust. Comput., 2020

High Performance and Portable Convolution Operators for Multicore Processors.
Proceedings of the 32nd IEEE International Symposium on Computer Architecture and High Performance Computing, 2020

Tiled Algorithms for Efficient Task-Parallel ℌ-Matrix Solvers.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

Multiprecision Block-Jacobi for Iterative Triangular Solves.
Proceedings of the Euro-Par 2020: Parallel Processing, 2020

Balanced and Compressed Coordinate Layout for the Sparse Matrix-Vector Product on GPUs.
Proceedings of the Euro-Par 2020: Parallel Processing Workshops, 2020

2019
FloatX: A C++ Library for Customized Floating-Point Arithmetic.
ACM Trans. Math. Softw., 2019

Noise estimation for hyperspectral subspace identification on FPGAs.
J. Supercomput., 2019

Accelerating the SRP-PHAT algorithm on multi- and many-core platforms using OpenCL.
J. Supercomput., 2019

Fast block QR update in digital signal processing.
J. Supercomput., 2019

An efficient GPU version of the preconditioned GMRES method.
J. Supercomput., 2019

Dynamic look-ahead in the reduction to band form for the singular value decomposition.
Parallel Comput., 2019

Variable-size batched Gauss-Jordan elimination for block-Jacobi preconditioning on graphics processors.
Parallel Comput., 2019

Accelerating the task/data-parallel version of ILUPACK's BiCG in multi-CPU/GPU configurations.
Parallel Comput., 2019

Look-ahead in the two-sided reduction to compact band forms for symmetric eigenvalue problems and the SVD.
Numer. Algorithms, 2019

Erratum to "Exploiting nested task-parallelism in theH-LU factorization" [J. Comput. Sci. 33 (2019) 20-33].
J. Comput. Sci., 2019

Exploiting nested task-parallelism in the H-LU factorization.
J. Comput. Sci., 2019

Fine-grained bit-flip protection for relaxation methods.
J. Comput. Sci., 2019

Hierarchical approach for deriving a reproducible unblocked LU factorization.
Int. J. High Perform. Comput. Appl., 2019

Toward a modular precision ecosystem for high-performance computing.
Int. J. High Perform. Comput. Appl., 2019

Power-aware computing.
Concurr. Comput. Pract. Exp., 2019

Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers.
Concurr. Comput. Pract. Exp., 2019

A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization With Partial Pivoting.
IEEE Access, 2019

Automatic Selection of Sparse Triangular Linear System Solvers on GPUs through Machine Learning Techniques.
Proceedings of the 31st International Symposium on Computer Architecture and High Performance Computing, 2019

Analysis of model parallelism for distributed neural networks.
Proceedings of the 26th European MPI Users' Group Meeting, 2019

Structure-Aware Calculation of Many-Electron Wave Function Overlaps on Multicore Processors.
Proceedings of the Parallel Processing and Applied Mathematics, 2019

Towards Continuous Benchmarking: An Automated Performance Evaluation Framework for High Performance Software.
Proceedings of the Platform for Advanced Scientific Computing Conference, 2019

Cholesky and Gram-Schmidt Orthogonalization for Tall-and-Skinny QR Factorizations on Graphics Processors.
Proceedings of the Euro-Par 2019: Parallel Processing, 2019

Theoretical Scalability Analysis of Distributed Deep Convolutional Neural Networks.
Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019

2018
Exploring the interoperability of remote GPGPU virtualization using rCUDA and directive-based programming models.
J. Supercomput., 2018

Optimized Fundamental Signal Processing Operations For Energy Minimization on Heterogeneous Mobile Devices.
IEEE Trans. Circuits Syst. I Regul. Pap., 2018

DMR API: Improving cluster productivity by turning applications into malleable.
Parallel Comput., 2018

Static scheduling of the LU factorization with look-ahead on asymmetric multicore processors.
Parallel Comput., 2018

Energy balance between voltage-frequency scaling and resilience for linear algebra routines on low-power multicore architectures.
Parallel Comput., 2018

Parallel programming for resilience and energy efficiency.
Parallel Comput., 2018

Two-sided orthogonal reductions to condensed forms on asymmetric multicore processors.
Parallel Comput., 2018

Multi-threaded dense linear algebra libraries for low-power asymmetric multicore processors.
J. Comput. Sci., 2018

FaST-LMM for Two-Way Epistasis Tests on High-Performance Clusters.
J. Comput. Biol., 2018

A framework for genomic sequencing on clusters of multicore and manycore processors.
Int. J. High Perform. Comput. Appl., 2018

Residual Replacement in Mixed-Precision Iterative Refinement for Sparse Linear Systems.
Proceedings of the High Performance Computing, 2018

High-Performance GPU Implementation of PageRank with Reduced Precision Based on Mantissa Segmentation.
Proceedings of the 8th IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms, 2018

Reduction to Band Form for the Singular Value Decomposition on Graphics Accelerators.
Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores, 2018

Extending ILUPACK with a Task-Parallel Version of BiCG for Dual-GPU Servers.
Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores, 2018

Fast Blocking of Householder Reflectors on Graphics Processors.
Proceedings of the 26th Euromicro International Conference on Parallel, 2018

Extending ILUPACK with a GPU Version of the BiCGStab Method.
Proceedings of the XLIV Latin American Computer Conference, 2018

2017
Modeling power consumption of 3D MPDATA and the CG method on ARM and Intel multicore architectures.
J. Supercomput., 2017

Time and energy modeling of a high-performance multi-threaded Cholesky factorization.
J. Supercomput., 2017

Solving Weighted Least Squares (WLS) problems on ARM-based architectures.
J. Supercomput., 2017

Accelerating multi-channel filtering of audio signal on ARM processors.
J. Supercomput., 2017

Adapting concurrency throttling and voltage-frequency scaling for dense eigensolvers.
J. Supercomput., 2017

GPU-Based Dynamic Wave Field Synthesis Using Fractional Delay Filters and Room Compensation.
IEEE ACM Trans. Audio Speech Lang. Process., 2017

Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations.
Parallel Comput., 2017

Architecture-aware optimization of an HEVC decoder on asymmetric multicore processors.
J. Real Time Image Process., 2017

Two-Sided Reduction to Compact Band Forms with Look-Ahead.
CoRR, 2017

Extending the Gauss-Huard method for the solution of Lyapunov matrix equations and matrix inversion.
Concurr. Comput. Pract. Exp., 2017

Communication in task-parallel ILU-preconditioned CG solvers using MPI + OmpSs.
Concurr. Comput. Pract. Exp., 2017

Flexible batched sparse matrix-vector product on GPUs.
Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2017

Overcoming Memory-Capacity Constraints in the Use of ILUPACK on Graphics Processors.
Proceedings of the 29th International Symposium on Computer Architecture and High Performance Computing, 2017

Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs.
Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores, 2017

Reduction to Tridiagonal Form for Symmetric Eigenproblems on Asymmetric Multicore Processors.
Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores, 2017

Towards Reproducible Blocked LU Factorization.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Static Versus Dynamic Task Scheduling of the Lu Factorization on ARM big. LITTLE Architectures.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Task-Parallel LU Factorization of Hierarchical Matrices Using OmpSs.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Efficient Scalable Computing through Flexible Applications and Adaptive Workloads.
Proceedings of the 46th International Conference on Parallel Processing Workshops, 2017

GLTO: On the Adequacy of Lightweight Thread Approaches for OpenMP Implementations.
Proceedings of the 46th International Conference on Parallel Processing, 2017

Variable-Size Batched LU for Small Matrices and Its Integration into Block-Jacobi Preconditioning.
Proceedings of the 46th International Conference on Parallel Processing, 2017

Solving Sparse Differential Riccati Equations on Hybrid CPU-GPU Platforms.
Proceedings of the Computational Science and Its Applications - ICCSA 2017, 2017

Solution of Few-Body Coulomb Problems with Latent Matrices on Multicore Processors.
Proceedings of the International Conference on Computational Science, 2017

On the Use of a GPU-Accelerated Mobile Device Processor for Sound Source Localization.
Proceedings of the International Conference on Computational Science, 2017

Variable-Size Batched Gauss-Huard for Block-Jacobi Preconditioning.
Proceedings of the International Conference on Computational Science, 2017

Accelerating FaST-LMM for Epistasis Tests.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2017

Balanced CSR Sparse Matrix-Vector Product on Graphics Processors.
Proceedings of the Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28, 2017

GLT: A Unified API for Lightweight Thread Libraries.
Proceedings of the Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28, 2017

Evaluating the NVIDIA Tegra Processor as a Low-Power Alternative for Sparse GPU Computations.
Proceedings of the High Performance Computing - 4th Latin American Conference, 2017

Tiles-and WPP-based HEVC Decoding on Asymmetric Multi-core Processors.
Proceedings of the Third IEEE International Conference on Multimedia Big Data, 2017

2016
Analytical Modeling Is Enough for High-Performance BLIS.
ACM Trans. Math. Softw., 2016

Exploiting task and data parallelism in ILUPACK's preconditioned CG solver on NUMA architectures and many-core accelerators.
Parallel Comput., 2016

A fast band-Krylov eigensolver for macromolecular functional motion simulation on multicore architectures and graphics processors.
J. Comput. Phys., 2016

Characterizing the efficiency of multicore and manycore processors for the solution of sparse linear systems.
Comput. Sci. Res. Dev., 2016

Evaluating fault tolerance on asymmetric multicore systems-on-chip using iso-metrics.
IET Comput. Digit. Tech., 2016

Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors.
Clust. Comput., 2016

Balancing Energy and Performance in Dense Linear System Solvers for Hybrid ARM+GPU platforms.
CLEI Electron. J., 2016

Refactoring Conventional Task Schedulers to Exploit Asymmetric ARM big.LITTLE Architectures in Dense Linear Algebra.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

The Impact of Panel Factorization on the Gauss-Huard Algorithm for the Solution of Linear Systems on Modern Architectures.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2016

Tuning the Blocksize for Dense Linear Algebra Factorization Routines with the Roofline Model.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2016

The Impact of Voltage-Frequency Scaling for the Matrix-Vector Product on the IBM POWER8.
Proceedings of the Euro-Par 2016: Parallel Processing, 2016

A Data-Parallel ILUPACK for Sparse General and Symmetric Indefinite Linear Systems.
Proceedings of the Euro-Par 2016: Parallel Processing Workshops, 2016

Exploiting Task-Parallelism in Message-Passing Sparse Linear System Solvers Using OmpSs.
Proceedings of the Euro-Par 2016: Parallel Processing, 2016

A Review of Lightweight Thread Approaches for High Performance Computing.
Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016

Enabling GPU Virtualization in Cloud Environments.
Proceedings of the CLOSER 2016, 2016

Design of a Task-Parallel Version of ILUPACK for Graphics Processors.
Proceedings of the High Performance Computing - Third Latin American Conference, 2016

2015
Exploring the performance-power-energy balance of low-power multicore and manycore architectures for anomaly detection in remote sensing.
J. Supercomput., 2015

Extending lyapack for the solution of band Lyapunov equations on hybrid CPU-GPU platforms.
J. Supercomput., 2015

Concurrent and Accurate Short Read Mapping on Multicore Processors.
IEEE ACM Trans. Comput. Biol. Bioinform., 2015

Systematic derivation of time and power models for linear algebra kernels on multicore architectures.
Sustain. Comput. Informatics Syst., 2015

Time and energy modeling of high-performance Level-3 BLAS on x86 architectures.
Simul. Model. Pract. Theory, 2015

Fast and Reliable Noise Estimation for Hyperspectral Subspace Identification.
IEEE Geosci. Remote. Sens. Lett., 2015

Reducing the cost of power monitoring with DC wattmeters.
Comput. Sci. Res. Dev., 2015

Are our dense linear algebra libraries energy-friendly?
Comput. Sci. Res. Dev., 2015

Revisiting Conventional Task Schedulers to Exploit Asymmetry in ARM big.LITTLE Architectures for Dense Linear Algebra.
CoRR, 2015

Evaluating Asymmetric Multicore Systems-on-Chip using Iso-Metrics.
CoRR, 2015

Performance and Energy Optimization of Matrix Multiplication on Asymmetric big.LITTLE Processors.
CoRR, 2015

Multi-Threaded Dense Linear Algebra Libraries for Low-Power Asymmetric Multicore Processors.
CoRR, 2015

Improving the user experience of the rCUDA remote GPU virtualization framework.
Concurr. Comput. Pract. Exp., 2015

Parallel computing on graphics processing units and heterogeneous platforms.
Concurr. Comput. Pract. Exp., 2015

Out-of-core macromolecular simulations on multithreaded architectures.
Concurr. Comput. Pract. Exp., 2015

Unveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors.
Concurr. Comput. Pract. Exp., 2015

Unleashing GPU acceleration for symmetric band linear algebra kernels and model reduction.
Clust. Comput., 2015

Balancing task- and data-level parallelism to improve performance and energy consumption of matrix computations on the Intel Xeon Phi.
Comput. Electr. Eng., 2015

Time and energy modeling of an INTRA-ONLY HEVC encoder.
Proceedings of the 2015 Visual Communications and Image Processing, 2015

Scalable RNA Sequencing on Clusters of Multicore Processors.
Proceedings of the 2015 IEEE TrustCom/BigDataSE/ISPA, 2015

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rCUDA Virtualization.
Proceedings of the 2015 IEEE TrustCom/BigDataSE/ISPA, 2015

Tuning stationary iterative solvers for fault resilience.
Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2015

Adaptive precision solvers for sparse linear systems.
Proceedings of the 3rd International Workshop on Energy Efficient Supercomputing, 2015

Revisiting the Gauss-Huard Algorithm for the Solution of Linear Systems on Graphics Accelerators.
Proceedings of the Parallel Processing and Applied Mathematics, 2015

A Parallel Multi-threaded Solver for Symmetric Positive Definite Bordered-Band Linear Systems.
Proceedings of the Parallel Processing and Applied Mathematics, 2015

Exploring the Offload Execution Model in the Intel Xeon Phi via Matrix Inversion.
Proceedings of the Parallel Computing: On the Road to Exascale, 2015

Harnessing CUDA Dynamic Parallelism for the Solution of Sparse Linear Systems.
Proceedings of the Parallel Computing: On the Road to Exascale, 2015

Performance and Fault Tolerance of Preconditioned Iterative Solvers on Low-Power ARM Architectures.
Proceedings of the Parallel Computing: On the Road to Exascale, 2015

Evaluating the Potential of Low Power Systems for Headphone-based Spatial Audio Applications.
Proceedings of the International Conference on Computational Science, 2015

Real-time Sound Source Localization on an Embedded GPU Using a Spherical Microphone Array.
Proceedings of the International Conference on Computational Science, 2015

Vectorization of binaural sound virtualization on the ARM Cortex-A15 architecture.
Proceedings of the 23rd European Signal Processing Conference, 2015

Systematic Fusion of CUDA Kernels for Iterative Sparse Linear System Solvers.
Proceedings of the Euro-Par 2015: Parallel Processing, 2015

Exploring the Suitability of Remote GPGPU Virtualization for the OpenACC Programming Model Using rCUDA.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

Solving dense linear systems with hybrid ARM+GPU platforms.
Proceedings of the 2015 Latin American Computing Conference, 2015

Solving Linear Systems on the Intel Xeon-Phi Accelerator via the Gauss-Huard Algorithm.
Proceedings of the High Performance Computing - Second Latin American Conference, 2015

2014
Assessing Power Monitoring Approaches for Energy and Power Analysis of Computers.
Sustain. Comput. Informatics Syst., 2014

Assessing the Performance-Energy Balance of Graphics Processors for Spectral Unmixing.
IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., 2014

Efficient Implementation of Hyperspectral Anomaly Detection Techniques on GPUs and Multicore Processors.
IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., 2014

Hyperspectral Unmixing on Multicore DSPs: Trading Off Performance for Energy.
IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., 2014

Improved Accuracy and Parallelism for MRRR-Based Eigensolvers - A Mixed Precision Approach.
SIAM J. Sci. Comput., 2014

A complete and efficient CUDA-sharing solution for HPC clusters.
Parallel Comput., 2014

Leveraging task-parallelism in message-passing dense matrix factorizations using SMPSs.
Parallel Comput., 2014

iMODS: internal coordinates normal mode analysis server.
Nucleic Acids Res., 2014

A factored variant of the Newton iteration for the solution of algebraic Riccati equations via the matrix sign function.
Numer. Algorithms, 2014

Automatic detection of power bottlenecks in parallel scientific applications.
Comput. Sci. Res. Dev., 2014

Modeling power and energy of the task-parallel Cholesky factorization on multicore processors.
Comput. Sci. Res. Dev., 2014

Modeling power and energy consumption of dense matrix factorizations on multicore processors.
Concurr. Comput. Pract. Exp., 2014

Enhancing performance and energy consumption of runtime schedulers for dense linear algebra.
Concurr. Comput. Pract. Exp., 2014

Assessing the impact of the CPU power-saving modes on the task-parallel solution of sparse linear systems.
Clust. Comput., 2014

Trading Off Performance for Energy in Linear Algebra Operations with Applications in Control Theory.
CLEI Electron. J., 2014

Adaptive Downtime for Live Migration of Virtual Machines.
Proceedings of the 7th IEEE/ACM International Conference on Utility and Cloud Computing, 2014

SLURM Support for Remote GPU Virtualization: Implementation and Performance Study.
Proceedings of the 26th IEEE International Symposium on Computer Architecture and High Performance Computing, 2014

Leveraging Task-Parallelism with OmpSs in ILUPACK's Preconditioned CG Method.
Proceedings of the 26th IEEE International Symposium on Computer Architecture and High Performance Computing, 2014

Leveraging Data-Parallelism in ILUPACK using Graphics Processors.
Proceedings of the IEEE 13th International Symposium on Parallel and Distributed Computing, 2014

Analyzing the Energy Efficiency of the Memory Subsystem in Multicore Processors.
Proceedings of the IEEE International Symposium on Parallel and Distributed Processing with Applications, 2014

Performance and Energy-Aware Characterization of the Sparse Matrix-Vector Multiplication on Multithreaded Architectures.
Proceedings of the 43rd International Conference on Parallel Processing Workshops, 2014

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction.
Proceedings of the Computational Science and Its Applications - ICCSA 2014 - 14th International Conference, Guimarães, Portugal, June 30, 2014

Parallel performance and energy efficiency of modern video encoders on multithreaded architectures.
Proceedings of the 22nd European Signal Processing Conference, 2014

Boosting the performance of remote GPU virtualization using InfiniBand connect-IB and PCIe 3.0.
Proceedings of the 2014 IEEE International Conference on Cluster Computing, 2014

Evaluating the Impact of Virtualization on Performance and Power Dissipation.
Proceedings of the CLOSER 2014, 2014

Accelerating the general band matrix multiplication using graphics processors.
Proceedings of the XL Latin American Computing Conference, 2014

Efficient Symmetric Band Matrix-Matrix Multiplication on GPUs.
Proceedings of the High Performance Computing - First HPCLATAM, 2014

2013
Accelerating the Lyapack library using GPUs.
J. Supercomput., 2013

Exploring large macromolecular functional motions on clusters of multicore processors.
J. Comput. Phys., 2013

Deriving dense linear algebra libraries.
Formal Aspects Comput., 2013

Performance versus energy consumption of hyperspectral unmixing algorithms on multi-core platforms.
EURASIP J. Adv. Signal Process., 2013

Concurrent and Accurate RNA Sequencing on Multicore Platforms
CoRR, 2013

Matrix inversion on CPU-GPU platforms with applications in control theory.
Concurr. Comput. Pract. Exp., 2013

Graphics processing unit computing and exploitation of hardware accelerators.
Concurr. Comput. Pract. Exp., 2013

Energy-efficient execution of dense linear algebra algorithms on multi-core processors.
Clust. Comput., 2013

Solving Matrix Equations on Multi-Core and Many-Core Architectures.
Algorithms, 2013

A dynamic pipeline for RNA sequencing on multicore processors.
Proceedings of the 20th European MPI Users's Group Meeting, 2013

Out-of-Core Solution of Eigenproblems for Macromolecular Simulations.
Proceedings of the Parallel Processing and Applied Mathematics, 2013

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures.
Proceedings of the Parallel Processing and Applied Mathematics, 2013

Exploiting Data- and Task-Parallelism in the Solution of Riccati Equations on Multicore Servers and GPUs.
Proceedings of the Parallel Computing: Accelerating Computational Science and Engineering (CSE), 2013

Reformulated Conjugate Gradient for the Energy-Aware Solution of Linear Systems on GPUs.
Proceedings of the 42nd International Conference on Parallel Processing, 2013

On the Impact of Optimization on the Time-Power-Energy Balance of Dense Linear Algebra Factorizations.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2013

Solving Some Mysteries in Power Monitoring of Servers: Take Care of Your Wattmeters!
Proceedings of the Energy Efficiency in Large Scale Distributed Systems, 2013

Runtime Scheduling of the LU Factorization: Performance and Energy.
Proceedings of the Energy Efficiency in Large Scale Distributed Systems, 2013

Influence of InfiniBand FDR on the performance of remote GPU virtualization.
Proceedings of the 2013 IEEE International Conference on Cluster Computing, 2013

2012
A Runtime System for Programming Out-of-Core Matrix Algorithms-by-Tiles on Multithreaded Architectures.
ACM Trans. Math. Softw., 2012

A simulator to assess energy saving strategies and policies in HPC workloads.
ACM SIGOPS Oper. Syst. Rev., 2012

Parallel Computation of 3-D Soil-Structure Interaction in Time Domain with a Coupled FEM/SBFEM Approach.
J. Sci. Comput., 2012

The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations.
J. Parallel Distributed Comput., 2012

Optimization of power consumption in the iterative solution of sparse linear systems on graphics processors.
Comput. Sci. Res. Dev., 2012

DVFS-control techniques for dense linear algebra operations on multi-core processors.
Comput. Sci. Res. Dev., 2012

Solving dense generalized eigenproblems on multi-threaded architectures.
Appl. Math. Comput., 2012

Applying OOC Techniques in the Reduction to Condensed Form for Very Large Symmetric Eigenproblems on GPUs.
Proceedings of the 20th Euromicro International Conference on Parallel, 2012

Analysis of Strategies to Save Energy for Message-Passing Dense Linear Algebra Kernels.
Proceedings of the 20th Euromicro International Conference on Parallel, 2012

Saving Energy in the LU Factorization with Partial Pivoting on Multi-core Processors.
Proceedings of the 20th Euromicro International Conference on Parallel, 2012

High Performance Implementations of the BST Method on Hybrid CPU-GPU Platforms.
Proceedings of the 10th IEEE International Symposium on Parallel and Distributed Processing with Applications, 2012

Binding Performance and Power of Dense Linear Algebra Operations.
Proceedings of the 10th IEEE International Symposium on Parallel and Distributed Processing with Applications, 2012

Reducing Energy Consumption of Dense Linear Algebra Operations on Hybrid CPU-GPU Platforms.
Proceedings of the 10th IEEE International Symposium on Parallel and Distributed Processing with Applications, 2012

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners.
Proceedings of the ICT as Key Technology against Global Warming, 2012

Tools for Power-Energy Modelling and Analysis of Parallel Scientific Applications.
Proceedings of the 41st International Conference on Parallel Processing, 2012

CU2rCU: Towards the complete rCUDA remote GPU virtualization and sharing solution.
Proceedings of the 19th International Conference on High Performance Computing, 2012

Unleashing CPU-GPU Acceleration for Control Theory Applications.
Proceedings of the Euro-Par 2012: Parallel Processing Workshops, 2012

2011
ILUPACK.
Proceedings of the Encyclopedia of Parallel Computing, 2011

High performance computing tools in science and engineering.
J. Supercomput., 2011

High performance computing tools in science and engineering II.
J. Supercomput., 2011

Using desktop computers to solve large-scale dense linear algebra problems.
J. Supercomput., 2011

Using graphics processors to accelerate the computation of the matrix inverse.
J. Supercomput., 2011

A mixed-precision algorithm for the solution of Lyapunov equations on hybrid CPU-GPU platforms.
Parallel Comput., 2011

Exploiting thread-level parallelism in the iterative solution of sparse linear systems.
Parallel Comput., 2011

Real-Time Endmember Extraction on Multicore Processors.
IEEE Geosci. Remote. Sens. Lett., 2011

A parallel solver for huge dense linear systems.
Comput. Phys. Commun., 2011

Large-scale linear system solver using secondary storage: Self-energy in hybrid nanostructures.
Comput. Phys. Commun., 2011

Special Issue: GPU computing.
Concurr. Comput. Pract. Exp., 2011

Condensed forms for the symmetric eigenvalue problem on multi-threaded architectures.
Concurr. Comput. Pract. Exp., 2011

Increasing data locality and introducing Level-3 BLAS in the Neville elimination.
Appl. Math. Comput., 2011

Accelerating BST Methods for Model Reduction with Graphics Processors.
Proceedings of the Parallel Processing and Applied Mathematics, 2011

High Performance Matrix Inversion on a Multi-core Platform with Several GPUs.
Proceedings of the 19th International Euromicro Conference on Parallel, 2011

Symmetric Rank-k Update on Clusters of Multicore Processors with SMPSs.
Proceedings of the Applications, Tools and Techniques on the Road to Exascale Computing, Proceedings of the conference ParCo 2011, 31 August, 2011

Power-aware Dense Linear Algebra Implementations on Multi-core and Many-core Processors.
Proceedings of the 3rd Many-core Applications Research Community (MARC) Symposium. Proceedings of the 3rd MARC Symposium, 2011

Evaluation of the Energy Performance of Dense Linear Algebra Kernels on Multi-core and Many-Core Processors.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Power Consumption of Mixed Precision in the Iterative Solution of Sparse Linear Systems.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

High performance matrix inversion of SPD matrices on graphics processors.
Proceedings of the 2011 International Conference on High Performance Computing & Simulation, 2011

Improving power efficiency of dense linear algebra algorithms on multi-core processors via slack control.
Proceedings of the 2011 International Conference on High Performance Computing & Simulation, 2011

Performance of CUDA Virtualized Remote GPUs in High Performance Clusters.
Proceedings of the International Conference on Parallel Processing, 2011

Efficient Model Order Reduction of Large-Scale Systems on Multi-core Platforms.
Proceedings of the Computational Science and Its Applications - ICCSA 2011, 2011

Enabling CUDA acceleration within virtual machines using rCUDA.
Proceedings of the 18th International Conference on High Performance Computing, 2011

Analysis and optimization of power consumption in the iterative solution of sparse linear systems on multi-core and many-core platforms.
Proceedings of the 2011 International Green Computing Conference and Workshops, 2011

2010
Accelerating Model Reduction of Large Linear Systems with Graphics Processors.
Proceedings of the Applied Parallel and Scientific Computing, 2010

Parallelization of Multilevel ILU Preconditioners on Distributed-Memory Multiprocessors.
Proceedings of the Applied Parallel and Scientific Computing, 2010

Message from the PDSEC-10 workshop chairs.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Retargeting PLAPACK to clusters with hardware accelerators.
Proceedings of the 2010 International Conference on High Performance Computing & Simulation, 2010

rCUDA: Reducing the number of GPU-based accelerators in high performance clusters.
Proceedings of the 2010 International Conference on High Performance Computing & Simulation, 2010

Parallel Numerical Algorithms.
Proceedings of the Euro-Par 2010 - Parallel Processing, 16th International Euro-Par Conference, Ischia, Italy, August 31, 2010

EnergySaving Cluster Roll: Power Saving System for Clusters.
Proceedings of the Architecture of Computing Systems, 2010

2009
Programming matrix algorithms-by-blocks for thread-level parallelism.
ACM Trans. Math. Softw., 2009

Toward the parallelization of GSL.
J. Supercomput., 2009

Out-of-core solution of linear systems on graphics processors.
Int. J. Parallel Emergent Distributed Syst., 2009

Parallel solution of large-scale algebraic Bernoulli equations with the matrix sign function method.
Int. J. Comput. Sci. Eng., 2009

The libflame Library for Dense Matrix Computations.
Comput. Sci. Eng., 2009

Exploiting the capabilities of modern GPUs for dense matrix computations.
Concurr. Comput. Pract. Exp., 2009

Parallelizing dense and banded linear algebra libraries using SMPSs.
Concurr. Comput. Pract. Exp., 2009

Solving dense linear systems on platforms with multiple hardware accelerators.
Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009

Reduction to Condensed Forms for Symmetric Eigenvalue Problems on Multi-core Architectures.
Proceedings of the Parallel Processing and Applied Mathematics, 2009

Evaluation of Parallel Sparse Matrix Partitioning Software for Parallel Multilevel ILU Preconditioning on Shared-Memory Multiprocessors.
Proceedings of the Parallel Computing: From Multicores and GPU's to Petascale, 2009

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures.
Proceedings of the Evolving OpenMP in an Age of Extreme Parallelism, 2009

Using Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems.
Proceedings of the Eighth International Symposium on Parallel and Distributed Computing, 2009

Fast development of dense linear algebra codes on graphics processors.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Solving "large" dense matrix problems on multi-core processors.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Out-of-Core Computation of the QR Factorization on Multi-core Processors.
Proceedings of the Euro-Par 2009 Parallel Processing, 2009

An Efficient Implementation of GPU Virtualization in High Performance Clusters.
Proceedings of the Euro-Par 2009, 2009

Using Hybrid CPU-GPU Platforms to Accelerate the Computation of the Matrix Sign Function.
Proceedings of the Euro-Par 2009, 2009

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs.
Proceedings of the Euro-Par 2009 Parallel Processing, 2009

2008
Updating an LU Factorization with Pivoting.
ACM Trans. Math. Softw., 2008

Solving linear-quadratic optimal control problems on parallel computers.
Optim. Methods Softw., 2008

An Algorithm-by-Blocks for SuperMatrix Band Cholesky Factorization.
Proceedings of the High Performance Computing for Computational Science, 2008

Attaining High Performance in General-Purpose Computations on Current Graphics Processors.
Proceedings of the High Performance Computing for Computational Science, 2008

Design, Tuning and Evaluation of Parallel Multilevel ILU Preconditioners.
Proceedings of the High Performance Computing for Computational Science, 2008

SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks.
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008

Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures.
Proceedings of the 16th Euromicro International Conference on Parallel, 2008

Design of scalable dense linear algebra libraries for multithreaded architectures: the LU factorization.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Evaluation and tuning of the Level 3 CUBLAS for graphics processors.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Solving Dense Linear Systems on Graphics Processors.
Proceedings of the Euro-Par 2008, 2008

2007
Efficient algorithms for generalized algebraic Bernoulli equations based on the matrix sign function.
Numer. Algorithms, 2007

Stabilizing large-scale generalized systems on parallel computers using multithreading and message-passing.
Concurr. Comput. Pract. Exp., 2007

Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures.
Proceedings of the SPAA 2007: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, 2007

Parallelizing Dense Linear Algebra Operations with Task Queues in.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 14th European PVM/MPI User's Group Meeting, Paris, France, September 30, 2007

Parallel Solution of Band Linear Systems in Model Reduction.
Proceedings of the Parallel Processing and Applied Mathematics, 2007

The Implementation of BLAS for Band Matrices.
Proceedings of the Parallel Processing and Applied Mathematics, 2007

Strategies for Parallelizing the Solution of Rational Matrix Equations.
Proceedings of the Parallel Computing: Architectures, 2007

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors.
Proceedings of the Parallel Computing: Architectures, 2007

Parallel Implementation of LQG Balanced Truncation for Large-Scale Systems.
Proceedings of the Large-Scale Scientific Computing, 6th International Conference, 2007

Satisfying your dependencies with SuperMatrix.
Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

2006
Accumulating Householder transformations, revisited.
ACM Trans. Math. Softw., 2006

Solving Stable Sylvester Equations via Rational Iterative Schemes.
J. Sci. Comput., 2006

Parallelization of GSL: The Web Service Interface.
Proceedings of the 14th Euromicro International Conference on Parallel, 2006

Cholesky Factorization of Band Matrices Using Multithreaded BLAS.
Proceedings of the Applied Parallel Computing. State of the Art in Scientific Computing, 2006

Specialized Spectral Division Algorithms for Generalized Eigenproblems Via the Inverse-Free Iteration.
Proceedings of the Applied Parallel Computing. State of the Art in Scientific Computing, 2006

An Open Source Web Service Based Platform for Heterogeneous Clusters.
Proceedings of the Parallel and Distributed Processing and Applications, 2006

Parallel LU Factorization of Band Matrices on SMP Systems.
Proceedings of the High Performance Computing and Communications, 2006

Parallel Solution of Large-Scale and Sparse Generalized Algebraic Riccati Equations.
Proceedings of the Euro-Par 2006, Parallel Processing, 12th International Euro-Par Conference, Dresden, Germany, August 28, 2006

2005
Representing linear algebra algorithms in code: the FLAME application program interfaces.
ACM Trans. Math. Softw., 2005

The science of deriving dense linear algebra algorithms.
ACM Trans. Math. Softw., 2005

Partial stabilisation of large-scale discrete-time linear control systems.
Int. J. Comput. Sci. Eng., 2005

Implementing OpenMP for Clusters on Top of MPI.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2005

Parallelization of GSL on Clusters of Symmetric Multiprocessors.
Proceedings of the Parallel Computing: Current & Future Issues of High-End Computing, 2005

Parallel Order Reduction via Balanced Truncation for Optimal Cooling of Steel Profiles.
Proceedings of the Euro-Par 2005, Parallel Processing, 11th International Euro-Par Conference, Lisbon, Portugal, August 30, 2005

Parallel Algorithms for Balanced Truncation of Large-Scale Unstable Systems.
Proceedings of the 44th IEEE IEEE Conference on Decision and Control and 8th European Control Conference Control, 2005

2004
Spectral division methods for block generalized Schur decompositions.
Math. Comput., 2004

Parallel Model Reduction of Large Linear Descriptor Systems via Balanced Truncation.
Proceedings of the High Performance Computing for Computational Science, 2004

Parallelization of GSL: Architecture, Interfaces, and Programming Models.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2004

Computing Passive Reduced-Order Models for Circuit Simulation.
Proceedings of the 2004 International Conference on Parallel Computing in Electrical Engineering (PARELEC 2004), 2004

Rapid Development of High-Performance Out-of-Core Solvers.
Proceedings of the Applied Parallel Computing, 2004

Rapid Development of High-Performance Linear Algebra Libraries.
Proceedings of the Applied Parallel Computing, 2004

Parallel Algorithms for Balanced Truncation Model Reduction of Sparse Systems.
Proceedings of the Applied Parallel Computing, 2004


Parallelization of the GNU Scientific Library on Heterogeneous Systems.
Proceedings of the 3rd International Symposium on Parallel and Distributed Computing (ISPDC 2004), 2004

Computing optimal Hankel norm approximations of large-scale systems.
Proceedings of the 43rd IEEE Conference on Decision and Control, 2004

2003
Formal derivation of algorithms: The triangular sylvester equation.
ACM Trans. Math. Softw., 2003

State-space truncation methods for parallel model reduction of large-scale systems.
Parallel Comput., 2003

Parallel algorithms for model reduction of discrete-time systems.
Int. J. Syst. Sci., 2003

Parallel Model Reduction of Large-Scale Unstable Systems.
Proceedings of the Parallel Computing: Software Technology, 2003

Remote Model Reduction of Very Large Linear Systems.
Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), 2003

2002
The Generalized Newton Iteration forthe Matrix Sign Function.
SIAM J. Sci. Comput., 2002

Numerical solution of discrete stable linear matrix equations on multicomputers.
Parallel Algorithms Appl., 2002

Parallel Algorithms for LQ Optimal Control of Discrete-Time Periodic Linear Systems.
J. Parallel Distributed Comput., 2002

Remote Parallel Model Reduction of Linear Time-Invariant Systems Made Easy.
Proceedings of the High Performance Computing for Computational Science, 2002

Enhanced Services for Remote Model Reduction of Large-Scale Dense Linear Systems.
Proceedings of the Applied Parallel Computing Advanced Scientific Computing, 2002

Solving Large Sparse Lyapunov Equations on Parallel Computers (Research Note).
Proceedings of the Euro-Par 2002, 2002

2001
Efficient Algorithms for the Block Hessenberg Form.
J. Supercomput., 2001

A Note On Parallel Matrix Inversion.
SIAM J. Sci. Comput., 2001

Specialized Parallel Algorithms for Solving Lyapunov and Stein Equations.
J. Parallel Distributed Comput., 2001

Parallel solvers for discrete-time algebric Riccati equations.
Concurr. Comput. Pract. Exp., 2001

Partial Stabilization of Large-Scale Discrete-Time Linear Control Systems.
Proceedings of the 30th International Workshops on Parallel Processing (ICPP 2001 Workshops), 2001

Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice.
Proceedings of the 2001 International Conference on Dependable Systems and Networks (DSN 2001) (formerly: FTCS), 2001

2000
Parallel Partial Stabilizing Algorithms for Large Linear Control Systems.
J. Supercomput., 2000

Solving algebraic Riccati equations on parallel computers using Newton's method with exact line search.
Parallel Comput., 2000

Parallel Spectral Division Using the Matrix Sign Function for the Generalized Eigenproblem.
Int. J. High Speed Comput., 2000

Parallel Pole Assignment of Single-Input Systems.
Proceedings of the Vector and Parallel Processing, 2000

Solving Discrete-Time Periodic Riccati Equations on a Cluster (Research Note).
Proceedings of the Euro-Par 2000, Parallel Processing, 6th International Euro-Par Conference, Munich, Germany, August 29, 2000

1999
Parallel Distributed Solvers for Large Stable Generalized Lyapunov Equations.
Parallel Process. Lett., 1999

Solving stable generalized Lyapunov equations with the matrix sign function.
Numer. Algorithms, 1999

Fast Parallel Kernels for Selected Problems in Control Theory.
Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing, 1999

Parallel Cyclic Wavefront Algorithms for Solving Semidefinite Lyapunov Equations.
Proceedings of the Euro-Par '99 Parallel Processing, 5th International Euro-Par Conference, Toulouse, France, August 31, 1999

Solving Stable Stein Equations on Distributed Memory Computers.
Proceedings of the Euro-Par '99 Parallel Processing, 5th International Euro-Par Conference, Toulouse, France, August 31, 1999

1998
Efficient Solution Of The Rank-Deficient Linear Least Squares Problem.
SIAM J. Sci. Comput., 1998

Parallel solution of Riccati matrix equations with the matrix sign function.
Autom., 1998

A Portable Subroutine Library for Solving Linear Control Problems on Distributed Memory Computers.
Proceedings of the Workshop on Wide Area Networks and High Performance Computing, 1998

1996
Solving Discrete-Time Lyapunov Equations for the Cholesky Factor on a Shared Memory Multiprocessor.
Parallel Process. Lett., 1996

Stabilizing Large Control Linear Systems on Multicomputers.
Proceedings of the Vector and Parallel Processing, 1996

1995
A Parallel Triangular Sylvester Equation Solver Based on the Hessenberg-schur Method.
Parallel Algorithms Appl., 1995

An Efficient Parallel Sylvester Equation Solver Based on the Hessenberg-schur Method.
Parallel Algorithms Appl., 1995

1993
A tool-kit for the design and simulation of systolic algorithms.
Proceedings of the 1993 Euromicro Workshop on Parallel and Distributed Processing, 1993


  Loading...