Marc Casas

Orcid: 0000-0003-4564-2093

According to our database1, Marc Casas authored at least 109 papers between 2007 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2024

2023
HPCG on long-vector architectures: Evaluation and optimization on NEC SX-Aurora and RISC-V.
Future Gener. Comput. Syst., June, 2023

Compressed Real Numbers for AI: a case-study using a RISC-V CPU.
CoRR, 2023

Open-Source GEMM Hardware Kernels Generator: Toward Numerically-Tailored Computations.
CoRR, 2023

Characterizing the impact of last-level cache replacement policies on big-data workloads.
CoRR, 2023

Optimization of SpGEMM with Risc-V vector instructions.
CoRR, 2023

Efficient Direct Convolution Using Long SIMD Instructions.
Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2023

Efficient Execution of SpGEMM on Long Vector Architectures.
Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, 2023

An Open-Source Framework for Efficient Numerically-Tailored Computations.
Proceedings of the 33rd International Conference on Field-Programmable Logic and Applications, 2023

2022
Compiler-Assisted Compaction/Restoration of SIMD Instructions.
IEEE Trans. Parallel Distributed Syst., 2022

A BF16 FMA is All You Need for DNN Training.
IEEE Trans. Emerg. Top. Comput., 2022

Optimization of the Sparse Multi-Threaded Cholesky Factorization for A64FX.
CoRR, 2022

TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming Models.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

FASE: A Fast, Accurate and Seamless Emulator for Custom Numerical Formats.
Proceedings of the Machine Learning and Knowledge Discovery in Databases, 2022

Page Size Aware Cache Prefetching.
Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture, 2022

Task-based Acceleration of Bidirectional Recurrent Neural Networks on Multi-core Architectures.
Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

Communication-aware Sparse Patterns for the Factorized Approximate Inverse Preconditioner.
Proceedings of the HPDC '22: The 31st International Symposium on High-Performance Parallel and Distributed Computing, Minneapolis, MN, USA, 27 June 2022, 2022

A Generator of Numerically-Tailored and High-Throughput Accelerators for Batched GEMMs.
Proceedings of the 30th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2022

A Selective Nesting Approach for the Sparse Multi-threaded Cholesky Factorization.
Proceedings of the 7th IEEE/ACM International Workshop on Extreme Scale Programming Models and Middleware, 2022

2021
Intelligent Adaptation of Hardware Knobs for Improving Performance and Power Consumption.
IEEE Trans. Computers, 2021

Efficiently running SpMV on long vector architectures.
Proceedings of the PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021


Morrigan: A Composite Instruction TLB Prefetcher.
Proceedings of the MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021

Exploiting Page Table Locality for Agile TLB Prefetching.
Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture, 2021

Dynamically Adapting Floating-Point Precision to Accelerate Deep Neural Network Training.
Proceedings of the 20th IEEE International Conference on Machine Learning and Applications, 2021

Cache-aware Sparse Patterns for the Factorized Sparse Approximate Inverse Preconditioner.
Proceedings of the HPDC '21: The 30th International Symposium on High-Performance Parallel and Distributed Computing, 2021

PrioRAT: Criticality-Driven Prioritization Inside the On-Chip Memory Hierarchy.
Proceedings of the Euro-Par 2021: Parallel Processing, 2021

2020
Efficiency analysis of modern vector architectures: vector ALU sizes, core counts and clock frequencies.
J. Supercomput., 2020

Iteration-fusing conjugate gradient for sparse linear systems with MPI + OmpSs.
J. Supercomput., 2020

Using Arm's scalable vector extension on stencil codes.
J. Supercomput., 2020

Semi-automatic validation of cycle-accurate simulation infrastructures: The case for gem5-x86.
Future Gener. Comput. Syst., 2020

Generating Efficient DNN-Ensembles with Evolutionary Computation.
CoRR, 2020

Reducing Data Motion to Accelerate the Training of Deep Neural Networks.
CoRR, 2020

Runtime-guided ECC protection using online estimation of memory vulnerability.
Proceedings of the International Conference for High Performance Computing, 2020

Cost-aware prediction of uncorrected DRAM errors in the field.
Proceedings of the International Conference for High Performance Computing, 2020

Characterizing the impact of last-level cache replacement policies on big-data workloads.
Proceedings of the IEEE International Symposium on Workload Characterization, 2020

Wavefront parallelization of recurrent neural networks on multi-core architectures.
Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020

RICH: implementing reductions in the cache hierarchy.
Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020

Modeling and optimizing NUMA effects and prefetching with machine learning.
Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020

Evaluating Mixed-Precision Arithmetic for 3D Generative Adversarial Networks to Simulate High Energy Physics Detectors.
Proceedings of the 19th IEEE International Conference on Machine Learning and Applications, 2020

Improving Predication Efficiency through Compaction/Restoration of SIMD Instructions.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2020

2019
Design trade-offs for emerging HPC processors based on mobile market technology.
J. Supercomput., 2019

Sampled Simulation of Task-Based Programs.
IEEE Trans. Computers, 2019

Special issue on the message passing interface.
Parallel Comput., 2019

On the maturity of parallel applications for asymmetric multi-core processors.
J. Parallel Distributed Comput., 2019

Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions.
Int. J. High Perform. Comput. Appl., 2019

Optimizing computation-communication overlap in asynchronous task-based programs: poster.
Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019

On the Benefits of Tasking with OpenMP.
Proceedings of the OpenMP: Conquering the Full Hardware Spectrum, 2019

Design Space Exploration of Next-Generation HPC Machines.
Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

A Vulnerability Factor for ECC-protected Memory.
Proceedings of the 25th IEEE International Symposium on On-Line Testing and Robust System Design, 2019

Open-Source Shared Memory implementation of the HPCG benchmark: analysis, improvements and evaluation on Cavium ThunderX2.
Proceedings of the 17th International Conference on High Performance Computing & Simulation, 2019

Power efficient job scheduling by predicting the impact of processor manufacturing variability.
Proceedings of the ACM International Conference on Supercomputing, 2019

Optimizing computation-communication overlap in asynchronous task-based programs.
Proceedings of the ACM International Conference on Supercomputing, 2019

Convolutional Neural Network Training with Dynamic Epoch Ordering.
Proceedings of the Artificial Intelligence Research and Development, 2019

POSTER: An Optimized Predication Execution for SIMD Extensions.
Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques, 2019

2018
Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers.
IEEE Trans. Parallel Distributed Syst., 2018

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach.
IEEE Trans. Parallel Distributed Syst., 2018

Performance and energy effects on task-based parallelized applications - User-directed versus manual vectorization.
J. Supercomput., 2018

Memory Vulnerability: A Case for Delaying Error Reporting.
CoRR, 2018

Low-Precision Floating-Point Schemes for Neural Network Training.
CoRR, 2018

TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism.
Proceedings of the High Performance Computing - 33rd International Conference, 2018

Approximating a Multi-Grid Solver.
Proceedings of the 2018 IEEE/ACM Performance Modeling, 2018

Runtime-assisted cache coherence deactivation in task parallel programs.
Proceedings of the International Conference for High Performance Computing, 2018

Graph partitioning applied to DAG scheduling to reduce NUMA effects.
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018

Data Prefetching on In-order Processors.
Proceedings of the 2018 International Conference on High Performance Computing & Simulation, 2018

Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies.
Proceedings of the 32nd International Conference on Supercomputing, 2018

Runtime-Guided Management of Stacked DRAM Memories in Task Parallel Programs.
Proceedings of the 32nd International Conference on Supercomputing, 2018

Architectural Support for Task Dependence Management with Flexible Software Scheduling.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2018

Stencil codes on a vector length agnostic architecture.
Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, 2018

2017
Task Scheduling Techniques for Asymmetric Multi-Core Systems.
IEEE Trans. Parallel Distributed Syst., 2017

Prediction of the impact of network switch utilization on application performance via active measurement.
Parallel Comput., 2017

iQ: An Efficient and Flexible Queue-Based Simulation Framework.
Proceedings of the 25th IEEE International Symposium on Modeling, 2017

ATM: Approximate Task Memoization in the Runtime System.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

Iteration-fusing conjugate gradient.
Proceedings of the International Conference on Supercomputing, 2017

libPRISM: an intelligent adaptation of prefetch and SMT levels.
Proceedings of the International Conference on Supercomputing, 2017

Evaluating Scientific Workflow Execution on an Asymmetric Multicore Processor.
Proceedings of the Euro-Par 2017: Parallel Processing Workshops, 2017

Runtime-Assisted Shared Cache Insertion Policies Based on Re-reference Intervals.
Proceedings of the Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28, 2017

2016
Evaluation of HPC Applications' Memory Resource Consumption via Active Measurement.
IEEE Trans. Parallel Distributed Syst., 2016

PARSECSs: Evaluating the Impact of Task Parallelism in the PARSEC Benchmark Suite.
ACM Trans. Archit. Code Optim., 2016

MUSA: a multi-level simulation approach for next-generation HPC machines.
Proceedings of the International Conference for High Performance Computing, 2016

TaskPoint: Sampled simulation of task-based programs.
Proceedings of the 2016 IEEE International Symposium on Performance Analysis of Systems and Software, 2016

CATA: Criticality Aware Task Acceleration for Multicore Processors.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Runtime-Guided Mitigation of Manufacturing Variability in Power-Constrained Multi-Socket NUMA Nodes.
Proceedings of the 2016 International Conference on Supercomputing, 2016

POSTER: Exploiting Asymmetric Multi-Core Processors with Flexible System Sofware.
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, 2016

Reducing Cache Coherence Traffic with Hierarchical Directory Cache and NUMA-Aware Runtime Scheduling.
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, 2016

2015
A framework for evaluating comprehensive fault resilience mechanisms in numerical programs.
J. Supercomput., 2015

Adaptive and application dependent runtime guided hardware prefetcher reconfiguration on the IBM POWER7.
CoRR, 2015

Exploiting asynchrony from exact forward recovery for DUE in iterative solvers.
Proceedings of the International Conference for High Performance Computing, 2015

Evaluating the Impact of OpenMP 4.0 Extensions on Relevant Parallel Workloads.
Proceedings of the OpenMP: Heterogenous Execution and Data Movements, 2015

Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures.
Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015


Runtime-Guided Management of Scratchpad Memories in Multicore Architectures.
Proceedings of the 2015 International Conference on Parallel Architectures and Compilation, 2015

2014
Runtime-Aware Architectures: A First Approach.
Supercomput. Front. Innov., 2014

Active Measurement of Memory Resource Consumption.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Active Measurement of the Impact of Network Switch Utilization on Application Performance.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Evaluating Execution Time Predictability of Task-Based Programs on Multi-Core Processors.
Proceedings of the Euro-Par 2014: Parallel Processing Workshops, 2014

2013
Performance Analysis Techniques for the Exascale Co-Design Process.
Proceedings of the Parallel Computing: Accelerating Computational Science and Engineering (CSE), 2013

2012
Poster: Autonomic Modeling of Data-Driven Application Behavior.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Abstract: Autonomic Modeling of Data-Driven Application Behavior.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Fault resilience of the algebraic multi-grid solver.
Proceedings of the International Conference on Supercomputing, 2012

2011
Simulating Whole Supercomputer Applications.
IEEE Micro, 2011

Extracting the optimal sampling frequency of applications using spectral analysis.
Concurr. Comput. Pract. Exp., 2011

Trace Spectral Analysis toward Dynamic Levels of Detail.
Proceedings of the 17th IEEE International Conference on Parallel and Distributed Systems, 2011

2010
Spectral analysis of executions of computer programs and its applications on performance analysis.
PhD thesis, 2010

Automatic Phase Detection and Structure Extraction of MPI Applications.
Int. J. High Perform. Comput. Appl., 2010

2008
Automatic analysis of speedup of MPI applications.
Proceedings of the 22nd Annual International Conference on Supercomputing, 2008

Prediction of behavior of MPI applications.
Proceedings of the 2008 IEEE International Conference on Cluster Computing, 29 September, 2008

2007
Automatic Phase Detection of MPI Applications.
Proceedings of the Parallel Computing: Architectures, 2007

Automatic Structure Extraction from MPI Applications Tracefiles.
Proceedings of the Euro-Par 2007, 2007


  Loading...