Gerhard Wellein

Orcid: 0000-0001-7371-3026

Affiliations:
  • University of Erlangen-Nuremberg, Germany


According to our database1, Gerhard Wellein authored at least 129 papers between 2002 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Alya towards Exascale: Optimal OpenACC Performance of the Navier-Stokes Finite Element Assembly on GPUs.
CoRR, 2024

2023
MD-Bench: A performance-focused prototyping harness for state-of-the-art short-range molecular dynamics algorithms.
Future Gener. Comput. Syst., December, 2023

Making applications faster by asynchronous execution: Slowing down processes or relaxing MPI collectives.
Future Gener. Comput. Syst., November, 2023

Analytical performance estimation during code generation on modern GPUs.
J. Parallel Distributed Comput., March, 2023

Level-Based Blocking for Sparse Matrices: Sparse Matrix-Power-Vector Multiplication.
IEEE Trans. Parallel Distributed Syst., February, 2023

The Role of Idle Waves, Desynchronization, and Bottleneck Evasion in the Performance of Parallel Programs.
IEEE Trans. Parallel Distributed Syst., February, 2023

CloverLeaf on Intel Multi-Core CPUs: A Case Study in Write-Allocate Evasion.
CoRR, 2023

Algebraic Temporal Blocking for Sparse Iterative Solvers on Multi-Core CPUs.
CoRR, 2023

MD-Bench: Engineering the in-core performance of short-range molecular dynamics kernels from state-of-the-art simulation packages.
CoRR, 2023

SPEChpc 2021 Benchmarks on Ice Lake and Sapphire Rapids Infiniband Clusters: A Performance and Energy Case Study.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

Physical Oscillator Model for Supercomputing.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

2022
Multiway p-spectral graph cuts on Grassmann manifolds.
Mach. Learn., 2022

Execution-Cache-Memory modeling and performance tuning of sparse matrix-vector multiplication and Lattice quantum chromodynamics on A64FX.
Concurr. Comput. Pract. Exp., 2022

Analytic performance model for parallel overlapping memory-bound kernels.
Concurr. Comput. Pract. Exp., 2022

MD-Bench: A Generic Proxy-App Toolbox for State-of-the-Art Molecular Dynamics Algorithms.
Proceedings of the Parallel Processing and Applied Mathematics, 2022

Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications.
Proceedings of the Parallel Processing and Applied Mathematics, 2022

Addressing White-box Modeling and Simulation Challenges in Parallel Computing.
Proceedings of the SIGSIM-PADS '22: SIGSIM Conference on Principles of Advanced Discrete Simulation, Atlanta, GA, USA, June 8, 2022

2021
Energy efficiency of nonlinear domain decomposition methods.
Int. J. High Perform. Comput. Appl., 2021

Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs.
Int. J. High Perform. Comput. Appl., 2021

ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX.
CoRR, 2021

Analytic Modeling of Idle Waves in Parallel Programs: Communication, Cluster Topology, and Noise Impact.
Proceedings of the High Performance Computing - 36th International Conference, 2021

Opening the Black Box: Performance Estimation during Code Generation for GPUs.
Proceedings of the 33rd IEEE International Symposium on Computer Architecture and High Performance Computing, 2021

YaskSite: Stencil Optimization Techniques Applied to Explicit ODE Methods on Modern Architectures.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2021

2020
EXASTEEL: Towards a Virtual Laboratory for the Multiscale Simulation of Dual-Phase Steel Using High-Performance Computing.
Proceedings of the Software for Exascale Computing - SPPEXA 2016-2019, 2020


A Recursive Algebraic Coloring Technique for Hardware-efficient Symmetric Sparse Matrix-vector Multiplication.
ACM Trans. Parallel Comput., 2020

PHIST: A Pipelined, Hybrid-Parallel Iterative Solver Toolkit.
ACM Trans. Math. Softw., 2020

Bridging the Architecture Gap: Abstracting Performance-Relevant Properties of Modern Server Processors.
Supercomput. Front. Innov., 2020

Analytic performance modeling and analysis of detailed neuron simulations.
Int. J. High Perform. Comput. Appl., 2020

An analytic performance model for overlapping execution of memory-bound loop kernels on multicore CPUs.
CoRR, 2020

K-way p-spectral clustering on Grassmann manifolds.
CoRR, 2020

Understanding HPC Benchmark Performance on Intel Broadwell and Cascade Lake Processors.
Proceedings of the High Performance Computing - 35th International Conference, 2020

Desynchronization and Wave Pattern Formation in MPI-Parallel and Hybrid Memory-Bound Programs.
Proceedings of the High Performance Computing - 35th International Conference, 2020

Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX.
Proceedings of the 2020 IEEE/ACM Performance Modeling, 2020

2019
CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance.
IEEE Trans. Parallel Distributed Syst., 2019

Collecting and Presenting Reproducible Intranode Stencil Performance: INSPECT.
Supercomput. Front. Innov., 2019

Delay Propagation and Overlapping Mechanisms on Clusters: A Case Study of Idle Periods based on Workload, Communication, and Delay Granularity.
CoRR, 2019

Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs.
CoRR, 2019

Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels.
Proceedings of the 2019 IEEE/ACM Performance Modeling, 2019

Code generation for massively parallel phase-field simulations.
Proceedings of the International Conference for High Performance Computing, 2019

Performance Engineering for a Tall & Skinny Matrix Multiplication Kernels on GPUs.
Proceedings of the Parallel Processing and Applied Mathematics, 2019

ClusterCockpit - A web application for job-specific performance monitoring.
Proceedings of the 2019 IEEE International Conference on Cluster Computing, 2019

Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study.
Proceedings of the 2019 IEEE International Conference on Cluster Computing, 2019

2018
Performance Engineering.
Inform. Spektrum, 2018

Building and utilizing fault tolerance support tools for the GASPI applications.
Int. J. High Perform. Comput. Appl., 2018

Optimization and performance evaluation of the IDR iterative Krylov solver on GPUs.
Int. J. High Perform. Comput. Appl., 2018

Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs.
CoRR, 2018

Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs.
Proceedings of the High Performance Computing - 33rd International Conference, 2018

Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures.
Proceedings of the 2018 IEEE/ACM Performance Modeling, 2018

Multicore Performance Engineering of Sparse Triangular Solves Using a Modified Roofline Model.
Proceedings of the 30th International Symposium on Computer Architecture and High Performance Computing, 2018

2017
Preconditioned Krylov solvers on GPUs.
Parallel Comput., 2017

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems.
Int. J. Parallel Program., 2017

Lattice Boltzmann Benchmark Kernels as a Testbed for Performance Analysis.
CoRR, 2017

Validation of hardware events for successful performance pattern identification in High Performance Computing.
CoRR, 2017

Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels.
CoRR, 2017

Performance analysis of the Kahan-enhanced scalar product on current multi-core and many-core processors.
Concurr. Comput. Pract. Exp., 2017

An Analysis of Core- and Chip-Level Architectural Features in Four Generations of Intel Server Processors.
Proceedings of the High Performance Computing - 32nd International Conference, 2017

LIKWID Monitoring Stack: A Flexible Framework Enabling Job Specific Performance monitoring for the masses.
Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

2016
Towards an Exascale Enabled Sparse Solver Repository.
Proceedings of the Software for Exascale Computing - SPPEXA 2013-2015, 2016

Performance Engineering and Energy Efficiency of Building Blocks for Large, Sparse Eigenvalue Computations on Heterogeneous Supercomputers.
Proceedings of the Software for Exascale Computing - SPPEXA 2013-2015, 2016

Hybrid Parallel Multigrid Methods for Geodynamical Simulations.
Proceedings of the Software for Exascale Computing - SPPEXA 2013-2015, 2016

High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations.
J. Comput. Phys., 2016

Performance analysis of the Kahan-enhanced scalar product on current multi- and manycore processors.
CoRR, 2016

Chip-level and multi-node analysis of energy-optimized lattice Boltzmann CFD simulations.
Concurr. Comput. Pract. Exp., 2016

Exploring performance and power properties of modern multi-core chips via simple machine models.
Concurr. Comput. Pract. Exp., 2016

Performance and power for highly parallel systems.
Concurr. Comput. Pract. Exp., 2016

Efficiency of General Krylov Methods on GPUs - An Experimental Study.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

Analysis of Intel's Haswell Microarchitecture Using the ECM Model and Microbenchmarks.
Proceedings of the Architecture of Computing Systems - ARCS 2016, 2016

2015
Increasing the Performance of the Jacobi-Davidson Method by Blocking.
SIAM J. Sci. Comput., 2015

Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates.
SIAM J. Sci. Comput., 2015

Short Note on Costs of Floating Point Operations on current x86-64 Architectures: Denormals, Overflow, Underflow, and Division by Zero.
CoRR, 2015

Performance analysis of the Kahan-enhanced scalar product on current multicore processors.
CoRR, 2015

Automatic loop kernel analysis and performance modeling with Kerncraft.
Proceedings of the 6th International Workshop on Performance Modeling, 2015

Performance Analysis of the Kahan-Enhanced Scalar Product on Current Multicore Processors.
Proceedings of the Parallel Processing and Applied Mathematics, 2015


Performance Engineering of the Kernel Polynomal Method on Large-Scale CPU-GPU Systems.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model.
Proceedings of the 29th ACM on International Conference on Supercomputing, 2015

Building a Fault Tolerant Application Using the GASPI Communication Layer.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

2014
A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units.
SIAM J. Sci. Comput., 2014

Modeling and analyzing performance for highly optimized propagation steps of the lattice Boltzmann method on sparse lattices.
CoRR, 2014

Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems.
CoRR, 2014

Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips.
Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing, 2014

Overhead Analysis of Performance Counter Measurements.
Proceedings of the 43rd International Conference on Parallel Processing Workshops, 2014

ESSEX: Equipping Sparse Solvers for Exascale.
Proceedings of the Euro-Par 2014: Parallel Processing Workshops, 2014

Performance Engineering for a Medical Imaging Application on the Intel Xeon Phi Accelerator.
Proceedings of the ARCS 2014, 2014

2013
A Survey of Checkpoint/Restart Techniques on Distributed Memory Systems.
Parallel Process. Lett., 2013

Pushing the limits for medical image reconstruction on recent standard multicore processors.
Int. J. High Perform. Comput. Appl., 2013

An analysis of energy-optimized lattice-Boltzmann CFD simulations from the chip to the highly parallel level
CoRR, 2013

Optimization of FASTEST-3D for Modern Multicore Systems
CoRR, 2013

Asynchronous MPI for the Masses
CoRR, 2013

A unified sparse matrix data format for modern processors with wide SIMD units.
CoRR, 2013

Comparison of different propagation steps for lattice Boltzmann methods.
Comput. Math. Appl., 2013

An Evaluation of Different I/O Techniques for Checkpoint/Restart.
Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

2012
Exploring performance and power properties of modern multicore chips via simple machine models
CoRR, 2012

Best practices for HPM-assisted performance engineering on modern multicore processors
CoRR, 2012

Asynchronous Checkpointing by Dedicated Checkpoint Threads.
Proceedings of the Recent Advances in the Message Passing Interface, 2012

Sparse Matrix-vector Multiplication on GPGPU Clusters: A New Storage Format and a Scalable Implementation.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

Performance Patterns and Hardware Metrics on Modern Multicore Processors: Best Practices for Performance Engineering.
Proceedings of the Euro-Par 2012: Parallel Processing Workshops, 2012

2011
Hybrid-Parallel Sparse Matrix-Vector Multiplication with Explicit Communication Overlap on Current Multicore-Based Systems.
Parallel Process. Lett., 2011

A flexible Patch-based lattice Boltzmann parallelization approach for heterogeneous GPU-CPU clusters.
Parallel Comput., 2011

Efficient multicore-aware parallelization strategies for iterative stencil computations.
J. Comput. Sci., 2011

Simulation software for supercomputers.
J. Comput. Sci., 2011

Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results
CoRR, 2011

Domain decomposition and locality optimization for large-scale lattice Boltzmann simulations
CoRR, 2011

Comparison of different Propagation Steps for the Lattice Boltzmann Method
CoRR, 2011

Performance analysis and optimization strategies for a D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA.
Adv. Eng. Softw., 2011

Poster: LIKWID: lightweight performance tools.
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2011

likwid-bench: An Extensible Microbenchmarking Platform for x86 Multicore Compute Nodes.
Proceedings of the Tools for High Performance Computing 2011, 2011

Parallel Sparse Matrix-Vector Multiplication as a Test Case for Hybrid MPI+OpenMP Programming.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Introduction to High Performance Computing for Scientists and Engineers.
Chapman and Hall / CRC computational science series, CRC Press, ISBN: 978-1-439-81192-4, 2011

2010
Leveraging Shared Caches for Parallel Temporal Blocking of Stencil Codes on Multicore Processors and Clusters.
Parallel Process. Lett., 2010

Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments.
Proceedings of the 39th International Conference on Parallel Processing, 2010

LIKWID: Lightweight Performance Tools.
Proceedings of the Competence in High Performance Computing 2010, 2010

2009
Benchmark Analysis and Application Results for Lattice Boltzmann Simulations on NEC SX Vector and Intel Nehalem Systems.
Parallel Process. Lett., 2009

Multi-core architectures: Complexities of performance prediction and the impact of cache topology
CoRR, 2009

The world's fastest CPU and SMP node: Some performance results from the NEC SX-9.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization.
Proceedings of the 33rd Annual IEEE International Computer Software and Applications Conference, 2009

2008
Data Access Characteristics and Optimizations for Sun UltraSPARC T2 and T2+ Systems.
Parallel Process. Lett., 2008

Performance comparison of different parallel lattice Boltzmann implementations on multi-core multi-socket systems.
Int. J. Comput. Sci. Eng., 2008

Data access optimizations for highly threaded multi-core CPUs with multiple memory controllers.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Vector Computers in a World of Commodity Clusters, Massively Parallel Systems and Many-Core Many-Threaded CPUs: Recent Experience Based on an Advanced Lattice Boltzmann Flow Solver.
Proceedings of the High Performance Computing in Science and Engineering '08, 2008

2007
Hierarchical hybrid grids: achieving TERAFLOP performance on large scale finite element simulations.
Int. J. Parallel Emergent Distributed Syst., 2007

RZBENCH: Performance evaluation of current HPC architectures using low-level and application benchmarks
CoRR, 2007

2004
Performance Evaluation of Parallel Large-Scale Lattice Boltzmann Applications on Three Supercomputing Architectures.
Proceedings of the ACM/IEEE SC2004 Conference on High Performance Networking and Computing, 2004

2003
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures.
Int. J. High Perform. Comput. Appl., 2003

Comparison of Parallel Programming Models on Clusters of SMP Nodes.
Proceedings of the Modeling, 2003

Exact Numerical Treatment of Finite Quantum Systems Using Leading-Edge Supercomputers.
Proceedings of the Modeling, 2003

2002
Fast Sparse Matrix-Vector Multiplication for TeraFlop/s Computers.
Proceedings of the High Performance Computing for Computational Science, 2002


  Loading...