We stand with Ukraine

We stand with Ukraine

Alexander Heinecke

Orcid: 0009-0007-0947-5394

According to our database¹, Alexander Heinecke authored at least 79 papers between 2007 and 2026.

Collaborative distances:

Dijkstra number² of three.
Erdős number³ of three.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

Xe-Forge: Multi-Stage LLM-Powered Kernel Optimization for Intel GPU.

[DOI]

Marcin Spoczynski

,

Daniel Fleischer

,

Moshe Berchansky

,

Gabriela Ben-Melech Stan

,

,

,

Adam Siemieniuk

,

Alexander Heinecke

CoRR, May, 2026

Space Filling Curves is All You Need: Communication-Avoiding Matrix Multiplication Made Simple.

[DOI]

Evangelos Georganas

,

Alexander Heinecke

,

CoRR, January, 2026

Tensor Algebra Processing Primitives (TAPP): Towards a Standard for Tensor Operations.

[DOI]

,

Niklas Hörnblad

,

Edward F. Valeev

,

Alexander Heinecke

,

Jeff R. Hammond

,

,

Paolo Bientinesi

CoRR, January, 2026

2025

Library Liberation: Competitive Performance Matmul Through Compiler-composed Nanokernels.

[DOI]

Arun Thangamani

,

Md Asghar Ahmad Shahid

,

Adam Siemieniuk

,

,

,

Alexander Heinecke

CoRR, November, 2025

Pushing the Envelope of LLM Inference on AI-PC.

[DOI]

Evangelos Georganas

,

Dhiraj D. Kalamkar

,

Alexander Heinecke

CoRR, August, 2025

ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts.

[DOI]

Evangelos Georganas

,

Dhiraj D. Kalamkar

,

Alexander Kozlov

,

Alexander Heinecke

CoRR, March, 2025

Einsum Trees: An Abstraction for Optimizing the Execution of Tensor Expressions.

[DOI]

Alexander Breuer

,

,

,

,

Alexander Heinecke

,

,

Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2025

2024

Towards a high-performance AI compiler with upstream MLIR.

[DOI]

,

Lorenzo Chelini

,

Adam Siemieniuk

,

,

Niranjan Hasabnis

,

,

Evangelos Georganas

,

Alexander Heinecke

CoRR, 2024

Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures.

[DOI]

Evangelos Georganas

,

Dhiraj D. Kalamkar

,

,

,

,

,

Alexander Breuer

,

Alexander Heinecke

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

2023

Microscaling Data Formats for Deep Learning.

[DOI]

CoRR, 2023

Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures.

[DOI]

Evangelos Georganas

,

Dhiraj D. Kalamkar

,

,

,

,

Alexander Breuer

,

Alexander Heinecke

CoRR, 2023

2022

Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning and HPC Workloads.

[DOI]

Evangelos Georganas

,

Dhiraj D. Kalamkar

,

Sasikanth Avancha

,

Menachem Adelman

,

Deepti Aggarwal

,

Cristina Anderson

,

Alexander Breuer

,

Jeremy Bruestle

,

Narendra Chaudhary

,

,

,

,

,

,

Ramanarayan Mohanty

,

,

,

,

Alexander Heinecke

Frontiers Appl. Math. Stat., 2022

FP8 Formats for Deep Learning.

[DOI]

Paulius Micikevicius

,

,

,

,

,

Richard Grisenthwaite

,

,

Alexander Heinecke

,

,

,

Naveen Mellempudi

,

Stuart F. Oberman

,

Mohammad Shoeybi

,

,

CoRR, 2022

FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems.

[DOI]

,

Evangelos Georganas

,

Alexander Heinecke

,

,

Eriko Nurvitadhi

CoRR, 2022

FPGA-Based AI Smart NICs for Scalable Distributed AI Training Systems.

[DOI]

,

Evangelos Georganas

,

Alexander Heinecke

,

,

,

Eriko Nurvitadhi

IEEE Comput. Archit. Lett., 2022

Accelerating Deep Learning based Identification of Chromatin Accessibility from noisy ATAC-seq Data.

[DOI]

Narendra Chaudhary

,

,

Dhiraj D. Kalamkar

,

Alexander Heinecke

,

Evangelos Georganas

,

,

Menachem Adelman

,

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

Next-Generation Local Time Stepping for the ADER-DG Finite Element Method.

[DOI]

Alexander Breuer

,

Alexander Heinecke

Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

2021

PolyDL: Polyhedral Optimizations for Creation of High-performance DL Primitives.

[DOI]

Sanket Tavarageri

,

Alexander Heinecke

,

Sasikanth Avancha

,

,

Gagandeep Goyal

,

Ramakrishna Upadrasta

ACM Trans. Archit. Code Optim., 2021

Efficient and Generic 1D Dilated Convolution Layer for Deep Learning.

[DOI]

Narendra Chaudhary

,

,

Dhiraj D. Kalamkar

,

Alexander Heinecke

,

Evangelos Georganas

,

,

Menachem Adelman

,

CoRR, 2021

Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning Workloads.

[DOI]

Evangelos Georganas

,

Dhiraj D. Kalamkar

,

Sasikanth Avancha

,

Menachem Adelman

,

Cristina Anderson

,

Alexander Breuer

,

Narendra Chaudhary

,

,

,

,

Ramanarayan Mohanty

,

,

,

Alexander Heinecke

CoRR, 2021

DistGNN: scalable distributed training for large-scale graph neural networks.

[DOI]

,

,

,

Ramanarayan Mohanty

,

Evangelos Georganas

,

Alexander Heinecke

,

Dhiraj D. Kalamkar

,

Nesreen K. Ahmed

,

Sasikanth Avancha

Proceedings of the International Conference for High Performance Computing, 2021

Tensor processing primitives: a programming abstraction for efficiency and portability in deep learning workloads.

[DOI]

Evangelos Georganas

,

Dhiraj D. Kalamkar

,

Sasikanth Avancha

,

Menachem Adelman

,

Cristina Anderson

,

Alexander Breuer

,

Jeremy Bruestle

,

Narendra Chaudhary

,

,

,

,

,

,

Ramanarayan Mohanty

,

,

,

Alexander Heinecke

Proceedings of the International Conference for High Performance Computing, 2021

2020

PolyScientist: Automatic Loop Transformations Combined with Microkernels for Optimization of Deep Learning Primitives.

[DOI]

Sanket Tavarageri

,

Alexander Heinecke

,

Sasikanth Avancha

,

Gagandeep Goyal

,

Ramakrishna Upadrasta

,

CoRR, 2020

Performance study of sustained petascale direct numerical simulation on Cray XC40 systems.

[DOI]

,

,

Maxwell Hutchinson

,

Alexander Heinecke

,

Lisandro Dalcín

,

Concurr. Comput. Pract. Exp., 2020

Optimizing deep learning recommender systems training on CPU cluster architectures.

[DOI]

Dhiraj D. Kalamkar

,

Evangelos Georganas

,

Sudarshan Srinivasan

,

,

Mikhail Shiryaev

,

Alexander Heinecke

Proceedings of the International Conference for High Performance Computing, 2020

Harnessing Deep Learning via a Single Building Block.

[DOI]

Evangelos Georganas

,

,

Dhiraj D. Kalamkar

,

Sasikanth Avancha

,

,

Michael J. Anderson

,

,

,

Alexander Heinecke

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

2019

Optimizing Deep Learning RNN Topologies on Intel Architecture.

[DOI]

,

Evangelos Georganas

,

Dhiraj D. Kalamkar

,

,

,

Cristina Anderson

,

Alexander Heinecke

Supercomput. Front. Innov., 2019

Tensor-optimized hardware accelerates fused discontinuous Galerkin simulations.

[DOI]

Alexander Heinecke

,

Alexander Breuer

,

Parallel Comput., 2019

Training Neural Machine Translation (NMT) Models using Tensor Train Decomposition on TensorFlow (T3F).

[DOI]

,

Alexander Heinecke

CoRR, 2019

High-Performance Deep Learning via a Single Building Block.

[DOI]

Evangelos Georganas

,

,

Dhiraj D. Kalamkar

,

Sasikanth Avancha

,

,

Michael J. Anderson

,

,

,

Alexander Heinecke

CoRR, 2019

A Study of BFLOAT16 for Deep Learning Training.

[DOI]

Dhiraj D. Kalamkar

,

Dheevatsa Mudigere

,

Naveen Mellempudi

,

,

,

Sasikanth Avancha

,

Dharma Teja Vooturi

,

Nataraj Jammalamadaka

,

,

,

,

,

Alexander Heinecke

,

Evangelos Georganas

,

Sudarshan Srinivasan

,

,

Misha Smelyanskiy

,

,

CoRR, 2019

Petaflop Seismic Simulations in the Public Cloud.

[DOI]

Alexander Breuer

,

,

Alexander Heinecke

Proceedings of the High Performance Computing - 34th International Conference, 2019

Training Google Neural Machine Translation on an Intel CPU Cluster.

[DOI]

Dhiraj D. Kalamkar

,

,

Sudarshan Srinivasan

,

Srinivas Sridharan

,

Evangelos Georganas

,

Mikhail E. Smorkalov

,

,

Alexander Heinecke

Proceedings of the 2019 IEEE International Conference on Cluster Computing, 2019

ISA mapper: a compute and hardware agnostic deep learning compiler.

[DOI]

Matthew Sotoudeh

,

,

Michael J. Anderson

,

Evangelos Georganas

,

Alexander Heinecke

,

Proceedings of the 16th ACM International Conference on Computing Frontiers, 2019

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations.

[DOI]

,

Ping Tak Peter Tang

,

Alexander Heinecke

Proceedings of the 26th IEEE Symposium on Computer Arithmetic, 2019

2018

Anatomy of high-performance deep learning convolutions on SIMD architectures.

[DOI]

Evangelos Georganas

,

Sasikanth Avancha

,

,

Dhiraj D. Kalamkar

,

,

,

Alexander Heinecke

Proceedings of the International Conference for High Performance Computing, 2018

Mixed Precision Training of Convolutional Neural Networks using Integer Operations.

[DOI]

,

Naveen Mellempudi

,

Dheevatsa Mudigere

,

Dhiraj D. Kalamkar

,

Sasikanth Avancha

,

,

Srinivas Sridharan

,

Karthik Vaidyanathan

,

,

Evangelos Georganas

,

Alexander Heinecke

,

,

,

Nikita Shustrov

,

,

Evarist Fomenko

,

Vadim O. Pirogov

Proceedings of the 6th International Conference on Learning Representations, 2018

2017

Accelerating Seismic Simulations Using the Intel Xeon Phi Knights Landing Processor.

[DOI]

,

Alexander Breuer

,

Alexander Heinecke

,

,

Proceedings of the High Performance Computing - 32nd International Conference, 2017

EDGE: Extreme Scale Fused Seismic Simulations with the Discontinuous Galerkin Method.

[DOI]

Alexander Breuer

,

Alexander Heinecke

,

Proceedings of the High Performance Computing - 32nd International Conference, 2017

2016

Optimizations in a high-performance conjugate gradient benchmark for IA-based multi- and many-core processors.

[DOI]

,

Mikhail Smelyanskiy

,

Karthikeyan Vaidyanathan

,

Alexander Heinecke

,

Dhiraj D. Kalamkar

,

Md. Mostofa Ali Patwary

,

Vadim O. Pirogov

,

,

,

,

,

Christopher S. Daley

Int. J. High Perform. Comput. Appl., 2016

Data mining on vast data sets as a cluster system benchmark.

[DOI]

Alexander Heinecke

,

Roman Karlstetter

,

,

Hans-Joachim Bungartz

Concurr. Comput. Pract. Exp., 2016

Efficiency of High Order Spectral Element Methods on Petascale Architectures.

[DOI]

Maxwell Hutchinson

,

Alexander Heinecke

,

,

,

,

Proceedings of the High Performance Computing - 31st International Conference, 2016

High Order Seismic Simulations on the Intel Xeon Phi Processor (Knights Landing).

[DOI]

Alexander Heinecke

,

Alexander Breuer

,

,

Proceedings of the High Performance Computing - 31st International Conference, 2016

LIBXSMM: accelerating small matrix multiplications by runtime code generation.

[DOI]

Alexander Heinecke

,

,

Maxwell Hutchinson

,

Proceedings of the International Conference for High Performance Computing, 2016

Petascale Local Time Stepping for the ADER-DG Finite Element Method.

[DOI]

Alexander Breuer

,

Alexander Heinecke

,

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

2015

Supercomputing for Molecular Dynamics Simulations - Handling Multi-Trillion Particles in Nanofluidics

[DOI]

Alexander Heinecke

,

Wolfgang Eckhardt

,

,

Hans-Joachim Bungartz

Springer Briefs in Computer Science, Springer, ISBN: 978-3-319-17148-7, 2015

Beacon: Deployment and Application of Intel Xeon Phi Coprocessorsfor Scientific Computing.

[DOI]

,

Alexander Heinecke

,

Anthony B. Costa

,

,

Vincent C. Betro

,

,

,

Comput. Sci. Eng., 2015

Cache-oblivious matrix algorithms in the age of multicores and many cores.

[DOI]

Alexander Heinecke

,

Carsten Trinitis

Concurr. Comput. Pract. Exp., 2015

High-Order ADER-DG Minimizes Energy- and Time-to-Solution of SeisSol.

[DOI]

Alexander Breuer

,

Alexander Heinecke

,

Leonhard Rannabauer

,

Proceedings of the High Performance Computing - 30th International Conference, 2015

Full correlation matrix analysis of fMRI data on Intel® Xeon Phi™ coprocessors.

[DOI]

,

Michael J. Anderson

,

Jonathan D. Cohen

,

Alexander Heinecke

,

,

Nadathur Satish

,

Narayanan Sundaram

,

Nicholas B. Turk-Browne

,

Theodore L. Willke

Proceedings of the International Conference for High Performance Computing, 2015

Exploring Shared-Memory Optimizations for an Unstructured Mesh CFD Application on Modern Parallel Systems.

[DOI]

Dheevatsa Mudigere

,

Srinivas Sridharan

,

Anand M. Deshpande

,

,

Alexander Heinecke

,

Mikhail Smelyanskiy

,

,

,

Dinesh K. Kaushik

,

Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Optimized Force Calculation in Molecular Dynamics Simulations for the Intel Xeon Phi.

[DOI]

,

,

,

Wolfgang Eckhardt

,

Alexander Heinecke

,

Hans-Joachim Bungartz

,

Philipp Neumann

Proceedings of the Euro-Par 2015: Parallel Processing Workshops, 2015

2014

Boosting Scientific Computing Applications through Leveraging Data Parallel Architectures.

[DOI]

Alexander Heinecke

PhD thesis, 2014

ls1 mardyn: The massively parallel molecular dynamics code for large systems.

[DOI]

Christoph Niethammer

,

,

Martin Bernreuther

,

Martin Buchholz

,

Wolfgang Eckhardt

,

Alexander Heinecke

,

,

Hans-Joachim Bungartz

,

,

,

,

CoRR, 2014

Sustained Petascale Performance of Seismic Simulations with SeisSol on SuperMUC.

[DOI]

Alexander Breuer

,

Alexander Heinecke

,

Sebastian Rettenberger

,

,

Alice-Agnes Gabriel

,

Christian Pelties

Proceedings of the Supercomputing - 29th International Conference, 2014

Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices.

[DOI]

,

Mikhail Smelyanskiy

,

Karthikeyan Vaidyanathan

,

Alexander Heinecke

,

Dhiraj D. Kalamkar

,

,

Md. Mostofa Ali Patwary

,

,

Proceedings of the International Conference for High Performance Computing, 2014

Petascale High Order Dynamic Rupture Earthquake Simulations on Heterogeneous Supercomputers.

[DOI]

Alexander Heinecke

,

Alexander Breuer

,

Sebastian Rettenberger

,

,

Alice-Agnes Gabriel

,

Christian Pelties

,

,

,

,

Karthikeyan Vaidyanathan

,

Mikhail Smelyanskiy

,

Proceedings of the International Conference for High Performance Computing, 2014

Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters.

[DOI]

Karthikeyan Vaidyanathan

,

,

Dhiraj D. Kalamkar

,

Alexander Heinecke

,

Mikhail Smelyanskiy

,

,

,

Aniruddha G. Shet

,

,

,

Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

2013

Emerging Architectures Enable to Boost Massively Parallel Data Mining Using Adaptive Sparse Grids.

[DOI]

Alexander Heinecke

,

Int. J. Parallel Program., 2013

591 TFLOPS Multi-trillion Particles Simulation on SuperMUC.

[DOI]

Wolfgang Eckhardt

,

Alexander Heinecke

,

,

,

,

,

Hans-Georg Kleinhenz

,

,

,

,

Martin Bernreuther

,

,

Christoph Niethammer

,

,

Hans-Joachim Bungartz

Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013

Many-core architectures boost the pricing of basket options on adaptive sparse grids.

[DOI]

Alexander Heinecke

,

,

Hans-Joachim Bungartz

Proceedings of WHPCF'13: 6th Workshop on High Performance Computational Finance, 2013

Accelerating SeisSol by Generating Vectorized Code for Sparse Matrix Operators.

[DOI]

Alexander Breuer

,

Alexander Heinecke

,

,

Christian Pelties

Proceedings of the Parallel Computing: Accelerating Computational Science and Engineering (CSE), 2013

Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor.

[DOI]

Alexander Heinecke

,

Karthikeyan Vaidyanathan

,

Mikhail Smelyanskiy

,

Alexander Kobotov

,

,

,

Aniruddha G. Shet

,

,

Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

Accelerators in scientific computing is it worth the effort?

[DOI]

Alexander Heinecke

Proceedings of the International Conference on High Performance Computing & Simulation, 2013

2012

Option pricing with a direct adaptive sparse grid approach.

[DOI]

Hans-Joachim Bungartz

,

Alexander Heinecke

,

,

Stefanie Schraufstetter

J. Comput. Appl. Math., 2012

A highly parallel Black-Scholes solver based on adaptive sparse grids.

[DOI]

Alexander Heinecke

,

Stefanie Schraufstetter

,

Hans-Joachim Bungartz

Int. J. Comput. Math., 2012

From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture.

[DOI]

Alexander Heinecke

,

,

Hans-Joachim Bungartz

Comput. Sci. Eng., 2012

Exploiting State-of-the-Art x86 Architectures in Scientific Computing.

[DOI]

Alexander Heinecke

,

Thomas Auckenthaler

,

Carsten Trinitis

Proceedings of the 11th International Symposium on Parallel and Distributed Computing, 2012

HPCS 2012 panels: Panel I: Energy efficient systems in next generation high performance data and compute centers.

[DOI]

Laurent Lefèvre

,

,

Miguel A. Ordonez

,

Johnatan E. Pecero

,

Jean-Marc Pierson

,

Jesús Carretero

,

,

David R. C. Hill

,

,

Reinhard Schneider

,

James C. Sexton

,

,

Gorka Esnal Lopez

,

,

,

Giovanni Aloisio

,

Carsten Trinitis

,

Alexander Heinecke

,

Proceedings of the 2012 International Conference on High Performance Computing & Simulation, 2012

Sparse grid classifiers as base learners for AdaBoost.

[DOI]

Alexander Heinecke

,

Benjamin Peherstorfer

,

,

Proceedings of the 2012 International Conference on High Performance Computing & Simulation, 2012

Solving High-Dimensional Problems on Processors with Integrated GPU.

[DOI]

Alexander Heinecke

Proceedings of the Facing the Multicore-Challenge, 2012

An efficient vectorization of linked-cell particle simulations.

[DOI]

Wolfgang Eckhardt

,

Alexander Heinecke

Proceedings of the Computing Frontiers Conference, CF'12, 2012

2011

Making TifaMMy fit for tomorrow: Towards future shared memory systems and beyond.

[DOI]

Alexander Heinecke

,

Carsten Trinitis

Proceedings of the 2011 International Conference on High Performance Computing & Simulation, 2011

Towards High-Performance Implementations of a Custom HPC Kernel Using ® Array Building Blocks.

[DOI]

Alexander Heinecke

,

,

,

Proceedings of the Facing the Multicore - Challenge II, 2011

Extending a Highly Parallel Data Mining Algorithm to the Intel ® Many Integrated Core Architecture.

[DOI]

Alexander Heinecke

,

,

,

,

Hans-Joachim Bungartz

Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

Multi- and many-core data mining with adaptive sparse grids.

[DOI]

Alexander Heinecke

,

Proceedings of the 8th Conference on Computing Frontiers, 2011

2010

Parallelizing a Black-Scholes solver based on finite elements and sparse grids.

[DOI]

Hans-Joachim Bungartz

,

Alexander Heinecke

,

,

Stefanie Schraufstetter

Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Porting existing cache-oblivious linear algebra HPC modules to larrabee architecture.

[DOI]

Alexander Heinecke

,

Carsten Trinitis

,

Josef Weidendorfer

Proceedings of the 7th Conference on Computing Frontiers, 2010

2007

Hardware-Oriented Implementation of Cache Oblivious Matrix Operations Based on Space-Filling Curves.

[DOI]

,

,

Stephan Günther

,

Alexander Heinecke

Proceedings of the Parallel Processing and Applied Mathematics, 2007

Loading...