Dhiraj D. Kalamkar

According to our database¹, Dhiraj D. Kalamkar authored at least 35 papers between 2007 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer.

[BibT_eX]

[DOI]

CoRR, April, 2026

2025

Pushing the Envelope of LLM Inference on AI-PC.

[BibT_eX]

[DOI]

Evangelos Georganas

Dhiraj D. Kalamkar

Alexander Heinecke

CoRR, August, 2025

ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts.

[BibT_eX]

[DOI]

CoRR, March, 2025

2024

Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

LOSM: Leveraging OpenMP and Shared Memory for Accelerating Blocking MPI Allreduce.

[BibT_eX]

[DOI]

Pranjal Walia

Ishan Shanware

Karthikeyan Vaidyanathan

Dhiraj D. Kalamkar

Uma M. Natarajan

Proceedings of the 31st IEEE International Conference on High Performance Computing, Data and Analytics, HiPC 2024, 2024

2023

Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures.

[BibT_eX]

[DOI]

CoRR, 2023

2022

Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning and HPC Workloads.

[BibT_eX]

[DOI]

Frontiers Appl. Math. Stat., 2022

Accelerating Deep Learning based Identification of Chromatin Accessibility from noisy ATAC-seq Data.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

2021

Efficient and Generic 1D Dilated Convolution Layer for Deep Learning.

[BibT_eX]

[DOI]

CoRR, 2021

Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning Workloads.

[BibT_eX]

[DOI]

CoRR, 2021

DistGNN: scalable distributed training for large-scale graph neural networks.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2021

Tensor processing primitives: a programming abstraction for efficiency and portability in deep learning workloads.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2021

2020

Optimizing deep learning recommender systems training on CPU cluster architectures.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2020

Harnessing Deep Learning via a Single Building Block.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

2019

Optimizing Deep Learning RNN Topologies on Intel Architecture.

[BibT_eX]

[DOI]

Supercomput. Front. Innov., 2019

K-TanH: Hardware Efficient Activations For Deep Learning.

[BibT_eX]

[DOI]

CoRR, 2019

High-Performance Deep Learning via a Single Building Block.

[BibT_eX]

[DOI]

CoRR, 2019

A Study of BFLOAT16 for Deep Learning Training.

[BibT_eX]

[DOI]

Nataraj Jammalamadaka

CoRR, 2019

Training Google Neural Machine Translation on an Intel CPU Cluster.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE International Conference on Cluster Computing, 2019

2018

On Scale-out Deep Learning Training for Cloud and HPC.

[BibT_eX]

[DOI]

Srinivas Sridharan

Karthikeyan Vaidyanathan

CoRR, 2018

Anatomy of high-performance deep learning convolutions on SIMD architectures.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2018

Mixed Precision Training of Convolutional Neural Networks using Integer Operations.

[BibT_eX]

[DOI]

Proceedings of the 6th International Conference on Learning Representations, 2018

2016

Optimizations in a high-performance conjugate gradient benchmark for IA-based multi- and many-core processors.

[BibT_eX]

[DOI]

Jongsoo Park

Mikhail Smelyanskiy

Karthikeyan Vaidyanathan

Alexander Heinecke

Dhiraj D. Kalamkar

Md. Mostofa Ali Patwary

Int. J. High Perform. Comput. Appl., 2016

Distributed Deep Learning Using Synchronous Stochastic Gradient Descent.

[BibT_eX]

[DOI]

Dipankar Das

Sasikanth Avancha

Dheevatsa Mudigere

Karthikeyan Vaidyanathan

CoRR, 2016

Optimizing Wilson-Dirac Operator and Linear Solvers for Intel® KNL.

[BibT_eX]

[DOI]

Bálint Joó

Dhiraj D. Kalamkar

Thorsten Kurth

Karthikeyan Vaidyanathan

Aaron Walden

Proceedings of the High Performance Computing, 2016

2015

Improving concurrency and asynchrony in multithreaded MPI applications using software offloading.

[BibT_eX]

[DOI]

Karthikeyan Vaidyanathan

Proceedings of the International Conference for High Performance Computing, 2015

2014

Enabling Efficient Multithreaded MPI Communication through a Library-Based Implementation of MPI Endpoints.

[BibT_eX]

[DOI]

Srinivas Sridharan

James Dinan

Dhiraj D. Kalamkar

Proceedings of the International Conference for High Performance Computing, 2014

Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices.

[BibT_eX]

[DOI]

Jongsoo Park

Mikhail Smelyanskiy

Karthikeyan Vaidyanathan

Alexander Heinecke

Dhiraj D. Kalamkar

Xing Liu

Md. Mostofa Ali Patwary

Yutong Lu

Pradeep Dubey

Proceedings of the International Conference for High Performance Computing, 2014

Lattice QCD with Domain Decomposition on Intel® Xeon Phi Co-Processors.

[BibT_eX]

[DOI]

Karthikeyan Vaidyanathan

Tilo Wettig

Pradeep Dubey

Proceedings of the International Conference for High Performance Computing, 2014

Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters.

[BibT_eX]

[DOI]

Karthikeyan Vaidyanathan

Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

2013

Lattice QCD on Intel® Xeon PhiTM Coprocessors.

[BibT_eX]

[DOI]

Bálint Joó

Dhiraj D. Kalamkar

Karthikeyan Vaidyanathan

William A. Watson III

Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013

2012

Optimization of geometric multigrid for emerging multi- and manycore processors.

[BibT_eX]

[DOI]

Proceedings of the SC Conference on High Performance Computing Networking, 2012

Analysis and Optimization of Financial Analytics Benchmark on Modern Multi- and Many-core IA-Based Architectures.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

High Performance Non-uniform FFT on Modern X86-based Multi-core Systems.

[BibT_eX]

[DOI]

Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

2007

Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads.

[BibT_eX]

[DOI]

Dhiraj D. Kalamkar

Mainak Chaudhuri

Mark A. Heinrich

Proceedings of the 2007 IEEE International Symposium on Performance Analysis of Systems and Software, 2007

Dhiraj D. Kalamkar

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...