Dhiraj D. Kalamkar

According to our database1, Dhiraj D. Kalamkar authored at least 30 papers between 2007 and 2023.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2023
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures.
CoRR, 2023

2022
Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning and HPC Workloads.
Frontiers Appl. Math. Stat., 2022

Accelerating Deep Learning based Identification of Chromatin Accessibility from noisy ATAC-seq Data.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

2021
Efficient and Generic 1D Dilated Convolution Layer for Deep Learning.
CoRR, 2021

Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning Workloads.
CoRR, 2021

DistGNN: scalable distributed training for large-scale graph neural networks.
Proceedings of the International Conference for High Performance Computing, 2021

Tensor processing primitives: a programming abstraction for efficiency and portability in deep learning workloads.
Proceedings of the International Conference for High Performance Computing, 2021

2020
Optimizing deep learning recommender systems training on CPU cluster architectures.
Proceedings of the International Conference for High Performance Computing, 2020

Harnessing Deep Learning via a Single Building Block.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

2019
Optimizing Deep Learning RNN Topologies on Intel Architecture.
Supercomput. Front. Innov., 2019

K-TanH: Hardware Efficient Activations For Deep Learning.
CoRR, 2019

High-Performance Deep Learning via a Single Building Block.
CoRR, 2019

A Study of BFLOAT16 for Deep Learning Training.
CoRR, 2019

Training Google Neural Machine Translation on an Intel CPU Cluster.
Proceedings of the 2019 IEEE International Conference on Cluster Computing, 2019

2018
On Scale-out Deep Learning Training for Cloud and HPC.
CoRR, 2018

Anatomy of high-performance deep learning convolutions on SIMD architectures.
Proceedings of the International Conference for High Performance Computing, 2018

Mixed Precision Training of Convolutional Neural Networks using Integer Operations.
Proceedings of the 6th International Conference on Learning Representations, 2018

2016
Optimizations in a high-performance conjugate gradient benchmark for IA-based multi- and many-core processors.
Int. J. High Perform. Comput. Appl., 2016

Distributed Deep Learning Using Synchronous Stochastic Gradient Descent.
CoRR, 2016

Optimizing Wilson-Dirac Operator and Linear Solvers for Intel® KNL.
Proceedings of the High Performance Computing, 2016

2015
Improving concurrency and asynchrony in multithreaded MPI applications using software offloading.
Proceedings of the International Conference for High Performance Computing, 2015

2014
Enabling Efficient Multithreaded MPI Communication through a Library-Based Implementation of MPI Endpoints.
Proceedings of the International Conference for High Performance Computing, 2014

Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices.
Proceedings of the International Conference for High Performance Computing, 2014

Lattice QCD with Domain Decomposition on Intel® Xeon Phi Co-Processors.
Proceedings of the International Conference for High Performance Computing, 2014

Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

2013
Lattice QCD on Intel® Xeon PhiTM Coprocessors.
Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013

2012
Optimization of geometric multigrid for emerging multi- and manycore processors.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

Analysis and Optimization of Financial Analytics Benchmark on Modern Multi- and Many-core IA-Based Architectures.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

High Performance Non-uniform FFT on Modern X86-based Multi-core Systems.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

2007
Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads.
Proceedings of the 2007 IEEE International Symposium on Performance Analysis of Systems and Software, 2007


  Loading...