Pradeep Dubey

According to our database1, Pradeep Dubey authored at least 116 papers between 1979 and 2023.

Collaborative distances:


ACM Fellow

ACM Fellow 2023, "For contributions to emerging compute- and data-intensive applications and parallel processing computer architectures".

IEEE Fellow

IEEE Fellow 2001, "For contributions to computer architecture supporting multimedia processing.".



In proceedings 
PhD thesis 




Microscaling Data Formats for Deep Learning.
CoRR, 2023

AUTOSPARSE: Towards Automated Sparse Training of Deep Neural Networks.
CoRR, 2023

HamLib: A Library of Hamiltonians for Benchmarking Quantum Algorithms and Hardware.
Proceedings of the IEEE International Conference on Quantum Computing and Engineering, 2023

FP8 Formats for Deep Learning.
CoRR, 2022

Systolic Computing on GPUs for Productive Performance.
CoRR, 2020

MISIM: An End-to-End Neural Code Similarity System.
CoRR, 2020

Context-Aware Parse Trees.
CoRR, 2020

SuSy: A Programming Model for Productive Construction of High-Performance Systolic Arrays on FPGAs.
Proceedings of the IEEE/ACM International Conference On Computer Aided Design, 2020

Parallelizing Word2Vec in Shared and Distributed Memory.
IEEE Trans. Parallel Distributed Syst., 2019

K-TanH: Hardware Efficient Activations For Deep Learning.
CoRR, 2019

A Study of BFLOAT16 for Deep Learning Training.
CoRR, 2019

SysML: The New Frontier of Machine Learning Systems.
CoRR, 2019

T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations.
Proceedings of the 27th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2019

Graphical exchange mechanisms.
Games Econ. Behav., 2018

Money as minimal complexity.
Games Econ. Behav., 2018

On Scale-out Deep Learning Training for Cloud and HPC.
CoRR, 2018

Mixed Precision Training of Convolutional Neural Networks using Integer Operations.
Proceedings of the 6th International Conference on Learning Representations, 2018

Ternary Neural Networks with Fine-Grained Quantization.
CoRR, 2017

Ternary Residual Networks.
CoRR, 2017

Deep learning at 15PF: supervised and semi-supervised classification for scientific data.
Proceedings of the International Conference for High Performance Computing, 2017

Galactos: computing the anisotropic 3-point correlation function for 2 billion galaxies.
Proceedings of the International Conference for High Performance Computing, 2017

The Quest for The Ultimate Learning Machine.
Proceedings of the 2017 ACM on International Symposium on Physical Design, 2017

ScaleDeep: A Scalable Compute Architecture for Learning and Evaluating Deep Networks.
Proceedings of the 44th Annual International Symposium on Computer Architecture, 2017

Faster CNNs with Direct Sparse Convolutions and Guided Pruning.
Proceedings of the 5th International Conference on Learning Representations, 2017

Full-Stack Architecting to Achieve a Billion-Requests-Per-Second Throughput on a Single Key-Value Store Server Platform.
ACM Trans. Comput. Syst., 2016

Efficient Approximation Algorithms for Weighted b-Matching.
SIAM J. Sci. Comput., 2016

Achieving One Billion Key-Value Requests per Second on a Single Server.
IEEE Micro, 2016

Optimizations in a high-performance conjugate gradient benchmark for IA-based multi- and many-core processors.
Int. J. High Perform. Comput. Appl., 2016

Scaling up Hartree-Fock calculations on Tianhe-2.
Int. J. High Perform. Comput. Appl., 2016

Eliciting performance: deterministic versus proportional prizes.
Int. J. Game Theory, 2016

Holistic SparseCNN: Forging the Trident of Accuracy, Speed, and Size.
CoRR, 2016

BlackOut: Speeding up Recurrent Neural Network Language Models With Very Large Vocabularies.
Proceedings of the 4th International Conference on Learning Representations, 2016

Parallelizing Word2Vec in Multi-Core and Many-Core Architectures.
CoRR, 2016

Distributed Deep Learning Using Synchronous Stochastic Gradient Descent.
CoRR, 2016

High Order Seismic Simulations on the Intel Xeon Phi Processor (Knights Landing).
Proceedings of the High Performance Computing - 31st International Conference, 2016

Designing scalable <i>b</i>-Matching algorithms on distributed memory multiprocessors by approximation.
Proceedings of the International Conference for High Performance Computing, 2016

PANDA: Extreme Scale Parallel K-Nearest Neighbor on Distributed Architectures.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

GraphPad: Optimized Graph Primitives for Parallel and Distributed Platforms.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

GraphMat: High performance graph analytics made productive.
Proc. VLDB Endow., 2015

Beacon: Deployment and Application of Intel Xeon Phi Coprocessorsfor Scientific Computing.
Comput. Sci. Eng., 2015

GraphMat: High performance graph analytics made productive.
CoRR, 2015

Decentralization of a Machine: Some Definitions.
CoRR, 2015

Can traditional programming bridge the ninja performance gap for parallel computing applications?
Commun. ACM, 2015

Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms.
Proceedings of the High Performance Computing - 30th International Conference, 2015

BD-CATS: big data clustering at trillion particle scale.
Proceedings of the International Conference for High Performance Computing, 2015

High-performance algebraic multigrid solver optimized for multi-core based distributed parallel systems.
Proceedings of the International Conference for High Performance Computing, 2015

Improving graph partitioning for modern graphs and architectures.
Proceedings of the 5th Workshop on Irregular Applications - Architectures and Algorithms, 2015

Architecting to achieve a billion requests per second throughput on a single key-value store server platform.
Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015

Exploring Shared-Memory Optimizations for an Unstructured Mesh CFD Application on Modern Parallel Systems.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver.
Proceedings of the Supercomputing - 29th International Conference, 2014

Navigating the maze of graph analytics frameworks using massive graph datasets.
Proceedings of the International Conference on Management of Data, 2014

Pardicle: Parallel Approximate Density-Based Clustering.
Proceedings of the International Conference for High Performance Computing, 2014

Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices.
Proceedings of the International Conference for High Performance Computing, 2014

Lattice QCD with Domain Decomposition on Intel® Xeon Phi Co-Processors.
Proceedings of the International Conference for High Performance Computing, 2014

Petascale High Order Dynamic Rupture Earthquake Simulations on Heterogeneous Supercomputers.
Proceedings of the International Conference for High Performance Computing, 2014

Improving the energy efficiency of Big Cores.
Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture, 2014

Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Intel "big data" science and technology center vision and execution plan.
SIGMOD Rec., 2013

Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing.
Proc. VLDB Endow., 2013

Lattice QCD on Intel® Xeon PhiTM Coprocessors.
Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013

Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors.
Proceedings of the International Conference for High Performance Computing, 2013

Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor.
Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

Efficient sparse matrix-vector multiplication on x86-based many-core processors.
Proceedings of the International Conference on Supercomputing, 2013

Large-scale fluid simulation using velocity-vorticity domain decomposition.
ACM Trans. Graph., 2012

CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster.
Proceedings of the ACM SIGMOD International Conference on Management of Data, 2012

Optimization of geometric multigrid for emerging multi- and manycore processors.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

Analysis and Optimization of Financial Analytics Benchmark on Modern Multi- and Many-core IA-Based Architectures.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Large-scale energy-efficient graph traversal: a path to efficient data-intensive supercomputing.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

GPP-Grep: High-Speed Regular Expression Processing Engine on General Purpose Processors.
Proceedings of the Research in Attacks, Intrusions, and Defenses, 2012

High Performance Non-uniform FFT on Modern X86-based Multi-core Systems.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

Fast and Efficient Graph Traversal Algorithm for CPUs: Maximizing Single-Node Efficiency.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

Emerging Applications.
Fundamentals of Multicore Software Development, 2012

Designing fast architecture-sensitive tree search on modern multicore/many-core processors.
ACM Trans. Database Syst., 2011

PALM: Parallel Architecture-Friendly Latch-Free Modifications to B+ Trees on Many-Core Processors.
Proc. VLDB Endow., 2011

Fast Updates on Read-Optimized Databases Using Multi-Core CPUs.
Proc. VLDB Endow., 2011

High-Performance 3D Compressive Sensing MRI Reconstruction Using Many-Core Architectures.
Int. J. Biomed. Imaging, 2011

Designing and dynamically load balancing hybrid LU for multi/many-core.
Comput. Sci. Res. Dev., 2011

Interactive hybrid simulation of large-scale traffic.
Proceedings of the International Conference on Computer Graphics and Interactive Techniques, 2011

PhysBAM: physically based simulation.
Proceedings of the International Conference on Computer Graphics and Interactive Techniques, 2011

High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach.
Proceedings of the Conference on High Performance Computing Networking, 2011

Credit cards and inflation.
Games Econ. Behav., 2010

A celebration of Robert Aumann's achievements on the occasion of his 80th birthday.
Games Econ. Behav., 2010

Grading exams: 100, 99, 98, ... or A, B, C?
Games Econ. Behav., 2010

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort.
Proceedings of the ACM SIGMOD International Conference on Management of Data, 2010

FAST: fast architecture sensitive tree search on modern CPUs and GPUs.
Proceedings of the ACM SIGMOD International Conference on Management of Data, 2010

PLEdestrians: A Least-Effort Approach to Crowd Simulation.
Proceedings of the 2010 Eurographics/ACM SIGGRAPH Symposium on Computer Animation, 2010

3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs.
Proceedings of the Conference on High Performance Computing Networking, 2010

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU.
Proceedings of the 37th International Symposium on Computer Architecture (ISCA 2010), 2010

Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures.
IEEE Trans. Vis. Comput. Graph., 2009

Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs.
Proc. VLDB Endow., 2009

Larrabee: A Many-Core x86 Architecture for Visual Computing.
IEEE Micro, 2009

Perfect competition in an oligopoly (including bilateral monopoly).
Games Econ. Behav., 2009

ClearPath: highly parallel collision avoidance for multi-agent simulation.
Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, 2009

Interactive Modeling, Simulation and Control of Large-Scale Crowds and Traffic.
Proceedings of the Motion in Games, Second International Workshop, 2009

Efficient implementation of sorting on multi-core SIMD CPU architecture.
Proc. VLDB Endow., 2008

Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications.
Proc. IEEE, 2008

Second Life and the New Generation of Virtual Worlds.
Computer, 2008

Cache-conscious frequent pattern mining on modern and emerging processors.
VLDB J., 2007

Scaling performance of interior-point method on large-scale chip multiprocessor system.
Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, 2007

Strategic complements and substitutes, and potential games.
Games Econ. Behav., 2006

Competing for Customers in a Social Network: The Quasi-linear Case.
Proceedings of the Internet and Network Economics, Second International Workshop, 2006

Games of Connectivity.
Proceedings of the Internet and Network Economics, Second International Workshop, 2006

Compound voting and the Banzhaf index.
Games Econ. Behav., 2005

Cache-conscious Frequent Pattern Mining on a Modern Processor.
Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, August 30, 2005

A Characterization of Data Mining Workloads on a Modern Processor.
Proceedings of the Workshop on Data Management on New Hardware, 2005

Learning with perfect information.
Games Econ. Behav., 2004

Optimal scrutiny in multi-period promotion tournaments.
Games Econ. Behav., 2003

Compression Tolerant Watermarking for Image Verification.
Proceedings of the 2000 International Conference on Image Processing, 2000

Characterizing vulnerability of parallelism to resource constraints.
Proceedings of the Fourth International on High-Performance Computing, 1997

Inefficiency of Nash Equilibria.
Math. Oper. Res., 1986

Totally balanced games arising from controlled programming problems.
Math. Program., 1984

Information Conditions, Communication and General Equilibrium.
Math. Oper. Res., 1981

Value Theory Without Efficiency.
Math. Oper. Res., 1981

Asymptotic Semivalues and a Short Proof of Kannai's Theorem.
Math. Oper. Res., 1980

Mathematical Properties of the Banzhaf Power Index.
Math. Oper. Res., 1979
