Mikhail Smelyanskiy

Orcid: 0000-0002-2433-6110

According to our database1, Mikhail Smelyanskiy authored at least 63 papers between 2000 and 2022.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2022
Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization.
Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, 2022

Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models.
Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, 2022


Supporting Massive DLRM Inference through Software Defined Memory.
Proceedings of the 42nd IEEE International Conference on Distributed Computing Systems, 2022

2021
Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale.
IEEE Micro, 2021

Differentiable NAS Framework and Application to Ads CTR Prediction.
CoRR, 2021

Supporting Massive DLRM Inference Through Software Defined Memory.
CoRR, 2021

High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models.
CoRR, 2021

FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference.
CoRR, 2021

2020
Check-N-Run: A Checkpointing System for Training Recommendation Models.
CoRR, 2020

Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems.
CoRR, 2020

RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing.
Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture, 2020

The Architectural Implications of Facebook's DNN-Based Personalized Recommendation.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2020

2019
The Architectural Implications of Facebook's DNN-based Personalized Recommendation.
CoRR, 2019

Deep Learning Recommendation Model for Personalization and Recommendation Systems.
CoRR, 2019

A Study of BFLOAT16 for Deep Learning Training.
CoRR, 2019

Bandana: Using Non-Volatile Memory for Storing Deep Learning Models.
Proceedings of Machine Learning and Systems 2019, 2019

Zion: Facebook Next- Generation Large Memory Training Platform.
Proceedings of the 2019 IEEE Hot Chips 31 Symposium (HCS), 2019

2018
Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications.
CoRR, 2018

Glow: Graph Lowering Compiler Techniques for Neural Networks.
CoRR, 2018

Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2018

2017
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.
Proceedings of the 5th International Conference on Learning Representations, 2017

Distributed Hessian-Free Optimization for Deep Neural Network.
Proceedings of the Workshops of the The Thirty-First AAAI Conference on Artificial Intelligence, 2017

2016
The BLIS Framework: Experiments in Portability.
ACM Trans. Math. Softw., 2016

Optimizations in a high-performance conjugate gradient benchmark for IA-based multi- and many-core processors.
Int. J. High Perform. Comput. Appl., 2016

Scaling up Hartree-Fock calculations on Tianhe-2.
Int. J. High Perform. Comput. Appl., 2016

qHiPSTER: The Quantum High Performance Software Testing Environment.
CoRR, 2016

Large Scale Distributed Hessian-Free Optimization for Deep Neural Network.
CoRR, 2016

High performance emulation of quantum circuits.
Proceedings of the International Conference for High Performance Computing, 2016

High Performance Parallel Stochastic Gradient Descent in Shared Memory.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Sparso: Context-driven Optimizations of Sparse Linear Algebra.
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, 2016

2015
Can traditional programming bridge the ninja performance gap for parallel computing applications?
Commun. ACM, 2015

High-performance algebraic multigrid solver optimized for multi-core based distributed parallel systems.
Proceedings of the International Conference for High Performance Computing, 2015

Exploring Shared-Memory Optimizations for an Unstructured Mesh CFD Application on Modern Parallel Systems.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

2014
Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver.
Proceedings of the Supercomputing - 29th International Conference, 2014

Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices.
Proceedings of the International Conference for High Performance Computing, 2014

Lattice QCD with Domain Decomposition on Intel® Xeon Phi Co-Processors.
Proceedings of the International Conference for High Performance Computing, 2014

Petascale High Order Dynamic Rupture Earthquake Simulations on Heterogeneous Supercomputers.
Proceedings of the International Conference for High Performance Computing, 2014

Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Anatomy of High-Performance Many-Threaded Matrix Multiplication.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

2013
Efficient backprojection-based synthetic aperture radar computation with many-core processors.
Sci. Program., 2013

Lattice QCD on Intel® Xeon PhiTM Coprocessors.
Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013

Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi Coprocessors.
Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor.
Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

Efficient sparse matrix-vector multiplication on x86-based many-core processors.
Proceedings of the International Conference on Supercomputing, 2013

2012
Optimization of geometric multigrid for emerging multi- and manycore processors.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

Analysis and Optimization of Financial Analytics Benchmark on Modern Multi- and Many-core IA-Based Architectures.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Improving the Performance of Dynamical Simulations Via Multiple Right-Hand Sides.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

High Performance Non-uniform FFT on Modern X86-based Multi-core Systems.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

2011
High-Performance 3D Compressive Sensing MRI Reconstruction Using Many-Core Architectures.
Int. J. Biomed. Imaging, 2011

Designing and dynamically load balancing hybrid LU for multi/many-core.
Comput. Sci. Res. Dev., 2011

High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach.
Proceedings of the Conference on High Performance Computing Networking, 2011

2010
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU.
Proceedings of the 37th International Symposium on Computer Architecture (ISCA 2010), 2010

2009
Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures.
IEEE Trans. Vis. Comput. Graph., 2009

2008
Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications.
Proc. IEEE, 2008

An algorithm for the fast solution of symmetric linear complementarity problems.
Numerische Mathematik, 2008

Atomic Vector Operations on Chip Multiprocessors.
Proceedings of the 35th International Symposium on Computer Architecture (ISCA 2008), 2008

2007
Scaling performance of interior-point method on large-scale chip multiprocessor system.
Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, 2007

2004
Probabilistic Predicate-Aware Modulo Scheduling.
Proceedings of the 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 2004

2003
Predicate-Aware Scheduling: A Technique for Reducing Resource Constraints.
Proceedings of the 1st IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2003), 2003

Systematic Register Bypass Customization for Application-Specific Processors.
Proceedings of the 14th IEEE International Conference on Application-Specific Systems, 2003

2001
Stack Value File: Custom Microarchitecture for the Stack.
Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA'01), 2001

2000
Register Queues: A New Hardware/Software Approach to Efficient Software Pipelining.
Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques (PACT'00), 2000


  Loading...