Alexey Tumanov

Orcid: 0009-0005-7862-1477

According to our database1, Alexey Tumanov authored at least 64 papers between 2007 and 2025.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
Toward Weight Sharing Paradigm for Efficient AI: Training and Inference Serving.
ACM SIGOPS Oper. Syst. Rev., July, 2025

EMPIRIC: Exploring Missing Pieces in KV Cache Compression for Reducing Computation, Storage, and Latency in Long-Context LLM Inference.
ACM SIGOPS Oper. Syst. Rev., July, 2025

Efficient LLM Inference via Chunked Prefills.
ACM SIGOPS Oper. Syst. Rev., July, 2025

On Evaluating Performance of LLM Inference Serving Systems.
CoRR, July, 2025

Maya: Optimizing Deep Learning Training Workloads using Emulated Virtual Accelerators.
CoRR, March, 2025

RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression.
CoRR, February, 2025

∇QDARTS: Quantization as an Elastic Dimension to Differentiable NAS.
Trans. Mach. Learn. Res., 2025

SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads.
Proceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation, 2025

Client Availability in Federated Learning: It Matters!
Proceedings of the 5th Workshop on Machine Learning and Systems, 2025

2024
PLUM: Improving Inference Efficiency By Leveraging Repetition-Sparsity Trade-Off.
Trans. Mach. Learn. Res., 2024

Initialization using Update Approximation is a Silver Bullet for Extremely Efficient Low-Rank Fine-Tuning.
CoRR, 2024

Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations.
CoRR, 2024

Metron: Holistic Performance Evaluation Framework for LLM Inference Systems.
CoRR, 2024

DεS: Delayed ε-Shrinking for Faster Once-For-All Training.
CoRR, 2024

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve.
Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation, 2024

VIDUR: A Large-Scale Simulation Framework for LLM Inference.
Proceedings of the Seventh Annual Conference on Machine Learning and Systems, 2024

Harmonica: Hybrid Accelerator to Overcome Imperfections of Mixed-signal DNN Accelerators.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

SuperFedNAS: Cost-Efficient Federated Neural Architecture Search for On-device Inference.
Proceedings of the Computer Vision - ECCV 2024, 2024

DεpS: Delayed ε-Shrinking for Faster Once-for-All Training.
Proceedings of the Computer Vision - ECCV 2024, 2024

Inshrinkerator: Compressing Deep Learning Training Checkpoints via Dynamic Quantization.
Proceedings of the 2024 ACM Symposium on Cloud Computing, 2024

2023
Hardware-Software Co-Design for Real-Time Latency-Accuracy Navigation in Tiny Machine Learning Applications.
IEEE Micro, 2023

Signed Binarization: Unlocking Efficiency Through Repetition-Sparsity Trade-Off.
CoRR, 2023

ABKD: Graph Neural Network Compression with Attention-Based Knowledge Distillation.
CoRR, 2023

Pareto-Secure Machine Learning (PSML): Fingerprinting and Securing Inference Serving Systems.
CoRR, 2023

DynaQuant: Compressing Deep Learning Training Checkpoints via Dynamic Quantization.
CoRR, 2023

SuperFed: Weight Shared Federated Learning.
CoRR, 2023

Subgraph Stationary Hardware-Software Inference Co-Design.
Proceedings of the Sixth Conference on Machine Learning and Systems, 2023

TransEHR: Self-Supervised Transformer for Clinical Time Series Data.
Proceedings of the Machine Learning for Health, 2023

2022
UnfoldML: Cost-Aware and Uncertainty-Based Dynamic 2D Prediction for Multi-Stage Classification.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

CoDG-ReRAM: An Algorithm-Hardware Co-design to Accelerate Semi-Structured GNNs on ReRAM.
Proceedings of the IEEE 40th International Conference on Computer Design, 2022

Automatic Parallelization of Python Programs for Distributed Heterogeneous Computing.
Proceedings of the Euro-Par 2022: Parallel Processing, 2022

ESCHER: expressive scheduling with ephemeral resources.
Proceedings of the 13th Symposium on Cloud Computing, SoCC 2022, 2022

2021
CompOFA - Compound Once-For-All Networks for Faster Multi-Platform Deployment.
Proceedings of the 9th International Conference on Learning Representations, 2021

RubberBand: cloud-based hyperparameter tuning.
Proceedings of the EuroSys '21: Sixteenth European Conference on Computer Systems, 2021

2020
Cloudburst: Stateful Functions-as-a-Service.
Proc. VLDB Endow., 2020

Cloudburst: Stateful Functions-as-a-Service.
CoRR, 2020

HOLMES: Health OnLine Model Ensemble Serving for Deep Learning Models in Intensive Care Units.
Proceedings of the KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2020

InferLine: latency-aware provisioning and scaling for prediction serving pipelines.
Proceedings of the SoCC '20: ACM Symposium on Cloud Computing, 2020

2019
The OoO VLIW JIT Compiler for GPU Inference.
CoRR, 2019

Dynamic Space-Time Scheduling for GPU Inference.
CoRR, 2019

Lineage stash: fault tolerance off the critical path.
Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019

HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline.
Proceedings of the ACM Symposium on Cloud Computing, SoCC 2019, 2019

Cirrus: a Serverless Framework for End-to-end ML Workflows.
Proceedings of the ACM Symposium on Cloud Computing, SoCC 2019, 2019

Serverless Computing: One Step Forward, Two Steps Back.
Proceedings of the 9th Biennial Conference on Innovative Data Systems Research, 2019

2018
InferLine: ML Inference Pipeline Composition Framework.
CoRR, 2018

Tributary: spot-dancing for elastic services with latency SLOs.
Proceedings of the 2018 USENIX Annual Technical Conference, 2018

IDK Cascades: Fast Deep Learning by Learning not to Overthink.
Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, 2018

Ray: A Distributed Framework for Emerging AI Applications.
Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation, 2018

3Sigma: distribution-based cluster scheduling for runtime uncertainty.
Proceedings of the Thirteenth EuroSys Conference, 2018

2017
Ray: A Distributed Framework for Emerging AI Applications.
CoRR, 2017

IDK Cascades: Fast Deep Learning by Learning not to Overthink.
CoRR, 2017

Real-Time Machine Learning: The Missing Pieces.
Proceedings of the 16th Workshop on Hot Topics in Operating Systems, 2017

Proteus: agile ML elasticity through tiered reliability in dynamic resource markets.
Proceedings of the Twelfth European Conference on Computer Systems, 2017

2016
Scheduling with Space-Time Soft Constraints In Heterogeneous Cloud Datacenters.
PhD thesis, 2016

Morpheus: Towards Automated SLOs for Enterprise Clusters.
Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, 2016

TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters.
Proceedings of the Eleventh European Conference on Computer Systems, 2016

2014
Agility and Performance in Elastic Distributed Storage.
ACM Trans. Storage, 2014

SpringFS: bridging agility and performance in elastic distributed storage.
Proceedings of the 12th USENIX conference on File and Storage Technologies, 2014

PriorityMeister: Tail Latency QoS for Shared Networked Storage.
Proceedings of the ACM Symposium on Cloud Computing, 2014

Exploiting iterative-ness for parallel ML computations.
Proceedings of the ACM Symposium on Cloud Computing, 2014

2012
alsched: algebraic scheduling of mixed workloads in heterogeneous clouds.
Proceedings of the ACM Symposium on Cloud Computing, SOCC '12, 2012

Heterogeneity and dynamicity of clouds at scale: Google trace analysis.
Proceedings of the ACM Symposium on Cloud Computing, SOCC '12, 2012

2011
Kaleidoscope: cloud micro-elasticity via VM state coloring.
Proceedings of the European Conference on Computer Systems, 2011

2007
Variability-Aware Latency Amelioration in Distributed Environments.
Proceedings of the IEEE Virtual Reality Conference, 2007


  Loading...