Esha Choukse

Orcid: 0000-0003-0371-5522

Affiliations:
  • Microsoft, Redmond, WA, USA


According to our database1, Esha Choukse authored at least 44 papers between 2016 and 2026.

Collaborative distances:

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

Online presence:

On csauthors.net:

Bibliography

2026
Natural Language Query to Configuration for Retrieval Agents.
CoRR, May, 2026

StreamWise: Serving Multi-Modal Generation in Real-Time at Scale.
CoRR, March, 2026

DroidSpeak: KV Cache Sharing Across Fine-tuned Model Variants.
Proceedings of the 23rd USENIX Symposium on Networked Systems Design and Implementation, 2026

Harvesting Spare CPU Resources in Container Systems.
Proceedings of the 23rd USENIX Symposium on Networked Systems Design and Implementation, 2026

2025
Sherlock: Reliable and Efficient Agentic Workflow Execution.
CoRR, November, 2025

Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute.
CoRR, September, 2025

Murakkab: Resource-Efficient Agentic Workflow Orchestration in Cloud Platforms.
CoRR, August, 2025

Power Stabilization for AI Training Datacenters.
CoRR, August, 2025

EcoServe: Designing Carbon-Aware AI Inference Systems.
CoRR, February, 2025

Towards Efficient Large Multimodal Model Serving.
CoRR, February, 2025

Enabling Sustainable Cloud Computing With Low-Carbon Server Design.
IEEE Micro, 2025

Splitwise: Efficient Generative LLM Inference Using Phase Splitting.
IEEE Micro, 2025

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2025

Towards Resource-Efficient Compound AI Systems.
Proceedings of the 2025 Workshop on Hot Topics in Operating Systems, 2025

Performance Aware LLM Load Balancer for Mixed Workloads.
Proceedings of the 5th Workshop on Machine Learning and Systems, 2025

Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency.
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving.
Proceedings of the 2025 ACM Symposium on Cloud Computing, 2025

TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms.
Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2025

2024
DroidSpeak: Enhancing Cross-LLM Communication.
CoRR, 2024

Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations.
CoRR, 2024

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling.
CoRR, 2024

Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference.
CoRR, 2024

Junctiond: Extending FaaS Runtimes with Kernel-Bypass.
CoRR, 2024

Input-Dependent Power Usage in GPUs.
Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024

Making Kernel Bypass Practical for the Cloud with Junction.
Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, 2024

Mosaic: Harnessing the Micro-Architectural Resources of Servers in Serverless Environments.
Proceedings of the 57th IEEE/ACM International Symposium on Microarchitecture, 2024

Memory Allocation Under Hardware Compression.
Proceedings of the 57th IEEE/ACM International Symposium on Microarchitecture, 2024

Designing Cloud Servers for Lower Carbon.
Proceedings of the 51st ACM/IEEE Annual International Symposium on Computer Architecture, 2024

SmartOClock: Workload- and Risk-Aware Overclocking in the Cloud.
Proceedings of the 51st ACM/IEEE Annual International Symposium on Computer Architecture, 2024

Splitwise: Efficient Generative LLM Inference Using Phase Splitting.
Proceedings of the 51st ACM/IEEE Annual International Symposium on Computer Architecture, 2024

DyLeCT: Achieving Huge-page-like Translation Performance for Hardware-compressed Memory.
Proceedings of the 51st ACM/IEEE Annual International Symposium on Computer Architecture, 2024

Characterizing Power Management Opportunities for LLMs in the Cloud.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024


2023
POLCA: Power Oversubscription in LLM Cloud Providers.
CoRR, 2023

Towards Improved Power Management in Cloud GPUs.
IEEE Comput. Archit. Lett., 2023

Myths and Misconceptions Around Reducing Carbon Embedded in Cloud Platforms.
Proceedings of the 2nd Workshop on Sustainable Computer Systems, 2023

2022
Overclocking in Immersion-Cooled Datacenters.
IEEE Micro, 2022

Translation-optimized Memory Compression for Capacity.
Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture, 2022

2020
Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs.
Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture, 2020

2019
PruneTrain: Gradual Structured Pruning from Scratch for Faster Neural Network Training.
CoRR, 2019

PruneTrain: fast neural network training by dynamic sparse model reconfiguration.
Proceedings of the International Conference for High Performance Computing, 2019

2018
CompressPoints: An Evaluation Methodology for Compressed Memory Systems.
IEEE Comput. Archit. Lett., 2018

Compresso: Pragmatic Main Memory Compression.
Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, 2018

2016
Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures.
Proceedings of the 43rd ACM/IEEE Annual International Symposium on Computer Architecture, 2016


  Loading...