Sehoon Kim

Orcid: 0000-0002-9339-5480

Affiliations:
  • University of California, Berkeley, CA, USA (PhD 2024)


According to our database1, Sehoon Kim authored at least 35 papers between 2021 and 2025.

Collaborative distances:

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Multipole Attention for Efficient Long Context Reasoning.
CoRR, June, 2025

ETS: Efficient Tree Search for Inference-Time Scaling.
CoRR, February, 2025

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache.
Proceedings of the Forty-second International Conference on Machine Learning, 2025

Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks.
Proceedings of the Forty-second International Conference on Machine Learning, 2025

Squeezed Attention: Accelerating Long Context Length LLM Inference.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024
Full Stack Approach for Efficient Deep Learning Inference
PhD thesis, 2024

AI and Memory Wall.
IEEE Micro, 2024

Corrigendum: Applications and techniques for fast machine learning in science.
Frontiers Big Data, 2024

Squeezed Attention: Accelerating Long Context Length LLM Inference.
CoRR, 2024

Efficient and Scalable Estimation of Tool Representations in Vector Space.
CoRR, 2024

Characterizing Prompt Compression Methods for Long Context Inference.
CoRR, 2024

Learned Best-Effort LLM Serving.
CoRR, 2024

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

An LLM Compiler for Parallel Function Calling.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

SqueezeLLM: Dense-and-Sparse Quantization.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

TinyAgent: Function Calling at the Edge.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: EMNLP 2024, 2024

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

2023
SPEED: Speculative Pipelined Execution for Efficient Decoding.
CoRR, 2023

Full Stack Optimization of Transformer Inference: a Survey.
CoRR, 2023

Big Little Transformer Decoder.
CoRR, 2023

Speculative Decoding with Big Little Decoder.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

2022
Applications and Techniques for Fast Machine Learning in Science.
Frontiers Big Data, 2022

Hessian-Aware Pruning and Optimal Neural Implant.
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022

A Fast Post-Training Pruning Framework for Transformers.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Learned Token Pruning for Transformers.
Proceedings of the KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14, 2022

Integer-Only Zero-Shot Quantization for Efficient Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2022

2021
WindTunnel: Towards Differentiable ML Pipelines Beyond a Single Modele.
Proc. VLDB Endow., 2021

Applications and Techniques for Fast Machine Learning in Science.
CoRR, 2021

Learned Token Pruning for Transformers.
CoRR, 2021

Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech Recognition.
CoRR, 2021

A Survey of Quantization Methods for Efficient Neural Network Inference.
CoRR, 2021

Terra: Imperative-Symbolic Co-Execution of Imperative Deep Learning Programs.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Memory-Efficient Hardware Performance Counters with Approximate-Counting Algorithms.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2021

I-BERT: Integer-only BERT Quantization.
Proceedings of the 38th International Conference on Machine Learning, 2021


  Loading...