Sehoon Kim

Orcid: 0000-0002-9339-5480

Affiliations:
  • University of California, Berkeley, CA, USA (PhD 2024)


According to our database1, Sehoon Kim authored at least 35 papers between 2021 and 2025.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Multipole Attention for Efficient Long Context Reasoning.
CoRR, June, 2025

Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks.
CoRR, March, 2025

ETS: Efficient Tree Search for Inference-Time Scaling.
CoRR, February, 2025

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache.
CoRR, February, 2025

Squeezed Attention: Accelerating Long Context Length LLM Inference.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024
Full Stack Approach for Efficient Deep Learning Inference
PhD thesis, 2024

AI and Memory Wall.
IEEE Micro, 2024

Corrigendum: Applications and techniques for fast machine learning in science.
Frontiers Big Data, 2024

Squeezed Attention: Accelerating Long Context Length LLM Inference.
CoRR, 2024

Efficient and Scalable Estimation of Tool Representations in Vector Space.
CoRR, 2024

TinyAgent: Function Calling at the Edge.
CoRR, 2024

Characterizing Prompt Compression Methods for Long Context Inference.
CoRR, 2024

Learned Best-Effort LLM Serving.
CoRR, 2024

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

An LLM Compiler for Parallel Function Calling.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

SqueezeLLM: Dense-and-Sparse Quantization.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

2023
SPEED: Speculative Pipelined Execution for Efficient Decoding.
CoRR, 2023

Full Stack Optimization of Transformer Inference: a Survey.
CoRR, 2023

Big Little Transformer Decoder.
CoRR, 2023

Speculative Decoding with Big Little Decoder.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

2022
Applications and Techniques for Fast Machine Learning in Science.
Frontiers Big Data, 2022

Hessian-Aware Pruning and Optimal Neural Implant.
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022

A Fast Post-Training Pruning Framework for Transformers.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Learned Token Pruning for Transformers.
Proceedings of the KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14, 2022

Integer-Only Zero-Shot Quantization for Efficient Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2022

2021
WindTunnel: Towards Differentiable ML Pipelines Beyond a Single Modele.
Proc. VLDB Endow., 2021

Applications and Techniques for Fast Machine Learning in Science.
CoRR, 2021

Learned Token Pruning for Transformers.
CoRR, 2021

Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech Recognition.
CoRR, 2021

A Survey of Quantization Methods for Efficient Neural Network Inference.
CoRR, 2021

Terra: Imperative-Symbolic Co-Execution of Imperative Deep Learning Programs.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Memory-Efficient Hardware Performance Counters with Approximate-Counting Algorithms.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2021

I-BERT: Integer-only BERT Quantization.
Proceedings of the 38th International Conference on Machine Learning, 2021


  Loading...