Yue Guan

Orcid: 0009-0005-7433-2627

Affiliations:
  • Shanghai Jiao Tong University, Department of Computer Science and Engineering, China


According to our database1, Yue Guan authored at least 27 papers between 2020 and 2026.

Collaborative distances:

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

Online presence:

On csauthors.net:

Bibliography

2026
TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments.
CoRR, May, 2026

FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration.
CoRR, May, 2026

JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training.
CoRR, April, 2026

Pancake: Hierarchical Memory System for Multi-Agent LLM Serving.
CoRR, February, 2026

ScaleSim: Serving Large-Scale Multi-Agent Simulation with Invocation Distance-Based Memory Management.
CoRR, January, 2026

AutoOverlap: Enabling Fine-Grained Overlap of Computation and Communication with Chunk-Based Scheduling.
CoRR, January, 2026

Proton: Towards Multi-level, Adaptive Profiling for Triton.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2026

2025
Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting.
CoRR, October, 2025

KPerfIR: Towards an Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads.
CoRR, May, 2025

Mercury: Unlocking Multi-GPU Operator Optimization for LLMs via Remote Memory Scheduling.
Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, 2025

KPerfIR: Towards a Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads.
Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation, 2025

WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training.
Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation, 2025

KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference.
Proceedings of the Eighth Conference on Machine Learning and Systems, 2025

An Efficient Private GPT Never Autoregressively Decodes.
Proceedings of the Forty-second International Conference on Machine Learning, 2025

M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2025

VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2025

2024
Accelerating Sparse DNNs Based on Tiled GEMM.
IEEE Trans. Computers, May, 2024

Fractal: Joint Multi-Level Sparse Pattern Tuning of Accuracy and Performance for DNN Pruning.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024

Amanda: Unified Instrumentation Framework for Deep Neural Networks.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024

2022
PAME: precision-aware multi-exit DNN serving for reducing latencies of batched inferences.
Proceedings of the ICS '22: 2022 International Conference on Supercomputing, Virtual Event, June 28, 2022

Transkimmer: Transformer Learns to Layer-wise Skim.
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022

Block-Skim: Efficient Question Answering for Transformer.
Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

2020
Co-Design of Binary Processing in Memory ReRAM Array and DNN Model Optimization Algorithm.
IEICE Trans. Electron., 2020

Accelerating sparse DNN models without hardware-support via tile-wise sparsity.
Proceedings of the International Conference for High Performance Computing, 2020

How Far Does BERT Look At: Distance-based Clustering and Analysis of BERT's Attention.
Proceedings of the 28th International Conference on Computational Linguistics, 2020


  Loading...