Xianzhi Yu

Orcid: 0000-0002-1497-5525

According to our database¹, Xianzhi Yu authored at least 28 papers between 2020 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Links

On csauthors.net:

Bibliography

2025

Behavioral Fingerprinting of Large Language Models.

[BibT_eX]

[DOI]

CoRR, September, 2025

Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling.

[BibT_eX]

[DOI]

CoRR, September, 2025

HAP: Hybrid Adaptive Parallelism for Efficient Mixture-of-Experts Inference.

[BibT_eX]

[DOI]

CoRR, August, 2025

EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization.

[BibT_eX]

[DOI]

CoRR, June, 2025

A Simple Linear Patch Revives Layer-Pruned Large Language Models.

[BibT_eX]

[DOI]

CoRR, May, 2025

Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity.

[BibT_eX]

[DOI]

CoRR, May, 2025

MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE.

[BibT_eX]

[DOI]

CoRR, May, 2025

Faster and Better LLMs via Latency-Aware Test-Time Scaling.

[BibT_eX]

[DOI]

CoRR, May, 2025

PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval.

[BibT_eX]

[DOI]

CoRR, May, 2025

L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models.

[BibT_eX]

[DOI]

CoRR, May, 2025

TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling.

[BibT_eX]

[DOI]

CoRR, May, 2025

Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs.

[BibT_eX]

[DOI]

CoRR, May, 2025

Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models.

[BibT_eX]

[DOI]

CoRR, April, 2025

SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention.

[BibT_eX]

[DOI]

CoRR, February, 2025

CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference.

[BibT_eX]

[DOI]

CoRR, February, 2025

AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference.

[BibT_eX]

[DOI]

CoRR, February, 2025

LLMShare: Optimizing LLM Inference Serving with Hardware Architecture Exploration.

[BibT_eX]

[DOI]

Proceedings of the 62nd ACM/IEEE Design Automation Conference, 2025

2024

LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs.

[BibT_eX]

[DOI]

ACM Trans. Archit. Code Optim., December, 2024

FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers.

[BibT_eX]

[DOI]

CoRR, 2024

FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs.

[BibT_eX]

[DOI]

CoRR, 2024

FlatQuant: Flatness Matters for LLM Quantization.

[BibT_eX]

[DOI]

CoRR, 2024

2023

EC-SpMM: Efficient Compilation of SpMM Kernel on GPUs.

[BibT_eX]

[DOI]

Proceedings of the 52nd International Conference on Parallel Processing, 2023

2022

An Application-oblivious Memory Scheduling System for DNN Accelerators.

[BibT_eX]

[DOI]

ACM Trans. Archit. Code Optim., 2022

HW-TSC's Submission for the WMT22 Efficiency Task.

[BibT_eX]

[DOI]

Proceedings of the Seventh Conference on Machine Translation, 2022

Accelerating Sparse Convolution with Column Vector-Wise Sparsity.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

2021

Optimizing the LINPACK Algorithm for Large-Scale PCIe-Based CPU-GPU Heterogeneous Systems.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2021

Pinpointing the Memory Behaviors of DNN Training.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2021

2020

Revisiting linpack algorithm on large-scale CPU-GPU heterogeneous systems.

[BibT_eX]

[DOI]

Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

Xianzhi Yu

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...