Xianzhi Yu

Orcid: 0000-0002-1497-5525

According to our database1, Xianzhi Yu authored at least 24 papers between 2020 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization.
CoRR, June, 2025

A Simple Linear Patch Revives Layer-Pruned Large Language Models.
CoRR, May, 2025

Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity.
CoRR, May, 2025

MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE.
CoRR, May, 2025

Faster and Better LLMs via Latency-Aware Test-Time Scaling.
CoRR, May, 2025

PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval.
CoRR, May, 2025

L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models.
CoRR, May, 2025

TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling.
CoRR, May, 2025

Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs.
CoRR, May, 2025

Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models.
CoRR, April, 2025

SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention.
CoRR, February, 2025

CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference.
CoRR, February, 2025

AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference.
CoRR, February, 2025

2024
LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs.
ACM Trans. Archit. Code Optim., December, 2024

FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers.
CoRR, 2024

FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs.
CoRR, 2024

FlatQuant: Flatness Matters for LLM Quantization.
CoRR, 2024

2023
EC-SpMM: Efficient Compilation of SpMM Kernel on GPUs.
Proceedings of the 52nd International Conference on Parallel Processing, 2023

2022
An Application-oblivious Memory Scheduling System for DNN Accelerators.
ACM Trans. Archit. Code Optim., 2022

HW-TSC's Submission for the WMT22 Efficiency Task.
Proceedings of the Seventh Conference on Machine Translation, 2022

Accelerating Sparse Convolution with Column Vector-Wise Sparsity.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

2021
Optimizing the LINPACK Algorithm for Large-Scale PCIe-Based CPU-GPU Heterogeneous Systems.
IEEE Trans. Parallel Distributed Syst., 2021

Pinpointing the Memory Behaviors of DNN Training.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2021

2020
Revisiting linpack algorithm on large-scale CPU-GPU heterogeneous systems.
Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020


  Loading...