Xianzhi Yu

Orcid: 0000-0002-1497-5525

According to our database¹, Xianzhi Yu authored at least 46 papers between 2020 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding.

[BibT_eX]

[DOI]

CoRR, May, 2026

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning.

[BibT_eX]

[DOI]

CoRR, May, 2026

QuantClaw: Precision Where It Matters for OpenClaw.

[BibT_eX]

[DOI]

CoRR, April, 2026

HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models.

[BibT_eX]

[DOI]

CoRR, April, 2026

BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization.

[BibT_eX]

[DOI]

CoRR, March, 2026

FreeAct: Freeing Activations for LLM Quantization.

[BibT_eX]

[DOI]

CoRR, March, 2026

Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats.

[BibT_eX]

[DOI]

CoRR, February, 2026

DLLM Agent: See Farther, Run Faster.

[BibT_eX]

[DOI]

CoRR, February, 2026

What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study.

[BibT_eX]

[DOI]

CoRR, January, 2026

SwiftMem: Fast Agentic Memory via Query-aware Indexing.

[BibT_eX]

[DOI]

CoRR, January, 2026

Revisiting Judge Decoding from First Principles via Training-Free Distributional Divergence.

[BibT_eX]

[DOI]

CoRR, January, 2026

What Matters For Safety Alignment?

[BibT_eX]

[DOI]

CoRR, January, 2026

EPLoN: Exploiting Efficient Parallelism with Selective Rematerialization for Lightning Attention on Ascend NPU.

[BibT_eX]

[DOI]

Anastasiya Bistrigova

Proceedings of the 40th ACM International Conference on Supercomputing, 2026

Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats.

[BibT_eX]

[DOI]

Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2026

Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis.

[BibT_eX]

[DOI]

Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2026

2025

Towards Efficient Agents: A Co-Design of Inference Architecture and System.

[BibT_eX]

[DOI]

CoRR, December, 2025

E<sup>3</sup>-Pruner: Towards Efficient, Economical, and Effective Layer Pruning for Large Language Models.

[BibT_eX]

[DOI]

CoRR, November, 2025

Behavioral Fingerprinting of Large Language Models.

[BibT_eX]

[DOI]

CoRR, September, 2025

Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling.

[BibT_eX]

[DOI]

CoRR, September, 2025

EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization.

[BibT_eX]

[DOI]

CoRR, June, 2025

Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity.

[BibT_eX]

[DOI]

CoRR, May, 2025

PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval.

[BibT_eX]

[DOI]

CoRR, May, 2025

TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling.

[BibT_eX]

[DOI]

CoRR, May, 2025

Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs.

[BibT_eX]

[DOI]

CoRR, May, 2025

Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models.

[BibT_eX]

[DOI]

CoRR, April, 2025

SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention.

[BibT_eX]

[DOI]

CoRR, February, 2025

CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference.

[BibT_eX]

[DOI]

CoRR, February, 2025

AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference.

[BibT_eX]

[DOI]

CoRR, February, 2025

AttentionPredictor: Temporal Patterns Matter for KV Cache Compression.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

A Simple Linear Patch Revives Layer-Pruned Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

HAP: Hybrid Adaptive Parallelism for Efficient Mixture-of-Experts Inference.

[BibT_eX]

[DOI]

Proceedings of the 31th IEEE International Conference on Parallel and Distributed Systems, 2025

FlatQuant: Flatness Matters for LLM Quantization.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

Faster and Better LLMs via Latency-Aware Test-Time Scaling.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

LLMShare: Optimizing LLM Inference Serving with Hardware Architecture Exploration.

[BibT_eX]

[DOI]

Proceedings of the 62nd ACM/IEEE Design Automation Conference, 2025

2024

LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs.

[BibT_eX]

[DOI]

ACM Trans. Archit. Code Optim., December, 2024

FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers.

[BibT_eX]

[DOI]

CoRR, 2024

FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs.

[BibT_eX]

[DOI]

CoRR, 2024

2023

EC-SpMM: Efficient Compilation of SpMM Kernel on GPUs.

[BibT_eX]

[DOI]

Proceedings of the 52nd International Conference on Parallel Processing, 2023

2022

An Application-oblivious Memory Scheduling System for DNN Accelerators.

[BibT_eX]

[DOI]

ACM Trans. Archit. Code Optim., 2022

HW-TSC's Submission for the WMT22 Efficiency Task.

[BibT_eX]

[DOI]

Proceedings of the Seventh Conference on Machine Translation, 2022

Accelerating Sparse Convolution with Column Vector-Wise Sparsity.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

2021

Optimizing the LINPACK Algorithm for Large-Scale PCIe-Based CPU-GPU Heterogeneous Systems.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2021

Pinpointing the Memory Behaviors of DNN Training.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2021

2020

Revisiting linpack algorithm on large-scale CPU-GPU heterogeneous systems.

[BibT_eX]

[DOI]

Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

Xianzhi Yu

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...