Xianzhi Yu

Orcid: 0000-0002-1497-5525

According to our database1, Xianzhi Yu authored at least 41 papers between 2020 and 2026.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

On csauthors.net:

Bibliography

2026
QuantClaw: Precision Where It Matters for OpenClaw.
CoRR, April, 2026

HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models.
CoRR, April, 2026

BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization.
CoRR, March, 2026

FreeAct: Freeing Activations for LLM Quantization.
CoRR, March, 2026

Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats.
CoRR, February, 2026

DLLM Agent: See Farther, Run Faster.
CoRR, February, 2026

What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study.
CoRR, January, 2026

Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats.
CoRR, January, 2026

SwiftMem: Fast Agentic Memory via Query-aware Indexing.
CoRR, January, 2026

Revisiting Judge Decoding from First Principles via Training-Free Distributional Divergence.
CoRR, January, 2026

What Matters For Safety Alignment?
CoRR, January, 2026

2025
Towards Efficient Agents: A Co-Design of Inference Architecture and System.
CoRR, December, 2025

E<sup>3</sup>-Pruner: Towards Efficient, Economical, and Effective Layer Pruning for Large Language Models.
CoRR, November, 2025

Behavioral Fingerprinting of Large Language Models.
CoRR, September, 2025

Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling.
CoRR, September, 2025

EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization.
CoRR, June, 2025

A Simple Linear Patch Revives Layer-Pruned Large Language Models.
CoRR, May, 2025

Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity.
CoRR, May, 2025

MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE.
CoRR, May, 2025

PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval.
CoRR, May, 2025

L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models.
CoRR, May, 2025

TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling.
CoRR, May, 2025

Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs.
CoRR, May, 2025

Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models.
CoRR, April, 2025

SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention.
CoRR, February, 2025

CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference.
CoRR, February, 2025

AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference.
CoRR, February, 2025

HAP: Hybrid Adaptive Parallelism for Efficient Mixture-of-Experts Inference.
Proceedings of the 31th IEEE International Conference on Parallel and Distributed Systems, 2025

FlatQuant: Flatness Matters for LLM Quantization.
Proceedings of the Forty-second International Conference on Machine Learning, 2025

Faster and Better LLMs via Latency-Aware Test-Time Scaling.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

LLMShare: Optimizing LLM Inference Serving with Hardware Architecture Exploration.
Proceedings of the 62nd ACM/IEEE Design Automation Conference, 2025

2024
LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs.
ACM Trans. Archit. Code Optim., December, 2024

FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers.
CoRR, 2024

FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs.
CoRR, 2024

2023
EC-SpMM: Efficient Compilation of SpMM Kernel on GPUs.
Proceedings of the 52nd International Conference on Parallel Processing, 2023

2022
An Application-oblivious Memory Scheduling System for DNN Accelerators.
ACM Trans. Archit. Code Optim., 2022

HW-TSC's Submission for the WMT22 Efficiency Task.
Proceedings of the Seventh Conference on Machine Translation, 2022

Accelerating Sparse Convolution with Column Vector-Wise Sparsity.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

2021
Optimizing the LINPACK Algorithm for Large-Scale PCIe-Based CPU-GPU Heterogeneous Systems.
IEEE Trans. Parallel Distributed Syst., 2021

Pinpointing the Memory Behaviors of DNN Training.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2021

2020
Revisiting linpack algorithm on large-scale CPU-GPU heterogeneous systems.
Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020


  Loading...