Size Zheng

Orcid: 0000-0002-9471-1780

Affiliations:
  • ByteDance, China
  • Peking University, Beijing, China (former)


According to our database1, Size Zheng authored at least 27 papers between 2020 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding.
CoRR, June, 2025

MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production.
CoRR, May, 2025

MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design.
CoRR, May, 2025

Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler.
CoRR, April, 2025

MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness.
CoRR, March, 2025

TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives.
CoRR, March, 2025

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts.
CoRR, February, 2025

Qtenon: Towards Low-Latency Architecture Integration for Accelerating Hybrid Quantum-Classical Computing.
Proceedings of the 52nd Annual International Symposium on Computer Architecture, 2025

2024
Rubick: A Unified Infrastructure for Analyzing, Exploring, and Implementing Spatial Architectures via Dataflow Decomposition.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., April, 2024

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference.
CoRR, 2024

ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving.
Proceedings of the Seventh Annual Conference on Machine Learning and Systems, 2024

vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs.
Proceedings of the Seventh Annual Conference on Machine Learning and Systems, 2024

SpREM: Exploiting Hamming Sparsity for Fast Quantum Readout Error Mitigation.
Proceedings of the 61st ACM/IEEE Design Automation Conference, 2024

MoteNN: Memory Optimization via Fine-grained Scheduling for Deep Neural Networks on Tiny Devices.
Proceedings of the 61st ACM/IEEE Design Automation Conference, 2024

SpecPIM: Accelerating Speculative Inference on PIM-Enabled System via Architecture-Dataflow Co-Exploration.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024

MAGIS: Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024

2023
TileFlow: A Framework for Modeling Fusion Dataflow via Tree-based Analysis.
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023

ARES: A Mapping Framework of DNNs Towards Diverse PIMs with General Abstractions.
Proceedings of the IEEE/ACM International Conference on Computer Aided Design, 2023

Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2023

Memory and Computation Coordinated Mapping of DNNs onto Complex Heterogeneous SoC.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023

Rubick: A Synthesis Framework for Spatial Architectures via Dataflow Decomposition.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023

2022
NeoFlow: A Flexible Framework for Enabling Efficient Compilation for High Performance DNN Training.
IEEE Trans. Parallel Distributed Syst., 2022

AMOS: enabling automatic mapping for tensor computations on spatial accelerators with hardware abstraction.
Proceedings of the ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18, 2022

2021
HASCO: Towards Agile HArdware and Software CO-design for Tensor Computation.
Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture, 2021

2020
SuSy: A Programming Model for Productive Construction of High-Performance Systolic Arrays on FPGAs.
Proceedings of the IEEE/ACM International Conference On Computer Aided Design, 2020

FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System.
Proceedings of the ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, 2020


  Loading...