Di Zhang

Orcid: 0009-0006-5475-2728

Affiliations:
  • Kuaishou Technology, Beijing, China


According to our database1, Di Zhang authored at least 116 papers between 2021 and 2025.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
A-SDM: Accelerating Stable Diffusion Through Model Assembly and Feature Inheritance Strategies.
IEEE Trans. Neural Networks Learn. Syst., October, 2025

Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS.
CoRR, August, 2025

Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery.
CoRR, August, 2025

Score Augmentation for Diffusion Models.
CoRR, August, 2025

AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation.
CoRR, August, 2025

Imbalance in Balance: Online Concept Balancing in Generation Models.
CoRR, July, 2025

VMoBA: Mixture-of-Block Attention for Video Diffusion Models.
CoRR, June, 2025

Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation.
CoRR, June, 2025

PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis.
CoRR, June, 2025

UNIC: Unified In-Context Video Editing.
CoRR, June, 2025

FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers.
CoRR, June, 2025

Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval.
CoRR, June, 2025

CamCloneMaster: Enabling Reference-based Camera Control for Video Generation.
CoRR, June, 2025

Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control.
CoRR, June, 2025

Control-R: Towards controllable test-time scaling.
CoRR, June, 2025

OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers.
CoRR, May, 2025

MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback.
CoRR, May, 2025

Scaling Image and Video Generation via Test-Time Evolutionary Search.
CoRR, May, 2025

Flow-GRPO: Training Flow Matching Models via Online RL.
CoRR, May, 2025

Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding.
IEEE Trans. Pattern Anal. Mach. Intell., April, 2025

A Survey of Interactive Generative Video.
CoRR, April, 2025

VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform.
CoRR, April, 2025

AlignRAG: An Adaptable Framework for Resolving Misalignments in Retrieval-Aware Reasoning of RAG.
CoRR, April, 2025

InstructEngine: Instruction-driven Text-to-Image Alignment.
CoRR, April, 2025

Mavors: Multi-granularity Video Representation for Multimodal Large Language Model.
CoRR, April, 2025

Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models.
CoRR, April, 2025

Leanabell-Prover: Posttraining Scaling in Formal Reasoning.
CoRR, April, 2025

Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation.
CoRR, March, 2025

HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment.
CoRR, March, 2025

SARGes: Semantically Aligned Reliable Gesture Generation via Intent Chain.
CoRR, March, 2025

FullDiT: Multi-Task Video Generative Foundation Model with Full Attention.
CoRR, March, 2025

Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings.
CoRR, March, 2025

Position: Interactive Generative Video as Next-Generation Game Engine.
CoRR, March, 2025

DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers.
CoRR, March, 2025

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video.
CoRR, March, 2025

TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs.
CoRR, March, 2025

Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding.
CoRR, March, 2025

ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis.
CoRR, March, 2025

RectifiedHR: Enable Efficient High-Resolution Image Generation via Energy Rectification.
CoRR, March, 2025

SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin.
CoRR, February, 2025

FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems.
CoRR, February, 2025

Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts.
CoRR, February, 2025

DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs.
CoRR, February, 2025

iMOVE: Instance-Motion-Aware Video Understanding.
CoRR, February, 2025

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment.
CoRR, February, 2025

TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types.
CoRR, February, 2025

CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation.
CoRR, February, 2025

Improving Video Generation with Human Feedback.
CoRR, January, 2025

GameFactory: Creating New Games with Generative Interactive Videos.
CoRR, January, 2025

ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning.
CoRR, January, 2025

LLaMA-Berry: Pairwise Optimization for Olympiad-level Mathematical Reasoning via O1-like Monte Carlo Tree Search.
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Stable Segment Anything Model.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Towards Precise Scaling Laws for Video Diffusion Transformers.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

StyleMaster: Stylize Your Video with Artistic Generation and Translation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

SketchVideo: Sketch-based Video Generation and Editing.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

GPAvatar: High-fidelity Head Avatars by Learning Efficient Gaussian Projections.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-Resolution.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Breaking the Stage Barrier: A Novel Single-Stage Approach to Long Context Extension for Large Language Models.
Proceedings of the 31st International Conference on Computational Linguistics, 2025

HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

iMOVE : Instance-Motion-Aware Video Understanding.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024
Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models.
CoRR, 2024

Owl-1: Omni World Model for Consistent Long Video Generation.
CoRR, 2024

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints.
CoRR, 2024

Video-Text Dataset Construction from Multi-AI Feedback: Promoting Weak-to-Strong Preference Learning for Video Large Language Models.
CoRR, 2024

VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing.
CoRR, 2024

MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts.
CoRR, 2024

DMQR-RAG: Diverse Multi-Query Rewriting for RAG.
CoRR, 2024

Kwai-STaR: Transform LLMs into State-Transition Reasoners.
CoRR, 2024

LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning.
CoRR, 2024

ERABAL: Enhancing Role-Playing Agents through Boundary-Aware Learning.
CoRR, 2024

SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs.
CoRR, 2024

ViMo: Generating Motions from Casual Videos.
CoRR, 2024

EVLM: An Efficient Vision-Language Model for Visual Understanding.
CoRR, 2024

LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control.
CoRR, 2024

Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model.
CoRR, 2024

Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B.
CoRR, 2024

VideoTetris: Towards Compositional Text-to-Video Generation.
CoRR, 2024

RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance.
CoRR, 2024

UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark.
CoRR, 2024

Motion Inversion for Video Customization.
CoRR, 2024

ChemLLM: A Chemical Large Language Model.
CoRR, 2024

Towards Unified 3D Hair Reconstruction from Single-View Portraits.
Proceedings of the SIGGRAPH Asia 2024 Conference Papers, 2024

Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion.
Proceedings of the ACM SIGGRAPH 2024 Conference Papers, 2024

I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models.
Proceedings of the ACM SIGGRAPH 2024 Conference Papers, 2024

VideoTetris: Towards Compositional Text-to-Video Generation.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

DialogBench: Evaluating LLMs as Human-like Dialogue Systems.
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024

PlacidDreamer: Advancing Harmony in Text-to-3D Generation.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Inductive-Deductive Strategy Reuse for Multi-Turn Instructional Dialogues.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Small Agent Can Also Rock! Empowering Small Language Models as Hallucination Detector.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Learning Multi-Dimensional Human Preference for Text-to-Image Generation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs.
Proceedings of the 2024 Joint International Conference on Computational Linguistics, 2024

Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

Just Ask One More Time! Self-Agreement Improves Reasoning of Language Models in (Almost) All Scenarios.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

2023
I2V-Adapter: A General Image-to-Video Adapter for Video Diffusion Models.
CoRR, 2023

Ask One More Time: Self-Agreement Improves Reasoning of Language Models in (Almost) All Scenarios.
CoRR, 2023

DialogBench: Evaluating LLMs as Human-like Dialogue Systems.
CoRR, 2023

KwaiYiiMath: Technical Report.
CoRR, 2023

Parrot: Enhancing Multi-Turn Chat Models by Learning to Ask Questions.
CoRR, 2023

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization.
CoRR, 2023

Resource Constrained Model Compression via Minimax Optimization for Spiking Neural Networks.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

2022
PICASSO: Unleashing the Potential of GPU-centric Training for Wide-and-deep Recommender Systems.
Proceedings of the 38th IEEE International Conference on Data Engineering, 2022

AMCAD: Adaptive Mixed-Curvature Representation based Advertisement Retrieval System.
Proceedings of the 38th IEEE International Conference on Data Engineering, 2022

2021
Exploring Sparse Expert Models and Beyond.
CoRR, 2021

SMAD: Scalable Multi-view Ad Retrieval System for E-Commerce Sponsored Search.
Proceedings of the CIKM '21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1, 2021


  Loading...