Di Zhang
Orcid: 0009-0006-5475-2728Affiliations:
- Kuaishou Technology, Beijing, China
According to our database1,
Di Zhang
authored at least 116 papers
between 2021 and 2025.
Collaborative distances:
Collaborative distances:
Timeline
Legend:
Book In proceedings Article PhD thesis Dataset OtherLinks
Online presence:
-
on orcid.org
On csauthors.net:
Bibliography
2025
A-SDM: Accelerating Stable Diffusion Through Model Assembly and Feature Inheritance Strategies.
IEEE Trans. Neural Networks Learn. Syst., October, 2025
Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS.
CoRR, August, 2025
CoRR, August, 2025
AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation.
CoRR, August, 2025
CoRR, July, 2025
Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation.
CoRR, June, 2025
CoRR, June, 2025
CoRR, June, 2025
Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval.
CoRR, June, 2025
CoRR, June, 2025
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control.
CoRR, June, 2025
CoRR, May, 2025
MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback.
CoRR, May, 2025
CoRR, May, 2025
Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding.
IEEE Trans. Pattern Anal. Mach. Intell., April, 2025
CoRR, April, 2025
AlignRAG: An Adaptable Framework for Resolving Misalignments in Retrieval-Aware Reasoning of RAG.
CoRR, April, 2025
CoRR, April, 2025
Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models.
CoRR, April, 2025
CoRR, March, 2025
HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment.
CoRR, March, 2025
CoRR, March, 2025
CoRR, March, 2025
Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings.
CoRR, March, 2025
CoRR, March, 2025
CoRR, March, 2025
CoRR, March, 2025
TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs.
CoRR, March, 2025
CoRR, March, 2025
ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis.
CoRR, March, 2025
RectifiedHR: Enable Efficient High-Resolution Image Generation via Energy Rectification.
CoRR, March, 2025
CoRR, February, 2025
FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems.
CoRR, February, 2025
Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts.
CoRR, February, 2025
DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs.
CoRR, February, 2025
TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types.
CoRR, February, 2025
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation.
CoRR, February, 2025
CoRR, January, 2025
ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning.
CoRR, January, 2025
LLaMA-Berry: Pairwise Optimization for Olympiad-level Mathematical Reasoning via O1-like Monte Carlo Tree Search.
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-Resolution.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
Breaking the Stage Barrier: A Novel Single-Stage Approach to Long Context Extension for Large Language Models.
Proceedings of the 31st International Conference on Computational Linguistics, 2025
HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025
Proceedings of the Findings of the Association for Computational Linguistics, 2025
VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation.
Proceedings of the Findings of the Association for Computational Linguistics, 2025
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025
2024
Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models.
CoRR, 2024
CoRR, 2024
Video-Text Dataset Construction from Multi-AI Feedback: Promoting Weak-to-Strong Preference Learning for Video Large Language Models.
CoRR, 2024
CoRR, 2024
CoRR, 2024
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning.
CoRR, 2024
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs.
CoRR, 2024
CoRR, 2024
Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model.
CoRR, 2024
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B.
CoRR, 2024
CoRR, 2024
Proceedings of the SIGGRAPH Asia 2024 Conference Papers, 2024
Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion.
Proceedings of the ACM SIGGRAPH 2024 Conference Papers, 2024
Proceedings of the ACM SIGGRAPH 2024 Conference Papers, 2024
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization.
Proceedings of the Forty-first International Conference on Machine Learning, 2024
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
Small Agent Can Also Rock! Empowering Small Language Models as Hallucination Detector.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs.
Proceedings of the 2024 Joint International Conference on Computational Linguistics, 2024
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024
Just Ask One More Time! Self-Agreement Improves Reasoning of Language Models in (Almost) All Scenarios.
Proceedings of the Findings of the Association for Computational Linguistics, 2024
Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint.
Proceedings of the Findings of the Association for Computational Linguistics, 2024
2023
Ask One More Time: Self-Agreement Improves Reasoning of Language Models in (Almost) All Scenarios.
CoRR, 2023
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization.
CoRR, 2023
Resource Constrained Model Compression via Minimax Optimization for Spiking Neural Networks.
Proceedings of the 31st ACM International Conference on Multimedia, 2023
2022
PICASSO: Unleashing the Potential of GPU-centric Training for Wide-and-deep Recommender Systems.
Proceedings of the 38th IEEE International Conference on Data Engineering, 2022
Proceedings of the 38th IEEE International Conference on Data Engineering, 2022
2021
Proceedings of the CIKM '21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1, 2021