Yuhang Zang

Orcid: 0000-0003-1110-5062

According to our database¹, Yuhang Zang authored at least 76 papers between 2019 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Links

On csauthors.net:

Bibliography

2025

Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, October, 2025

STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence.

[BibT_eX]

[DOI]

CoRR, October, 2025

UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation.

[BibT_eX]

[DOI]

CoRR, October, 2025

LSVOS 2025 Challenge Report: Recent Advances in Complex Video Object Segmentation.

[BibT_eX]

[DOI]

CoRR, October, 2025

G<sup>2</sup>RPO: Granular GRPO for Precise Reward in Flow Models.

[BibT_eX]

[DOI]

CoRR, October, 2025

2nd Place Report of MOSEv2 Challenge 2025: Concept Guided Video Object Segmentation via SeC.

[BibT_eX]

[DOI]

CoRR, September, 2025

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, September, 2025

SPARK: Synergistic Policy And Reward Co-Evolving Framework.

[BibT_eX]

[DOI]

CoRR, September, 2025

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing.

[BibT_eX]

[DOI]

CoRR, September, 2025

SIM-CoT: Supervised Implicit Chain-of-Thought.

[BibT_eX]

[DOI]

CoRR, September, 2025

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, August, 2025

CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, August, 2025

DiCache: Let Diffusion Model Determine Its Own Cache.

[BibT_eX]

[DOI]

CoRR, August, 2025

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience.

[BibT_eX]

[DOI]

CoRR, August, 2025

Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models.

[BibT_eX]

[DOI]

CoRR, August, 2025

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction.

[BibT_eX]

[DOI]

CoRR, July, 2025

Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation.

[BibT_eX]

[DOI]

CoRR, July, 2025

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing.

[BibT_eX]

[DOI]

CoRR, June, 2025

Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, May, 2025

Visual Agentic Reinforcement Fine-Tuning.

[BibT_eX]

[DOI]

CoRR, May, 2025

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning.

[BibT_eX]

[DOI]

CoRR, May, 2025

MM-IFEngine: Towards Multimodal Instruction Following.

[BibT_eX]

[DOI]

CoRR, April, 2025

HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance.

[BibT_eX]

[DOI]

CoRR, April, 2025

Unified Reward Model for Multimodal Understanding and Generation.

[BibT_eX]

[DOI]

CoRR, March, 2025

Visual-RFT: Visual Reinforcement Fine-Tuning.

[BibT_eX]

[DOI]

CoRR, March, 2025

Contextual Object Detection with Multimodal Large Language Models.

[BibT_eX]

[DOI]

Int. J. Comput. Vis., February, 2025

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation.

[BibT_eX]

[DOI]

CoRR, February, 2025

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion.

[BibT_eX]

[DOI]

CoRR, February, 2025

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

[BibT_eX]

[DOI]

CoRR, February, 2025

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning.

[BibT_eX]

[DOI]

CoRR, January, 2025

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MotionClone: Training-Free Motion Cloning for Controllable Video Generation.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Conical Visual Concentration for Efficient Large Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

WildAvatar: Learning In-the-wild 3D Avatars from the Web.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2025

Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2025

2024

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions.

[BibT_eX]

[DOI]

CoRR, 2024

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models.

[BibT_eX]

[DOI]

CoRR, 2024

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction.

[BibT_eX]

[DOI]

CoRR, 2024

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree.

[BibT_eX]

[DOI]

CoRR, 2024

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate.

[BibT_eX]

[DOI]

CoRR, 2024

BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way.

[BibT_eX]

[DOI]

CoRR, 2024

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output.

[BibT_eX]

[DOI]

CoRR, 2024

WildAvatar: Web-scale In-the-wild Video Dataset for 3D Avatar Creation.

[BibT_eX]

[DOI]

CoRR, 2024

V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results.

[BibT_eX]

[DOI]

CoRR, 2024

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.

[BibT_eX]

[DOI]

CoRR, 2024

Bootstrap3D: Improving 3D Content Creation with Synthetic Data.

[BibT_eX]

[DOI]

CoRR, 2024

Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo.

[BibT_eX]

[DOI]

CoRR, 2024

Unified Scene Representation and Reconstruction for 3D Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Are We on the Right Way for Evaluating Large Vision-Language Models?

[BibT_eX]

[DOI]

CoRR, 2024

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition.

[BibT_eX]

[DOI]

CoRR, 2024

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model.

[BibT_eX]

[DOI]

CoRR, 2024

Streaming Long Video Understanding with Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Are We on the Right Way for Evaluating Large Vision-Language Models?

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

VLMEvalKit: An Open-Source ToolKit for Evaluating Large Multi-Modality Models.

[BibT_eX]

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Long-CLIP: Unlocking the Long-Text Capability of CLIP.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

Alpha-CLIP: A CLIP Model Focusing on Wherever you Want.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023

Semi-Supervised and Long-Tailed Object Detection with CascadeMatch.

[BibT_eX]

[DOI]

Int. J. Comput. Vis., April, 2023

2022

Unified Vision and Language Prompt Learning.

[BibT_eX]

[DOI]

CoRR, 2022

On-Device Domain Generalization.

[BibT_eX]

[DOI]

CoRR, 2022

Open-Vocabulary DETR with Conditional Matching.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

2021

FASA: Feature Augmentation and Sampling Adaptation for Long-Tailed Instance Segmentation.

[BibT_eX]

[DOI]

Yuhang Zang

Chen Huang

Chen Change Loy

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Seesaw Loss for Long-Tailed Instance Segmentation.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020

1st Place Solutions for OpenImage2019 - Object Detection and Instance Segmentation.

[BibT_eX]

[DOI]

CoRR, 2020

KPNet: Towards Minimal Face Detector.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

2019

Efficient and Accurate Arbitrary-Shaped Text Detection With Pixel Aggregation Network.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Scene Text Detection with Supervised Pyramid Context Network.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019

Yuhang Zang

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...