Yuhang Zang

Orcid: 0000-0003-1110-5062

According to our database1, Yuhang Zang authored at least 63 papers between 2019 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience.
CoRR, August, 2025

Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models.
CoRR, August, 2025

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction.
CoRR, July, 2025

Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation.
CoRR, July, 2025

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing.
CoRR, June, 2025

Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning.
CoRR, May, 2025

Visual Agentic Reinforcement Fine-Tuning.
CoRR, May, 2025

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning.
CoRR, May, 2025

MM-IFEngine: Towards Multimodal Instruction Following.
CoRR, April, 2025

HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance.
CoRR, April, 2025

Unified Reward Model for Multimodal Understanding and Generation.
CoRR, March, 2025

Visual-RFT: Visual Reinforcement Fine-Tuning.
CoRR, March, 2025

Contextual Object Detection with Multimodal Large Language Models.
Int. J. Comput. Vis., February, 2025

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation.
CoRR, February, 2025

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion.
CoRR, February, 2025

VideoRoPE: What Makes for Good Video Rotary Position Embedding?
CoRR, February, 2025

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning.
CoRR, January, 2025

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MotionClone: Training-Free Motion Cloning for Controllable Video Generation.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Conical Visual Concentration for Efficient Large Vision-Language Models.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

WildAvatar: Learning In-the-wild 3D Avatars from the Web.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

2024
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions.
CoRR, 2024

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models.
CoRR, 2024

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction.
CoRR, 2024

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree.
CoRR, 2024

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate.
CoRR, 2024

BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way.
CoRR, 2024

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output.
CoRR, 2024

WildAvatar: Web-scale In-the-wild Video Dataset for 3D Avatar Creation.
CoRR, 2024

V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results.
CoRR, 2024

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.
CoRR, 2024

Bootstrap3D: Improving 3D Content Creation with Synthetic Data.
CoRR, 2024

Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo.
CoRR, 2024

Unified Scene Representation and Reconstruction for 3D Large Language Models.
CoRR, 2024

Are We on the Right Way for Evaluating Large Vision-Language Models?
CoRR, 2024

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition.
CoRR, 2024

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model.
CoRR, 2024

Streaming Long Video Understanding with Large Language Models.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Are We on the Right Way for Evaluating Large Vision-Language Models?
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

VLMEvalKit: An Open-Source ToolKit for Evaluating Large Multi-Modality Models.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Long-CLIP: Unlocking the Long-Text Capability of CLIP.
Proceedings of the Computer Vision - ECCV 2024, 2024

MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo.
Proceedings of the Computer Vision - ECCV 2024, 2024

Alpha-CLIP: A CLIP Model Focusing on Wherever you Want.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023
Semi-Supervised and Long-Tailed Object Detection with CascadeMatch.
Int. J. Comput. Vis., April, 2023

2022
Unified Vision and Language Prompt Learning.
CoRR, 2022

On-Device Domain Generalization.
CoRR, 2022

Open-Vocabulary DETR with Conditional Matching.
Proceedings of the Computer Vision - ECCV 2022, 2022

2021
FASA: Feature Augmentation and Sampling Adaptation for Long-Tailed Instance Segmentation.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Seesaw Loss for Long-Tailed Instance Segmentation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020
1st Place Solutions for OpenImage2019 - Object Detection and Instance Segmentation.
CoRR, 2020

KPNet: Towards Minimal Face Detector.
Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

2019
Efficient and Accurate Arbitrary-Shaped Text Detection With Pixel Aggregation Network.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Scene Text Detection with Supervised Pyramid Context Network.
Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019


  Loading...