Yuhang Cao

Orcid: 0009-0008-3627-590X

According to our database1, Yuhang Cao authored at least 57 papers between 2017 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
COFFA: A Co-Design Framework for Fused-Grained Reconfigurable Architecture Towards Efficient Irregular Loop Handling.
IEEE Trans. Computers, September, 2025

Intern-S1: A Scientific Multimodal Foundation Model.
CoRR, August, 2025

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience.
CoRR, August, 2025

Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models.
CoRR, August, 2025

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction.
CoRR, July, 2025

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing.
CoRR, June, 2025

Visual Agentic Reinforcement Fine-Tuning.
CoRR, May, 2025

MM-IFEngine: Towards Multimodal Instruction Following.
CoRR, April, 2025

HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance.
CoRR, April, 2025

Visual-RFT: Visual Reinforcement Fine-Tuning.
CoRR, March, 2025

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation.
CoRR, February, 2025

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion.
CoRR, February, 2025

VideoRoPE: What Makes for Good Video Rotary Position Embedding?
CoRR, February, 2025

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning.
CoRR, January, 2025

Detecting and Reducing the Factual Hallucinations of Large Language Models with Metamorphic Testing.
Proc. ACM Softw. Eng., 2025

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Conical Visual Concentration for Efficient Large Vision-Language Models.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

2024
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions.
CoRR, 2024

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction.
CoRR, 2024

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree.
CoRR, 2024

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate.
CoRR, 2024

BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way.
CoRR, 2024

SCA: Highly Efficient Semantic-Consistent Unrestricted Adversarial Attack.
CoRR, 2024

A General-Purpose Device for Interaction with LLMs.
CoRR, 2024

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output.
CoRR, 2024

V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results.
CoRR, 2024

DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models.
CoRR, 2024

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model.
CoRR, 2024

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Ximalaya ASDR System for ICASSP 2024 in-Car Multi-Channel (ICMC) ASR Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2024

Diacorrect: Error Correction Back-End for Speaker Diarization.
Proceedings of the IEEE International Conference on Acoustics, 2024

MDCRA: A Reconfigurable Accelerator Framework for Multiple Dataflow Lanes.
Proceedings of the 35th IEEE International Conference on Application-specific Systems, 2024

2023
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition.
CoRR, 2023

Exploring the Power of Cross-Contextual Large Language Model in Mimic Emotion Prediction.
Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, 2023

Multimodal Cross-Lingual Features and Weight Fusion for Cross-Cultural Humor Detection.
Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, 2023

V3Det: Vast Vocabulary Visual Detection Dataset.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

A Dynamic Partial Reconfigurable CGRA Framework for Multi-Kernel Applications.
Proceedings of the International Conference on Field Programmable Technology, 2023

E<sup>2</sup>-ACE: An Energy-Efficient Reconfigurable Crypto-Accelerator with Agile End-to-End Toolchain.
Proceedings of the International Conference on Field Programmable Technology, 2023

PP-MET: A Real-World Personalized Prompt Based Meeting Transcription System.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

2022
MINI: Mining Implicit Novel Instances for Few-Shot Object Detection.
CoRR, 2022

The USTC-Ximalaya System for the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription (M2met) Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2022

TRAM: An Open-Source Template-based Reconfigurable Architecture Modeling Framework.
Proceedings of the 32nd International Conference on Field-Programmable Logic and Applications, 2022

2021
WSSOD: A New Pipeline for Weakly- and Semi-Supervised Object Detection.
CoRR, 2021

Few-Shot Object Detection via Association and DIscrimination.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Seesaw Loss for Long-Tailed Instance Segmentation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020
Feature Pyramid Grids.
CoRR, 2020

Side-Aware Boundary Localization for More Precise Object Detection.
Proceedings of the Computer Vision - ECCV 2020, 2020

Prime Sample Attention in Object Detection.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2019
Speaker Direction-of-Arrival Estimation Based on Orthogonal Dipoles.
Circuits Syst. Signal Process., 2019

MMDetection: Open MMLab Detection Toolbox and Benchmark.
CoRR, 2019

Investigation of Cost Function for Supervised Monaural Speech Separation.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

2017
Speaker Direction-of-Arrival Estimation Based on Frequency-Independent Beampattern.
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017


  Loading...