Fengyun Rao

Orcid: 0000-0002-2868-2088

According to our database¹, Fengyun Rao authored at least 46 papers between 2019 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization.

[BibT_eX]

[DOI]

CoRR, May, 2026

Stage-adaptive Token Selection for Efficient Omni-modal LLMs.

[BibT_eX]

[DOI]

CoRR, May, 2026

Semantic-Enriched Latent Visual Reasoning.

[BibT_eX]

[DOI]

CoRR, May, 2026

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding.

[BibT_eX]

[DOI]

CoRR, May, 2026

Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings.

[BibT_eX]

[DOI]

CoRR, April, 2026

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents.

[BibT_eX]

[DOI]

CoRR, March, 2026

Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction.

[BibT_eX]

[DOI]

CoRR, February, 2026

D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning.

[BibT_eX]

[DOI]

CoRR, February, 2026

SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback.

[BibT_eX]

[DOI]

CoRR, February, 2026

ObjEmbed: Towards Universal Multimodal Object Embeddings.

[BibT_eX]

[DOI]

CoRR, February, 2026

MMhops-R1: Multimodal Multi-hop Reasoning.

[BibT_eX]

[DOI]

Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

2025

WeDetect: Fast Open-Vocabulary Object Detection as Retrieval.

[BibT_eX]

[DOI]

CoRR, December, 2025

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens.

[BibT_eX]

[DOI]

CoRR, December, 2025

WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM.

[BibT_eX]

[DOI]

CoRR, September, 2025

TempFlow-GRPO: When Timing Matters for GRPO in Flow Models.

[BibT_eX]

[DOI]

CoRR, August, 2025

WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, June, 2025

Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs.

[BibT_eX]

[DOI]

CoRR, March, 2025

FlexSelect: Flexible Token Selection for Efficient Long Video Understanding.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

R1-Onevision: Advancing Generalized Multimodal Reasoning Through Cross-Modal Formalization.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-Reward Alignment.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Instruction-augmented Multimodal Alignment for Image-Text and Element Matching.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025

MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Number it: Temporal Grounding Videos like Flipping Manga.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024

Advancing Video Quality Assessment for AIGC.

[BibT_eX]

[DOI]

CoRR, 2024

Revisiting Video Quality Assessment from the Perspective of Generalization.

[BibT_eX]

[DOI]

CoRR, 2024

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model.

[BibT_eX]

[DOI]

CoRR, 2024

Multi-Modal Generative Embedding Model.

[BibT_eX]

[DOI]

CoRR, 2024

Visual Perception by Large Language Model's Weights.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

ReGenNet: Towards Human Action-Reaction Synthesis.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Inter-X: Towards Versatile Human-Human Interaction Analysis.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Task Navigator: Decomposing Complex Tasks for Multimodal Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Spatial-Semantic Collaborative Cropping for User Generated Content.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

Image Captioning with Multi-Context Synthetic Data.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023

Text-Only Image Captioning with Multi-Context Data Generation.

[BibT_eX]

[DOI]

CoRR, 2023

A Similarity Alignment Model for Video Copy Segment Matching.

[BibT_eX]

[DOI]

CoRR, 2023

A Dual-level Detection Method for Video Copy Detection.

[BibT_eX]

[DOI]

CoRR, 2023

2022

CA-SSL: Class-Agnostic Semi-Supervised Learning for Detection and Segmentation.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

Tencent-MVSE: A Large-Scale Benchmark Dataset for Multi-Modal Video Similarity Evaluation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

CaSP: Class-agnostic Semi-Supervised Pretraining for Detection and Segmentation.

[BibT_eX]

[DOI]

CoRR, 2021

CLIP4Caption ++: Multi-CLIP for Video Caption.

[BibT_eX]

[DOI]

CoRR, 2021

CLIP4Caption: CLIP for Video Caption.

[BibT_eX]

[DOI]

Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021

2019

Multi-Task Multi-Head Attention Memory Network for Fine-Grained Sentiment Analysis.

[BibT_eX]

[DOI]

Proceedings of the Natural Language Processing and Chinese Computing, 2019

Fengyun Rao

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...