Ming Yan

Orcid: 0000-0003-4959-8878

Affiliations:
  • Alibaba Group, DAMO Academy, Institute of Intelligent Computing, Hangzhou, China
  • Chinese Academy of Sciences, Institute of Automation, National Laboratory of Pattern Recognition, Beijing, China (PhD 2016)


According to our database1, Ming Yan authored at least 138 papers between 2013 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Mobile-Agent-v3: Fundamental Agents for GUI Automation.
CoRR, August, 2025

L-CLIPScore: a Lightweight Embedding-based Captioning Metric for Evaluating and Training.
CoRR, July, 2025

Perception-Aware Policy Optimization for Multimodal Reasoning.
CoRR, July, 2025

WebSailor: Navigating Super-human Reasoning for Web Agent.
CoRR, July, 2025

Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning.
CoRR, June, 2025

Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration.
CoRR, May, 2025

MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding.
CoRR, May, 2025

QwenLong-CPRS: Towards ∞-LLMs with Dynamic Context Optimization.
CoRR, May, 2025

QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning.
CoRR, May, 2025

VLM-R<sup>3</sup>: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought.
CoRR, May, 2025

SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization.
CoRR, May, 2025

Towards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning.
CoRR, May, 2025

WritingBench: A Comprehensive Benchmark for Generative Writing.
CoRR, March, 2025

MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio.
CoRR, March, 2025

Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration.
CoRR, February, 2025

Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization.
CoRR, February, 2025

PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC.
CoRR, February, 2025

Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks.
CoRR, January, 2025

Endowing Visual Reprogramming with Adversarial Robustness.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Mutual-Taught for Co-adapting Policy and Reward Models.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

A Training-free LLM-based Approach to General Chinese Character Error Correction.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024
UniQRNet: Unifying Referring Expression Grounding and Segmentation with QRNet.
ACM Trans. Multim. Comput. Commun. Appl., August, 2024

CLIP-VG: Self-Paced Curriculum Adapting of CLIP for Visual Grounding.
IEEE Trans. Multim., 2024

SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing.
CoRR, 2024

ProFuser: Progressive Fusion of Large Language Models.
CoRR, 2024

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning.
CoRR, 2024

ReAct Meets ActRe: When Language Agents Enjoy Training Data Autonomy.
CoRR, 2024

RoleInteract: Evaluating the Social Interaction of Role-Playing Agents.
CoRR, 2024

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding.
CoRR, 2024

Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection.
CoRR, 2024

Meta Ranking: Less Capable Language Models are Capable for Single Response Judgement.
CoRR, 2024

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception.
CoRR, 2024

Knowledge Distillation for Closed-Source Language Models.
CoRR, 2024

Modeling Comparative Logical Relation with Contrastive Learning for Text Generation.
Proceedings of the Natural Language Processing and Chinese Computing, 2024

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Part-Aware Prompt Tuning for Weakly Supervised Referring Expression Grounding.
Proceedings of the MultiMedia Modeling - 30th International Conference, 2024

Revisiting Unsupervised Temporal Action Localization: The Primacy of High-Quality Actionness and Pseudolabels.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language Models.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

Breaking Barriers of System Heterogeneity: Straggler-Tolerant Multimodal Federated Learning via Knowledge Distillation.
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

VG-Annotator: Vision-Language Models as Query Annotators for Unsupervised Visual Grounding.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2024

Two-Stage Information Bottleneck For Temporal Language Grounding.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2024

TinyChart: Efficient Chart Understanding with Program-of-Thoughts Learning and Visual Token Merging.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Small LLMs Are Weak Tool Learners: A Multi-LLM Agent.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

MIBench: Evaluating Multimodal Large Language Models over Multiple Images.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

mPLUG-OwI2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval.
Proceedings of the 2024 Joint International Conference on Computational Linguistics, 2024

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training.
Proceedings of the 2024 Joint International Conference on Computational Linguistics, 2024

SiTunes: A Situational Music Recommendation Dataset with Physiological and Psychological Signals.
Proceedings of the 2024 ACM SIGIR Conference on Human Information Interaction and Retrieval, 2024

Budget-Constrained Tool Learning with Planning.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

Browse and Concentrate: Comprehending Multimodal Content via Prior-LLM Context Fusion.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

PANDA: Preference Adaptation for Enhancing Domain-Specific Abilities of LLMs.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

SocialBench: Sociality Evaluation of Role-Playing Conversational Agents.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

Model Composition for Multimodal Large Language Models.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023
Attribute-Guided Collaborative Learning for Partial Person Re-Identification.
IEEE Trans. Pattern Anal. Mach. Intell., December, 2023

Multi-modal multi-hop interaction network for dialogue response generation.
Expert Syst. Appl., October, 2023

Achieving Human Parity on Visual Question Answering.
ACM Trans. Inf. Syst., 2023

An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation.
CoRR, 2023

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration.
CoRR, 2023

ModelScope-Agent: Building Your Customizable Agent System with Open-source Large Language Models.
CoRR, 2023

Evaluation and Analysis of Hallucination in Large Vision-Language Models.
CoRR, 2023

CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility.
CoRR, 2023

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding.
CoRR, 2023

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks.
CoRR, 2023

CLIP-VG: Self-paced Curriculum Adapting of CLIP via Exploiting Pseudo-Language Labels for Visual Grounding.
CoRR, 2023

AMTSS: An Adaptive Multi-Teacher Single-Student Knowledge Distillation Framework For Multilingual Language Inference.
CoRR, 2023

Vision Langauge Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation.
CoRR, 2023

Transforming Visual Scene Graphs to Image Captions.
CoRR, 2023

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality.
CoRR, 2023

ChatPLUG: Open-Domain Generative Dialogue System with Internet-Augmented Instruction Tuning for Digital Human.
CoRR, 2023

mPLUG-Octopus: The Versatile Assistant Empowered by A Modularized End-to-End Multimodal LLM.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

Learning Semantics-Grounded Vocabulary Representation for Video-Text Retrieval.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

COPA : Efficient Vision-Language Pre-training through Collaborative Object- and Patch-Text Alignment.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping.
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video.
Proceedings of the International Conference on Machine Learning, 2023

Construction and Applications of Billion-Scale Pre-Trained Multimodal Business Knowledge Graph.
Proceedings of the 39th IEEE International Conference on Data Engineering, 2023

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Learning Trajectory-Word Alignments for Video-Language Tasks.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Improved Visual Fine-tuning with Natural Language Supervision.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

BUS : Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

ModelScope-Agent: Building Your Customizable Agent System with Open-source Large Language Models.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

MCC-KD: Multi-CoT Consistent Knowledge Distillation.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

Distinguish Before Answer: Generating Contrastive Explanation as Knowledge for Commonsense Question Answering.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

2022
Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment.
CoRR, 2022

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections.
CoRR, 2022

Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus.
CoRR, 2022

Comprehensive Relationship Reasoning for Composed Query Based Image Retrieval.
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval.
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

Attribute-guided Dynamic Routing Graph Network for Transductive Few-shot Learning.
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

DictBERT: Dictionary Description Knowledge Enhanced Language Model Pre-training via Contrastive Learning.
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022

CAT-MNER: Multimodal Named Entity Recognition with Knowledge-Refined Cross-Modal Attention.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2022

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

PromptMNER: Prompt-Based Entity-Related Visual Clue Extraction and Integration for Multimodal Named Entity Recognition.
Proceedings of the Database Systems for Advanced Applications, 2022

Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

WikiDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types.
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022

2021
Achieving Human Parity on Visual Question Answering.
CoRR, 2021

Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training.
CoRR, 2021

SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels.
CoRR, 2021

MinD at SemEval-2021 Task 6: Propaganda Detection using Transfer Learning and Multimodal Fusion.
Proceedings of the 15th International Workshop on Semantic Evaluation, 2021

E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021

Addressing Semantic Drift in Generative Question Answering with Auxiliary Extraction.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021

StructuralLM: Structural Pre-training for Form Understanding.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021

A Unified Pretraining Framework for Passage Ranking and Expansion.
Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021

2020
Aliababa DAMO Academy at TREC Precision Medicine 2020: State-of-the-art Evidence Retriever for Precision Medicine with Expert-in-the-loop Active Learning.
Proceedings of the Twenty-Ninth Text REtrieval Conference, 2020

StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding.
Proceedings of the 8th International Conference on Learning Representations, 2020

PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020

Generating Well-Formed Answers by Machine Reading with Stochastic Selector Networks.
Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

2019
Symmetric Regularization based BERT for Pair-wise Semantic Reasoning.
CoRR, 2019

StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding.
CoRR, 2019

IDST at TREC 2019 Deep Learning Track: Deep Cascade Ranking with Generation-based Document Expansion and Pre-trained Language Modeling.
Proceedings of the Twenty-Eighth Text REtrieval Conference, 2019

Incorporating External Knowledge into Machine Reading for Generative Question Answering.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019

Incorporating Relation Knowledge into Commonsense Reading Comprehension with Multi-task Learning.
Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019

A Deep Cascade Model for Multi-Document Reading Comprehension.
Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019

2018
Understanding Dynamic Cross-OSN Associations for Cold-Start Recommendation.
IEEE Trans. Multim., 2018

Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering.
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018

2017
Ensemble Methods for Personalized E-Commerce Search Challenge at CIKM Cup 2016.
CoRR, 2017

Session-aware Information Embedding for E-commerce Product Recommendation.
CoRR, 2017

Session-aware Information Embedding for E-commerce Product Recommendation.
Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2017

2016
A Unified Video Recommendation by Cross-Network User Modeling.
ACM Trans. Multim. Comput. Commun. Appl., 2016

基于关联规则挖掘的跨网络知识关联及协同应用 (Association Rules Mining Based Cross-network Knowledge Association and Collaborative Applications).
计算机科学, 2016

2015
YouTube Video Promotion by Cross-Network Association: @Britney to Advertise Gangnam Style.
IEEE Trans. Multim., 2015

Unified YouTube Video Recommendation via Cross-network Collaboration.
Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, 2015

2014
Twitter is Faster: Personalized Time-Aware Video Recommendation from Twitter to YouTube.
ACM Trans. Multim. Comput. Commun. Appl., 2014

Mining Cross-network Association for YouTube Video Promotion.
Proceedings of the ACM International Conference on Multimedia, MM '14, Orlando, FL, USA, November 03, 2014

2013
Friend transfer: Cold-start friend recommendation with cross-platform transfer learning of social knowledge.
Proceedings of the 2013 IEEE International Conference on Multimedia and Expo, 2013

User-Oriented Social Analysis across Social Media Sites.
Proceedings of the New Trends in Image Analysis and Processing - ICIAP 2013, 2013


  Loading...