Haiyang Xu

Orcid: 0000-0001-9442-5912

Affiliations:
  • Alibaba Group


According to our database1, Haiyang Xu authored at least 75 papers between 2015 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Mobile-Agent-v3: Fundamental Agents for GUI Automation.
CoRR, August, 2025

L-CLIPScore: a Lightweight Embedding-based Captioning Metric for Evaluating and Training.
CoRR, July, 2025

Perception-Aware Policy Optimization for Multimodal Reasoning.
CoRR, July, 2025

Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation.
CoRR, June, 2025

VLM-R<sup>3</sup>: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought.
CoRR, May, 2025

Towards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning.
CoRR, May, 2025

Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration.
CoRR, February, 2025

PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC.
CoRR, February, 2025

Qwen2.5-VL Technical Report.
CoRR, February, 2025

Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks.
CoRR, January, 2025

Endowing Visual Reprogramming with Adversarial Robustness.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024
UniQRNet: Unifying Referring Expression Grounding and Segmentation with QRNet.
ACM Trans. Multim. Comput. Commun. Appl., August, 2024

SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing.
CoRR, 2024

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning.
CoRR, 2024

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding.
CoRR, 2024

Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection.
CoRR, 2024

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception.
CoRR, 2024

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

TinyChart: Efficient Chart Understanding with Program-of-Thoughts Learning and Visual Token Merging.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

MIBench: Evaluating Multimodal Large Language Models over Multiple Images.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

mPLUG-OwI2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval.
Proceedings of the 2024 Joint International Conference on Computational Linguistics, 2024

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training.
Proceedings of the 2024 Joint International Conference on Computational Linguistics, 2024

TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023
Learning Video-Text Aligned Representations for Video Captioning.
ACM Trans. Multim. Comput. Commun. Appl., 2023

Achieving Human Parity on Visual Question Answering.
ACM Trans. Inf. Syst., 2023

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration.
CoRR, 2023

ModelScope-Agent: Building Your Customizable Agent System with Open-source Large Language Models.
CoRR, 2023

Evaluation and Analysis of Hallucination in Large Vision-Language Models.
CoRR, 2023

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding.
CoRR, 2023

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks.
CoRR, 2023

Vision Langauge Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation.
CoRR, 2023

Transforming Visual Scene Graphs to Image Captions.
CoRR, 2023

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality.
CoRR, 2023

ChatPLUG: Open-Domain Generative Dialogue System with Internet-Augmented Instruction Tuning for Digital Human.
CoRR, 2023

Adaptively Clustering Neighbor Elements for Image Captioning.
CoRR, 2023

mPLUG-Octopus: The Versatile Assistant Empowered by A Modularized End-to-End Multimodal LLM.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

Learning Semantics-Grounded Vocabulary Representation for Video-Text Retrieval.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

COPA : Efficient Vision-Language Pre-training through Collaborative Object- and Patch-Text Alignment.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video.
Proceedings of the International Conference on Machine Learning, 2023

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Learning Trajectory-Word Alignments for Video-Language Tasks.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

BUS : Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

ModelScope-Agent: Building Your Customizable Agent System with Open-source Large Language Models.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

Towards Adaptive Prefix Tuning for Parameter-Efficient Language Model Fine-tuning.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2023

Transforming Visual Scene Graphs to Image Captions.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

2022
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections.
CoRR, 2022

Image Captioning In the Transformer Age.
CoRR, 2022

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021
Achieving Human Parity on Visual Question Answering.
CoRR, 2021

Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training.
CoRR, 2021

SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels.
CoRR, 2021

E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021

2020
Storyline extraction from news articles with dynamic dependency.
Intell. Data Anal., 2020

Adversarial Multi-Binary Neural Network for Multi-class Classification.
CoRR, 2020

Selective Attention Encoders by Syntactic Graph Convolutional Networks for Document Summarization.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Neural Topic Modeling with Bidirectional Adversarial Training.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

2019
DELTA: A DEep learning based Language Technology plAtform.
CoRR, 2019

Learning Alignment for Multimodal Emotion Recognition from Speech.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Learning Syntactic and Dynamic Selective Encoding for Document Summarization.
Proceedings of the International Joint Conference on Neural Networks, 2019

NVSRN: A Neural Variational Scaling Reasoning Network for Initiative Response Generation.
Proceedings of the 2019 IEEE International Conference on Data Mining, 2019

2016
Unsupervised Storyline Extraction from News Articles.
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, 2016

2015
An Unsupervised Bayesian Modelling Approach for Storyline Detection on News Articles.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015


  Loading...