Haoyu Cao

Orcid: 0000-0002-3789-9705

Affiliations:

Tencent YouTu Lab, Hefei, China

According to our database¹, Haoyu Cao authored at least 20 papers between 2022 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Bibliography

2025

VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting.

[BibT_eX]

[DOI]

CoRR, October, 2025

Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts.

[BibT_eX]

[DOI]

CoRR, October, 2025

VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation.

[BibT_eX]

[DOI]

CoRR, October, 2025

DREAM: Document Reconstruction via End-to-end Autoregressive Model.

[BibT_eX]

[DOI]

CoRR, July, 2025

TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs.

[BibT_eX]

[DOI]

CoRR, May, 2025

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model.

[BibT_eX]

[DOI]

CoRR, May, 2025

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy.

[BibT_eX]

[DOI]

CoRR, February, 2025

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction.

[BibT_eX]

[DOI]

CoRR, January, 2025

2024

Turning a CLIP Model Into a Scene Text Spotter.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., September, 2024

Communication-efficient clustered federated learning via model distance.

[BibT_eX]

[DOI]

Mach. Learn., June, 2024

Break the Visual Perception: Adversarial Attacks Targeting Encoded Visual Tokens of Large Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

HRVDA: High-Resolution Visual Document Assistant.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Few-shot Temporal Pruning Accelerates Diffusion Models for Text Generation.

[BibT_eX]

[DOI]

Proceedings of the 2024 Joint International Conference on Computational Linguistics, 2024

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction.

[BibT_eX]

[DOI]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

2023

ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images.

[BibT_eX]

[DOI]

Proceedings of the Document Analysis and Recognition - ICDAR 2023, 2023

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

2022

GMN: Generative Multi-modal Network for Practical Document Information Extraction.

[BibT_eX]

[DOI]

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022

Relational Representation Learning in Visually-Rich Documents.

[BibT_eX]

[DOI]

Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

Query-driven Generative Network for Document Information Extraction in the Wild.

[BibT_eX]

[DOI]

Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

Haoyu Cao

Timeline

Legend:

Links

Online presence:

On csauthors.net:

Bibliography

Loading...