Yaya Shi

Orcid: 0000-0003-0465-6712

According to our database¹, Yaya Shi authored at least 19 papers between 2018 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2025

iMOVE: Instance-Motion-Aware Video Understanding.

[BibT_eX]

[DOI]

CoRR, February, 2025

TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types.

[BibT_eX]

[DOI]

CoRR, February, 2025

TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

iMOVE : Instance-Motion-Aware Video Understanding.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2025

2024

UniQRNet: Unifying Referring Expression Grounding and Segmentation with QRNet.

[BibT_eX]

[DOI]

ACM Trans. Multim. Comput. Commun. Appl., August, 2024

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model.

[BibT_eX]

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

MIBench: Evaluating Multimodal Large Language Models over Multiple Images.

[BibT_eX]

[DOI]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval.

[BibT_eX]

[DOI]

Proceedings of the 2024 Joint International Conference on Computational Linguistics, 2024

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training.

[BibT_eX]

[DOI]

Proceedings of the 2024 Joint International Conference on Computational Linguistics, 2024

2023

Learning Video-Text Aligned Representations for Video Captioning.

[BibT_eX]

[DOI]

ACM Trans. Multim. Comput. Commun. Appl., 2023

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks.

[BibT_eX]

[DOI]

CoRR, 2023

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality.

[BibT_eX]

[DOI]

CoRR, 2023

Learning Semantics-Grounded Vocabulary Representation for Video-Text Retrieval.

[BibT_eX]

[DOI]

Proceedings of the 31st ACM International Conference on Multimedia, 2023

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Machine Learning, 2023

2022

A Simple and Strong Baseline for Universal Targeted Attacks on Siamese Visual Tracking.

[BibT_eX]

[DOI]

IEEE Trans. Circuits Syst. Video Technol., 2022

EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2020

Object Relational Graph With Teacher-Recommended Learning for Video Captioning.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2019

VATEX Captioning Challenge 2019: Multi-modal Information Fusion and Multi-stage Training Strategy for Video Captioning.

[BibT_eX]

[DOI]

CoRR, 2019

2018

Permafrost Presence/Absence Mapping of the Qinghai-Tibet Plateau Based on Multi-Source Remote Sensing Data.

[BibT_eX]

[DOI]

Remote. Sens., 2018

Yaya Shi

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...