Yaya Shi

Orcid: 0000-0003-0465-6712

According to our database1, Yaya Shi authored at least 19 papers between 2018 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

On csauthors.net:

Bibliography

2025
iMOVE: Instance-Motion-Aware Video Understanding.
CoRR, February, 2025

TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types.
CoRR, February, 2025

TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

iMOVE : Instance-Motion-Aware Video Understanding.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

2024
UniQRNet: Unifying Referring Expression Grounding and Segmentation with QRNet.
ACM Trans. Multim. Comput. Commun. Appl., August, 2024

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

MIBench: Evaluating Multimodal Large Language Models over Multiple Images.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval.
Proceedings of the 2024 Joint International Conference on Computational Linguistics, 2024

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training.
Proceedings of the 2024 Joint International Conference on Computational Linguistics, 2024

2023
Learning Video-Text Aligned Representations for Video Captioning.
ACM Trans. Multim. Comput. Commun. Appl., 2023

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks.
CoRR, 2023

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality.
CoRR, 2023

Learning Semantics-Grounded Vocabulary Representation for Video-Text Retrieval.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video.
Proceedings of the International Conference on Machine Learning, 2023

2022
A Simple and Strong Baseline for Universal Targeted Attacks on Siamese Visual Tracking.
IEEE Trans. Circuits Syst. Video Technol., 2022

EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2020
Object Relational Graph With Teacher-Recommended Learning for Video Captioning.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2019
VATEX Captioning Challenge 2019: Multi-modal Information Fusion and Multi-stage Training Strategy for Video Captioning.
CoRR, 2019

2018
Permafrost Presence/Absence Mapping of the Qinghai-Tibet Plateau Based on Multi-Source Remote Sensing Data.
Remote. Sens., 2018


  Loading...