Yali Wang

Orcid: 0000-0002-2999-7428

Affiliations:

Chinese Academy of Sciences, Shenzhen Institute of Advanced Technology, Guangdong-Hong Kong-Macao Joint Laboratory of Human-Machine Intelligence-Synergy Systems, China

According to our database¹, Yali Wang authored at least 128 papers between 2016 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Bibliography

2026

WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens.

[BibT_eX]

[DOI]

CoRR, May, 2026

Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners.

[BibT_eX]

[DOI]

CoRR, May, 2026

Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning.

[BibT_eX]

[DOI]

CoRR, January, 2026

A renaissance of explicit motion information mining from transformers for action recognition.

[BibT_eX]

[DOI]

Pattern Recognit., 2026

Super encoding network: Recursive association of multi-modal encoders for video understanding.

[BibT_eX]

[DOI]

Pattern Recognit., 2026

VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations.

[BibT_eX]

[DOI]

Proceedings of the 2026 International Conference on Multimedia Retrieval, 2026

VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning.

[BibT_eX]

[DOI]

Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

G-UBS: Towards Robust Understanding of Implicit Feedback via Group-Aware User Behavior Simulation.

[BibT_eX]

[DOI]

Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

When Top-ranked Recommendations Fail: Modeling Multi-Granular Negative Feedback for Explainable and Robust Video Recommendation.

[BibT_eX]

[DOI]

Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning.

[BibT_eX]

[DOI]

Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

2025

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision.

[BibT_eX]

[DOI]

CoRR, December, 2025

LvBench: A Benchmark for Long-form Video Understanding with Versatile Multi-modal Question Answering.

[BibT_eX]

[DOI]

Int. J. Comput. Vis., November, 2025

VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, November, 2025

When Top-ranked Recommendations Fail: Modeling Multi-Granular Negative Feedback for Explainable and Robust Video Recommendation.

[BibT_eX]

[DOI]

CoRR, November, 2025

Weakly Supervised Temporal Sentence Grounding via Positive Sample Mining.

[BibT_eX]

[DOI]

IEEE Trans. Circuits Syst. Video Technol., October, 2025

Guiding Audio-Visual Question Answering with Collective Question Reasoning.

[BibT_eX]

[DOI]

Int. J. Comput. Vis., October, 2025

VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations.

[BibT_eX]

[DOI]

CoRR, October, 2025

UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation.

[BibT_eX]

[DOI]

CoRR, October, 2025

Vinci: A Real-time Smart Assistant Based on Egocentric Vision-language Model for Portable Devices.

[BibT_eX]

[DOI]

Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., September, 2025

G-UBS: Towards Robust Understanding of Implicit Feedback via Group-Aware User Behavior Simulation.

[BibT_eX]

[DOI]

CoRR, August, 2025

WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction.

[BibT_eX]

[DOI]

CoRR, August, 2025

Video-GPT via Next Clip Diffusion.

[BibT_eX]

[DOI]

CoRR, May, 2025

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning.

[BibT_eX]

[DOI]

CoRR, April, 2025

LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents.

[BibT_eX]

[DOI]

CoRR, March, 2025

Get In Video: Add Anything You Want to the Video.

[BibT_eX]

[DOI]

CoRR, March, 2025

An Egocentric Vision-Language Model based Portable Real-time Smart Assistant.

[BibT_eX]

[DOI]

CoRR, March, 2025

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling.

[BibT_eX]

[DOI]

CoRR, January, 2025

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling.

[BibT_eX]

[DOI]

CoRR, January, 2025

Percept, Chat, Adapt: Knowledge transfer of foundation models for open-world video recognition.

[BibT_eX]

[DOI]

Pattern Recognit., 2025

VideoChat: chat-centric video understanding.

[BibT_eX]

[DOI]

Sci. China Inf. Sci., 2025

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

TimeStep Master: Asymmetrical Mixture of Timestep LoRA Experts for Versatile and Efficient Diffusion Models in Vision.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

WeGen: A Unified Model for Interactive Multimodal Generation as We Chat.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, 2025

H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, 2025

2024

F2S-Net: learning frame-to-segment prediction for online action detection.

[BibT_eX]

[DOI]

Yi Liu

Yu Qiao

Yali Wang

J. Real Time Image Process., May, 2024

CP-Net: Contour-Perturbed Reconstruction Network for Self-Supervised Point Cloud Learning.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2024

Dual Masked Modeling for Weakly-Supervised Temporal Boundary Discovery.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2024

Attentive Snippet Prompting for Video Retrieval.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2024

Progressive Frame-Proposal Mining for Weakly Supervised Video Object Detection.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2024

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model.

[BibT_eX]

[DOI]

CoRR, 2024

EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation.

[BibT_eX]

[DOI]

CoRR, 2024

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text.

[BibT_eX]

[DOI]

CoRR, 2024

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding.

[BibT_eX]

[DOI]

CoRR, 2024

Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of Foundation Models for Open-World Video Recognition.

[BibT_eX]

[DOI]

CoRR, 2024

From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities.

[BibT_eX]

[DOI]

CoRR, 2024

TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

VideoMamba: State Space Model for Efficient Video Understanding.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

Vlogger: Make Your Dream A Vlog.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

M-BEV: Masked BEV Perception for Robust Autonomous Driving.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023

UniFormer: Unifying Convolution and Self-Attention for Visual Recognition.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., October, 2023

CP3: Unifying Point Cloud Completion by Pretrain-Prompt-Predict Paradigm.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., August, 2023

Hybrid token transformer for deep face recognition.

[BibT_eX]

[DOI]

Pattern Recognit., July, 2023

Towards robustness and generalization of point cloud representation: A geometry coding method and a large-scale object-level dataset.

[BibT_eX]

[DOI]

Comput. Vis. Media, February, 2023

MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding.

[BibT_eX]

[DOI]

CoRR, 2023

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark.

[BibT_eX]

[DOI]

CoRR, 2023

Harvest Video Foundation Models via Efficient Post-Pretraining.

[BibT_eX]

[DOI]

CoRR, 2023

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation.

[BibT_eX]

[DOI]

CoRR, 2023

VideoLLM: Modeling Video Sequence with Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

InternGPT: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language.

[BibT_eX]

[DOI]

CoRR, 2023

Parameter is Not All You Need: Starting from Non-Parametric Networks for 3D Point Cloud Analysis.

[BibT_eX]

[DOI]

CoRR, 2023

Learning Discriminative Feature Representation for Open Set Action Recognition.

[BibT_eX]

[DOI]

Proceedings of the 31st ACM International Conference on Multimedia, 2023

UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Unmasked Teacher: Towards Training-Efficient Video Foundation Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Starting from Non-Parametric Networks for 3D Point Cloud Analysis.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2022

Action Recognition With Motion Diversification and Dynamic Selection.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2022

FineAction: A Fine-Grained Video Dataset for Temporal Action Localization.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2022

InternVideo: General Video Foundation Models via Generative and Discriminative Learning.

[BibT_eX]

[DOI]

CoRR, 2022

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer.

[BibT_eX]

[DOI]

CoRR, 2022

InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges.

[BibT_eX]

[DOI]

CoRR, 2022

Low-Resolution Action Recognition for Tiny Actions Challenge.

[BibT_eX]

[DOI]

Boyu Chen

Yu Qiao

Yali Wang

CoRR, 2022

CP3: Unifying Point Cloud Completion by Pretrain-Prompt-Predict Paradigm.

[BibT_eX]

[DOI]

CoRR, 2022

UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning.

[BibT_eX]

[DOI]

CoRR, 2022

Visual Knowledge Graph for Human Action Reasoning in Videos.

[BibT_eX]

[DOI]

Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

VideoPipe 2022 Challenge: Real-World Video Understanding for Urban Pipe Inspection.

[BibT_eX]

[DOI]

Proceedings of the 26th International Conference on Pattern Recognition, 2022

UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the Tenth International Conference on Learning Representations, 2022

Self-slimmed Vision Transformer.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

Target-Relevant Knowledge Preservation for Multi-Source Domain Adaptive Object Detection.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Cross Domain Object Detection by Target-Perceived Dual Branch Distillation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

Wildfish++: A Comprehensive Fish Benchmark for Multimedia Research.

[BibT_eX]

[DOI]

Peiqin Zhuang

Yali Wang

Yu Qiao

IEEE Trans. Multim., 2021

Learning Dynamical Human-Joint Affinity for 3D Pose Estimation in Videos.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2021

Multi-View Partial (MVP) Point Cloud Challenge 2021 on Completion and Registration: Methods and Results.

[BibT_eX]

[DOI]

Francisco Gómez Fernández

Qinlong Wang

Yang Yang

CoRR, 2021

MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video.

[BibT_eX]

[DOI]

CoRR, 2021

FineAction: A Fined Video Dataset for Temporal Action Localization.

[BibT_eX]

[DOI]

CoRR, 2021

CT-Net: Channel Tensorization Network for Video Classification.

[BibT_eX]

[DOI]

Proceedings of the 9th International Conference on Learning Representations, 2021

Digging into Uncertainty in Self-supervised Multi-view Stereo.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

PC-HMR: Pose Calibration for 3D Human Mesh Recovery from 2D Images/Videos.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021

2020

Progressive Object Transfer Detection.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2020

DID: Disentangling-Imprinting-Distilling for Continuous Low-Shot Detection.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2020

Finding hard faces with better proposals and classifier.

[BibT_eX]

[DOI]

Mach. Vis. Appl., 2020

Mining Inter-Video Proposal Relations for Video Object Detection.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2020, 2020

SmallBigNet: Integrating Core and Contextual Views for Video Classification.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

Learning Attentive Pairwise Interaction for Fine-Grained Classification.

[BibT_eX]

[DOI]

Peiqin Zhuang

Yali Wang

Yu Qiao

Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

Context-Transformer: Tackling Object Confusion for Few-Shot Detection.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

2019

Dual-supervised attention network for deep cross-modal hashing.

[BibT_eX]

[DOI]

Pattern Recognit. Lett., 2019

MetaCleaner: Learning to Hallucinate Clean Representations for Noisy-Labeled Visual Recognition.

[BibT_eX]

[DOI]

Weihe Zhang

Yali Wang

Yu Qiao

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

PA3D: Pose-Action 3D Machine for Video Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

Adaptive Pyramid Context Network for Semantic Segmentation.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

2018

Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos.

[BibT_eX]

[DOI]

Wenbin Du

Yali Wang

Yu Qiao

IEEE Trans. Image Process., 2018

WildFish: A Large Benchmark for Fish Recognition in the Wild.

[BibT_eX]

[DOI]

Peiqin Zhuang

Yali Wang

Yu Qiao

Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference, 2018

Temporal Hallucinating for Action Recognition With Few Still Images.

[BibT_eX]

[DOI]

Yali Wang

Lei Zhou

Yu Qiao

Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

LSTD: A Low-Shot Transfer Detector for Object Detection.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018

2017

Weakly Supervised PatchNets: Describing and Aggregating Local Patches for Scene Recognition.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2017

Bayesian inference for time-varying applications: Particle-based Gaussian process approaches.

[BibT_eX]

[DOI]

Yali Wang

Brahim Chaib-draa

Neurocomputing, 2017

An online Bayesian filtering framework for Gaussian process regression: Application to global surface temperature analysis.

[BibT_eX]

[DOI]

Yali Wang

Brahim Chaib-draa

Expert Syst. Appl., 2017

RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos.

[BibT_eX]

[DOI]

Wenbin Du

Yali Wang

Yu Qiao

Proceedings of the IEEE International Conference on Computer Vision, 2017

Sparse Deep Transfer Learning for Convolutional Neural Network.

[BibT_eX]

[DOI]

Jiaming Liu

Yali Wang

Yu Qiao

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017

2016

KNN-based Kalman filter: An efficient and non-stationary method for Gaussian process regression.

[BibT_eX]

[DOI]

Yali Wang

Brahim Chaib-draa

Knowl. Based Syst., 2016

Codebook enhancement of vlad representation for visual recognition.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

Human action recognition with DeepAction Kernel Gaussian Process.

[BibT_eX]

[DOI]

Yali Wang

Lin Li

Yu Qiao

Proceedings of the 2016 International Conference on Advanced Robotics and Mechatronics, 2016

Yali Wang

Timeline

Legend:

Links

Online presence:

On csauthors.net:

Bibliography

Loading...