Rui Shao

Orcid: 0000-0003-0090-9604

Affiliations:

Harbin Institute of Technology (Shenzhen), School of Computer Science and Technology, China
Nanyang Technological University, Singapore (2021 - 2023)
Hong Kong Baptist University, Department of Computer Science, Hong Kong (PhD 2021)

According to our database¹, Rui Shao authored at least 62 papers between 2017 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Bibliography

2026

From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model.

[BibT_eX]

[DOI]

CoRR, May, 2026

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation.

[BibT_eX]

[DOI]

CoRR, May, 2026

HATS: Hardness-Aware Trajectory Synthesis for GUI Agents.

[BibT_eX]

[DOI]

CoRR, March, 2026

ΔVLA: Prior-Guided Vision-Language-Action Models via World Knowledge Variation.

[BibT_eX]

[DOI]

CoRR, March, 2026

Multimodal Large Language Models-Enabled UAV Swarm: Towards Efficient and Intelligent Autonomous Aerial Systems.

[BibT_eX]

[DOI]

IEEE Wirel. Commun., February, 2026

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation.

[BibT_eX]

[DOI]

CoRR, February, 2026

Learning to Accelerate Vision-Language-Action Models through Adaptive Visual Token Caching.

[BibT_eX]

[DOI]

CoRR, February, 2026

Inject Once Survive Later: Backdooring Vision-Language-Action Models to Persist Through Downstream Fine-tuning.

[BibT_eX]

[DOI]

CoRR, February, 2026

PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records.

[BibT_eX]

[DOI]

CoRR, January, 2026

UniEmo: Unifying Emotional Understanding and Generation With Learnable Expert Queries.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2026

H-GAR: A Hierarchical Interaction Framework via Goal-Driven Observation-Action Refinement for Robotic Manipulation.

[BibT_eX]

[DOI]

Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation.

[BibT_eX]

[DOI]

Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

2025

HiconAgent: History Context-aware Policy Optimization for GUI Agents.

[BibT_eX]

[DOI]

CoRR, December, 2025

CAT+: Investigating and Enhancing Audio-Visual Understanding in Large Language Models.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., October, 2025

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification.

[BibT_eX]

[DOI]

CoRR, August, 2025

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey.

[BibT_eX]

[DOI]

CoRR, August, 2025

DeepFake-Adapter: Dual-Level Adapter for DeepFake Detection.

[BibT_eX]

[DOI]

Int. J. Comput. Vis., June, 2025

Robust Sequential DeepFake Detection.

[BibT_eX]

[DOI]

Rui Shao

Tianxing Wu

Ziwei Liu

Int. J. Comput. Vis., June, 2025

Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills.

[BibT_eX]

[DOI]

CoRR, June, 2025

Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts.

[BibT_eX]

[DOI]

CoRR, June, 2025

DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer.

[BibT_eX]

[DOI]

CoRR, April, 2025

TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs.

[BibT_eX]

[DOI]

CoRR, March, 2025

FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers.

[BibT_eX]

[DOI]

CoRR, January, 2025

CogVLA: Cognition-Aligned Vision-Language-Action Models via Instruction-Driven Routing & Sparsification.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

EmoSym: A Symbiotic Framework for Unified Emotional Understanding and Generation via Latent Reasoning.

[BibT_eX]

[DOI]

Proceedings of the 33rd ACM International Conference on Multimedia, 2025

PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning.

[BibT_eX]

[DOI]

Proceedings of the 33rd ACM International Conference on Multimedia, 2025

STAR: Learning Diverse Robot Skill Abstractions through Rotation-Augmented Vector Quantization.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

Spa-Bench: a comprehensive Benchmark for Smartphone Agent Evaluation.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

FALCON: Resolving Visual Redundancy and Fragmentation in High-Resolution Multimodal Large Language Models via Visual Registers.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Less is More: Empowering GUI Agent with Context-Aware Simplification.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Spatial-Temporal Graph Diffusion Policy with Kinematic Modeling for Bimanual Robotic Manipulation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024

Detecting and Grounding Multi-Modal Media Manipulation and Beyond.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., August, 2024

Federated Generalized Face Presentation Attack Detection.

[BibT_eX]

[DOI]

IEEE Trans. Neural Networks Learn. Syst., January, 2024

Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding.

[BibT_eX]

[DOI]

CoRR, 2024

RoboMP<sup>2</sup>: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Enhancing the Emotional Generation Capability of Large Language Models via Emotional Chain-of-Thought.

[BibT_eX]

[DOI]

CoRR, 2024

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

RoboMP2: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023

Tackling Model Mismatch with Mixup Regulated Test-Time Training.

[BibT_eX]

[DOI]

Proceedings of the 10th IEEE International Conference on Data Science and Advanced Analytics, 2023

Detecting and Grounding Multi-Modal Media Manipulation.

[BibT_eX]

[DOI]

Rui Shao

Tianxing Wu

Ziwei Liu

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2022

Open-Set Adversarial Defense with Clean-Adversarial Mutual Learning.

[BibT_eX]

[DOI]

Int. J. Comput. Vis., 2022

Mixup for Test-Time Training.

[BibT_eX]

[DOI]

CoRR, 2022

Detecting and Recovering Sequential DeepFake Manipulation.

[BibT_eX]

[DOI]

Rui Shao

Tianxing Wu

Ziwei Liu

Proceedings of the Computer Vision - ECCV 2022, 2022

2021

Focusing on Clinically Interpretable Features: Selective Attention Regularization for Liver Biopsy Image Classification.

[BibT_eX]

[DOI]

Proceedings of the Medical Image Computing and Computer Assisted Intervention - MICCAI 2021 - 24th International Conference, Strasbourg, France, September 27, 2021

Federated Test-Time Adaptive Face Presentation Attack Detection with Dual-Phase Privacy Preservation.

[BibT_eX]

[DOI]

Proceedings of the 16th IEEE International Conference on Automatic Face and Gesture Recognition, 2021

2020

Federated Face Anti-spoofing.

[BibT_eX]

[DOI]