Shuai Bai

Orcid: 0000-0002-6896-8590

According to our database¹, Shuai Bai authored at least 51 papers between 2017 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments.

[BibT_eX]

[DOI]

CoRR, May, 2026

FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies.

[BibT_eX]

[DOI]

CoRR, May, 2026

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents.

[BibT_eX]

[DOI]

CoRR, May, 2026

MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing.

[BibT_eX]

[DOI]

CoRR, May, 2026

Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding.

[BibT_eX]

[DOI]

CoRR, May, 2026

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing.

[BibT_eX]

[DOI]

CoRR, May, 2026

GenMask: Adapting DiT for Segmentation via Direct Mask Generation.

[BibT_eX]

[DOI]

CoRR, March, 2026

Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos.

[BibT_eX]

[DOI]

CoRR, March, 2026

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents.

[BibT_eX]

[DOI]

CoRR, February, 2026

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking.

[BibT_eX]

[DOI]

CoRR, January, 2026

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models.

[BibT_eX]

[DOI]

CoRR, January, 2026

CloudClaw: A Recoverable Execution Substrate for Multi-Tenant LLM Agents.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual International Conference on Mobile Systems, 2026

2025

VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference.

[BibT_eX]

[DOI]

CoRR, December, 2025

Soft Adaptive Policy Optimization.

[BibT_eX]

[DOI]

CoRR, November, 2025

GD-NeRF: Generative Detail Compensation for One-shot Generalizable Neural Radiance Fields.

[BibT_eX]

[DOI]

ACM Trans. Multim. Comput. Commun. Appl., October, 2025

Revisiting Multimodal Positional Encoding in Vision-Language Models.

[BibT_eX]

[DOI]

CoRR, October, 2025

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark.

[BibT_eX]

[DOI]

CoRR, September, 2025

Qwen-Image Technical Report.

[BibT_eX]

[DOI]

CoRR, August, 2025

Qwen2.5-Omni Technical Report.

[BibT_eX]

[DOI]

CoRR, March, 2025

Qwen2.5-VL Technical Report.

[BibT_eX]

[DOI]

CoRR, February, 2025

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control is Easier than You Think.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2025, 2025

2024

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey.

[BibT_eX]

[DOI]

CoRR, 2024

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution.

[BibT_eX]

[DOI]

CoRR, 2024

Qwen2 Technical Report.

[BibT_eX]

[DOI]

CoRR, 2024

GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for One-shot Generalizable Neural Radiance Fields.

[BibT_eX]

[DOI]

CoRR, 2024

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

2023

Qwen Technical Report.

[BibT_eX]

[DOI]

CoRR, 2023

TouchStone: Evaluating Vision-Language Models by Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities.

[BibT_eX]

[DOI]

CoRR, 2023

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities.

[BibT_eX]

[DOI]

CoRR, 2023

2022

OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models.

[BibT_eX]

[DOI]

CoRR, 2022

Pretrained Diffusion Models for Unified Human Motion Synthesis.

[BibT_eX]

[DOI]

Jianxin Ma

Shuai Bai

Chang Zhou

CoRR, 2022

M6-Fashion: High-Fidelity Multi-modal Image Generation and Editing.

[BibT_eX]

[DOI]

CoRR, 2022

Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework.

[BibT_eX]

[DOI]

CoRR, 2022

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Machine Learning, 2022

Single Stage Virtual Try-On Via Deformable Attention Flows.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

2021

Dense Relation Distillation With Context-Aware Aggregation for Few-Shot Object Detection.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Connecting Language and Vision for Natural Language-Based Vehicle Retrieval.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2021

2020

Multi-Hierarchical Independent Correlation Filters For Visual Tracking.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Multimedia and Expo, 2020

Class-Wise Dynamic Graph Convolution for Semantic Segmentation.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2020, 2020

Adaptive Dilated Network With Self-Correction Supervision for Counting.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2019

Toward Robust Online Adaptive Visual Tracking via Pyramidal Features Extraction.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Multimedia & Expo Workshops, 2019

The Seventh Visual Object Tracking VOT2019 Challenge Results.

[BibT_eX]

[DOI]

Abdelrahman Eldesokey

Rama Krishna Sai Subrahmanyam Gorthi

Alireza Memarmoghadam

Ardhendu Shekhar Tripathi

Arnold W. M. Smeulders

Joni-Kristian Kämäräinen

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops, 2019

Multi-Camera Vehicle Tracking with Powerful Visual Features and Spatial-Temporal Cue.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019

Traffic Anomaly Detection via Perspective Map based on Spatial-temporal Information Matrix.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019

2018

Multi-hierarchical Independent Correlation Filters for Visual Tracking.

[BibT_eX]

[DOI]

CoRR, 2018

An integrated approach for the energy-efficient driving strategy optimization of multiple trains by considering regenerative braking.

[BibT_eX]

[DOI]

Comput. Ind. Eng., 2018

The Sixth Visual Object Tracking VOT2018 Challenge Results.

[BibT_eX]

[DOI]

Abdelrahman Eldesokey

Gustavo Fernández

Álvaro García-Martín

Álvaro Iglesias-Arias

A. Aydin Alatan

Abel González-García

Alfredo Petrosino

Alireza Memarmoghadam

Andrea Vedaldi

Andrej Muhic

Anfeng He

Arnold W. M. Smeulders

Gorthi R. K. Sai Subrahmanyam

Guilherme Sousa Bastos

Haibin Ling

Hamed Kiani Galoogahi

Jorge Rodríguez Herranz

Mario Edoardo Maresca

Martin Danelljan

Ming-Hsuan Yang

Mohamed H. Abdelpakey

Pablo Vicente-Moñivar

Rama Krishna Sai Subrahmanyam Gorthi

Proceedings of the Computer Vision - ECCV 2018 Workshops, 2018

An Improved GMM-Based Moving Object Detection Method Under Sudden Illumination Change.

[BibT_eX]

[DOI]

Proceedings of the Bio-inspired Computing: Theories and Applications, 2018

2017

A Deep Learning Method to Detect Web Attacks Using a Specially Designed CNN.

[BibT_eX]

[DOI]

Proceedings of the Neural Information Processing - 24th International Conference, 2017

Shuai Bai

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...