Shuai Bai

Orcid: 0000-0002-6896-8590

According to our database1, Shuai Bai authored at least 51 papers between 2017 and 2026.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

On csauthors.net:

Bibliography

2026
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments.
CoRR, May, 2026

FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies.
CoRR, May, 2026

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents.
CoRR, May, 2026

MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing.
CoRR, May, 2026

Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding.
CoRR, May, 2026

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing.
CoRR, May, 2026

GenMask: Adapting DiT for Segmentation via Direct Mask Generation.
CoRR, March, 2026

Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos.
CoRR, March, 2026

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents.
CoRR, February, 2026

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking.
CoRR, January, 2026

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models.
CoRR, January, 2026

CloudClaw: A Recoverable Execution Substrate for Multi-Tenant LLM Agents.
Proceedings of the 24th Annual International Conference on Mobile Systems, 2026

2025
VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference.
CoRR, December, 2025

Soft Adaptive Policy Optimization.
CoRR, November, 2025

GD-NeRF: Generative Detail Compensation for One-shot Generalizable Neural Radiance Fields.
ACM Trans. Multim. Comput. Commun. Appl., October, 2025

Revisiting Multimodal Positional Encoding in Vision-Language Models.
CoRR, October, 2025

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark.
CoRR, September, 2025

Qwen-Image Technical Report.
CoRR, August, 2025

Qwen2.5-Omni Technical Report.
CoRR, March, 2025

Qwen2.5-VL Technical Report.
CoRR, February, 2025

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control is Easier than You Think.
Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2025, 2025

2024
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey.
CoRR, 2024

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution.
CoRR, 2024

Qwen2 Technical Report.
CoRR, 2024

GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for One-shot Generalizable Neural Radiance Fields.
CoRR, 2024

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models.
Proceedings of the Computer Vision - ECCV 2024, 2024

2023
Qwen Technical Report.
CoRR, 2023

TouchStone: Evaluating Vision-Language Models by Language Models.
CoRR, 2023

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities.
CoRR, 2023

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities.
CoRR, 2023

2022
OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models.
CoRR, 2022

Pretrained Diffusion Models for Unified Human Motion Synthesis.
CoRR, 2022

M6-Fashion: High-Fidelity Multi-modal Image Generation and Editing.
CoRR, 2022

Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework.
CoRR, 2022

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework.
Proceedings of the International Conference on Machine Learning, 2022

Single Stage Virtual Try-On Via Deformable Attention Flows.
Proceedings of the Computer Vision - ECCV 2022, 2022

2021
Dense Relation Distillation With Context-Aware Aggregation for Few-Shot Object Detection.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Connecting Language and Vision for Natural Language-Based Vehicle Retrieval.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2021

2020
Multi-Hierarchical Independent Correlation Filters For Visual Tracking.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2020

Class-Wise Dynamic Graph Convolution for Semantic Segmentation.
Proceedings of the Computer Vision - ECCV 2020, 2020

Adaptive Dilated Network With Self-Correction Supervision for Counting.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2019
Toward Robust Online Adaptive Visual Tracking via Pyramidal Features Extraction.
Proceedings of the IEEE International Conference on Multimedia & Expo Workshops, 2019

The Seventh Visual Object Tracking VOT2019 Challenge Results.
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops, 2019

Multi-Camera Vehicle Tracking with Powerful Visual Features and Spatial-Temporal Cue.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019

Traffic Anomaly Detection via Perspective Map based on Spatial-temporal Information Matrix.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019

2018
Multi-hierarchical Independent Correlation Filters for Visual Tracking.
CoRR, 2018

An integrated approach for the energy-efficient driving strategy optimization of multiple trains by considering regenerative braking.
Comput. Ind. Eng., 2018

The Sixth Visual Object Tracking VOT2018 Challenge Results.
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
Proceedings of the Computer Vision - ECCV 2018 Workshops, 2018

An Improved GMM-Based Moving Object Detection Method Under Sudden Illumination Change.
Proceedings of the Bio-inspired Computing: Theories and Applications, 2018

2017
A Deep Learning Method to Detect Web Attacks Using a Specially Designed CNN.
Proceedings of the Neural Information Processing - 24th International Conference, 2017


  Loading...