Xiaojian Ma

Orcid: 0000-0001-5609-3822

Affiliations:
  • State Key Laboratory of General Artificial Intelligence, BIGAI, China


According to our database1, Xiaojian Ma authored at least 61 papers between 2018 and 2025.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation.
CoRR, July, 2025

LEO-VL: Towards 3D Vision-Language Generalists via Data Scaling with Efficient Representation.
CoRR, June, 2025

From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes.
CoRR, June, 2025

FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation.
CoRR, May, 2025

Iterative Trajectory Exploration for Multimodal Agents.
CoRR, April, 2025

TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials.
CoRR, April, 2025

Building LLM Agents by Incorporating Insights from Computer Systems.
CoRR, April, 2025

JARVIS-1: Open-World Multi-Task Agents With Memory-Augmented Multimodal Language Models.
IEEE Trans. Pattern Anal. Mach. Intell., March, 2025

Fast Visuomotor Policies via Partial Denoising.
CoRR, March, 2025

LongViTU: Instruction Tuning for Long-Form Video Understanding.
CoRR, January, 2025

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding.
CoRR, January, 2025

Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

GROOT-2: Weakly Supervised Multimodal Instruction Following Agents.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

2024
GROOT-2: Weakly Supervised Multi-Modal Instruction Following Agents.
CoRR, 2024

ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting.
CoRR, 2024

Task-oriented Sequential Grounding in 3D Scenes.
CoRR, 2024

Latent Energy-Based Odyssey: Black-Box Optimization via Expanded Exploration in the Energy-Based Latent Space.
CoRR, 2024

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting.
CoRR, 2024

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding.
CoRR, 2024

RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation.
CoRR, 2024

UltraEdit: Instruction-based Fine-Grained Image Editing at Scale.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Multi-modal Situated Reasoning in 3D Scenes.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

MindAgent: Emergent Gaming Interaction.
Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, 2024

An Embodied Generalist Agent in 3D World.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

GROOT: Learning to Follow Instructions by Watching Gameplay Videos.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Unifying 3D Vision-Language Understanding via Promptable Queries.
Proceedings of the Computer Vision - ECCV 2024, 2024

[inline-graphic not available: see fulltext]VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding.
Proceedings of the Computer Vision - ECCV 2024, 2024

CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023
MindAgent: Emergent Gaming Interaction.
CoRR, 2023

Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents.
CoRR, 2023

Learning Energy-Based Prior Model with Diffusion-Amortized MCMC.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Describe, Explain, Plan and Select: Interactive Planning with LLMs Enables Open-World Multi-Task Agents.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

SQA3D: Situated Question Answering in 3D Scenes.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2022
Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation.
CoRR, 2022

Latent Diffusion Energy-Based Model for Interpretable Text Modeling.
CoRR, 2022

Latent Diffusion Energy-Based Model for Interpretable Text Modelling.
Proceedings of the International Conference on Machine Learning, 2022

RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning.
Proceedings of the Tenth International Conference on Learning Representations, 2022

Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021
HALMA: Humanlike Abstraction Learning Meets Affordance in Rapid Problem Solving.
CoRR, 2021

Unsupervised Foreground Extraction via Deep Region Competition.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Adversarial Option-Aware Hierarchical Imitation Learning.
Proceedings of the 38th International Conference on Machine Learning, 2021

2020
Robust Robotic Pouring using Audition and Haptics.
CoRR, 2020

Robust Robotic Pouring using Audition and Haptics.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2020

A Mobile Robot Hand-Arm Teleoperation System by Vision and IMU.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2020

Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance.
Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

Theory-Based Causal Transfer: Integrating Instance-Level Induction and Abstract-Level Structure Learning.
Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

2019
Making Sense of Audio Vibration for Liquid Height Estimation in Robotic Pouring.
CoRR, 2019

Imitation Learning from Observations by Minimizing Inverse Dynamics Disagreement.
Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 2019

Making Sense of Audio Vibration for Liquid Height Estimation in Robotic Pouring.
Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2019

PointNetGPD: Detecting Grasp Configurations from Point Sets.
Proceedings of the International Conference on Robotics and Automation, 2019

Vision-based Teleoperation of Shadow Dexterous Hand using End-to-End Deep Neural Network.
Proceedings of the International Conference on Robotics and Automation, 2019

Task Transfer by Preference-Based Cost Learning.
Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019

2018
Learning and Inference Movement with Deep Generative Model.
CoRR, 2018

Adversarial Task Transfer from Preference.
CoRR, 2018


  Loading...