Sipeng Zheng

Orcid: 0000-0001-5331-6314

According to our database1, Sipeng Zheng authored at least 41 papers between 2019 and 2026.

Collaborative distances:
  • Dijkstra number2 of five.
  • Erdős number3 of four.

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

On csauthors.net:

Bibliography

2026
Being-H0.7: A Latent World-Action Model from Egocentric Videos.
CoRR, May, 2026

Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models.
CoRR, April, 2026

OpenT2M: No-frill Motion Generation with Open-source,Large-scale, High-quality Data.
CoRR, March, 2026

Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting.
CoRR, March, 2026

Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild.
CoRR, February, 2026

Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization.
CoRR, February, 2026

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization.
CoRR, January, 2026

2025
Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos.
CoRR, December, 2025

Robust Motion Generation using Part-level Reliable Data from Videos.
CoRR, December, 2025

DiG-Flow: Discrepancy-Guided Flow Matching for Robust VLA Models.
CoRR, December, 2025

Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model.
CoRR, August, 2025

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos.
CoRR, July, 2025

RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control.
CoRR, June, 2025

EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining.
CoRR, March, 2025

Scaling Large Motion Models with Million-Level Human Motions.
Proceedings of the Forty-second International Conference on Machine Learning, 2025

Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions?
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Unified Multimodal Understanding via Byte-Pair Visual Encoding.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

VideoOrion: Tokenizing Object Dynamics in Videos.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

MotionCtrl: A Real-Time Controllable Vision-Language-Motion Model.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning.
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

SPAFormer: Sequential 3D Part Assembly with Transformers.
Proceedings of the International Conference on 3D Vision, 2025

2024
VideoOrion: Tokenizing Object Dynamics in Videos.
CoRR, 2024

Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models.
CoRR, 2024

QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds.
CoRR, 2024

EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions?
CoRR, 2024

LLaMA-Rider: Spurring Large Language Models to Explore the Open World.
Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, 2024

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

UniCode: Learning a Unified Codebook for Multimodal Large Language Models.
Proceedings of the Computer Vision - ECCV 2024, 2024

2023
No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection.
CoRR, 2023

POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

Anchor-Based Detection for Natural Language Localization in Ego-Centric Videos.
Proceedings of the IEEE International Conference on Consumer Electronics, 2023

Open-Category Human-Object Interaction Pre-training via Language Modeling Framework.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Accommodating Audio Modality in CLIP for Multimodal Processing.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022
Exploring Anchor-based Detection for Ego4D Natural Language Query.
CoRR, 2022

Few-Shot Action Recognition with Hierarchical Matching and Contrastive Learning.
Proceedings of the Computer Vision - ECCV 2022, 2022

VRDFormer: End-to-End Video Visual Relation Detection with Transformers.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021
MR imaging for the quantitative assessment of brain iron in aceruloplasminemia: A postmortem validation study.
NeuroImage, 2021

2020
Skeleton-Based Interactive Graph Network For Human Object Interaction Detection.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2020

2019
Visual Relation Detection with Multi-Level Attention.
Proceedings of the 27th ACM International Conference on Multimedia, 2019

Relation Understanding in Videos.
Proceedings of the 27th ACM International Conference on Multimedia, 2019


  Loading...