We stand with Ukraine

We stand with Ukraine

Sipeng Zheng

Orcid: 0000-0001-5331-6314

According to our database¹, Sipeng Zheng authored at least 41 papers between 2019 and 2026.

Collaborative distances:

Dijkstra number² of five.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

Being-H0.7: A Latent World-Action Model from Egocentric Videos.

[DOI]

,

,

,

,

,

,

,

,

CoRR, May, 2026

Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models.

[DOI]

,

,

,

,

,

CoRR, April, 2026

OpenT2M: No-frill Motion Generation with Open-source,Large-scale, High-quality Data.

[DOI]

,

,

,

,

,

CoRR, March, 2026

Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting.

[DOI]

,

,

,

,

,

,

,

,

CoRR, March, 2026

Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild.

[DOI]

,

,

,

,

,

,

,

CoRR, February, 2026

Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

CoRR, February, 2026

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

CoRR, January, 2026

2025

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos.

[DOI]

,

,

,

,

,

,

CoRR, December, 2025

Robust Motion Generation using Part-level Reliable Data from Videos.

[DOI]

,

,

,

,

CoRR, December, 2025

DiG-Flow: Discrepancy-Guided Flow Matching for Robust VLA Models.

[DOI]

,

,

,

,

,

,

,

CoRR, December, 2025

Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model.

[DOI]

,

,

,

,

,

,

,

CoRR, August, 2025

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos.

[DOI]

,

,

,

,

,

,

,

,

,

CoRR, July, 2025

RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control.

[DOI]

,

,

,

,

,

,

,

,

,

CoRR, June, 2025

EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining.

[DOI]

,

,

,

,

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

Scaling Large Motion Models with Million-Level Human Motions.

[DOI]

,

,

,

,

,

,

Proceedings of the Forty-second International Conference on Machine Learning, 2025

Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions?

[DOI]

,

,

,

,

,

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities.

[DOI]

,

,

,

,

,

,

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Unified Multimodal Understanding via Byte-Pair Visual Encoding.

[DOI]

,

,

,

,

,

,

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

VideoOrion: Tokenizing Object Dynamics in Videos.

[DOI]

,

,

,

,

,

,

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

MotionCtrl: A Real-Time Controllable Vision-Language-Motion Model.

[DOI]

,

,

,

,

,

,

,

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning.

[DOI]

,

,

Börje F. Karlsson

,

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

SPAFormer: Sequential 3D Part Assembly with Transformers.

[DOI]

,

,

Proceedings of the International Conference on 3D Vision, 2025

2024

VideoOrion: Tokenizing Object Dynamics in Videos.

[DOI]

,

,

,

,

CoRR, 2024

Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models.

[DOI]

,

,

,

,

,

CoRR, 2024

QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds.

[DOI]

,

,

,

CoRR, 2024

EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions?

[DOI]

,

,

,

,

,

CoRR, 2024

LLaMA-Rider: Spurring Large Language Models to Explore the Open World.

[DOI]

,

,

,

,

Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, 2024

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds.

[DOI]

,

,

,

Proceedings of the Twelfth International Conference on Learning Representations, 2024

UniCode: Learning a Unified Codebook for Multimodal Large Language Models.

[DOI]

,

,

,

,

Proceedings of the Computer Vision - ECCV 2024, 2024

2023

No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection.

[DOI]

,

,

CoRR, 2023

POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World.

[DOI]

,

,

Proceedings of the 31st ACM International Conference on Multimedia, 2023

Anchor-Based Detection for Natural Language Localization in Ego-Centric Videos.

[DOI]

,

,

,

Wen-Huang Cheng

Proceedings of the IEEE International Conference on Consumer Electronics, 2023

Open-Category Human-Object Interaction Pre-training via Language Modeling Framework.

[DOI]

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Accommodating Audio Modality in CLIP for Multimodal Processing.

[DOI]

,

,

,

,

,

Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022

Exploring Anchor-based Detection for Ego4D Natural Language Query.

[DOI]

,

,

,

,

CoRR, 2022

Few-Shot Action Recognition with Hierarchical Matching and Contrastive Learning.

[DOI]

,

,

Proceedings of the Computer Vision - ECCV 2022, 2022

VRDFormer: End-to-End Video Visual Relation Detection with Transformers.

[DOI]

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

MR imaging for the quantitative assessment of brain iron in aceruloplasminemia: A postmortem validation study.

[DOI]

Lena H. P. Vroegindeweij

,

Piotr Wielopolski

,

Agnita J. W. Boon

,

J. H. Paul Wilson

,

,

,

Sylvestre Bonnet

,

,

Louise van der Weerd

,

Juan Antonio Hernández Tamames

,

Janneke G. Langendonk

NeuroImage, 2021

2020

Skeleton-Based Interactive Graph Network For Human Object Interaction Detection.

[DOI]

,

,

Proceedings of the IEEE International Conference on Multimedia and Expo, 2020

2019

Visual Relation Detection with Multi-Level Attention.

[DOI]

,

,

Proceedings of the 27th ACM International Conference on Multimedia, 2019

Relation Understanding in Videos.

[DOI]

,

,

,

Proceedings of the 27th ACM International Conference on Multimedia, 2019

Loading...