Zheng Ge

Orcid: 0000-0002-8630-8270

According to our database1, Zheng Ge authored at least 62 papers between 2018 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale.
CoRR, August, 2025

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning.
CoRR, July, 2025

DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning.
CoRR, June, 2025

Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model.
CoRR, June, 2025

Step1X-Edit: A Practical Framework for General Image Editing.
CoRR, April, 2025

Perception-R1: Pioneering Perception Policy with Reinforcement Learning.
CoRR, April, 2025

Perception in Reflection.
CoRR, April, 2025

M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?
CoRR, March, 2025

Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model.
CoRR, March, 2025

Unhackable Temporal Rewarding for Scalable Video MLLMs.
CoRR, February, 2025

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model.
CoRR, February, 2025

PerPO: Perceptual Preference Optimization via Discriminative Rewarding.
CoRR, February, 2025

Taming Teacher Forcing for Masked Autoregressive Video Generation.
CoRR, January, 2025

Unhackable Temporal Reward for Scalable Video MLLMs.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Reconstructive Visual Instruction Tuning.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Taming Teacher Forcing for Masked Autoregressive Video Generation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024
GroupLane: End-to-End 3D Lane Detection With Channel-Wise Grouping.
IEEE Robotics Autom. Lett., November, 2024

Exploring Recurrent Long-Term Temporal Fusion for Multi-View 3D Perception.
IEEE Robotics Autom. Lett., July, 2024

Fourier-Transform-Based Unmixing Method for Fusion of Multiresolution Satellite Images.
IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., 2024

Slow Perception: Let's Perceive Geometric Figures Step-by-step.
CoRR, 2024

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model.
CoRR, 2024

DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models.
CoRR, 2024

Focus Anywhere for Fine-grained Multi-page Document Understanding.
CoRR, 2024

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction.
CoRR, 2024

Small Language Model Meets with Reinforced Vision Vocabulary.
CoRR, 2024

Self-Supervised Visual Preference Alignment.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

OneChart: Purify the Chart Structural Extraction via One Auxiliary Token.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning.
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

DreamLLM: Synergistic Multimodal Comprehension and Creation.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Merlin: Empowering Multimodal LLMs with Foresight Minds.
Proceedings of the Computer Vision - ECCV 2024, 2024

Vary: Scaling up the Vision Vocabulary for Large Vision-Language Model.
Proceedings of the Computer Vision - ECCV 2024, 2024

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction.
Proceedings of the Computer Vision - ECCV 2024, 2024

Align-DETR: Enhancing End-to-end Object Detection with Aligned Loss.
Proceedings of the 35th British Machine Vision Conference, 2024

2023
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models.
CoRR, 2023

GMM: Delving into Gradient Aware and Model Perceive Depth Mining for Monocular 3D Detection.
CoRR, 2023

The 1st-place Solution for CVPR 2023 OpenLane Topology in Autonomous Driving Challenge.
CoRR, 2023

Align-DETR: Improving DETR with Simple IoU-aware BCE loss.
CoRR, 2023

BEVStereo++: Accurate Depth Estimation in Multi-view 3D Object Detection via Dynamic Temporal Stereo.
CoRR, 2023

Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining.
Proceedings of the International Conference on Machine Learning, 2023

Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?
Proceedings of the Eleventh International Conference on Learning Representations, 2023

MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Implicit Identity Leakage: The Stumbling Block to Improving Deepfake Detection Generalization.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

BEVDepth: Acquisition of Reliable Depth for Multi-View 3D Object Detection.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

BEVStereo: Enhancing Depth Estimation in Multi-View 3D Object Detection with Temporal Stereo.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022
Towards 3D Object Detection with 2D Supervision.
CoRR, 2022

Towards A Robust Deepfake Detector: Common Artifact Deepfake Detection Model.
CoRR, 2022

BEVStereo: Enhancing Depth Estimation in Multi-view 3D Object Detection with Dynamic Temporal Stereo.
CoRR, 2022

STS: Surround-view Temporal Stereo for Multi-view 3D Detection.
CoRR, 2022

PersDet: Monocular 3D Detection in Perspective Bird's-Eye-View.
CoRR, 2022

Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection.
Proceedings of the Computer Vision - ECCV 2022, 2022

2021
LLA: Loss-aware label assignment for dense pedestrian detection.
Neurocomputing, 2021

Delving deep into the imbalance of positive proposals in two-stage object detection.
Neurocomputing, 2021

Workshop on Autonomous Driving at CVPR 2021: Technical Report for Streaming Perception Challenge.
CoRR, 2021

YOLOX: Exceeding YOLO Series in 2021.
CoRR, 2021

Premium Power Value-Added Service Product Decision-Making Method Based on Multi-Index Two-Sided Matching.
IEEE Access, 2021

OTA: Optimal Transport Assignment for Object Detection.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020
Delving into the Imbalance of Positive Proposals in Two-stage Object Detection.
CoRR, 2020

DualBox: Generating BBox Pair with Strong Correspondence via Occlusion Pattern Clustering and Proposal Refinement.
Proceedings of the 25th International Conference on Pattern Recognition, 2020

PS-RCNN: Detecting Secondary Human Instances in a Crowd via Primary Object Suppression.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2020

NMS by Representative Region: Towards Crowded Pedestrian Detection by Proposal Pairing.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2018
Fast Portrait Matting Using Spatial Detail-Preserving Network.
Proceedings of the Neural Information Processing - 25th International Conference, 2018


  Loading...