Peng Gao

Orcid: 0009-0005-7881-712X

Affiliations:
  • Shanghai Artificial Intelligence Laboratory, OpenGVLab, Shanghai, China
  • Chinese University of Hong Kong, Multimedia Lab, Hong Kong (PhD 2021)


According to our database1, Peng Gao authored at least 171 papers between 2018 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
OneVAE: Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better.
CoRR, August, 2025

Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling.
CoRR, July, 2025

Resurrect Mask AutoRegressive Modeling for Efficient and Scalable Image Generation.
CoRR, July, 2025

TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models.
IEEE Trans. Big Data, June, 2025

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning.
CoRR, April, 2025

TrustGeoGen: Scalable and Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving.
CoRR, April, 2025

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning.
CoRR, April, 2025

OmniCaptioner: One Captioner to Rule Them All.
CoRR, April, 2025

Lumina-OmniLV: A Unified Multimodal Framework for General Low-Level Vision.
CoRR, April, 2025

LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models.
IEEE Trans. Pattern Anal. Mach. Intell., March, 2025

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework.
CoRR, March, 2025

LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis.
CoRR, March, 2025

TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation.
CoRR, March, 2025

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency.
CoRR, February, 2025

Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT.
CoRR, February, 2025

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step.
CoRR, January, 2025

IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models.
CoRR, January, 2025

Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models.
CoRR, January, 2025

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation.
CoRR, January, 2025

3DAxisPrompt: Promoting the 3D grounding and reasoning in GPT-4o.
Neurocomputing, 2025

MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MMSearch: Unveiling the Potential of Large Models as Multi-modal Search Engines.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Let's Verify and Reinforce Image Generation Step by Step.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024
FeatAug-DETR: Enriching One-to-Many Matching for DETRs With Feature Augmentation.
IEEE Trans. Pattern Anal. Mach. Intell., September, 2024

Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking.
Int. J. Comput. Vis., May, 2024

CLIP-Adapter: Better Vision-Language Models with Feature Adapters.
Int. J. Comput. Vis., February, 2024

POS-BERT: Point cloud one-stage BERT pre-training.
Expert Syst. Appl., 2024

TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction.
CoRR, 2024

Customize Your Visual Autoregressive Recipe with Set Autoregressive Modeling.
CoRR, 2024

LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models.
CoRR, 2024

I-Max: Maximize the Resolution Potential of Pre-trained Rectified Flow Transformers with Projected Flow.
CoRR, 2024

UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models.
CoRR, 2024

SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation.
CoRR, 2024

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions.
CoRR, 2024

MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines.
CoRR, 2024

SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners.
CoRR, 2024

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining.
CoRR, 2024

AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents.
CoRR, 2024

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models.
CoRR, 2024

MAVIS: Mathematical Visual Instruction Tuning.
CoRR, 2024

VEnhancer: Generative Space-Time Enhancement for Video Generation.
CoRR, 2024

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT.
CoRR, 2024

Phased Consistency Model.
CoRR, 2024

TerDiT: Ternary Diffusion Models with Transformers.
CoRR, 2024

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers.
CoRR, 2024

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
CoRR, 2024

Searching a Lightweight Network Architecture for Thermal Infrared Pedestrian Tracking.
CoRR, 2024

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models.
CoRR, 2024

Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models.
CoRR, 2024

ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning.
CoRR, 2024

Xiaoqing: A Q&A model for glaucoma based on LLMs.
Comput. Biol. Medicine, 2024

Efficient MAE towards Large-Scale Vision Transformers.
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024

Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Phased Consistency Models.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2024

Bridging Zero-shot Object Navigation and Foundation Models through Pixel-Guided Navigation Skill.
Proceedings of the IEEE International Conference on Robotics and Automation, 2024

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

InstructSpeech: Following Speech Editing Instructions via Large Language Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

LLaMA-Adapter: Efficient Fine-tuning of Large Language Models with Zero-initialized Attention.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Personalize Segment Anything Model with One Shot.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

MATHVERSE: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
Proceedings of the Computer Vision - ECCV 2024, 2024

SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding.
Proceedings of the Computer Vision - ECCV 2024, 2024

Any2Point: Empowering Any-Modality Large Models for Efficient 3D Understanding.
Proceedings of the Computer Vision - ECCV 2024, 2024

SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models.
Proceedings of the Computer Vision - ECCV 2024, 2024

No Time to Train: Empowering Non-Parametric Networks for Few-Shot 3D Scene Segmentation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

OneLLM: One Framework to Align All Modalities with Language.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Digital Life Project: Autonomous 3D Characters with Social Intelligence.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Masked AutoDecoder is Effective Multi-Task Vision Generalist.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

A3VLM: Actionable Articulation-Aware Vision Language Model.
Proceedings of the Conference on Robot Learning, 6-9 November 2024, Munich, Germany., 2024

ChartAssistant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023
UniFormer: Unifying Convolution and Self-Attention for Visual Recognition.
IEEE Trans. Pattern Anal. Mach. Intell., October, 2023

Hybrid token transformer for deep face recognition.
Pattern Recognit., July, 2023

P2FEViT: Plug-and-Play CNN Feature Embedded Hybrid Vision Transformer for Remote Sensing Image Classification.
Remote. Sens., April, 2023

Object-Centric Masked Image Modeling-Based Self-Supervised Pretraining for Remote Sensing Object Detection.
IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., 2023

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding.
CoRR, 2023

A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise.
CoRR, 2023

3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V.
CoRR, 2023

ChatIllusion: Efficient-Aligning Interleaved Generation ability with Visual Instruction Model.
CoRR, 2023

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models.
CoRR, 2023

Improving Compositional Text-to-image Generation with Large Vision-Language Models.
CoRR, 2023

ImageBind-LLM: Multi-modality Instruction Tuning.
CoRR, 2023

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following.
CoRR, 2023

Less is More: Towards Efficient Few-shot 3D Semantic Segmentation via Training-free Networks.
CoRR, 2023

Tiny LVLM-eHub: Early Multimodal Experiments with Bard.
CoRR, 2023

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation.
CoRR, 2023

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model.
CoRR, 2023

Personalize Segment Anything Model with One Shot.
CoRR, 2023

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model.
CoRR, 2023

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention.
CoRR, 2023

Parameter is Not All You Need: Starting from Non-Parametric Networks for 3D Point Cloud Analysis.
CoRR, 2023

Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking.
CoRR, 2023

SUG: Single-dataset Unified Generalization for 3D Point Cloud Classification.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

Hybrid Transformer Network for Change Detection Under Self-Supervised Pretraining.
Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, 2023

Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

SparseMAE: Sparse Training Meets Masked Autoencoders.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Starting from Non-Parametric Networks for 3D Point Cloud Analysis.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Learning 3D Representations from 2D Pre-Trained Models via Image-to-Point Masked Autoencoders.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Stare at What You See: Masked Image Modeling without Reconstruction.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Q-DETR: An Efficient Low-Bit Quantized Detection Transformer.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Resilient Binary Neural Network.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022
Consecutive Pre-Training: A Knowledge Transfer Learning Strategy with Relevant Unlabeled Data for Remote Sensing Domain.
Remote. Sens., 2022

Hierarchical Disentangling Network for Building Extraction from Very High Resolution Optical Remote Sensing Imagery.
Remote. Sens., 2022

PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning.
CoRR, 2022

Collaboration of Pre-trained Models Makes Better Few-shot Learner.
CoRR, 2022

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification.
CoRR, 2022

Consecutive Pretraining: A Knowledge Transfer Learning Strategy with Relevant Unlabeled Data for Remote Sensing Domain.
CoRR, 2022

Illumination Adaptive Transformer.
CoRR, 2022

ConvMAE: Masked Convolution Meets Masked Autoencoders.
CoRR, 2022

POS-BERT: Point Cloud One-Stage BERT Pre-Training.
CoRR, 2022

MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection.
CoRR, 2022

Distillation with Contrast is All You Need for Self-Supervised Point Cloud Representation Learning.
CoRR, 2022

TerViT: An Efficient Ternary Vision Transformer.
CoRR, 2022

UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning.
CoRR, 2022

RestoreDet: Degradation Equivariant Representation for Object Detection in Low Resolution Images.
CoRR, 2022

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

MCMAE: Masked Convolution Meets Masked Autoencoders.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Adaptive Local Context Embedding for Small Vehicle Detection from Aerial Optical Remote Sensing Images.
Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, 2022

UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning.
Proceedings of the Tenth International Conference on Learning Representations, 2022

Audio-Visual Scene-Aware Dialog and Reasoning Using Audio-Visual Transformers with Joint Student-Teacher Learning.
Proceedings of the IEEE International Conference on Acoustics, 2022

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification.
Proceedings of the Computer Vision - ECCV 2022, 2022

IDa-Det: An Information Discrepancy-Aware Distillation for 1-Bit Detectors.
Proceedings of the Computer Vision - ECCV 2022, 2022

Recurrent Bilinear Optimization for Binary Neural Networks.
Proceedings of the Computer Vision, 2022

Frozen CLIP Models are Efficient Video Learners.
Proceedings of the Computer Vision - ECCV 2022, 2022

Prototypical Contrast Adaptation for Domain Adaptive Semantic Segmentation.
Proceedings of the Computer Vision - ECCV 2022, 2022

PointCLIP: Point Cloud Understanding by CLIP.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Unleashing the Potential of Vision-Language Models for Long-Tailed Visual Recognition.
Proceedings of the 33rd British Machine Vision Conference 2022, 2022

You Only Need 90K Parameters to Adapt Light: a Light Weight Transformer for Image Enhancement and Exposure Correction.
Proceedings of the 33rd British Machine Vision Conference 2022, 2022

2021
Multi-View Partial (MVP) Point Cloud Challenge 2021 on Completion and Registration: Methods and Results.
CoRR, 2021

A Simple Long-Tailed Recognition Baseline via Vision-Language Model.
CoRR, 2021

Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling.
CoRR, 2021

Oriented Object Detection with Transformer.
CoRR, 2021

Scalable Transformers for Neural Machine Translation.
CoRR, 2021

Container: Context Aggregation Network.
CoRR, 2021

Dual-stream Network for Visual Recognition.
CoRR, 2021

RomeBERT: Robust Training of Multi-Exit BERT.
CoRR, 2021

Dual-stream Network for Visual Recognition.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Container: Context Aggregation Networks.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Dense Contrastive Visual-Linguistic Pretraining.
Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021

Fast Convergence of DETR with Spatially Modulated Co-Attention.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

End-to-End Object Detection with Adaptive Clustering Transformer.
Proceedings of the 32nd British Machine Vision Conference 2021, 2021

Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers.
Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021

2020
End-to-End Object Detection with Adaptive Clustering Transformer.
CoRR, 2020

Multi-Pass Transformer for Machine Translation.
CoRR, 2020

Contrastive Visual-Linguistic Pretraining.
CoRR, 2020

Gradient Regularized Contrastive Learning for Continual Domain Adaptation.
CoRR, 2020

Spatio-Temporal Scene Graphs for Video Dialog.
CoRR, 2020

Character Matters: Video Story Understanding with Character-Aware Relations.
CoRR, 2020

Extreme Low-Light Imaging with Multi-granulation Cooperative Networks.
CoRR, 2020

Multi-Layer Content Interaction Through Quaternion Product for Visual Question Answering.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Learning Where to Focus for Efficient Video Object Detection.
Proceedings of the Computer Vision - ECCV 2020, 2020

2019
Multi-Modality Latent Interaction Network for Visual Question Answering.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

Video Object Detection with Locally-Weighted Deformable Neighbors.
Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019

2018
Question-Guided Hybrid Convolution for Visual Question Answering.
Proceedings of the Computer Vision - ECCV 2018, 2018


  Loading...