Jifeng Dai

Orcid: 0000-0002-6785-0785

According to our database1, Jifeng Dai authored at least 162 papers between 2011 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
A Survey of Reasoning with Foundation Models: Concepts, Methodologies, and Outlook.
ACM Comput. Surv., November, 2025

Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy.
CoRR, August, 2025

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents.
CoRR, July, 2025

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning.
CoRR, July, 2025

Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models.
CoRR, July, 2025

Spatial Frequency Modulation for Semantic Segmentation.
CoRR, July, 2025

InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models.
CoRR, June, 2025

CoMemo: LVLMs Need Image Context with Image Memory.
CoRR, June, 2025

OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis.
CoRR, June, 2025

Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces.
CoRR, June, 2025

ZeroGUI: Automating Online GUI Learning at Zero Human Cost.
CoRR, May, 2025

Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings.
CoRR, May, 2025

Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space.
CoRR, May, 2025

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning.
CoRR, May, 2025

Demystify Transformers & Convolutions in Modern Image Deep Networks.
IEEE Trans. Pattern Anal. Mach. Intell., April, 2025

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models.
CoRR, April, 2025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models.
CoRR, April, 2025

BEVFormer: Learning Bird's-Eye-View Representation From LiDAR-Camera via Spatiotemporal Transformers.
IEEE Trans. Pattern Anal. Mach. Intell., March, 2025

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy.
CoRR, March, 2025

LangBridge: Interpreting Image as a Combination of Language Embeddings.
CoRR, March, 2025

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing.
CoRR, March, 2025

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning.
CoRR, March, 2025

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding.
CoRR, January, 2025

Maintaining Structural Integrity in Parameter Spaces for Parameter Efficient Fine-tuning.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Docopilot: Improving Multimodal Models for Document-Level Understanding.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024
FeatAug-DETR: Enriching One-to-Many Matching for DETRs With Feature Augmentation.
IEEE Trans. Pattern Anal. Mach. Intell., September, 2024

Delving Into the Devils of Bird's-Eye-View Perception: A Review, Evaluation and Recipe.
IEEE Trans. Pattern Anal. Mach. Intell., April, 2024

Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance.
Vis. Intell., 2024

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding.
CoRR, 2024

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling.
CoRR, 2024

HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving.
CoRR, 2024

MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost.
CoRR, 2024

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization.
CoRR, 2024

Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance.
CoRR, 2024

Diffusion Transformer Policy.
CoRR, 2024

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation.
CoRR, 2024

big.LITTLE Vision Transformer for Efficient Visual Recognition.
CoRR, 2024

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training.
CoRR, 2024

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models.
CoRR, 2024

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output.
CoRR, 2024

Hierarchical Memory for Long Video QA.
CoRR, 2024

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text.
CoRR, 2024

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams.
CoRR, 2024

LLMs Meet Multimodal Generation and Editing: A Survey.
CoRR, 2024

FLoRA: Low-Rank Core Space for N-dimension.
CoRR, 2024

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer.
CoRR, 2024

Effect of a reduced arterial axial pre-stretch ratio during aging on the cardiac output and cerebral blood flow in the healthy elders.
Comput. Methods Programs Biomed., 2024

MMInstruct: a high-quality multi-modal instruction tuning dataset with extensive diversity.
Sci. China Inf. Sci., 2024

How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites.
Sci. China Inf. Sci., 2024

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling.
Proceedings of the ACM SIGGRAPH 2024 Conference Papers, 2024

Parameter-Inverted Image Pyramid Networks.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Needle In A Multimodal Haystack.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Learning 1D Causal Visual Representation with De-focus Attention Networks.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

CooHOI: Learning Cooperative Human-Object Interaction with Manipulated Object Dynamics.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

Bounding Box Stability against Feature Dropout Reflects Detector Generalization across Environments.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World.
Proceedings of the Computer Vision - ECCV 2024, 2024

ControlLLM: Augment Language Models with Tools by Searching on Graphs.
Proceedings of the Computer Vision - ECCV 2024, 2024

Distilling Knowledge from Large-Scale Image Models for Object Detection.
Proceedings of the Computer Vision - ECCV 2024, 2024

Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-End Oriented Object Detection with Single Point Supervision.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Intern VL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks.
CoRR, 2023

A Survey of Reasoning with Foundation Models.
CoRR, 2023

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving.
CoRR, 2023

InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation.
CoRR, 2023

Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision.
CoRR, 2023

ControlLLM: Augment Language Models with Tools by Searching on Graphs.
CoRR, 2023

Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models.
CoRR, 2023

FlowFormer: A Transformer Architecture and Its Masked Cost Volume Autoencoding for Optical Flow.
CoRR, 2023

Denoising Diffusion Semantic Segmentation with Mask Prior Modeling.
CoRR, 2023

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory.
CoRR, 2023

InternGPT: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language.
CoRR, 2023

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

JourneyDB: A Benchmark for Generative Image Understanding.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Vision Transformer Adapter for Dense Predictions.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

VideoFlow: Exploiting Temporal Cues for Multi-frame Optical Flow Estimation.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Video Dehazing via a Multi-Range Temporal Alignment Network with Physical Prior.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Siamese Image Modeling for Self-Supervised Vision Representation Learning.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Planning-oriented Autonomous Driving.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2022
Goal-oriented Autonomous Driving.
CoRR, 2022

Demystify Transformers & Convolutions in Modern Image Deep Networks.
CoRR, 2022

Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe.
CoRR, 2022

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification.
CoRR, 2022

Siamese Image Modeling for Self-Supervised Vision Representation Learning.
CoRR, 2022

ConvMAE: Masked Convolution Meets Masked Autoencoders.
CoRR, 2022

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers.
CoRR, 2022

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

MCMAE: Masked Convolution Meets Masked Autoencoders.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification.
Proceedings of the Computer Vision - ECCV 2022, 2022

VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition.
Proceedings of the Computer Vision - ECCV 2022, 2022

Frozen CLIP Models are Efficient Video Learners.
Proceedings of the Computer Vision - ECCV 2022, 2022

BEVFormer: Learning Bird's-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers.
Proceedings of the Computer Vision - ECCV 2022, 2022

FlowFormer: A Transformer Architecture for Optical Flow.
Proceedings of the Computer Vision - ECCV 2022, 2022

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Exploring the Equivalence of Siamese Self-Supervised Learning via A Unified Gradient Framework.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

AutoLoss-Zero: Searching Loss Functions from Scratch for Generic Tasks.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks.
CoRR, 2021

VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition.
CoRR, 2021

Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling.
CoRR, 2021

Collaborative Visual Navigation.
CoRR, 2021

Scalable Transformers for Neural Machine Translation.
CoRR, 2021

Decoupled Spatial-Temporal Transformer for Video Inpainting.
CoRR, 2021

Searching Parameterized AP Loss for Object Detection.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Deformable DETR: Deformable Transformers for End-to-End Object Detection.
Proceedings of the 9th International Conference on Learning Representations, 2021

Auto Seg-Loss: Searching Metric Surrogates for Semantic Segmentation.
Proceedings of the 9th International Conference on Learning Representations, 2021

Exploring Cross-Image Pixel Contrast for Semantic Segmentation.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Influence Selection for Active Learning.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Fast Convergence of DETR with Spatially Modulated Co-Attention.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Unsupervised Object Detection With LIDAR Clues.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020
1st Place Solution of LVIS Challenge 2020: A Good Box is not a Guarantee of a Good Mask.
CoRR, 2020

VL-BERT: Pre-training of Generic Visual-Linguistic Representations.
Proceedings of the 8th International Conference on Learning Representations, 2020

Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation.
Proceedings of the 8th International Conference on Learning Representations, 2020

Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation.
Proceedings of the Computer Vision - ECCV 2020, 2020

Resolution Adaptive Networks for Efficient Inference.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

Hierarchical Human Parsing With Typed Part-Relation Reasoning.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2019
MMDetection: Open MMLab Detection Toolbox and Benchmark.
CoRR, 2019

An Empirical Study of Spatial Attention Mechanisms in Deep Networks.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Deformable ConvNets V2: More Deformable, Better Results.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

2018
Integrated Object Detection and Tracking with Tracklet-Conditioned Detection.
CoRR, 2018

Towards High Performance Video Object Detection for Mobiles.
CoRR, 2018

Learning Region Features for Object Detection.
Proceedings of the Computer Vision - ECCV 2018, 2018

Towards High Performance Video Object Detection.
Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

Relation Networks for Object Detection.
Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

2017
Flow-Guided Feature Aggregation for Video Object Detection.
Proceedings of the IEEE International Conference on Computer Vision, 2017

Deformable Convolutional Networks.
Proceedings of the IEEE International Conference on Computer Vision, 2017

Deep Feature Flow for Video Recognition.
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017

Fully Convolutional Instance-Aware Semantic Segmentation.
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017

2016
R-FCN: Object Detection via Region-based Fully Convolutional Networks.
Proceedings of the Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, 2016

Instance-Sensitive Fully Convolutional Networks.
Proceedings of the Computer Vision - ECCV 2016, 2016

ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation.
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016

Instance-Aware Semantic Segmentation via Multi-task Network Cascades.
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016

2015
Generative Modeling of Convolutional Neural Networks.
Proceedings of the 3rd International Conference on Learning Representations, 2015

BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation.
Proceedings of the 2015 IEEE International Conference on Computer Vision, 2015

Convolutional feature masking for joint object and stuff segmentation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015

2014
Unsupervised Learning of Dictionaries of Hierarchical Compositional Models.
Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014

2013
Cosegmentation and Cosketch by Unsupervised Learning.
Proceedings of the IEEE International Conference on Computer Vision, 2013

2012
Robust and Efficient Ridge-Based Palmprint Matching.
IEEE Trans. Pattern Anal. Mach. Intell., 2012

Mining sub-categories for object detection.
Proceedings of the 21st International Conference on Pattern Recognition, 2012

2011
Multifeature-Based High-Resolution Palmprint Recognition.
IEEE Trans. Pattern Anal. Mach. Intell., 2011


  Loading...