Linjie Li

Orcid: 0000-0003-0867-8863

According to our database1, Linjie Li authored at least 121 papers between 2016 and 2025.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models.
CoRR, July, 2025

A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning.
CoRR, July, 2025

GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?
CoRR, July, 2025

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers.
CoRR, June, 2025

Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models.
CoRR, June, 2025

MoTE: Mixture of Task-specific Experts for Pre-Trained ModelBased Class-incremental Learning.
CoRR, June, 2025

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs.
CoRR, June, 2025

Audio-Aware Large Language Models as Judges for Speaking Styles.
CoRR, June, 2025

Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations.
CoRR, June, 2025

Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT.
CoRR, May, 2025

Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation.
CoRR, May, 2025

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning.
CoRR, May, 2025

FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow.
CoRR, May, 2025

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning.
CoRR, May, 2025

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning.
CoRR, April, 2025

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement.
CoRR, April, 2025

V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models.
CoRR, April, 2025

Measurement of LLM's Philosophies of Human Nature.
CoRR, April, 2025

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising.
CoRR, March, 2025

Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models.
CoRR, March, 2025

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning.
CoRR, March, 2025

EmoAssist: Emotional Assistant for Visual Impairment Community.
CoRR, February, 2025

TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation.
CoRR, February, 2025

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback.
CoRR, January, 2025

Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark.
CoRR, January, 2025

MoTE: Mixture of task-specific experts for pre-trained model-based Class-incremental learning.
Knowl. Based Syst., 2025

EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

GenXD: Generating Any 3D and 4D Scenes.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

CertainlyUncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Synthetic Visual Genome.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

ShowUI: One Vision-Language-Action Model for GUI Visual Agent.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

LiVOS: Light Video Object Segmentation with Gated Linear Matching.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities.
Dataset, December, 2024

Multimodal Foundation Models: From Specialists to General-Purpose Assistants.
Found. Trends Comput. Graph. Vis., 2024

An Iterative Resampling Deep Decoupling Domain Adaptation method for class-imbalance bearing fault diagnosis under variant working conditions.
Expert Syst. Appl., 2024

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension.
CoRR, 2024

ShowUI: One Vision-Language-Action Model for GUI Visual Agent.
CoRR, 2024

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities.
CoRR, 2024

Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness.
CoRR, 2024

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs.
CoRR, 2024

Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition.
CoRR, 2024

TaE: Task-aware Expandable Representation for Long Tail Class Incremental Learning.
CoRR, 2024

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training.
CoRR, 2024

Interfacing Foundation Models' Embeddings.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

VideoGUI: A Benchmark for GUI Automation from Instructional Videos.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

OpenLEAF: A Novel Benchmark for Open-Domain Interleaved Image-Text Generation.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Bring Metric Functions into Diffusion Models.
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

The Generative AI Paradox: "What It Can Create, It May Not Understand".
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Enhancing Human-to-Robot Skill Transfer: A Framework Integrating Movement and Variable Impedance Based on EMG.
Proceedings of the IEEE International Conference on Industrial Technology, 2024

IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation.
Proceedings of the Computer Vision - ECCV 2024, 2024

Idea2Img: Iterative Self-refinement with GPT-4V for Automatic Image Design and Generation.
Proceedings of the Computer Vision - ECCV 2024, 2024

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Disco: Disentangled Control for Realistic Human Dance Generation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023
Interfacing Foundation Models' Embeddings.
CoRR, 2023

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation.
CoRR, 2023

MM-VID: Advancing Video Understanding with GPT-4V(ision).
CoRR, 2023

DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design.
CoRR, 2023

Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation.
CoRR, 2023

OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation.
CoRR, 2023

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision).
CoRR, 2023

Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models.
CoRR, 2023

DisCo: Disentangled Control for Referring Human Dance Generation in Real World.
CoRR, 2023

Aligning Large Multi-Modal Model with Robust Instruction Tuning.
CoRR, 2023

MultiSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos.
CoRR, 2023

Segment Everything Everywhere All at Once.
CoRR, 2023

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation.
CoRR, 2023

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action.
CoRR, 2023

Segment Everything Everywhere All at Once.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Learning 3D Photography Videos via Self-supervised Diffusion on Single Images.
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023

Equivariant Similarity for Vision-Language Foundation Models.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

An Empirical Study of Multimodal Model Merging.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

Generalized Decoding for Pixel, Image, and Language.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

ReCo: Region-Controlled Text-to-Image Generation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Adaptive Human Matting for Dynamic Videos.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

2022
Global Profiling of 2-hydroxyisobutyrylome in Common Wheat.
Genom. Proteom. Bioinform., August, 2022

GIT: A Generative Image-to-text Transformer for Vision and Language.
Trans. Mach. Learn. Res., 2022

Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends.
Found. Trends Comput. Graph. Vis., 2022

Cross-modal Representation Learning for Zero-shot Action Recognition.
CoRR, 2022

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Multiple Z-Complementary Code Sets With Low Inter-Set Cross-Correlation.
Proceedings of the 10th International Workshop on Signal Design and Its Applications in Communications, 2022

Crossmodal Representation Learning for Zero-shot Action Recognition.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

PREVAIL: Pre-trained Variational Adversarial Active Learning for Molecular Property Prediction.
Proceedings of the 8th IEEE International Conference on Cloud Computing and Intelligent Systems, 2022

TaE: Task-Aware Expandable Representation for Long Tail Class Incremental Learning.
Proceedings of the Computer Vision - ACCV 2024, 2022

Playing Lottery Tickets with Vision and Language.
Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

2021
MLP Architectures for Vision-and-Language Modeling: An Empirical Study.
CoRR, 2021

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling.
CoRR, 2021

Playing Lottery Tickets with Vision and Language.
CoRR, 2021

Meta Module Network for Compositional Visual Reasoning.
Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2021

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation.
Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, 2021

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval.
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021

Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020
A Fault Diagnostic Scheme Based on Capsule Network for Rolling Bearing under Different Rotational Speeds.
Sensors, 2020

A Closer Look at the Robustness of Vision-and-Language Pre-trained Models.
CoRR, 2020

Large-Scale Adversarial Training for Vision-and-Language Representation Learning.
Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

Graph Optimal Transport for Cross-Domain Alignment.
Proceedings of the 37th International Conference on Machine Learning, 2020

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020

UNITER: UNiversal Image-TExt Representation Learning.
Proceedings of the Computer Vision - ECCV 2020, 2020

Analysis of Vibration Characteristics of Rolling Linear Guides.
Proceedings of the AIAM2020: 2nd International Conference on Artificial Intelligence and Advanced Manufacture, 2020

2019
UNITER: Learning UNiversal Image-TExt Representations.
CoRR, 2019

Configuration Design and Simulation of Novel Petal Tooth Nutation Joint Drive for Robot.
Proceedings of the Intelligent Robotics and Applications - 12th International Conference, 2019

Relation-Aware Graph Attention Network for Visual Question Answering.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog.
Proceedings of the 57th Conference of the Association for Computational Linguistics, 2019

2017
Learning to see people like people.
CoRR, 2017

Learning to See People like People: Predicting Social Perceptions of Faces.
Proceedings of the 39th Annual Meeting of the Cognitive Science Society, 2017

2016
Understanding human facial attractiveness from multiple views.
Proceedings of the 38th Annual Meeting of the Cognitive Science Society, 2016

Extracting Human Face Similarity Judgments: Pairs or Triplets?
Proceedings of the 38th Annual Meeting of the Cognitive Science Society, 2016


  Loading...