Linjie Li

Orcid: 0009-0005-8582-2218

According to our database¹, Linjie Li authored at least 129 papers between 2016 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Links

On csauthors.net:

Bibliography

2025

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation.

[BibT_eX]

[DOI]

CoRR, November, 2025

ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning.

[BibT_eX]

[DOI]

CoRR, October, 2025

Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents.

[BibT_eX]

[DOI]

CoRR, October, 2025

VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents.

[BibT_eX]

[DOI]

CoRR, October, 2025

SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models.

[BibT_eX]

[DOI]

CoRR, October, 2025

InfoAgent: Advancing Autonomous Information-Seeking Agents.

[BibT_eX]

[DOI]

CoRR, September, 2025

EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing.

[BibT_eX]

[DOI]

CoRR, September, 2025

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models.

[BibT_eX]

[DOI]

CoRR, July, 2025

A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning.

[BibT_eX]

[DOI]

CoRR, July, 2025

GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?

[BibT_eX]

[DOI]

CoRR, July, 2025

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers.

[BibT_eX]

[DOI]

CoRR, June, 2025

Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models.

[BibT_eX]

[DOI]

CoRR, June, 2025

MoTE: Mixture of Task-specific Experts for Pre-Trained ModelBased Class-incremental Learning.

[BibT_eX]

[DOI]

Linjie Li

Zhenyu Wu

Yang Ji

CoRR, June, 2025

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs.

[BibT_eX]

[DOI]

CoRR, June, 2025

Audio-Aware Large Language Models as Judges for Speaking Styles.

[BibT_eX]

[DOI]

CoRR, June, 2025

Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations.

[BibT_eX]

[DOI]

CoRR, June, 2025

Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT.

[BibT_eX]

[DOI]

CoRR, May, 2025

Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation.

[BibT_eX]

[DOI]

CoRR, May, 2025

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning.

[BibT_eX]

[DOI]

CoRR, May, 2025

FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow.

[BibT_eX]

[DOI]

CoRR, May, 2025

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, May, 2025

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, April, 2025

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement.

[BibT_eX]

[DOI]

CoRR, April, 2025

V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models.

[BibT_eX]

[DOI]

CoRR, April, 2025

Measurement of LLM's Philosophies of Human Nature.

[BibT_eX]

[DOI]

CoRR, April, 2025

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising.

[BibT_eX]

[DOI]

CoRR, March, 2025

Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models.

[BibT_eX]

[DOI]

CoRR, March, 2025

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning.

[BibT_eX]

[DOI]

CoRR, March, 2025

TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation.

[BibT_eX]

[DOI]

CoRR, February, 2025

Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark.

[BibT_eX]

[DOI]

CoRR, January, 2025

MoTE: Mixture of task-specific experts for pre-trained model-based Class-incremental learning.

[BibT_eX]

[DOI]

Linjie Li

Zhenyu Wu

Yang Ji

Knowl. Based Syst., 2025

EmoAssist: Emotional Assistant for Visual Impairment Community.

[BibT_eX]

[DOI]

Proceedings of the International Joint Conference on Neural Networks, 2025

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

GenXD: Generating Any 3D and 4D Scenes.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

CertainlyUncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness.

[BibT_eX]

[DOI]

Khyathi Raghavi Chandu

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Synthetic Visual Genome.

[BibT_eX]

[DOI]

Khyathi Raghavi Chandu

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

ShowUI: One Vision-Language-Action Model for GUI Visual Agent.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

LiVOS: Light Video Object Segmentation with Gated Linear Matching.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities.

[BibT_eX]

[DOI]

Dataset, December, 2024

Multimodal Foundation Models: From Specialists to General-Purpose Assistants.

[BibT_eX]

[DOI]

Found. Trends Comput. Graph. Vis., 2024

An Iterative Resampling Deep Decoupling Domain Adaptation method for class-imbalance bearing fault diagnosis under variant working conditions.

[BibT_eX]

[DOI]

Expert Syst. Appl., 2024

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension.

[BibT_eX]

[DOI]

CoRR, 2024

ShowUI: One Vision-Language-Action Model for GUI Visual Agent.

[BibT_eX]

[DOI]

CoRR, 2024

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities.

[BibT_eX]

[DOI]

CoRR, 2024

Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness.

[BibT_eX]

[DOI]

Khyathi Raghavi Chandu

CoRR, 2024

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs.

[BibT_eX]

[DOI]

CoRR, 2024

Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition.

[BibT_eX]

[DOI]

CoRR, 2024

TaE: Task-aware Expandable Representation for Long Tail Class Incremental Learning.

[BibT_eX]

[DOI]

CoRR, 2024

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training.

[BibT_eX]

[DOI]

CoRR, 2024

Interfacing Foundation Models' Embeddings.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

VideoGUI: A Benchmark for GUI Automation from Instructional Videos.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

OpenLEAF: A Novel Benchmark for Open-Domain Interleaved Image-Text Generation.

[BibT_eX]

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Bring Metric Functions into Diffusion Models.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

The Generative AI Paradox: "What It Can Create, It May Not Understand".

[BibT_eX]

[DOI]

Abhilasha Ravichander

Khyathi Raghavi Chandu

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Enhancing Human-to-Robot Skill Transfer: A Framework Integrating Movement and Variable Impedance Based on EMG.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Industrial Technology, 2024

IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

Idea2Img: Iterative Self-refinement with GPT-4V for Automatic Image Design and Generation.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Disco: Disentangled Control for Realistic Human Dance Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023

Interfacing Foundation Models' Embeddings.

[BibT_eX]

[DOI]

CoRR, 2023

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation.

[BibT_eX]

[DOI]

CoRR, 2023

MM-VID: Advancing Video Understanding with GPT-4V(ision).

[BibT_eX]

[DOI]

CoRR, 2023

DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design.

[BibT_eX]

[DOI]

CoRR, 2023

Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation.

[BibT_eX]

[DOI]

CoRR, 2023

OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation.

[BibT_eX]

[DOI]

CoRR, 2023

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision).

[BibT_eX]

[DOI]

CoRR, 2023

Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models.

[BibT_eX]

[DOI]

CoRR, 2023

DisCo: Disentangled Control for Referring Human Dance Generation in Real World.

[BibT_eX]

[DOI]

CoRR, 2023

Aligning Large Multi-Modal Model with Robust Instruction Tuning.

[BibT_eX]

[DOI]

CoRR, 2023

MultiSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos.

[BibT_eX]

[DOI]

CoRR, 2023

Segment Everything Everywhere All at Once.

[BibT_eX]

[DOI]

CoRR, 2023

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation.

[BibT_eX]

[DOI]

CoRR, 2023

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action.

[BibT_eX]

[DOI]

CoRR, 2023

Segment Everything Everywhere All at Once.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Learning 3D Photography Videos via Self-supervised Diffusion on Single Images.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023

Equivariant Similarity for Vision-Language Foundation Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

An Empirical Study of Multimodal Model Merging.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

Generalized Decoding for Pixel, Image, and Language.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

ReCo: Region-Controlled Text-to-Image Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Adaptive Human Matting for Dynamic Videos.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation.

[BibT_eX]

[DOI]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

2022

Global Profiling of 2-hydroxyisobutyrylome in Common Wheat.

[BibT_eX]

[DOI]

Genom. Proteom. Bioinform., August, 2022

GIT: A Generative Image-to-text Transformer for Vision and Language.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2022

Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends.

[BibT_eX]

[DOI]

Found. Trends Comput. Graph. Vis., 2022

Cross-modal Representation Learning for Zero-shot Action Recognition.

[BibT_eX]

[DOI]

CoRR, 2022

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Multiple Z-Complementary Code Sets With Low Inter-Set Cross-Correlation.

[BibT_eX]

[DOI]

Proceedings of the 10th International Workshop on Signal Design and Its Applications in Communications, 2022

Crossmodal Representation Learning for Zero-shot Action Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

PREVAIL: Pre-trained Variational Adversarial Active Learning for Molecular Property Prediction.

[BibT_eX]

[DOI]

Proceedings of the 8th IEEE International Conference on Cloud Computing and Intelligent Systems, 2022

TaE: Task-Aware Expandable Representation for Long Tail Class Incremental Learning.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ACCV 2024, 2022

Playing Lottery Tickets with Vision and Language.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

2021

MLP Architectures for Vision-and-Language Modeling: An Empirical Study.

[BibT_eX]

[DOI]

CoRR, 2021

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling.

[BibT_eX]

[DOI]

CoRR, 2021

Playing Lottery Tickets with Vision and Language.

[BibT_eX]

[DOI]

CoRR, 2021

Meta Module Network for Compositional Visual Reasoning.

[BibT_eX]

[DOI]

Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2021

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation.

[BibT_eX]

[DOI]

Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, 2021

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval.

[BibT_eX]

[DOI]

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021

Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020

A Fault Diagnostic Scheme Based on Capsule Network for Rolling Bearing under Different Rotational Speeds.

[BibT_eX]

[DOI]

Linjie Li

Mian Zhang

Kesheng Wang

Sensors, 2020

A Closer Look at the Robustness of Vision-and-Language Pre-trained Models.

[BibT_eX]

[DOI]

Linjie Li

Zhe Gan

Jingjing Liu

CoRR, 2020

Large-Scale Adversarial Training for Vision-and-Language Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

Graph Optimal Transport for Cross-Domain Alignment.

[BibT_eX]

[DOI]

Proceedings of the 37th International Conference on Machine Learning, 2020

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training.

[BibT_eX]

[DOI]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020

UNITER: UNiversal Image-TExt Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2020, 2020

Analysis of Vibration Characteristics of Rolling Linear Guides.

[BibT_eX]

[DOI]

Proceedings of the AIAM2020: 2nd International Conference on Artificial Intelligence and Advanced Manufacture, 2020

2019

UNITER: Learning UNiversal Image-TExt Representations.

[BibT_eX]

[DOI]

CoRR, 2019

Configuration Design and Simulation of Novel Petal Tooth Nutation Joint Drive for Robot.

[BibT_eX]

[DOI]

Proceedings of the Intelligent Robotics and Applications - 12th International Conference, 2019

Relation-Aware Graph Attention Network for Visual Question Answering.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog.

[BibT_eX]

[DOI]

Proceedings of the 57th Conference of the Association for Computational Linguistics, 2019

2017

Learning to see people like people.

[BibT_eX]

[DOI]

CoRR, 2017

Learning to See People like People: Predicting Social Perceptions of Faces.

[BibT_eX]

[DOI]

Proceedings of the 39th Annual Meeting of the Cognitive Science Society, 2017

2016

Understanding human facial attractiveness from multiple views.

[BibT_eX]

[DOI]

Proceedings of the 38th Annual Meeting of the Cognitive Science Society, 2016

Extracting Human Face Similarity Judgments: Pairs or Triplets?

[BibT_eX]

[DOI]

Proceedings of the 38th Annual Meeting of the Cognitive Science Society, 2016

Linjie Li

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...