We stand with Ukraine

We stand with Ukraine

Linjie Li

This page is a disambiguation page, it actually contains multiple papers from persons of the same or a similar name.

Bibliography

2026

Planning with the Views via Scene Self-Exploration.

[DOI]

,

,

,

,

,

,

,

Leonidas J. Guibas

,

,

CoRR, May, 2026

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

CoRR, May, 2026

SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects.

[DOI]

,

,

,

,

Kevin Qinghong Lin

,

,

CoRR, May, 2026

Quantum-Gated Task-interaction Knowledge Distillation for Pre-trained Model-based Class-Incremental Learning.

[DOI]

,

,

,

,

CoRR, April, 2026

LDEPrompt: Layer-importance guided Dual Expandable Prompt Pool for Pre-trained Model-based Class-Incremental Learning.

[DOI]

,

,

,

CoRR, April, 2026

FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching.

[DOI]

,

,

,

,

,

,

,

,

,

Alex Jinpeng Wang

CoRR, April, 2026

RAGEN-2: Reasoning Collapse in Agentic RL.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

CoRR, April, 2026

Gym-V: A Unified Vision Environment System for Agentic Vision Research.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

Michael Qizhe Shieh

CoRR, March, 2026

GloSplat: Joint Pose-Appearance Optimization for Faster and More Accurate 3D Reconstruction.

[DOI]

,

,

,

CoRR, March, 2026

RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

CoRR, February, 2026

AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning.

[DOI]

,

,

,

,

,

,

CoRR, January, 2026

VoMarkSplat: Robust watermarking for 3D Gaussian splatting with patch and multi-convolutional voting.

[DOI]

,

,

,

,

Pattern Recognit. Lett., 2026

Rethinking the refinement stage of 3D object detection: A multi-task learning perspective with Mixture-of-Experts.

[DOI]

,

,

,

,

,

,

J. Vis. Commun. Image Represent., 2026

Iterative mutual voting matching for efficient and accurate Structure-from-Motion.

[DOI]

,

,

,

,

,

,

J. Vis. Commun. Image Represent., 2026

Genetic interactions between bioactive ingredients in traditional Chinese medicine and major depressive disorder, bipolar disorder, and schizophrenia.

[DOI]

,

,

,

,

,

,

,

,

Comput. Biol. Chem., 2026

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising.

[DOI]

,

,

,

,

,

Chung-Ching Lin

,

,

Gedas Bertasius

,

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2026

Shanks: Simultaneous Hearing and Thinking for Spoken Language Models.

[DOI]

Cheng-Han Chiang

,

,

,

Chung-Ching Lin

,

,

,

,

,

,

Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2026

TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering.

[DOI]

,

,

,

,

Alex Jinpeng Wang

Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

2025

ProImage-Bench: Rubric-Based Evaluation for Professional Image Generation.

[DOI]

,

,

,

,

Chung-Ching Lin

,

,

,

,

,

,

,

CoRR, December, 2025

Glance: Accelerating Diffusion Models with 1 Sample.

[DOI]

,

,

,

,

,

,

,

Alex Jinpeng Wang

CoRR, December, 2025

Computer-Use Agents as Judges for Generative User Interface.

[DOI]

Kevin Qinghong Lin

,

,

,

,

,

,

Mike Zheng Shou

CoRR, November, 2025

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation.

[DOI]

Kevin Qinghong Lin

,

,

,

,

,

,

,

Alex Jinpeng Wang

CoRR, November, 2025

ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning.

[DOI]

,

,

Huichen Will Wang

,

,

Michael Qizhe Shieh

,

,

,

CoRR, October, 2025

Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents.

[DOI]

,

Alex Jinpeng Wang

,

,

,

Mike Zheng Shou

CoRR, October, 2025

VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

CoRR, October, 2025

InfoAgent: Advancing Autonomous Information-Seeking Agents.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

CoRR, September, 2025

EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing.

[DOI]

,

,

,

,

,

,

,

,

,

,

Chung-Ching Lin

,

,

,

,

,

CoRR, September, 2025

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models.

[DOI]

Cheng-Han Chiang

,

,

,

Chung-Ching Lin

,

,

,

,

,

,

CoRR, July, 2025

A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning.

[DOI]

,

,

,

,

,

,

,

CoRR, July, 2025

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Yi R. (May) Fung

CoRR, June, 2025

Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models.

[DOI]

,

,

,

Róbert Csordás

,

,

,

,

,

,

CoRR, June, 2025

MoTE: Mixture of Task-specific Experts for Pre-Trained ModelBased Class-incremental Learning.

[DOI]

,

,

CoRR, June, 2025

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs.

[DOI]

,

,

,

,

,

,

,

,

Chung-Ching Lin

,

,

,

,

CoRR, June, 2025

Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations.

[DOI]

,

Mahtab Bigverdi

,

,

,

,

,

,

CoRR, June, 2025

Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT.

[DOI]

,

,

,

,

,

Alex Jinpeng Wang

,

,

CoRR, May, 2025

Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation.

[DOI]

,

,

,

,

CoRR, May, 2025

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning.

[DOI]

,

,

,

Chung-Ching Lin

,

,

,

CoRR, May, 2025

FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow.

[DOI]

,

Huichen Will Wang

,

,

,

CoRR, May, 2025

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning.

[DOI]

,

,

,

,

,

,

,

,

,

,

CoRR, May, 2025

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning.

[DOI]

,

,

,

,

,

,

,

,

Minh Nhat Nguyen

,

,

,

,

,

,

,

,

,

CoRR, April, 2025

V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models.

[DOI]

,

,

,

,

Alex Jinpeng Wang

,

,

,

CoRR, April, 2025

Measurement of LLM's Philosophies of Human Nature.

[DOI]

,

,

,

,

,

Chung-Ching Lin

,

,

,

CoRR, April, 2025

Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models.

[DOI]

Alex Jinpeng Wang

,

,

,

,

CoRR, March, 2025

TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation.

[DOI]

Alex Jinpeng Wang

,

,

,

,

,

,

,

,

,

,

,

CoRR, February, 2025

MoTE: Mixture of task-specific experts for pre-trained model-based Class-incremental learning.

[DOI]

,

,

Knowl. Based Syst., 2025

VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement.

[DOI]

,

,

,

,

,

Chung-Ching Lin

,

,

,

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

EmoAssist: Emotional Assistant for Visual Impairment Community.

[DOI]

,

,

,

Proceedings of the International Joint Conference on Neural Networks, 2025

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback.

[DOI]

,

,

,

,

Proceedings of the Forty-second International Conference on Machine Learning, 2025

Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark.

[DOI]

,

,

Huichen Will Wang

,

,

,

,

Proceedings of the Forty-second International Conference on Machine Learning, 2025

EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing.

[DOI]

,

,

,

,

,

,

,

,

,

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

GenXD: Generating Any 3D and 4D Scenes.

[DOI]

,

Chung-Ching Lin

,

,

,

,

,

,

,

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation.

[DOI]

,

,

,

,

,

,

,

Chung-Ching Lin

,

,

,

,

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

William Yang Wang

,

,

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

CertainlyUncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness.

[DOI]

Khyathi Raghavi Chandu

,

,

,

,

,

,

,

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension.

[DOI]

,

,

,

,

,

Chung-Ching Lin

,

,

,

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

ImageGen-CoT: Enhancing Text-to-Image in-context Learning with Chain-of-Thought Reasoning.

[DOI]

,

,

,

,

,

,

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

Audio-Aware Large Language Models as Judges for Speaking Styles.

[DOI]

Cheng-Han Chiang

,

,

Chung-Ching Lin

,

,

,

,

,

,

,

,

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

Synthetic Visual Genome.

[DOI]

,

,

,

,

,

,

Khyathi Raghavi Chandu

,

,

Norimasa Kobori

,

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

ShowUI: One Vision-Language-Action Model for GUI Visual Agent.

[DOI]

Kevin Qinghong Lin

,

,

,

,

,

,

Stan Weixian Lei

,

,

Mike Zheng Shou

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

LiVOS: Light Video Object Segmentation with Gated Linear Matching.

[DOI]

,

,

,

,

,

Marc Niethammer

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities.

[DOI]

,

,

,

,

,

,

Chung-Ching Lin

,

,

,

Dataset, December, 2024

Multimodal Foundation Models: From Specialists to General-Purpose Assistants.

[DOI]

,

,

,

,

,

,

Found. Trends Comput. Graph. Vis., 2024

An Iterative Resampling Deep Decoupling Domain Adaptation method for class-imbalance bearing fault diagnosis under variant working conditions.

[DOI]

,

,

,

,

Expert Syst. Appl., 2024

ShowUI: One Vision-Language-Action Model for GUI Visual Agent.

[DOI]

Kevin Qinghong Lin

,

,

,

,

,

,

,

,

Mike Zheng Shou

CoRR, 2024

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities.

[DOI]

,

,

,

,

,

,

Chung-Ching Lin

,

,

,

CoRR, 2024

Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness.

[DOI]

Khyathi Raghavi Chandu

,

,

,

,

,

,

,

CoRR, 2024

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs.

[DOI]

,

,

,

,

,

,

,

,

Julian J. McAuley

,

,

CoRR, 2024

Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition.

[DOI]

,

,

,

,

,

,

Christos Faloutsos

,

,

CoRR, 2024

TaE: Task-aware Expandable Representation for Long Tail Class Incremental Learning.

[DOI]

,

,

,

CoRR, 2024

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training.

[DOI]

Alex Jinpeng Wang

,

,

Kevin Qinghong Lin

,

,

,

,

,

Mike Zheng Shou

CoRR, 2024

Interfacing Foundation Models' Embeddings.

[DOI]

,

,

,

,

,

,

,

,

,

,

Arul Aravinthan

,

,

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning.

[DOI]

Alex Jinpeng Wang

,

,

,

,

,

Mike Zheng Shou

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

VideoGUI: A Benchmark for GUI Automation from Instructional Videos.

[DOI]

Kevin Qinghong Lin

,

,

,

,

,

,

,

Mike Zheng Shou

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation.

[DOI]

,

,

,

,

,

Chung-Ching Lin

,

David S. Doermann

,

,

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

OpenLEAF: A Novel Benchmark for Open-Domain Interleaved Image-Text Generation.

[DOI]

,

,

,

,

,

,

,

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Bring Metric Functions into Diffusion Models.

[DOI]

,

,

,

,

,

,

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities.

[DOI]

,

,

,

,

,

,

,

Proceedings of the Forty-first International Conference on Machine Learning, 2024

The Generative AI Paradox: "What It Can Create, It May Not Understand".

[DOI]

,

,

,

,

,

,

,

,

Abhilasha Ravichander

,

Khyathi Raghavi Chandu

,

Benjamin Newman

,

,

Allyson Ettinger

,

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning.

[DOI]

,

,

,

,

,

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Enhancing Human-to-Robot Skill Transfer: A Framework Integrating Movement and Variable Impedance Based on EMG.

[DOI]

,

,

,

,

,

Proceedings of the IEEE International Conference on Industrial Technology, 2024

IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation.

[DOI]

,

,

,

Chung-Ching Lin

,

,

,

David S. Doermann

,

,

,

Proceedings of the Computer Vision - ECCV 2024, 2024

Idea2Img: Iterative Self-refinement with GPT-4V for Automatic Image Design and Generation.

[DOI]

,

,

,

,

Chung-Ching Lin

,

,

Proceedings of the Computer Vision - ECCV 2024, 2024

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning.

[DOI]

,

,

,

,

,

Chung-Ching Lin

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Disco: Disentangled Control for Realistic Human Dance Generation.

[DOI]

,

,

,

,

Chung-Ching Lin

,

,

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation.

[DOI]

,

,

,

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023

Interfacing Foundation Models' Embeddings.

[DOI]

,

,

,

,

,

,

,

,

,

Arul Aravinthan

,

,

CoRR, 2023

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation.

[DOI]

,

,

,

,

,

,

,

,

Julian J. McAuley

,

,

,

CoRR, 2023

MM-VID: Advancing Video Understanding with GPT-4V(ision).

[DOI]

,

,

,

Chung-Ching Lin

,

Ehsan Azarnasab

,

,

,

,

,

,

,

CoRR, 2023

DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design.

[DOI]

,

,

,

,

CoRR, 2023

Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation.

[DOI]

,

,

,

,

Chung-Ching Lin

,

,

CoRR, 2023

OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation.

[DOI]

,

,

,

,

,

,

,

CoRR, 2023

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision).

[DOI]

,

,

,

,

Chung-Ching Lin

,

,

CoRR, 2023

Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models.

[DOI]

,

,

,

,

,

,

CoRR, 2023

DisCo: Disentangled Control for Referring Human Dance Generation in Real World.

[DOI]

,

,

,

Chung-Ching Lin

,

,

,

,

CoRR, 2023

Aligning Large Multi-Modal Model with Robust Instruction Tuning.

[DOI]

,

,

,

,

,

CoRR, 2023

MultiSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

CoRR, 2023

Segment Everything Everywhere All at Once.

[DOI]

,

,

,

,

,

,

CoRR, 2023

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

CoRR, 2023

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action.

[DOI]

,

,

,

,

Ehsan Azarnasab

,

,

,

,

,

CoRR, 2023

Segment Everything Everywhere All at Once.

[DOI]

,

,

,

,

,

,

,

,

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Learning 3D Photography Videos via Self-supervised Diffusion on Single Images.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023

Equivariant Similarity for Vision-Language Foundation Models.

[DOI]

,

,

,

Chung-Ching Lin

,

,

,

,

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

An Empirical Study of Multimodal Model Merging.

[DOI]

,

,

,

,

,

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

Generalized Decoding for Pixel, Image, and Language.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

ReCo: Region-Controlled Text-to-Image Generation.

[DOI]

,

,

,

,

,

,

,

,

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Adaptive Human Matting for Dynamic Videos.

[DOI]

Chung-Ching Lin

,

,

,

,

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling.

[DOI]

,

,

,

Chung-Ching Lin

,

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling.

[DOI]

,

,

,

,

William Yang Wang

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

2022

Global Profiling of 2-hydroxyisobutyrylome in Common Wheat.

[DOI]

,

,

,

,

,

,

,

Genom. Proteom. Bioinform., August, 2022

GIT: A Generative Image-to-text Transformer for Vision and Language.

[DOI]

,

,

,

,

,

,

,

,

Trans. Mach. Learn. Res., 2022

Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends.

[DOI]

,

,

,

,

,

Found. Trends Comput. Graph. Vis., 2022

Cross-modal Representation Learning for Zero-shot Action Recognition.

[DOI]

Chung-Ching Lin

,

,

,

,

CoRR, 2022

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone.

[DOI]

,

Aishwarya Kamath

,

,

Pengchuan Zhang

,

,

,

,

,

,

,

,

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Multiple Z-Complementary Code Sets With Low Inter-Set Cross-Correlation.

[DOI]

,

,

,

Proceedings of the 10th International Workshop on Signal Design and Its Applications in Communications, 2022

Crossmodal Representation Learning for Zero-shot Action Recognition.

[DOI]

Chung-Ching Lin

,

,

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning.

[DOI]

,

,

Chung-Ching Lin

,

,

,

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

PREVAIL: Pre-trained Variational Adversarial Active Learning for Molecular Property Prediction.

[DOI]

,

,

,

Proceedings of the 8th IEEE International Conference on Cloud Computing and Intelligent Systems, 2022

TaE: Task-Aware Expandable Representation for Long Tail Class Incremental Learning.

[DOI]

,

,

,

Proceedings of the Computer Vision - ACCV 2024, 2022

Playing Lottery Tickets with Vision and Language.

[DOI]

,

,

,

,

,

,

,

,

Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

2021

MLP Architectures for Vision-and-Language Modeling: An Empirical Study.

[DOI]

,

,

,

,

,

,

,

,

CoRR, 2021

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling.

[DOI]

,

,

,

,

William Yang Wang

,

,

CoRR, 2021

Playing Lottery Tickets with Vision and Language.

[DOI]

,

,

,

,

,

,

CoRR, 2021

Meta Module Network for Compositional Visual Reasoning.

[DOI]

,

,

,

,

William Yang Wang

,

Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2021

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation.

[DOI]

,

,

,

,

,

,

,

,

,

William Yang Wang

,

,

,

,

,

Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, 2021

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval.

[DOI]

,

,

,

,

,

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021

Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models.

[DOI]

,

,

,

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training.

[DOI]

,

,

,

,

,

,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling.

[DOI]

,

,

,

,

,

,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020

A Fault Diagnostic Scheme Based on Capsule Network for Rolling Bearing under Different Rotational Speeds.

[DOI]

,

,

Sensors, 2020

A Closer Look at the Robustness of Vision-and-Language Pre-trained Models.

[DOI]

,

,

CoRR, 2020

Large-Scale Adversarial Training for Vision-and-Language Representation Learning.

[DOI]

,

,

,

,

,

Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

Graph Optimal Transport for Cross-Domain Alignment.

[DOI]

,

,

,

,

,

Proceedings of the 37th International Conference on Machine Learning, 2020

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training.

[DOI]

,

,

,

,

,

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020

UNITER: UNiversal Image-TExt Representation Learning.

[DOI]

,

,

,

,

,

,

,

Proceedings of the Computer Vision - ECCV 2020, 2020

Analysis of Vibration Characteristics of Rolling Linear Guides.

[DOI]

,

,

,

,

,

Proceedings of the AIAM2020: 2nd International Conference on Artificial Intelligence and Advanced Manufacture, 2020

2019

UNITER: Learning UNiversal Image-TExt Representations.

[DOI]

,

,

,

,

,

,

,

CoRR, 2019

Configuration Design and Simulation of Novel Petal Tooth Nutation Joint Drive for Robot.

[DOI]

,

,

,

Proceedings of the Intelligent Robotics and Applications - 12th International Conference, 2019

Relation-Aware Graph Attention Network for Visual Question Answering.

[DOI]

,

,

,

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog.

[DOI]

,

,

,

,

,

Proceedings of the 57th Conference of the Association for Computational Linguistics, 2019

2017

Learning to see people like people.

[DOI]

,

,

,

Garrison W. Cottrell

CoRR, 2017

Learning to See People like People: Predicting Social Perceptions of Faces.

[DOI]

,

,

,

Proceedings of the 39th Annual Meeting of the Cognitive Science Society, 2017

2016

Understanding human facial attractiveness from multiple views.

[DOI]

,

,

Vicente L. Malave

,

,

Proceedings of the 38th Annual Meeting of the Cognitive Science Society, 2016

Extracting Human Face Similarity Judgments: Pairs or Triplets?

[DOI]

,

Vicente L. Malave

,

,

Proceedings of the 38th Annual Meeting of the Cognitive Science Society, 2016

Loading...