Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark.

[BibT_eX]

[DOI]

Yunzhuo Hao

Proceedings of the Forty-second International Conference on Machine Learning, 2025

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding.

[BibT_eX]

[DOI]

Dinei A. F. Florêncio

Cha Zhang

Proceedings of the Forty-second International Conference on Machine Learning, 2025

EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

GenXD: Generating Any 3D and 4D Scenes.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

SITE: Towards Spatial Intelligence Thorough Evaluation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

ImageGen-CoT: Enhancing Text-to-Image in-context Learning with Chain-of-Thought Reasoning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?

[BibT_eX]

[DOI]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

Audio-Aware Large Language Models as Judges for Speaking Styles.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

ShowUI: One Vision-Language-Action Model for GUI Visual Agent.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

LiVOS: Light Video Object Segmentation with Gated Linear Matching.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities.

[BibT_eX]

[DOI]

Dataset, December, 2024

Introduction to the Special Issue on AI-Generated Content for Multimedia.

[BibT_eX]

[DOI]

Shengxi Li

Xuelong Li

Leonardo Chiariglione

IEEE Trans. Circuits Syst. Video Technol., August, 2024

Multimodal Foundation Models: From Specialists to General-Purpose Assistants.

[BibT_eX]

[DOI]

Found. Trends Comput. Graph. Vis., 2024

OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation.

[BibT_eX]

[DOI]

CoRR, 2024

ShowUI: One Vision-Language-Action Model for GUI Visual Agent.

[BibT_eX]

[DOI]

CoRR, 2024

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation.

[BibT_eX]

[DOI]

CoRR, 2024

MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition.

[BibT_eX]

[DOI]

CoRR, 2024

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities.

[BibT_eX]

[DOI]

CoRR, 2024

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs.

[BibT_eX]

[DOI]

CoRR, 2024

Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition.

[BibT_eX]

[DOI]

CoRR, 2024

Design2Code: How Far Are We From Automating Front-End Engineering?

[BibT_eX]

[DOI]

CoRR, 2024

StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis.

[BibT_eX]

[DOI]

CoRR, 2024

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training.

[BibT_eX]

[DOI]

CoRR, 2024

Interfacing Foundation Models' Embeddings.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

VideoGUI: A Benchmark for GUI Automation from Instructional Videos.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

OpenLEAF: A Novel Benchmark for Open-Domain Interleaved Image-Text Generation.

[BibT_eX]

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Bring Metric Functions into Diffusion Models.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

StrokeNUWA - Tokenizing Strokes for Vector Graphic Synthesis.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

Idea2Img: Iterative Self-refinement with GPT-4V for Automatic Image Design and Generation.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

GRiT: A Generative Region-to-Text Transformer for Object Understanding.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Disco: Disentangled Control for Realistic Human Dance Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023

TransVG++: End-to-End Visual Grounding With Language Conditioned Vision Transformer.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., November, 2023

InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

Interfacing Foundation Models' Embeddings.

[BibT_eX]

[DOI]

CoRR, 2023

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation.

[BibT_eX]

[DOI]

CoRR, 2023

MM-VID: Advancing Video Understanding with GPT-4V(ision).

[BibT_eX]

[DOI]

CoRR, 2023

DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design.

[BibT_eX]

[DOI]

CoRR, 2023

Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation.

[BibT_eX]

[DOI]

CoRR, 2023

OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation.

[BibT_eX]

[DOI]

CoRR, 2023

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision).

[BibT_eX]

[DOI]

CoRR, 2023

Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models.

[BibT_eX]

[DOI]

CoRR, 2023

DisCo: Disentangled Control for Referring Human Dance Generation in Real World.

[BibT_eX]

[DOI]

CoRR, 2023

MultiSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos.

[BibT_eX]

[DOI]

CoRR, 2023

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation.

[BibT_eX]

[DOI]

CoRR, 2023

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action.

[BibT_eX]

[DOI]

CoRR, 2023

Revisiting Transformer for Point Cloud-based 3D Scene Graph Generation.

[BibT_eX]

[DOI]

CoRR, 2023

Learning 3D Photography Videos via Self-supervised Diffusion on Single Images.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023

Prompting GPT-3 To Be Reliable.

[BibT_eX]

[DOI]

Jordan L. Boyd-Graber

Lijuan Wang

Proceedings of the Eleventh International Conference on Learning Representations, 2023

Equivariant Similarity for Vision-Language Foundation Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

ReCo: Region-Controlled Text-to-Image Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation.

[BibT_eX]

[DOI]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

2022

GIT: A Generative Image-to-text Transformer for Vision and Language.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2022

PromptCap: Prompt-Guided Task-Aware Image Captioning.

[BibT_eX]

[DOI]

CoRR, 2022

Cross-modal Contrastive Distillation for Instructional Activity Anticipation.

[BibT_eX]

[DOI]

Proceedings of the 26th International Conference on Pattern Recognition, 2022

Apple Counting Network Before Fruit Thinning Period Based On Dilated Convolution.

[BibT_eX]

[DOI]

Proceedings of the 11th International Conference on Networks, Communication and Computing, 2022

UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

Scaling Up Vision-Language Pretraining for Image Captioning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

2021

Grounding-Tracking-Integration.

[BibT_eX]

[DOI]

IEEE Trans. Circuits Syst. Video Technol., 2021

Scaling Up Vision-Language Pre-training for Image Captioning.

[BibT_eX]

[DOI]

CoRR, 2021

Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling.

[BibT_eX]

[DOI]

CoRR, 2021

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning.

[BibT_eX]

[DOI]

CoRR, 2021

SAT: 2D Semantics Assisted Training for 3D Visual Grounding.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

TransVG: End-to-End Visual Grounding with Transformers.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020

Dynamic Context-guided Capsule Network for Multimodal Machine Translation.

[BibT_eX]

[DOI]

Proceedings of the MM '20: The 28th ACM International Conference on Multimedia, 2020

Weakly Supervised Body Part Segmentation with Pose based Part Priors.

[BibT_eX]

[DOI]

Proceedings of the 25th International Conference on Pattern Recognition, 2020

Pose-based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation.

[BibT_eX]

[DOI]

Proceedings of the 25th International Conference on Pattern Recognition, 2020

Improving One-Stage Visual Grounding by Recursive Sub-query Construction.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2020, 2020

A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation.

[BibT_eX]

[DOI]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

2019

Action Recognition With Spatio-Temporal Visual Attention on Skeleton Image Sequences.

[BibT_eX]

[DOI]

IEEE Trans. Circuits Syst. Video Technol., 2019

Grounding-Tracking-Integration.

[BibT_eX]

[DOI]

CoRR, 2019

Weakly Supervised Body Part Parsing with Pose based Part Priors.

[BibT_eX]

[DOI]

CoRR, 2019

Human-Centered Emotion Recognition in Animated GIFs.

[BibT_eX]

[DOI]

Zhengyuan Yang

Yixuan Zhang

Jiebo Luo

Proceedings of the IEEE International Conference on Multimedia and Expo, 2019

A Fast and Accurate One-Stage Approach to Visual Grounding.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Attentive Relational Networks for Mapping Images to Scene Graphs.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

2018

End-to-end Multi-Modal Multi-Task Vehicle Control for Self-Driving Cars with Visual Perception.

[BibT_eX]

[DOI]

CoRR, 2018

End-to-end Multi-Modal Multi-Task Vehicle Control for Self-Driving Cars with Visual Perceptions.

[BibT_eX]

[DOI]

Proceedings of the 24th International Conference on Pattern Recognition, 2018

Action Recognition with Visual Attention on Skeleton Images.

[BibT_eX]

[DOI]

Proceedings of the 24th International Conference on Pattern Recognition, 2018

2017

Personalized pose estimation for body language understanding.

[BibT_eX]

[DOI]

Zhengyuan Yang

Jiebo Luo

Proceedings of the 2017 IEEE International Conference on Image Processing, 2017

2015

Curve fitting and optimal interpolation for CNC machining under confined error using quadratic B-splines.

[BibT_eX]

[DOI]

Comput. Aided Des., 2015

Zhengyuan Yang

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...