Ranjay Krishna

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine.

[BibT_eX]

[DOI]

Mahtab Bigverdi

Pavan Kumar Anasosalu Vasu

Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2025, 2025

Agonistic Image Generation: Unsettling the Hegemony of Intention.

[BibT_eX]

[DOI]

Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, 2025

Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

LATTE: Learning to Think with Vision Specialists.

[BibT_eX]

[DOI]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

Semantic and Expressive Variations in Image Captions Across Languages.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

RealEdit: Reddit Edits As a Large-scale Empirical Dataset for Image Transformations.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Synthetic Visual Genome.

[BibT_eX]

[DOI]

Khyathi Raghavi Chandu

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

NVILA: Efficient Frontier Visual Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Perception Tokens Enhance Visual Reasoning in Multimodal Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

One Diffusion to Generate Them All.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Improving Interpersonal Communication by Simulating Audiences with Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the 47th Annual Meeting of the Cognitive Science Society, 2025

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024

The One RING: a Robotic Indoor Navigation Generalist.

[BibT_eX]

[DOI]

CoRR, 2024

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming.

[BibT_eX]

[DOI]

CoRR, 2024

SAT: Spatial Aptitude Training for Multimodal Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action.

[BibT_eX]

[DOI]

CoRR, 2024

NVILA: Efficient Frontier Visual Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Negative Token Merging: Image-based Adversarial Feature Guidance.

[BibT_eX]

[DOI]

CoRR, 2024

Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment.

[BibT_eX]

[DOI]

CoRR, 2024

Language Model Preference Evaluation with Multiple Weak Evaluators.

[BibT_eX]

[DOI]

CoRR, 2024

ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition.

[BibT_eX]

[DOI]

CoRR, 2024

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models.

[BibT_eX]

[DOI]

CoRR, 2024

Self-Enhancing Video Data Management System for Compositional Events with Large Language Models [Technical Report].

[BibT_eX]

[DOI]

CoRR, 2024

Coarse Correspondence Elicit 3D Spacetime Understanding in Multimodal Language Model.

[BibT_eX]

[DOI]

CoRR, 2024

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions.

[BibT_eX]

[DOI]

CoRR, 2024

Multilingual Diversity Improves Vision-Language Representations.

[BibT_eX]

[DOI]

CoRR, 2024

EVE: Enabling Anyone to Train Robot using Augmented Reality.

[BibT_eX]

[DOI]

CoRR, 2024

Training Language Model Agents without Modifying Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Scaling Up LLM Reviews for Google Ads Content Moderation.

[BibT_eX]

[DOI]

Proceedings of the 17th ACM International Conference on Web Search and Data Mining, 2024

EVE: Enabling Anyone to Train Robots using Augmented Reality.

[BibT_eX]

[DOI]

Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, 2024

THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation.

[BibT_eX]

[DOI]

Proceedings of the Robotics: Science and Systems XX, 2024

Task Me Anything.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Multilingual Diversity Improves Vision-Language Representations.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Offline Training of Language Model Agents with Functions as Learnable Weights.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Selective Visual Representations Improve Convergence and Generalization for Embodied AI.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

ImageInWords: Unlocking Hyper-Detailed Image Descriptions.

[BibT_eX]

[DOI]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps.

[BibT_eX]

[DOI]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning.

[BibT_eX]

[DOI]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

m &m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

Efficient Inference of Vision Instruction-Following Models with Elastic Cache.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

The Hard Positive Truth About Vision-Language Compositionality.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

BLINK: Multimodal Large Language Models Can See but Not Perceive.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

VIDEOSHOP: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion.

[BibT_eX]

[DOI]

Xiang Fan

Anand Bhattad

Proceedings of the Computer Vision - ECCV 2024, 2024

Iterated Learning Improves Compositionality in Large Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Holodeck: Language Guided Generation of 3D Embodied AI Environments.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Modeling Collaborator: Enabling Subjective Vision Classification with Minimal Human Effort via LLM Tool-Use.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos.

[BibT_eX]

[DOI]

Mehmet Saygin Seyfioglu

Fatemeh Ghezloo

Krishnamurthy Viswanathan

Linda G. Shapiro

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

MIMIC: Masked Image Modeling with Image Correspondences.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models.

[BibT_eX]

[DOI]

Yushi Hu

Otilia Stretcu

Chun-Ta Lu

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

SPOC: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

RoboPoint: A Vision-Language Model for Spatial Affordance Prediction in Robotics.

[BibT_eX]

[DOI]

Adithyavairavan Murali

Arsalan Mousavian

Dieter Fox

Proceedings of the Conference on Robot Learning, 6-9 November 2024, Munich, Germany., 2024

I Can Tell What I am Doing: Toward Real-World Natural Language Grounding of Robot Experiences.

[BibT_eX]

[DOI]

Proceedings of the Conference on Robot Learning, 6-9 November 2024, Munich, Germany., 2024

Manipulate-Anything: Automating Real-World Robots using Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the Conference on Robot Learning, 6-9 November 2024, Munich, Germany., 2024

Found in the middle: Calibrating Positional Attention Bias Improves Long Context Utilization.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2024

2023

Guest Editorial: Introduction to the Special Section on Graphs in Vision and Pattern Analysis.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., June, 2023

Explanations Can Reduce Overreliance on AI Systems During Decision-Making.

[BibT_eX]

[DOI]

Helena Vasconcelos

Matthew Jörke

Tobias Gerstenberg

Michael S. Bernstein

Proc. ACM Hum. Comput. Interact., April, 2023

EQUI-VOCAL: Synthesizing Queries for Compositional Video Events from Limited User Interactions.

[BibT_eX]

[DOI]

Proc. VLDB Endow., 2023

EQUI-VOCAL Demonstration: Synthesizing Video Queries from User Interactions.

[BibT_eX]

[DOI]

Proc. VLDB Endow., 2023

VOCALExplore: Pay-as-You-Go Video Data Exploration and Model Building.

[BibT_eX]

[DOI]

Proc. VLDB Endow., 2023

Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World.

[BibT_eX]

[DOI]

CoRR, 2023

Lasagna: Layered Score Distillation for Disentangled Object Relighting.

[BibT_eX]

[DOI]

CoRR, 2023

Improving Interpersonal Communication by Simulating Audiences with Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

Cultural and Linguistic Diversity Improves Visual Representations.

[BibT_eX]

[DOI]

CoRR, 2023

EcoAssistant: Using LLM Assistant More Affordably and Accurately.

[BibT_eX]

[DOI]

Jieyu Zhang

Ahmed Hassan Awadallah

Chi Wang

CoRR, 2023

Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

MIMIC: Masked Image Modeling with Image Correspondences.

[BibT_eX]

[DOI]

CoRR, 2023

COLA: How to adapt vision-language models to Compose Objects Localized with Attributes?

[BibT_eX]

[DOI]

CoRR, 2023

EQUI-VOCAL: Synthesizing Queries for Compositional Video Events from Limited User Interactions [Technical Report].

[BibT_eX]

[DOI]

CoRR, 2023

Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Cola: A Benchmark for Compositional Text-to-image Retrieval.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

OBJECT 3DIT: Language-guided 3D-aware Image Editing.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Quilt-1M: One Million Image-Text Pairs for Histopathology.

[BibT_eX]

[DOI]

Mehmet Saygin Seyfioglu

Fatemeh Ghezloo

Dylan Stefan Chan Geva

Fatwir Sheikh Mohammed

Pavan Kumar Anand

Krishnamurthy Viswanathan

Linda G. Shapiro

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

DataComp: In search of the next generation of multimodal datasets.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Agile Modeling: From Concept to Classifier in Minutes.

[BibT_eX]

[DOI]

Otilia Stretcu

Edward Vendrow

Kenji Hata

MohammadHossein Bateni

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

@ CREPE: Can Vision-Language Foundation Models Reason Compositionally?

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

AR2-D2: Training a Robot Without a Robot.

[BibT_eX]

[DOI]

Proceedings of the Conference on Robot Learning, 2023

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

2022

AGQA 2.0: An Updated Benchmark for Compositional Spatio-Temporal Reasoning.

[BibT_eX]

[DOI]

Maneesh Agrawala

CoRR, 2022

ELIGN: Expectation Alignment as a Multi-Agent Intrinsic Reward.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Measuring Compositional Consistency for Video Question Answering.

[BibT_eX]

[DOI]

Mona Gandhi

Mustafa Omer Gul

Eva Prakash

Maneesh Agrawala

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

VOCAL: Video Organization and Interactive Compositional AnaLytics.

[BibT_eX]

[DOI]

Proceedings of the 12th Conference on Innovative Data Systems Research, 2022

2021

Visual intelligence through human learning.

[BibT_eX]

[DOI]

PhD thesis, 2021

Visual Intelligence through Human Interaction.

[BibT_eX]

[DOI]

CoRR, 2021

On the Opportunities and Risks of Foundation Models.

[BibT_eX]

[DOI]

et al.

CoRR, 2021

AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning.

[BibT_eX]

[DOI]

Maneesh Agrawala

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering.

[BibT_eX]

[DOI]

Siddharth Karamcheti

Li Fei-Fei

Christopher D. Manning

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021

2020

Conceptual Metaphors Impact Perceptions of Human-AI Collaboration.

[BibT_eX]

[DOI]

Proc. ACM Hum. Comput. Interact., 2020

Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

Determining Question-Answer Plausibility in Crowdsourced Datasets Using Multi-Task Learning.

[BibT_eX]

[DOI]

Proceedings of the Sixth Workshop on Noisy User-generated Text, 2020

2019

Action Genome: Actions as Composition of Spatio-temporal Scene Graphs.

[BibT_eX]

[DOI]

CoRR, 2019

Deep Bayesian Active Learning for Multiple Correct Outputs.

[BibT_eX]

[DOI]

CoRR, 2019

HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 2019

HYPE: Human-eYe Perceptual Evaluation of Generative Models.

[BibT_eX]

[DOI]

Proceedings of the Deep Generative Models for Highly Structured Data, 2019

Visual Relationships as Functions: Enabling Few-Shot Scene Graph Prediction.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops, 2019

Scene Graph Prediction with Limited Labels.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops, 2019

AI-Based Request Augmentation to Increase Crowdsourcing Participation.

[BibT_eX]

[DOI]

Proceedings of the Seventh AAAI Conference on Human Computation and Crowdsourcing, 2019

Information Maximizing Visual Question Generation.

[BibT_eX]

[DOI]

Snehalkumar (Neil) S. Gaikwad

Michael S. Bernstein

Li Fei-Fei

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

Eevee: Transforming Images by Bridging High-level Goals and Low-level Edit Operations.

[BibT_eX]

[DOI]

Proceedings of the Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, 2019

2018

The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary.

[BibT_eX]

[DOI]

CoRR, 2018

Engagement Learning: Expanding Visual Knowledge by Engaging Online Participants.

[BibT_eX]

[DOI]

Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology Adjunct Proceedings, 2018

Referring Relationships.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

2017

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations.

[BibT_eX]

[DOI]

Int. J. Comput. Vis., 2017

ActivityNet Challenge 2017 Summary.

[BibT_eX]

[DOI]

CoRR, 2017

Crowd Research: Open and Scalable University Laboratories.

[BibT_eX]

[DOI]

Rajan Vaish

Geza Kovacs

Andreas Veit