Yapeng Tian

CoRR, April, 2026

Mitigating the ID-OOD Tradeoff in Open-Set Test-Time Adaptation.

[BibT_eX]

[DOI]

CoRR, April, 2026

Omni-MMSI: Toward Identity-attributed Social Interaction Understanding.

[BibT_eX]

[DOI]

CoRR, April, 2026

High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling.

[BibT_eX]

[DOI]

Int. J. Comput. Vis., March, 2026

A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding.

[BibT_eX]

[DOI]

CoRR, March, 2026

ARGaze: Autoregressive Transformers for Online Egocentric Gaze Estimation.

[BibT_eX]

[DOI]

CoRR, February, 2026

Object-WIPER : Training-Free Object and Associated Effect Removal in Videos.

[BibT_eX]

[DOI]

Sayan Nag

Siva Sai Nagender Vasireddy

Kuldeep Kulkarni

CoRR, January, 2026

Modality-Inconsistent Continual Learning of Multimodal Large Language Models.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2026

A Survey on Foundations and Frontiers of Multimodal Agentic Frameworks: Techniques and Applications.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2026

Towards Online Multimodal Social Interaction Understanding.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2026

Touch with Meaning: A Contextual Analysis of Social Touch.

[BibT_eX]

[DOI]

Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems, 2026

Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?

[BibT_eX]

[DOI]

Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

Toward Gaze Target Detection of Young Autistic Children.

[BibT_eX]

[DOI]

Shijian Deng

Erin E. Kosloski

Jia Li

Randi Sierra Sherwood

Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

2025

Explainable AI-Generated Image Detection RewardBench.

[BibT_eX]

[DOI]

CoRR, November, 2025

ANNIE: Be Careful of Your Robots.

[BibT_eX]

[DOI]

CoRR, September, 2025

From Waveforms to Pixels: A Survey on Audio-Visual Segmentation.

[BibT_eX]

[DOI]

Jia Li

CoRR, August, 2025

VRSight: An AI-Driven Scene Description System to Improve Virtual Reality Accessibility for Blind People.

[BibT_eX]

[DOI]

CoRR, August, 2025

AROMA: Mixed-Initiative AI Assistance for Non-Visual Cooking by Grounding Multi-modal Information Between Reality and Videos.

[BibT_eX]

[DOI]

CoRR, July, 2025

AVROBUSTBENCH: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time.

[BibT_eX]

[DOI]

Sarthak Kumar Maharana

CoRR, June, 2025

FreSca: Unveiling the Scaling Space in Diffusion Models.

[BibT_eX]

[DOI]

CoRR, April, 2025

DiffI2I: Efficient Diffusion Model for Image-to-Image Translation.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., March, 2025

Towards Online Multi-Modal Social Interaction Understanding.

[BibT_eX]

[DOI]

CoRR, March, 2025

PRVQL: Progressive Knowledge-guided Refinement for Robust Egocentric Visual Query Localization.

[BibT_eX]

[DOI]

CoRR, February, 2025

Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2025

TP‑Blend: Textual‑Prompt Attention Pairing for Precise Object‑Style Blending in Diffusion Models.

[BibT_eX]

[DOI]

Xin Jin

Yichuan Zhong

Trans. Mach. Learn. Res., 2025

MagicTalk: Implicit and Explicit Correlation Learning for Diffusion-Based Emotional Talking Face Generation.

[BibT_eX]

[DOI]

Comput. Vis. Media, 2025

Joint Co-Speech Gesture and Expressive Talking Face Generation Using Diffusion with Adapters.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

AROMA: Mixed-Initiative AI Assistance for Non-Visual Cooking by Grounding Multimodal Information Between Reality and Videos.

[BibT_eX]

[DOI]

Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, 2025

VRSight: An AI-Driven Scene Description System to Improve Virtual Reality Accessibility for Blind People.

[BibT_eX]

[DOI]

Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, 2025

Language-Guided Adaptive Vision Token Pruning for Efficient Multimodal Large Language Models.

[BibT_eX]

[DOI]

Omer Faruk Deniz

Tarik Arici

Fatemeh Sheikholeslami

Proceedings of the Advances in Knowledge Discovery and Data Mining, 2025

AV-DiT: Taming Image Diffusion Transformers for Efficient Joint Audio and Video Generation.

[BibT_eX]

[DOI]

Proceedings of the 33rd ACM International Conference on Multimedia, 2025

Introduction to the First Workshop on Vision Foundation Models and Generative AI for Accessibility.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2025, 2025

SignLLM: Sign Language Production Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2025, 2025

PRVQL: Progressive Knowledge-Guided Refinement for Robust Egocentric Visual Query Localization.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

ZFusion: Efficient Deep Compositional Zero-Shot Learning for Blind Image Super-Resolution with Generative Diffusion Prior.

[BibT_eX]

[DOI]

Alireza Esmaeilzehi

Hossein Zaredar

Laleh Seyyed-Kalantari

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

SignDiff: Diffusion Model for American Sign Language Production.

[BibT_eX]

[DOI]

Proceedings of the 19th IEEE International Conference on Automatic Face and Gesture Recognition, 2025

Self-Improvement in Multimodal Large Language Models: A Survey.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Demonstration of VRSight: AI-Driven Real-Time Descriptions to Enhance VR Accessibility for Blind People.

[BibT_eX]

[DOI]

Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 2025

Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Big Data, 2025

CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024

EgoVSR: Toward High-Quality Egocentric Video Super-Resolution.

[BibT_eX]

[DOI]

IEEE Trans. Circuits Syst. Video Technol., November, 2024

STDAN: Deformable Attention Network for Space-Time Video Super-Resolution.

[BibT_eX]

[DOI]

IEEE Trans. Neural Networks Learn. Syst., August, 2024

Cross Modality Bias in Visual Question Answering: A Causal View With Possible Worlds VQA.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2024

Audio-Visual Dataset Distillation.

[BibT_eX]

[DOI]

Siva Sai Nagender Vasireddy

Kai Wang

Angelica I. Avilés-Rivero

Trans. Mach. Learn. Res., 2024

STADNet: Spatial-Temporal Attention-Guided Dual-Path Network for cardiac cine MRI super-resolution.

[BibT_eX]

[DOI]

Jing Qin

Medical Image Anal., 2024

Modality-Inconsistent Continual Learning of Multimodal Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach.

[BibT_eX]

[DOI]

CoRR, 2024

CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs.

[BibT_eX]

[DOI]

CoRR, 2024

Scaling Concept With Text-Guided Diffusion Models.

[BibT_eX]

[DOI]

CoRR, 2024

CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP.

[BibT_eX]

[DOI]

CoRR, 2024

Semantic Grouping Network for Audio Source Separation.

[BibT_eX]

[DOI]

CoRR, 2024

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation.

[BibT_eX]

[DOI]

CoRR, 2024

Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation.

[BibT_eX]

[DOI]

Siva Sai Nagender Vasireddy

CoRR, 2024

SignLLM: Sign Languages Production Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Robust Active Speaker Detection in Noisy Environments.

[BibT_eX]

[DOI]

Chenxu Zhang

Xiaohu Guo

CoRR, 2024

Text-to-Audio Generation Synchronized with Videos.

[BibT_eX]

[DOI]

Jing Shi

CoRR, 2024

Efficiently Leveraging Linguistic Priors for Scene Text Spotting.

[BibT_eX]

[DOI]

Nguyen Nguyen

CoRR, 2024

OSCaR: Object State Captioning and State Change Representation.

[BibT_eX]

[DOI]

CoRR, 2024

LAVSS: Location-Guided Audio-Visual Spatial Audio Separation.

[BibT_eX]

[DOI]

Yuxin Ye

Wenming Yang

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024

CookAR: Affordance Augmentations in Wearable AR to Support Kitchen Tool Interactions for People with Low Vision.

[BibT_eX]

[DOI]

Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, 2024

Continual Audio-Visual Sound Separation.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

OSCaR: Object State Captioning and State Change Representation.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, 2024

Towards AI-Powered AR for Enhancing Sports Playability for People with Low Vision: An Exploration of ARSports.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Mixed and Augmented Reality Adjunct, 2024

SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

Towards Efficient Audio-Visual Learners via Empowering Pre-trained Vision Transformers with Cross-Modal Adaptation.

[BibT_eX]

[DOI]

Kai Wang

Dimitrios Hatzinakos

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

T-VSL: Text-Guided Visual Sound Source Localization in Mixtures.

[BibT_eX]

[DOI]

Tanvir Mahmud

Diana Marculescu

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision Viewers.

[BibT_eX]

[DOI]

Proceedings of the CHI Conference on Human Factors in Computing Systems, 2024

MIMOSA: Human-AI Co-Creation of Computational Spatial Audio Effects on Videos.

[BibT_eX]

[DOI]

Proceedings of the 16th Conference on Creativity & Cognition, 2024

Benchmarking and Optimizing Federated Learning with Hardware-related Metrics.

[BibT_eX]

[DOI]

Proceedings of the 35th British Machine Vision Conference, 2024

Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ACCV 2024, 2024

High-Quality Visually-Guided Sound Separation from Diverse Categories.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ACCV 2024, 2024

2023

Adaptive channel-modulated personalized federated learning for magnetic resonance image reconstruction.

[BibT_eX]

[DOI]

Comput. Biol. Medicine, October, 2023

Meta-Learning-Based Degradation Representation for Blind Super-Resolution.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2023

GDSSR: Toward Real-World Ultra-High-Resolution Image Super-Resolution.

[BibT_eX]

[DOI]

Yichen Chi

Wenming Yang

IEEE Signal Process. Lett., 2023

DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation.

[BibT_eX]

[DOI]

CoRR, 2023

Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation.

[BibT_eX]

[DOI]

CoRR, 2023

Neural Acoustic Context Field: Rendering Realistic Room Impulse Response With Neural Fields.

[BibT_eX]

[DOI]

CoRR, 2023

CMRxRecon: An open cardiac MRI dataset for the competition of accelerated image reconstruction.

[BibT_eX]

[DOI]

CoRR, 2023

SignDiff: Learning Diffusion Models for American Sign Language Production.

[BibT_eX]

[DOI]

CoRR, 2023

DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Models.

[BibT_eX]

[DOI]

CoRR, 2023

Towards Long Form Audio-visual Video Understanding.

[BibT_eX]

[DOI]

CoRR, 2023

Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA.

[BibT_eX]

[DOI]

CoRR, 2023

EgoVSR: Towards High-Quality Egocentric Video Super-Resolution.

[BibT_eX]

[DOI]

CoRR, 2023

DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment.

[BibT_eX]

[DOI]

Jing Shi

CoRR, 2023

AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation.

[BibT_eX]

[DOI]

CoRR, 2023

PEANUT: A Human-AI Collaborative Tool for Annotating Audio-Visual Data.

[BibT_eX]

[DOI]

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023

Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Dual Arbitrary Scale Super-Resolution for Multi-contrast MRI.

[BibT_eX]

[DOI]

Proceedings of the Medical Image Computing and Computer Assisted Intervention - MICCAI 2023, 2023

Knowledge Distillation based Degradation Estimation for Blind Super-Resolution.

[BibT_eX]

[DOI]

Proceedings of the Eleventh International Conference on Learning Representations, 2023

Basic Binary Convolution Unit for Binarized Image Restoration Network.

[BibT_eX]

[DOI]

Proceedings of the Eleventh International Conference on Learning Representations, 2023

DiffIR: Efficient Diffusion Model for Image Restoration.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Audio-Visual Class-Incremental Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Class-Incremental Grouping Network for Continual Audio-Visual Learning.

[BibT_eX]

[DOI]

Weiguo Pian

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Structured Sparsity Learning for Efficient Video Super-Resolution.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Audio-Visual Grouping Network for Sound Localization from Mixtures.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Egocentric Audio-Visual Object Localization.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Towards Unified, Explainable, and Robust Multisensory Perception.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022

Learning in Audio-visual Context: A Review, Analysis, and New Perspective.

[BibT_eX]

[DOI]

CoRR, 2022

Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

DuDoCAF: Dual-Domain Cross-Attention Fusion with Recurrent Transformer for Fast Multi-contrast MR Imaging.

[BibT_eX]

[DOI]

Proceedings of the Medical Image Computing and Computer Assisted Intervention - MICCAI 2022, 2022

Correspondences for image and video reconstruction.

[BibT_eX]

[DOI]

Xiaoyu Xiang

Proceedings of the Imaging and Multimedia Analytics at the Edge 2022, 2022

Learning Spatio-Temporal Downsampling for Effective Video Upscaling.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

Learning to Answer Questions in Dynamic Audio-Visual Scenarios.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Transformer-empowered Multi-scale Contextual Matching and Aggregation for Multi-contrast MRI Super-resolution.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Coarse-to-Fine Embedded PatchMatch and Multi-Scale Dynamic Aggregation for Reference-Based Super-resolution.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

Efficient Non-local Contrastive Attention for Image Super-resolution.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

2021

Residual Dense Network for Image Restoration.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., 2021

Zooming SlowMo: An Efficient One-Stage Framework for Space-Time Video Super-Resolution.

[BibT_eX]

[DOI]

CoRR, 2021

Video Matting via Consistency-Regularized Graph Neural Networks.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Can Audio-Visual Integration Strengthen Robustness Under Multimodal Attacks?

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation.

[BibT_eX]

[DOI]

Di Hu

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Space-Time Memory Network for Sounding Object Localization in Videos.

[BibT_eX]

[DOI]

Sizhe Li

Proceedings of the 32nd British Machine Vision Conference 2021, 2021

2020

LCSCNet: Linear Compressing-Based Skip-Connecting Network for Image Super-Resolution.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2020

Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing.

[BibT_eX]

[DOI]

Dingzeyu Li

Proceedings of the Computer Vision - ECCV 2020, 2020

Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video Super-Resolution.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2019

Deep Learning for Single Image Super-Resolution: A Brief Review.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2019

Deep Audio Prior.

[BibT_eX]

[DOI]