Di Hu

Proceedings of the 2025 International Conference on Multimedia Retrieval, 2025

Efficient Quantification of Multimodal Interaction at Sample Level.

[BibT_eX]

[DOI]

Zequn Yang

Hongfa Wang

Proceedings of the Forty-second International Conference on Machine Learning, 2025

RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Adaptive Unimodal Regulation for Balanced Multimodal Information Acquisition.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Towards Effective and Efficient Continual Pre-training of Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024

Geometric-inspired graph-based Incomplete Multi-view Clustering.

[BibT_eX]

[DOI]

Pattern Recognit., March, 2024

Towards accurate knowledge transfer via target-awareness representation disentanglement.

[BibT_eX]

[DOI]

Mach. Learn., February, 2024

YuLan: An Open-source Large Language Model.

[BibT_eX]

[DOI]

CoRR, 2024

Learning Manipulation by Predicting Interaction.

[BibT_eX]

[DOI]

Proceedings of the Robotics: Science and Systems XX, 2024

Enhancing Modality Representation and Alignment for Multimodal Cold-start Active Learning.

[BibT_eX]

[DOI]

Meng Shen

Yake Wei

Jianxiong (Terry) Yin

Deepu Rajan

Simon See

Proceedings of the 6th ACM International Conference on Multimedia in Asia, 2024

Unveiling and Mitigating Bias in Audio Visual Segmentation.

[BibT_eX]

[DOI]

Peiwen Sun

Honggang Zhang

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Boosting Audio Visual Question Answering via Key Semantic-Aware Cues.

[BibT_eX]

[DOI]

Guangyao Li

Henghui Du

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Depth Helps: Improving Pre-trained RGB-based Policy with Depth Information Injection.

[BibT_eX]

[DOI]

Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2024

Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Robotics and Automation, 2024

MMPareto: Boosting Multimodal Learning with Innocent Unimodal Assistance.

[BibT_eX]

[DOI]

Yake Wei

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Quantifying and Enhancing Multi-modal Robustness with Modality Preference.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Diagnosing and Re-learning for Balanced Multimodal Learning.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

Enhancing Multimodal Cooperation via Sample-Level Modality Valuation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

KOI: Accelerating Online Imitation Learning via Hybrid Key-state Guidance.

[BibT_eX]

[DOI]

Proceedings of the Conference on Robot Learning, 6-9 November 2024, Munich, Germany., 2024

Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation.

[BibT_eX]

[DOI]

Proceedings of the Conference on Robot Learning, 6-9 November 2024, Munich, Germany., 2024

Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023

Self-supervised audiovisual representation learning for remote sensing data.

[BibT_eX]

[DOI]

Int. J. Appl. Earth Obs. Geoinformation, February, 2023

Self-Supervised Learning for Heterogeneous Audiovisual Scene Analysis.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2023

Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs.

[BibT_eX]

[DOI]

CoRR, 2023

Enhancing Multi-modal Cooperation via Fine-grained Modality Valuation.

[BibT_eX]

[DOI]

CoRR, 2023

Towards Long Form Audio-visual Video Understanding.

[BibT_eX]

[DOI]

CoRR, 2023

Robust Cross-Modal Knowledge Distillation for Unconstrained Videos.

[BibT_eX]

[DOI]

CoRR, 2023

Balanced Audiovisual Dataset for Imbalance Analysis.

[BibT_eX]

[DOI]

CoRR, 2023

Revisiting Pre-training in Audio-Visual Learning.

[BibT_eX]

[DOI]

Ruoxuan Feng

Wenke Xia

CoRR, 2023

TikTalk: A Multi-Modal Dialogue Dataset for Real-World Chitchat.

[BibT_eX]

[DOI]

CoRR, 2023

SeCo: Separating Unknown Musical Visual Sounds with Consistency Guidance.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023

Exploiting Visual Context Semantics for Sound Source Localization.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023

TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World.

[BibT_eX]

[DOI]

Proceedings of the 31st ACM International Conference on Multimedia, 2023

Progressive Spatio-temporal Perception for Audio-Visual Question Answering.

[BibT_eX]

[DOI]

Guangyao Li

Wenxuan Hou

Proceedings of the 31st ACM International Conference on Multimedia, 2023

Multi-Scale Attention for Audio Question Answering.

[BibT_eX]

[DOI]

Guangyao Li

Yixin Xu

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Towards Inadequately Pre-trained Models in Transfer Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

2022

Class-Aware Sounding Objects Localization via Audiovisual Correspondence.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., 2022

Learning in Audio-visual Context: A Review, Analysis, and New Perspective.

[BibT_eX]

[DOI]

CoRR, 2022

SeCo: Separating Unknown Musical Visual Sounds with Consistency Guidance.

[BibT_eX]

[DOI]

CoRR, 2022

Inadequately Pre-trained Models are Better Feature Extractors.

[BibT_eX]

[DOI]

CoRR, 2022

Dual Domain-Adversarial Learning for Audio-Visual Saliency Prediction.

[BibT_eX]

[DOI]

Proceedings of the HCMA@MM 2022: Proceedings of the 3rd International Workshop on Human-Centric Multimedia Analysis, 2022

Balanced Multimodal Learning via On-the-fly Gradient Modulation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Learning to Answer Questions in Dynamic Audio-Visual Scenarios.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

SepFusion: Finding Optimal Fusion Structures for Visual Sound Separation.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

Visual Sound Localization in the Wild by Cross-Modal Interference Erasing.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

2021

Generalising combinatorial discriminant analysis through conditioning truncated Rayleigh flow.

[BibT_eX]

[DOI]

Knowl. Inf. Syst., 2021

Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation.

[BibT_eX]

[DOI]

Yapeng Tian

Chenliang Xu

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Unsupervised Multi-Source Domain Adaptation for Person Re-Identification.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Temporal Relational Modeling with Self-Supervision for Action Segmentation.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021

2020

Towards Accurate Knowledge Transfer via Target-awareness Representation Disentanglement.

[BibT_eX]

[DOI]

CoRR, 2020

Cross-Task Transfer for Multimodal Aerial Scene Recognition.

[BibT_eX]

[DOI]

CoRR, 2020

Ambient Sound Helps: Audiovisual Crowd Counting in Extreme Conditions.

[BibT_eX]

[DOI]

CoRR, 2020

Curriculum Audiovisual Learning.

[BibT_eX]

[DOI]

CoRR, 2020

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

Multiple Sound Sources Localization from Coarse to Fine.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2020, 2020

Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2020, 2020

2019

Discrete Spectral Hashing for Efficient Similarity Retrieval.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2019

Dense Multimodal Fusion for Hierarchically Joint Representation.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2019

Deep Multimodal Clustering for Unsupervised Audiovisual Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

Listen to the Image.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

2018

Dense Multimodal Fusion for Hierarchically Joint Representation.

[BibT_eX]

[DOI]

CoRR, 2018

Deep LDA Hashing.

[BibT_eX]

[DOI]

CoRR, 2018

Deep Co-Clustering for Unsupervised Audiovisual Learning.

[BibT_eX]

[DOI]

CoRR, 2018

2017

Deep Binary Reconstruction for Cross-modal Hashing.

[BibT_eX]

[DOI]

Proceedings of the 2017 ACM on Multimedia Conference, 2017

Image2song: Song Retrieval via Bridging Image Content and Lyric Words.

[BibT_eX]

[DOI]

Xiaoqiang Lu

Proceedings of the IEEE International Conference on Computer Vision, 2017

Large Graph Hashing with Spectral Rotation.

[BibT_eX]

[DOI]

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017

2016

Multimodal Learning via Exploring Deep Semantic Similarity.

[BibT_eX]

[DOI]

Xiaoqiang Lu

Proceedings of the 2016 ACM Conference on Multimedia Conference, 2016

Temporal Multimodal Learning in Audiovisual Speech Recognition.

[BibT_eX]

[DOI]