Tao Jin

Orcid: 0000-0003-3564-1628

Affiliations:
  • Zhejiang University, Hangzhou, China


According to our database1, Tao Jin authored at least 74 papers between 2019 and 2025.

Collaborative distances:
  • Dijkstra number2 of five.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
TAP: Parameter-efficient Task-Aware Prompting for Adverse Weather Removal.
CoRR, August, 2025

Open-set Cross Modal Generalization via Multimodal Unified Representation.
CoRR, July, 2025

APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization.
CoRR, June, 2025

Vela: Scalable Embeddings with Voice Large Language Models for Multimodal Retrieval.
CoRR, June, 2025

IRBridge: Solving Image Restoration Bridge with Pre-trained Generative Diffusion Models.
CoRR, May, 2025

Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning.
CoRR, May, 2025

ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting.
CoRR, April, 2025

Unleashing the Power of Natural Audio Featuring Multiple Sound Sources.
CoRR, April, 2025

OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios.
CoRR, January, 2025

Recognize-and-tell: Generating video captions with textual cue in scene.
Expert Syst. Appl., 2025

Omni-Chart-600K: A Comprehensive Dataset of Chart Types for Chart Understanding.
Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, New Mexico, USA, April 29, 2025

Data-Efficiently Learn Large Language Model for Universal 3D Scene Perception.
Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, New Mexico, USA, April 29, 2025

Efficient Prompting for Continual Adaptation to Missing Modalities.
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

Diff-Prompt: Diffusion-Driven Prompt Generator with Mask Supervision.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Smoothing the Shift: Towards Stable Test-Time Adaptation under Complex Multimodal Noises.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words?
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Curriculum Learning aided Audio-Visual Speech Recognition with Arbitrary Speaker Number.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

Towards Transformer-Based Aligned Generation with Self-Coherence Guidance.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

ConceptGuard: Continual Personalized Text-to-Image Generation with Forgetting and Confusion Mitigation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

Speech Watermarking with Discrete Intermediate Representations.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

A Wander Through the Multimodal Landscape: Efficient Transfer Learning via Low-rank Sequence Multimodal Adapter.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

Bridging the Gap for Test-Time Multimodal Sentiment Analysis.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024
Multi-Granularity Relational Attention Network for Audio-Visual Question Answering.
IEEE Trans. Circuits Syst. Video Technol., August, 2024

GTADT: Gated tone-sensitive acne grading via augmented domain transfer.
Multim. Tools Appl., 2024

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup.
CoRR, 2024

EAGER: Two-Stream Generative Recommender with Behavior-Semantic Collaboration.
CoRR, 2024

Extending Multi-modal Contrastive Representations.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

E<sup>3</sup>: Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Action Imitation in Common Action Space for Customized Action Image Synthesis.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Classifier-guided Gradient Modulation for Enhanced Multimodal Learning.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt.
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024

SyncTalklip: Highly Synchronized Lip-Readable Speaker Generation with Multi-Task Learning.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Low-rank Prompt Interaction for Continual Vision-Language Retrieval.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Calibrating Prompt from History for Continual Vision-Language Retrieval and Grounding.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

EAGER: Two-Stream Generative Recommender with Behavior-Semantic Collaboration.
Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024

Non-confusing Generation of Customized Concepts in Diffusion Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

AudioVSR: Enhancing Video Speech Recognition with Audio Data.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

MPOD123: One Image to 3D Content Generation Using Mask-Enhanced Progressive Outline-to-Detail Optimization.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

Rethinking the Multimodal Correlation of Multimodal Sequential Learning via Generalizable Attentional Results Alignment.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

2023
Electromagnetic Imaging Boosted Visual Object Recognition Under Difficult Visual Conditions.
IEEE Trans. Geosci. Remote. Sens., 2023

TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation.
CoRR, 2023

Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers.
CoRR, 2023

Extending Multi-modal Contrastive Representations.
CoRR, 2023

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition.
CoRR, 2023

Rethinking Missing Modality Learning from a Decoding Perspective.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

Exploring Group Video Captioning with Efficient Relational Approximation.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Gloss Attention for Gloss-free Sign Language Translation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

DATE: Domain Adaptive Product Seeker for E-Commerce.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Weakly-Supervised Spoken Video Grounding via Semantic Interaction Learning.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

Semantic-conditioned Dual Adaptation for Cross-domain Query-based Visual Segmentation.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

TAVT: Towards Transferable Audio-Visual Text Generation.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

2022
Interaction augmented transformer with decoupled decoding for video captioning.
Neurocomputing, 2022

MC-SLT: Towards Low-Resource Signer-Adaptive Sign Language Translation.
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

Prior Knowledge and Memory Enriched Transformer for Sign Language Translation.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, 2022

2021
Generalizable Multi-linear Attention Network.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Contrastive Disentangled Meta-Learning for Signer-Independent Sign Language Translation.
Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021

2020
SBAT: Video Captioning with Sparse Boundary-Aware Transformer.
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, 2020

Dual Low-Rank Multimodal Fusion.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, 2020

2019
Recurrent convolutional video captioning with global and local attention.
Neurocomputing, 2019

Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019


  Loading...