Xize Cheng

Orcid: 0000-0001-9708-3225

According to our database1, Xize Cheng authored at least 62 papers between 2022 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
MARS-Sep: Multimodal-Aligned Reinforced Sound Separation.
CoRR, October, 2025

WavReward: Spoken Dialogue Models With Generalist Reward Evaluators.
CoRR, May, 2025

Unleashing the Power of Natural Audio Featuring Multiple Sound Sources.
CoRR, April, 2025

OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios.
CoRR, January, 2025

Multimodal Conditional Retrieval with High Controllability.
Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, V.2, 2025

GTA: Towards Generative Text-To-Audio Retrieval via Multi-Scale Tokenizer.
Proceedings of the 26th Annual Conference of the International Speech Communication Association, 2025

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words?
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Enhancing Expressive Voice Conversion with Discrete Pitch-Conditioned Flow Matching Model.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

Curriculum Learning aided Audio-Visual Speech Recognition with Arbitrary Speaker Number.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

VoxpopuliTTS: a large-scale multilingual TTS corpus for zero-shot speech generation.
Proceedings of the 31st International Conference on Computational Linguistics, 2025

Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

CART: A Generative Cross-Modal Retrieval Framework With Coarse-To-Fine Semantic Modeling.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

A Wander Through the Multimodal Landscape: Efficient Transfer Learning via Low-rank Sequence Multimodal Adapter.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024
WavChat: A Survey of Spoken Dialogue Models.
CoRR, 2024

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup.
CoRR, 2024

MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization.
CoRR, 2024

SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing.
CoRR, 2024

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling.
CoRR, 2024

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces.
CoRR, 2024

ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling.
CoRR, 2024

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec.
CoRR, 2024

AudioLCM: Text-to-Audio Generation with Latent Consistency Models.
CoRR, 2024

Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment.
CoRR, 2024

Extending Multi-modal Contrastive Representations.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

SyncTalklip: Highly Synchronized Lip-Readable Speaker Generation with Multi-Task Learning.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

AudioLCM: Efficient and High-Quality Text-to-Audio Generation with Minimal Inference Steps.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

VoiceTuner: Self-Supervised Pre-training and Efficient Fine-tuning For Voice Generation.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

InstructSpeech: Following Speech Editing Instructions via Large Language Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2024

AudioVSR: Enhancing Video Speech Recognition with Audio Data.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Wav2SQL: Direct Generalizable Speech-To-SQL Parsing.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

Rethinking the Multimodal Correlation of Multimodal Sequential Learning via Generalizable Attentional Results Alignment.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

Text-to-Song: Towards Controllable Music Generation Incorporating Vocal and Accompaniment.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

2023
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation.
CoRR, 2023

Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers.
CoRR, 2023

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition.
CoRR, 2023

Connecting Multi-modal Contrastive Representations.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Rethinking Missing Modality Learning from a Decoding Perspective.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Exploring Group Video Captioning with Efficient Relational Approximation.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

Weakly-Supervised Spoken Video Grounding via Semantic Interaction Learning.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

Semantic-conditioned Dual Adaptation for Cross-domain Query-based Visual Segmentation.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

TAVT: Towards Transferable Audio-Visual Text Generation.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

2022
Diffusion Denoising Process for Perceptron Bias in Out-of-distribution Detection.
CoRR, 2022


  Loading...