Xinfa Zhu

Orcid: 0000-0001-9275-523X

According to our database¹, Xinfa Zhu authored at least 45 papers between 2022 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Links

On csauthors.net:

Bibliography

2025

DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching.

[BibT_eX]

[DOI]

CoRR, October, 2025

PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation.

[BibT_eX]

[DOI]

CoRR, October, 2025

Qwen3-Omni Technical Report.

[BibT_eX]

[DOI]

CoRR, September, 2025

MPO: Multidimensional Preference Optimization for Language Model-based Text-to-Speech.

[BibT_eX]

[DOI]

CoRR, September, 2025

Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis.

[BibT_eX]

[DOI]

CoRR, August, 2025

DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis.

[BibT_eX]

[DOI]

CoRR, July, 2025

U-SAM: An audio language Model for Unified Speech, Audio, and Music Understanding.

[BibT_eX]

[DOI]

CoRR, May, 2025

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens.

[BibT_eX]

[DOI]

CoRR, March, 2025

LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement.

[BibT_eX]

[DOI]

CoRR, March, 2025

Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought.

[BibT_eX]

[DOI]

CoRR, February, 2025

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis.

[BibT_eX]

[DOI]

CoRR, February, 2025

CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions.

[BibT_eX]

[DOI]

CoRR, January, 2025

OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia.

[BibT_eX]

[DOI]

CoRR, January, 2025

U-SAM: An Audio Language Model for Unified Speech, Audio, and Music Understanding.

[BibT_eX]

[DOI]

Proceedings of the 26th Annual Conference of the International Speech Communication Association, 2025

FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching.

[BibT_eX]

[DOI]

Proceedings of the 26th Annual Conference of the International Speech Communication Association, 2025

Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR.

[BibT_eX]

[DOI]

Proceedings of the 26th Annual Conference of the International Speech Communication Association, 2025

ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024

METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2024

U-Style: Cascading U-Nets With Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2024

Autoregressive Speech Synthesis with Next-Distribution Prediction.

[BibT_eX]

[DOI]

Xinfa Zhu

Wenjie Tian

Lei Xie

CoRR, 2024

YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls.

[BibT_eX]

[DOI]

CoRR, 2024

CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion.

[BibT_eX]

[DOI]

CoRR, 2024

The NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge.

[BibT_eX]

[DOI]

CoRR, 2024

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy.

[BibT_eX]

[DOI]

CoRR, 2024

UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis.

[BibT_eX]

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Contrastive Context-Speech Pretraining for Expressive Text-to-Speech Synthesis.

[BibT_eX]

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

The NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge.

[BibT_eX]

[DOI]

Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

Text-aware and Context-aware Expressive Audiobook Speech Synthesis.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

Boosting Multi-Speaker Expressive Speech Synthesis with Semi-Supervised Contrastive Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Multimedia and Expo, 2024

SELM: Speech Enhancement using Discrete Tokens and Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Spontts: Modeling and Transferring Spontaneous Style for TTS.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

2023

DiCLET-TTS: Diffusion Model Based Cross-Lingual Emotion Transfer for Text-to-Speech - A Study Between English and Mandarin.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2023

Accent-VITS: accent transfer for end-to-end TTS.

[BibT_eX]

[DOI]

CoRR, 2023

SponTTS: modeling and transferring spontaneous style for TTS.

[BibT_eX]

[DOI]

CoRR, 2023

Multi-Speaker Expressive Speech Synthesis via Semi-supervised Contrastive Learning.

[BibT_eX]

[DOI]

CoRR, 2023

Vec-Tok Speech: speech vectorization and tokenization for neural speech generation.

[BibT_eX]

[DOI]

CoRR, 2023

Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Zero-Shot Emotion Transfer for Cross-Lingual Speech Synthesis.

[BibT_eX]

[DOI]

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

HIGNN-TTS: Hierarchical Prosody Modeling With Graph Neural Networks for Expressive Long-Form TTS.

[BibT_eX]

[DOI]

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

2022

Cross-Speaker Emotion Transfer Through Information Perturbation in Emotional Speech Synthesis.

[BibT_eX]

[DOI]

IEEE Signal Process. Lett., 2022

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge.

[BibT_eX]

[DOI]

CoRR, 2022

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge.

[BibT_eX]

[DOI]

Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

Xinfa Zhu

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...