Xie Chen

Orcid: 0000-0001-7423-617X

Affiliations:
  • Shanghai Jiao Tong University, China
  • Microsoft, Redmond, WA, USA (former)
  • University of Cambridge, UK (former)


According to our database1, Xie Chen authored at least 142 papers between 2011 and 2025.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows.
CoRR, August, 2025

FISHER: A Foundation Model for Multi-Modal Industrial Signal Comprehensive Representation.
CoRR, July, 2025

Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy.
CoRR, June, 2025

CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate.
CoRR, June, 2025

NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025.
CoRR, June, 2025

MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation.
CoRR, June, 2025

Towards General Discrete Speech Codec for Complex Acoustic Environments: A Study of Reconstruction and Downstream Task Consistency.
CoRR, May, 2025

VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining.
CoRR, May, 2025

Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling.
CoRR, May, 2025

Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment.
CoRR, May, 2025

Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate.
CoRR, May, 2025

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix.
CoRR, May, 2025

Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation.
CoRR, May, 2025

MER 2025: When Affective Computing Meets Large Language Models.
CoRR, April, 2025

EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting.
CoRR, April, 2025

Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis.
CoRR, April, 2025

YuE: Scaling Open Foundation Models for Long-Form Music Generation.
CoRR, March, 2025

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens.
CoRR, March, 2025

URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models.
CoRR, February, 2025

Recent Advances in Discrete Speech Tokens: A Review.
CoRR, February, 2025

Characteristic-Specific Partial Fine-Tuning for Efficient Emotion and Speaker Adaptation in Codec Language Text-to-Speech Models.
CoRR, January, 2025

Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model.
CoRR, January, 2025

MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization.
CoRR, January, 2025

Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

Making LLMs Better Many-to-Many Speech-to-Text Translators with Curriculum Learning.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

Towards Reliable Large Audio Language Model.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

Language Model Can Listen While Speaking.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

VQTalker: Towards Multilingual Talking Avatars Through Facial Motion Tokenization.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024
E$^{3}$TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

Advanced Long-Content Speech Recognition With Factorized Neural Transducer.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

Investigating Acoustic-Textual Emotional Inconsistency Information for Automatic Depression Detection.
CoRR, 2024

Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective.
CoRR, 2024

LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec.
CoRR, 2024

CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought.
CoRR, 2024

Exploring SSL Discrete Tokens for Multilingual ASR.
CoRR, 2024

Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR.
CoRR, 2024

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders.
CoRR, 2024

Progressive Residual Extraction based Pre-training for Speech Representation Learning.
CoRR, 2024

AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection.
CoRR, 2024

GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting.
CoRR, 2024

The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge.
CoRR, 2024

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity.
CoRR, 2024

CTC-Assisted LLM-Based Contextual ASR.
Proceedings of the IEEE Spoken Language Technology Workshop, 2024

Attention-Constrained Inference For Robust Decoder-Only Text-to-Speech.
Proceedings of the IEEE Spoken Language Technology Workshop, 2024

NDVQ: Robust Neural Audio Codec With Normal Distribution-Based Vector Quantization.
Proceedings of the IEEE Spoken Language Technology Workshop, 2024

1st Place Solution to Odyssey Emotion Recognition Challenge Task1: Tackling Class Imbalance Problem.
Proceedings of the Odyssey 2024: The Speaker and Language Recognition Workshop, 2024

MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition.
Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing, 2024

AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Improving Emotion Recognition with Pre-Trained Models, Multimodality, and Contextual Information.
Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

The X-Lance Technical Report for Interspeech 2024 Speech Processing using Discrete Speech Unit Challenge.
Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

MaLa-ASR: Multimedia-Assisted LLM-Based ASR.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

Improved Factorized Neural Transducer Model For Text-only Domain Adaptation.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

On the Effectiveness of Acoustic BPE in Decoder-Only TTS.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

The Interspeech 2024 Challenge on Speech Processing Using Discrete Units.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer.
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

BAT: Learning to Reason about Spatial Sounds with Large Language Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

Improving Acoustic Scene Classification via Self-Supervised and Semi-Supervised Learning with Efficient Audio Transformer.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2024

Semi-Supervised Acoustic Scene Classification with Test-Time Adaptation.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2024

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS.
Proceedings of the IEEE International Conference on Acoustics, 2024

Acoustic BPE for Speech Generation with Discrete Tokens.
Proceedings of the IEEE International Conference on Acoustics, 2024

Leveraging Speech PTM, Text LLM, And Emotional TTS For Speech Emotion Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2024

StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations.
Proceedings of the IEEE International Conference on Acoustics, 2024

SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention.
Proceedings of the IEEE International Conference on Acoustics, 2024

VoiceFlow: Efficient Text-To-Speech with Rectified Flow Matching.
Proceedings of the IEEE International Conference on Acoustics, 2024

emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023
Speaker Adaptive Text-to-Speech With Timbre-Normalized Vector-Quantized Feature.
IEEE ACM Trans. Audio Speech Lang. Process., 2023

Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations.
CoRR, 2023

Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation.
CoRR, 2023

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

Unsupervised Active Learning: Optimizing Labeling Cost-Effectiveness for Automatic Speech Recognition.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Blank-regularized CTC for Frame Skipping in Neural Transducer.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Improving Code-Switching and Name Entity Recognition in ASR with Speech Editing based Data Augmentation.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

An Adapter Based Multi-Label Pre-Training for Speech Separation and Enhancement.
Proceedings of the IEEE International Conference on Acoustics, 2023

Emodiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance.
Proceedings of the IEEE International Conference on Acoustics, 2023

Factorized AED: Factorized Attention-Based Encoder-Decoder for Text-Only Domain Adaptive ASR.
Proceedings of the IEEE International Conference on Acoustics, 2023

LongFNT: Long-Form Speech Recognition with Factorized Neural Transducer.
Proceedings of the IEEE International Conference on Acoustics, 2023

Front-End Adapter: Adapting Front-End Input of Speech Based Self-Supervised Learning for Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2023

Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation.
Proceedings of the IEEE International Conference on Acoustics, 2023

Fast-Hubert: an Efficient Training Framework for Self-Supervised Speech Representation Learning.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

2022
Exploring Effective Fusion Algorithms for Speech Based Self-Supervised Learning Models.
CoRR, 2022

Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Factorized Neural Transducer for Efficient Language Model Adaptation.
Proceedings of the IEEE International Conference on Acoustics, 2022

2021
Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Memory-Efficient Pipeline-Parallel DNN Training.
Proceedings of the 38th International Conference on Machine Learning, 2021

Internal Language Model Training for Domain-Adaptive End-To-End Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2021

Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset.
Proceedings of the IEEE International Conference on Acoustics, 2021

2020
LSTM-LM with Long-Term History for First-Pass Decoding in Conversational Speech Recognition.
CoRR, 2020

Low-bit Quantization of Recurrent Neural Network Language Models Using Alternating Direction Methods of Multipliers.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019
Exploiting Future Word Contexts in Neural Network Language Models for Speech Recognition.
IEEE ACM Trans. Audio Speech Lang. Process., 2019

Long-span language modeling for speech recognition.
CoRR, 2019

Recurrent Neural Network Language Model Training Using Natural Gradient.
Proceedings of the IEEE International Conference on Acoustics, 2019

Gaussian Process Lstm Recurrent Neural Network Language Models for Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2019

Investigation of Sampling Techniques for Maximum Entropy Language Modeling Training.
Proceedings of the IEEE International Conference on Acoustics, 2019

2018
Active Memory Networks for Language Modeling.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Neural Network Language Modeling with Letter-Based Features and Importance Sampling.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Phonetic and Graphemic Systems for Multi-Genre Broadcast Transcription.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Limited-Memory BFGS Optimization of Recurrent Neural Network Language Models for Speech Recognition.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

The Effect of Adding Authorship Knowledge in Automated Text Scoring.
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications@NAACL-HLT 2018, 2018

2017
Future Word Contexts in Neural Network Language Models.
CoRR, 2017

Investigating Bidirectional Recurrent Neural Network Language Models for Speech Recognition.
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Exploiting the Tibetan Radicals in Recurrent Neural Network for Low-Resource Language Models.
Proceedings of the Neural Information Processing - 24th International Conference, 2017

Recurrent neural network language models for keyword search.
Proceedings of the 2017 IEEE International Conference on Acoustics, 2017

Future word contexts in neural network language models.
Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop, 2017

2016
Two Efficient Lattice Rescoring Methods Using Recurrent Neural Network Language Models.
IEEE ACM Trans. Audio Speech Lang. Process., 2016

Efficient Training and Evaluation of Recurrent Neural Network Language Models for Automatic Speech Recognition.
IEEE ACM Trans. Audio Speech Lang. Process., 2016

Multi-Language Neural Network Language Models.
Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

CUED-RNNLM - An open-source toolkit for efficient training and evaluation of recurrent neural network language models.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

2015
Recurrent neural network language model adaptation for multi-genre broadcast speech recognition.
Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

Paraphrastic recurrent neural network language models.
Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

Robust excitation-based features for Automatic Speech Recognition.
Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

Recurrent neural network language model training with noise contrastive estimation for speech recognition.
Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

Improving the training and evaluation efficiency of recurrent neural network language models.
Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

Investigation of back-off based interpolation between recurrent neural network and n-gram language models.
Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, 2015

2014
Efficient GPU-based training of recurrent neural network language models using spliced sentence bunch.
Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

An initial investigation of long-term adaptation for meeting transcription.
Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

Impact of single-microphone dereverberation on DNN-based meeting transcription systems.
Proceedings of the IEEE International Conference on Acoustics, 2014

Efficient lattice rescoring using recurrent neural network language models.
Proceedings of the IEEE International Conference on Acoustics, 2014

2012
Pipelined Back-Propagation for Context-Dependent Deep Neural Networks.
Proceedings of the 13th Annual Conference of the International Speech Communication Association, 2012

2011
Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription.
Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, 2011


  Loading...