We stand with Ukraine

We stand with Ukraine

Xuankai Chang

Orcid: 0000-0002-5221-5412

According to our database¹, Xuankai Chang authored at least 87 papers between 2016 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

TMT: Tri-Modal Translation Between Speech, Image, and Text by Processing Different Modalities as Different Languages.

[DOI]

,

,

,

,

,

,

Shinji Watanabe

,

IEEE Trans. Multim., 2026

An end-to-end integration of speech separation and recognition with self-supervised learning representation.

[DOI]

Yoshiki Masuyama

,

,

,

Samuele Cornell

,

,

,

,

Shinji Watanabe

Comput. Speech Lang., 2026

Recent trends in distant conversational speech recognition: A review of CHiME-7 and 8 DASR challenges.

[DOI]

Samuele Cornell

,

Christoph Boeddeker

,

,

,

,

Matthew Wiesner

,

Yoshiki Masuyama

,

,

,

Stefano Squartini

,

,

Shinji Watanabe

Comput. Speech Lang., 2026

2025

Data-Centric Lessons To Improve Speech-Language Pretraining.

[DOI]

Vishaal Udandarao

,

,

,

,

,

Albin Madapally Jose

,

,

,

Chung-Cheng Chiu

CoRR, October, 2025

2024

Module-Based End-to-End Distant Speech Processing: A case study of far-field automatic speech recognition [Special Issue On Model-Based and Data-Driven Audio Signal Processing].

[DOI]

,

Shinji Watanabe

,

,

,

,

IEEE Signal Process. Mag., November, 2024

Everyday Conversation Speech Recognition with End-to-End Neural Networks

[DOI]

PhD thesis, 2024

A Large-Scale Evaluation of Speech Foundation Models.

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2024

MC-Whisper: Extending Speech Foundation Models to Multichannel Distant Speech Recognition.

[DOI]

,

,

,

Takashi Maekaku

,

Shinji Watanabe

IEEE Signal Process. Lett., 2024

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR.

[DOI]

,

,

,

Shinji Watanabe

,

CoRR, 2024

SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data.

[DOI]

,

,

,

,

,

Shinji Watanabe

CoRR, 2024

The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization.

[DOI]

Samuele Cornell

,

,

,

Christoph Böddeker

,

,

Matthew Maciejewski

,

Matthew Wiesner

,

,

Shinji Watanabe

CoRR, 2024

Robust Audiovisual Speech Recognition Models with Mixture-of-Experts.

[DOI]

,

,

,

,

,

Shinji Watanabe

Proceedings of the IEEE Spoken Language Technology Workshop, 2024

ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets.

[DOI]

,

,

,

Martijn Bartelds

,

Vanya Bannihatti Kumar

,

,

,

,

,

,

Shinji Watanabe

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer.

[DOI]

,

,

,

,

,

,

Muhammad Shakeel

,

,

,

,

,

Shinji Watanabe

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

The Interspeech 2024 Challenge on Speech Processing Using Discrete Units.

[DOI]

,

,

,

,

,

,

Shinji Watanabe

,

,

,

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

UniAudio: Towards Universal Audio Generation with Large Language Models.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing.

[DOI]

,

,

Antonios Anastasopoulos

,

,

Shinji Watanabe

Proceedings of the IEEE International Conference on Acoustics, 2024

Improving Audio Captioning Models with Fine-Grained Audio Features, Text Embedding Supervision, and LLM Mix-Up Augmentation.

[DOI]

,

,

,

,

François G. Germain

,

Jonathan Le Roux

,

Shinji Watanabe

Proceedings of the IEEE International Conference on Acoustics, 2024

VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks.

[DOI]

,

,

,

,

,

Shinji Watanabe

Proceedings of the IEEE International Conference on Acoustics, 2024

Hubertopic: Enhancing Semantic Representation of Hubert Through Self-Supervision Utilizing Topic Model.

[DOI]

Takashi Maekaku

,

,

,

,

Shinji Watanabe

Proceedings of the IEEE International Conference on Acoustics, 2024

Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study.

[DOI]

,

,

,

,

,

,

Roshan S. Sharma

,

,

,

Shinji Watanabe

,

,

Takashi Maekaku

,

,

,

,

,

Hsiu-Hsuan Wang

Proceedings of the IEEE International Conference on Acoustics, 2024

Towards Robust Speech Representation Learning for Thousands of Languages.

[DOI]

,

,

,

,

,

,

,

,

,

Shinji Watanabe

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

Shinji Watanabe

Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023

Software Design and User Interface of ESPnet-SE++: Speech Enhancement for Robust Speech Processing.

[DOI]

,

,

,

,

Samuele Cornell

,

,

Yoshiki Masuyama

,

,

Robin Scheibler

,

,

,

,

Shinji Watanabe

J. Open Source Softw., November, 2023

Software Design and User Interface of ESPnet-SE++: Speech Enhancement for Robust Speech Processing (espnet-v.202310).

[DOI]

,

,

,

,

Samuele Cornell

,

,

Yoshiki Masuyama

,

,

Robin Scheibler

,

,

,

,

Shinji Watanabe

Dataset, October, 2023

Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond.

[DOI]

,

,

,

Hsiu-Hsuan Wang

,

,

,

,

,

,

,

Abdelrahman Mohamed

,

,

Shinji Watanabe

CoRR, 2023

UniAudio: An Audio Foundation Model Toward Universal Audio Generation.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

Shinji Watanabe

,

CoRR, 2023

The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios.

[DOI]

Samuele Cornell

,

Matthew Wiesner

,

Shinji Watanabe

,

,

,

,

Yoshiki Masuyama

,

,

Stefano Squartini

,

Sanjeev Khudanpur

CoRR, 2023

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

Shinji Watanabe

CoRR, 2023

Improving Perceptual Quality, Intelligibility, and Acoustics on VoIP Platforms.

[DOI]

,

,

Shikhar Agnihotri

,

,

,

,

,

,

,

,

,

,

,

,

CoRR, 2023

Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation.

[DOI]

Yoshiki Masuyama

,

,

,

Samuele Cornell

,

,

,

,

Shinji Watanabe

Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2023

A New Benchmark of Aphasia Speech Recognition and Detection Based on E-Branchformer and Multi-task Learning.

[DOI]

,

,

,

Shinji Watanabe

,

Brian MacWhinney

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

ML-SUPERB: Multilingual Speech Universal PERformance Benchmark.

[DOI]

,

,

,

,

,

,

,

,

Abdelrahman Mohamed

,

,

Shinji Watanabe

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition.

[DOI]

,

,

,

,

Marco Tagliasacchi

,

,

John R. Hershey

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute.

[DOI]

,

,

,

,

,

Shinji Watanabe

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning.

[DOI]

,

,

,

Takashi Maekaku

,

Shinji Watanabe

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Fully Unsupervised Topic Clustering of Unlabelled Spoken Audio Using Self-Supervised Representation Learning and Topic Model.

[DOI]

Takashi Maekaku

,

,

,

Shinji Watanabe

Proceedings of the IEEE International Conference on Acoustics, 2023

FindAdaptNet: Find and Insert Adapters by Learned Layer Importance.

[DOI]

,

Karthik Ganesan

,

,

,

,

,

Shinji Watanabe

Proceedings of the IEEE International Conference on Acoustics, 2023

Findings of the 2023 ML-Superb Challenge: Pre-Training And Evaluation Over More Languages And Beyond.

[DOI]

,

,

,

Hsiu-Hsuan Wang

,

,

,

,

,

,

,

Abdelrahman Mohamed

,

,

Shinji Watanabe

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data.

[DOI]

,

,

,

,

,

,

,

,

,

Roshan S. Sharma

,

,

,

Muhammad Shakeel

,

,

,

Shinji Watanabe

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

LV-CTC: Non-Autoregressive ASR With CTC and Latent Variable Models.

[DOI]

,

Shinji Watanabe

,

,

Takashi Maekaku

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Joint Prediction and Denoising for Large-Scale Multilingual Self-Supervised Learning.

[DOI]

,

,

,

,

,

,

,

,

Shinji Watanabe

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

2022

End-to-End Dereverberation, Beamforming, and Speech Recognition in a Cocktail Party.

[DOI]

,

,

Christoph Böddeker

,

Tomohiro Nakatani

,

Shinji Watanabe

,

IEEE ACM Trans. Audio Speech Lang. Process., 2022

Train from scratch: Single-stage joint training of speech separation and recognition.

[DOI]

,

,

Shinji Watanabe

,

Comput. Speech Lang., 2022

Muskits: an End-to-End Music Processing Toolkit for Singing Voice Synthesis.

[DOI]

,

,

,

,

,

,

,

,

,

,

Shinji Watanabe

,

CoRR, 2022

End-to-End Multi-Speaker ASR with Independent Vector Analysis.

[DOI]

Robin Scheibler

,

,

,

Shinji Watanabe

,

Proceedings of the IEEE Spoken Language Technology Workshop, 2022

A Study on the Integration of Pre-Trained SSL, ASR, LM and SLU Models for Spoken Language Understanding.

[DOI]

,

,

,

,

,

Karthik Ganesan

,

Siddharth Dalmia

,

,

Shinji Watanabe

Proceedings of the IEEE Spoken Language Technology Workshop, 2022

End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation.

[DOI]

Yoshiki Masuyama

,

,

Samuele Cornell

,

Shinji Watanabe

,

Proceedings of the IEEE Spoken Language Technology Workshop, 2022

Superb @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning.

[DOI]

,

Shuyan Annie Dong

,

,

,

,

,

,

,

,

,

Shinji Watanabe

,

Abdelrahman Mohamed

,

,

Proceedings of the IEEE Spoken Language Technology Workshop, 2022

Muskits: an End-to-end Music Processing Toolkit for Singing Voice Synthesis.

[DOI]

,

,

,

,

,

,

,

,

,

Shinji Watanabe

,

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding.

[DOI]

,

,

,

,

Samuele Cornell

,

,

Yoshiki Masuyama

,

,

Robin Scheibler

,

,

,

,

Shinji Watanabe

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation.

[DOI]

,

Takashi Maekaku

,

,

Shinji Watanabe

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Two-Pass Low Latency End-to-End Spoken Language Understanding.

[DOI]

,

Siddharth Dalmia

,

,

,

,

Shinji Watanabe

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Joint Speech Recognition and Audio Captioning.

[DOI]

Chaitanya Narisetty

,

,

,

Yosuke Kashiwagi

,

Michael Hentschel

,

Shinji Watanabe

Proceedings of the IEEE International Conference on Acoustics, 2022

An Exploration of Hubert with Large Number of Cluster Units and Model Assessment Using Bayesian Information Criterion.

[DOI]

Takashi Maekaku

,

,

,

Shinji Watanabe

Proceedings of the IEEE International Conference on Acoustics, 2022

Towards Low-Distortion Multi-Channel Speech Enhancement: The ESPNET-Se Submission to the L3DAS22 Challenge.

[DOI]

,

Samuele Cornell

,

,

,

,

,

,

Shinji Watanabe

Proceedings of the IEEE International Conference on Acoustics, 2022

Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR.

[DOI]

,

,

,

Shinji Watanabe

,

Jonathan Le Roux

Proceedings of the IEEE International Conference on Acoustics, 2022

ESPnet-SLU: Advancing Spoken Language Understanding Through ESPnet.

[DOI]

,

Siddharth Dalmia

,

,

,

,

,

,

,

Karthik Ganesan

,

,

,

,

Shinji Watanabe

Proceedings of the IEEE International Conference on Acoustics, 2022

SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities.

[DOI]

Hsiang-Sheng Tsai

,

,

,

,

Kushal Lakhotia

,

,

,

,

,

,

,

,

,

,

Shinji Watanabe

,

Abdelrahman Mohamed

,

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022

2021

Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem.

[DOI]

,

,

,

,

Shinji Watanabe

,

CoRR, 2021

ESPnet-SE: End-To-End Speech Enhancement and Separation Toolkit Designed for ASR Integration.

[DOI]

,

,

,

Aswin Shanmugam Subramanian

,

,

,

,

,

Christoph Böddeker

,

,

Shinji Watanabe

Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Investigation of End-to-End Speaker-Attributed ASR for Continuous Multi-Talker Recordings.

[DOI]

,

,

,

,

,

,

Takuya Yoshioka

Proceedings of the IEEE Spoken Language Technology Workshop, 2021

SUPERB: Speech Processing Universal PERformance Benchmark.

[DOI]

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Streaming End-to-End ASR Based on Blockwise Non-Autoregressive Models.

[DOI]

,

,

,

Shinji Watanabe

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021.

[DOI]

Takashi Maekaku

,

,

,

,

Shinji Watanabe

,

Alexander I. Rudnicky

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain.

[DOI]

,

,

Shinji Watanabe

,

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Recent Developments on Espnet Toolkit Boosted By Conformer.

[DOI]

,

,

,

,

,

Hirofumi Inaguma

,

,

,

Daniel Garcia-Romero

,

,

,

Shinji Watanabe

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2021

Hypothesis Stitcher for End-to-End Speaker-Attributed ASR on Long-Form Multi-Talker Recordings.

[DOI]

,

,

,

,

,

Takuya Yoshioka

Proceedings of the IEEE International Conference on Acoustics, 2021

An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition.

[DOI]

,

Takashi Maekaku

,

,

,

,

Aswin Shanmugam Subramanian

,

,

,

,

,

Shinji Watanabe

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

2020

Improving End-to-End Single-Channel Multi-Talker Speech Recognition.

[DOI]

,

,

,

Shinji Watanabe

IEEE ACM Trans. Audio Speech Lang. Process., 2020

The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans.

[DOI]

Shinji Watanabe

,

,

,

,

,

,

,

,

Hirofumi Inaguma

,

,

,

,

,

Aswin Shanmugam Subramanian

,

CoRR, 2020

Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals.

[DOI]

,

,

,

Shinji Watanabe

,

,

,

,

Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming.

[DOI]

,

Aswin Shanmugam Subramanian

,

,

Shinji Watanabe

,

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Insertion-Based Modeling for End-to-End Automatic Speech Recognition.

[DOI]

,

Shinji Watanabe

,

,

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

End-to-End ASR with Adaptive Span Self-Attention.

[DOI]

,

Aswin Shanmugam Subramanian

,

,

Shinji Watanabe

,

,

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

End-To-End Multi-Speaker Speech Recognition With Transformer.

[DOI]

,

,

,

Jonathan Le Roux

,

Shinji Watanabe

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019

Knowledge Distillation for End-to-End Monaural Multi-Talker ASR System.

[DOI]

,

,

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

End-to-end Monaural Multi-speaker ASR System without Pretraining.

[DOI]

,

,

,

Shinji Watanabe

Proceedings of the IEEE International Conference on Acoustics, 2019

MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition.

[DOI]

,

,

,

Jonathan Le Roux

,

Shinji Watanabe

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

2018

Single-channel multi-talker speech recognition with permutation invariant training.

[DOI]

,

,

Speech Commun., 2018

Erratum to: Past review, current progress, and challenges ahead on the cocktail party problem.

[DOI]

,

,

,

,

Frontiers Inf. Technol. Electron. Eng., 2018

Past review, current progress, and challenges ahead on the cocktail party problem.

[DOI]

,

,

,

,

Frontiers Inf. Technol. Electron. Eng., 2018

Monaural Multi-Talker Speech Recognition with Attention Mechanism and Gated Convolutional Networks.

[DOI]

,

,

Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Adaptive Permutation Invariant Training with Auxiliary Information for Monaural Multi-Talker Speech Recognition.

[DOI]

,

,

Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

2017

Recognizing Multi-Talker Speech with Permutation Invariant Training.

[DOI]

,

,

Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

2016

Unrestricted Vocabulary Keyword Spotting Using LSTM-CTC.

[DOI]

,

,

,

Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

Loading...