Shinji Watanabe

Orcid: 0000-0002-5970-8631

Affiliations:
  • Carnegie Mellon University, Pittsburgh, PA, USA
  • Johns Hopkins University, Baltimore, MD, USA (former)
  • Mitsubishi Electric Research Laboratories, Cambridge, MA, USA (2012 - 2017)
  • NTT Communication Science Laboratories, Kyoto, Japan (2001 - 2011)
  • Waseda University, Tokyo, Japan (PhD 2006)


According to our database1, Shinji Watanabe authored at least 531 papers between 2002 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

End-to-End Speech Recognition: A Survey.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

Wav2Gloss: Generating Interlinear Glossed Text from Speech.
CoRR, 2024

TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages.
CoRR, 2024

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification.
CoRR, 2024

Evaluating and Improving Continual Learning in Spoken Language Understanding.
CoRR, 2024

Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?
CoRR, 2024

SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition.
CoRR, 2024

Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2.
CoRR, 2024

ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models.
CoRR, 2024

SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics.
CoRR, 2024

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer.
CoRR, 2024

Improving Design of Input Condition Invariant Speech Enhancement.
CoRR, 2024

Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor.
CoRR, 2024

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search.
CoRR, 2024

Improving ASR Contextual Biasing with Guided Attention.
CoRR, 2024

AugSumm: towards generalizable speech summarization using synthetic labels from large language model.
CoRR, 2024

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023
Software Design and User Interface of ESPnet-SE++: Speech Enhancement for Robust Speech Processing.
J. Open Source Softw., November, 2023

Software Design and User Interface of ESPnet-SE++: Speech Enhancement for Robust Speech Processing (espnet-v.202310).
Dataset, October, 2023

STFT-Domain Neural Speech Enhancement With Very Low Algorithmic Latency.
IEEE ACM Trans. Audio Speech Lang. Process., 2023

TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation.
IEEE ACM Trans. Audio Speech Lang. Process., 2023

Improving Speech Enhancement Performance by Leveraging Contextual Broad Phonetic Class Information.
IEEE ACM Trans. Audio Speech Lang. Process., 2023

Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors.
IEEE ACM Trans. Audio Speech Lang. Process., 2023

LegoNN: Building Modular Encoder-Decoder Models.
IEEE ACM Trans. Audio Speech Lang. Process., 2023

A dilemma of ground truth in noisy speech separation and an approach to lessen the impact of imperfect training data.
Comput. Speech Lang., 2023

Understanding Probe Behaviors through Variational Bounds of Mutual Information.
CoRR, 2023

Generative Context-aware Fine-tuning of Self-supervised Speech Models.
CoRR, 2023

Phoneme-aware Encoding for Prefix-tree-based Contextual ASR.
CoRR, 2023

Music ControlNet: Multiple Time-varying Controls for Music Generation.
CoRR, 2023

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch.
CoRR, 2023

Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond.
CoRR, 2023

HuBERTopic: Enhancing Semantic Representation of HuBERT through Self-supervision Utilizing Topic Model.
CoRR, 2023

EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Multilingual and Low Resource Scenarios.
CoRR, 2023

UniverSLU: Universal Spoken Language Understanding for Diverse Classification and Sequence Generation Tasks with a Single Network.
CoRR, 2023

One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition.
CoRR, 2023

UniAudio: An Audio Foundation Model Toward Universal Audio Generation.
CoRR, 2023

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation.
CoRR, 2023

Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing.
CoRR, 2023

Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study.
CoRR, 2023

Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization.
CoRR, 2023

Speech collage: code-switched audio generation by collaging monolingual corpora.
CoRR, 2023

Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff.
CoRR, 2023

Semi-Autoregressive Streaming ASR With Label Context.
CoRR, 2023

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models.
CoRR, 2023

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech.
CoRR, 2023

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation.
CoRR, 2023

Visual Speech Recognition for Low-resource Languages with Automatic Labels From Whisper Model.
CoRR, 2023

Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens.
CoRR, 2023

The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction.
CoRR, 2023

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks.
CoRR, 2023

Bayes Risk Transducer: Transducer with Controllable Alignment Prediction.
CoRR, 2023

Integration of Frame- and Label-synchronous Beam Search for Streaming Encoder-decoder Speech Recognition.
CoRR, 2023

Integrating Pretrained ASR and LM to Perform Sequence Generation for Spoken Language Understanding.
CoRR, 2023

BASS: Block-wise Adaptation for Speech Summarization.
CoRR, 2023

The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios.
CoRR, 2023

Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute.
CoRR, 2023

Exploration on HuBERT with Multiple Resolutions.
CoRR, 2023

Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning.
CoRR, 2023

DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models.
CoRR, 2023

A New Benchmark of Aphasia Speech Recognition and Detection Based on E-Branchformer and Multi-task Learning.
CoRR, 2023

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization.
CoRR, 2023

A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks.
CoRR, 2023

ML-SUPERB: Multilingual Speech Universal PERformance Benchmark.
CoRR, 2023

Improving Cascaded Unsupervised Speech Translation with Denoising Back-translation.
CoRR, 2023

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head.
CoRR, 2023

Neural Speech Enhancement with Very Low Algorithmic Latency and Complexity via Integrated Full- and Sub-Band Modeling.
CoRR, 2023

Multi-Channel Target Speaker Extraction with Refinement: The WavLab Submission to the Second Clarity Enhancement Challenge.
CoRR, 2023

Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study.
CoRR, 2023

Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation.
Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2023

Multilingual TTS Accent Impressions for Accented ASR.
Proceedings of the Text, Speech, and Dialogue - 26th International Conference, 2023

SigMoreFun Submission to the SIGMORPHON Shared Task on Interlinear Glossing.
Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, 2023

UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

CMU's IWSLT 2023 Simultaneous Speech Translation System.
Proceedings of the 20th International Conference on Spoken Language Translation, 2023


Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining.
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023

Efficient Sequence Transduction by Jointly Predicting Tokens and Durations.
Proceedings of the International Conference on Machine Learning, 2023

Bayes Risk CTC: Controllable CTC Alignment in Sequence-to-Sequence Tasks.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

TAPLoss: A Temporal Acoustic Parameter Loss for Speech Enhancement.
Proceedings of the IEEE International Conference on Acoustics, 2023

Paaploss: A Phonetic-Aligned Acoustic Parameter Loss for Speech Enhancement.
Proceedings of the IEEE International Conference on Acoustics, 2023

Towards Zero-Shot Code-Switched Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2023

Multi-Blank Transducers for Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2023

Wav2Seq: Pre-Training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages.
Proceedings of the IEEE International Conference on Acoustics, 2023

Speaker-Independent Acoustic-to-Articulatory Speech Inversion.
Proceedings of the IEEE International Conference on Acoustics, 2023

The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2023

FNeural Speech Enhancement with Very Low Algorithmic Latency and Complexity via Integrated full- and sub-band Modeling.
Proceedings of the IEEE International Conference on Acoustics, 2023

TF-GRIDNET: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation.
Proceedings of the IEEE International Conference on Acoustics, 2023

Context-Aware Fine-Tuning of Self-Supervised Speech Models.
Proceedings of the IEEE International Conference on Acoustics, 2023

Enhancing Speech-To-Speech Translation with Multiple TTS Targets.
Proceedings of the IEEE International Conference on Acoustics, 2023

Bridging Speech and Textual Pre-Trained Models With Unsupervised ASR.
Proceedings of the IEEE International Conference on Acoustics, 2023

I3D: Transformer Architectures with Input-Dependent Dynamic Depth for Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2023

Structured Pruning of Self-Supervised Pre-Trained Models for Speech Recognition and Understanding.
Proceedings of the IEEE International Conference on Acoustics, 2023

Align, Write, Re-Order: Explainable End-to-End Speech Translation via Operation Sequence Generation.
Proceedings of the IEEE International Conference on Acoustics, 2023

Speechlmscore: Evaluating Speech Generation Using Speech Language Model.
Proceedings of the IEEE International Conference on Acoustics, 2023

Fully Unsupervised Topic Clustering of Unlabelled Spoken Audio Using Self-Supervised Representation Learning and Topic Model.
Proceedings of the IEEE International Conference on Acoustics, 2023

Articulatory Representation Learning via Joint Factor Analysis and Neural Matrix Factorization.
Proceedings of the IEEE International Conference on Acoustics, 2023

E-Branchformer-Based E2E SLU Toward Stop on-Device Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2023

Speech Summarization of Long Spoken Document: Improving Memory Efficiency of Speech/Text Encoders.
Proceedings of the IEEE International Conference on Acoustics, 2023

In Search of Strong Embedding Extractors for Speaker Diarisation.
Proceedings of the IEEE International Conference on Acoustics, 2023

FindAdaptNet: Find and Insert Adapters by Learned Layer Importance.
Proceedings of the IEEE International Conference on Acoustics, 2023

BECTRA: Transducer-Based End-To-End ASR with Bert-Enhanced Encoder.
Proceedings of the IEEE International Conference on Acoustics, 2023

Intermpl: Momentum Pseudo-Labeling With Intermediate CTC Loss.
Proceedings of the IEEE International Conference on Acoustics, 2023

Euro: Espnet Unsupervised ASR Open-Source Toolkit.
Proceedings of the IEEE International Conference on Acoustics, 2023

Streaming Joint Speech Recognition and Disfluency Detection.
Proceedings of the IEEE International Conference on Acoustics, 2023

The Pipeline System of ASR and NLU with MLM-based data Augmentation Toward Stop Low-Resource Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2023

Multi-Channel Speaker Extraction with Adversarial Training: The Wavlab Submission to The Clarity ICASSP 2023 Grand Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2023

Improving Massively Multilingual ASR with Auxiliary CTC Objectives.
Proceedings of the IEEE International Conference on Acoustics, 2023

A Unified One-Shot Prosody and Speaker Conversion System with Self-Supervised Discrete Speech Units.
Proceedings of the IEEE International Conference on Acoustics, 2023

Summary on the Multimodal Information Based Speech Processing (MISP) 2022 Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2023

Avoid Overthinking in Self-Supervised Models for Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2023

A Study on the Integration of Pipeline and E2E SLU Systems for Spoken Semantic Parsing Toward Stop Quality Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2023

Joint Modelling of Spoken Language Understanding Tasks with Integrated Dialog History.
Proceedings of the IEEE International Conference on Acoustics, 2023

CTC Alignments Improve Autoregressive Translation.
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

Toward Universal Speech Enhancement For Diverse Input Conditions.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Domain Adaptation by Data Distribution Matching Via Submodularity For Speech Recognition.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Findings of the 2023 ML-Superb Challenge: Pre-Training And Evaluation Over More Languages And Beyond.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Espnet-Summ: Introducing a Novel Large Dataset, Toolkit, and a Cross-Corpora Evaluation of Speech Summarization Systems.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, And Extraction.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Yodas: Youtube-Oriented Dataset for Audio and Speech.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Summarize While Translating: Universal Model With Parallel Decoding for Summarization and Translation.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

TorchAudio 2.1: Advancing Speech Recognition, Self-Supervised Learning, and Audio Processing Components for Pytorch.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

LV-CTC: Non-Autoregressive ASR With CTC and Latent Variable Models.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Joint Prediction and Denoising for Large-Scale Multilingual Self-Supervised Learning.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Synthetic Data Augmentation for ASR with Domain Filtering.
Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference, 2023

ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2023

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022
End-to-End Dereverberation, Beamforming, and Speech Recognition in a Cocktail Party.
IEEE ACM Trans. Audio Speech Lang. Process., 2022

Encoder-Decoder Based Attractors for End-to-End Neural Diarization.
IEEE ACM Trans. Audio Speech Lang. Process., 2022

Improving Frame-Online Neural Speech Enhancement With Overlapped-Frame Prediction.
IEEE Signal Process. Lett., 2022

Self-Supervised Speech Representation Learning: A Review.
IEEE J. Sel. Top. Signal Process., 2022

Editorial Editorial of Special Issue on Self-Supervised Learning for Speech and Audio Processing.
IEEE J. Sel. Top. Signal Process., 2022

Deep learning based multi-source localization with source splitting and its effectiveness in multi-talker speech recognition.
Comput. Speech Lang., 2022

An investigation of neural uncertainty estimation for target speaker extraction equipped RNN transducer.
Comput. Speech Lang., 2022

Train from scratch: Single-stage joint training of speech separation and recognition.
Comput. Speech Lang., 2022

A review of speaker diarization: Recent advances with deep learning.
Comput. Speech Lang., 2022

Arabic speech recognition by end-to-end, modular systems and human.
Comput. Speech Lang., 2022

Joint speaker diarization and speech recognition based on region proposal networks.
Comput. Speech Lang., 2022

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders.
CoRR, 2022

Large-scale learning of generalised representations for speaker recognition.
CoRR, 2022

Online Neural Diarization of Unlimited Numbers of Speakers.
CoRR, 2022

Muskits: an End-to-End Music Processing Toolkit for Singing Voice Synthesis.
CoRR, 2022

HEAR 2021: Holistic Evaluation of Audio Representations.
CoRR, 2022

End-to-End Multi-Speaker ASR with Independent Vector Analysis.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

A Study on the Integration of Pre-Trained SSL, ASR, LM and SLU Models for Spoken Language Understanding.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

On Compressing Sequences for Self-Supervised Speech Models.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

E-Branchformer: Branchformer with Enhanced Merging for Speech Recognition.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

Mutual Learning of Single- and Multi-Channel End-to-End Neural Diarization.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

Superb @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

Phone Inventories and Recognition for Every Language.
Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022

CMU's IWSLT 2022 Dialect Speech Translation System.
Proceedings of the 19th International Conference on Spoken Language Translation, 2022


Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis.
Proceedings of the Interspeech 2022, 2022

Online Continual Learning of End-to-End Speech Recognition Models.
Proceedings of the Interspeech 2022, 2022

Improving Speech Enhancement through Fine-Grained Speech Characteristics.
Proceedings of the Interspeech 2022, 2022

Deep Speech Synthesis from Articulatory Representations.
Proceedings of the Interspeech 2022, 2022

Residual Language Model for End-to-end Speech Recognition.
Proceedings of the Interspeech 2022, 2022

Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR Models.
Proceedings of the Interspeech 2022, 2022

Streaming Automatic Speech Recognition with Re-blocking Processing Based on Integrated Voice Activity Detection.
Proceedings of the Interspeech 2022, 2022

Minimum latency training of sequence transducers for streaming end-to-end speech recognition.
Proceedings of the Interspeech 2022, 2022

VQ-T: RNN Transducers using Vector-Quantized Prediction Network States.
Proceedings of the Interspeech 2022, 2022

Muskits: an End-to-end Music Processing Toolkit for Singing Voice Synthesis.
Proceedings of the Interspeech 2022, 2022

When Is TTS Augmentation Through a Pivot Language Useful?
Proceedings of the Interspeech 2022, 2022

Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR.
Proceedings of the Interspeech 2022, 2022

ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding.
Proceedings of the Interspeech 2022, 2022

ASR2K: Speech Recognition for Around 2000 Languages without Audio.
Proceedings of the Interspeech 2022, 2022

Memory-Efficient Training of RNN-Transducer with Sampled Softmax.
Proceedings of the Interspeech 2022, 2022

Better Intermediates Improve CTC Inference.
Proceedings of the Interspeech 2022, 2022

TriniTTS: Pitch-controllable End-to-end TTS without External Aligner.
Proceedings of the Interspeech 2022, 2022

SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy.
Proceedings of the Interspeech 2022, 2022

Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation.
Proceedings of the Interspeech 2022, 2022

Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis.
Proceedings of the Interspeech 2022, 2022

End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation.
Proceedings of the Interspeech 2022, 2022

Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation.
Proceedings of the Interspeech 2022, 2022

Two-Pass Low Latency End-to-End Spoken Language Understanding.
Proceedings of the Interspeech 2022, 2022

Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding.
Proceedings of the International Conference on Machine Learning, 2022


Joint Modeling of Code-Switched and Monolingual ASR via Conditional Factorization.
Proceedings of the IEEE International Conference on Acoustics, 2022

Run-and-Back Stitch Search: Novel Block Synchronous Decoding For Streaming Encoder-Decoder ASR.
Proceedings of the IEEE International Conference on Acoustics, 2022

SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2022

Non-Autoregressive End-To-End Automatic Speech Recognition Incorporating Downstream Natural Language Processing.
Proceedings of the IEEE International Conference on Acoustics, 2022

Joint Speech Recognition and Audio Captioning.
Proceedings of the IEEE International Conference on Acoustics, 2022

Sequence Transduction with Graph-Based Supervision.
Proceedings of the IEEE International Conference on Acoustics, 2022

An Exploration of Hubert with Large Number of Cluster Units and Model Assessment Using Bayesian Information Criterion.
Proceedings of the IEEE International Conference on Acoustics, 2022

Conditional Diffusion Probabilistic Model for Speech Enhancement.
Proceedings of the IEEE International Conference on Acoustics, 2022

Towards Low-Distortion Multi-Channel Speech Enhancement: The ESPNET-Se Submission to the L3DAS22 Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2022

Integrating Multiple ASR Systems into NLP Backend with Attention Fusion.
Proceedings of the IEEE International Conference on Acoustics, 2022

S3PRL-VC: Open-Source Voice Conversion Framework with Self-Supervised Speech Representations.
Proceedings of the IEEE International Conference on Acoustics, 2022

Investigating Self-Supervised Learning for Speech Enhancement and Separation.
Proceedings of the IEEE International Conference on Acoustics, 2022

Multi-Channel End-To-End Neural Diarization with Distributed Microphones.
Proceedings of the IEEE International Conference on Acoustics, 2022

Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models.
Proceedings of the IEEE International Conference on Acoustics, 2022

The First Multimodal Information Based Speech Processing (Misp) Challenge: Data, Tasks, Baselines And Results.
Proceedings of the IEEE International Conference on Acoustics, 2022

Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR.
Proceedings of the IEEE International Conference on Acoustics, 2022

ESPnet-SLU: Advancing Spoken Language Understanding Through ESPnet.
Proceedings of the IEEE International Conference on Acoustics, 2022

BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022

Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022

SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities.
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022

Zero-shot Learning for Grapheme to Phoneme Conversion with Language Ensemble.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, 2022

2021
Non-Autoregressive Transformer for Speech Recognition.
IEEE Signal Process. Lett., 2021

Far-Field Automatic Speech Recognition.
Proc. IEEE, 2021

Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem.
CoRR, 2021

JTubeSpeech: corpus of Japanese speech collected from YouTube for speech recognition and speaker verification.
CoRR, 2021

TorchAudio: Building Blocks for Audio and Speech Processing.
CoRR, 2021

ESPnet2-TTS: Extending the Edge of TTS Research.
CoRR, 2021

Non-autoregressive End-to-end Speech Translation with Parallel Autoregressive Rescoring.
CoRR, 2021

Encoder-Decoder Based Attractor Calculation for End-to-End Neural Diarization.
CoRR, 2021

GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10, 000 Hours of Transcribed Audio.
CoRR, 2021

INTERSPEECH 2021 ConferencingSpeech Challenge: Towards Far-field Multi-Channel Speech Enhancement for Video Conferencing.
CoRR, 2021

The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-Vector Clustering Systems Combined by DOVER-Lap.
CoRR, 2021

Online End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers.
CoRR, 2021

Arabic Speech Recognition by End-to-End, Modular Systems and Human.
CoRR, 2021

Closing the Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions.
Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2021

Online End-To-End Neural Diarization with Speaker-Tracing Buffer.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Streaming Transformer Asr With Blockwise Synchronous Beam Search.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

DOVER-Lap: A Method for Combining Overlap-Aware Diarization Outputs.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Dual-Path RNN for Long Recording Speech Separation.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

ESPnet-SE: End-To-End Speech Enhancement and Separation Toolkit Designed for ASR Integration.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021


End-to-end ASR to jointly predict transcriptions and linguistic annotations.
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021

Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation.
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021

Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks.
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021

Self-Guided Curriculum Learning for Neural Machine Translation.
Proceedings of the 18th International Conference on Spoken Language Translation, 2021

ESPnet-ST IWSLT 2021 Offline Speech Translation System.
Proceedings of the 18th International Conference on Spoken Language Translation, 2021

Auxiliary Loss Function for Target Speech Extraction and Recognition with Weak Supervision Based on Speaker Characteristics.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

SUPERB: Speech Processing Universal PERformance Benchmark.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

Differentiable Allophone Graphs for Language-Universal Speech Recognition.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

Streaming End-to-End ASR Based on Blockwise Non-Autoregressive Models.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

Data Augmentation Methods for End-to-End Speech Recognition on Distant-Talk Scenarios.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

Semi-Supervised Training with Pseudo-Labeling for End-To-End Neural Diarization.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

Leveraging Pre-Trained Language Model for Speech Sentiment Analysis.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

SPGISpeech: 5, 000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Recognition.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

Speaker Verification-Based Evaluation of Single-Channel Speech Separation.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

Layer Pruning on Demand with Intermediate CTC.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

Acoustic Event Detection with Classifier Chains.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

Multi-Mode Transformer Transducer with Stochastic Future Context.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

Target-Speaker Voice Activity Detection with Improved i-Vector Estimation for Unknown Number of Speaker.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

Continuous Speech Separation Using Speaker Inventory for Long Recording.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

Toward Streaming ASR with Non-Autoregressive Insertion-Based Model.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10, 000 Hours of Transcribed Audio.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend.
Proceedings of the IEEE International Conference on Acoustics, 2021

Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition with Source Localization.
Proceedings of the IEEE International Conference on Acoustics, 2021

Improving RNN Transducer with Target Speaker Extraction and Neural Uncertainty Estimation.
Proceedings of the IEEE International Conference on Acoustics, 2021

End-To-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings.
Proceedings of the IEEE International Conference on Acoustics, 2021

Training Noisy Single-Channel Speech Separation with Noisy Oracle Sources: A Large Gap and a Small Step.
Proceedings of the IEEE International Conference on Acoustics, 2021

Dual-Path Modeling for Long Recording Speech Separation in Meetings.
Proceedings of the IEEE International Conference on Acoustics, 2021

Intermediate Loss Regularization for CTC-Based Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2021

Gaussian Kernelized Self-Attention for Long Sequence Data and its Application to CTC-Based Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2021

ORTHROS: non-autoregressive end-to-end speech translation With dual-decoder.
Proceedings of the IEEE International Conference on Acoustics, 2021

End-To-End Speaker Diarization as Post-Processing.
Proceedings of the IEEE International Conference on Acoustics, 2021

Improved Mask-CTC for Non-Autoregressive End-to-End ASR.
Proceedings of the IEEE International Conference on Acoustics, 2021

Recent Developments on Espnet Toolkit Boosted By Conformer.
Proceedings of the IEEE International Conference on Acoustics, 2021

Eat: Enhanced ASR-TTS for Self-Supervised Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2021

Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yolóxochitl Mixtec.
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021

Leveraging State-of-the-art ASR Techniques to Audio Captioning.
Proceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events 2021 (DCASE 2021), 2021

Cross-Lingual Transfer for Speech Processing Using Acoustic Language Similarity.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

Conferencingspeech Challenge: Towards Far-Field Multi-Channel Speech Enhancement for Video Conferencing.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

Attention-Based Multi-Hypothesis Fusion for Speech Summarization.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

On Prosody Modeling for ASR+TTS Based Voice Conversion.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

A Study of Transducer Based End-to-End ASR with ESPnet: Architecture, Auxiliary Loss and Decoding Strategies.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

Understanding the Tradeoffs in Client-side Privacy for Downstream Speech Tasks.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2021

A Study on Speech Enhancement Based on Diffusion Probabilistic Model.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2021

2020
Automated Development of DNN Based Spoken Language Systems Using Evolutionary Algorithms.
Proceedings of the Deep Neural Evolution - Deep Learning with Evolutionary Computation, 2020

Improving End-to-End Single-Channel Multi-Talker Speech Recognition.
IEEE ACM Trans. Audio Speech Lang. Process., 2020

Multi-Stream End-to-End Speech Recognition.
IEEE ACM Trans. Audio Speech Lang. Process., 2020

The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans.
CoRR, 2020

Continuous Speech Separation Using Speaker Inventory for Long Multi-talker Recording.
CoRR, 2020

Convolutive Transfer Function Invariant SDR training criteria for Multi-Channel Reverberant Speech Separation.
CoRR, 2020

The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS.
CoRR, 2020

Augmentation adversarial training for unsupervised speaker recognition.
CoRR, 2020

Streaming Transformer ASR with Blockwise Synchronous Inference.
CoRR, 2020

The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge.
CoRR, 2020

Online End-to-End Neural Diarization with Speaker-Tracing Buffer.
CoRR, 2020

Neural Speaker Diarization with Speaker-Wise Chain Rule.
CoRR, 2020

DiscreTalk: Text-to-Speech as a Machine Translation Problem.
CoRR, 2020

CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings.
CoRR, 2020

End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification.
CoRR, 2020

Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals.
Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming.
Proceedings of the Interspeech 2020, 2020

End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors.
Proceedings of the Interspeech 2020, 2020

Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict.
Proceedings of the Interspeech 2020, 2020

Insertion-Based Modeling for End-to-End Automatic Speech Recognition.
Proceedings of the Interspeech 2020, 2020

Learning Speaker Embedding from Text-to-Speech.
Proceedings of the Interspeech 2020, 2020

End-to-End ASR with Adaptive Span Self-Attention.
Proceedings of the Interspeech 2020, 2020

Speaker-Conditional Chain Model for Speech Separation and Extraction.
Proceedings of the Interspeech 2020, 2020

End-to-End Automatic Speech Recognition Integrated with CTC-Based Voice Activity Detection.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Far-Field Location Guided Target Speech Extraction Using End-to-End Speech Recognition Objectives.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Weakly-Supervised Sound Event Detection with Self-Attention.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

A Practical Two-Stage Training Strategy for Multi-Stream End-to-End Speech Recognition.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Semi-Supervised Speaker Adaptation for End-to-End Speech Synthesis with Pretrained Models.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Speaker Diarization with Region Proposal Network.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Attention-Based ASR with Lightweight and Dynamic Convolutions.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

End-To-End Multi-Speaker Speech Recognition With Transformer.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Conformer-Based Sound Event Detection with Semi-Supervised Learning and Data Augmentation.
Proceedings of 5th the Workshop on Detection and Classification of Acoustic Scenes and Events 2020 (DCASE 2020), 2020

ESPnet-ST: All-in-One Speech Translation Toolkit.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020

2019
Evolution-Strategy-Based Automation of System Development for High-Performance Speech Recognition.
IEEE ACM Trans. Audio Speech Lang. Process., 2019

Speech Processing for Digital Home Assistants: Combining signal processing with deep-learning techniques.
IEEE Signal Process. Mag., 2019

Introduction to the Issue on Far-Field Speech Processing in the Era of Deep Learning: Speech Enhancement, Separation, and Recognition.
IEEE J. Sel. Top. Signal Process., 2019

Phasebook and Friends: Leveraging Discrete Representations for Source Separation.
IEEE J. Sel. Top. Signal Process., 2019

Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition.
CoRR, 2019

Towards Online End-to-end Transformer Automatic Speech Recognition.
CoRR, 2019

Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text.
CoRR, 2019

Dry, Focus, and Transcribe: End-to-End Integration of Dereverberation, Beamforming, and ASR.
CoRR, 2019

Generalized Weighted-Prediction-Error Dereverberation with Varying Source Priors For Reverberant Speech Recognition.
Proceedings of the 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2019

Speech Enhancement Using End-to-End Speech Recognition Objectives.
Proceedings of the 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2019

Analysis of Robustness of Deep Single-Channel Speech Separation Using Corpora Constructed From Multiple Domains.
Proceedings of the 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2019

Massively Multilingual Adversarial Speech Recognition.
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019

ESPnet How2 Speech Translation System for IWSLT 2019: Pre-training, Knowledge Distillation, and Going Deeper.
Proceedings of the 16th International Conference on Spoken Language Translation, 2019

Pretraining by Backtranslation for End-to-End ASR in Low-Resource Settings.
Proceedings of the Interspeech 2019, 2019

End-to-End Multilingual Multi-Speaker Speech Recognition.
Proceedings of the Interspeech 2019, 2019

Vectorized Beam Search for CTC-Attention-Based Speech Recognition.
Proceedings of the Interspeech 2019, 2019

Study of the Performance of Automatic Speech Recognition Systems in Speakers with Parkinson's Disease.
Proceedings of the Interspeech 2019, 2019

Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration.
Proceedings of the Interspeech 2019, 2019

Analysis of Multilingual Sequence-to-Sequence Speech Recognition Systems.
Proceedings of the Interspeech 2019, 2019

Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition.
Proceedings of the Interspeech 2019, 2019

Pre-Trained Text Embeddings for Enhanced Text-to-Speech Synthesis.
Proceedings of the Interspeech 2019, 2019

Speaker Recognition Benchmark Using the CHiME-5 Corpus.
Proceedings of the Interspeech 2019, 2019

End-to-End Neural Speaker Diarization with Permutation-Free Objectives.
Proceedings of the Interspeech 2019, 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition.
Proceedings of the Interspeech 2019, 2019

Semi-Supervised Sequence-to-Sequence ASR Using Unpaired Speech and Text.
Proceedings of the Interspeech 2019, 2019

Weakly-Supervised Deep Recurrent Neural Networks for Basic Dance Step Generation.
Proceedings of the International Joint Conference on Neural Networks, 2019

Using ASR Methods for OCR.
Proceedings of the 2019 International Conference on Document Analysis and Recognition, 2019

Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling.
Proceedings of the IEEE International Conference on Acoustics, 2019

Stream Attention-based Multi-array End-to-end Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2019

The Phasebook: Building Complex Masks via Discrete Representations for Source Separation.
Proceedings of the IEEE International Conference on Acoustics, 2019

Acoustic Modeling for Overlapping Speech Recognition: Jhu Chime-5 Challenge System.
Proceedings of the IEEE International Conference on Acoustics, 2019

Joint Acoustic and Class Inference for Weakly Supervised Sound Event Detection.
Proceedings of the IEEE International Conference on Acoustics, 2019

Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders.
Proceedings of the IEEE International Conference on Acoustics, 2019

Acoustic Modeling for Distant Multi-talker Speech Recognition with Single- and Multi-channel Branches.
Proceedings of the IEEE International Conference on Acoustics, 2019

Transfer Learning of Language-independent End-to-end ASR with Language Model Fusion.
Proceedings of the IEEE International Conference on Acoustics, 2019

Cycle-consistency Training for End-to-end Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2019

Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2019

End-to-end Monaural Multi-speaker ASR System without Pretraining.
Proceedings of the IEEE International Conference on Acoustics, 2019

Promising Accurate Prefix Boosting for Sequence-to-sequence ASR.
Proceedings of the IEEE International Conference on Acoustics, 2019

CNN-based Multichannel End-to-End Speech Recognition for Everyday Home Environments<sup>*</sup>.
Proceedings of the 27th European Signal Processing Conference, 2019

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Transformer ASR with Contextual Block Processing.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

A Comparative Study on Transformer vs RNN in Speech Applications.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Multilingual End-to-End Speech Translation.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

End-to-End Neural Speaker Diarization with Self-Attention.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

2018
Low Resource Multi-modal Data Augmentation for End-to-end ASR.
CoRR, 2018

Multi-encoder multi-resolution framework for end-to-end speech recognition.
CoRR, 2018

Vectorization of hypotheses and speech for faster beam search in encoder decoder-based speech recognition.
CoRR, 2018

CNN-based MultiChannel End-to-End Speech Recognition for everyday home environments.
CoRR, 2018

Building Corpora for Single-Channel Speech Separation Across Multiple Domains.
CoRR, 2018

Low-Resource Contextual Topic Identification on Speech.
Proceedings of the 2018 IEEE Spoken Language Technology Workshop, 2018

End-to-end Speech Recognition With Word-Based Rnn Language Models.
Proceedings of the 2018 IEEE Spoken Language Technology Workshop, 2018

Back-Translation-Style Data Augmentation for end-to-end ASR.
Proceedings of the 2018 IEEE Spoken Language Technology Workshop, 2018

Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling.
Proceedings of the 2018 IEEE Spoken Language Technology Workshop, 2018

The JHU/KyotoU Speech Translation System for IWSLT 2018.
Proceedings of the 15th International Conference on Spoken Language Translation, 2018


Student-Teacher Learning for BLSTM Mask-based Speech Enhancement.
Proceedings of the Interspeech 2018, 2018

Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge.
Proceedings of the Interspeech 2018, 2018

Multi-Modal Data Augmentation for End-to-end ASR.
Proceedings of the Interspeech 2018, 2018

Semi-Supervised End-to-End Speech Recognition.
Proceedings of the Interspeech 2018, 2018

Multi-Head Decoder for End-to-End Speech Recognition.
Proceedings of the Interspeech 2018, 2018

Effectiveness of Single-Channel BLSTM Enhancement for Language Identification.
Proceedings of the Interspeech 2018, 2018

Auxiliary Feature Based Adaptation of End-to-end ASR Systems.
Proceedings of the Interspeech 2018, 2018

Building State-of-the-art Distant Speech Recognition Using the CHiME-4 Challenge with a Setup of Speech Enhancement Baseline.
Proceedings of the Interspeech 2018, 2018

The Fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, Task and Baselines.
Proceedings of the Interspeech 2018, 2018

End-to-End Multi-Speaker Speech Recognition.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Speaker Adaptation for Multichannel End-to-End Speech Recognition.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

A Purely End-to-End System for Multi-speaker Speech Recognition.
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018

2017
Duration-Controlled LSTM for Polyphonic Sound Event Detection.
IEEE ACM Trans. Audio Speech Lang. Process., 2017

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition.
IEEE J. Sel. Top. Signal Process., 2017

Unified Architecture for Multichannel End-to-End Speech Recognition With Neural Beamforming.
IEEE J. Sel. Top. Signal Process., 2017

Prior-based Binary Masking and Discriminative Methods for Reverberant and Noisy Speech Recognition Using Distant Stereo Microphones.
J. Inf. Process., 2017

An analysis of environment, microphone and data simulation mismatches in robust speech recognition.
Comput. Speech Lang., 2017

Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend.
Comput. Speech Lang., 2017

The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes.
Comput. Speech Lang., 2017

Multi-microphone speech recognition in everyday environments.
Comput. Speech Lang., 2017

Does speech enhancement work with end-to-end ASR objectives?: Experimental analysis of multichannel end-to-end ASR.
Proceedings of the 27th IEEE International Workshop on Machine Learning for Signal Processing, 2017

Coupled Initialization of Multi-Channel Non-Negative Matrix Factorization Based on Spatial and Spectral Information.
Proceedings of the Interspeech 2017, 2017

Semi-Supervised Learning of a Pronunciation Dictionary from Disjoint Phonemic Transcripts and Text.
Proceedings of the Interspeech 2017, 2017

Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM.
Proceedings of the Interspeech 2017, 2017

Multichannel End-to-end Speech Recognition.
Proceedings of the 34th International Conference on Machine Learning, 2017

Student-teacher network learning with enhanced features.
Proceedings of the 2017 IEEE International Conference on Acoustics, 2017

Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition.
Proceedings of the 2017 IEEE International Conference on Acoustics, 2017

Joint CTC-attention based end-to-end speech recognition using multi-task learning.
Proceedings of the 2017 IEEE International Conference on Acoustics, 2017

BLSTM-HMM hybrid system combined with sound activity detection network for polyphonic Sound Event Detection.
Proceedings of the 2017 IEEE International Conference on Acoustics, 2017

Language independent end-to-end architecture for joint language identification and speech recognition.
Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop, 2017

Composite embedding systems for ZeroSpeech2017 Track1.
Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop, 2017

Multi-level language modeling and decoding for open vocabulary end-to-end speech recognition.
Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop, 2017

Joint CTC/attention decoding for end-to-end speech recognition.
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017

Discriminative Beamforming with Phase-Aware Neural Networks for Speech Enhancement and Recognition.
Proceedings of the New Era for Robust Speech Recognition, Exploiting Deep Learning., 2017

Toolkits for Robust Speech Processing.
Proceedings of the New Era for Robust Speech Recognition, Exploiting Deep Learning., 2017

Preliminaries.
Proceedings of the New Era for Robust Speech Recognition, Exploiting Deep Learning., 2017

Training Data Augmentation and Data Selection.
Proceedings of the New Era for Robust Speech Recognition, Exploiting Deep Learning., 2017

Novel Deep Architectures in Speech Processing.
Proceedings of the New Era for Robust Speech Recognition, Exploiting Deep Learning., 2017

Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio.
Proceedings of the New Era for Robust Speech Recognition, Exploiting Deep Learning., 2017

The CHiME Challenges: Robust Speech Recognition in Everyday Environments.
Proceedings of the New Era for Robust Speech Recognition, Exploiting Deep Learning., 2017

2016
Automated structure discovery and parameter tuning of neural network language model based on evolution strategy.
Proceedings of the 2016 IEEE Spoken Language Technology Workshop, 2016

Dialog state tracking with attention-based sequence-to-sequence learning.
Proceedings of the 2016 IEEE Spoken Language Technology Workshop, 2016

Data Selection by Sequence Summarizing Neural Network in Mismatch Condition Training.
Proceedings of the Interspeech 2016, 2016

Single-Channel Multi-Speaker Separation Using Deep Clustering.
Proceedings of the Interspeech 2016, 2016

Context-Sensitive and Role-Dependent Spoken Language Understanding Using Bidirectional and Attention LSTMs.
Proceedings of the Interspeech 2016, 2016

Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks.
Proceedings of the Interspeech 2016, 2016

Driver confusion status detection using recurrent neural networks.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2016

Deep beamforming networks for multi-channel speech recognition.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

Deep unfolding for multichannel source separation.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

Sequence summarizing neural network for speaker adaptation.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

Minimum word error training of long short-term memory recurrent neural network language models for speech recognition.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

Deep clustering: Discriminative embeddings for segmentation and separation.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

High-accuracy user identification using EEG biometrics.
Proceedings of the 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2016

Bidirectional LSTM-HMM Hybrid System for Polyphonic Sound Event Detection.
Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, 2016

Beamforming networks using spatial covariance features for far-field speech recognition.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2016

2015
Effectiveness of dereverberation, feature transformation, discriminative training methods, and system combination approach for various reverberant environments.
EURASIP J. Adv. Signal Process., 2015

Uncertainty training and decoding methods of deep neural networks based on stochastic representation of enhanced features.
Proceedings of the INTERSPEECH 2015, 2015

Efficient learning for spoken language understanding tasks with word embedding based pre-training.
Proceedings of the INTERSPEECH 2015, 2015

Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks.
Proceedings of the INTERSPEECH 2015, 2015

Robust speech processing using observation uncertainty and uncertainty propagation: session and paper overview.
Proceedings of the INTERSPEECH 2015, 2015

Uncertainty propagation through deep neural networks.
Proceedings of the INTERSPEECH 2015, 2015

Discriminative method for recurrent neural network language models.
Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

Structure discovery of deep neural network based on evolutionary algorithms.
Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks.
Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR.
Proceedings of the Latent Variable Analysis and Signal Separation, 2015

Automation of system building for state-of-the-art large vocabulary speech recognition using evolution strategy.
Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, 2015

Robust speech recognition in unknown reverberant and noisy conditions.
Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, 2015

The MERL/SRI system for the 3RD CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition.
Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, 2015

The third 'CHiME' speech separation and recognition challenge: Dataset, task and baselines.
Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, 2015

Feature-space structural MAPLR with regression tree-based multiple transformation matrices for DNN.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2015

Bayesian Speech and Language Processing
Cambridge University Press, ISBN: 9781107295360, 2015

2014
Structural Bayesian Linear Regression for Hidden Markov Models.
J. Signal Process. Syst., 2014

Discriminative NMF and its application to single-channel source separation.
Proceedings of the INTERSPEECH 2014, 2014

Cost-level integration of statistical and rule-based dialog managers.
Proceedings of the INTERSPEECH 2014, 2014

Sequential maximum mutual information linear discriminant analysis for speech recognition.
Proceedings of the INTERSPEECH 2014, 2014

Deep recurrent de-noising auto-encoder and blind de-reverberation for reverberated speech recognition.
Proceedings of the IEEE International Conference on Acoustics, 2014

Recurrent deep neural networks for robust speech recognition.
Proceedings of the IEEE International Conference on Acoustics, 2014

Black box optimization for automatic speech recognition.
Proceedings of the IEEE International Conference on Acoustics, 2014

Log-linear dialog manager.
Proceedings of the IEEE International Conference on Acoustics, 2014

Ensemble integration of calibrated speaker localization and statistical speech detection in domestic environments.
Proceedings of the 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays, 2014

Sequence discriminative training for low-rank deep neural networks.
Proceedings of the 2014 IEEE Global Conference on Signal and Information Processing, 2014

2013
Feature Enhancement With Joint Use of Consecutive Corrupted and Noise Feature Vectors With Discriminative Region Weighting.
IEEE Trans. Speech Audio Process., 2013

Influence relation estimation based on lexical entrainment in conversation.
Speech Commun., 2013

Prior-shared feature and model space speaker adaptation by consistently employing map estimation.
Speech Commun., 2013

Training data selection with user's physical characteristics data for acceleration-based activity modeling.
Pers. Ubiquitous Comput., 2013

Cluster-based dynamic variance adaptation for interconnecting speech enhancement pre-processor and speech recognizer.
Comput. Speech Lang., 2013

Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds.
Comput. Speech Lang., 2013

Ensemble learning for speech enhancement.
Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2013

Blocked Gibbs sampling based multi-scale mixture model for speaker clustering on noisy data.
Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing, 2013

Discriminative training of acoustic models for system combination.
Proceedings of the INTERSPEECH 2013, 2013

Statistical Dialogue Management using Intention Dependency Graph.
Proceedings of the Sixth International Joint Conference on Natural Language Processing, 2013

Stereo-based feature enhancement using dictionary learning.
Proceedings of the IEEE International Conference on Acoustics, 2013

The second 'chime' speech separation and recognition challenge: Datasets, tasks and baselines.
Proceedings of the IEEE International Conference on Acoustics, 2013

Effectiveness of discriminative training and feature transformation for reverberated and noisy speech.
Proceedings of the IEEE International Conference on Acoustics, 2013

The second 'CHiME' speech separation and recognition challenge: An overview of challenge systems and outcomes.
Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013

A generalized discriminative training framework for system combination.
Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013

2012
Statistical Voice Conversion Based on Noisy Channel Model.
IEEE Trans. Speech Audio Process., 2012

Structural Classification Methods Based on Weighted Finite-State Transducers for Automatic Speech Recognition.
IEEE Trans. Speech Audio Process., 2012

Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera.
IEEE Trans. Speech Audio Process., 2012

Frame-wise model re-estimation method based on Gaussian pruning with weight normalization for noise robust voice activity detection.
Speech Commun., 2012

Fully Bayesian speaker clustering based on hierarchically structured utterance-oriented Dirichlet process mixture model.
Proceedings of the INTERSPEECH 2012, 2012

Bag Of ARCS: New representation of speech segment features based on finite state machines.
Proceedings of the 2012 IEEE International Conference on Acoustics, 2012

Fully Bayesian inference of multi-mixture Gaussian model and its evaluation using speaker clustering.
Proceedings of the 2012 IEEE International Conference on Acoustics, 2012

MFCC enhancement using joint corrupted and noise feature space for highly non-stationary noise environments.
Proceedings of the 2012 IEEE International Conference on Acoustics, 2012

Effect of dialog acts on word use in polylogue.
Proceedings of the 2012 IEEE International Conference on Acoustics, 2012

Basis vector orthogonalization for an improved kernel gradient matching pursuit method.
Proceedings of the 2012 IEEE International Conference on Acoustics, 2012

Decoding network optimization using minimum transition error training.
Proceedings of the 2012 IEEE International Conference on Acoustics, 2012

Noise suppression with unsupervised joint speaker adaptation and noise mixture model estimation.
Proceedings of the 2012 IEEE International Conference on Acoustics, 2012

Discriminative feature transforms using differenced maximum mutual information.
Proceedings of the 2012 IEEE International Conference on Acoustics, 2012

Handling uncertain observations in unsupervised topic-mixture language model adaptation.
Proceedings of the 2012 IEEE International Conference on Acoustics, 2012

2011
Topic tracking language model for speech recognition.
Comput. Speech Lang., 2011

Bayesian linear regression for Hidden Markov Model based on optimizing variational bounds.
Proceedings of the 2011 IEEE International Workshop on Machine Learning for Signal Processing, 2011

Unsupervised Activity Recognition with User's Physical Characteristics Data.
Proceedings of the 15th IEEE International Symposium on Wearable Computers (ISWC 2011), 2011

Model Adaptation for Automatic Speech Recognition Based on Multiple Time Scale Evolution.
Proceedings of the INTERSPEECH 2011, 2011

Speaker Clustering Based on Utterance-Oriented Dirichlet Process Mixture Model.
Proceedings of the INTERSPEECH 2011, 2011

Learning Influences from Word Use in Polylogue.
Proceedings of the INTERSPEECH 2011, 2011

A Robust Estimation Method of Noise Mixture Model for Noise Suppression.
Proceedings of the INTERSPEECH 2011, 2011

Fashion Coordinates Recommender System Using Photographs from Fashion Magazines.
Proceedings of the IJCAI 2011, 2011

Gibbs sampling based Multi-scale Mixture Model for speaker clustering.
Proceedings of the IEEE International Conference on Acoustics, 2011

High accurate model-integration-based voice conversion using dynamic features and model structure optimization.
Proceedings of the IEEE International Conference on Acoustics, 2011

Subspace pursuit method for kernel-log-linear models.
Proceedings of the IEEE International Conference on Acoustics, 2011

Non-stationary noise estimation method based on bias-residual component decomposition for robust speech recognition.
Proceedings of the IEEE International Conference on Acoustics, 2011

Variance Compensation for Recognition of Reverberant Speech with Dereverberation Preprocessing.
Proceedings of the Robust Speech Recognition of Uncertain or Missing Data, 2011

2010
Predictor-Corrector Adaptation by Using Time Evolution System With Macroscopic Time Scale.
IEEE Trans. Speech Audio Process., 2010

A Sequential Pattern Classifier Based on Hidden Markov Kernel Machine and Its Application to Phoneme Classification.
IEEE J. Sel. Top. Signal Process., 2010

Online Unsupervised Classification With Model Comparison in the Variational Bayes Framework for Voice Activity Detection.
IEEE J. Sel. Top. Signal Process., 2010

Application of topic tracking model to language model adaptation and meeting analysis.
Proceedings of the 2010 IEEE Spoken Language Technology Workshop, 2010

Real-time meeting recognition and understanding using distant microphones and omni-directional camera.
Proceedings of the 2010 IEEE Spoken Language Technology Workshop, 2010

Large vocabulary continuous speech recognition using WFST-based linear classifier for structured data.
Proceedings of the INTERSPEECH 2010, 2010

Probabilistic integration of joint density model and speaker model for voice conversion.
Proceedings of the INTERSPEECH 2010, 2010

A regularized discriminative training method of acoustic models derived by minimum relative entropy discrimination.
Proceedings of the INTERSPEECH 2010, 2010

Improvements of search error risk minimization in viterbi beam search for speech recognition.
Proceedings of the INTERSPEECH 2010, 2010

Voice activity detection using frame-wise model re-estimation method based on Gaussian pruning with weight normalization.
Proceedings of the INTERSPEECH 2010, 2010

Minimum Error Classification with geometric margin control.
Proceedings of the IEEE International Conference on Acoustics, 2010

A discriminative model for continuous speech recognition based on Weighted Finite State Transducers.
Proceedings of the IEEE International Conference on Acoustics, 2010

Discriminative training based on an integrated view of MPE and MMI in margin and error space.
Proceedings of the IEEE International Conference on Acoustics, 2010

Search error risk minimization in Viterbi beam search for speech recognition.
Proceedings of the IEEE International Conference on Acoustics, 2010

Using online model comparison in the Variational Bayes framework for online unsupervised Voice Activity Detection.
Proceedings of the IEEE International Conference on Acoustics, 2010

Fast similarity search on a large speech data set with neighborhood graph indexing.
Proceedings of the IEEE International Conference on Acoustics, 2010

2009
Static and Dynamic Variance Compensation for Recognition of Reverberant Speech With Dereverberation Preprocessing.
IEEE Trans. Speech Audio Process., 2009

Margin-space integration of MPE loss via differencing of MMI functionals for generalized error-weighted discriminative training.
Proceedings of the INTERSPEECH 2009, 2009

Stereo-input speech recognition using sparseness-based time-frequency masking in a reverberant environment.
Proceedings of the INTERSPEECH 2009, 2009

Topic Tracking Model for Analyzing Consumer Purchase Behavior.
Proceedings of the IJCAI 2009, 2009

On-line adaptation and Bayesian detection of environmental changes based on a macroscopic time evolution system.
Proceedings of the IEEE International Conference on Acoustics, 2009

A unified view for discriminative objective functions based on negative exponential of difference measure between strings.
Proceedings of the IEEE International Conference on Acoustics, 2009

2008
A unified interpretation of adaptation approaches based on a macroscopic time evolution system and indirect/direct adaptation approaches.
Proceedings of the IEEE International Conference on Acoustics, 2008

Combined static and dynamic variance adaptation for efficient interconnection of speech enhancement pre-processor with speech recognizer.
Proceedings of the IEEE International Conference on Acoustics, 2008

2007
Incremental Adaptation Based on a Macroscopic Time Evolution System.
Proceedings of the IEEE International Conference on Acoustics, 2007

2006
Automatic determination of acoustic model topology using variational Bayesian estimation and clustering for large vocabulary continuous speech recognition.
IEEE Trans. Speech Audio Process., 2006

Speech Recognition Based on Student's t-Distribution Derived from Total Bayesian Framework.
IEICE Trans. Inf. Syst., 2006

Advanced computational models and learning theories for spoken language processing.
IEEE Comput. Intell. Mag., 2006

Acoustic Model Adaptation Based on Coarse/Fine Training of Transfer Vectors Using Directional Statistics.
Proceedings of the 2006 IEEE International Conference on Acoustics Speech and Signal Processing, 2006

2005
Selection of Shared-State Hidden Markov Model Structure Using Bayesian Criterion.
IEICE Trans. Inf. Syst., 2005

Effects of Bayesian predictive classification using variational Bayesian posteriors for sparse training data in speech recognition.
Proceedings of the INTERSPEECH 2005, 2005

2004
Variational bayesian estimation and clustering for speech recognition.
IEEE Trans. Speech Audio Process., 2004

Acoustic model adaptation based on coarse/fine training of transfer vectors and its application to a speaker adaptation task.
Proceedings of the INTERSPEECH 2004, 2004

Bayesian modelling of the speech spectrum using mixture of Gaussians.
Proceedings of the 2004 IEEE International Conference on Acoustics, 2004

Automatic determination of acoustic model topology using variational Bayesian estimation and clustering.
Proceedings of the 2004 IEEE International Conference on Acoustics, 2004

2003
Application of variational Bayesian estimation and clustering to acoustic model adaptation.
Proceedings of the 2003 IEEE International Conference on Acoustics, 2003

2002
Application of Variational Bayesian Approach to Speech Recognition.
Proceedings of the Advances in Neural Information Processing Systems 15 [Neural Information Processing Systems, 2002

Constructing shared-state hidden Markov models based on a Bayesian approach.
Proceedings of the 7th International Conference on Spoken Language Processing, ICSLP2002, 2002


  Loading...