Yuki Saito
Orcid: 0000-0002-7967-2613Affiliations:
- University of Tokyo, Department of Information Physics and Computing, Tokyo, Japan (PhD 2021)
According to our database1,
Yuki Saito
authored at least 62 papers
between 2017 and 2025.
Collaborative distances:
Collaborative distances:
Timeline
Legend:
Book In proceedings Article PhD thesis Dataset OtherLinks
Online presence:
-
on linkedin.com
-
on twitter.com
-
on sython.org
-
on orcid.org
On csauthors.net:
Bibliography
2025
Multi-Sampling-Frequency Naturalness MOS Prediction Using Self-Supervised Learning Model with Sampling-Frequency-Independent Layer.
CoRR, July, 2025
RELATE: Subjective evaluation dataset for automatic evaluation of relevance between text and audio.
CoRR, June, 2025
CoRR, June, 2025
Causal Speech Enhancement with Predicting Semantics based on Quantized Self-supervised Learning Features.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025
Measuring Time Delay Tolerance in Third-Person Live Commentary for Super Smash Bros. Ultimate.
Proceedings of the IEEE Conference on Games, 2025
2024
J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling.
CoRR, 2024
UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge.
CoRR, 2024
Building speech corpus with diverse voice characteristics for its prompt-based representation.
CoRR, 2024
JVNV: A Corpus of Japanese Emotional Speech With Verbal Content and Nonverbal Expressions.
IEEE Access, 2024
Cross-Dialect Text-to-Speech In Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level Bert.
Proceedings of the IEEE Spoken Language Technology Workshop, 2024
The T05 System for the voicemos challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech.
Proceedings of the IEEE Spoken Language Technology Workshop, 2024
Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024
Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024
Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024
STYLECAP: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-Supervised Learning Models.
Proceedings of the IEEE International Conference on Acoustics, 2024
NecoBERT: Self-Supervised Learning Model Trained by Masked Language Modeling on Rich Acoustic Features Derived from Neural Audio Codec.
Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference, 2024
Real-Time Noise Estimation for Lombard-Effect Speech Synthesis in Human-Avatar Dialogue Systems.
Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference, 2024
2023
JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions.
Dataset, October, 2023
Proceedings of the 12th ISCA Speech Synthesis Workshop, 2023
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023
ChatGPT-EDSS: Empathetic Dialogue Speech Synthesis Trained from ChatGPT-derived Context Word Embeddings.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023
CALLS: Japanese Empathetic Dialogue Speech Corpus of Complaint Handling and Attentive Listening in Customer Center.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023
Duration-Aware Pause Insertion Using Pre-Trained Language Model for Multi-Speaker Text-To-Speech.
Proceedings of the IEEE International Conference on Acoustics, 2023
MID-Attribute Speaker Generation Using Optimal-Transport-Based Interpolation of Gaussian Mixture Models.
Proceedings of the IEEE International Conference on Acoustics, 2023
COCO-NUT: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-Based Control.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023
2022
CoRR, 2022
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022
Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022
Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022
2021
Perceptual-Similarity-Aware Deep Speaker Representation Learning for Multi-Speaker Generative Modeling.
IEEE ACM Trans. Audio Speech Lang. Process., 2021
Real-Time Full-Band Voice Conversion with Sub-Band Modeling and Data-Driven Phase Estimation of Spectral Differentials.
IEICE Trans. Inf. Syst., 2021
DNN-Based Low-Musical-Noise Single-Channel Speech Enhancement Based on Higher-Order-Moments Matching.
IEICE Trans. Inf. Syst., 2021
Cross-Lingual Speaker Adaptation Using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021
Humanacgan: Conditional Generative Adversarial Network with Human-Based Auxiliary Classifier and its Evaluation in Phoneme Perception.
Proceedings of the IEEE International Conference on Acoustics, 2021
Emotion-Controllable Speech Synthesis Using Emotion Soft Labels and Fine-Grained Prosody Factors.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2021
2020
Phase reconstruction from amplitude spectrograms based on directional-statistics deep neural networks.
Signal Process., 2020
Joint Adversarial Training of Speech Recognition and Synthesis Models for Many-to-One Voice Conversion Using Phonetic Posteriorgrams.
IEICE Trans. Inf. Syst., 2020
Generative Moment Matching Network-Based Neural Double-Tracking for Synthesized and Natural Singing Voices.
IEICE Trans. Inf. Syst., 2020
Proceedings of The 12th Language Resources and Evaluation Conference, 2020
SMASH Corpus: A Spontaneous Speech Corpus Recording Third-person Audio Commentaries on Gameplay.
Proceedings of The 12th Language Resources and Evaluation Conference, 2020
Investigating Effective Additional Contextual Factors in DNN-Based Spontaneous Speech Synthesis.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020
Cross-Lingual Text-To-Speech Synthesis via Domain Adaptation and Perceptual Similarity Regression in Speaker Space.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020
Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted from a Face Image.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020
Lifter Training and Sub-Band Modeling for Computationally Efficient and High-Quality Voice Conversion Using Spectral Differentials.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020
Humangan: Generative Adversarial Network With Human-Based Discriminator And Its Evaluation In Speech Perception Modeling.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020
2019
Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra.
Comput. Speech Lang., 2019
DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis.
Proceedings of the 10th ISCA Speech Synthesis Workshop, 2019
Proceedings of the 10th ISCA Speech Synthesis Workshop, 2019
Generative Moment Matching Network-based Random Modulation Post-filter for DNN-based Singing Voice Synthesis and Neural Double-tracking.
Proceedings of the IEEE International Conference on Acoustics, 2019
2018
Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks.
IEEE ACM Trans. Audio Speech Lang. Process., 2018
Phase Reconstruction from Amplitude Spectrograms Based on Von-Mises-Distribution Deep Neural Network.
Proceedings of the 16th International Workshop on Acoustic Signal Enhancement, 2018
Text-to-Speech Synthesis Using STFT Spectra Based on Low-/Multi-Resolution Generative Adversarial Networks.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018
Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018
Generative approach using the noise generation models for DNN-based speech synthesis trained from noisy speech.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2018
2017
IEICE Trans. Inf. Syst., 2017
Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities.
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017
Training algorithm to deceive Anti-Spoofing Verification for DNN-based speech synthesis.
Proceedings of the 2017 IEEE International Conference on Acoustics, 2017