Yuki Saito

Orcid: 0000-0002-7967-2613

Affiliations:
  • University of Tokyo, Department of Information Physics and Computing, Tokyo, Japan (PhD 2021)


According to our database1, Yuki Saito authored at least 62 papers between 2017 and 2025.

Collaborative distances:
  • Dijkstra number2 of five.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Multi-Sampling-Frequency Naturalness MOS Prediction Using Self-Supervised Learning Model with Sampling-Frequency-Independent Layer.
CoRR, July, 2025

RELATE: Subjective evaluation dataset for automatic evaluation of relevance between text and audio.
CoRR, June, 2025

Human-CLAP: Human-perception-based contrastive language-audio pretraining.
CoRR, June, 2025

Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis.
CoRR, May, 2025

Causal Speech Enhancement with Predicting Semantics based on Quantized Self-supervised Learning Features.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

Measuring Time Delay Tolerance in Third-Person Live Commentary for Super Smash Bros. Ultimate.
Proceedings of the IEEE Conference on Games, 2025

2024
J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling.
CoRR, 2024

UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge.
CoRR, 2024

Building speech corpus with diverse voice characteristics for its prompt-based representation.
CoRR, 2024

JVNV: A Corpus of Japanese Emotional Speech With Verbal Content and Nonverbal Expressions.
IEEE Access, 2024

Cross-Dialect Text-to-Speech In Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level Bert.
Proceedings of the IEEE Spoken Language Technology Workshop, 2024

The T05 System for the voicemos challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech.
Proceedings of the IEEE Spoken Language Technology Workshop, 2024

Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

STYLECAP: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-Supervised Learning Models.
Proceedings of the IEEE International Conference on Acoustics, 2024

NecoBERT: Self-Supervised Learning Model Trained by Masked Language Modeling on Rich Acoustic Features Derived from Neural Audio Codec.
Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference, 2024

Real-Time Noise Estimation for Lombard-Effect Speech Synthesis in Human-Avatar Dialogue Systems.
Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference, 2024

2023
JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions.
Dataset, October, 2023

Federated Learning for Human-in-the-Loop Many-to-Many Voice Conversion.
Proceedings of the 12th ISCA Speech Synthesis Workshop, 2023

HumanDiffusion: diffusion model using perceptual gradients.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

ChatGPT-EDSS: Empathetic Dialogue Speech Synthesis Trained from ChatGPT-derived Context Word Embeddings.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

CALLS: Japanese Empathetic Dialogue Speech Corpus of Complaint Handling and Attentive Listening in Customer Center.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Duration-Aware Pause Insertion Using Pre-Trained Language Model for Multi-Speaker Text-To-Speech.
Proceedings of the IEEE International Conference on Acoustics, 2023

MID-Attribute Speaker Generation Using Optimal-Transport-Based Interpolation of Gaussian Mixture Models.
Proceedings of the IEEE International Conference on Acoustics, 2023

COCO-NUT: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-Based Control.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

2022
Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech.
CoRR, 2022

Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

STUDIES: Corpus of Japanese Empathetic Dialogue Speech Towards Friendly Voice Agent.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

2021
Perceptual-Similarity-Aware Deep Speaker Representation Learning for Multi-Speaker Generative Modeling.
IEEE ACM Trans. Audio Speech Lang. Process., 2021

Real-Time Full-Band Voice Conversion with Sub-Band Modeling and Data-Driven Phase Estimation of Spectral Differentials.
IEICE Trans. Inf. Syst., 2021

DNN-Based Low-Musical-Noise Single-Channel Speech Enhancement Based on Higher-Order-Moments Matching.
IEICE Trans. Inf. Syst., 2021

Cross-Lingual Speaker Adaptation Using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Humanacgan: Conditional Generative Adversarial Network with Human-Based Auxiliary Classifier and its Evaluation in Phoneme Perception.
Proceedings of the IEEE International Conference on Acoustics, 2021

Emotion-Controllable Speech Synthesis Using Emotion Soft Labels and Fine-Grained Prosody Factors.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2021

2020
Phase reconstruction from amplitude spectrograms based on directional-statistics deep neural networks.
Signal Process., 2020

Joint Adversarial Training of Speech Recognition and Synthesis Models for Many-to-One Voice Conversion Using Phonetic Posteriorgrams.
IEICE Trans. Inf. Syst., 2020

Generative Moment Matching Network-Based Neural Double-Tracking for Synthesized and Natural Singing Voices.
IEICE Trans. Inf. Syst., 2020

DNN-based Speech Synthesis Using Abundant Tags of Spontaneous Speech Corpus.
Proceedings of The 12th Language Resources and Evaluation Conference, 2020

SMASH Corpus: A Spontaneous Speech Corpus Recording Third-person Audio Commentaries on Gameplay.
Proceedings of The 12th Language Resources and Evaluation Conference, 2020

Investigating Effective Additional Contextual Factors in DNN-Based Spontaneous Speech Synthesis.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Cross-Lingual Text-To-Speech Synthesis via Domain Adaptation and Perceptual Similarity Regression in Speaker Space.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Real-Time, Full-Band, Online DNN-Based Voice Conversion System Using a Single CPU.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted from a Face Image.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Lifter Training and Sub-Band Modeling for Computationally Efficient and High-Quality Voice Conversion Using Spectral Differentials.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Humangan: Generative Adversarial Network With Human-Based Discriminator And Its Evaluation In Speech Perception Modeling.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019
Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra.
Comput. Speech Lang., 2019

JVS corpus: free Japanese multi-speaker voice corpus.
CoRR, 2019

DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis.
Proceedings of the 10th ISCA Speech Synthesis Workshop, 2019

V2S attack: building DNN-based voice conversion from automatic speaker verification.
Proceedings of the 10th ISCA Speech Synthesis Workshop, 2019

Generative Moment Matching Network-based Random Modulation Post-filter for DNN-based Singing Voice Synthesis and Neural Double-tracking.
Proceedings of the IEEE International Conference on Acoustics, 2019

2018
Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks.
IEEE ACM Trans. Audio Speech Lang. Process., 2018

Phase Reconstruction from Amplitude Spectrograms Based on Von-Mises-Distribution Deep Neural Network.
Proceedings of the 16th International Workshop on Acoustic Signal Enhancement, 2018

Text-to-Speech Synthesis Using STFT Spectra Based on Low-/Multi-Resolution Generative Adversarial Networks.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Generative approach using the noise generation models for DNN-based speech synthesis trained from noisy speech.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2018

2017
Voice Conversion Using Input-to-Output Highway Networks.
IEICE Trans. Inf. Syst., 2017

Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities.
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Training algorithm to deceive Anti-Spoofing Verification for DNN-based speech synthesis.
Proceedings of the 2017 IEEE International Conference on Acoustics, 2017


  Loading...