Ya Li

Orcid: 0000-0002-6284-5039

Affiliations:

Beijing University of Posts and Telecommunications, School of Artificial Intelligence, Beijing, China
Chinese Academy of Sciences (CAS), Institute of Automation, National Laboratory of Pattern Recognition, Beijing, China (PhD 2012)

According to our database¹, Ya Li authored at least 104 papers between 2009 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Bibliography

2025

Video Demoireing Using Focused-Defocused Dual-Camera System.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., December, 2025

DashFusion: Dual-Stream Alignment With Hierarchical Bottleneck Fusion for Multimodal Sentiment Analysis.

[BibT_eX]

[DOI]

IEEE Trans. Neural Networks Learn. Syst., October, 2025

SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding.

[BibT_eX]

[DOI]

CoRR, September, 2025

Deep Learning Approaches for Multimodal Intent Recognition: A Survey.

[BibT_eX]

[DOI]

CoRR, July, 2025

MER 2025: When Affective Computing Meets Large Language Models.

[BibT_eX]

[DOI]

CoRR, April, 2025

Psy-Copilot: Visual Chain of Thought for Counseling.

[BibT_eX]

[DOI]

CoRR, March, 2025

Psy-Insight: Explainable Multi-turn Bilingual Dataset for Mental Health Counseling.

[BibT_eX]

[DOI]

CoRR, March, 2025

SeeNet: A Soft Emotion Expert and Data Augmentation Method to Enhance Speech Emotion Recognition.

[BibT_eX]

[DOI]

IEEE Trans. Affect. Comput., 2025

EEG-based Voice Conversion : Hearing the Voice of Your Brain.

[BibT_eX]

[DOI]

Proceedings of the 26th Annual Conference of the International Speech Communication Association, 2025

OV-MER: Towards Open-Vocabulary Multimodal Emotion Recognition.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

DetailTTS: Learning Residual Detail Information for Zero-shot Text-to-speech.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

Beyond Surface Simplicity: Revealing Hidden Reasoning Attributes for Precise Commonsense Diagnosis.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

Controllable 3D Dance Generation Using Diffusion-Based Transformer U-Net.

[BibT_eX]

[DOI]

Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024

DepressionMLP: A Multi-Layer Perceptron Architecture for Automatic Depression Level Prediction via Facial Keypoints and Action Units.

[BibT_eX]

[DOI]

IEEE Trans. Circuits Syst. Video Technol., September, 2024

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2024

Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab and Convolutional Recurrent Neural Networks.

[BibT_eX]

[DOI]

Yingming Gao

Peter Birkholz

Ya Li

IEEE ACM Trans. Audio Speech Lang. Process., 2024

WavDepressionNet: Automatic Depression Level Prediction via Raw Speech Signals.

[BibT_eX]

[DOI]

IEEE Trans. Affect. Comput., 2024

Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation.

[BibT_eX]

[DOI]

CoRR, 2024

Open-vocabulary Multimodal Emotion Recognition: Dataset, Metric, and Benchmark.

[BibT_eX]

[DOI]

CoRR, 2024

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model.

[BibT_eX]

[DOI]

CoRR, 2024

ExpressiveSinger: Synthesizing Expressive Singing Voice as an Instrument.

[BibT_eX]

[DOI]

Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

G2DiaR: Enhancing Commonsense Reasoning of LLMs with Graph-to-Dialogue & Reasoning.

[BibT_eX]

[DOI]

Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

ICAGC 2024: Inspirational and Convincing Audio Generation Challenge 2024.

[BibT_eX]

[DOI]

Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

A Preliminary Study on Automatic Pronunciation Error Detection for Hearing-impaired Children.

[BibT_eX]

[DOI]

Proceedings of the 10th International Conference on Communication and Information Processing, 2024

Frame-Level Emotional State Alignment Method for Speech Emotion Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Concss: Contrastive-based Context Comprehension for Dialogue-Appropriate Prosody in Conversational Speech Synthesis.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

2023

Dual-Lens HDR using Guided 3D Exposure CNN and Guided Denoising Transformer.

[BibT_eX]

[DOI]

ACM Trans. Multim. Comput. Commun. Appl., 2023

Dual Attention and Element Recalibration Networks for Automatic Depression Level Prediction.

[BibT_eX]

[DOI]

IEEE Trans. Affect. Comput., 2023

M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis.

[BibT_eX]

[DOI]

CoRR, 2023

Mining High-quality Samples from Raw Data and Majority Voting Method for Multimodal Emotion Recognition.

[BibT_eX]

[DOI]

Qifei Li

Yingming Gao

Ya Li

Proceedings of the 31st ACM International Conference on Multimedia, 2023

CMCU-CSS: Enhancing Naturalness via Commonsense-based Multi-modal Context Understanding in Conversational Speech Synthesis.

[BibT_eX]

[DOI]

Proceedings of the 31st ACM International Conference on Multimedia, 2023

FTA-net: A Frequency and Time Attention Network for Speech Depression Detection.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Exploring the interpretability in speech-based adolescent depression detection by SHAP.

[BibT_eX]

[DOI]

Proceedings of the 9th International Conference on Communication and Information Processing, 2023

GaitParse: Gait Parsing Algorithm with Self-Supervised Fine-Tuning for Gait Recognition.

[BibT_eX]

[DOI]

Proceedings of the 9th International Conference on Communication and Information Processing, 2023

M<sup>2</sup>-CTTS: End-to-End Multi-Scale Multi-Modal Conversational Text-to-Speech Synthesis.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

2022

Selective Element and Two Orders Vectorization Networks for Automatic Depression Severity Diagnosis via Facial Changes.

[BibT_eX]

[DOI]

IEEE Trans. Circuits Syst. Video Technol., 2022

Depressioner: Facial dynamic representation for automatic depression level prediction.

[BibT_eX]

[DOI]

Expert Syst. Appl., 2022

A Keypoint Based Enhancement Method for Audio Driven Free View Talking Head Synthesis.

[BibT_eX]

[DOI]

Proceedings of the 24th IEEE International Workshop on Multimedia Signal Processing, 2022

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis.

[BibT_eX]

[DOI]

Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis.

[BibT_eX]

[DOI]

Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

Automatic Respiratory Sound Classification Via Multi-Branch Temporal Convolutional Network.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

Automatic Depression Level Assessment from Speech By Long-Term Global Information Embedding.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

2021

Correction to: Semi-supervised Ladder Networks for Speech Emotion Recognition.

[BibT_eX]

[DOI]

Int. J. Autom. Comput., 2021

2020

Expression Analysis Based on Face Regions in Real-world Conditions.

[BibT_eX]

[DOI]

Int. J. Autom. Comput., 2020

2019

Semi-supervised Ladder Networks for Speech Emotion Recognition.

[BibT_eX]

[DOI]

Int. J. Autom. Comput., 2019

Expression Analysis Based on Face Regions in Read-world Conditions.

[BibT_eX]

[DOI]

CoRR, 2019

Speech Emotion Recognition via Contrastive Loss under Siamese Networks.

[BibT_eX]

[DOI]

CoRR, 2019

Discriminative Video Representation with Temporal Order for Micro-expression Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2019

2018

Investigation of Multimodal Features, Classifiers and Fusion Methods for Emotion Recognition.

[BibT_eX]

[DOI]

CoRR, 2018

Deep Learning for Continuous Multiple Time Series Annotations.

[BibT_eX]

[DOI]

Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop, 2018

Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks.

[BibT_eX]

[DOI]

Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop, 2018

BLSTM-CRF Based End-to-End Prosodic Boundary Prediction with Context Sensitive Embeddings in a Text-to-Speech Front-End.

[BibT_eX]

[DOI]

Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function.

[BibT_eX]

[DOI]

Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

End-to-End Continuous Emotion Recognition from Video Using 3D Convlstm Networks.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

2017

Quantitative intonation modeling of interrogative sentences for Mandarin speech synthesis.

[BibT_eX]

[DOI]

Speech Commun., 2017

CHEAVD: a Chinese natural emotional audio-visual database.

[BibT_eX]

[DOI]

J. Ambient Intell. Humaniz. Comput., 2017

Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network.

[BibT_eX]

[DOI]

Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, Mountain View, CA, USA, October 23, 2017

Investigating Efficient Feature Representation Methods and Training Objective for BLSTM-Based Phone Duration Prediction.

[BibT_eX]

[DOI]

Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Distilling Knowledge from an Ensemble of Models for Punctuation Prediction.

[BibT_eX]

[DOI]

Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

The NLPR Speech Synthesis entry for Blizzard Challenge 2017.

[BibT_eX]

[DOI]

Proceedings of the Blizzard Challenge 2017, Stockholm, Sweden, August 25, 2017, 2017

2016

Investigating Effect of Rich Syntactic Features on Mandarin Prosodic Boundaries Prediction.

[BibT_eX]

[DOI]

J. Signal Process. Syst., 2016

Audio Visual Emotion Recognition with Temporal Alignment and Perception Attention.

[BibT_eX]

[DOI]

CoRR, 2016

Investigating deep neural network adaptation for generating exclamatory and interrogative speech in Mandarin.

[BibT_eX]

[DOI]

Proceedings of the 10th International Symposium on Chinese Spoken Language Processing, 2016

Text-based sentential stress prediction using continuous lexical embedding for Mandarin speech synthesis.

[BibT_eX]

[DOI]

Proceedings of the 10th International Symposium on Chinese Spoken Language Processing, 2016

End-to-end keywords spotting based on connectionist temporal classification for Mandarin.

[BibT_eX]

[DOI]

Proceedings of the 10th International Symposium on Chinese Spoken Language Processing, 2016

Improving Prosodic Boundaries Prediction for Mandarin Speech Synthesis by Using Enhanced Embedding Feature and Model Fusion Approach.

[BibT_eX]

[DOI]

Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

The Parameterized Phoneme Identity Feature as a Continuous Real-Valued Vector for Neural Network Based Speech Synthesis.

[BibT_eX]

[DOI]

Zhengqi Wen

Ya Li

Jianhua Tao

Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

Long short term memory recurrent neural network based encoding method for emotion recognition in video.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

MEC 2016: The Multimodal Emotion Recognition Challenge of CCPR 2016.

[BibT_eX]

[DOI]

Proceedings of the Pattern Recognition - 7th Chinese Conference, 2016

BLSTM Guided Unit Selection Synthesis System for Blizzard Challenge 2016.

[BibT_eX]

[DOI]

Proceedings of the Blizzard Challenge 2016, Cuppertino, CA, USA, September 16, 2016, 2016

2015

Hierarchical stress modeling and generation in mandarin for expressive Text-to-Speech.

[BibT_eX]

[DOI]

Speech Commun., 2015

Long Short Term Memory Recurrent Neural Network based Multimodal Dimensional Emotion Recognition.

[BibT_eX]

[DOI]

Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, 2015

A novel method of artificial bandwidth extension using deep architecture.

[BibT_eX]

[DOI]

Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

Voice quality: Not only about "you" but also about "your interlocutor".

[BibT_eX]

[DOI]

Ya Li

Nick Campbell

Jianhua Tao

Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

From simulated speech to natural speech, what are the robust features for emotion recognition?

[BibT_eX]

[DOI]

Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction, 2015

Multi task sequence learning for depression scale prediction from video.

[BibT_eX]

[DOI]

Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction, 2015

2014

Phonological influences on the realization of final lowering evidence from dialogue Chinese Mandarin.

[BibT_eX]

[DOI]

Proceedings of the 2014 17th Oriental Chapter of the International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA), 2014

Multi-scale Temporal Modeling for Dimensional Emotion Recognition in Video.

[BibT_eX]

[DOI]

Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, 2014

The expression of emotions by text and speech.

[BibT_eX]

[DOI]

Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, 2014

Survey on discriminative feature selection for speech emotion recognition.

[BibT_eX]

[DOI]

Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, 2014

Context features based pre-selection and weight prediction in concatenation speech synthesis system.

[BibT_eX]

[DOI]

Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, 2014

Efficient voice activity detection algorithm based on sub-band temporal envelope and sub-band long-term signal variability.

[BibT_eX]

[DOI]

Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, 2014

Investigating effect of rich syntactic features on Mandarin prosodic phrase boundaries prediction.

[BibT_eX]

[DOI]

Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, 2014

Improving generation performance of speech emotion recognition by denoising autoencoders.

[BibT_eX]

[DOI]

Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, 2014

Combining prosodic and spectral features for Mandarin intonation recognition.

[BibT_eX]

[DOI]

Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, 2014

A hierarchical viterbi algorithm for Mandarin hybrid speech synthesis system.

[BibT_eX]

[DOI]

Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

Improving Mandarin prosodic boundary prediction with rich syntactic features.

[BibT_eX]

[DOI]

Hao Che

Jianhua Tao

Ya Li

Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

A novel hybrid mandarin speech synthesis system using different base units for model training and concatenation.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2014

2013

A novel unit selection method for concatenation speech system using similarity measure.

[BibT_eX]

[DOI]

Proceedings of the 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013

On Constructing a Chinese Task-Oriental Subjectivity Lexicon.

[BibT_eX]

[DOI]

Xiaoying Xu

Jianhua Tao

Ya Li

Proceedings of the Chinese Lexical Semantics - 14th Workshop, 2013

Extended Decision Tree with or Relationship for HMM-Based Speech Synthesis.

[BibT_eX]

[DOI]

Yang Wang

Jianhua Tao