Ya Li

Orcid: 0000-0002-6284-5039

Affiliations:
  • Beijing University of Posts and Telecommunications, School of Artificial Intelligence, Beijing, China
  • Chinese Academy of Sciences (CAS), Institute of Automation, National Laboratory of Pattern Recognition, Beijing, China (PhD 2012)


According to our database1, Ya Li authored at least 97 papers between 2009 and 2025.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Deep Learning Approaches for Multimodal Intent Recognition: A Survey.
CoRR, July, 2025

MER 2025: When Affective Computing Meets Large Language Models.
CoRR, April, 2025

Psy-Copilot: Visual Chain of Thought for Counseling.
CoRR, March, 2025

Psy-Insight: Explainable Multi-turn Bilingual Dataset for Mental Health Counseling.
CoRR, March, 2025

Beyond Surface Simplicity: Revealing Hidden Reasoning Attributes for Precise Commonsense Diagnosis.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

Controllable 3D Dance Generation Using Diffusion-Based Transformer U-Net.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024
DepressionMLP: A Multi-Layer Perceptron Architecture for Automatic Depression Level Prediction via Facial Keypoints and Action Units.
IEEE Trans. Circuits Syst. Video Technol., September, 2024

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab and Convolutional Recurrent Neural Networks.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

WavDepressionNet: Automatic Depression Level Prediction via Raw Speech Signals.
IEEE Trans. Affect. Comput., 2024

Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation.
CoRR, 2024

Open-vocabulary Multimodal Emotion Recognition: Dataset, Metric, and Benchmark.
CoRR, 2024

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model.
CoRR, 2024

ExpressiveSinger: Synthesizing Expressive Singing Voice as an Instrument.
Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

G2DiaR: Enhancing Commonsense Reasoning of LLMs with Graph-to-Dialogue & Reasoning.
Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

ICAGC 2024: Inspirational and Convincing Audio Generation Challenge 2024.
Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

A Preliminary Study on Automatic Pronunciation Error Detection for Hearing-impaired Children.
Proceedings of the 10th International Conference on Communication and Information Processing, 2024

Frame-Level Emotional State Alignment Method for Speech Emotion Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2024

Concss: Contrastive-based Context Comprehension for Dialogue-Appropriate Prosody in Conversational Speech Synthesis.
Proceedings of the IEEE International Conference on Acoustics, 2024

2023
Dual-Lens HDR using Guided 3D Exposure CNN and Guided Denoising Transformer.
ACM Trans. Multim. Comput. Commun. Appl., 2023

Dual Attention and Element Recalibration Networks for Automatic Depression Level Prediction.
IEEE Trans. Affect. Comput., 2023

M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis.
CoRR, 2023

Mining High-quality Samples from Raw Data and Majority Voting Method for Multimodal Emotion Recognition.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

CMCU-CSS: Enhancing Naturalness via Commonsense-based Multi-modal Context Understanding in Conversational Speech Synthesis.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

FTA-net: A Frequency and Time Attention Network for Speech Depression Detection.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Exploring the interpretability in speech-based adolescent depression detection by SHAP.
Proceedings of the 9th International Conference on Communication and Information Processing, 2023

GaitParse: Gait Parsing Algorithm with Self-Supervised Fine-Tuning for Gait Recognition.
Proceedings of the 9th International Conference on Communication and Information Processing, 2023

M<sup>2</sup>-CTTS: End-to-End Multi-Scale Multi-Modal Conversational Text-to-Speech Synthesis.
Proceedings of the IEEE International Conference on Acoustics, 2023

2022
Selective Element and Two Orders Vectorization Networks for Automatic Depression Severity Diagnosis via Facial Changes.
IEEE Trans. Circuits Syst. Video Technol., 2022

Depressioner: Facial dynamic representation for automatic depression level prediction.
Expert Syst. Appl., 2022

A Keypoint Based Enhancement Method for Audio Driven Free View Talking Head Synthesis.
Proceedings of the 24th IEEE International Workshop on Multimedia Signal Processing, 2022

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis.
Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis.
Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

Automatic Respiratory Sound Classification Via Multi-Branch Temporal Convolutional Network.
Proceedings of the IEEE International Conference on Acoustics, 2022

Automatic Depression Level Assessment from Speech By Long-Term Global Information Embedding.
Proceedings of the IEEE International Conference on Acoustics, 2022

2021
Correction to: Semi-supervised Ladder Networks for Speech Emotion Recognition.
Int. J. Autom. Comput., 2021

2020
Expression Analysis Based on Face Regions in Real-world Conditions.
Int. J. Autom. Comput., 2020

2019
Semi-supervised Ladder Networks for Speech Emotion Recognition.
Int. J. Autom. Comput., 2019

Expression Analysis Based on Face Regions in Read-world Conditions.
CoRR, 2019

Speech Emotion Recognition via Contrastive Loss under Siamese Networks.
CoRR, 2019

Discriminative Video Representation with Temporal Order for Micro-expression Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2019

2018
Investigating Deep Neural Network Adaptation for Generating Exclamatory and Interrogative Speech in Mandarin.
J. Signal Process. Syst., 2018

Investigation of Multimodal Features, Classifiers and Fusion Methods for Emotion Recognition.
CoRR, 2018

Deep Learning for Continuous Multiple Time Series Annotations.
Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop, 2018

Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks.
Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop, 2018

BLSTM-CRF Based End-to-End Prosodic Boundary Prediction with Context Sensitive Embeddings in a Text-to-Speech Front-End.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

End-to-End Continuous Emotion Recognition from Video Using 3D Convlstm Networks.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

2017
Quantitative intonation modeling of interrogative sentences for Mandarin speech synthesis.
Speech Commun., 2017

CHEAVD: a Chinese natural emotional audio-visual database.
J. Ambient Intell. Humaniz. Comput., 2017

Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network.
Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, Mountain View, CA, USA, October 23, 2017

Investigating Efficient Feature Representation Methods and Training Objective for BLSTM-Based Phone Duration Prediction.
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Distilling Knowledge from an Ensemble of Models for Punctuation Prediction.
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

The NLPR Speech Synthesis entry for Blizzard Challenge 2017.
Proceedings of the Blizzard Challenge 2017, Stockholm, Sweden, August 25, 2017, 2017

2016
Investigating Effect of Rich Syntactic Features on Mandarin Prosodic Boundaries Prediction.
J. Signal Process. Syst., 2016

Audio Visual Emotion Recognition with Temporal Alignment and Perception Attention.
CoRR, 2016

Text-based sentential stress prediction using continuous lexical embedding for Mandarin speech synthesis.
Proceedings of the 10th International Symposium on Chinese Spoken Language Processing, 2016

End-to-end keywords spotting based on connectionist temporal classification for Mandarin.
Proceedings of the 10th International Symposium on Chinese Spoken Language Processing, 2016

Improving Prosodic Boundaries Prediction for Mandarin Speech Synthesis by Using Enhanced Embedding Feature and Model Fusion Approach.
Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

The Parameterized Phoneme Identity Feature as a Continuous Real-Valued Vector for Neural Network Based Speech Synthesis.
Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

Long short term memory recurrent neural network based encoding method for emotion recognition in video.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

MEC 2016: The Multimodal Emotion Recognition Challenge of CCPR 2016.
Proceedings of the Pattern Recognition - 7th Chinese Conference, 2016

BLSTM Guided Unit Selection Synthesis System for Blizzard Challenge 2016.
Proceedings of the Blizzard Challenge 2016, Cuppertino, CA, USA, September 16, 2016, 2016

2015
Hierarchical stress modeling and generation in mandarin for expressive Text-to-Speech.
Speech Commun., 2015

Long Short Term Memory Recurrent Neural Network based Multimodal Dimensional Emotion Recognition.
Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, 2015

A novel method of artificial bandwidth extension using deep architecture.
Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

Voice quality: Not only about "you" but also about "your interlocutor".
Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

From simulated speech to natural speech, what are the robust features for emotion recognition?
Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction, 2015

Multi task sequence learning for depression scale prediction from video.
Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction, 2015

2014
Phonological influences on the realization of final lowering evidence from dialogue Chinese Mandarin.
Proceedings of the 2014 17th Oriental Chapter of the International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA), 2014

Multi-scale Temporal Modeling for Dimensional Emotion Recognition in Video.
Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, 2014

The expression of emotions by text and speech.
Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, 2014

Survey on discriminative feature selection for speech emotion recognition.
Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, 2014

Context features based pre-selection and weight prediction in concatenation speech synthesis system.
Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, 2014

Efficient voice activity detection algorithm based on sub-band temporal envelope and sub-band long-term signal variability.
Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, 2014

Investigating effect of rich syntactic features on Mandarin prosodic phrase boundaries prediction.
Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, 2014

Improving generation performance of speech emotion recognition by denoising autoencoders.
Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, 2014

Combining prosodic and spectral features for Mandarin intonation recognition.
Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, 2014

A hierarchical viterbi algorithm for Mandarin hybrid speech synthesis system.
Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

Improving Mandarin prosodic boundary prediction with rich syntactic features.
Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

A novel hybrid mandarin speech synthesis system using different base units for model training and concatenation.
Proceedings of the IEEE International Conference on Acoustics, 2014

2013
A novel unit selection method for concatenation speech system using similarity measure.
Proceedings of the 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013

On Constructing a Chinese Task-Oriental Subjectivity Lexicon.
Proceedings of the Chinese Lexical Semantics - 14th Workshop, 2013

Extended Decision Tree with or Relationship for HMM-Based Speech Synthesis.
Proceedings of the 2nd IAPR Asian Conference on Pattern Recognition, 2013

Bayesian Inference Based Temporal Modeling for Naturalistic Affective Expression Classification.
Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, 2013

2012
A multimodal approach of generating 3D human-like talking agent.
J. Multimodal User Interfaces, 2012

2011
Utterance independent bimodal emotion recognition in spontaneous communication.
EURASIP J. Adv. Signal Process., 2011

Hierarchical Stress Modeling in Mandarin Text-to-Speech.
Proceedings of the 12th Annual Conference of the International Speech Communication Association, 2011

The Stability Analysis of Disyllabic Stress in Mandarin Speech.
Proceedings of the 17th International Congress of Phonetic Sciences, 2011

The CASIA Audio Emotion Recognition Method for Audio/Visual Emotion Challenge 2011.
Proceedings of the Affective Computing and Intelligent Interaction, 2011

2010
Text-based unstressed syllable prediction in Mandarin.
Proceedings of the 11th Annual Conference of the International Speech Communication Association, 2010

The WISTON Text to Speech System for Blizzard Challenge 2010.
Proceedings of the Blizzard Challenge 2010, Kansai Science City, Japan, September 25, 2010, 2010

2009
The WISTON Text-to-Speech System for Blizzard Challenge 2009.
Proceedings of the Blizzard Challenge 2009, Edinburgh, Scotland, UK, September 4, 2009, 2009


  Loading...