Dan Su

Orcid: 0000-0003-2590-7090

Affiliations:
  • Tencent AI Lab, Shenzhen, China


According to our database1, Dan Su authored at least 107 papers between 2017 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder.
CoRR, 2024

Prompt-guided Precise Audio Editing with Diffusion Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis.
Proceedings of the IEEE International Conference on Acoustics, 2024

Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

MM-LLMs: Recent Advances in MultiModal Large Language Models.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

2023
A High Fidelity and Low Complexity Neural Audio Coding.
CoRR, 2023

DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis.
CoRR, 2023

Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Compressed MoE ASR Model Based on Knowledge Distillation and Quantization.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Multi-mode Neural Speech Coding Based on Deep Generative Networks.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Trinet: Stabilizing Self-Supervised Learning From Complete or Slow Collapse.
Proceedings of the IEEE International Conference on Acoustics, 2023

UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022
Cross-Speaker Emotion Transfer Through Information Perturbation in Emotional Speech Synthesis.
IEEE Signal Process. Lett., 2022

The DKU-Tencent System for the VoxCeleb Speaker Recognition Challenge 2022.
CoRR, 2022

DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs.
CoRR, 2022

3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition.
Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

End-to-End Voice Conversion with Information Perturbation.
Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation.
Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Cross-Age Speaker Verification: Learning Age-Invariant Speaker Embeddings.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis.
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022

BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis.
Proceedings of the Tenth International Conference on Learning Representations, 2022

Multi-Channel Speaker Diarization Using Spatial Features for Meetings.
Proceedings of the IEEE International Conference on Acoustics, 2022

The CUHK-Tencent Speaker Diarization System for the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2022

Speechmoe2: Mixture-of-Experts Model with Improved Routing.
Proceedings of the IEEE International Conference on Acoustics, 2022

VCVTS: Multi-Speaker Video-to-Speech Synthesis Via Cross-Modal Knowledge Transfer from Voice Conversion.
Proceedings of the IEEE International Conference on Acoustics, 2022

Consistent Training and Decoding for End-to-End Speech Recognition Using Lattice-Free MMI.
Proceedings of the IEEE International Conference on Acoustics, 2022

Simple Attention Module Based Speaker Verification with Iterative Noisy Label Detection.
Proceedings of the IEEE International Conference on Acoustics, 2022

DP-DWA: Dual-Path Dynamic Weight Attention Network With Streaming Dfsmn-San For Automatic Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2022

Referee: Towards Reference-Free Cross-Speaker Style Transfer with Low-Quality Data for Expressive Speech Synthesis.
Proceedings of the IEEE International Conference on Acoustics, 2022

Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-Based Multi-Modal Context Modeling.
Proceedings of the IEEE International Conference on Acoustics, 2022

2021
Meta-Voice: Fast few-shot style transfer for expressive voice cloning using meta learning.
CoRR, 2021

Bilateral Denoising Diffusion Models.
CoRR, 2021

GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10, 000 Hours of Transcribed Audio.
CoRR, 2021

Spoken Style Learning with Multi-modal Hierarchical Context Encoding for Conversational Text-to-Speech Synthesis.
CoRR, 2021

VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention.
CoRR, 2021

Effective Low-Cost Time-Domain Audio Separation Using Globally Attentive Locally Recurrent Networks.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Exploring Cross-lingual Singing Voice Synthesis Using Speech Data.
Proceedings of the 12th International Symposium on Chinese Spoken Language Processing, 2021

SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Raw Waveform Encoder with Multi-Scale Globally Attentive Locally Recurrent Networks for End-to-End Speech Recognition.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Glow-WaveGAN: Learning Speech Representations from GAN-Based Variational Auto-Encoder for High Fidelity Flow-Based Speech Synthesis.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Controllable Context-Aware Conversational Speech Synthesis.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10, 000 Hours of Transcribed Audio.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

TeNC: Low Bit-Rate Speech Coding with VQ-VAE and GAN.
Proceedings of the ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction, Montreal, QC, Canada, October 18, 2021

Fastsvc: Fast Cross-Domain Singing Voice Conversion With Feature-Wise Linear Modulation.
Proceedings of the 2021 IEEE International Conference on Multimedia and Expo, 2021

A Joint Training Framework of Multi-Look Separator and Speaker Embedding Extractor for Overlapped Speech.
Proceedings of the IEEE International Conference on Acoustics, 2021

Contrastive Separative Coding for Self-Supervised Representation Learning.
Proceedings of the IEEE International Conference on Acoustics, 2021

Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input.
Proceedings of the IEEE International Conference on Acoustics, 2021

Replay and Synthetic Speech Detection with Res2Net Architecture.
Proceedings of the IEEE International Conference on Acoustics, 2021

Sandglasset: A Light Multi-Granularity Self-Attentive Network for Time-Domain Speech Separation.
Proceedings of the IEEE International Conference on Acoustics, 2021

Learned Transferable Architectures Can Surpass Hand-Designed Architectures for Large Scale Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2021

DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

Latency-Controlled Neural Architecture Search for Streaming Speech Recognition.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2021

Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect.
Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021

2020
A Framework for Adapting DNN Speaker Embedding Across Languages.
IEEE ACM Trans. Audio Speech Lang. Process., 2020

On the localness modeling for the self-attention based end-to-end speech synthesis.
Neural Networks, 2020

Phonetic Posteriorgrams based Many-to-Many Singing Voice Conversion via Adversarial Training.
CoRR, 2020

Audio-Visual Multi-Channel Recognition of Overlapped Speech.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

DurIAN: Duration Informed Attention Network for Speech Synthesis.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Speech-XLNet: Unsupervised Acoustic Model Pretraining for Self-Attention Networks.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Transferring Source Style in Non-Parallel Voice Conversion.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Investigating Robustness of Adversarial Samples Detection for Automatic Speaker Verification.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

End-to-End Multi-Look Keyword Spotting.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Dfsmn-San with Persistent Memory Model for Automatic Speech Recognition.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

End-To-End Accent Conversion Without Using Native Utterances.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Multi-Level Deep Neural Network Adaptation for Speaker Verification Using MMD and Consistency Regularization.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Mixup-breakdown: A Consistency Training Method for Improving Generalization of Speech Separation Models.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Speaker-Aware Target Speaker Enhancement by Jointly Learning with Speaker Embedding Extraction.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Integration of Multi-Look Beamformers for Multi-Channel Keyword Spotting.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

A Random Gossip BMUF Process for Neural Language Modeling.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Enhancing End-to-End Multi-Channel Speech Separation Via Spatial Feature Learning.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Code-Switched Speech Synthesis Using Bilingual Phonetic Posteriorgram with Only Monolingual Corpora.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019
Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks.
CoRR, 2019

DurIAN: Duration Informed Attention Network For Multimodal Synthesis.
CoRR, 2019

Maximizing Mutual Information for Tacotron.
CoRR, 2019

Phrase-Level Class based Language Model for Mandarin Smart Speaker Query Recognition.
CoRR, 2019

End-to-End Multi-Channel Speech Separation.
CoRR, 2019

Extract, Adapt and Recognize: An End-to-End Neural Network for Corrupted Monaural Speech Recognition.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-Trained BERT.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Teach an All-rounder with Experts in Different Domains.
Proceedings of the IEEE International Conference on Acoustics, 2019

Joint Training of Complex Ratio Mask Based Beamformer and Acoustic Model for Noise Robust Asr.
Proceedings of the IEEE International Conference on Acoustics, 2019

Quasi-fully Convolutional Neural Network with Variational Inference for Speech Synthesis.
Proceedings of the IEEE International Conference on Acoustics, 2019

Learning Discriminative Features in Sequence Training without Requiring Framewise Labelled Data.
Proceedings of the IEEE International Conference on Acoustics, 2019

Investigating End-to-end Speech Recognition for Mandarin-english Code-switching.
Proceedings of the IEEE International Conference on Acoustics, 2019

Component Fusion: Learning Replaceable Language Model Component for End-to-end Speech Recognition System.
Proceedings of the IEEE International Conference on Acoustics, 2019

Boundary Discriminative Large Margin Cosine Loss for Text-independent Speaker Verification.
Proceedings of the IEEE International Conference on Acoustics, 2019

Multi-band PIT and Model Integration for Improved Multi-channel Speech Separation.
Proceedings of the IEEE International Conference on Acoustics, 2019

Improving Speech Enhancement with Phonetic Embedding Features.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Syllable-Dependent Discriminative Learning for Small Footprint Text-Dependent Speaker Verification.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Automatic Prosodic Structure Labeling using DNN-BGRU-CRF Hybrid Neural Network.
Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2019

Prosodic Structure Prediction using Deep Self-attention Neural Network.
Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2019

2018
Improving Attention-Based End-to-End ASR Systems with Sequence-Based Loss Functions.
Proceedings of the 2018 IEEE Spoken Language Technology Workshop, 2018

Speech Super-Resolution Using Parallel WaveNet.
Proceedings of the 11th International Symposium on Chinese Spoken Language Processing, 2018

Text-Dependent Speech Enhancement for Small-Footprint Robust Keyword Detection.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Rapid Style Adaptation Using Residual Error Embedding for Expressive Speech Synthesis.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Deep Extractor Network for Target Speaker Recovery from Single Channel Speech Mixtures.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Deep Discriminative Embeddings for Duration Robust Speaker Verification.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Permutation Invariant Training of Generative Adversarial Network for Monaural Speech Separation.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

2017
Chiral Buckybowl Molecules.
Symmetry, 2017


  Loading...