Ziyang Ma

Orcid: 0000-0002-8195-3262

Affiliations:

Shanghai Jiao Tong University, Department of Computer Science and Engineering, AI Institute, MoE Key Lab of Artificial Intelligence, Shanghai, China
Shandong University, School of Computer Science and Technology, Shandong, China (until 2022)

According to our database¹, Ziyang Ma authored at least 79 papers between 2021 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Bibliography

2025

ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models.

[BibT_eX]

[DOI]

CoRR, October, 2025

UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models.

[BibT_eX]

[DOI]

CoRR, October, 2025

SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization.

[BibT_eX]

[DOI]

CoRR, October, 2025

Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception.

[BibT_eX]

[DOI]

CoRR, October, 2025

Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations.

[BibT_eX]

[DOI]

CoRR, October, 2025

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis.

[BibT_eX]

[DOI]

CoRR, September, 2025

Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models.

[BibT_eX]

[DOI]

CoRR, September, 2025

Qwen3-Omni Technical Report.

[BibT_eX]

[DOI]

CoRR, September, 2025

EMER-Ranker: Learning to Rank Emotion Descriptions in the Absence of Ground Truth.

[BibT_eX]

[DOI]

CoRR, July, 2025

NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025.

[BibT_eX]

[DOI]

CoRR, June, 2025

MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation.

[BibT_eX]

[DOI]

CoRR, June, 2025

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models.

[BibT_eX]

[DOI]

CoRR, May, 2025

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix.

[BibT_eX]

[DOI]

CoRR, May, 2025

Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation.

[BibT_eX]

[DOI]

CoRR, May, 2025

MER 2025: When Affective Computing Meets Large Language Models.

[BibT_eX]

[DOI]

CoRR, April, 2025

EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting.

[BibT_eX]

[DOI]

CoRR, April, 2025

YuE: Scaling Open Foundation Models for Long-Form Music Generation.

[BibT_eX]

[DOI]

CoRR, March, 2025

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens.

[BibT_eX]

[DOI]

CoRR, March, 2025

URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models.

[BibT_eX]

[DOI]

CoRR, February, 2025

Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model.

[BibT_eX]

[DOI]

CoRR, January, 2025

MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization.

[BibT_eX]

[DOI]

CoRR, January, 2025

Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling.

[BibT_eX]

[DOI]

Proceedings of the 26th Annual Conference of the International Speech Communication Association, 2025

k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Multimedia and Expo, 2025

A Progressive Generation Framework with Speech Pre-trained Model for Expressive Voice Conversion.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Multimedia and Expo, 2025

Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

Making LLMs Better Many-to-Many Speech-to-Text Translators with Curriculum Learning.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2025

GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

Towards Reliable Large Audio Language Model.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2025

ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering.

[BibT_eX]

[DOI]

Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration.

[BibT_eX]

[DOI]

Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

Language Model Can Listen While Speaking.

[BibT_eX]

[DOI]

Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

VQTalker: Towards Multilingual Talking Avatars Through Facial Motion Tokenization.

[BibT_eX]

[DOI]

Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024

Towards Weakly Supervised Text-to-Audio Grounding.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2024

E$^{3}$TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2024

CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought.

[BibT_eX]

[DOI]

CoRR, 2024

Progressive Residual Extraction based Pre-training for Speech Representation Learning.

[BibT_eX]

[DOI]

CoRR, 2024

Foundation Models for Music: A Survey.

[BibT_eX]

[DOI]

CoRR, 2024

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens.

[BibT_eX]

[DOI]

CoRR, 2024

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs.

[BibT_eX]

[DOI]

CoRR, 2024

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series.

[BibT_eX]

[DOI]

CoRR, 2024

MuPT: A Generative Symbolic Music Pretrained Transformer.

[BibT_eX]

[DOI]

CoRR, 2024

The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge.

[BibT_eX]

[DOI]

CoRR, 2024

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model.

[BibT_eX]

[DOI]

CoRR, 2024

ChatMusician: Understanding and Generating Music Intrinsically with LLM.

[BibT_eX]

[DOI]

CoRR, 2024

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity.

[BibT_eX]

[DOI]

CoRR, 2024

CTC-Assisted LLM-Based Contextual ASR.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2024

NDVQ: Robust Neural Audio Codec With Normal Distribution-Based Vector Quantization.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2024

1st Place Solution to Odyssey Emotion Recognition Challenge Task1: Tackling Class Imbalance Problem.

[BibT_eX]

[DOI]

Proceedings of the Odyssey 2024: The Speaker and Language Recognition Workshop, 2024

MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition.

[BibT_eX]

[DOI]

Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing, 2024

Improving Emotion Recognition with Pre-Trained Models, Multimodality, and Contextual Information.

[BibT_eX]

[DOI]

Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

The X-Lance Technical Report for Interspeech 2024 Speech Processing using Discrete Speech Unit Challenge.

[BibT_eX]

[DOI]

Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

MaLa-ASR: Multimedia-Assisted LLM-Based ASR.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

BAT: Learning to Reason about Spatial Sounds with Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Hourglass-AVSR: Down-Up Sampling-Based Computational Efficiency Model for Audio-Visual Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Leveraging Speech PTM, Text LLM, And Emotional TTS For Speech Emotion Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

VoiceFlow: Efficient Text-To-Speech with Rectified Flow Matching.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

ChatMusician: Understanding and Generating Music Intrinsically with LLM.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2024

emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2024

2023

LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT.

[BibT_eX]

[DOI]

CoRR, 2023

Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation.

[BibT_eX]

[DOI]

CoRR, 2023

Unsupervised Active Learning: Optimizing Labeling Cost-Effectiveness for Automatic Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Improving Code-Switching and Name Entity Recognition in ASR with Speech Editing based Data Augmentation.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Front-End Adapter: Adapting Front-End Input of Speech Based Self-Supervised Learning for Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Fast-Hubert: an Efficient Training Framework for Self-Supervised Speech Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

2022

TESSP: Text-Enhanced Self-Supervised Speech Pre-training.

[BibT_eX]

[DOI]

CoRR, 2022

2021

Hierarchical Deep Residual Reasoning for Temporal Moment Localization.

[BibT_eX]

[DOI]

Proceedings of the MMAsia '21: ACM Multimedia Asia, Gold Coast, Australia, December 1, 2021

Ziyang Ma

Timeline

Legend:

Links

Online presence:

On csauthors.net:

Bibliography

Loading...