Guangzhi Sun

Warit Sirichotedumrong

Kasima Tharnpipitchai

Kunat Pipatanakul

Proceedings of the 26th Annual Conference of the International Speech Communication Association, 2025

CASE-Bench: Context-Aware SafEty Benchmark for Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

Improving LLM Video Understanding with 16 Frames Per Second.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

Bayesian WeakS-to-Strong from Text Classification to Generation.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

Audio-centric Video Understanding Benchmark without Text Shortcut.

[BibT_eX]

[DOI]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?

[BibT_eX]

[DOI]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

SkillAggregation: Reference-free LLM-Dependent Aggregation.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024

Graph Neural Networks for Contextual ASR With the Tree-Constrained Pointer Generator.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2024

Cross-Utterance Conditioned VAE for Speech Generation.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2024

SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation.

[BibT_eX]

[DOI]

CoRR, 2024

Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization.

[BibT_eX]

[DOI]

CoRR, 2024

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation.

[BibT_eX]

[DOI]

CoRR, 2024

Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement.

[BibT_eX]

[DOI]

CoRR, 2024

Speaker Adaptation for Quantised End-to-End ASR Models.

[BibT_eX]

[DOI]

CoRR, 2024

Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning.

[BibT_eX]

[DOI]

Keqi Deng

CoRR, 2024

CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models.

[BibT_eX]

[DOI]

CoRR, 2024

Matching domain experts by training from scratch on domain knowledge.

[BibT_eX]

[DOI]

Xiaoliang Luo

Bradley C. Love

CoRR, 2024

M<sup>3</sup>AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset.

[BibT_eX]

[DOI]

CoRR, 2024

Large language models surpass human experts in predicting neuroscience results.

[BibT_eX]

[DOI]

CoRR, 2024

Hierarchical Multi-Path and Multi-Model Selection For Fake Speech Detection.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2024

Affect Recognition in Conversations Using Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2024

SOT Triggered Neural Clustering for Speaker Attributed ASR.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

Can Large Language Models Understand Spatial Audio?

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

SALMONN: Towards Generic Hearing Abilities for Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Enhancing Quantised End-to-End ASR Models Via Personalisation.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Connecting Speech Encoder and Large Language Model for ASR.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Extending Large Language Models for Speech and Audio Captioning.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Parameter Efficient Finetuning for Speech Emotion Recognition and Domain Adaptation.

[BibT_eX]

[DOI]

Nineli Lashkarashvili

Wen Wu

Proceedings of the IEEE International Conference on Acoustics, 2024

Building Better AI Agents: A Provocation on the Utilisation of Persona in LLM-based Conversational Agents.

[BibT_eX]

[DOI]

Xiao Zhan

Jose Such

Proceedings of the ACM Conversational User Interfaces 2024, 2024

Speech-based Slot Filling using Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2024

M³AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset.

[BibT_eX]

[DOI]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

2023

Minimising Biasing Word Errors for Contextual ASR With the Tree-Constrained Pointer Generator.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2023

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch.

[BibT_eX]

[DOI]

CoRR, 2023

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

Conditional Diffusion Model for Target Speaker Extraction.

[BibT_eX]

[DOI]

CoRR, 2023

Affect Recognition in Conversations Using Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

Can Contextual Biasing Remain Effective with Whisper and GPT-2?

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

End-to-End Spoken Language Understanding with Tree-Constrained Pointer Generator.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Spectral Clustering-Aware Learning of Embeddings for Speaker Diarisation.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

TorchAudio 2.1: Advancing Speech Recognition, Self-Supervised Learning, and Audio Processing Components for Pytorch.

[BibT_eX]

[DOI]

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

2022

Tree-constrained Pointer Generator with Graph Neural Network Encodings for Contextual Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech.

[BibT_eX]

[DOI]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022

2021

Combination of deep speaker embeddings for diarisation.

[BibT_eX]

[DOI]

Neural Networks, 2021

Content-Aware Speaker Embeddings for Speaker Diarisation.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

Transformer Language Models with LSTM-Based Cross-Utterance Information Representation.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

Tree-Constrained Pointer Generator for End-to-End Contextual Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

2020

Cross-Utterance Language Models with Acoustic Error Sampling.

[BibT_eX]

[DOI]

CoRR, 2020

Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior.

[BibT_eX]

[DOI]

CoRR, 2020

Fully-Hierarchical Fine-Grained Prosody Modeling For Interpretable Speech Synthesis.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019

Speaker Diarisation Using 2D Self-attentive Combination of Embeddings.

[BibT_eX]

[DOI]