Stavros Petridis

CoRR, March, 2026

Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition.

[BibT_eX]

[DOI]

Alexandros Haliassos

CoRR, February, 2026

KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution.

[BibT_eX]

[DOI]

Stella Bounareli

Michal Stypulkowski

Trans. Mach. Learn. Res., 2026

2025

FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs.

[BibT_eX]

[DOI]

CoRR, December, 2025

Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models.

[BibT_eX]

[DOI]

CoRR, November, 2025

Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMS.

[BibT_eX]

[DOI]

CoRR, October, 2025

MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition.

[BibT_eX]

[DOI]

CoRR, October, 2025

FaceCrafter: Identity-Conditional Diffusion with Disentangled Control over Facial Pose, Expression, and Emotion.

[BibT_eX]

[DOI]

Kazuaki Mishima

Kenji Suzuki

CoRR, May, 2025

Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis.

[BibT_eX]

[DOI]

Proceedings of the 26th Annual Conference of the International Speech Communication Association, 2025

Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach.

[BibT_eX]

[DOI]

Proceedings of the 26th Annual Conference of the International Speech Communication Association, 2025

Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Contextual Speech Extraction: Leveraging Textual History as an Implicit Cue for Target Speech Extraction.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

Large Language Models are Strong Audio-Visual Speech Recognition Learners.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation.

[BibT_eX]

[DOI]

Michal Stypulkowski

Stella Bounareli

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs.

[BibT_eX]

[DOI]

Umberto Cappellazzo

Minsu Kim

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2025

2024

Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation.

[BibT_eX]

[DOI]

Michal Stypulkowski

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024

Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Dynamic Data Pruning for Automatic Speech Recognition.

[BibT_eX]

[DOI]

Qiao Xiao

Decebal Constantin Mocanu

Shiwei Liu

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

BRAVEn: Improving Self-supervised pre-training for Visual and Auditory Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Hearing Loss Detection From Facial Expressions in One-On-One Conversations.

[BibT_eX]

[DOI]

Yufeng Yin

Ishwarya Ananthabhotla

Proceedings of the IEEE International Conference on Acoustics, 2024

EMOPortraits: Emotion-Enhanced Multimodal One-Shot Head Avatars.

[BibT_eX]

[DOI]

Nikita Drobyshev

Zoe Landgraf

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023

KAN-AV dataset for audio-visual face and speech analysis in the wild.

[BibT_eX]

[DOI]

Triantafyllos Kefalas

Image Vis. Comput., December, 2023

Self-Supervised Video-Centralised Transformer for Video Face Clustering.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., November, 2023

End-to-End Video-to-Speech Synthesis Using Generative Adversarial Networks.

[BibT_eX]

[DOI]

IEEE Trans. Cybern., June, 2023

Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition?

[BibT_eX]

[DOI]

IEEE Trans. Affect. Comput., 2023

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch.

[BibT_eX]

[DOI]

CoRR, 2023

Laughing Matters: Introducing Laughing-Face Generation using Diffusion Models.

[BibT_eX]

[DOI]

Nikita Drobyshev

CoRR, 2023

Is dataset condensation a silver bullet for healthcare data sharing?

[BibT_eX]

[DOI]

CoRR, 2023

SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision.

[BibT_eX]

[DOI]

Xubo Liu

Egor Lakomkin

CoRR, 2023

SparseVSR: Lightweight and Noise Robust Visual Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Streaming Audio-Visual Speech Recognition with Alignment Regularization.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Jointly Learning Visual and Auditory Speech Representations from Raw Data.

[BibT_eX]

[DOI]

Proceedings of the Eleventh International Conference on Learning Representations, 2023

Learning Cross-Lingual Visual Speech Representations.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

LA-VOCE: LOW-SNR Audio-Visual Speech Enhancement Using Neural Vocoders.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels.

[BibT_eX]

[DOI]

Alexandros Haliassos

Honglie Chen

Proceedings of the IEEE International Conference on Acoustics, 2023

SS-VAERR: Self-Supervised Apparent Emotional Reaction Recognition from Video.

[BibT_eX]

[DOI]

Marija Jegorova

Proceedings of the 17th IEEE International Conference on Automatic Face and Gesture Recognition, 2023

SynthVSR: Scaling Up Visual Speech RecognitionWith Synthetic Supervision.

[BibT_eX]

[DOI]

Xubo Liu

Egor Lakomkin

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Laughing Matters: Introducing Audio-Driven Laughing-Face Generation with Diffusion Models.

[BibT_eX]

[DOI]

Nikita Drobyshev

Proceedings of the 34th British Machine Vision Conference 2023, 2023

2022

Visual speech recognition for multiple languages in the wild.

[BibT_eX]

[DOI]

Rodrigo Schoburg Carrillo de Mira

Nat. Mac. Intell., November, 2022

SVTS: Scalable Video-to-Speech Synthesis.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Training Strategies for Improved Lip-Reading.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

Domain Generalisation for Apparent Emotional Facial Expression Recognition across Age-Groups.

[BibT_eX]

[DOI]

CoRR, 2021

Lip-reading with Densely Connected Temporal Convolutional Networks.

[BibT_eX]

[DOI]

Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2021

LiRA: Learning Visual Speech Representations from Audio Through Self-Supervision.

[BibT_eX]

[DOI]

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

DINO: A Conditional Energy-Based GAN for Domain Translation.

[BibT_eX]

[DOI]

Proceedings of the 9th International Conference on Learning Representations, 2021

End-To-End Audio-Visual Speech Recognition with Conformers.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

Detecting Adversarial Attacks on Audiovisual Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

Towards Practical Lipreading with Distilled and Efficient Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

Lips Don't Lie: A Generalisable and Robust Approach To Face Forgery Detection.

[BibT_eX]

[DOI]

Alexandros Haliassos

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020

Blind Audio-Visual Localization and Separation via Low-Rank and Sparsity.

[BibT_eX]

[DOI]

IEEE Trans. Cybern., 2020

End-to-end visual speech recognition for small-scale datasets.

[BibT_eX]

[DOI]

Pattern Recognit. Lett., 2020

Realistic Speech-Driven Facial Animation with GANs.

[BibT_eX]

[DOI]

Int. J. Comput. Vis., 2020

Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision.

[BibT_eX]

[DOI]

CoRR, 2020

Does Visual Self-Supervision Improve Learning of Speech Representations?

[BibT_eX]

[DOI]

CoRR, 2020

Domain Adversarial Neural Networks for Dysarthric Speech Recognition.

[BibT_eX]

[DOI]

Dominika Woszczyk

David E. Millard

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Visually Guided Self Supervised Learning of Speech Representations.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Lipreading Using Temporal Convolutional Networks.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Speech-Driven Facial Animation Using Polynomial Fusion of Features.

[BibT_eX]

[DOI]

Triantafyllos Kefalas

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Towards Pose-Invariant Lip-Reading.

[BibT_eX]

[DOI]

Shiyang Cheng

Georgios Tzimiropoulos

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019

A real-time and unsupervised face re-identification system for human-robot interaction.

[BibT_eX]

[DOI]

Pattern Recognit. Lett., 2019

Detecting Adversarial Attacks On Audio-Visual Speech Recognition.

[BibT_eX]

[DOI]

CoRR, 2019

Video-Driven Speech Reconstruction Using Generative Adversarial Networks.

[BibT_eX]

[DOI]

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019

2018

Transfer Learning for Action Unit Recognition.

[BibT_eX]

[DOI]

CoRR, 2018

Audio-Visual Speech Recognition with a Hybrid CTC/Attention Architecture.

[BibT_eX]

[DOI]

Themos Stafylakis

Georgios Tzimiropoulos

Proceedings of the 2018 IEEE Spoken Language Technology Workshop, 2018

End-to-End Audiovisual Speech Recognition.

[BibT_eX]

[DOI]

Georgios Tzimiropoulos

Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Visual-Only Recognition of Normal, Whispered and Silent Speech.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Online Attention for Interpretable Conflict Estimation in Political Debates.

[BibT_eX]

[DOI]

Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition, 2018

End-to-End Speech-Driven Facial Animation with Temporal GANs.

[BibT_eX]

[DOI]

Proceedings of the British Machine Vision Conference 2018, 2018

2017

Local Deep Neural Networks for Age and Gender Classification.

[BibT_eX]

[DOI]

Zukang Liao

CoRR, 2017

Audio-visual object localization and separation using low-rank and sparsity.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Conference on Acoustics, 2017

End-to-end visual speech recognition with LSTMS.

[BibT_eX]

[DOI]

Zuwei Li

Proceedings of the 2017 IEEE International Conference on Acoustics, 2017

End-to-End Multi-View Lipreading.

[BibT_eX]

[DOI]

Proceedings of the British Machine Vision Conference 2017, 2017

End-to-End Audiovisual Fusion with LSTMs.

[BibT_eX]

[DOI]

Proceedings of the 14th International Conference on Auditory-Visual Speech Processing, 2017

2016

Discrimination Between Native and Non-Native Speech Using Visual Features Only.

[BibT_eX]

[DOI]

Christos Georgakis

IEEE Trans. Cybern., 2016

Prediction-Based Audiovisual Fusion for Classification of Non-Linguistic Vocalisations.

[BibT_eX]

[DOI]

IEEE Trans. Affect. Comput., 2016

Multi-modal Neural Conditional Ordinal Random Fields for agreement level estimation.

[BibT_eX]

[DOI]

Proceedings of the 23rd International Conference on Pattern Recognition, 2016

Deep complementary bottleneck features for visual speech recognition.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

2015

The MAHNOB Mimicry Database: A database of naturalistic human interactions.

[BibT_eX]

[DOI]

Pattern Recognit. Lett., 2015

Comparison of single-model and multiple-model prediction-based audiovisual fusion.

[BibT_eX]

[DOI]

Varun Rajgarhia

Proceedings of the Auditory-Visual Speech Processing, 2015

Neural conditional ordinal random fields for agreement level estimation.

[BibT_eX]

[DOI]

Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction, 2015

2014

Discriminating Native from Non-Native Speech Using Fusion of Visual Cues.

[BibT_eX]

[DOI]

Christos Georgakis

Proceedings of the ACM International Conference on Multimedia, MM '14, Orlando, FL, USA, November 03, 2014

Visual-only discrimination between native and non-native speech.

[BibT_eX]

[DOI]

Christos Georgakis

Proceedings of the IEEE International Conference on Acoustics, 2014

2013

The MAHNOB Laughter database.

[BibT_eX]

[DOI]

Brais Martínez

Image Vis. Comput., 2013

Bimodal log-linear regression for fusion of audio and visual features.

[BibT_eX]

[DOI]

Ognjen Rudovic

Proceedings of the ACM Multimedia Conference, 2013

Audiovisual Detection of Laughter in Human-Machine Interaction.

[BibT_eX]

[DOI]

Maelle Leveque

Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, 2013

Audiovisual Detection of Behavioural Mimicry.

[BibT_eX]

[DOI]

Sanjay Bilakhia

Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, 2013

2012

Audiovisual discrimination between laughter and speech.

[BibT_eX]

[DOI]

PhD thesis, 2012

Comparison of prediction-based fusion and feature-level fusion across different learning models.

[BibT_eX]

[DOI]

Sanjay Bilakhia

Proceedings of the 20th ACM Multimedia Conference, MM '12, Nara, Japan, October 29, 2012

Audiovisual vocal outburst classification in noisy acoustic conditions.

[BibT_eX]

[DOI]

Proceedings of the 2012 IEEE International Conference on Acoustics, 2012

2011

Audiovisual Discrimination Between Speech and Laughter: Why and When Visual Information Might Help.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2011

Audiovisual classification of vocal outbursts in human conversation using Long-Short-Term Memory networks.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2011

Prediction-based classification for audiovisual discrimination between laughter and speech.

[BibT_eX]

[DOI]

Jeffrey F. Cohn

Proceedings of the Ninth IEEE International Conference on Automatic Face and Gesture Recognition (FG 2011), 2011

2010

Classifying laughter and speech using audio-visual feature prediction.

[BibT_eX]

[DOI]

Ali Asghar

Proceedings of the IEEE International Conference on Acoustics, 2010

2009

Static vs. dynamic modeling of human nonverbal behavior from multiple cues and modalities.

[BibT_eX]

[DOI]

Proceedings of the 11th International Conference on Multimodal Interfaces, 2009

Is this joke really funny? judging the mirth by audiovisual laughter analysis.

[BibT_eX]

[DOI]

Proceedings of the 2009 IEEE International Conference on Multimedia and Expo, 2009

2008

Learning to Detect Aircraft at Low Resolutions.

[BibT_eX]

[DOI]

Christopher Geyer

Sanjiv Singh

Proceedings of the Computer Vision Systems, 6th International Conference, 2008

Audiovisual laughter detection based on temporal features.

[BibT_eX]

[DOI]

Proceedings of the 10th International Conference on Multimodal Interfaces, 2008

Audiovisual discrimination between laughter and speech.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2008

Fusion of audio and visual cues for laughter detection.

[BibT_eX]

[DOI]

Proceedings of the 7th ACM International Conference on Image and Video Retrieval, 2008

2007

Machine learned regression for abductive DNA sequencing.

[BibT_eX]

[DOI]

David Thornley

Maxim Zverev

Proceedings of the Sixth International Conference on Machine Learning and Applications, 2007

Decoding Trace Peak Behaviour - A Neuro-Fuzzy Approach.

[BibT_eX]

[DOI]

David Thornley

Proceedings of the FUZZ-IEEE 2007, 2007

2006

Construction of Neural Network Based Lyapunov Functions.

[BibT_eX]

[DOI]

Vassilios Petridis

Proceedings of the International Joint Conference on Neural Networks, 2006

Machine Learning in Basecalling Decoding Trace Peak Behaviour.

[BibT_eX]

[DOI]

David Thornley