Arsha Nagrani

Orcid: 0000-0003-2190-9013

According to our database¹, Arsha Nagrani authored at least 73 papers between 2017 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding.

[BibT_eX]

[DOI]

Sudheendra Vijayanarasimhan

CoRR, May, 2026

CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning.

[BibT_eX]

[DOI]

CoRR, April, 2026

CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning.

[BibT_eX]

[DOI]

CoRR, January, 2026

2025

More than a Moment: Towards Coherent Sequences of Audio Descriptions.

[BibT_eX]

[DOI]

CoRR, October, 2025

CAViAR: Critic-Augmented Video Agentic Reasoning.

[BibT_eX]

[DOI]

CoRR, September, 2025

Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Minerva: Evaluating Complex Video Reasoning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Flexible Frame Selection for Efficient Video Reasoning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024

The VoxCeleb Speaker Recognition Challenge: A Retrospective.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2024

Neptune: The Long Orbit to Benchmarking Long Video Understanding.

[BibT_eX]

[DOI]

Nitesh Bharadwaj Gundavarapu

CoRR, 2024

Mixture of Nested Experts: Adaptive Processing of Visual Tokens.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

VIEWS: Entity-Aware News Video Captioning.

[BibT_eX]

[DOI]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Streaming Dense Video Captioning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

VicTR: Video-conditioned Text Representations for Activity Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

AutoAD III: The Prequel - Back to the Pixels.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

On Scaling Up a Multilingual Vision and Language Model.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ACCV 2024, 2024

2023

Video Summarization: Towards Entity-Aware Captions.

[BibT_eX]

[DOI]

CoRR, 2023

PaLI-X: On Scaling up a Multilingual Vision and Language Model.

[BibT_eX]

[DOI]

CoRR, 2023

VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge.

[BibT_eX]

[DOI]

CoRR, 2023

VidChapters-7M: Video Chapters at Scale.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

LanSER: Language-Model Supported Speech Emotion Recognition.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

UnLoc: A Unified Framework for Video Localization Tasks.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Verbs in Action: Improving verb understanding in video-language models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

AutoAD II: The Sequel - Who, When, and What in Movie Audio Description.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR.

[BibT_eX]

[DOI]

Paul Hongsuck Seo

Arsha Nagrani

Cordelia Schmid

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

AutoAD: Movie Description in Context.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Modular Visual Question Answering via Code Generation.

[BibT_eX]

[DOI]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2023

2022

AVATAR submission to the Ego4D AV Transcription Challenge.

[BibT_eX]

[DOI]

Paul Hongsuck Seo

Arsha Nagrani

Cordelia Schmid

CoRR, 2022

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency.

[BibT_eX]

[DOI]

CoRR, 2022

M&M Mix: A Multimodal Multiview Transformer Ensemble.

[BibT_eX]

[DOI]

CoRR, 2022

A CLIP-Hitchhiker's Guide to Long Video Retrieval.

[BibT_eX]

[DOI]

CoRR, 2022

VoxSRC 2021: The Third VoxCeleb Speaker Recognition Challenge.

[BibT_eX]

[DOI]

CoRR, 2022

Masking Modalities for Cross-modal Video Retrieval.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022

AVATAR: Unconstrained Audiovisual Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

TL;DW? Summarizing Instructional Videos with Task Relevance and Cross-Modal Saliency.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

Learning Audio-Video Modalities from Image Captions.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

End-to-end Generative Pretraining for Multimodal Video Captioning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

WiCV 2020: The Seventh Women In Computer Vision Workshop.

[BibT_eX]

[DOI]

CoRR, 2021

Attention Bottlenecks for Multimodal Fusion.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Composable Augmentation Encoding for Video Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Slow-Fast Auditory Streams for Audio Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

Playing a Part: Speaker Verification at the movies.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

Look Before You Speak: Visually Contextualized Utterances.

[BibT_eX]

[DOI]

Paul Hongsuck Seo

Arsha Nagrani

Cordelia Schmid

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Localizing Visual Sounds the Hard Way.

[BibT_eX]

[DOI]

Honglie Chen

Weidi Xie

Triantafyllos Afouras

Arsha Nagrani

Andrea Vedaldi

Andrew Zisserman

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition.

[BibT_eX]

[DOI]

Proceedings of the 32nd British Machine Vision Conference 2021, 2021

Audio-Visual Synchronisation in the wild.

[BibT_eX]

[DOI]

Triantafyllos Afouras

Proceedings of the 32nd British Machine Vision Conference 2021, 2021

2020

Video understanding using multimodal deep learning.

[BibT_eX]

[DOI]

Arsha Nagrani

PhD thesis, 2020

Voxceleb: Large-scale speaker verification in the wild.

[BibT_eX]

[DOI]

Comput. Speech Lang., 2020

VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge.

[BibT_eX]

[DOI]

CoRR, 2020

Cough Against COVID: Evidence of COVID-19 Signature in Cough Sounds.

[BibT_eX]

[DOI]

CoRR, 2020

The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020).

[BibT_eX]

[DOI]

CoRR, 2020

Spot the Conversation: Speaker Diarisation in the Wild.

[BibT_eX]

[DOI]

Joon Son Chung

Jaesung Huh

Arsha Nagrani

Triantafyllos Afouras

Andrew Zisserman

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Disentangled Speech Embeddings Using Cross-Modal Self-Supervision.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2020, 2020

Speech2Action: Cross-Modal Supervision for Action Recognition.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

Condensed Movies: Story Based Retrieval with Contextual Embeddings.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ACCV 2020 - 15th Asian Conference on Computer Vision, Kyoto, Japan, November 30, 2020

2019

VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge.

[BibT_eX]

[DOI]

CoRR, 2019

Count, Crop and Recognise: Fine-Grained Recognition in the Wild.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops, 2019

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Utterance-level Aggregation for Speaker Recognition in the Wild.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2019

WiCV 2019: The Sixth Women In Computer Vision Workshop.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019

Use What You Have: Video retrieval using representations from collaborative experts.

[BibT_eX]

[DOI]

Proceedings of the 30th British Machine Vision Conference 2019, 2019

2018

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild.

[BibT_eX]

[DOI]

Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference, 2018

VoxCeleb2: Deep Speaker Recognition.

[BibT_eX]

[DOI]

Joon Son Chung

Arsha Nagrani

Andrew Zisserman

Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Learnable PINs: Cross-modal Embeddings for Person Identity.

[BibT_eX]

[DOI]

Arsha Nagrani

Samuel Albanie

Andrew Zisserman

Proceedings of the Computer Vision - ECCV 2018, 2018

Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching.

[BibT_eX]

[DOI]

Arsha Nagrani

Samuel Albanie

Andrew Zisserman

Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

2017

VoxCeleb: A Large-Scale Speaker Identification Dataset.

[BibT_eX]

[DOI]

Arsha Nagrani

Joon Son Chung

Andrew Zisserman

Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script.

[BibT_eX]

[DOI]

Arsha Nagrani

Andrew Zisserman

Proceedings of the British Machine Vision Conference 2017, 2017

Arsha Nagrani

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...