Arsha Nagrani

Orcid: 0000-0003-2190-9013

According to our database1, Arsha Nagrani authored at least 55 papers between 2017 and 2023.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2023
Video Summarization: Towards Entity-Aware Captions.
CoRR, 2023

LanSER: Language-Model Supported Speech Emotion Recognition.
CoRR, 2023

PaLI-X: On Scaling up a Multilingual Vision and Language Model.
CoRR, 2023

VicTR: Video-conditioned Text Representations for Activity Recognition.
CoRR, 2023

VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge.
CoRR, 2023

VidChapters-7M: Video Chapters at Scale.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

UnLoc: A Unified Framework for Video Localization Tasks.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Verbs in Action: Improving verb understanding in video-language models.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

AutoAD II: The Sequel - Who, When, and What in Movie Audio Description.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

AutoAD: Movie Description in Context.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Modular Visual Question Answering via Code Generation.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2023

2022
AVATAR submission to the Ego4D AV Transcription Challenge.
CoRR, 2022

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency.
CoRR, 2022

M&M Mix: A Multimodal Multiview Transformer Ensemble.
CoRR, 2022

A CLIP-Hitchhiker's Guide to Long Video Retrieval.
CoRR, 2022

VoxSRC 2021: The Third VoxCeleb Speaker Recognition Challenge.
CoRR, 2022

Masking Modalities for Cross-modal Video Retrieval.
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022

AVATAR: Unconstrained Audiovisual Speech Recognition.
Proceedings of the Interspeech 2022, 2022

TL;DW? Summarizing Instructional Videos with Task Relevance and Cross-Modal Saliency.
Proceedings of the Computer Vision - ECCV 2022, 2022

Learning Audio-Video Modalities from Image Captions.
Proceedings of the Computer Vision - ECCV 2022, 2022

End-to-end Generative Pretraining for Multimodal Video Captioning.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021
WiCV 2020: The Seventh Women In Computer Vision Workshop.
CoRR, 2021

Attention Bottlenecks for Multimodal Fusion.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Composable Augmentation Encoding for Video Representation Learning.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Slow-Fast Auditory Streams for Audio Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2021

Playing a Part: Speaker Verification at the movies.
Proceedings of the IEEE International Conference on Acoustics, 2021

Look Before You Speak: Visually Contextualized Utterances.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Localizing Visual Sounds the Hard Way.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition.
Proceedings of the 32nd British Machine Vision Conference 2021, 2021

Audio-Visual Synchronisation in the wild.
Proceedings of the 32nd British Machine Vision Conference 2021, 2021

2020
Video understanding using multimodal deep learning.
PhD thesis, 2020

Voxceleb: Large-scale speaker verification in the wild.
Comput. Speech Lang., 2020

VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge.
CoRR, 2020

Cough Against COVID: Evidence of COVID-19 Signature in Cough Sounds.
CoRR, 2020

The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020).
CoRR, 2020

Spot the Conversation: Speaker Diarisation in the Wild.
Proceedings of the Interspeech 2020, 2020

Disentangled Speech Embeddings Using Cross-Modal Self-Supervision.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos.
Proceedings of the Computer Vision - ECCV 2020, 2020

Speech2Action: Cross-Modal Supervision for Action Recognition.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

Condensed Movies: Story Based Retrieval with Contextual Embeddings.
Proceedings of the Computer Vision - ACCV 2020 - 15th Asian Conference on Computer Vision, Kyoto, Japan, November 30, 2020

2019
VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge.
CoRR, 2019

Count, Crop and Recognise: Fine-Grained Recognition in the Wild.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops, 2019

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Utterance-level Aggregation for Speaker Recognition in the Wild.
Proceedings of the IEEE International Conference on Acoustics, 2019

WiCV 2019: The Sixth Women In Computer Vision Workshop.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019

Use What You Have: Video retrieval using representations from collaborative experts.
Proceedings of the 30th British Machine Vision Conference 2019, 2019

2018
Emotion Recognition in Speech using Cross-Modal Transfer in the Wild.
Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference, 2018

VoxCeleb2: Deep Speaker Recognition.
Proceedings of the Interspeech 2018, 2018

Learnable PINs: Cross-modal Embeddings for Person Identity.
Proceedings of the Computer Vision - ECCV 2018, 2018

Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching.
Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

2017
VoxCeleb: A Large-Scale Speaker Identification Dataset.
Proceedings of the Interspeech 2017, 2017

From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script.
Proceedings of the British Machine Vision Conference 2017, 2017


  Loading...