David F. Harwath

Orcid: 0000-0003-0206-0253

Affiliations:
  • Massachusetts Institute of Technology, Cambridge, USA (PhD 2018)


According to our database1, David F. Harwath authored at least 63 papers between 2012 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild.
CoRR, 2024

SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data.
CoRR, 2024

Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model.
CoRR, 2024

BAT: Learning to Reason about Spatial Sounds with Large Language Models.
CoRR, 2024

2023
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models.
CoRR, 2023

Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos.
CoRR, 2023

Unit-based Speech-to-Speech Translation Without Parallel Data.
CoRR, 2023

Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages.
CoRR, 2023

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Mode.
CoRR, 2023

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization.
CoRR, 2023

Subject Generalization in Classifying Imagined and Spoken Speech with MEG.
Proceedings of the 11th International IEEE/EMBS Conference on Neural Engineering, 2023

Learning to Map Efficiently by Active Echolocation.
IROS, 2023

Contrastive Audio-Visual Masked Autoencoder.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval.
Proceedings of the IEEE International Conference on Acoustics, 2023

A Dataset for Foreground Speech Analysis With Smartwatches In Everyday Home Environments.
Proceedings of the IEEE International Conference on Acoustics, 2023

Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised Speech Models.
Proceedings of the IEEE International Conference on Acoustics, 2023

Continual Learning for On-Device Speech Recognition Using Disentangled Conformers.
Proceedings of the IEEE International Conference on Acoustics, 2023

Learning Audio-Visual Dereverberation.
Proceedings of the IEEE International Conference on Acoustics, 2023

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval.
Proceedings of the IEEE International Conference on Acoustics, 2023

Audio-Visual Neural Syntax Acquisition.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2023

2022
Automated detection of foreground speech with wearable sensing in everyday home environments: A transfer learning approach.
CoRR, 2022

Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling.
CoRR, 2022

Phoneme Segmentation Using Self-Supervised Speech Models.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

Speak: A Toolkit Using Amazon Mechanical Turk to Collect and Validate Speech Audio Recordings.
Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022

Word Discovery in Visually Grounded, Self-Supervised Speech Models.
Proceedings of the Interspeech 2022, 2022

Exploring Few-Shot Fine-Tuning Strategies for Models of Visually Grounded Speech.
Proceedings of the Interspeech 2022, 2022

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer.
Proceedings of the Interspeech 2022, 2022

Adversarial Input Ablation for Audio-Visual Learning.
Proceedings of the IEEE International Conference on Acoustics, 2022

Fast-Slow Transformer for Visually Grounding Speech.
Proceedings of the IEEE International Conference on Acoustics, 2022

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

Everything at Once - Multi-modal Fusion Transformer for Video Retrieval.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021
Routing with Self-Attention for Multimodal Capsule Networks.
CoRR, 2021

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

Cascaded Multilingual Audio-Visual Learning from Videos.
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August, 2021

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Spoken Moments: Learning Joint Audio-Visual Representations From Video Descriptions.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021

2020
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input.
Int. J. Comput. Vis., 2020

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units.
CoRR, 2020

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos.
CoRR, 2020

Pair Expansion for Learning Multilingual Semantic Embeddings Using Disjoint Visually-Grounded Speech Audio Datasets.
Proceedings of the Interspeech 2020, 2020

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech.
Proceedings of the 8th International Conference on Learning Representations, 2020

Trilingual Semantic Embeddings of Visually Grounded Speech with Self-Attention Mechanisms.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019
Transfer Learning from Audio-Visual Grounding to Speech Recognition.
Proceedings of the Interspeech 2019, 2019

Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio.
Proceedings of the Interspeech 2019, 2019

Towards Visually Grounded Sub-word Speech Unit Discovery.
Proceedings of the IEEE International Conference on Acoustics, 2019

Learning Words by Drawing Images.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

Grounding Spoken Words in Unlabeled Video.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019

2018
Learning spoken language through vision.
PhD thesis, 2018

Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

2017
Learning modality-invariant representations for speech and images.
Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop, 2017

Learning Word-Like Units from Joint Audio-Visual Analysis.
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017

2016
On the Use of Acoustic Unit Discovery for Language Recognition.
IEEE ACM Trans. Audio Speech Lang. Process., 2016

Look, listen, and decode: Multimodal speech recognition with images.
Proceedings of the 2016 IEEE Spoken Language Technology Workshop, 2016

Unsupervised Learning of Spoken Language with Visual Context.
Proceedings of the Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, 2016

2015
Deep multimodal semantic embeddings for speech and images.
Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, 2015

2014
Choosing useful word alternates for automatic speech recognition correction interfaces.
Proceedings of the INTERSPEECH 2014, 2014

Speech recognition without a lexicon - bridging the gap between graphemic and phonetic systems.
Proceedings of the INTERSPEECH 2014, 2014

2013

Zero resource spoken audio corpus analysis.
Proceedings of the IEEE International Conference on Acoustics, 2013

2012
Topic identification based extrinsic evaluation of summarization techniques applied to conversational speech.
Proceedings of the 2012 IEEE International Conference on Acoustics, 2012


  Loading...