Yifan Peng

Orcid: 0000-0002-8581-8674

Affiliations:
  • NVIDIA Corporation, Santa Clara, CA, USA
  • Carnegie Mellon University, Department of Electrical and Computer Engineering, Pittsburgh, PA, USA (PhD 2025)


According to our database1, Yifan Peng authored at least 52 papers between 2022 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
OpusLM: A Family of Open Unified Speech Language Models.
CoRR, June, 2025

DYNAC: Dynamic Vocabulary based Non-Autoregressive Contextualization for Speech Recognition.
CoRR, June, 2025

OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning.
CoRR, June, 2025

Granary: Speech Recognition and Translation Dataset in 25 European Languages.
CoRR, May, 2025

On The Landscape of Spoken Language Models: A Comprehensive Survey.
CoRR, April, 2025

ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems.
CoRR, March, 2025

ESPnet-SpeechLM: An Open Speech Language Model Toolkit.
CoRR, February, 2025

OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models.
CoRR, February, 2025

VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning.
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

Context-aware Dynamic Pruning for Speech Foundation Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Enhancing Audiovisual Speech Recognition Through Bifocal Preference Optimization.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024
4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders.
CoRR, 2024

MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation.
CoRR, 2024

An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis.
CoRR, 2024

SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition.
CoRR, 2024

Robust Audiovisual Speech Recognition Models with Mixture-of-Experts.
Proceedings of the IEEE Spoken Language Technology Workshop, 2024

Contextualized Automatic Speech Recognition With Dynamic Vocabulary.
Proceedings of the IEEE Spoken Language Technology Workshop, 2024

ESPnet-EZ: Python-Only ESPnet For Easy Fine-Tuning And Integration.
Proceedings of the IEEE Spoken Language Technology Workshop, 2024

UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions.
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024

On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

MULTI-CONVFORMER: Extending Conformer with Multiple Convolution Kernels.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

Contextualized Automatic Speech Recognition With Attention-Based Bias Phrase Boosted Beam Search.
Proceedings of the IEEE International Conference on Acoustics, 2024

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation.
Proceedings of the IEEE International Conference on Acoustics, 2024

VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks.
Proceedings of the IEEE International Conference on Acoustics, 2024

Dynamic-Superb: Towards a Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark For Speech.
Proceedings of the IEEE International Conference on Acoustics, 2024

Towards Robust Speech Representation Learning for Thousands of Languages.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

2023
UniverSLU: Universal Spoken Language Understanding for Diverse Classification and Sequence Generation Tasks with a Single Network.
CoRR, 2023

CMU's IWSLT 2023 Simultaneous Speech Translation System.
Proceedings of the 20th International Conference on Spoken Language Translation, 2023

Time-synchronous one-pass Beam Search for Parallel Online and Offline Transducers with Dynamic Block Training.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Tensor decomposition for minimization of E2E SLU model toward on-device processing.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

I3D: Transformer Architectures with Input-Dependent Dynamic Depth for Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2023

Structured Pruning of Self-Supervised Pre-Trained Models for Speech Recognition and Understanding.
Proceedings of the IEEE International Conference on Acoustics, 2023

Speechlmscore: Evaluating Speech Generation Using Speech Language Model.
Proceedings of the IEEE International Conference on Acoustics, 2023

E-Branchformer-Based E2E SLU Toward Stop on-Device Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2023

The Pipeline System of ASR and NLU with MLM-based data Augmentation Toward Stop Low-Resource Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2023

Improving Massively Multilingual ASR with Auxiliary CTC Objectives.
Proceedings of the IEEE International Conference on Acoustics, 2023

A Study on the Integration of Pipeline and E2E SLU Systems for Spoken Semantic Parsing Toward Stop Quality Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2023

Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Joint Prediction and Denoising for Large-Scale Multilingual Self-Supervised Learning.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2023

2022
A Study on the Integration of Pre-Trained SSL, ASR, LM and SLU Models for Spoken Language Understanding.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

E-Branchformer: Branchformer with Enhanced Merging for Speech Recognition.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

CMU's IWSLT 2022 Dialect Speech Translation System.
Proceedings of the 19th International Conference on Spoken Language Translation, 2022

Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding.
Proceedings of the International Conference on Machine Learning, 2022

ESPnet-SLU: Advancing Spoken Language Understanding Through ESPnet.
Proceedings of the IEEE International Conference on Acoustics, 2022


  Loading...