Bowen Zhang

Orcid: 0000-0002-4971-4878

Affiliations:
  • Apple, USA
  • University of Southern California, Department of Computer Science, CA, USA (PhD 2022)
  • Tongji University, Department of Computer Science and Technology, Shanghai, China (former)


According to our database1, Bowen Zhang authored at least 40 papers between 2014 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
AXLearn: Modular Large Model Training on Heterogeneous Infrastructure.
CoRR, July, 2025

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Improve Vision Language Model Chain-of-thought Reasoning.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024
STIV: Scalable Text and Image Conditioned Video Generation.
CoRR, 2024

MM-Ego: Towards Building Egocentric Multimodal LLMs.
CoRR, 2024

Contrastive Localized Language-Image Pre-Training.
CoRR, 2024

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning.
CoRR, 2024

Apple Intelligence Foundation Language Models.
CoRR, 2024

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models.
CoRR, 2024

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training.
CoRR, 2024

Ferret: Refer and Ground Anything Anywhere at Any Granularity.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

MOFI: Learning Image Representations from Noisy Entity Annotated Images.
Proceedings of the Twelfth International Conference on Learning Representations, 2024


VeCLIP: Improving CLIP Training via Visual-Enriched Captions.
Proceedings of the Computer Vision - ECCV 2024, 2024

2023
Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts.
CoRR, 2023

MOFI: Learning Image Representations from Noisy Entity Annotated Images.
CoRR, 2023

Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness.
CoRR, 2023

STAIR: Learning Sparse Text and Image Representation in Grounded Tokens.
CoRR, 2023

STAIR: Learning Sparse Text and Image Representation in Grounded Tokens.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2021
Co-training Transformer with Videos and Images Improves Action Recognition.
CoRR, 2021

CREATe: Clinical Report Extraction and Annotation Technology.
Proceedings of the 37th IEEE International Conference on Data Engineering, 2021

Systematic Generalization on gSCAN: What is Nearly Solved and What is Next?
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

Visually Grounded Concept Composition.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, 2021

2020
A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus.
CoRR, 2020

Visual Storytelling via Predicting Anchor Word Embeddings in the Stories.
CoRR, 2020

Learning to Represent Image and Text with Denotation Graph.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020

2019
Topic Augmented Generator for Abstractive Summarization.
CoRR, 2019

2018
Real-Time Action Recognition With Deeply Transferred Motion Vector CNNs.
IEEE Trans. Image Process., 2018

A Probabilistic Model for Joint Learning of Word Embeddings from Texts and Images.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31, 2018

Cross-Modal and Hierarchical Modeling of Video and Text.
Proceedings of the Computer Vision - ECCV 2018, 2018

2017
Weakly Supervised PatchNets: Describing and Aggregating Local Patches for Scene Recognition.
IEEE Trans. Image Process., 2017

Learning correlations for human action recognition in videos.
Multim. Tools Appl., 2017

2016
CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016.
CoRR, 2016

Real-Time Action Recognition with Enhanced Motion Vector CNNs.
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016

2015
Encoding scale into fisher vector for human action recognition.
Proceedings of the 2015 Visual Communications and Image Processing, 2015

MIC-TJU in MediaEval 2015 Affective Impact of Movies Task.
Proceedings of the Working Notes Proceedings of the MediaEval 2015 Workshop, 2015

2014
MIC_TJ at TRECVID 2014.
Proceedings of the 2014 TREC Video Retrieval Evaluation, 2014

MIC-TJU at MediaEval Violent Scenes Detection (VSD) 2014.
Proceedings of the Working Notes Proceedings of the MediaEval 2014 Workshop, 2014


  Loading...