Bowen Zhang

Orcid: 0000-0002-4971-4878

Affiliations:

Apple, USA
University of Southern California, Department of Computer Science, CA, USA (PhD 2022)
Tongji University, Department of Computer Science and Technology, Shanghai, China (former)

According to our database¹, Bowen Zhang authored at least 42 papers between 2014 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Bibliography

2025

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer.

[BibT_eX]

[DOI]

CoRR, September, 2025

AXLearn: Modular Large Model Training on Heterogeneous Infrastructure.

[BibT_eX]

[DOI]

CoRR, July, 2025

DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation.

[BibT_eX]

[DOI]

CoRR, March, 2025

Contrastive Localized Language-Image Pre-Training.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning.

[BibT_eX]

[DOI]

Jean-Philippe Fauconnier

Zhengfeng Lai

Haoxuan You

Zirui Wang

et al.

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Improve Vision Language Model Chain-of-thought Reasoning.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024

STIV: Scalable Text and Image Conditioned Video Generation.

[BibT_eX]

[DOI]

CoRR, 2024

MM-Ego: Towards Building Egocentric Multimodal LLMs.

[BibT_eX]

[DOI]

CoRR, 2024

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning.

[BibT_eX]

[DOI]

CoRR, 2024

Apple Intelligence Foundation Language Models.

[BibT_eX]

[DOI]

Albin Madappally Jose

Hannah Gillis Coleman

CoRR, 2024

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training.

[BibT_eX]

[DOI]

Brandon McKinzie

Zhe Gan

Jean-Philippe Fauconnier

CoRR, 2024

Ferret: Refer and Ground Anything Anywhere at Any Granularity.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

MOFI: Learning Image Representations from Noisy Entity Annotated Images.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training.

[BibT_eX]

[DOI]

Brandon McKinzie

Zhe Gan

Jean-Philippe Fauconnier

Proceedings of the Computer Vision - ECCV 2024, 2024

VeCLIP: Improving CLIP Training via Visual-Enriched Captions.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

2023

Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts.

[BibT_eX]

[DOI]

CoRR, 2023

MOFI: Learning Image Representations from Noisy Entity Annotated Images.

[BibT_eX]

[DOI]

CoRR, 2023

Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness.

[BibT_eX]

[DOI]

CoRR, 2023

STAIR: Learning Sparse Text and Image Representation in Grounded Tokens.

[BibT_eX]

[DOI]

Albin Madappally Jose

CoRR, 2023

STAIR: Learning Sparse Text and Image Representation in Grounded Tokens.

[BibT_eX]

[DOI]

Albin Madappally Jose

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2021

Co-training Transformer with Videos and Images Improves Action Recognition.

[BibT_eX]

[DOI]

CoRR, 2021

CREATe: Clinical Report Extraction and Annotation Technology.

[BibT_eX]

[DOI]

Proceedings of the 37th IEEE International Conference on Data Engineering, 2021

Systematic Generalization on gSCAN: What is Nearly Solved and What is Next?

[BibT_eX]

[DOI]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

Visually Grounded Concept Composition.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, 2021

2020

A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus.

[BibT_eX]

[DOI]

CoRR, 2020

Visual Storytelling via Predicting Anchor Word Embeddings in the Stories.

[BibT_eX]

[DOI]

Bowen Zhang

Hexiang Hu

Fei Sha

CoRR, 2020

Learning to Represent Image and Text with Denotation Graph.

[BibT_eX]

[DOI]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020

2019

Topic Augmented Generator for Abstractive Summarization.

[BibT_eX]

[DOI]

Melissa Ailem

Bowen Zhang

Fei Sha

CoRR, 2019

2018

Real-Time Action Recognition With Deeply Transferred Motion Vector CNNs.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2018

A Probabilistic Model for Joint Learning of Word Embeddings from Texts and Images.

[BibT_eX]

[DOI]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31, 2018

Cross-Modal and Hierarchical Modeling of Video and Text.

[BibT_eX]

[DOI]

Bowen Zhang

Hexiang Hu

Fei Sha

Proceedings of the Computer Vision - ECCV 2018, 2018

2017

Weakly Supervised PatchNets: Describing and Aggregating Local Patches for Scene Recognition.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2017

Learning correlations for human action recognition in videos.

[BibT_eX]

[DOI]

Yun Yi

Hanli Wang

Bowen Zhang

Multim. Tools Appl., 2017

2016

CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016.

[BibT_eX]

[DOI]

CoRR, 2016

Real-Time Action Recognition with Enhanced Motion Vector CNNs.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016

2015

Encoding scale into fisher vector for human action recognition.

[BibT_eX]

[DOI]

Bowen Zhang

Hanli Wang

Proceedings of the 2015 Visual Communications and Image Processing, 2015

MIC-TJU in MediaEval 2015 Affective Impact of Movies Task.

[BibT_eX]

[DOI]

Proceedings of the Working Notes Proceedings of the MediaEval 2015 Workshop, 2015

2014

MIC_TJ at TRECVID 2014.

[BibT_eX]

[DOI]

Proceedings of the 2014 TREC Video Retrieval Evaluation, 2014

MIC-TJU at MediaEval Violent Scenes Detection (VSD) 2014.

[BibT_eX]

[DOI]

Proceedings of the Working Notes Proceedings of the MediaEval 2014 Workshop, 2014

Bowen Zhang

Timeline

Legend:

Links

Online presence:

On csauthors.net:

Bibliography

Loading...