Haotian Zhang

Orcid: 0000-0001-6809-0426

Affiliations:

Apple AI/ML, Cupertino, CA, USA
University of Washington, Department of Electrical and Computer Engineering, Seattle, WA, USA

According to our database¹, Haotian Zhang authored at least 31 papers between 2019 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Bibliography

2025

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents.

[BibT_eX]

[DOI]

CoRR, September, 2025

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer.

[BibT_eX]

[DOI]

CoRR, September, 2025

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning.

[BibT_eX]

[DOI]

Jean-Philippe Fauconnier

Zhengfeng Lai

Haoxuan You

Zirui Wang

et al.

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms.

[BibT_eX]

[DOI]

Mohana Prasad Sathya Moorthy

Jeffrey Nichols

Yinfei Yang

Zhe Gan

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Improve Vision Language Model Chain-of-thought Reasoning.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms.

[BibT_eX]

[DOI]

Mohana Prasad Sathya Moorthy

Jeff Nichols

Yinfei Yang

Zhe Gan

CoRR, 2024

MM-Ego: Towards Building Egocentric Multimodal LLMs.

[BibT_eX]

[DOI]

CoRR, 2024

Contrastive Localized Language-Image Pre-Training.

[BibT_eX]

[DOI]

CoRR, 2024

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning.

[BibT_eX]

[DOI]

CoRR, 2024

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training.

[BibT_eX]

[DOI]

Brandon McKinzie

Zhe Gan

Jean-Philippe Fauconnier

CoRR, 2024

How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts.

[BibT_eX]

[DOI]

CoRR, 2024

Empowering Unsupervised Domain Adaptation with Large-scale Pre-trained Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024

Ferret: Refer and Ground Anything Anywhere at Any Granularity.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training.

[BibT_eX]

[DOI]

Brandon McKinzie

Zhe Gan

Jean-Philippe Fauconnier

Proceedings of the Computer Vision - ECCV 2024, 2024

VeCLIP: Improving CLIP Training via Visual-Enriched Captions.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

2023

From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions.

[BibT_eX]

[DOI]

CoRR, 2023

2022

DIOR: DIstill Observations to Representations for Multi-Object Tracking and Segmentation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, 2022

GLIPv2: Unifying Localization and Vision-Language Understanding.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Grounded Language-Image Pre-training.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

ROD2021 Challenge: A Summary for Radar Object Detection Challenge for Autonomous Driving Applications.

[BibT_eX]

[DOI]

Proceedings of the ICMR '21: International Conference on Multimedia Retrieval, 2021

Monocular 3D Localization of Vehicles in Road Scenes.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2021

2020

Bundle Adjustment for Monocular Visual Odometry Based on Detections of Traffic Signs.

[BibT_eX]

[DOI]

IEEE Trans. Veh. Technol., 2020

IA-MOT: Instance-Aware Multi-Object Tracking with Motion Consistency.

[BibT_eX]

[DOI]

CoRR, 2020

2019

Eye in the Sky: Drone-Based Object Tracking and 3D Localization.

[BibT_eX]

[DOI]

Proceedings of the 27th ACM International Conference on Multimedia, 2019

Exploit the Connectivity: Multi-Object Tracking with TrackletNet.

[BibT_eX]

[DOI]

Proceedings of the 27th ACM International Conference on Multimedia, 2019

Bundle Adjustment for Monocular Visual Odometry Based on Detected Traffic Sign Features.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE International Conference on Image Processing, 2019

VisDrone-MOT2019: The Vision Meets Drone Multiple Object Tracking Challenge Results.

[BibT_eX]

[DOI]

Kannappan Palaniappan

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops, 2019

Haotian Zhang

Timeline

Legend:

Links

Online presence:

On csauthors.net:

Bibliography

Loading...