Kaipeng Zhang

Orcid: 0000-0001-6105-6532

According to our database¹, Kaipeng Zhang authored at least 118 papers between 2016 and 2025.

Collaborative distances:

Dijkstra number² of three.
Erdős number³ of three.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Links

On csauthors.net:

Bibliography

2025

TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning.

[BibT_eX]

[DOI]

CoRR, November, 2025

From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration.

[BibT_eX]

[DOI]

CoRR, October, 2025

Dialogue as Discovery: Navigating Human Intent Through Principled Inquiry.

[BibT_eX]

[DOI]

CoRR, October, 2025

OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling.

[BibT_eX]

[DOI]

CoRR, September, 2025

Symbolic Graphics Programming with Large Language Models.

[BibT_eX]

[DOI]

CoRR, September, 2025

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency.

[BibT_eX]

[DOI]

CoRR, August, 2025

InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles.

[BibT_eX]

[DOI]

CoRR, August, 2025

MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams.

[BibT_eX]

[DOI]

CoRR, August, 2025

Yume: An Interactive World Generation Model.

[BibT_eX]

[DOI]

CoRR, July, 2025

PyVision: Agentic Vision with Dynamic Tooling.

[BibT_eX]

[DOI]

CoRR, July, 2025

Neural-Driven Image Editing.

[BibT_eX]

[DOI]

CoRR, July, 2025

TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models.

[BibT_eX]

[DOI]

IEEE Trans. Big Data, June, 2025

InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models.

[BibT_eX]

[DOI]

CoRR, June, 2025

Sekai: A Video Dataset towards World Exploration.

[BibT_eX]

[DOI]

CoRR, June, 2025

A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation.

[BibT_eX]

[DOI]

CoRR, June, 2025

EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models.

[BibT_eX]

[DOI]

CoRR, May, 2025

SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model.

[BibT_eX]

[DOI]

CoRR, May, 2025

REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training.

[BibT_eX]

[DOI]

CoRR, May, 2025

IA-T2I: Internet-Augmented Text-to-Image Generation.

[BibT_eX]

[DOI]

CoRR, May, 2025

DD-Ranking: Rethinking the Evaluation of Dataset Distillation.

[BibT_eX]

[DOI]

Baharan Mirzasoleiman

Manolis Kellis

Konstantinos N. Plataniotis

CoRR, May, 2025

Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans.

[BibT_eX]

[DOI]

CoRR, May, 2025

AI Idea Bench 2025: AI Research Idea Generation Benchmark.

[BibT_eX]

[DOI]

CoRR, April, 2025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models.

[BibT_eX]

[DOI]

CoRR, April, 2025

MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models.

[BibT_eX]

[DOI]

CoRR, April, 2025

LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., March, 2025

Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction.

[BibT_eX]

[DOI]

Ziyao Guo

Kaipeng Zhang

Michael Qizhe Shieh

CoRR, March, 2025

Think or Not Think: A Study of Explicit Thinking inRule-Based Visual Reinforcement Fine-Tuning.

[BibT_eX]

[DOI]

CoRR, March, 2025

PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models.

[BibT_eX]

[DOI]

CoRR, March, 2025

MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification.

[BibT_eX]

[DOI]

CoRR, March, 2025

Neighboring Autoregressive Modeling for Efficient Visual Generation.

[BibT_eX]

[DOI]

CoRR, March, 2025

MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, March, 2025

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges.

[BibT_eX]

[DOI]

CoRR, March, 2025

ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy.

[BibT_eX]

[DOI]

CoRR, March, 2025

Enhance-A-Video: Better Generated Video for Free.

[BibT_eX]

[DOI]

CoRR, February, 2025

LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation.

[BibT_eX]

[DOI]

CoRR, January, 2025

B-AVIBench: Toward Evaluating the Robustness of Large Vision-Language Model on Black-Box Adversarial Visual-Instructions.

[BibT_eX]

[DOI]

IEEE Trans. Inf. Forensics Secur., 2025

TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, 2025

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

ZipAR: Parallel Autoregressive Image Generation through Spatial Locality.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

SAMRefiner: Taming Segment Anything Model for Universal Mask Refinement.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2025

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024

Open-Vocabulary Animal Keypoint Detection with Semantic-Feature Matching.

[BibT_eX]

[DOI]

Int. J. Comput. Vis., December, 2024

HF-HRNet: A Simple Hardware Friendly High-Resolution Network.

[BibT_eX]

[DOI]

IEEE Trans. Circuits Syst. Video Technol., August, 2024

Semantic Image Segmentation by Dynamic Discriminative Prototypes.

[BibT_eX]

[DOI]

Kaipeng Zhang

Yoichi Sato

IEEE Trans. Multim., 2024

FMGNet: An efficient feature-multiplex group network for real-time vision task.

[BibT_eX]

[DOI]

Pattern Recognit., 2024

ZipAR: Accelerating Auto-regressive Image Generation through Spatial Locality.

[BibT_eX]

[DOI]

CoRR, 2024

GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation.

[BibT_eX]

[DOI]

CoRR, 2024

Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping.

[BibT_eX]

[DOI]

CoRR, 2024

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression.

[BibT_eX]

[DOI]

CoRR, 2024

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation.

[BibT_eX]

[DOI]

CoRR, 2024

HRVMamba: High-Resolution Visual State Space Model for Dense Prediction.

[BibT_eX]

[DOI]

CoRR, 2024

Prioritize Alignment in Dataset Distillation.

[BibT_eX]

[DOI]

Konstantinos N. Plataniotis

Kai Wang

Yang You

CoRR, 2024

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model.

[BibT_eX]

[DOI]

CoRR, 2024

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT.

[BibT_eX]

[DOI]

CoRR, 2024

PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models.

[BibT_eX]

[DOI]

CoRR, 2024

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices.

[BibT_eX]

[DOI]

CoRR, 2024

UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge.

[BibT_eX]

[DOI]

CoRR, 2024

Adapting LLaMA Decoder to Vision Transformer.

[BibT_eX]

[DOI]

CoRR, 2024

ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Instructions.

[BibT_eX]

[DOI]

CoRR, 2024

Towards Implicit Prompt For Text-To-Image Models.

[BibT_eX]

[DOI]

CoRR, 2024

RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation.

[BibT_eX]

[DOI]

CoRR, 2024

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Two Trades is not Baffled: Condensing Graph via Crafting Rational Gradient Matching.

[BibT_eX]

[DOI]

CoRR, 2024

ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning.

[BibT_eX]

[DOI]

CoRR, 2024

Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability, Reproducibility, and Practicality.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Needle In A Multimodal Haystack.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Ablation Capability for Large Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

T3M: Text Guided 3D Human Motion Synthesis from Speech.

[BibT_eX]

[DOI]

Wenshuo Peng

Kaipeng Zhang

Sai Qian Zhang

Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, 2024

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Position: Towards Implicit Prompt For Text-To-Image Models.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Align, Adapt and Inject: Audio-Guided Image Generation, Editing and Stylization.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

OneLLM: One Framework to Align All Modalities with Language.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

ChartAssistant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2024

Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP without Training.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023

Toward High-quality Face-Mask Occluded Restoration.

[BibT_eX]

[DOI]

ACM Trans. Multim. Comput. Commun. Appl., January, 2023

MLLMs-Augmented Visual-Language Representation Learning.

[BibT_eX]

[DOI]

CoRR, 2023

DREAM+: Efficient Dataset Distillation by Bidirectional Representative Matching.

[BibT_eX]

[DOI]

CoRR, 2023

Towards Unified and Effective Domain Generalization.

[BibT_eX]

[DOI]

CoRR, 2023

Language-driven Open-Vocabulary Keypoint Detection for Animal Body and Face.

[BibT_eX]

[DOI]

CoRR, 2023

ImageBind-LLM: Multi-modality Instruction Tuning.

[BibT_eX]

[DOI]

CoRR, 2023

Tiny LVLM-eHub: Early Multimodal Experiments with Bard.

[BibT_eX]

[DOI]

CoRR, 2023

Meta-Transformer: A Unified Framework for Multimodal Learning.

[BibT_eX]

[DOI]

CoRR, 2023

Align, Adapt and Inject: Sound-guided Unified Image Generation.

[BibT_eX]

[DOI]

CoRR, 2023

Foundation Model is Efficient Multimodal Multitask Model Selector.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

RaMLP: Vision MLP via Region-aware Mixing.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023

DiffRate : Differentiable Compression Rate for Efficient Vision Transformers.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

2021

Neural Routing by Memory.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

2020

A Dual-Thread Method for Time-Optimal Trajectory Planning in Joint Space Based on Improved NGA.

[BibT_eX]

[DOI]

Kaipeng Zhang

Ning Liu

Gao Wang

J. Robotics, 2020

FarSee-Net: Real-Time Semantic Segmentation by Efficient Multi-scale Context Aggregation and Feature Space Super-resolution.

[BibT_eX]

[DOI]

Zhanpeng Zhang

Kaipeng Zhang

Proceedings of the 2020 IEEE International Conference on Robotics and Automation, 2020

2019

A Comprehensive Study on Center Loss for Deep Face Recognition.

[BibT_eX]

[DOI]

Int. J. Comput. Vis., 2019

Bootstrap Model Ensemble and Rank Loss for Engagement Intensity Regression.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Multimodal Interaction, 2019

Exploring Regularizations with Face, Body and Image Cues for Group Cohesion Prediction.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Multimodal Interaction, 2019

2018

Cascade Attention Networks For Group Emotion Recognition with Face, Body and Image Cues.

[BibT_eX]

[DOI]

Proceedings of the 2018 on International Conference on Multimodal Interaction, 2018

Super-Identity Convolutional Neural Network for Face Hallucination.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2018, 2018

Deep Disguised Faces Recognition.

[BibT_eX]

[DOI]

Kaipeng Zhang

Ya-Liang Chang

Winston H. Hsu

Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018

Attribute Augmented Convolutional Neural Network for Face Hallucination.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018

PIVTONS: Pose Invariant Virtual Try-On Shoe with Conditional Image Completion.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ACCV 2018, 2018

2017

Group emotion recognition with individual facial emotion CNNs and global image based CNNs.

[BibT_eX]

[DOI]

Proceedings of the 19th ACM International Conference on Multimodal Interaction, 2017

Detecting Faces Using Inside Cascaded Contextual CNN.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Computer Vision, 2017

2016

Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks.

[BibT_eX]

[DOI]

IEEE Signal Process. Lett., 2016

Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks.

[BibT_eX]

[DOI]

CoRR, 2016

A Discriminative Feature Learning Approach for Deep Face Recognition.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2016, 2016

Gender and Smile Classification Using Deep Convolutional Neural Networks.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016

Kaipeng Zhang

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...