Jiaqi Wang

Orcid: 0000-0001-6877-5353

Affiliations:

Shanghai Artificial Intelligence Laboratory, China
Chinese University of Hong Kong, Multimedia Laboratory, Hong Kong (PhD)

According to our database¹, Jiaqi Wang authored at least 127 papers between 2018 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Bibliography

2025

SS4D: Native 4D Generative Model via Structured Spacetime Latents.

[BibT_eX]

[DOI]

ACM Trans. Graph., December, 2025

Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, October, 2025

STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence.

[BibT_eX]

[DOI]

CoRR, October, 2025

G<sup>2</sup>RPO: Granular GRPO for Precise Reward in Flow Models.

[BibT_eX]

[DOI]

CoRR, October, 2025

2nd Place Report of MOSEv2 Challenge 2025: Concept Guided Video Object Segmentation via SeC.

[BibT_eX]

[DOI]

CoRR, September, 2025

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, September, 2025

SPARK: Synergistic Policy And Reward Co-Evolving Framework.

[BibT_eX]

[DOI]

CoRR, September, 2025

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing.

[BibT_eX]

[DOI]

CoRR, September, 2025

SIM-CoT: Supervised Implicit Chain-of-Thought.

[BibT_eX]

[DOI]

CoRR, September, 2025

CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, August, 2025

DiCache: Let Diffusion Model Determine Its Own Cache.

[BibT_eX]

[DOI]

CoRR, August, 2025

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience.

[BibT_eX]

[DOI]

CoRR, August, 2025

Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models.

[BibT_eX]

[DOI]

CoRR, August, 2025

Language-Aware Vision Transformer for Referring Segmentation.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., July, 2025

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction.

[BibT_eX]

[DOI]

CoRR, July, 2025

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing.

[BibT_eX]

[DOI]

CoRR, June, 2025

VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning.

[BibT_eX]

[DOI]

CoRR, June, 2025

Active Learning via Vision-Language Model Adaptation with Open Data.

[BibT_eX]

[DOI]

Tong Wang

Jiaqi Wang

Shu Kong

CoRR, June, 2025

Visual Agentic Reinforcement Fine-Tuning.

[BibT_eX]

[DOI]

CoRR, May, 2025

NeuroGen: Neural Network Parameter Generation via Large Language Models.

[BibT_eX]

[DOI]

Jiaqi Wang

Yusen Zhang

Xi Li

CoRR, May, 2025

MM-IFEngine: Towards Multimodal Instruction Following.

[BibT_eX]

[DOI]

CoRR, April, 2025

HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance.

[BibT_eX]

[DOI]

CoRR, April, 2025

Unified Reward Model for Multimodal Understanding and Generation.

[BibT_eX]

[DOI]

CoRR, March, 2025

Visual-RFT: Visual Reinforcement Fine-Tuning.

[BibT_eX]

[DOI]

CoRR, March, 2025

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference.

[BibT_eX]

[DOI]

CoRR, February, 2025

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion.

[BibT_eX]

[DOI]

CoRR, February, 2025

RelightVid: Temporal-Consistent Diffusion Model for Video Relighting.

[BibT_eX]

[DOI]

CoRR, January, 2025

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning.

[BibT_eX]

[DOI]

CoRR, January, 2025

GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models.

[BibT_eX]

[DOI]

CoRR, January, 2025

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

MLLM-DataEngine: Closing the Loop of Multimodal Instruction Tuning Data Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Multimedia and Expo, 2025

Beyond Multimodal Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Multimedia and Expo, 2025

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MotionClone: Training-Free Motion Cloning for Controllable Video Generation.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Conical Visual Concentration for Efficient Large Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference.

[BibT_eX]

[DOI]

Maosongcao Maosongcao

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2025

Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2025

SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

Utilize the Flow Before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning.

[BibT_eX]

[DOI]

Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024

ConDA: state-based data augmentation for context-dependent text-to-SQL.

[BibT_eX]

[DOI]

Int. J. Mach. Learn. Cybern., August, 2024

Prediction model of radiotherapy outcome for Ocular Adnexal Lymphoma using informative features selected by chemometric algorithms.

[BibT_eX]

[DOI]

Comput. Biol. Medicine, March, 2024

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions.

[BibT_eX]

[DOI]

CoRR, 2024

SimC3D: A Simple Contrastive 3D Pretraining Framework Using RGB Images.

[BibT_eX]

[DOI]

CoRR, 2024

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models.

[BibT_eX]

[DOI]

CoRR, 2024

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction.

[BibT_eX]

[DOI]

CoRR, 2024

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree.

[BibT_eX]

[DOI]

CoRR, 2024

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate.

[BibT_eX]

[DOI]

CoRR, 2024

BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way.

[BibT_eX]

[DOI]

CoRR, 2024

Tailor3D: Customized 3D Assets Editing and Generation with Dual-Side Images.

[BibT_eX]

[DOI]

CoRR, 2024

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output.

[BibT_eX]

[DOI]

CoRR, 2024

V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results.

[BibT_eX]

[DOI]

CoRR, 2024

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.

[BibT_eX]

[DOI]

CoRR, 2024

Bootstrap3D: Improving 3D Content Creation with Synthetic Data.

[BibT_eX]

[DOI]

CoRR, 2024

ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing.

[BibT_eX]

[DOI]

CoRR, 2024

Make-it-Real: Unleashing Large Multimodal Model's Ability for Painting 3D Objects with Realistic Materials.

[BibT_eX]

[DOI]

CoRR, 2024

Unified Scene Representation and Reconstruction for 3D Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Are We on the Right Way for Evaluating Large Vision-Language Models?

[BibT_eX]

[DOI]

CoRR, 2024

InternLM2 Technical Report.

[BibT_eX]

[DOI]

et al.

CoRR, 2024

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition.

[BibT_eX]

[DOI]

CoRR, 2024

SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation.

[BibT_eX]

[DOI]

CoRR, 2024

DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

SepRep-Net: Multi-source Free Domain Adaptation via Model Separation And Reparameterization.

[BibT_eX]

[DOI]

Ying Jin

Jiaqi Wang

Dahua Lin

CoRR, 2024

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model.

[BibT_eX]

[DOI]

CoRR, 2024

How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites.

[BibT_eX]

[DOI]

Sci. China Inf. Sci., 2024

FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Streaming Long Video Understanding with Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Make-it-Real: Unleashing Large Multimodal Model for Painting 3D Objects with Realistic Materials.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Are We on the Right Way for Evaluating Large Vision-Language Models?

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

VLMEvalKit: An Open-Source ToolKit for Evaluating Large Multi-Modality Models.

[BibT_eX]

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Long-CLIP: Unlocking the Long-Text Capability of CLIP.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

Adversarial Prompt Tuning for Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

MMBench: Is Your Multi-modal Model an All-Around Player?

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

ShareGPT4V: Improving Large Multi-modal Models with Better Captions.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

GPT4Point: A Unified Framework for Point-Language Understanding and Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

OneLLM: One Framework to Align All Modalities with Language.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Alpha-CLIP: A CLIP Model Focusing on Wherever you Want.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Enhancing EEG-to-Text Decoding through Transferable Representations from Pre-trained Contrastive EEG-Text Masked Autoencoder.

[BibT_eX]

[DOI]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

VIGC: Visual Instruction Generation and Correction.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection.

[BibT_eX]

[DOI]

Proceedings of the International Conference on 3D Vision, 2024

2023

Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases.

[BibT_eX]

[DOI]

CoRR, 2023

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization.

[BibT_eX]

[DOI]

CoRR, 2023

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition.

[BibT_eX]

[DOI]

CoRR, 2023

MLLM-DataEngine: An Iterative Refinement Approach for MLLM.

[BibT_eX]

[DOI]

CoRR, 2023

WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models.

[BibT_eX]

[DOI]

CoRR, 2023

OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation.

[BibT_eX]

[DOI]

CoRR, 2023

HyperDreamer: Hyper-Realistic 3D Content Generation and Editing from a Single Image.

[BibT_eX]

[DOI]

Proceedings of the SIGGRAPH Asia 2023 Conference Papers, 2023

Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization.

[BibT_eX]

[DOI]

Proceedings of the 31st ACM International Conference on Multimedia, 2023

UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Machine Learning, 2023

Voxurf: Voxel-based Efficient and Accurate Neural Surface Reconstruction.

[BibT_eX]

[DOI]

Proceedings of the Eleventh International Conference on Learning Representations, 2023

V3Det: Vast Vocabulary Visual Detection Dataset.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Dense Distinct Query for End-to-End Object Detection.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Multi-Level Logit Distillation.

[BibT_eX]

[DOI]

Ying Jin

Jiaqi Wang

Dahua Lin

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

BUOL: A Bottom-Up Framework with Occupancy-Aware Lifting for Panoptic 3D Scene Reconstruction From a Single Image.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Self-Supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022

CARAFE++: Unified Content-Aware ReAssembly of FEatures.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., 2022

DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition.

[BibT_eX]

[DOI]

CoRR, 2022

What Are Expected Queries in End-to-End Object Detection?

[BibT_eX]

[DOI]

CoRR, 2022

MINI: Mining Implicit Novel Instances for Few-Shot Object Detection.

[BibT_eX]

[DOI]

CoRR, 2022

Semi-Supervised Semantic Segmentation via Gentle Teaching Assistant.

[BibT_eX]

[DOI]

Ying Jin

Jiaqi Wang

Dahua Lin

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

PYSKL: Towards Good Practices for Skeleton Action Recognition.

[BibT_eX]

[DOI]

Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

Texture Memory-Augmented Deep Patch-Based Image Inpainting.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2021

Few-Shot Object Detection via Association and DIscrimination.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

MMFashion: An Open-Source Toolbox for Visual Fashion Analysis.

[BibT_eX]

[DOI]

Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021

Seesaw Loss for Long-Tailed Instance Segmentation.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020

Side-Aware Boundary Localization for More Precise Object Detection.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2020, 2020

2019

MMDetection: Open MMLab Detection Toolbox and Benchmark.

[BibT_eX]

[DOI]

CoRR, 2019

CARAFE: Content-Aware ReAssembly of FEatures.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Region Proposal by Guided Anchoring.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

Hybrid Task Cascade for Instance Segmentation.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

2018

A Height Correction Algorithm Applied in Underwater Photometric Stereo Reconstruction.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Conference on Signal Processing, 2018

Optimizing Video Object Detection via a Scale-Time Lattice.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

Jiaqi Wang

Timeline

Legend:

Links

Online presence:

On csauthors.net:

Bibliography

Loading...