Jiaqi Wang

Orcid: 0000-0001-6877-5353

Affiliations:
  • Shanghai Artificial Intelligence Laboratory, China
  • Chinese University of Hong Kong, Multimedia Laboratory, Hong Kong (PhD)


According to our database1, Jiaqi Wang authored at least 113 papers between 2018 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience.
CoRR, August, 2025

Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models.
CoRR, August, 2025

Language-Aware Vision Transformer for Referring Segmentation.
IEEE Trans. Pattern Anal. Mach. Intell., July, 2025

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction.
CoRR, July, 2025

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing.
CoRR, June, 2025

VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning.
CoRR, June, 2025

Active Learning via Vision-Language Model Adaptation with Open Data.
CoRR, June, 2025

Visual Agentic Reinforcement Fine-Tuning.
CoRR, May, 2025

NeuroGen: Neural Network Parameter Generation via Large Language Models.
CoRR, May, 2025

MM-IFEngine: Towards Multimodal Instruction Following.
CoRR, April, 2025

HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance.
CoRR, April, 2025

Unified Reward Model for Multimodal Understanding and Generation.
CoRR, March, 2025

Visual-RFT: Visual Reinforcement Fine-Tuning.
CoRR, March, 2025

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference.
CoRR, February, 2025

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation.
CoRR, February, 2025

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion.
CoRR, February, 2025

VideoRoPE: What Makes for Good Video Rotary Position Embedding?
CoRR, February, 2025

RelightVid: Temporal-Consistent Diffusion Model for Video Relighting.
CoRR, January, 2025

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning.
CoRR, January, 2025

GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models.
CoRR, January, 2025

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MotionClone: Training-Free Motion Cloning for Controllable Video Generation.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Conical Visual Concentration for Efficient Large Vision-Language Models.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

Utilize the Flow Before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024
ConDA: state-based data augmentation for context-dependent text-to-SQL.
Int. J. Mach. Learn. Cybern., August, 2024

Prediction model of radiotherapy outcome for Ocular Adnexal Lymphoma using informative features selected by chemometric algorithms.
Comput. Biol. Medicine, March, 2024

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions.
CoRR, 2024

SimC3D: A Simple Contrastive 3D Pretraining Framework Using RGB Images.
CoRR, 2024

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models.
CoRR, 2024

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction.
CoRR, 2024

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree.
CoRR, 2024

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate.
CoRR, 2024

BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way.
CoRR, 2024

Tailor3D: Customized 3D Assets Editing and Generation with Dual-Side Images.
CoRR, 2024

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output.
CoRR, 2024

V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results.
CoRR, 2024

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.
CoRR, 2024

Bootstrap3D: Improving 3D Content Creation with Synthetic Data.
CoRR, 2024

ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing.
CoRR, 2024

Make-it-Real: Unleashing Large Multimodal Model's Ability for Painting 3D Objects with Realistic Materials.
CoRR, 2024

Unified Scene Representation and Reconstruction for 3D Large Language Models.
CoRR, 2024

Are We on the Right Way for Evaluating Large Vision-Language Models?
CoRR, 2024

InternLM2 Technical Report.
CoRR, 2024

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition.
CoRR, 2024

SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation.
CoRR, 2024

DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models.
CoRR, 2024

SepRep-Net: Multi-source Free Domain Adaptation via Model Separation And Reparameterization.
CoRR, 2024

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model.
CoRR, 2024

How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites.
Sci. China Inf. Sci., 2024

FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Streaming Long Video Understanding with Large Language Models.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Make-it-Real: Unleashing Large Multimodal Model for Painting 3D Objects with Realistic Materials.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Are We on the Right Way for Evaluating Large Vision-Language Models?
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

VLMEvalKit: An Open-Source ToolKit for Evaluating Large Multi-Modality Models.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

Long-CLIP: Unlocking the Long-Text Capability of CLIP.
Proceedings of the Computer Vision - ECCV 2024, 2024

Adversarial Prompt Tuning for Vision-Language Models.
Proceedings of the Computer Vision - ECCV 2024, 2024

MMBench: Is Your Multi-modal Model an All-Around Player?
Proceedings of the Computer Vision - ECCV 2024, 2024

ShareGPT4V: Improving Large Multi-modal Models with Better Captions.
Proceedings of the Computer Vision - ECCV 2024, 2024

GPT4Point: A Unified Framework for Point-Language Understanding and Generation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

OneLLM: One Framework to Align All Modalities with Language.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Alpha-CLIP: A CLIP Model Focusing on Wherever you Want.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Enhancing EEG-to-Text Decoding through Transferable Representations from Pre-trained Contrastive EEG-Text Masked Autoencoder.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

VIGC: Visual Instruction Generation and Correction.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection.
Proceedings of the International Conference on 3D Vision, 2024

2023
Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases.
CoRR, 2023

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization.
CoRR, 2023

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition.
CoRR, 2023

MLLM-DataEngine: An Iterative Refinement Approach for MLLM.
CoRR, 2023

WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models.
CoRR, 2023

OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation.
CoRR, 2023

HyperDreamer: Hyper-Realistic 3D Content Generation and Editing from a Single Image.
Proceedings of the SIGGRAPH Asia 2023 Conference Papers, 2023

Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers.
Proceedings of the International Conference on Machine Learning, 2023

Voxurf: Voxel-based Efficient and Accurate Neural Surface Reconstruction.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

V3Det: Vast Vocabulary Visual Detection Dataset.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Dense Distinct Query for End-to-End Object Detection.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Multi-Level Logit Distillation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

BUOL: A Bottom-Up Framework with Occupancy-Aware Lifting for Panoptic 3D Scene Reconstruction From a Single Image.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Self-Supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022
CARAFE++: Unified Content-Aware ReAssembly of FEatures.
IEEE Trans. Pattern Anal. Mach. Intell., 2022

DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition.
CoRR, 2022

What Are Expected Queries in End-to-End Object Detection?
CoRR, 2022

MINI: Mining Implicit Novel Instances for Few-Shot Object Detection.
CoRR, 2022

Semi-Supervised Semantic Segmentation via Gentle Teaching Assistant.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

PYSKL: Towards Good Practices for Skeleton Action Recognition.
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021
Texture Memory-Augmented Deep Patch-Based Image Inpainting.
IEEE Trans. Image Process., 2021

Few-Shot Object Detection via Association and DIscrimination.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

MMFashion: An Open-Source Toolbox for Visual Fashion Analysis.
Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021

Seesaw Loss for Long-Tailed Instance Segmentation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020
Side-Aware Boundary Localization for More Precise Object Detection.
Proceedings of the Computer Vision - ECCV 2020, 2020

2019
MMDetection: Open MMLab Detection Toolbox and Benchmark.
CoRR, 2019

CARAFE: Content-Aware ReAssembly of FEatures.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Region Proposal by Guided Anchoring.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

Hybrid Task Cascade for Instance Segmentation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

2018
Optimizing Video Object Detection via a Scale-Time Lattice.
Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018


  Loading...