Yixiao Ge

CoRR, June, 2025

Aligning Latent Spaces with Flow Priors.

[BibT_eX]

[DOI]

CoRR, June, 2025

AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation.

[BibT_eX]

[DOI]

CoRR, June, 2025

HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation.

[BibT_eX]

[DOI]

CoRR, June, 2025

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

[BibT_eX]

[DOI]

CoRR, May, 2025

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation.

[BibT_eX]

[DOI]

CoRR, May, 2025

Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1.

[BibT_eX]

[DOI]

CoRR, March, 2025

HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding.

[BibT_eX]

[DOI]

CoRR, March, 2025

Equivariant symmetries for inertial navigation systems.

[BibT_eX]

[DOI]

Autom., 2025

Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, New Mexico, USA, April 29, 2025

Equivariant Filter Design for Range-Only SLAM.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Robotics and Automation, 2025

LoRA-Gen: Specializing Large Language Model via Online LoRA Generation.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

SEED-Story: Multimodal Long Story Generation with Large Language Model.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2025, 2025

Scalable Image Tokenization with Index Backpropagation Quantization.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

VoCo-LLaMA: Towards Vision Compression with Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024

Structured Domain Adaptation With Online Relation Regularization for Unsupervised Person Re-ID.

[BibT_eX]

[DOI]

IEEE Trans. Neural Networks Learn. Syst., January, 2024

Vision-Language Instruction Tuning: A Review and Analysis.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2024

A Geometric Perspective on Fusing Gaussian Distributions on Lie Groups.

[BibT_eX]

[DOI]

IEEE Control. Syst. Lett., 2024

DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Moto: Latent Motion Token as the Bridging Language for Robot Manipulation.

[BibT_eX]

[DOI]

CoRR, 2024

Taming Scalable Visual Tokenizer for Autoregressive Image Generation.

[BibT_eX]

[DOI]

CoRR, 2024

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance.

[BibT_eX]

[DOI]

CoRR, 2024

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation.

[BibT_eX]

[DOI]

CoRR, 2024

Geometric Data Fusion for Collaborative Attitude Estimation.

[BibT_eX]

[DOI]

CoRR, 2024

VoCo-LLaMA: Towards Vision Compression with Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

GrootVL: Tree Topology is All You Need in State Space Model.

[BibT_eX]

[DOI]

CoRR, 2024

SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing.

[BibT_eX]

[DOI]

CoRR, 2024

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension.

[BibT_eX]

[DOI]

CoRR, 2024

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation.

[BibT_eX]

[DOI]

CoRR, 2024

Supervised Fine-tuning in turn Improves Visual Foundation Models.

[BibT_eX]

[DOI]

CoRR, 2024

Towards A Better Metric for Text-to-Video Generation.

[BibT_eX]

[DOI]

CoRR, 2024

MambaTree: Tree Topology is All You Need in State Space Model.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

An Equivariant Approach to Robust State Estimation for the ArduPilot Autopilot System.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Robotics and Automation, 2024

Making LLaMA SEE and Draw with SEED Tokenizer.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

ST-LLM: Large Language Models Are Effective Temporal Learners.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

DreamDiffusion: High-Quality EEG-to-Image Generation with Temporal Masked Signal Modeling and CLIP Alignment.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

SEED-Bench: Benchmarking Multimodal Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

VIT-LENS: Towards Omni-modal Representations.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

SmartEdit: Exploring Complex Instruction-Based Image Editing with Multimodal Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

YOLO-World: Real-Time Open-Vocabulary Object Detection.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

LLaMA Pro: Progressive LLaMA with Block Expansion.

[BibT_eX]

[DOI]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

Cached Transformers: Improving Transformers with Differentiable Memory Cachde.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023

Cached Transformers: Improving Transformers with Differentiable Memory Cache.

[BibT_eX]

[DOI]

CoRR, 2023

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation.

[BibT_eX]

[DOI]

CoRR, 2023

EgoPlan-Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

SEED-Bench-2: Benchmarking Multimodal Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

ViT-Lens-2: Gateway to Omni-modal Intelligence.

[BibT_eX]

[DOI]

CoRR, 2023

One For All: Video Conversation is Feasible Without Video Instruction Tuning.

[BibT_eX]

[DOI]

CoRR, 2023

ViT-Lens: Towards Omni-modal Representations.

[BibT_eX]

[DOI]

CoRR, 2023

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension.

[BibT_eX]

[DOI]

CoRR, 2023

Planting a SEED of Vision in Large Language Model.

[BibT_eX]

[DOI]

CoRR, 2023

DreamDiffusion: Generating High-Quality Images from Brain EEG Signals.

[BibT_eX]

[DOI]

CoRR, 2023

PTVD: A Large-Scale Plot-Oriented Multimodal Dataset Based on Television Dramas.

[BibT_eX]

[DOI]

CoRR, 2023

TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter.

[BibT_eX]

[DOI]

CoRR, 2023

Sticker820K: Empowering Interactive Retrieval with Stickers.

[BibT_eX]

[DOI]

CoRR, 2023

TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale.

[BibT_eX]

[DOI]

CoRR, 2023

What Makes for Good Visual Tokenizers for Large Language Models?

[BibT_eX]

[DOI]

CoRR, 2023

Attack is Good Augmentation: Towards Skeleton-Contrastive Representation Learning.

[BibT_eX]

[DOI]

CoRR, 2023

TagGPT: Large Language Models are Zero-shot Multimodal Taggers.

[BibT_eX]

[DOI]

CoRR, 2023

Masked Visual Reconstruction in Language Semantic Space.

[BibT_eX]

[DOI]

CoRR, 2023

Modeling Uncertain Feature Representation for Domain Generalization.

[BibT_eX]

[DOI]

CoRR, 2023

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Binary Embedding-based Retrieval at Tencent.

[BibT_eX]

[DOI]

Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023

π-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Machine Learning, 2023

Masked Image Modeling with Denoising Contrast.

[BibT_eX]

[DOI]

Proceedings of the Eleventh International Conference on Learning Representations, 2023

BoxSnake: Polygonal Instance Segmentation with Box Supervision.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Exploring Model Transferability through the Lens of Potential Energy.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Learning Transferable Spatiotemporal Representations from Natural Script Knowledge.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

RILS: Masked Visual Reconstruction in Language Semantic Space.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Accelerating Vision-Language Pretraining with Free Language Modeling.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

All in One: Exploring Unified Video-Language Pre-Training.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

A Note on the Extended Kalman Filter on a Manifold.

[BibT_eX]

[DOI]

Proceedings of the 62nd IEEE Conference on Decision and Control, 2023

Darwinian Model Upgrades: Model Evolving with Selective Compatibility.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

Video-Text Pre-training with Learned Regions for Retrieval.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation.

[BibT_eX]

[DOI]

CoRR, 2022

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis.

[BibT_eX]

[DOI]

CoRR, 2022

Privacy-Preserving Model Upgrades with Bidirectional Compatible Training in Image Retrieval.

[BibT_eX]

[DOI]

CoRR, 2022

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval.

[BibT_eX]

[DOI]

CoRR, 2022

Revitalize Region Feature for Democratizing Video-Language Pre-training.

[BibT_eX]

[DOI]

CoRR, 2022

All in One: Exploring Unified Video-Language Pre-training.

[BibT_eX]

[DOI]

CoRR, 2022

Hot-Refresh Model Upgrades with Regression-Alleviating Compatible Training in Image Retrieval.

[BibT_eX]

[DOI]

CoRR, 2022

BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions.

[BibT_eX]

[DOI]

CoRR, 2022

Towards Universal Backward-Compatible Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022

Hot-Refresh Model Upgrades with Regression-Free Compatible Training in Image Retrieval.

[BibT_eX]

[DOI]

Proceedings of the Tenth International Conference on Learning Representations, 2022

Dynamic Token Normalization improves Vision Transformers.

[BibT_eX]

[DOI]

Proceedings of the Tenth International Conference on Learning Representations, 2022

Uncertainty Modeling for Out-of-Distribution Generalization.

[BibT_eX]

[DOI]

Proceedings of the Tenth International Conference on Learning Representations, 2022

Not All Models Are Equal: Predicting Model Transferability in a Self-challenging Fisher Space.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

mc-BEiT: Multi-choice Discretization for Image BERT Pre-training.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

Object-aware Video-language Pre-training for Retrieval.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Bridging Video-text Retrieval with Multiple Choice Questions.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Equivariant Filter Design for Discrete-time Systems.

[BibT_eX]

[DOI]

Proceedings of the 61st IEEE Conference on Decision and Control, 2022

2021

Dynamic Token Normalization Improves Vision Transformer.

[BibT_eX]

[DOI]

CoRR, 2021

Video-Text Pre-training with Learned Regions.

[BibT_eX]

[DOI]

CoRR, 2021

Self-distillation with Batch Knowledge Ensembling Improves ImageNet Classification.

[BibT_eX]

[DOI]

CoRR, 2021

Consensus-Guided Correspondence Denoising.

[BibT_eX]

[DOI]

CoRR, 2021

Online Pseudo Label Generation by Hierarchical Cluster Dynamics for Adaptive Person Re-identification.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Progressive Correspondence Pruning by Consensus Learning.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Refining Pseudo Labels With Clustering Consensus Over Generations for Unsupervised Object Re-Identification.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Mutual CRF-GNN for Few-Shot Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020

Improved Mutual Mean-Teaching for Unsupervised Domain Adaptive Re-ID.

[BibT_eX]

[DOI]

Shijie Yu

Dapeng Chen

CoRR, 2020

Structured Domain Adaptation for Unsupervised Person Re-identification.

[BibT_eX]

[DOI]

CoRR, 2020

Self-paced Contrastive Learning with Hybrid Memory for Domain Adaptive Object Re-ID.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification.

[BibT_eX]

[DOI]