Zuxuan Wu

Orcid: 0000-0002-8689-5807

According to our database¹, Zuxuan Wu authored at least 255 papers between 2014 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Bibliography

2026

LRANet++: Low-Rank Approximation Network for Accurate and Efficient Text Spotting.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., May, 2026

CameraNoise: Enabling Faithful Camera Control in Video Diffusion through Geometry-Flow-Guided Noise Warping.

[BibT_eX]

[DOI]

CoRR, May, 2026

VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models.

[BibT_eX]

[DOI]

CoRR, May, 2026

Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization.

[BibT_eX]

[DOI]

CoRR, May, 2026

Channel-wise Vector Quantization.

[BibT_eX]

[DOI]

CoRR, May, 2026

Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation.

[BibT_eX]

[DOI]

CoRR, May, 2026

DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders.

[BibT_eX]

[DOI]

CoRR, May, 2026

Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling.

[BibT_eX]

[DOI]

CoRR, May, 2026

Bench2Drive-Robust: Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations.

[BibT_eX]

[DOI]

CoRR, May, 2026

DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models.

[BibT_eX]

[DOI]

CoRR, May, 2026

GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization.

[BibT_eX]

[DOI]

CoRR, May, 2026

Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval.

[BibT_eX]

[DOI]

CoRR, May, 2026

Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses.

[BibT_eX]

[DOI]

CoRR, May, 2026

GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models.

[BibT_eX]

[DOI]

CoRR, May, 2026

HazardArena: Evaluating Semantic Safety in Vision-Language-Action Models.

[BibT_eX]

[DOI]

CoRR, April, 2026

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation.

[BibT_eX]

[DOI]

CoRR, April, 2026

Steering the Verifiability of Multimodal AI Hallucinations.

[BibT_eX]

[DOI]

CoRR, April, 2026

HAD: Combining Hierarchical Diffusion with Metric-Decoupled RL for End-to-End Driving.

[BibT_eX]

[DOI]

CoRR, April, 2026

FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance.

[BibT_eX]

[DOI]

CoRR, March, 2026

WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing.

[BibT_eX]

[DOI]

CoRR, March, 2026

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization.

[BibT_eX]

[DOI]

CoRR, March, 2026

FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding.

[BibT_eX]

[DOI]

CoRR, March, 2026

Preference Score Distillation: Leveraging 2D Rewards to Align Text-to-3D Generation with Human Preference.

[BibT_eX]

[DOI]

CoRR, March, 2026

Learning Accurate Segmentation Purely from Self-Supervision.

[BibT_eX]

[DOI]

Zuyao You

Zuxuan Wu

Yu-Gang Jiang

CoRR, February, 2026

UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling.

[BibT_eX]

[DOI]

CoRR, February, 2026

DCDM: Divide-and-Conquer Diffusion Models for Consistency-Preserving Video Generation.

[BibT_eX]

[DOI]

CoRR, February, 2026

ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation.

[BibT_eX]

[DOI]

CoRR, February, 2026

Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention.

[BibT_eX]

[DOI]

CoRR, February, 2026

CL-bench: A Benchmark for Context Learning.

[BibT_eX]

[DOI]

CoRR, February, 2026

VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding.

[BibT_eX]

[DOI]

CoRR, January, 2026

FRoM-W1: Towards General Humanoid Whole-Body Control with Language Instructions.

[BibT_eX]

[DOI]

CoRR, January, 2026

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5.

[BibT_eX]

[DOI]

CoRR, January, 2026

VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding.

[BibT_eX]

[DOI]

CoRR, January, 2026

Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy.

[BibT_eX]

[DOI]

CoRR, January, 2026

LSTD: Long Short-Term Temporal Diffusion for Video Generation.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2026

DriveSuprim: Towards Precise Trajectory Selection for End-to-End Planning.

[BibT_eX]

[DOI]

Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

Human2Robot: Learning Robot Actions from Paired Human-Robot Videos.

[BibT_eX]

[DOI]

Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

2025

FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction.

[BibT_eX]

[DOI]

CoRR, December, 2025

HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies.

[BibT_eX]

[DOI]

CoRR, December, 2025

DeRA: Decoupled Representation Alignment for Video Tokenization.

[BibT_eX]

[DOI]

CoRR, December, 2025

Stable Offline Hand-Eye Calibration for any Robot with Just One Mark.

[BibT_eX]

[DOI]

CoRR, November, 2025

UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, November, 2025

TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction.

[BibT_eX]

[DOI]

CoRR, November, 2025

Preserving Cross-Modal Consistency for CLIP-based Class-Incremental Learning.

[BibT_eX]

[DOI]

CoRR, November, 2025

PreferThinker: Reasoning-based Personalized Image Preference Assessment.

[BibT_eX]

[DOI]

CoRR, November, 2025

ZTRS: Zero-Imitation End-to-end Autonomous Driving with Trajectory Scoring.

[BibT_eX]

[DOI]

CoRR, October, 2025

RoboOmni: Proactive Robot Manipulation in Omni-modal Context.

[BibT_eX]

[DOI]

CoRR, October, 2025

COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability.

[BibT_eX]

[DOI]

CoRR, October, 2025

FreezeVLA: Action-Freezing Attacks against Vision-Language-Action Models.

[BibT_eX]

[DOI]

CoRR, September, 2025

Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue.

[BibT_eX]

[DOI]

CoRR, September, 2025

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, September, 2025

DiffusionAD: Norm-Guided One-Step Denoising Diffusion for Anomaly Detection.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., August, 2025

Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives.

[BibT_eX]

[DOI]

CoRR, August, 2025

StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation.

[BibT_eX]

[DOI]

CoRR, August, 2025

A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models.

[BibT_eX]

[DOI]

CoRR, August, 2025

Multimodal Referring Segmentation: A Survey.

[BibT_eX]

[DOI]

CoRR, August, 2025

Multi-Prompt Progressive Alignment for Multi-Source Unsupervised Domain Adaptation.

[BibT_eX]

[DOI]

CoRR, July, 2025

StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image Animation.

[BibT_eX]

[DOI]

CoRR, July, 2025

FreeLoRA: Enabling Training-Free LoRA Fusion for Autoregressive Multi-Subject Personalization.

[BibT_eX]

[DOI]

CoRR, July, 2025

Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning.

[BibT_eX]

[DOI]

Zuyao You

Zuxuan Wu

CoRR, June, 2025

Generalized Trajectory Scoring for End-to-end Multimodal Planning.

[BibT_eX]

[DOI]

CoRR, June, 2025

Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control.

[BibT_eX]

[DOI]

CoRR, June, 2025

CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design.

[BibT_eX]

[DOI]

CoRR, May, 2025

Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities.

[BibT_eX]

[DOI]

Ziwei Zhou

Rui Wang

Zuxuan Wu

CoRR, May, 2025

ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, May, 2025

OmniTracker: Unifying Visual Object Tracking by Tracking-With-Detection.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., April, 2025

Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks.

[BibT_eX]

[DOI]

CoRR, April, 2025

SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL.

[BibT_eX]

[DOI]

CoRR, April, 2025

Aligning Anime Video Generation with Human Feedback.

[BibT_eX]

[DOI]

CoRR, April, 2025

Fighting Malicious Media Data: A Survey on Tampering Detection and Deepfake Detection.

[BibT_eX]

[DOI]

Proc. IEEE, March, 2025

DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation.

[BibT_eX]

[DOI]

CoRR, March, 2025

CoMP: Continual Multimodal Pre-training for Vision Foundation Models.

[BibT_eX]

[DOI]

CoRR, March, 2025

Hydra-MDP++: Advancing End-to-End Driving via Expert-Guided Hydra-Distillation.

[BibT_eX]

[DOI]

CoRR, March, 2025

A Survey on Video Diffusion Models.

[BibT_eX]

[DOI]

ACM Comput. Surv., February, 2025

Human2Robot: Learning Robot Actions from Paired Human-Robot Videos.

[BibT_eX]

[DOI]

CoRR, February, 2025

Safety at Scale: A Comprehensive Survey of Large Model Safety.

[BibT_eX]

[DOI]

CoRR, February, 2025

Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning.

[BibT_eX]

[DOI]

CoRR, January, 2025

FNIN: A Fourier Neural Operator-based Numerical Integration Network for Surface-form-gradients.

[BibT_eX]

[DOI]

CoRR, January, 2025

The Role of ViT Design and Training in Robustness to Common Corruptions.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2025

BMB: Balanced Memory Bank for Long-Tailed Semi-Supervised Learning.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2025

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety.

[BibT_eX]

[DOI]

Found. Trends Priv. Secur., 2025

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

Adaptive Retention & Correction: Test-Time Training for Continual Learning.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Rethinking Discrete Tokens: Treating Them as Conditions for Continuous Autoregressive Image Synthesis.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Aid: Adapting Image2video Diffusion Models for Instruction-Guided Video Prediction.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

MotionFollower: Editing Video Motion via Score-Guided Diffusion.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

REDUCIO! Generating 1K Video Within 16 Seconds Using Extremely Compressed Motion Latents.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Hydra-NeXt: Robust Closed-Loop Driving with Open-Loop Training.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Achieving More with Less: Additive Prompt Tuning for Rehearsal-Free Class-Incremental Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Advancing Dark Action Recognition via Modality Fusion and Dark-to-Light Diffusion Model.

[BibT_eX]

[DOI]

Yuxuan Wang

Zhen Xing

Zuxuan Wu

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

ProLongVid: A Simple but Strong Baseline for Long-context Video Instruction Tuning.

[BibT_eX]

[DOI]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

StableAnimator: High-Quality Identity-Preserving Human Image Animation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

AgentGym: Evaluating and Training Large Language Model-based Agents across Diverse Environments.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

AdaDiff: Adaptive Step Selection for Fast Diffusion Models.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, 2025

FOCUS: Towards Universal Foreground Segmentation.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, 2025

FNIN: A Fourier Neural Operator-based Numerical Integration Network for Surface-from-gradients.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, 2025

Comprehensive Multi-Modal Prototypes Are Simple and Effective Classifiers for Vast-Vocabulary Object Detection.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, 2025

2024

Adaptive Cross-Modal Transferable Adversarial Attacks From Images to Videos.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., May, 2024

HCMS: Hierarchical and Conditional Modality Selection for Efficient Video Recognition.

[BibT_eX]

[DOI]

ACM Trans. Multim. Comput. Commun. Appl., February, 2024

Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., 2024

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation.

[BibT_eX]

[DOI]

CoRR, 2024

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning.

[BibT_eX]

[DOI]

CoRR, 2024

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision.

[BibT_eX]

[DOI]

CoRR, 2024

REDUCIO! Generating 1024⨉1024 Video within 16 Seconds using Extremely Compressed Motion Latents.

[BibT_eX]

[DOI]

CoRR, 2024

Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders.

[BibT_eX]

[DOI]

CoRR, 2024

Downstream Transfer Attack: Adversarial Attacks on Downstream Models with Pre-trained Vision Transformers.

[BibT_eX]

[DOI]

CoRR, 2024

V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results.

[BibT_eX]

[DOI]

CoRR, 2024

AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding.

[BibT_eX]

[DOI]

CoRR, 2024

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation.

[BibT_eX]

[DOI]

CoRR, 2024

AgentGym: Evolving Large Language Model-based Agents across Diverse Environments.

[BibT_eX]

[DOI]

CoRR, 2024

MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion.

[BibT_eX]

[DOI]

CoRR, 2024

Adaptive Rentention & Correction for Continual Learning.

[BibT_eX]

[DOI]

CoRR, 2024

PoseAnimate: Zero-shot high fidelity pose controllable character animation.

[BibT_eX]

[DOI]

CoRR, 2024

FDGaussian: Fast Gaussian Splatting from Single Image via Geometric-aware Diffusion Model.

[BibT_eX]

[DOI]

CoRR, 2024

MouSi: Poly-Visual-Expert Vision-Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Secrets of RLHF in Large Language Models Part II: Reward Modeling.

[BibT_eX]

[DOI]

CoRR, 2024

Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

GenRec: Unifying Video Generation and Recognition with Diffusion Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models.

[BibT_eX]

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

ModelLock: Locking Your Model With a Spell.

[BibT_eX]

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Zero-shot High-fidelity and Pose-controllable Character Animation.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

MagDiff: Multi-alignment Diffusion for High-Fidelity Video Generation and Editing.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

SegIC: Unleashing the Emergent Correspondence for In-Context Segmentation.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

PromptFusion: Decoupling Stability and Plasticity for Continual Learning.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

SimDA: Simple Diffusion Adapter for Efficient Video Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

OmniViD: A Generative Framework for Universal Video Understanding.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

MotionEditor: Editing Video Motion via Content-Aware Diffusion.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Learning to Rank Patches for Unbiased Image Redundancy Reduction.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023

Cross-Domain Contrastive Learning for Unsupervised Domain Adaptation.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2023

FT-TDR: Frequency-Guided Transformer and Top-Down Refinement Network for Blind Face Inpainting.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2023

Self-Supervised Learning for Semi-Supervised Temporal Language Grounding.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2023

Towards Transferable Adversarial Attacks on Image and Video Transformers.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2023

Multimodal Pre-training Method for Vision-language Understanding and Generation.

[BibT_eX]

[DOI]

Int. J. Softw. Informatics, 2023

VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models.

[BibT_eX]

[DOI]

CoRR, 2023

VideoAssembler: Identity-Consistent Video Generation with Reference Entities using Diffusion Model.

[BibT_eX]

[DOI]

CoRR, 2023

AdaDiff: Adaptive Step Selection for Fast Diffusion.

[BibT_eX]

[DOI]

CoRR, 2023

To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning.

[BibT_eX]

[DOI]

CoRR, 2023

Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models.

[BibT_eX]

[DOI]

CoRR, 2023

Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation.

[BibT_eX]

[DOI]

CoRR, 2023

Prompting Large Language Models to Reformulate Queries for Moment Localization.

[BibT_eX]

[DOI]

CoRR, 2023

BMB: Balanced Memory Bank for Imbalanced Semi-supervised Learning.

[BibT_eX]

[DOI]

CoRR, 2023

ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System.

[BibT_eX]

[DOI]

CoRR, 2023

OmniTracker: Unifying Object Tracking by Tracking-with-Detection.

[BibT_eX]

[DOI]

CoRR, 2023

DiffusionAD: Denoising Diffusion for Anomaly Detection.

[BibT_eX]

[DOI]

CoRR, 2023

Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization.

[BibT_eX]

[DOI]

CoRR, 2023

Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Multi-Prompt Alignment for Multi-Source Unsupervised Domain Adaptation.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

On the Importance of Spatial Relations for Few-shot Action Recognition.

[BibT_eX]

[DOI]

Proceedings of the 31st ACM International Conference on Multimedia, 2023

GCMA: Generative Cross-Modal Transferable Adversarial Attacks from Images to Videos.

[BibT_eX]

[DOI]

Proceedings of the 31st ACM International Conference on Multimedia, 2023

Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Machine Learning, 2023

Downstream Task-agnostic Transferable Attacks on Language-Image Pre-training Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Multimedia and Expo, 2023

Implicit Temporal Modeling with Learnable Alignment for Video Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

SVFormer: Semi-supervised Video Transformer for Action Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Enhancing the Self-Universality for Transferable Targeted Attacks.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Look Before You Match: Instance Understanding Matters in Video Object Segmentation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

ResFormer: Scaling ViTs with Multi-Resolution Training.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Vision Transformers are Good Mask Auto-Labelers.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Prototypical Residual Networks for Anomaly Detection and Localization.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Towards Scalable Neural Representation for Diverse Videos.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Resolving Task Confusion in Dynamic Expansion Architectures for Class Incremental Learning.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022

SAM: Modeling Scene, Object and Action With Semantics Attention Modules for Video Recognition.

[BibT_eX]

[DOI]

Xing Zhang

Zuxuan Wu

Yu-Gang Jiang

IEEE Trans. Multim., 2022

Spatial-Temporal Graphs for Cross-Modal Text2Video Retrieval.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2022

A Dynamic Frame Selection Framework for Fast Video Recognition.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., 2022

Multi-Prompt Alignment for Multi-source Unsupervised Domain Adaptation.

[BibT_eX]

[DOI]

Haoran Chen

Zuxuan Wu

Yu-Gang Jiang

CoRR, 2022

Incorporating Locality of Images to Generate Targeted Transferable Adversarial Examples.

[BibT_eX]

[DOI]

CoRR, 2022

Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling.

[BibT_eX]

[DOI]

CoRR, 2022

Deeper Insights into ViTs Robustness towards Common Corruptions.

[BibT_eX]

[DOI]

CoRR, 2022

M3DETR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022

OmniVL: One Foundation Model for Image-Language and Video-Language Tasks.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection.

[BibT_eX]

[DOI]

Proceedings of the ICMR '22: International Conference on Multimedia Retrieval, Newark, NJ, USA, June 27, 2022

Semi-supervised Single-View 3D Reconstruction via Prototype Shape Priors.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

Semi-supervised Vision Transformers.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

Efficient Video Transformers with Spatial-Temporal Token Selection.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

Cross-Modal Transferable Adversarial Attacks from Images to Videos.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

ObjectFormer for Image Manipulation Detection and Localization.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

BEVT: BERT Pretraining of Video Transformers.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Robust Optimization as Data Augmentation for Large-scale Graphs.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Boosting the Transferability of Video Adversarial Examples via Temporal Translation.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

Towards Transferable Adversarial Attacks on Vision Transformers.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

Rethinking Pseudo Labels for Semi-supervised Object Detection.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

Attacking Video Recognition Models with Bullet-Screen Comments.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

2021

A Coarse-to-Fine Framework for Resource Efficient Video Recognition.

[BibT_eX]

[DOI]

Int. J. Comput. Vis., 2021

Rethinking Nearest Neighbors for Visual Classification.

[BibT_eX]

[DOI]

CoRR, 2021

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation.

[BibT_eX]

[DOI]

CoRR, 2021

Efficient Video Transformers with Spatial-Temporal Token Selection.

[BibT_eX]

[DOI]

CoRR, 2021

M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection.

[BibT_eX]

[DOI]

CoRR, 2021

HMS: Hierarchical Modality Selection for Efficient Video Recognition.

[BibT_eX]

[DOI]

CoRR, 2021

THAT: Two Head Adversarial Training for Improving Robustness at Scale.

[BibT_eX]

[DOI]

CoRR, 2021

Encoding Robustness to Image Style via Adversarial Feature Perturbations.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

A Multimodal Framework for Video Ads Understanding.

[BibT_eX]

[DOI]

Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021

VideoLT: Large-scale Long-tailed Video Recognition.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Exploring Visual Engagement Signals for Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Intentonomy: A Dataset and Study Towards Human Intent Understanding.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Efficient Object Embedding for Spliced Image Retrieval.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

GTA: Global Temporal Attention for Video Action Understanding.

[BibT_eX]

[DOI]

Proceedings of the 32nd British Machine Vision Conference 2021, 2021

Deep Video Inpainting Detection.

[BibT_eX]

[DOI]

Proceedings of the 32nd British Machine Vision Conference 2021, 2021

2020

Image and video Understanding with constrained Resources.

[BibT_eX]

[DOI]

Zuxuan Wu

PhD thesis, 2020

FLAG: Adversarial Data Augmentation for Graph Neural Networks.

[BibT_eX]

[DOI]

CoRR, 2020

Prepare for the Worst: Generalizing across Domain Shifts with Adversarial Batch Normalization.

[BibT_eX]

[DOI]

CoRR, 2020

Making an Invisibility Cloak: Real World Adversarial Attacks on Object Detectors.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2020, 2020

Learning From Noisy Anchors for One-Stage Object Detection.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

M2KD: Incremental Learning via Multi-model and Multi-level Knowledge Distillation.

[BibT_eX]

[DOI]

Proceedings of the 31st British Machine Vision Conference 2020, 2020

Recognizing Instagram Filtered Images with Feature De-Stylization.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

2019

Visual Content Recognition by Exploiting Semantic Feature Map with Attention and Multi-task Learning.

[BibT_eX]

[DOI]

ACM Trans. Multim. Comput. Commun. Appl., 2019

An Analysis of Pre-Training on Object Detection.

[BibT_eX]

[DOI]

CoRR, 2019

M2KD: Multi-model and Multi-level Knowledge Distillation for Incremental Learning.

[BibT_eX]

[DOI]

CoRR, 2019

Compatible and Diverse Fashion Image Inpainting.

[BibT_eX]

[DOI]

CoRR, 2019

Weakly-Supervised Spatial Context Networks.

[BibT_eX]

[DOI]

Zuxuan Wu

Larry Davis

Leonid Sigal

Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2019

LiteEval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 2019

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation.

[BibT_eX]

[DOI]

Proceedings of the 7th International Conference on Learning Representations, 2019

ACE: Adapting to Changing Environments for Semantic Segmentation.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

FiNet: Compatible and Diverse Fashion Image Inpainting.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

AdaFrame: Adaptive Frame Selection for Fast Video Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

2018

Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2018

Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., 2018

DCAN: Dual Channel-Wise Alignment Networks for Unsupervised Scene Adaptation.

[BibT_eX]

[DOI]

Zuxuan Wu

Xintong Han

Yen-Liang Lin

Mustafa Gökhan Uzunbas

Tom Goldstein

Ser-Nam Lim

Larry S. Davis

Proceedings of the Computer Vision - ECCV 2018, 2018

BlockDrop: Dynamic Inference Paths in Residual Networks.

[BibT_eX]

[DOI]

Rogério Schmidt Feris

Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

VITON: An Image-Based Virtual Try-On Network.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

Deep learning for video classification and captioning.

[BibT_eX]

[DOI]

Proceedings of the Frontiers of Multimedia Research, 2018

2017

Aggregating Frame-level Features for Large-Scale Video Classification.

[BibT_eX]

[DOI]

CoRR, 2017

Learning Semantic Feature Map for Visual Content Recognition.

[BibT_eX]

[DOI]

Proceedings of the 2017 ACM on Multimedia Conference, 2017

LSVC2017: Large-Scale Video Classification Challenge.

[BibT_eX]

[DOI]

Proceedings of the 2017 ACM on Multimedia Conference, 2017

Learning Fashion Compatibility with Bidirectional LSTMs.

[BibT_eX]

[DOI]

Proceedings of the 2017 ACM on Multimedia Conference, 2017

Automatic Spatially-Aware Fashion Concept Discovery.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Computer Vision, 2017

2016

Deep Learning for Video Classification and Captioning.

[BibT_eX]

[DOI]

CoRR, 2016

Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification.

[BibT_eX]

[DOI]

Proceedings of the 2016 ACM Conference on Multimedia Conference, 2016

Exploiting Objects with LSTMs for Video Categorization.

[BibT_eX]

[DOI]

Proceedings of the 2016 ACM Conference on Multimedia Conference, 2016

Emotion in Context: Deep Semantic Feature Fusion for Video Emotion Recognition.

[BibT_eX]

[DOI]

Chen Chen

Zuxuan Wu

Yu-Gang Jiang

Proceedings of the 2016 ACM Conference on Multimedia Conference, 2016

Harnessing Object and Scene Semantics for Large-Scale Video Understanding.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016

2015

Fusing Multi-Stream Deep Networks for Video Classification.

[BibT_eX]

[DOI]

CoRR, 2015

Fudan at TRECVID 2015: Adaptive Feature Fusion for Multimedia Event Detection in Videos.

[BibT_eX]

[DOI]

Proceedings of the 2015 TREC Video Retrieval Evaluation, 2015

NTT-Fudan Team @ TRECVID 2015: Multimedia Event Detection.

[BibT_eX]

[DOI]

Proceedings of the 2015 TREC Video Retrieval Evaluation, 2015

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM '15, Brisbane, Australia, October 26, 2015

Evaluating Two-Stream CNN for Video Classification.

[BibT_eX]

[DOI]

Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, 2015

Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning.

[BibT_eX]

[DOI]

Proceedings of the Working Notes Proceedings of the MediaEval 2015 Workshop, 2015

2014

Fudan Team at TRECVID 2014: Multimedia Event Detection.

[BibT_eX]

[DOI]

Zuxuan Wu

Rui-Wei Zhao

Proceedings of the 2014 TREC Video Retrieval Evaluation, 2014

Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification.

[BibT_eX]

[DOI]

Proceedings of the ACM International Conference on Multimedia, MM '14, Orlando, FL, USA, November 03, 2014

Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks.

[BibT_eX]

[DOI]

Proceedings of the Working Notes Proceedings of the MediaEval 2014 Workshop, 2014

Challenge Huawei challenge: Fusing multimodal features with deep neural networks for Mobile Video Annotation.

[BibT_eX]

[DOI]

Proceedings of the 2013 IEEE International Conference on Multimedia and Expo Workshops, 2014

Zuxuan Wu

Timeline

Legend:

Links

Online presence:

On csauthors.net:

Bibliography

Loading...