Enze Xie

Orcid: 0000-0001-6890-1049

According to our database¹, Enze Xie authored at least 137 papers between 2018 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models.

[BibT_eX]

[DOI]

CoRR, May, 2026

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer.

[BibT_eX]

[DOI]

CoRR, May, 2026

Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving.

[BibT_eX]

[DOI]

CoRR, May, 2026

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation.

[BibT_eX]

[DOI]

CoRR, May, 2026

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer.

[BibT_eX]

[DOI]

CoRR, May, 2026

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling.

[BibT_eX]

[DOI]

CoRR, April, 2026

Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM.

[BibT_eX]

[DOI]

CoRR, April, 2026

MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head.

[BibT_eX]

[DOI]

CoRR, January, 2026

S2I-DiT: Unlocking the semantic-to-image transferability by fine-tuning large diffusion transformer models.

[BibT_eX]

[DOI]

Pattern Recognit., 2026

2025

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed.

[BibT_eX]

[DOI]

Yonggan Fu

Lexington Allen Whalen

CoRR, December, 2025

A Survey of Reasoning with Foundation Models: Concepts, Methodologies, and Outlook.

[BibT_eX]

[DOI]

ACM Comput. Surv., November, 2025

ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation.

[BibT_eX]

[DOI]

CoRR, October, 2025

Fast-dLLM v2: Efficient Block-Diffusion LLM.

[BibT_eX]

[DOI]

CoRR, September, 2025

Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel.

[BibT_eX]

[DOI]

CoRR, September, 2025

DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder.

[BibT_eX]

[DOI]

CoRR, September, 2025

DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space.

[BibT_eX]

[DOI]

CoRR, September, 2025

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer.

[BibT_eX]

[DOI]

CoRR, September, 2025

LongLive: Real-time Interactive Long Video Generation.

[BibT_eX]

[DOI]

CoRR, September, 2025

TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model.

[BibT_eX]

[DOI]

ACM Trans. Multim. Comput. Commun. Appl., June, 2025

T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-Image Generation.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., May, 2025

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding.

[BibT_eX]

[DOI]

CoRR, May, 2025

BEVFormer: Learning Bird's-Eye-View Representation From LiDAR-Camera via Spatiotemporal Transformers.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., March, 2025

AlgoFormer: An Efficient Transformer Framework with Algorithmic Structures.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2025

SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

SANA: Efficient High-Resolution Text-to-Image Synthesis with Linear Diffusion Transformers.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

HART: Efficient Visual Generation with Hybrid Autoregressive Transformer.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Char-SAM: Turning Segment Anything Model into Scene Text Segmentation Annotator with Character-level Visual Prompts.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

SplatMesh: Interactive 3D Segmentation and Editing Using Mesh-Based Gaussian Splatting.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025

2024

Fast-BEV: A Fast and Strong Bird's-Eye View Perception Baseline.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., December, 2024

DriveGPT4: Interpretable End-to-End Autonomous Driving Via Large Language Model.

[BibT_eX]

[DOI]

IEEE Robotics Autom. Lett., October, 2024

Delving Into the Devils of Bird's-Eye-View Perception: A Review, Evaluation and Recipe.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., April, 2024

Deeply Unsupervised Patch Re-Identification for Pre-Training Object Detectors.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., March, 2024

Lyra: Orchestrating Dual Correction in Automated Theorem Proving.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2024

Char-SAM: Turning Segment Anything Model into Scene Text Segmentation Annotator with Character-level Visual Prompts.

[BibT_eX]

[DOI]

CoRR, 2024

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models.

[BibT_eX]

[DOI]

CoRR, 2024

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models.

[BibT_eX]

[DOI]

CoRR, 2024

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers.

[BibT_eX]

[DOI]

CoRR, 2024

DriveCoT: Integrating Chain-of-Thought Reasoning with End-to-End Driving.

[BibT_eX]

[DOI]

CoRR, 2024

Editing Massive Concepts in Text-to-Image Diffusion Models.

[BibT_eX]

[DOI]

CoRR, 2024

TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model.

[BibT_eX]

[DOI]

CoRR, 2024

On the Expressive Power of a Variant of the Looped Transformer.

[BibT_eX]

[DOI]

CoRR, 2024

Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation.

[BibT_eX]

[DOI]

CoRR, 2024

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects.

[BibT_eX]

[DOI]

CoRR, 2024

PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models.

[BibT_eX]

[DOI]

CoRR, 2024

SF3D: SlowFast Temporal 3D Object Detection.

[BibT_eX]

[DOI]

Proceedings of the IEEE Intelligent Vehicles Symposium, 2024

DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for In-Context Learning.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

LEGO-Prover: Neural Theorem Proving with Growing Libraries.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Large Language Models as Automated Aligners for benchmarking Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data Generation.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

MagicDrive: Street View Generation with Diverse 3D Geometry Control.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

Segment, Lift and Fit: Automatic 3D Shape Labeling from 2D Prompts.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

PIXART-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

Accelerating Diffusion Sampling with Optimized Time Steps.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023

CycleMLP: A MLP-Like Architecture for Dense Visual Predictions.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., December, 2023

SERF: Fine-Grained Interactive 3D Segmentation and Editing with Radiance Fields.

[BibT_eX]

[DOI]

CoRR, 2023

A Survey of Reasoning with Foundation Models.

[BibT_eX]

[DOI]

CoRR, 2023

Drag-A-Video: Non-rigid Video Editing with Point-based Interaction.

[BibT_eX]

[DOI]

CoRR, 2023

Animate124: Animating One Image to 4D Dynamic Scene.

[BibT_eX]

[DOI]

CoRR, 2023

Large Language Models as Automated Aligners for benchmarking Vision-Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

LEGO-Prover: Neural Theorem Proving with Growing Libraries.

[BibT_eX]

[DOI]

CoRR, 2023

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.

[BibT_eX]

[DOI]

CoRR, 2023

DiffFlow: A Unified SDE Framework for Score-Based Diffusion Models and Generative Adversarial Networks.

[BibT_eX]

[DOI]

CoRR, 2023

DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation.

[BibT_eX]

[DOI]

CoRR, 2023

Integrating Geometric Control into Text-to-Image Diffusion Models for High-Quality Detection Data Generation via Text Prompt.

[BibT_eX]

[DOI]

CoRR, 2023

Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts.

[BibT_eX]

[DOI]

CoRR, 2023

MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation.

[BibT_eX]

[DOI]

CoRR, 2023

Progressive-Hint Prompting Improves Reasoning in Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

Vehicle-Infrastructure Cooperative 3D Object Detection via Feature Flow Prediction.

[BibT_eX]

[DOI]

CoRR, 2023

Fast-BEV: Towards Real-time On-vehicle Bird's-Eye View Perception.

[BibT_eX]

[DOI]

CoRR, 2023

Feature Enhancement with Text-Specific Region Contrast for Scene Text Detection.

[BibT_eX]

[DOI]

Proceedings of the Pattern Recognition and Computer Vision - 6th Chinese Conference, 2023

Flow-Based Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

DiffComplete: Diffusion-based Generative 3D Shape Completion.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Parametric Depth Based Feature Representation Learning for Object Detection and Segmentation in Bird's-Eye View.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

DDP: Diffusion Model for Dense Visual Prediction.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Beyond One-to-One: Rethinking the Referring Image Segmentation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

MetaBEV: Solving Sensor Failures for 3D Detection and Map Segmentation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

DT-Solver: Automated Theorem Proving with Dynamic-Tree Sampling Guided by Proof-level Value Function.

[BibT_eX]

[DOI]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

2022

Improving Monocular Visual Odometry Using Learned Depth.

[BibT_eX]

[DOI]

IEEE Trans. Robotics, 2022

PolarMask++: Enhanced Polar Representation for Single-Shot Instance Segmentation and Beyond.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., 2022

PAN++: Towards Efficient and Accurate End-to-End Spotting of Arbitrarily-Shaped Text.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., 2022

PVT v2: Improved baselines with Pyramid Vision Transformer.

[BibT_eX]

[DOI]

Comput. Vis. Media, 2022

Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe.

[BibT_eX]

[DOI]

CoRR, 2022

M<sup>2</sup>BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation.

[BibT_eX]

[DOI]

CoRR, 2022

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers.

[BibT_eX]

[DOI]

CoRR, 2022

WegFormer: Transformers for Weakly Supervised Semantic Segmentation.

[BibT_eX]

[DOI]

CoRR, 2022

Understanding The Robustness in Vision Transformers.

[BibT_eX]

[DOI]

Animashree Anandkumar

Jiashi Feng

José M. Álvarez

Proceedings of the International Conference on Machine Learning, 2022

UNITS: Unsupervised Intermediate Training Stage for Scene Text Detection.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Multimedia and Expo, 2022

CycleMLP: A MLP-like Architecture for Dense Prediction.

[BibT_eX]

[DOI]

Proceedings of the Tenth International Conference on Learning Representations, 2022

Polygon-Free: Unconstrained Scene Text Detection with Box Annotations.

[BibT_eX]

[DOI]

Proceedings of the 2022 IEEE International Conference on Image Processing, 2022

BEVFormer: Learning Bird's-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

2021

FAST: Searching for a Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation.

[BibT_eX]

[DOI]

CoRR, 2021

Panoptic SegFormer.

[BibT_eX]

[DOI]

CoRR, 2021

PVTv2: Improved Baselines with Pyramid Vision Transformer.

[BibT_eX]

[DOI]

CoRR, 2021

PAN++: Towards Efficient and Accurate End-to-End Spotting of Arbitrarily-Shaped Text.

[BibT_eX]

[DOI]

CoRR, 2021

FakeMix Augmentation Improves Transparent Object Detection.

[BibT_eX]

[DOI]

CoRR, 2021

Unsupervised Pretraining for Object Detection by Patch Reidentification.

[BibT_eX]

[DOI]

CoRR, 2021

DetCo: Unsupervised Contrastive Learning for Object Detection.

[BibT_eX]

[DOI]

CoRR, 2021

Trans2Seg: Transparent Object Segmentation with Transformer.

[BibT_eX]

[DOI]

CoRR, 2021

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Segmenting Transparent Objects in the Wild with Transformer.

[BibT_eX]

[DOI]

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021

What Makes for End-to-End Object Detection?

[BibT_eX]

[DOI]

Proceedings of the 38th International Conference on Machine Learning, 2021

DetCo: Unsupervised Contrastive Learning for Object Detection.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Watch Only Once: An End-to-End Video Action Detection Framework.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

2020

TransTrack: Multiple-Object Tracking with Transformer.

[BibT_eX]

[DOI]

CoRR, 2020

OneNet: Towards End-to-End One-Stage Object Detection.

[BibT_eX]

[DOI]

CoRR, 2020

SelfText Beyond Polygon: Unconstrained Text Detection with Box Supervision and Dynamic Self-Training.

[BibT_eX]

[DOI]

CoRR, 2020

Synthetic-to-Real Unsupervised Domain Adaptation for Scene Text Detection in the Wild.

[BibT_eX]

[DOI]

Weijia Wu

Ning Lu

Enze Xie

CoRR, 2020

1st Place Solutions for OpenImage2019 - Object Detection and Instance Segmentation.

[BibT_eX]

[DOI]

CoRR, 2020

Segmenting Transparent Objects in the Wild.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2020, 2020

Scene Text Image Super-Resolution in the Wild.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2020, 2020

AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2020, 2020

Differentiable Hierarchical Graph Grouping for Multi-person Pose Estimation.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2020, 2020

PolarMask: Single Shot Instance Segmentation With Polar Representation.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

Synthetic-to-Real Unsupervised Domain Adaptation for Scene Text Detection in the Wild.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ACCV 2020 - 15th Asian Conference on Computer Vision, Kyoto, Japan, November 30, 2020

2019

TextSR: Content-Aware Text Super-Resolution Guided by Recognition.

[BibT_eX]

[DOI]

CoRR, 2019

Shape Robust Text Detection with Progressive Scale Expansion Network.

[BibT_eX]

[DOI]

CoRR, 2019

Efficient and Accurate Arbitrary-Shaped Text Detection With Pixel Aggregation Network.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Shape Robust Text Detection With Progressive Scale Expansion Network.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

Scene Text Detection with Supervised Pyramid Context Network.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019

2018

Fast OBDD Reordering using Neural Message Passing on Hypergraph.

[BibT_eX]

[DOI]

CoRR, 2018

Attention Cropping: A Novel Data Augmentation Method for Real-world Plant Species Identification.

[BibT_eX]

[DOI]

CoRR, 2018

Improving Fine-Grained Object Classification Using Adversarial Generated Unlabelled Samples.

[BibT_eX]

[DOI]

Enze Xie

Guangyao Li

Wenyu Liu

Proceedings of the Fourth IEEE International Conference on Multimedia Big Data, 2018

Enze Xie

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...