Teng Wang

Orcid: 0000-0003-2331-3619

Affiliations:

Tencent ARC Lab, Shenzhen, China
University of Hong Kong, MMLab, Hong Kong (PhD)

According to our database¹, Teng Wang authored at least 51 papers between 2019 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Bibliography

2026

LiveVLN: Breaking the Stop-and-Go Loop in Vision-Language Navigation.

[BibT_eX]

[DOI]

CoRR, April, 2026

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video.

[BibT_eX]

[DOI]

CoRR, April, 2026

Video Understanding With Large Language Models: A Survey.

[BibT_eX]

[DOI]

IEEE Trans. Circuits Syst. Video Technol., February, 2026

\textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation.

[BibT_eX]

[DOI]

CoRR, January, 2026

SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2026

MCoCa: Towards fine-grained multimodal control in image captioning.

[BibT_eX]

[DOI]

Pattern Recognit., 2026

R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios.

[BibT_eX]

[DOI]

Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

2025

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs.

[BibT_eX]

[DOI]

CoRR, December, 2025

UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., November, 2025

ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries.

[BibT_eX]

[DOI]

CoRR, November, 2025

From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model.

[BibT_eX]

[DOI]

CoRR, October, 2025

FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning.

[BibT_eX]

[DOI]

CoRR, September, 2025

AudioStory: Generating Long-Form Narrative Audio with Large Language Models.

[BibT_eX]

[DOI]

CoRR, August, 2025

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts.

[BibT_eX]

[DOI]

CoRR, July, 2025

Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder.

[BibT_eX]

[DOI]

CoRR, June, 2025

SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning.

[BibT_eX]

[DOI]

CoRR, June, 2025

An Event-Aware Dual Representation Model With Mixture-of-Experts for Serious Adverse Events Prediction in Clinical Trials.

[BibT_eX]

[DOI]

IEEE Trans. Consumer Electron., May, 2025

Reinforcing Video Reasoning with Focused Thinking.

[BibT_eX]

[DOI]

CoRR, May, 2025

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

[BibT_eX]

[DOI]

CoRR, May, 2025

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation.

[BibT_eX]

[DOI]

CoRR, May, 2025

Hallucination Reduction in Video-Language Models via Hierarchical Multimodal Consistency.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, 2025

Diff-LMM: Diffusion Teacher-Guided Spatio-Temporal Perception for Video Large Multimodal Models.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, 2025

Instruction-aware Memory Network for Video Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Multimedia and Expo, 2025

Quality-Guided Dynamic Memory for LLMs-based Long-Term Video Understanding.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Multimedia and Expo, 2025

Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors.

[BibT_eX]

[DOI]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024

UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization.

[BibT_eX]

[DOI]

CoRR, 2024

Two in One Go: Single-stage Emotion Recognition with Decoupled Subject-context Transformer.

[BibT_eX]

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

2023

Show, Tell and Rephrase: Diverse Video Captioning via Two-Stage Progressive Training.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2023

Video Understanding with Large Language Models: A Survey.

[BibT_eX]

[DOI]

CoRR, 2023

PTVD: A Large-Scale Plot-Oriented Multimodal Dataset Based on Television Dramas.

[BibT_eX]

[DOI]

CoRR, 2023

LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning.

[BibT_eX]

[DOI]

CoRR, 2023

Caption Anything: Interactive Image Description with Diverse Multimodal Controls.

[BibT_eX]

[DOI]

CoRR, 2023

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos.

[BibT_eX]

[DOI]

CoRR, 2023

π-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Machine Learning, 2023

Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Transferable Decoding with Visual Entities for Zero-Shot Image Captioning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Accelerating Vision-Language Pretraining with Free Language Modeling.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2022

Exploiting Context Information for Generic Event Boundary Captioning.

[BibT_eX]

[DOI]

CoRR, 2022

Semantic-Aware Pretraining for Dense Video Captioning.

[BibT_eX]

[DOI]

CoRR, 2022

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Machine Learning, 2022

Multi-modal Segment Assemblage Network for Ad Video Editing with Importance-Coherence Reward.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ACCV 2022, 2022

2021

Event-Centric Hierarchical Representation for Dense Video Captioning.

[BibT_eX]

[DOI]

IEEE Trans. Circuits Syst. Video Technol., 2021

End-to-End Dense Video Captioning with Parallel Decoding.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

2020

Dense-Captioning Events in Videos: SYSU Submission to ActivityNet Challenge 2020.

[BibT_eX]

[DOI]

Teng Wang

Huicheng Zheng

Mingjing Yu

CoRR, 2020

2019

Image Caption with Endogenous-Exogenous Attention.

[BibT_eX]

[DOI]

Teng Wang

Haifeng Hu

Chen He

Neural Process. Lett., 2019

Teng Wang

Timeline

Legend:

Links

Online presence:

On csauthors.net:

Bibliography

Loading...