Teng Wang

Orcid: 0000-0003-2331-3619

Affiliations:
  • Tencent ARC Lab, Shenzhen, China
  • University of Hong Kong, MMLab, Hong Kong (PhD)


According to our database1, Teng Wang authored at least 51 papers between 2019 and 2026.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

Online presence:

On csauthors.net:

Bibliography

2026
LiveVLN: Breaking the Stop-and-Go Loop in Vision-Language Navigation.
CoRR, April, 2026

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video.
CoRR, April, 2026

Video Understanding With Large Language Models: A Survey.
IEEE Trans. Circuits Syst. Video Technol., February, 2026

\textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation.
CoRR, January, 2026

SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning.
IEEE Trans. Image Process., 2026

MCoCa: Towards fine-grained multimodal control in image captioning.
Pattern Recognit., 2026

R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios.
Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

2025
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs.
CoRR, December, 2025

UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization.
IEEE Trans. Pattern Anal. Mach. Intell., November, 2025

ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries.
CoRR, November, 2025

From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model.
CoRR, October, 2025

FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning.
CoRR, September, 2025

AudioStory: Generating Long-Form Narrative Audio with Large Language Models.
CoRR, August, 2025

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts.
CoRR, July, 2025

Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder.
CoRR, June, 2025

SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning.
CoRR, June, 2025

An Event-Aware Dual Representation Model With Mixture-of-Experts for Serious Adverse Events Prediction in Clinical Trials.
IEEE Trans. Consumer Electron., May, 2025

Reinforcing Video Reasoning with Focused Thinking.
CoRR, May, 2025

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
CoRR, May, 2025

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation.
CoRR, May, 2025

Hallucination Reduction in Video-Language Models via Hierarchical Multimodal Consistency.
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, 2025

Diff-LMM: Diffusion Teacher-Guided Spatio-Temporal Perception for Video Large Multimodal Models.
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, 2025

Instruction-aware Memory Network for Video Recognition.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2025

Quality-Guided Dynamic Memory for LLMs-based Long-Term Video Understanding.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2025

Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors.
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024
UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization.
CoRR, 2024

Two in One Go: Single-stage Emotion Recognition with Decoupled Subject-context Transformer.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models.
Proceedings of the Computer Vision - ECCV 2024, 2024

2023
Show, Tell and Rephrase: Diverse Video Captioning via Two-Stage Progressive Training.
IEEE Trans. Multim., 2023

Video Understanding with Large Language Models: A Survey.
CoRR, 2023

PTVD: A Large-Scale Plot-Oriented Multimodal Dataset Based on Television Dramas.
CoRR, 2023

LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning.
CoRR, 2023

Caption Anything: Interactive Image Description with Diverse Multimodal Controls.
CoRR, 2023

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos.
CoRR, 2023

π-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation.
Proceedings of the International Conference on Machine Learning, 2023

Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Transferable Decoding with Visual Entities for Zero-Shot Image Captioning.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Accelerating Vision-Language Pretraining with Free Language Modeling.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2022
Exploiting Context Information for Generic Event Boundary Captioning.
CoRR, 2022

Semantic-Aware Pretraining for Dense Video Captioning.
CoRR, 2022

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix.
Proceedings of the International Conference on Machine Learning, 2022

Multi-modal Segment Assemblage Network for Ad Video Editing with Importance-Coherence Reward.
Proceedings of the Computer Vision - ACCV 2022, 2022

2021
Event-Centric Hierarchical Representation for Dense Video Captioning.
IEEE Trans. Circuits Syst. Video Technol., 2021

End-to-End Dense Video Captioning with Parallel Decoding.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

2020
Dense-Captioning Events in Videos: SYSU Submission to ActivityNet Challenge 2020.
CoRR, 2020

2019
Image Caption with Endogenous-Exogenous Attention.
Neural Process. Lett., 2019


  Loading...