Xinyu Li

Orcid: 0000-0001-5398-3707

Affiliations:

Amazon Web Service, AWS AI Labs, Seattle, USA
Rutgers University, Department of Electrical and Computer Engineering, NJ, USA

According to our database¹, Xinyu Li authored at least 52 papers between 2016 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Bibliography

2026

STORM: End-to-End Referring Multi-Object Tracking in Videos.

[BibT_eX]

[DOI]

CoRR, April, 2026

Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2026

2025

GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-Grained Video-Language Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

Now you see Me: Context-Aware Automatic Audio Description.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

2024

NowYouSee Me: Context-Aware Automatic Audio Description.

[BibT_eX]

[DOI]

CoRR, 2024

Video Token Merging for Long-form Video Understanding.

[BibT_eX]

[DOI]

CoRR, 2024

Video Token Merging for Long Video Understanding.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Text-Guided Video Masked Autoencoder.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

2023

Nearest-Neighbor Inter-Intra Contrastive Learning from Unlabeled Videos.

[BibT_eX]

[DOI]

CoRR, 2023

MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation.

[BibT_eX]

[DOI]

Hector J. Santos-Villalobos

Vimal Bhat

Rohith MV

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Motion-Guided Masking for Spatiotemporal Representation Learning.

[BibT_eX]

[DOI]

Hector J. Santos-Villalobos

Rohith MV

Xinyu Li

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

2022

SSCAP: Self-supervised Co-occurrence Action Parsing for Unsupervised Temporal Action Segmentation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022

NUTA: Non-uniform Temporal Aggregation for Action Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022

TubeR: Tubelet Transformer for Video Action Detection.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Id-Free Person Similarity Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Stochastic Backpropagation: A Memory Efficient Strategy for Training Video Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

Unsupervised Action Segmentation with Self-supervised Feature Learning and Co-occurrence Parsing.

[BibT_eX]

[DOI]

CoRR, 2021

TubeR: Tube-Transformer for Action Detection.

[BibT_eX]

[DOI]

CoRR, 2021

Long Short-Term Transformer for Online Action Detection.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Video Contrastive Learning with Global Context.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2021

VidTr: Video Transformer Without Convolutions.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Selective Feature Compression for Efficient Activity Recognition Inference.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Multi-Label Activity Recognition Using Activity-Specific Features and Activity Correlations.

[BibT_eX]

[DOI]

Yanyi Zhang

Xinyu Li

Ivan Marsic

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

SiamMOT: Siamese Multi-Object Tracking.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020

A Comprehensive Study of Deep Video Action Recognition.

[BibT_eX]

[DOI]

Yi Zhu

Xinyu Li

Chunhui Liu

Mohammadreza Zolfaghari

CoRR, 2020

Multi-Label Activity Recognition using Activity-specific Features.

[BibT_eX]

[DOI]

Yanyi Zhang

Xinyu Li

Ivan Marsic

CoRR, 2020

Application of Multi-Object Tracking with Siamese Track-RCNN to the Human in Events Dataset.

[BibT_eX]

[DOI]

Proceedings of the MM '20: The 28th ACM International Conference on Multimedia, 2020

Directional Temporal Modeling for Action Recognition.

[BibT_eX]

[DOI]

Xinyu Li

Bing Shuai

Joseph Tighe

Proceedings of the Computer Vision - ECCV 2020, 2020

2019

Mutual Correlation Attentive Factors in Dyadic Fusion Networks for Speech Emotion Recognition.

[BibT_eX]

[DOI]

Proceedings of the 27th ACM International Conference on Multimedia, 2019

Multi-Stream Network with Temporal Attention for Environmental Sound Classification.

[BibT_eX]

[DOI]

Xinyu Li

Venkata Chebiyyam

Katrin Kirchhoff

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Speech Audio Super-Resolution for Speech Recognition.

[BibT_eX]

[DOI]

Xinyu Li

Venkata Chebiyyam

Katrin Kirchhoff

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

2018

Tri-axial Self-Attention for Concurrent Activity Recognition.

[BibT_eX]

[DOI]

CoRR, 2018

Human Conversation Analysis Using Attentive Multimodal Networks with Hierarchical Encoder-Decoder.

[BibT_eX]

[DOI]

Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference, 2018

Hybrid Attention based Multimodal Network for Spoken Language Classification.

[BibT_eX]

[DOI]

Proceedings of the 27th International Conference on Computational Linguistics, 2018

Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment.

[BibT_eX]

[DOI]

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018

2017

Progress Estimation and Phase Detection for Sequential Processes.

[BibT_eX]

[DOI]

Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., 2017

A Framework for Evaluating Trace Alignments.

[BibT_eX]

[DOI]

CoRR, 2017

Concurrent Activity Recognition with Multimodal CNN-LSTM Structure.

[BibT_eX]

[DOI]

CoRR, 2017

Process Progress Estimation and Phase Detection.

[BibT_eX]

[DOI]

CoRR, 2017

Online People Tracking and Identification with RFID and Kinect.

[BibT_eX]

[DOI]

CoRR, 2017

Region-based Activity Recognition Using Conditional GAN.

[BibT_eX]

[DOI]

Proceedings of the 2017 ACM on Multimedia Conference, 2017

CAR - a deep learning structure for concurrent activity recognition: poster abstract.

[BibT_eX]

[DOI]

Proceedings of the 16th ACM/IEEE International Conference on Information Processing in Sensor Networks, 2017

3D activity localization with multiple sensors: poster abstract.

[BibT_eX]

[DOI]

Proceedings of the 16th ACM/IEEE International Conference on Information Processing in Sensor Networks, 2017

Evaluation of Trace Alignment Quality and its Application in Medical Process Mining.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Conference on Healthcare Informatics, 2017

Language-Based Process Phase Detection in the Trauma Resuscitation.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Conference on Healthcare Informatics, 2017

Speech Intention Classification with Multimodal Deep Learning.

[BibT_eX]

[DOI]

Proceedings of the Advances in Artificial Intelligence, 2017

2016

Online process phase detection using multimodal deep learning.

[BibT_eX]

[DOI]

Proceedings of the 7th IEEE Annual Ubiquitous Computing, 2016

Deep Learning for RFID-Based Activity Recognition.

[BibT_eX]

[DOI]

Proceedings of the 14th ACM Conference on Embedded Network Sensor Systems, SenSys 2016, 2016

Activity recognition for medical teamwork based on passive RFID.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Conference on RFID, 2016

Deep neural network for RFID-based activity recognition.

[BibT_eX]

[DOI]

Proceedings of the Eighth Wireless of the Students, 2016

Privacy Preserving Dynamic Room Layout Mapping.

[BibT_eX]

[DOI]

Proceedings of the Image and Signal Processing - 7th International Conference, 2016

Xinyu Li

Timeline

Legend:

Links

Online presence:

On csauthors.net:

Bibliography

Loading...