Yifei Huang

Orcid: 0000-0001-8067-6227

Affiliations:
  • University of Tokyo, Sato Laboratory, Tokyo, Japan
  • Shanghai AI Lab, Shanghai, China
  • Shanghai Jiao Tong University, China (until 2015)


According to our database1, Yifei Huang authored at least 64 papers between 2017 and 2025.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Vinci: A Real-time Smart Assistant Based on Egocentric Vision-language Model for Portable Devices.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., September, 2025

EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs.
CoRR, July, 2025

Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision.
CoRR, June, 2025

Egocentric Action-aware Inertial Localization in Point Clouds.
CoRR, May, 2025

Weakly Supervised Temporal Sentence Grounding via Positive Sample Mining.
CoRR, May, 2025

Learning Streaming Video Representation via Multitask Training.
CoRR, April, 2025

An Egocentric Vision-Language Model based Portable Real-time Smart Assistant.
CoRR, March, 2025

AutoGaze: A Very Initial Exploration in A SAM2-based Pipeline for Automated Eye-Object Interaction Analysis in First-Person Videos.
Proceedings of the IEEE Conference on Virtual Reality and 3D User Interfaces, 2025

EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

SiMHand: Mining Similar Hands for Large-Scale 3D Hand Pose Pre-training.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024
Matching Compound Prototypes for Few-Shot Action Recognition.
Int. J. Comput. Vis., September, 2024

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model.
CoRR, 2024

Pre-Training for 3D Hand Pose Estimation with Contrastive Learning on Large-Scale Hand Images in the Wild.
CoRR, 2024

EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation.
CoRR, 2024

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding.
CoRR, 2024

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding.
CoRR, 2024

FineBio: A Fine-Grained Video Dataset of Biological Experiments with Hierarchical Annotation.
CoRR, 2024

Masked Video and Body-Worn IMU Autoencoder for Egocentric Action Recognition.
Proceedings of the Computer Vision - ECCV 2024, 2024

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding.
Proceedings of the Computer Vision - ECCV 2024, 2024

ActionVOS: Actions as Prompts for Video Object Segmentation.
Proceedings of the Computer Vision - ECCV 2024, 2024

Retrieval-Augmented Egocentric Video Captioning.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives.
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding.
CoRR, 2023

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives.
CoRR, 2023

VideoLLM: Modeling Video Sequence with Large Language Models.
CoRR, 2023

Fine-grained Affordance Annotation for Egocentric Hand-Object Interaction Videos.
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023

3D Segmenter: 3D Transformer based Semantic Segmentation via 2D Panoramic Distillation.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

Memory-and-Anticipation Transformer for Online Action Understanding.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Weakly Supervised Temporal Sentence Grounding with Uncertainty-Guided Self-training.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

First Bite/Chew: distinguish different types of food by first biting/chewing and the corresponding hand movement.
Proceedings of the Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, 2023

Proposal-based Temporal Action Localization with Point-level Supervision.
Proceedings of the 34th British Machine Vision Conference 2023, 2023

First Bite/Chew: distinguish typical allergic food by two IMUs.
Proceedings of the Augmented Humans International Conference 2023, 2023

2022
Spatio-Temporal Perturbations for Video Attribution.
IEEE Trans. Circuits Syst. Video Technol., 2022

InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges.
CoRR, 2022

Precise Affordance Annotation for Egocentric Action Video Datasets.
CoRR, 2022

Seeing our Blind Spots: Smart Glasses-based Simulation to Increase Design Students' Awareness of Visual Impairment.
Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, 2022

Inner self drawing machine.
Proceedings of the SIGGRAPH Asia 2022 Art Gallery, 2022

GazeSync: Eye Movement Transfer Using an Optical Eye Tracker and Monochrome Liquid Crystal Displays.
Proceedings of the IUI 2022: 27th International Conference on Intelligent User Interfaces, Helsinki, Finland, March 22 - 25, 2022, 2022

Compound Prototype Matching for Few-Shot Action Recognition.
Proceedings of the Computer Vision - ECCV 2022, 2022

Interact before Align: Leveraging Cross-Modal Knowledge for Domain Adaptive Action Recognition.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022


2021
Ego4D: Around the World in 3, 000 Hours of Egocentric Video.
CoRR, 2021

EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition 2021: Team M3EM Technical Report.
CoRR, 2021

Towards Visually Explaining Video Understanding Networks with Perturbation.
Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2021

Precise Multi-Modal In-Hand Pose Estimation using Low-Precision Sensors for Robotic Assembly.
Proceedings of the IEEE International Conference on Robotics and Automation, 2021

Goal-Oriented Gaze Estimation for Zero-Shot Learning.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips.
Proceedings of the 32nd British Machine Vision Conference 2021, 2021

Leveraging Human Selective Attention for Medical Image Analysis with Limited Training Data.
Proceedings of the 32nd British Machine Vision Conference 2021, 2021

Commonsense Knowledge Aware Concept Selection For Diverse and Informative Visual Storytelling.
Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021

2020
Mutual Context Network for Jointly Estimating Egocentric Gaze and Action.
IEEE Trans. Image Process., 2020

An Ego-Vision System for Discovering Human Joint Attention.
IEEE Trans. Hum. Mach. Syst., 2020

Learn to Extract Building Outline from Misaligned Annotation through Nearest Feature Selector.
Remote. Sens., 2020

A Comprehensive Study on Visual Explanations for Spatio-temporal Networks.
CoRR, 2020

Learn to Recover Visible Color for Video Surveillance in a Day.
Proceedings of the Computer Vision - ECCV 2020, 2020

Improving Action Segmentation via Graph-Based Temporal Reasoning.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2019
Mutual Context Network for Jointly Estimating Egocentric Gaze and Actions.
CoRR, 2019

Manipulation-Skill Assessment from Videos with Spatial Attention Network.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops, 2019

2018
Predicting Gaze in Egocentric Video by Learning Task-Dependent Attention Transition.
Proceedings of the Computer Vision - ECCV 2018, 2018

Semantic Aware Attention Based Deep Object Co-segmentation.
Proceedings of the Computer Vision - ACCV 2018, 2018

2017
Temporal Localization and Spatial Segmentation of Joint Attention in Multiple First-Person Videos.
Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops, 2017


  Loading...