We stand with Ukraine

We stand with Ukraine

Shizhe Chen

Orcid: 0000-0002-7313-9703

According to our database¹, Shizhe Chen authored at least 99 papers between 2014 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction.

[DOI]

,

,

Cordelia Schmid

CoRR, May, 2026

HO-Flow: Generalizable Hand-Object Interaction Generation with Latent Flow Matching.

[DOI]

,

Rolandos Alexandros Potamias

,

,

,

Cordelia Schmid

,

Stefanos Zafeiriou

CoRR, April, 2026

FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval.

[DOI]

François Gardères

,

Camille-Sovanneary Gauthier

,

,

CoRR, April, 2026

MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping.

[DOI]

,

Antoine Guédon

,

,

Vincent Lepetit

CoRR, March, 2026

Robust Interacting Multiple Model Kalman Filter Based on Generalized Gaussian-Cauchy Mixture Correntropy for Non-Gaussian Noises.

[DOI]

,

,

,

IEEE Trans. Aerosp. Electron. Syst., 2026

2025

Guardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models.

[DOI]

,

,

,

Cordelia Schmid

CoRR, December, 2025

FOM-Nav: Frontier-Object Maps for Object Goal Navigation.

[DOI]

,

,

,

Cordelia Schmid

CoRR, December, 2025

Hear: Hierarchically Enhanced Aesthetic Representations For Multidimensional Music Evaluation.

[DOI]

,

,

,

,

,

CoRR, November, 2025

FACap: A Large-scale Fashion Dataset for Fine-grained Composed Image Retrieval.

[DOI]

François Gardères

,

,

Camille-Sovanneary Gauthier

,

CoRR, July, 2025

Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation.

[DOI]

,

,

,

Cordelia Schmid

CoRR, June, 2025

ComposeAnything: Composite Object Priors for Text-to-Image Generation.

[DOI]

,

,

Cordelia Schmid

CoRR, May, 2025

Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-Guided 3D Policy.

[DOI]

,

,

Cordelia Schmid

Proceedings of the IEEE International Conference on Robotics and Automation, 2025

ViViDex: Learning Vision-Based Dexterous Manipulation from Human Videos.

[DOI]

,

,

,

,

Cordelia Schmid

Proceedings of the IEEE International Conference on Robotics and Automation, 2025

NextBestPath: Efficient 3D Mapping of Unseen Environments.

[DOI]

,

Antoine Guédon

,

Clémentin Boittiaux

,

,

Vincent Lepetit

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

HORT: Monocular Hand-held Objects Reconstruction with Transformers.

[DOI]

,

Rolandos Alexandros Potamias

,

,

Cordelia Schmid

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

MuKA: Multimodal Knowledge Augmented Visual Information-Seeking.

[DOI]

,

,

,

,

,

Proceedings of the 31st International Conference on Computational Linguistics, 2025

Online 3D Scene Reconstruction Using Neural Object Priors.

[DOI]

,

,

,

Cordelia Schmid

Proceedings of the International Conference on 3D Vision, 2025

2024

SOD-diffusion: Salient Object Detection via Diffusion-Based Image Generators.

[DOI]

,

,

,

,

,

Comput. Graph. Forum, October, 2024

Conan-embedding: General Text Embedding with More and Better Negative Samples.

[DOI]

,

,

,

CoRR, 2024

ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos.

[DOI]

,

,

Cordelia Schmid

,

CoRR, 2024

Think-Program-reCtify: 3D Situated Reasoning with Large Language Models.

[DOI]

,

,

,

,

CoRR, 2024

SUGAR : Pre-training 3D Visual Representations for Robotics.

[DOI]

,

,

,

Cordelia Schmid

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023

A Seawater Salinity Sensor Based on Optimized Long Period Fiber Grating in the Dispersion Turning Point.

[DOI]

,

,

,

,

,

,

,

,

Sensors, 2023

Translating Text Synopses to Video Storyboards.

[DOI]

,

,

,

,

,

,

CoRR, 2023

TeViS: Translating Text Synopses to Video Storyboards.

[DOI]

,

,

,

,

,

,

,

Proceedings of the 31st ACM International Conference on Multimedia, 2023

Robust Visual Sim-to-Real Transfer for Robotic Manipulation.

[DOI]

,

,

,

,

,

Cordelia Schmid

IROS, 2023

Object Goal Navigation with Recursive Implicit Maps.

[DOI]

,

,

,

Cordelia Schmid

IROS, 2023

Explore and Tell: Embodied Visual Captioning in 3D Environments.

[DOI]

,

,

,

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction.

[DOI]

,

,

Cordelia Schmid

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation.

[DOI]

,

,

Cordelia Schmid

,

Proceedings of the Conference on Robot Learning, 2023

InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation.

[DOI]

,

,

,

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

2022

Enhancing Neural Machine Translation With Dual-Side Multimodal Awareness.

[DOI]

,

,

,

,

,

IEEE Trans. Multim., 2022

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding.

[DOI]

,

Pierre-Louis Guhur

,

Makarand Tapaswi

,

Cordelia Schmid

,

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Few-Shot Action Recognition with Hierarchical Matching and Contrastive Learning.

[DOI]

,

,

Proceedings of the Computer Vision - ECCV 2022, 2022

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation.

[DOI]

,

Pierre-Louis Guhur

,

Makarand Tapaswi

,

Cordelia Schmid

,

Proceedings of the Computer Vision - ECCV 2022, 2022

VRDFormer: End-to-End Video Visual Relation Detection with Transformers.

[DOI]

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation.

[DOI]

,

Pierre-Louis Guhur

,

Makarand Tapaswi

,

Cordelia Schmid

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Instruction-driven history-aware policies for robotic manipulations.

[DOI]

Pierre-Louis Guhur

,

,

,

Makarand Tapaswi

,

,

Cordelia Schmid

Proceedings of the Conference on Robot Learning, 2022

2021

Development of Capacitive Rain Gauge for Marine Environment.

[DOI]

,

,

,

,

,

,

,

,

J. Sensors, 2021

Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization.

[DOI]

,

,

,

,

CoRR, 2021

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

CoRR, 2021

A Continuous Space Location Model and a Particle Swarm Optimization-Based Heuristic Algorithm for Maximizing the Allocation of Ocean-Moored Buoys.

[DOI]

,

,

,

,

,

,

,

,

IEEE Access, 2021

History Aware Multimodal Transformer for Vision-and-Language Navigation.

[DOI]

,

Pierre-Louis Guhur

,

Cordelia Schmid

,

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Question-controlled Text-aware Image Captioning.

[DOI]

,

,

Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021

Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training.

[DOI]

,

,

,

,

,

Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021

MMPT'21: International Joint Workshop on Multi-Modal Pre-Training for Multimedia Understanding.

[DOI]

,

,

,

,

Alexander G. Hauptmann

,

Proceedings of the ICMR '21: International Conference on Multimedia Retrieval, 2021

Airbert: In-domain Pretraining for Vision-and-Language Navigation.

[DOI]

Pierre-Louis Guhur

,

Makarand Tapaswi

,

,

,

Cordelia Schmid

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Elaborative Rehearsal for Zero-shot Action Recognition.

[DOI]

,

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Sketch, Ground, and Refine: Top-Down Dense Video Captioning.

[DOI]

,

,

,

,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Towards Diverse Paragraph Captioning for Untrimmed Videos.

[DOI]

,

,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020

The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020).

[DOI]

CoRR, 2020

2nd Place Solution to ECCV 2020 VIPriors Object Detection Challenge.

[DOI]

,

,

CoRR, 2020

Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring Sequential Events Detection for Dense Video Captioning.

[DOI]

,

,

,

CoRR, 2020

YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos.

[DOI]

,

,

,

,

CoRR, 2020

RUC_AIM3 at TRECVID 2020: Ad-hoc Video Search & Video to Text Description.

[DOI]

,

,

,

Proceedings of the 2020 TREC Video Retrieval Evaluation, 2020

ICECAP: Information Concentrated Entity-aware Image Captioning.

[DOI]

,

,

Proceedings of the MM '20: The 28th ACM International Conference on Multimedia, 2020

Skeleton-Based Interactive Graph Network For Human Object Interaction Detection.

[DOI]

,

,

Proceedings of the IEEE International Conference on Multimedia and Expo, 2020

Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning.

[DOI]

,

,

,

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs.

[DOI]

,

,

,

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2019

Generating Video Descriptions With Latent Topic Guidance.

[DOI]

,

,

,

Alexander G. Hauptmann

IEEE Trans. Multim., 2019

Integrating Temporal and Spatial Attentions for VATEX Video Captioning Challenge 2019.

[DOI]

,

,

,

,

CoRR, 2019

Activitynet 2019 Task 3: Exploring Contexts for Dense Captioning Events in Videos.

[DOI]

,

,

,

,

,

,

,

Alexander G. Hauptmann

CoRR, 2019

RUC_AIM3 at TRECVID 2019: Video to Text.

[DOI]

,

,

,

Proceedings of the 2019 TREC Video Retrieval Evaluation, 2019

Visual Relation Detection with Multi-Level Attention.

[DOI]

,

,

Proceedings of the 27th ACM International Conference on Multimedia, 2019

Relation Understanding in Videos.

[DOI]

,

,

,

Proceedings of the 27th ACM International Conference on Multimedia, 2019

Adversarial Domain Adaption for Multi-Cultural Dimensional Emotion Recognition in Dyadic Interactions.

[DOI]

,

,

,

,

Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop, 2019

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences.

[DOI]

,

,

,

,

,

,

,

,

Proceedings of the 27th ACM International Conference on Multimedia, 2019

Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards.

[DOI]

,

,

,

Proceedings of the 27th ACM International Conference on Multimedia, 2019

Speech Emotion Recognition in Dyadic Dialogues with Attentive Interaction Modeling.

[DOI]

,

,

,

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

From Words to Sentences: A Progressive Learning Approach for Zero-resource Machine Translation with Visual Pivots.

[DOI]

,

,

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019

Cross-culture Multimodal Emotion Recognition with Adversarial Learning.

[DOI]

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2019

YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension.

[DOI]

,

,

,

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019

Semi-supervised Multimodal Emotion Recognition with Improved Wasserstein GANs.

[DOI]

,

,

Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2019

Unsupervised Bilingual Lexicon Induction from Mono-Lingual Multimodal Data.

[DOI]

,

,

Alexander G. Hauptmann

Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019

2018

RUC+CMU: System Report for Dense Captioning Events in Videos.

[DOI]

,

,

,

,

,

Alexander G. Hauptmann

CoRR, 2018

Informedia @ TRECVID 2018: Ad-hoc Video Search, Video to Text Description, Activities in Extended video.

[DOI]

,

,

,

Alexander G. Hauptmann

,

,

,

,

,

,

,

,

,

,

,

,

,

Ruslan Salakhutdinov

,

,

Proceedings of the 2018 TREC Video Retrieval Evaluation, 2018

Multimodal Dimensional and Continuous Emotion Recognition in Dyadic Video Interactions.

[DOI]

,

,

Proceedings of the Advances in Multimedia Information Processing - PCM 2018, 2018

iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning.

[DOI]

,

,

,

,

Proceedings of the Advances in Multimedia Information Processing - PCM 2018, 2018

Multi-modal Multi-cultural Dimensional Continues Emotion Recognition in Dyadic Interactions.

[DOI]

,

,

,

Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop, 2018

Class-aware Self-Attention for Audio Event Recognition.

[DOI]

,

,

,

Alexander G. Hauptmann

Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, 2018

RUC at MediaEval 2018: Visual and Textual Features Exploration for Predicting Media Memorability.

[DOI]

,

,

,

Proceedings of the Working Notes Proceedings of the MediaEval 2018 Workshop, 2018

2017

Informedia @ TRECVID 2017.

[DOI]

,

,

,

,

,

,

Alexander G. Hauptmann

Proceedings of the 2017 TREC Video Retrieval Evaluation, 2017

Knowing Yourself: Improving Video Caption via In-depth Recap.

[DOI]

,

,

,

Alexander G. Hauptmann

Proceedings of the 2017 ACM on Multimedia Conference, 2017

Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition.

[DOI]

,

,

,

Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, Mountain View, CA, USA, October 23, 2017

Video Captioning with Guidance of Multimodal Latent Topics.

[DOI]

,

,

,

Alexander G. Hauptmann

Proceedings of the 2017 ACM on Multimedia Conference, 2017

Generating Video Descriptions with Topic Guidance.

[DOI]

,

,

Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, 2017

RUC at MediaEval 2017: Predicting Media Interestingness Task.

[DOI]

,

,

,

,

Proceedings of the Working Notes Proceedings of the MediaEval 2017 Workshop co-located with the Conference and Labs of the Evaluation Forum (CLEF 2017), 2017

Emotion recognition with multimodal features and temporal models.

[DOI]

,

,

,

,

,

,

Proceedings of the 19th ACM International Conference on Multimodal Interaction, 2017

Facial Action Units Detection with Multi-Features and -AUs Fusion.

[DOI]

,

,

Proceedings of the 12th IEEE International Conference on Automatic Face & Gesture Recognition, 2017

2016

Describing Videos using Multi-modal Fusion.

[DOI]

,

,

,

,

Alexander G. Hauptmann

Proceedings of the 2016 ACM Conference on Multimedia Conference, 2016

Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction.

[DOI]

,

Proceedings of the 2016 ACM Conference on Multimedia Conference, 2016

RUC at MediaEval 2016 Emotional Impact of Movies Task: Fusion of Multimodal Features.

[DOI]

,

Proceedings of the Working Notes Proceedings of the MediaEval 2016 Workshop, 2016

RUC at MediaEval 2016: Predicting Media Interestingness Task.

[DOI]

,

,

Proceedings of the Working Notes Proceedings of the MediaEval 2016 Workshop, 2016

Video emotion recognition in the wild based on fusion of multimodal features.

[DOI]

,

,

,

,

Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016

Emotion Recognition in Videos via Fusing Multimodal Features.

[DOI]

,

,

,

,

,

,

Proceedings of the Pattern Recognition - 7th Chinese Conference, 2016

2015

基于声学特征的语言情感识别 (Speech Emotion Recognition Based on Acoustic Features).

[DOI]

,

,

,

,

计算机科学, 2015

Multi-modal Dimensional Emotion Recognition using Recurrent Neural Networks.

[DOI]

,

Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, 2015

Speech emotion recognition with acoustic and lexical features.

[DOI]

,

,

,

Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

2014

Speech emotion classification using acoustic features.

[DOI]

,

,

,

,

Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, 2014

Loading...