Xiaoda Yang

Orcid: 0009-0002-7297-4536

According to our database1, Xiaoda Yang authored at least 36 papers between 2024 and 2026.

Collaborative distances:

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

Online presence:

On csauthors.net:

Bibliography

2026
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning.
CoRR, April, 2026

A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning.
CoRR, April, 2026

ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks.
CoRR, April, 2026

SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation.
CoRR, March, 2026

SpatialLogic-Bench: A Diagnostic Benchmark for Task-Oriented Spatiotemporal Reasoning.
Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling.
Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

2025
One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer.
CoRR, November, 2025

VividAnimator: An End-to-End Audio and Pose-driven Half-Body Human Animation Framework.
CoRR, October, 2025

CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation.
CoRR, June, 2025

OmniCam: Unified Multimodal Video Generation via Camera Control.
CoRR, April, 2025

Astrea: A MOE-based Visual Understanding Model with Progressive Alignment.
CoRR, March, 2025

Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis.
CoRR, February, 2025

OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios.
CoRR, January, 2025

EAGER-LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration.
Proceedings of the ACM on Web Conference 2025, 2025

EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model.
Proceedings of the 33rd ACM International Conference on Multimedia, 2025

Choose Your Expert: Uncertainty-Guided Expert Selection for Continual Deepfake Detection.
Proceedings of the 33rd ACM International Conference on Multimedia, 2025

Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation.
Proceedings of the 33rd ACM International Conference on Multimedia, 2025

Multimodal Conditional Retrieval with High Controllability.
Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, V.2, 2025

MelRe: Vision-Based Mel-Spectrogram Restoration.
Proceedings of the 26th Annual Conference of the International Speech Communication Association, 2025

Vela: Scalable Embeddings with Voice Large Language Models for Multimodal Retrieval.
Proceedings of the 26th Annual Conference of the International Speech Communication Association, 2025

GTA: Towards Generative Text-To-Audio Retrieval via Multi-Scale Tokenizer.
Proceedings of the 26th Annual Conference of the International Speech Communication Association, 2025

Diff-Prompt: Diffusion-Driven Prompt Generator with Mask Supervision.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words?
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

PACHAT: Persona-Aware Speech Assistant for Multi-party Dialogue.
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

BrainLoc: Brain Signal-Based Object Detection with Multi-modal Alignment.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

VoxpopuliTTS: a large-scale multilingual TTS corpus for zero-shot speech generation.
Proceedings of the 31st International Conference on Computational Linguistics, 2025

Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

CART: A Generative Cross-Modal Retrieval Framework With Coarse-To-Fine Semantic Modeling.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

Storynizor: Consistent Story Generation via Inter-Frame Synchronized and Shuffled ID Injection.
Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, 2025

2024
WavChat: A Survey of Spoken Dialogue Models.
CoRR, 2024

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling.
CoRR, 2024

ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling.
CoRR, 2024

SyncTalklip: Highly Synchronized Lip-Readable Speaker Generation with Multi-Task Learning.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

AudioVSR: Enhancing Video Speech Recognition with Audio Data.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024


  Loading...