Zhoufutu Wen

Orcid: 0009-0000-0894-5824

According to our database1, Zhoufutu Wen authored at least 26 papers between 2023 and 2026.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

On csauthors.net:

Bibliography

2026
When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors.
CoRR, April, 2026

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation.
CoRR, April, 2026

CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs.
CoRR, March, 2026

WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints.
CoRR, February, 2026

2025
DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains.
CoRR, November, 2025

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity.
CoRR, November, 2025

COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes.
CoRR, October, 2025

Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures.
CoRR, October, 2025

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning.
CoRR, September, 2025

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling.
CoRR, August, 2025

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction.
CoRR, August, 2025

First Return, Entropy-Eliciting Explore.
CoRR, July, 2025

SciDA: Scientific Dynamic Assessor of LLMs.
CoRR, June, 2025

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation.
CoRR, May, 2025

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs.
CoRR, April, 2025

SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models.
CoRR, February, 2025

CryptoX : Compositional Reasoning Evaluation of Large Language Models.
CoRR, February, 2025

Distillation Quantification for Large Language Models.
CoRR, January, 2025

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

SUMMA: A Multimodal Large Language Model for Advertisement Summarization.
Proceedings of the 34th ACM International Conference on Information and Knowledge Management, 2025

Quantification of Large Language Model Distillation.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024
KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks.
CoRR, 2024

2023
Enhancing Dynamic Image Advertising with Vision-Language Pre-training.
Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023


  Loading...