Yilun Zhao

Orcid: 0000-0002-7470-6124

Affiliations:
  • Yale University, New Haven, CT, USA
  • Zhejiang University, Hangzhou, China (former)


According to our database1, Yilun Zhao authored at least 83 papers between 2020 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers.
CoRR, July, 2025

SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks.
CoRR, July, 2025

MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation.
CoRR, June, 2025

Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure.
CoRR, June, 2025

SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing.
CoRR, June, 2025

Table-R1: Inference-Time Scaling for Table Reasoning.
CoRR, May, 2025

Judging with Many Minds: Do More Perspectives Mean Less Prejudice?
CoRR, May, 2025

Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective.
CoRR, May, 2025

Z1: Efficient Test-time Scaling with Code.
CoRR, April, 2025

MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search.
CoRR, March, 2025

Experience Retrieval-Augmentation with Electronic Health Records Enables Accurate Discharge QA.
CoRR, March, 2025

Survey on Evaluation of LLM-based Agents.
CoRR, March, 2025

MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning.
CoRR, March, 2025

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding.
CoRR, January, 2025

ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning.
CoRR, January, 2025

Are Multimodal LLMs Robust Against Adversarial Perturbations? RoMMath: A Systematic Evaluation on Multimodal Math Reasoning.
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval.
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

ReIFE: Re-evaluating Instruction-Following Evaluation.
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

ChemAgent: Self-updating Memories in Large Language Models Improves Chemical Reasoning.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

Physics: Benchmarking Foundation Models on University-Level Physics Problem Solving.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024
<tt>L2CEval</tt>: Evaluating Language-to-Code Generation Capabilities of Large Language Models.
Trans. Assoc. Comput. Linguistics, 2024

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation.
CoRR, 2024

ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain.
CoRR, 2024

FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents.
CoRR, 2024

Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications.
CoRR, 2024

Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation.
CoRR, 2024

Step-Back Profiling: Distilling User History for Personalized Scientific Writing.
CoRR, 2024

MIMIR: A Streamlined Platform for Personalized Agent Tuning in Domain Expertise.
CoRR, 2024

Evaluating LLMs at Detecting Errors in LLM Responses.
CoRR, 2024

Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science.
CoRR, 2024

Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models.
CoRR, 2024

Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in LLMs.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Struc-Bench: Are Large Language Models Good at Generating Complex Structured Tabular Data?
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Short Papers, 2024

On Evaluating the Integration of Reasoning and Action in LLM Agents with Database Question Answering.
Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, 2024

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization.
Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, 2024

Investigating Data Contamination in Modern Benchmarks for Large Language Models.
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024

Revisiting Automated Evaluation for Long-form Table Question Answering.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

OMG-QA: Building Open-Domain Multi-Modal Generative Question Answering Systems.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: EMNLP 2024, 2024

M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, 2024


FinDVer: Explainable Claim Verification over Long and Hybrid-content Financial Documents.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

Unveiling the Spectrum of Data Contamination in Language Model: A Survey from Detection to Remediation.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

KnowledgeFMath: A Knowledge-Intensive Math Reasoning Dataset in Finance Domains.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Financial Documents.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

TaPERA: Enhancing Faithfulness and Interpretability in Long-Form Table QA by Content Planning and Execution-based Reasoning.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

2023
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning.
CoRR, 2023

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks.
CoRR, 2023

DocMath-Eval: Evaluating Numerical Reasoning Capabilities of LLMs in Understanding Long Documents with Tabular Data.
CoRR, 2023

KnowledgeMath: Knowledge-Intensive Math Word Problem Solving in Finance Domains.
CoRR, 2023

L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models.
CoRR, 2023

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?
CoRR, 2023

ODSum: New Benchmarks for Open Domain Multi-Document Summarization.
CoRR, 2023

Large Language Models are Effective Table-to-Text Generators, Evaluators, and Feedback Providers.
CoRR, 2023

QTSumm: A New Benchmark for Query-Focused Table Summarization.
CoRR, 2023

Enhancing Few-shot Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies.
CoRR, 2023

Enhancing Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

Towards Interpretable and Efficient Automatic Reference-Based Summarization Evaluation.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

Investigating Table-to-Text Generation Capabilities of Large Language Models in Real-World Information Seeking Scenarios.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023, 2023

QTSumm: Query-Focused Summarization over Tabular Data.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

LoFT: Enhancing Faithfulness and Diversity for Table-to-Text Generation via Logic Form Control.
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

RobuT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

OpenRT: An Open-source Framework for Reasoning Over Tabular Data.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2023

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

2022
Apparel-Invariant Feature Learning for Person Re-Identification.
IEEE Trans. Multim., 2022

FOLIO: Natural Language Reasoning with First-Order Logic.
CoRR, 2022

FinMath: Injecting a Tree-structured Solver for Question Answering over Financial Reports.
Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022

ReasTAP: Injecting Table Reasoning Skills During Pre-training via Synthetic Reasoning Examples.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

R2D2: Robust Data-to-Text with Replacement Detection.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

MultiHiertt: Numerical Reasoning over Multi Hierarchical Tabular and Textual Data.
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022

2021
MusiCoder: A Universal Music-Acoustic Encoder Based on Transformer.
Proceedings of the MultiMedia Modeling - 27th International Conference, 2021

2020
LAMP: Label Augmented Multimodal Pretraining.
CoRR, 2020

Apparel-invariant Feature Learning for Apparel-changed Person Re-identification.
CoRR, 2020

MusiCoder: A Universal Music-Acoustic Encoder Based on Transformers.
CoRR, 2020


  Loading...