Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval.

[BibT_eX]

[DOI]

Tingyu Song

Guo Gan

Mingsheng Shang

Yilun Zhao

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

ReIFE: Re-evaluating Instruction-Following Evaluation.

[BibT_eX]

[DOI]

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

ChemAgent: Self-updating Memories in Large Language Models Improves Chemical Reasoning.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2025

Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2025

Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

Physics: Benchmarking Foundation Models on University-Level Physics Problem Solving.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2025

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2025

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024

<tt>L2CEval</tt>: Evaluating Language-to-Code Generation Capabilities of Large Language Models.

[BibT_eX]

[DOI]

Trans. Assoc. Comput. Linguistics, 2024

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation.

[BibT_eX]

[DOI]

CoRR, 2024

ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain.

[BibT_eX]

[DOI]

CoRR, 2024

FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents.

[BibT_eX]

[DOI]

CoRR, 2024

Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications.

[BibT_eX]

[DOI]

VijayaSai Somasundaram

CoRR, 2024

Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation.

[BibT_eX]

[DOI]

CoRR, 2024

Step-Back Profiling: Distilling User History for Personalized Scientific Writing.

[BibT_eX]

[DOI]

CoRR, 2024

MIMIR: A Streamlined Platform for Personalized Agent Tuning in Domain Expertise.

[BibT_eX]

[DOI]

CoRR, 2024

Evaluating LLMs at Detecting Errors in LLM Responses.

[BibT_eX]

[DOI]

Ryo Kamoi

Sarkar Snigdha Sarathi Das

Sujeeth Reddy Vummanthala

CoRR, 2024

Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science.

[BibT_eX]

[DOI]

CoRR, 2024

Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in LLMs.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Struc-Bench: Are Large Language Models Good at Generating Complex Structured Tabular Data?

[BibT_eX]

[DOI]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Short Papers, 2024

On Evaluating the Integration of Reasoning and Action in LLM Agents with Database Question Answering.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, 2024

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, 2024

Investigating Data Contamination in Modern Benchmarks for Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024

Revisiting Automated Evaluation for Long-form Table Question Answering.

[BibT_eX]

[DOI]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

OMG-QA: Building Open-Domain Multi-Modal Generative Question Answering Systems.

[BibT_eX]

[DOI]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: EMNLP 2024, 2024

M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

FOLIO: Natural Language Reasoning with First-Order Logic.

[BibT_eX]

[DOI]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

FinDVer: Explainable Claim Verification over Long and Hybrid-content Financial Documents.

[BibT_eX]

[DOI]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2024

Unveiling the Spectrum of Data Contamination in Language Model: A Survey from Detection to Remediation.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2024

KnowledgeFMath: A Knowledge-Intensive Math Reasoning Dataset in Finance Domains.

[BibT_eX]

[DOI]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Financial Documents.

[BibT_eX]

[DOI]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

TaPERA: Enhancing Faithfulness and Interpretability in Long-Form Table QA by Content Planning and Execution-based Reasoning.

[BibT_eX]

[DOI]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

2023

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning.

[BibT_eX]

[DOI]

CoRR, 2023

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks.

[BibT_eX]

[DOI]

CoRR, 2023

DocMath-Eval: Evaluating Numerical Reasoning Capabilities of LLMs in Understanding Long Documents with Tabular Data.

[BibT_eX]

[DOI]

CoRR, 2023

KnowledgeMath: Knowledge-Intensive Math Word Problem Solving in Finance Domains.

[BibT_eX]

[DOI]

CoRR, 2023

L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

[BibT_eX]

[DOI]

CoRR, 2023

ODSum: New Benchmarks for Open Domain Multi-Document Summarization.

[BibT_eX]

[DOI]

CoRR, 2023

Large Language Models are Effective Table-to-Text Generators, Evaluators, and Feedback Providers.

[BibT_eX]

[DOI]

CoRR, 2023

QTSumm: A New Benchmark for Query-Focused Table Summarization.

[BibT_eX]

[DOI]

CoRR, 2023

Enhancing Few-shot Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies.

[BibT_eX]

[DOI]

CoRR, 2023

Enhancing Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

Towards Interpretable and Efficient Automatic Reference-Based Summarization Evaluation.

[BibT_eX]

[DOI]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

Investigating Table-to-Text Generation Capabilities of Large Language Models in Real-World Information Seeking Scenarios.

[BibT_eX]

[DOI]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023, 2023

QTSumm: Query-Focused Summarization over Tabular Data.

[BibT_eX]

[DOI]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

LoFT: Enhancing Faithfulness and Diversity for Table-to-Text Generation via Logic Form Control.

[BibT_eX]

[DOI]

Yilun Zhao

Zhenting Qi

Linyong Nan

Lorenzo Jaime Yu Flores

Dragomir Radev

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

RobuT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations.

[BibT_eX]

[DOI]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

OpenRT: An Open-source Framework for Reasoning Over Tabular Data.

[BibT_eX]

[DOI]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2023

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation.

[BibT_eX]

[DOI]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

2022

Apparel-Invariant Feature Learning for Person Re-Identification.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2022

FOLIO: Natural Language Reasoning with First-Order Logic.

[BibT_eX]

[DOI]

CoRR, 2022

FinMath: Injecting a Tree-structured Solver for Question Answering over Financial Reports.

[BibT_eX]

[DOI]

Chenying Li

Wenbo Ye

Yilun Zhao

Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022

ReasTAP: Injecting Table Reasoning Skills During Pre-training via Synthetic Reasoning Examples.

[BibT_eX]

[DOI]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

R2D2: Robust Data-to-Text with Replacement Detection.

[BibT_eX]

[DOI]

Linyong Nan

Lorenzo Jaime Yu Flores

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

MultiHiertt: Numerical Reasoning over Multi Hierarchical Tabular and Textual Data.

[BibT_eX]

[DOI]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022

2021

MusiCoder: A Universal Music-Acoustic Encoder Based on Transformer.

[BibT_eX]

[DOI]

Yilun Zhao

Jia Guo

Proceedings of the MultiMedia Modeling - 27th International Conference, 2021

2020

LAMP: Label Augmented Multimodal Pretraining.

[BibT_eX]

[DOI]

CoRR, 2020

Apparel-invariant Feature Learning for Apparel-changed Person Re-identification.

[BibT_eX]

[DOI]

CoRR, 2020

MusiCoder: A Universal Music-Acoustic Encoder Based on Transformers.

[BibT_eX]

[DOI]

CoRR, 2020

Yilun Zhao

Timeline

Legend:

Links

Online presence:

On csauthors.net:

Bibliography

Loading...