Catherine Arnett

Orcid: 0000-0003-0448-5415

According to our database1, Catherine Arnett authored at least 21 papers between 2023 and 2026.

Collaborative distances:

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

On csauthors.net:

Bibliography

2026
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs.
CoRR, May, 2026

Weight Tying Biases Token Embeddings Towards the Output Space.
CoRR, March, 2026

How Open Must Language Models be to Enable Reliable Scientific Inference?
CoRR, March, 2026

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data.
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
CoRR, January, 2026

2025
Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction.
CoRR, October, 2025

Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures.
CoRR, October, 2025

Explaining and Mitigating Crosslingual Tokenizer Inequities.
CoRR, October, 2025

Evaluating Morphological Alignment of Tokenizers in 70 Languages.
CoRR, July, 2025

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training.
CoRR, June, 2025

BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization.
CoRR, May, 2025

Why do language models perform worse for morphologically complex languages?
Proceedings of the 31st International Conference on Computational Linguistics, 2025

On the Acquisition of Shared Grammatical Representations in Bilingual Language Models.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024
Toxicity of the Commons: Curating Open-Source Pre-Training Data.
CoRR, 2024

Goldfish: Monolingual Language Models for 350 Languages.
CoRR, 2024

Revenge of the Fallen? Recurrent Models Match Transformers at Predicting Human Language Comprehension Metrics.
CoRR, 2024

Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement.
CoRR, 2024

A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages.
CoRR, 2024

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

2023
Crosslingual Structural Priming and the Pre-Training Dynamics of Bilingual Language Models.
CoRR, 2023

Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023


  Loading...