Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations.

[BibT_eX]

[DOI]

Yong Cao

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Personalization up to a Point: Why Personalized Content Moderation Needs Boundaries, and How We Can Enforce Them.

[BibT_eX]

[DOI]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent.

[BibT_eX]

[DOI]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

Principled Personas: Defining and Measuring the Intended Effects of Persona Prompting on Task Performance.

[BibT_eX]

[DOI]

Pedro Henrique Luz de Araujo

Paul Röttger

Dirk Hovy

Benjamin Roth

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on Twitter.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

Beyond Demographics: Fine-tuning Large Language Models to Predict Individuals' Subjective Text Perceptions.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

Around the World in 24 Hours: Probing LLM Knowledge of Time and Place.

[BibT_eX]

[DOI]

Carolin Holtermann

Paul Röttger

Anne Lauscher

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

SafetyPrompts: A Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, 2025

2024

The benefits, risks and bounds of personalizing the alignment of large language models to individuals.

[BibT_eX]

[DOI]

Nat. Mac. Intell., 2024

Evidence of a log scaling law for political persuasion with large language models.

[BibT_eX]

[DOI]

CoRR, 2024

From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets.

[BibT_eX]

[DOI]

CoRR, 2024

Near to Mid-term Risks and Opportunities of Open Source Generative AI.

[BibT_eX]

[DOI]

CoRR, 2024

The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Introducing v0.5 of the AI Safety Benchmark from MLCommons.

[BibT_eX]

[DOI]

Borhane Blili-Hamelin

Kurt Bollacker

Rishi Bomassani

Marisa Ferrara Boston

Zacharie Delpierre Coudert

Joseph Marvin Imperial

Dinesh Jinenhally Naganna

Forough Poursabzi-Sangdeh

Alice Schoenauer Sebag

Elizabeth Anne Watkins

CoRR, 2024

Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think.

[BibT_eX]

[DOI]

CoRR, 2024

The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models.

[BibT_eX]

[DOI]

Rafael Mosquera Gómez

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024

Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset.

[BibT_eX]

[DOI]

Janis Goldzycher

Paul Röttger

Gerold Schneider

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024

Position: Near to Mid-term Risks and Opportunities of Open-Source Generative AI.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Beyond Flesch-Kincaid: Prompt-based Metrics Improve Difficulty Classification of Educational Texts.

[BibT_eX]

[DOI]

Donya Rooein

Paul Röttger

Anastassia Shaitarova

Dirk Hovy

Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications, 2024

Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

Compromesso! Italian Many-Shot Jailbreaks undermine the safety of Large Language Models.

[BibT_eX]

[DOI]

Fabio Pernisi

Dirk Hovy

Paul Röttger

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), 2024

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2024

"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2024

2023

SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

The Ecological Fallacy in Annotation: Modelling Human Label Variation goes beyond Sociodemographics.

[BibT_eX]

[DOI]

CoRR, 2023

Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback.

[BibT_eX]

[DOI]

CoRR, 2023

SemEval-2023 Task 10: Explainable Detection of Online Sexism.

[BibT_eX]

[DOI]

Proceedings of the The 17th International Workshop on Semantic Evaluation, 2023

The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values.

[BibT_eX]

[DOI]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

The Ecological Fallacy in Annotation: Modeling Human Label Variation goes beyond Sociodemographics.

[BibT_eX]

[DOI]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2023

Improving the Detection of Multilingual Online Attacks with Rich Social Media Data from Singapore.

[BibT_eX]

[DOI]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

2022

Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models.

[BibT_eX]

[DOI]

CoRR, 2022

Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks.

[BibT_eX]

[DOI]

Paul Röttger

Bertie Vidgen

Dirk Hovy

Janet B. Pierrehumbert

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022

Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-Based Hate.

[BibT_eX]

[DOI]

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022

Data-Efficient Strategies for Expanding Hate Speech Detection into Under-Resourced Languages.

[BibT_eX]

[DOI]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

2021

Temporal Adaptation of BERT and Performance on Downstream Document Classification: Insights from Social Media.

[BibT_eX]

[DOI]

Paul Röttger

Janet B. Pierrehumbert

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, 2021

HateCheck: Functional Tests for Hate Speech Detection Models.

[BibT_eX]

[DOI]

Janet B. Pierrehumbert

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021

Paul Röttger

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...