We stand with Ukraine

We stand with Ukraine

Ethan Perez

Orcid: 0009-0004-7851-1190

According to our database¹, Ethan Perez authored at least 71 papers between 2016 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?

[DOI]

Alexander Hägele

,

Aryo Pradipta Gema

,

,

,

Jascha Sohl-Dickstein

CoRR, January, 2026

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks.

[DOI]

CoRR, January, 2026

Integrating Computer Science in Middle School Lessons through Block-Based Coding.

[DOI]

Shreelakshya Reddi

,

,

Veronica Cateté

Proceedings of the 57th ACM Technical Symposium on Computer Science Education V.2, 2026

2025

Natural Emergent Misalignment from Reward Hacking in Production RL.

[DOI]

CoRR, November, 2025

Agentic Misalignment: How LLMs Could Be Insider Threats.

[DOI]

,

Benjamin Wright

,

,

Stuart J. Ritchie

,

Sören Mindermann

,

,

,

CoRR, October, 2025

Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks.

[DOI]

,

Mohammed Mahfoud

,

,

,

,

CoRR, August, 2025

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.

[DOI]

CoRR, July, 2025

Unsupervised Elicitation of Language Models.

[DOI]

,

,

,

,

,

Jacob Goldman-Wetzler

,

,

,

,

,

,

,

CoRR, June, 2025

Reasoning Models Don't Always Say What They Think.

[DOI]

,

,

Ansh Radhakrishnan

,

Jonathan Uesato

,

,

,

,

,

,

,

Vladimir Mikulik

,

Samuel R. Bowman

,

,

,

CoRR, May, 2025

Forecasting Rare Language Model Behaviors.

[DOI]

,

,

,

Mohammed Mahfoud

,

,

Roger B. Grosse

,

,

William Fithian

,

,

CoRR, February, 2025

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming.

[DOI]

CoRR, January, 2025

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs.

[DOI]

Abhay Sheshadri

,

,

,

,

,

,

,

Asa Cooper Stickland

,

,

Dylan Hadfield-Menell

,

Trans. Mach. Learn. Res., 2025

Inverse Scaling in Test-Time Compute.

[DOI]

Aryo Pradipta Gema

,

Alexander Hägele

,

,

,

Jacob Goldman-Wetzler

,

Kit Fraser-Taliente

,

,

,

,

,

Pasquale Minervini

,

,

,

Trans. Mach. Learn. Res., 2025

Quantifying Elicitation of Latent Capabilities in Language Models.

[DOI]

Elizabeth Donoway

,

,

,

,

,

Michael R. DeWeese

,

,

,

,

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

Language Models Learn to Mislead Humans via RLHF.

[DOI]

,

,

,

,

Jacob Steinhardt

,

,

Samuel R. Bowman

,

,

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats.

[DOI]

,

,

,

,

Ansh Radhakrishnan

,

,

,

,

,

,

,

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models.

[DOI]

Rylan Schaeffer

,

,

,

,

Cristóbal Eyzaguirre

,

,

,

,

,

,

,

Rajashree Agrawal

,

,

,

,

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Looking Inward: Language Models Can Learn About Themselves by Introspection.

[DOI]

Felix Jedidja Binder

,

,

,

,

,

,

,

,

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024

Learning from Natural Language Feedback.

[DOI]

,

Jérémy Scheurer

,

Jon Ander Campos

,

,

,

Samuel R. Bowman

,

,

Trans. Mach. Learn. Res., 2024

Alignment faking in large language models.

[DOI]

CoRR, 2024

Best-of-N Jailbreaking.

[DOI]

,

,

,

Rylan Schaeffer

,

,

,

,

,

,

CoRR, 2024

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach.

[DOI]

,

,

,

Rylan Schaeffer

,

Rajashree Agrawal

,

,

,

,

,

CoRR, 2024

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems.

[DOI]

Caspar Oesterheld

,

,

,

Linh Chi Nguyen

,

CoRR, 2024

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples.

[DOI]

,

,

,

,

CoRR, 2024

Sabotage Evaluations for Frontier Models.

[DOI]

,

,

Eric Christiansen

,

,

,

,

,

,

,

,

,

Holden Karnofsky

,

,

Roger B. Grosse

,

Samuel R. Bowman

,

CoRR, 2024

Looking Inward: Language Models Can Learn About Themselves by Introspection.

[DOI]

Felix J. Binder

,

,

,

,

,

,

,

,

CoRR, 2024

Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs.

[DOI]

Abhay Sheshadri

,

,

,

,

,

,

,

Asa Cooper Stickland

,

,

Dylan Hadfield-Menell

,

CoRR, 2024

When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?

[DOI]

Rylan Schaeffer

,

,

,

,

Cristóbal Eyzaguirre

,

,

,

,

,

,

Rajashree Agrawal

,

,

,

,

CoRR, 2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models.

[DOI]

,

Monte MacDiarmid

,

,

,

,

,

Nicholas Schiefer

,

,

,

,

,

Samuel R. Bowman

,

,

CoRR, 2024

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought.

[DOI]

,

,

,

Samuel R. Bowman

,

,

,

CoRR, 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.

[DOI]

CoRR, 2024

Many-shot Jailbreaking.

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Debating with More Persuasive LLMs Leads to More Truthful Answers.

[DOI]

,

,

,

,

,

Ansh Radhakrishnan

,

Edward Grefenstette

,

Samuel R. Bowman

,

Tim Rocktäschel

,

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Towards Understanding Sycophancy in Language Models.

[DOI]

,

,

,

,

,

Samuel R. Bowman

,

,

Zac Hatfield-Dodds

,

Scott R. Johnston

,

,

Timothy Maxwell

,

,

,

,

Nicholas Schiefer

,

,

,

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning.

[DOI]

,

Victoriano Montesinos

,

,

,

Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023

Inverse Scaling: When Bigger Isn't Better.

[DOI]

Trans. Mach. Learn. Res., 2023

Towards Evaluating AI Systems for Moral Status Using Self-Reports.

[DOI]

,

CoRR, 2023

Specific versus General Principles for Constitutional AI.

[DOI]

CoRR, 2023

Towards Understanding Sycophancy in Language Models.

[DOI]

,

,

,

,

,

Samuel R. Bowman

,

,

,

Zac Hatfield-Dodds

,

Scott R. Johnston

,

,

Timothy Maxwell

,

,

,

,

Nicholas Schiefer

,

,

,

CoRR, 2023

Studying Large Language Model Generalization with Influence Functions.

[DOI]

Roger B. Grosse

,

,

,

,

,

Amirhossein Tajdini

,

,

,

,

,

,

Kamile Lukosiute

,

,

Nicholas Joseph

,

,

,

Samuel R. Bowman

CoRR, 2023

Measuring Faithfulness in Chain-of-Thought Reasoning.

[DOI]

CoRR, 2023

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning.

[DOI]

CoRR, 2023

Training Language Models with Language Feedback at Scale.

[DOI]

Jérémy Scheurer

,

Jon Ander Campos

,

,

,

,

,

CoRR, 2023

Improving Code Generation by Training with Natural Language Feedback.

[DOI]

,

Jérémy Scheurer

,

,

Jon Ander Campos

,

,

Samuel R. Bowman

,

,

CoRR, 2023

The Capacity for Moral Self-Correction in Large Language Models.

[DOI]

CoRR, 2023

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting.

[DOI]

,

,

,

Samuel R. Bowman

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Pretraining Language Models with Human Preferences.

[DOI]

,

,

,

Rasika Vinayak Bhalerao

,

Christopher L. Buckley

,

,

Samuel R. Bowman

,

Proceedings of the International Conference on Machine Learning, 2023

Discovering Language Model Behaviors with Model-Written Evaluations.

[DOI]

,

,

Kamile Lukosiute

,

,

,

,

,

Catherine Olsson

,

,

Saurav Kadavath

,

,

,

,

,

,

Cameron McKinnon

,

Christopher Olah

,

,

,

,

,

,

Eli Tran-Johnson

,

,

Jackson Kernion

,

,

,

,

,

,

,

Landon Goldberg

,

,

,

Michael Sellitto

,

,

Neerav Kingsland

,

,

Nicholas Joseph

,

,

,

,

,

,

,

,

,

,

Timothy Telleen-Lawton

,

,

,

,

,

Zac Hatfield-Dodds

,

,

Samuel R. Bowman

,

,

Roger B. Grosse

,

Danny Hernandez

,

,

,

Nicholas Schiefer

,

Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

Few-shot Adaptation Works with UnpredicTable Data.

[DOI]

,

,

,

Jérémy Scheurer

,

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

2022

Finding and Fixing Undesirable Behaviors in Pretrained Language Models.

[DOI]

PhD thesis, 2022

Discovering Language Model Behaviors with Model-Written Evaluations.

[DOI]

,

,

Kamile Lukosiute

,

,

,

,

,

Catherine Olsson

,

,

Saurav Kadavath

,

,

,

,

,

,

Cameron McKinnon

,

Christopher Olah

,

,

,

,

,

,

Eli Tran-Johnson

,

,

Jackson Kernion

,

,

,

,

,

,

,

Landon Goldberg

,

,

,

Michael Sellitto

,

,

Neerav Kingsland

,

,

Nicholas Joseph

,

,

,

,

,

,

,

,

,

,

Timothy Telleen-Lawton

,

,

,

,

,

Zac Hatfield-Dodds

,

,

Samuel R. Bowman

,

,

Roger B. Grosse

,

Danny Hernandez

,

,

,

Nicholas Schiefer

,

CoRR, 2022

Constitutional AI: Harmlessness from AI Feedback.

[DOI]

,

Saurav Kadavath

,

,

,

Jackson Kernion

,

,

,

,

Azalia Mirhoseini

,

Cameron McKinnon

,

,

Catherine Olsson

,

Christopher Olah

,

Danny Hernandez

,

,

,

,

Eli Tran-Johnson

,

,

,

,

,

,

,

Kamile Lukosiute

,

,

Michael Sellitto

,

,

Nicholas Schiefer

,

,

,

,

,

,

,

,

,

,

,

Timothy Telleen-Lawton

,

,

,

,

Samuel R. Bowman

,

Zac Hatfield-Dodds

,

,

,

Nicholas Joseph

,

,

,

CoRR, 2022

Measuring Progress on Scalable Oversight for Large Language Models.

[DOI]

CoRR, 2022

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned.

[DOI]

CoRR, 2022

Language Models (Mostly) Know What They Know.

[DOI]

CoRR, 2022

Learning from Natural Language Feedback.

[DOI]

Jérémy Scheurer

,

Jon Ander Campos

,

,

,

,

CoRR, 2022

Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions.

[DOI]

,

,

,

,

,

,

Samuel R. Bowman

CoRR, 2022

Red Teaming Language Models with Language Models.

[DOI]

,

,

H. Francis Song

,

,

,

,

,

,

Geoffrey Irving

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

RL with KL penalties is better viewed as Bayesian inference.

[DOI]

,

,

Christopher L. Buckley

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022

2021

True Few-Shot Learning with Language Models.

[DOI]

,

,

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Rissanen Data Analysis: Examining Dataset Characteristics via Description Length.

[DOI]

,

,

Proceedings of the 38th International Conference on Machine Learning, 2021

Case-based Reasoning for Natural Language Queries over Knowledge Bases.

[DOI]

,

,

,

,

,

,

,

Lazaros Polymenakos

,

Andrew McCallum

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

2020

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

[DOI]

,

,

Aleksandra Piktus

,

,

Vladimir Karpukhin

,

,

Heinrich Küttler

,

,

,

Tim Rocktäschel

,

Sebastian Riedel

,

Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

Unsupervised Question Decomposition for Question Answering.

[DOI]

,

,

,

,

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020

2019

Finding Generalizable Evidence by Learning to Convince Q&A Models.

[DOI]

,

Siddharth Karamcheti

,

,

,

,

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019

ELI5: Long Form Question Answering.

[DOI]

,

,

,

,

,

Proceedings of the 57th Conference of the Association for Computational Linguistics, 2019

2018

HoME: a Household Multimodal Environment.

[DOI]

,

,

,

,

,

,

,

Hugo Larochelle

,

Aaron C. Courville

Proceedings of the 6th International Conference on Learning Representations, 2018

Visual Reasoning with Multi-hop Feature Modulation.

[DOI]

,

,

,

,

,

,

Aaron C. Courville

,

Olivier Pietquin

Proceedings of the Computer Vision - ECCV 2018, 2018

FiLM: Visual Reasoning with a General Conditioning Layer.

[DOI]

,

,

,

Vincent Dumoulin

,

Aaron C. Courville

Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018

2017

Learning Visual Reasoning Without Strong Priors.

[DOI]

,

,

,

Vincent Dumoulin

,

Aaron C. Courville

CoRR, 2017

2016

Semi-Supervised Learning with the Deep Rendering Mixture Model.

[DOI]

Minh Tan Nguyen

,

,

,

Richard G. Baraniuk

,

CoRR, 2016

Loading...