We stand with Ukraine

We stand with Ukraine

Owain Evans

According to our database¹, Owain Evans authored at least 40 papers between 2009 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

Negation Neglect: When models fail to learn negations in training.

[DOI]

,

,

,

,

,

CoRR, May, 2026

Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers.

[DOI]

,

,

Anna Sztyber-Betley

,

,

CoRR, April, 2026

The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious.

[DOI]

,

,

,

CoRR, April, 2026

2025

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers.

[DOI]

,

,

,

Kit Fraser-Taliente

,

Subhash Kantamneni

,

,

,

Arnab Sen Sharma

,

,

,

CoRR, December, 2025

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs.

[DOI]

,

,

,

,

,

Anna Sztyber-Betley

,

CoRR, December, 2025

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs.

[DOI]

,

,

,

Johannes Treutlein

,

CoRR, August, 2025

Persona Vectors: Monitoring and Controlling Character Traits in Language Models.

[DOI]

,

,

,

,

CoRR, July, 2025

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data.

[DOI]

,

,

,

,

Anna Sztyber-Betley

,

,

,

CoRR, July, 2025

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.

[DOI]

CoRR, July, 2025

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models.

[DOI]

,

,

,

CoRR, June, 2025

Inference-Time-Compute: More Faithful? A Research Note.

[DOI]

,

CoRR, January, 2025

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs.

[DOI]

,

Daniel Chee Hian Tan

,

,

Anna Sztyber-Betley

,

,

,

,

Proceedings of the Forty-second International Conference on Machine Learning, 2025

Looking Inward: Language Models Can Learn About Themselves by Introspection.

[DOI]

Felix Jedidja Binder

,

,

,

,

,

,

,

,

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Tell me about yourself: LLMs are aware of their learned behaviors.

[DOI]

,

,

,

Anna Sztyber-Betley

,

,

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024

The Two-Hop Curse: LLMs trained on A->B, B->C fail to learn A->C.

[DOI]

,

,

CoRR, 2024

Towards evaluations-based safety cases for AI scheming.

[DOI]

,

Marius Hobbhahn

,

,

Alexander Meinke

,

,

,

,

Jérémy Scheurer

,

,

,

Nicholas Goldowsky-Dill

,

,

,

,

Daniel Kokotajlo

,

CoRR, 2024

Looking Inward: Language Models Can Learn About Themselves by Introspection.

[DOI]

Felix J. Binder

,

,

,

,

,

,

,

,

CoRR, 2024

Can Language Models Explain Their Own Classification Behavior?

[DOI]

,

,

CoRR, 2024

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data.

[DOI]

Johannes Treutlein

,

,

,

,

,

Roger B. Grosse

,

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs.

[DOI]

,

,

,

Kaivalya Hariharan

,

,

Jérémy Scheurer

,

Marius Hobbhahn

,

Alexander Meinke

,

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions.

[DOI]

Lorenzo Pacchiardi

,

Alex James Chan

,

Sören Mindermann

,

,

,

,

,

Jan Markus Brauner

Proceedings of the Twelfth International Conference on Learning Representations, 2024

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A".

[DOI]

,

,

Maximilian Kaufmann

,

,

Asa Cooper Stickland

,

,

Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models.

[DOI]

Aarohi Srivastava

,

Abhinav Rastogi

,

,

Abu Awal Md Shoeb

,

,

,

,

,

,

Adrià Garriga-Alonso

,

Agnieszka Kluska

,

Aitor Lewkowycz

,

,

,

,

,

Alexander W. Kocurek

,

,

,

,

,

,

,

,

,

,

,

Anantharaman S. Iyer

,

Anders Andreassen

,

,

Andrea Santilli

,

Andreas Stuhlmüller

,

,

,

Andrew K. Lampinen

,

,

,

,

,

,

,

Antonio Norelli

,

,

Arash Gholamidavoodi

,

,

,

Arun Kirubarajan

,

Asher Mullokandov

,

Ashish Sabharwal

,

,

,

,

,

B. Ryan Roberts

,

,

,

Bartlomiej Bojanowski

,

Batuhan Özyurt

,

Behnam Hedayatnia

,

Behnam Neyshabur

,

,

,

,

Bill Yuchen Lin

,

,

,

,

,

Catherine Stinson

,

Cedrick Argueta

,

Cèsar Ferri Ramírez

,

,

Charles Rathkopf

,

,

,

,

Chris Callison-Burch

,

,

Christian Voigt

,

Christopher D. Manning

,

Christopher Potts

,

,

Clara E. Rivera

,

,

,

Courtney Ashcraft

,

Cristina Garbacea

,

,

,

,

,

,

,

Daniel Khashabi

,

,

Daniel Moseguí González

,

Danielle Perszyk

,

Danny Hernandez

,

,

Daphne Ippolito

,

,

,

,

,

Debajyoti Datta

,

,

,

,

,

,

,

,

,

,

Dimitri Coelho Mollo

,

,

,

,

Ekaterina Shutova

,

Ekin Dogus Cubuk

,

,

Eleanor Hagerman

,

Elizabeth Barnes

,

Elizabeth Donoway

,

,

Emanuele Rodolà

,

,

,

,

,

,

,

,

Ethan J. Jerzak

,

,

Eunice Engefu Manyasi

,

Evgenii Zheltonozhskii

,

,

,

Fernando Martínez-Plumed

,

Francesca Happé

,

François Chollet

,

,

,

Genta Indra Winata

,

,

Germán Kruszewski

,

Giambattista Parascandolo

,

Giorgio Mariani

,

,

Gonzalo Jaimovitch-López

,

,

,

Hana Galijasevic

,

,

,

Hannaneh Hajishirzi

,

,

,

,

Hinrich Schütze

,

,

,

,

,

,

,

Jack Geissinger

,

Jackson Kernion

,

,

,

Jaime Fernández Fisac

,

,

,

,

,

,

,

Janelle Wingfield

,

,

,

Jascha Sohl-Dickstein

,

,

,

,

Jekaterina Novikova

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Jonathan Batchelder

,

Jonathan Berant

,

,

,

José Hernández-Orallo

,

Joseph Boudeman

,

,

,

Joshua B. Tenenbaum

,

,

,

,

,

,

Karthik Gopalakrishnan

,

Katerina Ignatyeva

,

,

Kaustubh D. Dhole

,

,

,

Kory W. Mathewson

,

Kristen Chiafullo

,

Ksenia Shkaruta

,

,

,

Kyle Richardson

,

,

,

,

,

,

Lidia Contreras Ochando

,

Louis-Philippe Morency

,

,

,

,

,

,

Luis Oliveros Colón

,

,

Lütfi Kerem Senel

,

,

,

Maartje ter Hoeve

,

,

,

,

,

,

,

María José Ramírez-Quintana

,

,

Mario Giulianelli

,

,

Martin Potthast

,

Matthew L. Leavitt

,

,

Mátyás Schubert

,

Medina Baitemirova

,

,

Melvin McElrath

,

,

,

,

Michael I. Ivanitskiy

,

Michael Starritt

,

,

Michal Swedrowski

,

Michele Bevilacqua

,

Michihiro Yasunaga

,

,

,

,

,

,

,

,

Moin Aminnaseri

,

,

,

Mukund Varma T.

,

,

,

,

Neta Gur-Ari Krakover

,

Nicholas Cameron

,

Nicholas Roberts

,

,

Nicole Martinez

,

,

,

Niklas Muennighoff

,

Nitish Shirish Keskar

,

,

,

,

,

,

,

Omar Elbaghdadi

,

,

,

Pablo Antonio Moreno Casares

,

,

,

,

,

Pegah Alipoormolabashi

,

,

,

,

Peter Eckersley

,

,

,

Piotr Milkowski

,

,

Pouya Pezeshkpour

,

,

,

,

,

,

Rachel Etta Rudolph

,

,

,

,

Raphaël Millière

,

,

,

,

,

Robbe Raymaekers

,

,

,

,

,

,

,

,

,

Ruslan Salakhutdinov

,

,

,

,

,

,

,

Saif M. Mohammad

,

,

,

,

,

Samuel Gruetter

,

Samuel R. Bowman

,

Samuel S. Schoenholz

,

,

,

,

Sarik Ghazarian

,

,

,

Sebastian Bischoff

,

Sebastian Gehrmann

,

Sebastian Schuster

,

Sepideh Sadeghi

,

,

,

Shashank Srivastava

,

,

,

,

Shixiang Shane Gu

,

Shubh Pachchigar

,

Shubham Toshniwal

,

,

Shyamolima (Shammie) Debnath

,

,

Simon Thormeyer

,

,

,

Sneha Priscilla Makini

,

,

,

Sriharsha Hatwar

,

Stanislas Dehaene

,

,

,

Stella Biderman

,

,

,

Steven T. Piantadosi

,

Stuart M. Shieber

,

Summer Misherghi

,

Svetlana Kiritchenko

,

,

,

,

,

,

,

Tatsu Hashimoto

,

,

Théo Desbordes

,

Theodore Rothschild

,

,

,

Tiberius Nkinyili

,

,

,

,

Tobias Gerstenberg

,

,

Trishala Neeraj

,

,

,

,

,

,

Victoria Nyamai

,

,

Vinay V. Ramasesh

,

Vinay Uday Prabhu

,

Vishakh Padmakumar

,

,

,

William Saunders

,

,

,

,

,

,

,

,

Yadollah Yaghoobzadeh

,

,

,

,

,

,

,

,

Yonatan Belinkov

,

,

,

,

,

,

,

,

,

Trans. Mach. Learn. Res., 2023

Tell, don't show: Declarative facts influence how LLMs generalize.

[DOI]

Alexander Meinke

,

CoRR, 2023

Taken out of context: On measuring situational awareness in LLMs.

[DOI]

,

Asa Cooper Stickland

,

,

Maximilian Kaufmann

,

,

,

Daniel Kokotajlo

,

CoRR, 2023

2022

Teaching Models to Express Their Uncertainty in Words.

[DOI]

,

,

Trans. Mach. Learn. Res., 2022

Forecasting Future World Events With Neural Networks.

[DOI]

,

,

,

,

,

,

,

Jacob Steinhardt

,

,

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

TruthfulQA: Measuring How Models Mimic Human Falsehoods.

[DOI]

,

,

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022

2021

Truthful AI: Developing and governing AI that does not lie.

[DOI]

,

Owen Cotton-Barratt

,

Lukas Finnveden

,

,

,

,

,

William Saunders

CoRR, 2021

2020

Active Reinforcement Learning: Observing Rewards at a Cost.

[DOI]

,

,

,

CoRR, 2020

2019

Sensory Optimization: Neural Networks as a Model for Understanding and Creating Art.

[DOI]

CoRR, 2019

Generalizing from a few environments in safety-critical reinforcement learning.

[DOI]

,

,

,

CoRR, 2019

2018

Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts.

[DOI]

,

,

,

,

J. Artif. Intell. Res., 2018

Active Reinforcement Learning with Monte-Carlo Tree Search.

[DOI]

Sebastian Schulze

,

CoRR, 2018

The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation.

[DOI]

CoRR, 2018

Trial without Error: Towards Safe Reinforcement Learning via Human Intervention.

[DOI]

William Saunders

,

,

Andreas Stuhlmüller

,

Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, 2018

2017

When Will AI Exceed Human Performance? Evidence from AI Experts.

[DOI]

,

,

,

,

CoRR, 2017

Agent-Agnostic Human-in-the-Loop Reinforcement Learning.

[DOI]

,

,

Andreas Stuhlmüller

,

CoRR, 2017

2016

Learning the Preferences of Ignorant, Inconsistent Agents.

[DOI]

,

Andreas Stuhlmüller

,

Noah D. Goodman

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016

2009

Help or Hinder: Bayesian Models of Social Goal Inference.

[DOI]

Tomer D. Ullman

,

,

,

,

Noah D. Goodman

,

Joshua B. Tenenbaum

Proceedings of the Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 7-10 December 2009, 2009

Loading...