Jan Leike

Affiliations:

Anthropic PBC, San Francisco, CA, USA
OpenAI, San Francisco, CA, USA (former)
Australian National University, Canberra, ACT, Australia (PhD)
University of Freiburg, Germany

According to our database¹, Jan Leike authored at least 60 papers between 2013 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Bibliography

2026

Excess Description Length of Learning Generalizable Predictors.

[BibT_eX]

[DOI]

CoRR, January, 2026

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks.

[BibT_eX]

[DOI]

CoRR, January, 2026

2025

Natural Emergent Misalignment from Reward Hacking in Production RL.

[BibT_eX]

[DOI]

CoRR, November, 2025

Limit-Computable Grains of Truth for Arbitrary Computable Extensive-Form (Un)Known Games.

[BibT_eX]

[DOI]

CoRR, August, 2025

Unsupervised Elicitation of Language Models.

[BibT_eX]

[DOI]

Jacob Goldman-Wetzler

CoRR, June, 2025

Reasoning Models Don't Always Say What They Think.

[BibT_eX]

[DOI]

CoRR, May, 2025

Auditing language models for hidden objectives.

[BibT_eX]

[DOI]

CoRR, March, 2025

Forecasting Rare Language Model Behaviors.

[BibT_eX]

[DOI]

CoRR, February, 2025

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming.

[BibT_eX]

[DOI]

CoRR, January, 2025

Quantifying Elicitation of Latent Capabilities in Language Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

Scaling and evaluating sparse autoencoders.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024

Prover-Verifier Games improve legibility of LLM outputs.

[BibT_eX]

[DOI]

CoRR, 2024

LLM Critics Help Catch LLM Bugs.

[BibT_eX]

[DOI]

Nat McAleese

Rai Michael Pokorny

Juan Felipe Ceron Uribe

Evgenia Nitishinskaya

Maja Trebacz

Jan Leike

CoRR, 2024

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision.

[BibT_eX]

[DOI]

Leopold Aschenbrenner

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Let's Verify Step by Step.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

2022

Self-critiquing models for assisting human evaluators.

[BibT_eX]

[DOI]

CoRR, 2022

Safe Deep RL in 3D Environments using Human Feedback.

[BibT_eX]

[DOI]

CoRR, 2022

Training language models to follow instructions with human feedback.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

2021

Institutionalizing ethics in AI through broader impact requirements.

[BibT_eX]

[DOI]

Nat. Mach. Intell., 2021

Recursively Summarizing Books with Human Feedback.

[BibT_eX]

[DOI]

CoRR, 2021

Evaluating Large Language Models Trained on Code.

[BibT_eX]

[DOI]

Henrique Pondé de Oliveira Pinto

CoRR, 2021

Institutionalising Ethics in AI through Broader Impact Requirements.

[BibT_eX]

[DOI]

CoRR, 2021

Quantifying Differences in Reward Functions.

[BibT_eX]

[DOI]

Proceedings of the 9th International Conference on Learning Representations, 2021

2020

Active Reinforcement Learning: Observing Rewards at a Cost.

[BibT_eX]

[DOI]

CoRR, 2020

Hidden Incentives for Auto-Induced Distributional Shift.

[BibT_eX]

[DOI]

David Krueger

Tegan Maharaj

Jan Leike

CoRR, 2020

Pitfalls of Learning a Reward Function Online.

[BibT_eX]

[DOI]

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, 2020

Learning Human Objectives by Evaluating Hypothetical Behavior.

[BibT_eX]

[DOI]

Proceedings of the 37th International Conference on Machine Learning, 2020

2019

Learning to Understand Goal Specifications by Modelling Reward.

[BibT_eX]

[DOI]

Proceedings of the 7th International Conference on Learning Representations, 2019

2018

On the computability of Solomonoff induction and AIXI.

[BibT_eX]

[DOI]

Jan Leike

Marcus Hutter

Theor. Comput. Sci., 2018

Scaling shared model governance via model splitting.

[BibT_eX]

[DOI]

CoRR, 2018

Scalable agent alignment via reward modeling: a research direction.

[BibT_eX]

[DOI]

CoRR, 2018

Learning to Follow Language Instructions with Adversarial Reward Induction.

[BibT_eX]

[DOI]

CoRR, 2018

Geometric Nontermination Arguments.

[BibT_eX]

[DOI]

Jan Leike

Matthias Heizmann

Proceedings of the Tools and Algorithms for the Construction and Analysis of Systems, 2018

Reward learning from human preferences and demonstrations in Atari.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, 2018

Jointly Learning "What" and "How" from Instructions and Goal-States.

[BibT_eX]

[DOI]

Proceedings of the 6th International Conference on Learning Representations, 2018

2017

AI Safety Gridworlds.

[BibT_eX]

[DOI]

CoRR, 2017

Generalised Discount Functions applied to a Monte-Carlo AImu Implementation.

[BibT_eX]

[DOI]

CoRR, 2017

Deep Reinforcement Learning from Human Preferences.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 2017

On Thompson Sampling and Asymptotic Optimality.

[BibT_eX]

[DOI]

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017

Universal Reinforcement Learning Algorithms: Survey and Experiments.

[BibT_eX]

[DOI]

John Aslanides

Jan Leike

Marcus Hutter

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017

Generalised Discount Functions applied to a Monte-Carlo AI u Implementation.

[BibT_eX]

[DOI]

Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, 2017

2016

Nonparametric General Reinforcement Learning.

[BibT_eX]

[DOI]

Jan Leike

CoRR, 2016

Exploration Potential.

[BibT_eX]

[DOI]

Jan Leike

CoRR, 2016

A Formal Solution to the Grain of Truth Problem.

[BibT_eX]

[DOI]

Jan Leike

Jessica Taylor

Benya Fallenstein

Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, 2016

Thompson Sampling is Asymptotically Optimal in General Environments.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, 2016

Ultimate Automizer with Two-track Proofs - (Competition Contribution).

[BibT_eX]

[DOI]

Proceedings of the Tools and Algorithms for the Construction and Analysis of Systems, 2016

Loss Bounds and Time Complexity for Speed Priors.

[BibT_eX]

[DOI]

Daniel Filan

Jan Leike

Marcus Hutter

Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016

2015

On the Computability of AIXI.

[BibT_eX]

[DOI]

Jan Leike

Marcus Hutter

Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, 2015

Ultimate Automizer with Array Interpolation - (Competition Contribution).

[BibT_eX]

[DOI]

Proceedings of the Tools and Algorithms for the Construction and Analysis of Systems, 2015

Bad Universal Priors and Notions of Optimality.

[BibT_eX]

[DOI]

Jan Leike

Marcus Hutter

Proceedings of The 28th Conference on Learning Theory, 2015

On the Computability of Solomonoff Induction and Knowledge-Seeking.

[BibT_eX]

[DOI]

Jan Leike

Marcus Hutter

Proceedings of the Algorithmic Learning Theory - 26th International Conference, 2015

Solomonoff Induction Violates Nicod's Criterion.

[BibT_eX]

[DOI]

Jan Leike

Marcus Hutter

Proceedings of the Algorithmic Learning Theory - 26th International Conference, 2015

Sequential Extensions of Causal and Evidential Decision Theory.

[BibT_eX]

[DOI]

Tom Everitt

Jan Leike

Marcus Hutter

Proceedings of the Algorithmic Decision Theory - 4th International Conference, 2015

A Definition of Happiness for Reinforcement Learning Agents.

[BibT_eX]

[DOI]

Mayank Daswani

Jan Leike

Proceedings of the Artificial General Intelligence, 2015

2014

Geometric Series as Nontermination Arguments for Linear Lasso Programs.

[BibT_eX]

[DOI]

Jan Leike

Matthias Heizmann

CoRR, 2014

Ranking Function Synthesis for Linear Lasso Programs.

[BibT_eX]

[DOI]

Jan Leike

CoRR, 2014

Synthesis for Polynomial Lasso Programs.

[BibT_eX]

[DOI]

Jan Leike

Ashish Tiwari

Proceedings of the Verification, Model Checking, and Abstract Interpretation, 2014

Ranking Templates for Linear Loops.

[BibT_eX]

[DOI]

Jan Leike

Matthias Heizmann

Proceedings of the Tools and Algorithms for the Construction and Analysis of Systems, 2014

Indefinitely Oscillating Martingales.

[BibT_eX]

[DOI]

Jan Leike

Marcus Hutter

Proceedings of the Algorithmic Learning Theory - 25th International Conference, 2014

2013

Linear Ranking for Linear Lasso Programs.

[BibT_eX]

[DOI]

Proceedings of the Automated Technology for Verification and Analysis, 2013

Jan Leike

Timeline

Legend:

Links

Online presence:

On csauthors.net:

Bibliography

Loading...