Adam Gleave

Orcid: 0000-0002-3467-528X

According to our database¹, Adam Gleave authored at least 45 papers between 2016 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs.

[BibT_eX]

[DOI]

David Gros

Adam Gleave

CoRR, May, 2026

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes.

[BibT_eX]

[DOI]

CoRR, February, 2026

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution.

[BibT_eX]

[DOI]

CoRR, February, 2026

Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks.

[BibT_eX]

[DOI]

Lukas Struppek

Adam Gleave

Kellin Pelrine

CoRR, February, 2026

Large language models can effectively convince people to believe conspiracies.

[BibT_eX]

[DOI]

Jean-François Godbout

Adam Gleave

David Rand

Gordon Pennycook

CoRR, January, 2026

STACK: Adversarial Attacks on LLM Safeguard Pipelines.

[BibT_eX]

[DOI]

Ian R. McKenzie

Oskar John Hollinsworth

Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

2025

Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility.

[BibT_eX]

[DOI]

Brendan Murphy

Dillon Bowen

Shahrad Mohammadzadeh

Julius Broomfield

Adam Gleave

Kellin Pelrine

CoRR, July, 2025

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models.

[BibT_eX]

[DOI]

Ann-Kathrin Dombrowski

Dillon Bowen

Adam Gleave

Chris Cundy

CoRR, July, 2025

The Singapore Consensus on Global AI Safety Research Priorities.

[BibT_eX]

[DOI]

Vidhisha Balachandran

Bryan Low Kian Hsiang

CoRR, June, 2025

Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban.

[BibT_eX]

[DOI]

CoRR, June, 2025

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics.

[BibT_eX]

[DOI]

Matthew Kowal

Jasper Timm

Jean-François Godbout

CoRR, June, 2025

AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations.

[BibT_eX]

[DOI]

Dillon Bowen

Ann-Kathrin Dombrowski

Adam Gleave

Chris Cundy

CoRR, March, 2025

Multi-Agent Risks from Advanced AI.

[BibT_eX]

[DOI]

CoRR, February, 2025

Preference Learning with Lie Detectors can Induce Honesty or Evasion.

[BibT_eX]

[DOI]

Chris Cundy

Adam Gleave

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

Scaling Trends in Language Model Robustness.

[BibT_eX]

[DOI]

Nikolaus H. R. Howe

Ian R. McKenzie

Oskar John Hollinsworth

Proceedings of the Forty-second International Conference on Machine Learning, 2025

Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility.

[BibT_eX]

[DOI]

Brendan Murphy

Dillon Bowen

Shahrad Mohammadzadeh

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

Can Go AIs Be Adversarially Robust?

[BibT_eX]

[DOI]

Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, 2025

Scaling Trends for Data Poisoning in LLMs.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, 2025

2024

Scaling Laws for Data Poisoning in LLMs.

[BibT_eX]

[DOI]

CoRR, 2024

Exploring Scaling Trends in LLM Robustness.

[BibT_eX]

[DOI]

Nikolaus H. R. Howe

Michal Zajac

Ian R. McKenzie

Oskar John Hollinsworth

Tom Tseng

Pierre-Luc Bacon

Adam Gleave

CoRR, 2024

Planning behavior in a recurrent neural network that plays Sokoban.

[BibT_eX]

[DOI]

Adrià Garriga-Alonso

Mohammad Taufeeque

Adam Gleave

CoRR, 2024

Uncovering Latent Human Wellbeing in Language Model Embeddings.

[BibT_eX]

[DOI]

CoRR, 2024

STARC: A General Framework For Quantifying Differences Between Reward Functions.

[BibT_eX]

[DOI]

Joar Max Viktor Skalse

Lucy Farnik

Sumeet Ramesh Motwani

Erik Jenner

Adam Gleave

Alessandro Abate

Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023

Exploiting Novel GPT-4 APIs.

[BibT_eX]

[DOI]

CoRR, 2023

On The Fragility of Learned Reward Functions.

[BibT_eX]

[DOI]

CoRR, 2023

Adversarial Policies Beat Superhuman Go AIs.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Machine Learning, 2023

Invariance in Policy Optimisation and Partial Identifiability in Reward Learning.

[BibT_eX]

[DOI]

Joar Max Viktor Skalse

Matthew Farrugia-Roberts

Stuart Russell

Alessandro Abate

Adam Gleave

Proceedings of the International Conference on Machine Learning, 2023

2022

Towards Trustworthy Machine Learning

[BibT_eX]

[DOI]

Adam Gleave

PhD thesis, 2022

imitation: Clean Imitation Learning Implementations.

[BibT_eX]

[DOI]

CoRR, 2022

Adversarial Policies Beat Professional-Level Go AIs.

[BibT_eX]

[DOI]

CoRR, 2022

Calculus on MDPs: Potential Shaping as a Gradient.

[BibT_eX]

[DOI]

Erik Jenner

Herke van Hoof

Adam Gleave

CoRR, 2022

Reducing Exploitability with Population Based Training.

[BibT_eX]

[DOI]

Pavel Czempin

Adam Gleave

CoRR, 2022

Preprocessing Reward Functions for Interpretability.

[BibT_eX]

[DOI]

Erik Jenner

Adam Gleave

CoRR, 2022

A Primer on Maximum Causal Entropy Inverse Reinforcement Learning.

[BibT_eX]

[DOI]

Adam Gleave

Sam Toyer

CoRR, 2022

Uncertainty Estimation for Language Reward Models.

[BibT_eX]

[DOI]

Adam Gleave

Geoffrey Irving

CoRR, 2022

2021

Stable-Baselines3: Reliable Reinforcement Learning Implementations.

[BibT_eX]

[DOI]

J. Mach. Learn. Res., 2021

Quantifying Differences in Reward Functions.

[BibT_eX]

[DOI]

Proceedings of the 9th International Conference on Learning Representations, 2021

2020

Understanding Learned Reward Functions.

[BibT_eX]

[DOI]

Eric J. Michaud

Adam Gleave

Stuart Russell

CoRR, 2020

DERAIL: Diagnostic Environments for Reward And Imitation Learning.

[BibT_eX]

[DOI]

CoRR, 2020

Adversarial Policies: Attacking Deep Reinforcement Learning.

[BibT_eX]

[DOI]

Proceedings of the 8th International Conference on Learning Representations, 2020

2018

Inverse reinforcement learning for video games.

[BibT_eX]

[DOI]

Aaron Tucker

Adam Gleave

Stuart Russell

CoRR, 2018

Active Inverse Reward Design.

[BibT_eX]

[DOI]

Sören Mindermann

Rohin Shah

Adam Gleave

Dylan Hadfield-Menell

CoRR, 2018

Multi-task Maximum Entropy Inverse Reinforcement Learning.

[BibT_eX]

[DOI]

Adam Gleave

Oliver Habryka

CoRR, 2018

2017

Making Compression Algorithms for Unicode Text.

[BibT_eX]

[DOI]

Adam Gleave

Christian Steinruecken

Proceedings of the 2017 Data Compression Conference, 2017

2016

Firmament: Fast, Centralized Cluster Scheduling at Scale.

[BibT_eX]

[DOI]

Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, 2016

Adam Gleave

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...