Adam Gleave

Orcid: 0000-0002-3467-528X

According to our database1, Adam Gleave authored at least 44 papers between 2016 and 2026.

Collaborative distances:

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

On csauthors.net:

Bibliography

2026
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes.
CoRR, February, 2026

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution.
CoRR, February, 2026

Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks.
CoRR, February, 2026

Large language models can effectively convince people to believe conspiracies.
CoRR, January, 2026

STACK: Adversarial Attacks on LLM Safeguard Pipelines.
Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

2025
Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility.
CoRR, July, 2025

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models.
CoRR, July, 2025

The Singapore Consensus on Global AI Safety Research Priorities.
CoRR, June, 2025

Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban.
CoRR, June, 2025

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics.
CoRR, June, 2025

Preference Learning with Lie Detectors can Induce Honesty or Evasion.
CoRR, May, 2025

AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations.
CoRR, March, 2025

Multi-Agent Risks from Advanced AI.
CoRR, February, 2025

Scaling Trends in Language Model Robustness.
Proceedings of the Forty-second International Conference on Machine Learning, 2025

Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility.
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

Can Go AIs Be Adversarially Robust?
Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, 2025

Scaling Trends for Data Poisoning in LLMs.
Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, 2025

2024
Scaling Laws for Data Poisoning in LLMs.
CoRR, 2024

Exploring Scaling Trends in LLM Robustness.
CoRR, 2024

Planning behavior in a recurrent neural network that plays Sokoban.
CoRR, 2024

Uncovering Latent Human Wellbeing in Language Model Embeddings.
CoRR, 2024

STARC: A General Framework For Quantifying Differences Between Reward Functions.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023
Exploiting Novel GPT-4 APIs.
CoRR, 2023

On The Fragility of Learned Reward Functions.
CoRR, 2023

Adversarial Policies Beat Superhuman Go AIs.
Proceedings of the International Conference on Machine Learning, 2023

Invariance in Policy Optimisation and Partial Identifiability in Reward Learning.
Proceedings of the International Conference on Machine Learning, 2023

2022
Towards Trustworthy Machine Learning
PhD thesis, 2022

imitation: Clean Imitation Learning Implementations.
CoRR, 2022

Adversarial Policies Beat Professional-Level Go AIs.
CoRR, 2022

Calculus on MDPs: Potential Shaping as a Gradient.
CoRR, 2022

Reducing Exploitability with Population Based Training.
CoRR, 2022

Preprocessing Reward Functions for Interpretability.
CoRR, 2022

A Primer on Maximum Causal Entropy Inverse Reinforcement Learning.
CoRR, 2022

Uncertainty Estimation for Language Reward Models.
CoRR, 2022

2021
Stable-Baselines3: Reliable Reinforcement Learning Implementations.
J. Mach. Learn. Res., 2021

Quantifying Differences in Reward Functions.
Proceedings of the 9th International Conference on Learning Representations, 2021

2020
Understanding Learned Reward Functions.
CoRR, 2020

DERAIL: Diagnostic Environments for Reward And Imitation Learning.
CoRR, 2020

Adversarial Policies: Attacking Deep Reinforcement Learning.
Proceedings of the 8th International Conference on Learning Representations, 2020

2018
Inverse reinforcement learning for video games.
CoRR, 2018

Active Inverse Reward Design.
CoRR, 2018

Multi-task Maximum Entropy Inverse Reinforcement Learning.
CoRR, 2018

2017
Making Compression Algorithms for Unicode Text.
Proceedings of the 2017 Data Compression Conference, 2017

2016
Firmament: Fast, Centralized Cluster Scheduling at Scale.
Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, 2016


  Loading...