Adam Gleave

Orcid: 0000-0002-3467-528X

According to our database1, Adam Gleave authored at least 38 papers between 2016 and 2025.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility.
CoRR, July, 2025

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models.
CoRR, July, 2025

STACK: Adversarial Attacks on LLM Safeguard Pipelines.
CoRR, June, 2025

The Singapore Consensus on Global AI Safety Research Priorities.
CoRR, June, 2025

Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban.
CoRR, June, 2025

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics.
CoRR, June, 2025

Preference Learning with Lie Detectors can Induce Honesty or Evasion.
CoRR, May, 2025

AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations.
CoRR, March, 2025

Multi-Agent Risks from Advanced AI.
CoRR, February, 2025

Can Go AIs Be Adversarially Robust?
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

Scaling Trends for Data Poisoning in LLMs.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024
Scaling Laws for Data Poisoning in LLMs.
CoRR, 2024

Exploring Scaling Trends in LLM Robustness.
CoRR, 2024

Planning behavior in a recurrent neural network that plays Sokoban.
CoRR, 2024

Uncovering Latent Human Wellbeing in Language Model Embeddings.
CoRR, 2024

STARC: A General Framework For Quantifying Differences Between Reward Functions.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023
Exploiting Novel GPT-4 APIs.
CoRR, 2023

On The Fragility of Learned Reward Functions.
CoRR, 2023

Adversarial Policies Beat Superhuman Go AIs.
Proceedings of the International Conference on Machine Learning, 2023

Invariance in Policy Optimisation and Partial Identifiability in Reward Learning.
Proceedings of the International Conference on Machine Learning, 2023

2022
Towards Trustworthy Machine Learning
PhD thesis, 2022

imitation: Clean Imitation Learning Implementations.
CoRR, 2022

Adversarial Policies Beat Professional-Level Go AIs.
CoRR, 2022

Calculus on MDPs: Potential Shaping as a Gradient.
CoRR, 2022

Reducing Exploitability with Population Based Training.
CoRR, 2022

Preprocessing Reward Functions for Interpretability.
CoRR, 2022

A Primer on Maximum Causal Entropy Inverse Reinforcement Learning.
CoRR, 2022

Uncertainty Estimation for Language Reward Models.
CoRR, 2022

2021
Stable-Baselines3: Reliable Reinforcement Learning Implementations.
J. Mach. Learn. Res., 2021

Quantifying Differences in Reward Functions.
Proceedings of the 9th International Conference on Learning Representations, 2021

2020
Understanding Learned Reward Functions.
CoRR, 2020

DERAIL: Diagnostic Environments for Reward And Imitation Learning.
CoRR, 2020

Adversarial Policies: Attacking Deep Reinforcement Learning.
Proceedings of the 8th International Conference on Learning Representations, 2020

2018
Inverse reinforcement learning for video games.
CoRR, 2018

Active Inverse Reward Design.
CoRR, 2018

Multi-task Maximum Entropy Inverse Reinforcement Learning.
CoRR, 2018

2017
Making Compression Algorithms for Unicode Text.
Proceedings of the 2017 Data Compression Conference, 2017

2016
Firmament: Fast, Centralized Cluster Scheduling at Scale.
Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, 2016


  Loading...