Buck Shlegeris

According to our database1, Buck Shlegeris authored at least 22 papers between 2018 and 2025.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.
CoRR, July, 2025

The Singapore Consensus on Global AI Safety Research Priorities.
CoRR, June, 2025

SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents.
CoRR, June, 2025

Ctrl-Z: Controlling AI Agents via Resampling.
CoRR, April, 2025

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence.
CoRR, April, 2025

A sketch of an AI control safety case.
CoRR, January, 2025

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024
Language Models Are Better Than Humans at Next-token Prediction.
Trans. Mach. Learn. Res., 2024

Alignment faking in large language models.
CoRR, 2024

Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols.
CoRR, 2024

Towards evaluations-based safety cases for AI scheming.
CoRR, 2024

Sabotage Evaluations for Frontier Models.
CoRR, 2024

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols.
CoRR, 2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models.
CoRR, 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.
CoRR, 2024

AI Control: Improving Safety Despite Intentional Subversion.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

2023
Generalized Wick Decompositions.
CoRR, 2023

Measurement Tampering Detection Benchmark.
CoRR, 2023

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

2022
Polysemanticity and Capacity in Neural Networks.
CoRR, 2022

Adversarial training for high-stakes reliability.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

2018
Supervising strong learners by amplifying weak experts.
CoRR, 2018


  Loading...