Xander Davies

According to our database1, Xander Davies authored at least 22 papers between 2023 and 2026.

Collaborative distances:

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

On csauthors.net:

Bibliography

2026
Evaluating whether AI models would sabotage AI safety research.
CoRR, April, 2026

UK AISI Alignment Evaluation Case-Study.
CoRR, April, 2026

How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition.
CoRR, March, 2026

Boundary Point Jailbreaking of Black-Box LLMs.
CoRR, February, 2026

STACK: Adversarial Attacks on LLM Safeguard Pipelines.
Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

2025
Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents.
CoRR, October, 2025

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples.
CoRR, October, 2025

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents.
CoRR, October, 2025

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs.
CoRR, August, 2025

Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition.
CoRR, July, 2025

Existing Large Language Model Unlearning Evaluations Are Inconclusive.
CoRR, June, 2025

An Example Safety Case for Safeguards Against Misuse.
CoRR, May, 2025

Fundamental Limitations in Defending LLM Finetuning APIs.
CoRR, February, 2025

Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents.
CoRR, 2024

2023
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.
Trans. Mach. Learn. Res., 2023

Circuit Breaking: Removing Model Behaviors with Targeted Ablation.
CoRR, 2023

Discovering Variable Binding Circuitry with Desiderata.
CoRR, 2023

Unifying Grokking and Double Descent.
CoRR, 2023

Sparse Distributed Memory is a Continual Learner.
Proceedings of the Eleventh International Conference on Learning Representations, 2023


  Loading...