Henry Sleight

According to our database¹, Henry Sleight authored at least 22 papers between 2024 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Links

On csauthors.net:

Bibliography

2025

Evaluating Control Protocols for Untrusted AI Agents.

[BibT_eX]

[DOI]

CoRR, November, 2025

Believe It or Not: How Deeply do LLMs Believe Implanted Facts?

[BibT_eX]

[DOI]

CoRR, October, 2025

All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language.

[BibT_eX]

[DOI]

Shiyuan Guo

Henry Sleight

Fabien Roger

CoRR, October, 2025

Stress-Testing Model Specs Reveals Character Differences among Language Models.

[BibT_eX]

[DOI]

CoRR, October, 2025

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment.

[BibT_eX]

[DOI]

CoRR, October, 2025

The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models.

[BibT_eX]

[DOI]

Danielle Ensign

Henry Sleight

Kyle Fish

CoRR, September, 2025

Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks.

[BibT_eX]

[DOI]

CoRR, August, 2025

Persona Vectors: Monitoring and Controlling Character Traits in Language Models.

[BibT_eX]

[DOI]

CoRR, July, 2025

Inverse Scaling in Test-Time Compute.

[BibT_eX]

[DOI]

Jacob Goldman-Wetzler

CoRR, July, 2025

SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents.

[BibT_eX]

[DOI]

CoRR, June, 2025

Unsupervised Elicitation of Language Models.

[BibT_eX]

[DOI]

Jacob Goldman-Wetzler

CoRR, June, 2025

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs.

[BibT_eX]

[DOI]

Dylan Hadfield-Menell

Stephen Casper

Trans. Mach. Learn. Res., 2025

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Looking Inward: Language Models Can Learn About Themselves by Introspection.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024

Best-of-N Jailbreaking.

[BibT_eX]

[DOI]

CoRR, 2024

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach.

[BibT_eX]

[DOI]

CoRR, 2024

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples.

[BibT_eX]

[DOI]

CoRR, 2024

Looking Inward: Language Models Can Learn About Themselves by Introspection.

[BibT_eX]

[DOI]

CoRR, 2024

Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs.

[BibT_eX]

[DOI]

Dylan Hadfield-Menell

Stephen Casper

CoRR, 2024

When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?

[BibT_eX]

[DOI]

CoRR, 2024

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data.

[BibT_eX]

[DOI]

Matthias Gerstgrasser

CoRR, 2024

Henry Sleight

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...