Henry Sleight

According to our database1, Henry Sleight authored at least 14 papers between 2024 and 2025.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
Persona Vectors: Monitoring and Controlling Character Traits in Language Models.
CoRR, July, 2025

Inverse Scaling in Test-Time Compute.
CoRR, July, 2025

SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents.
CoRR, June, 2025

Unsupervised Elicitation of Language Models.
CoRR, June, 2025

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Looking Inward: Language Models Can Learn About Themselves by Introspection.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024
Best-of-N Jailbreaking.
CoRR, 2024

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach.
CoRR, 2024

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples.
CoRR, 2024

Looking Inward: Language Models Can Learn About Themselves by Introspection.
CoRR, 2024

Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs.
CoRR, 2024

When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
CoRR, 2024

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data.
CoRR, 2024


  Loading...