Mikita Balesni

According to our database¹, Mikita Balesni authored at least 12 papers between 2023 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2025

Stress Testing Deliberative Alignment for Anti-Scheming Training.

[BibT_eX]

[DOI]

Bronson Schoen

Evgenia Nitishinskaya

Nicholas Goldowsky-Dill

CoRR, September, 2025

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.

[BibT_eX]

[DOI]

CoRR, July, 2025

AI Behind Closed Doors: a Primer on The Governance of Internal Deployment.

[BibT_eX]

[DOI]

CoRR, April, 2025

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence.

[BibT_eX]

[DOI]

CoRR, April, 2025

2024

Frontier Models are Capable of In-context Scheming.

[BibT_eX]

[DOI]

CoRR, 2024

The Two-Hop Curse: LLMs trained on A->B, B->C fail to learn A->C.

[BibT_eX]

[DOI]

Mikita Balesni

Tomasz Korbak

Owain Evans

CoRR, 2024

Towards evaluations-based safety cases for AI scheming.

[BibT_eX]

[DOI]

Nicholas Goldowsky-Dill

CoRR, 2024

Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack.

[BibT_eX]

[DOI]

Leo McKee-Reid

Christoph Sträter

Maria Angelica Martinez

Joe Needham

Mikita Balesni

CoRR, 2024

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A".

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023

Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure.

[BibT_eX]

[DOI]

Jérémy Scheurer

Mikita Balesni

Marius Hobbhahn

CoRR, 2023

Taken out of context: On measuring situational awareness in LLMs.

[BibT_eX]

[DOI]

CoRR, 2023

Mikita Balesni

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...