Monte MacDiarmid

According to our database¹, Monte MacDiarmid authored at least 8 papers between 2023 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.

[BibT_eX]

[DOI]

CoRR, May, 2026

2025

Natural Emergent Misalignment from Reward Hacking in Production RL.

[BibT_eX]

[DOI]

CoRR, November, 2025

Auditing language models for hidden objectives.

[BibT_eX]

[DOI]

CoRR, March, 2025

2024

Alignment faking in large language models.

[BibT_eX]

[DOI]

CoRR, 2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.

[BibT_eX]

[DOI]

CoRR, 2024

2023

Understanding and Controlling a Maze-Solving Policy Network.

[BibT_eX]

[DOI]

Alexander Matt Turner

CoRR, 2023

Activation Addition: Steering Language Models Without Optimization.

[BibT_eX]

[DOI]

Alexander Matt Turner

CoRR, 2023

Monte MacDiarmid

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...