Monte MacDiarmid

According to our database1, Monte MacDiarmid authored at least 7 papers between 2023 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

On csauthors.net:

Bibliography

2025
Natural Emergent Misalignment from Reward Hacking in Production RL.
CoRR, November, 2025

Auditing language models for hidden objectives.
CoRR, March, 2025

2024
Alignment faking in large language models.
CoRR, 2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models.
CoRR, 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.
CoRR, 2024

2023
Understanding and Controlling a Maze-Solving Policy Network.
CoRR, 2023

Activation Addition: Steering Language Models Without Optimization.
CoRR, 2023


  Loading...