Evan Hubinger

According to our database1, Evan Hubinger authored at least 11 papers between 2019 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.
CoRR, 2024

2023
Steering Llama 2 via Contrastive Activation Addition.
CoRR, 2023

Studying Large Language Model Generalization with Influence Functions.
CoRR, 2023

Measuring Faithfulness in Chain-of-Thought Reasoning.
CoRR, 2023

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning.
CoRR, 2023

Conditioning Predictive Models: Risks and Strategies.
CoRR, 2023


2022
Discovering Language Model Behaviors with Model-Written Evaluations.
CoRR, 2022

Engineering Monosemanticity in Toy Models.
CoRR, 2022

2020
An overview of 11 proposals for building safe advanced AI.
CoRR, 2020

2019
Risks from Learned Optimization in Advanced Machine Learning Systems.
CoRR, 2019


  Loading...