Nicholas Goldowsky-Dill

According to our database1, Nicholas Goldowsky-Dill authored at least 9 papers between 2023 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
Stress Testing Deliberative Alignment for Anti-Scheming Training.
CoRR, September, 2025

Detecting Strategic Deception Using Linear Probes.
CoRR, February, 2025

Open Problems in Mechanistic Interpretability.
Trans. Mach. Learn. Res., 2025

Detecting Strategic Deception with Linear Probes.
Proceedings of the Forty-second International Conference on Machine Learning, 2025

2024
Towards evaluations-based safety cases for AI scheming.
CoRR, 2024

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks.
CoRR, 2024

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability.
CoRR, 2024

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

2023
Localizing Model Behavior with Path Patching.
CoRR, 2023


  Loading...