Lee Sharkey

Orcid: 0009-0009-2137-6027

According to our database1, Lee Sharkey authored at least 17 papers between 2021 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
Stochastic Parameter Decomposition.
CoRR, June, 2025

Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video.
CoRR, April, 2025

AI Behind Closed Doors: a Primer on The Governance of Internal Deployment.
CoRR, April, 2025

Identifying Sparsely Active Circuits Through Local Loss Landscape Decomposition.
CoRR, April, 2025

Open Problems in Mechanistic Interpretability.
CoRR, January, 2025

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition.
CoRR, January, 2025

Bilinear MLPs enable weight-based mechanistic interpretability.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Sparse Autoencoders Do Not Find Canonical Units of Analysis.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024
Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs.
CoRR, 2024

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Sparse Autoencoders Find Highly Interpretable Features in Language Models.
Proceedings of the Twelfth International Conference on Learning Representations, 2024


2023
A technical note on bilinear layers for interpretability.
CoRR, 2023

2022
Circumventing interpretability: How to defeat mind-readers.
CoRR, 2022

Interpreting Neural Networks through the Polytope Lens.
CoRR, 2022

Goal Misgeneralization in Deep Reinforcement Learning.
Proceedings of the International Conference on Machine Learning, 2022

2021
Objective Robustness in Deep Reinforcement Learning.
CoRR, 2021


  Loading...