Lee Sharkey

Orcid: 0009-0009-2137-6027

According to our database¹, Lee Sharkey authored at least 18 papers between 2021 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

Polysemantic Experts, Monosemantic Paths: Routing as Control in MoEs.

[BibT_eX]

[DOI]

Charles Ye

Bo Yuan

Lee Sharkey

CoRR, April, 2026

2025

Stochastic Parameter Decomposition.

[BibT_eX]

[DOI]

Lucius Bushnaq

Dan Braun

Lee Sharkey

CoRR, June, 2025

Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video.

[BibT_eX]

[DOI]

CoRR, April, 2025

AI Behind Closed Doors: a Primer on The Governance of Internal Deployment.

[BibT_eX]

[DOI]

CoRR, April, 2025

Identifying Sparsely Active Circuits Through Local Loss Landscape Decomposition.

[BibT_eX]

[DOI]

Brianna Chrisman

Lucius Bushnaq

Lee Sharkey

CoRR, April, 2025

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition.

[BibT_eX]

[DOI]

CoRR, January, 2025

Open Problems in Mechanistic Interpretability.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2025

Bilinear MLPs enable weight-based mechanistic interpretability.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Sparse Autoencoders Do Not Find Canonical Units of Analysis.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs.

[BibT_eX]

[DOI]

Kola Ayonrinde

Michael T. Pearce

Lee Sharkey

CoRR, 2024

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning.

[BibT_eX]

[DOI]

Dan Braun

Jordan Taylor

Nicholas Goldowsky-Dill

Lee Sharkey

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Sparse Autoencoders Find Highly Interpretable Features in Language Models.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Black-Box Access is Insufficient for Rigorous AI Audits.

[BibT_eX]

[DOI]

Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024

2023

A technical note on bilinear layers for interpretability.

[BibT_eX]

[DOI]

Lee Sharkey

CoRR, 2023

2022

Circumventing interpretability: How to defeat mind-readers.

[BibT_eX]

[DOI]

Lee Sharkey

CoRR, 2022

Interpreting Neural Networks through the Polytope Lens.

[BibT_eX]

[DOI]

CoRR, 2022

Goal Misgeneralization in Deep Reinforcement Learning.

[BibT_eX]

[DOI]

Lauro Langosco di Langosco

Proceedings of the International Conference on Machine Learning, 2022

2021

Objective Robustness in Deep Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, 2021

Lee Sharkey

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...