Stefan Heimersheim

According to our database¹, Stefan Heimersheim authored at least 18 papers between 2023 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes.

[BibT_eX]

[DOI]

CoRR, February, 2026

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution.

[BibT_eX]

[DOI]

CoRR, February, 2026

2025

SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs.

[BibT_eX]

[DOI]

CoRR, November, 2025

Benchmarking Deception Probes via Black-to-White Performance Boosts.

[BibT_eX]

[DOI]

Avi Parrack

Carlo Leonardo Attubato

Stefan Heimersheim

CoRR, July, 2025

Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability.

[BibT_eX]

[DOI]

CoRR, July, 2025

Detecting Strategic Deception Using Linear Probes.

[BibT_eX]

[DOI]

Nicholas Goldowsky-Dill

Bilal Chughtai

Stefan Heimersheim

Marius Hobbhahn

CoRR, February, 2025

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition.

[BibT_eX]

[DOI]

CoRR, January, 2025

Open Problems in Mechanistic Interpretability.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2025

Detecting Strategic Deception with Linear Probes.

[BibT_eX]

[DOI]

Nicholas Goldowsky-Dill

Bilal Chughtai

Stefan Heimersheim

Marius Hobbhahn

Proceedings of the Forty-second International Conference on Machine Learning, 2025

2024

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs.

[BibT_eX]

[DOI]

Daniel J. Lee

Stefan Heimersheim

CoRR, 2024

Evolution of SAE Features Across Layers in LLMs.

[BibT_eX]

[DOI]

CoRR, 2024

Characterizing stable regions in the residual stream of LLMs.

[BibT_eX]

[DOI]

CoRR, 2024

Evaluating Synthetic Activations composed of SAE Latents in GPT-2.

[BibT_eX]

[DOI]

CoRR, 2024

You can remove GPT2's LayerNorm by fine-tuning.

[BibT_eX]

[DOI]

Stefan Heimersheim

CoRR, 2024

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks.

[BibT_eX]

[DOI]

Lucius Bushnaq

Stefan Heimersheim

Nicholas Goldowsky-Dill

CoRR, 2024

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability.

[BibT_eX]

[DOI]

Nicholas Goldowsky-Dill

Kaarel Hänni

Cindy Wu

Marius Hobbhahn

CoRR, 2024

How to use and interpret activation patching.

[BibT_eX]

[DOI]

Stefan Heimersheim

Neel Nanda

CoRR, 2024

2023

Towards Automated Circuit Discovery for Mechanistic Interpretability.

[BibT_eX]

[DOI]

Arthur Conmy

Augustine N. Mavor-Parker

Aengus Lynch

Stefan Heimersheim

Adrià Garriga-Alonso

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Stefan Heimersheim

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...