Stefan Heimersheim

According to our database1, Stefan Heimersheim authored at least 18 papers between 2023 and 2026.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

On csauthors.net:

Bibliography

2026
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes.
CoRR, February, 2026

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution.
CoRR, February, 2026

2025
SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs.
CoRR, November, 2025

Benchmarking Deception Probes via Black-to-White Performance Boosts.
CoRR, July, 2025

Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability.
CoRR, July, 2025

Detecting Strategic Deception Using Linear Probes.
CoRR, February, 2025

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition.
CoRR, January, 2025

Open Problems in Mechanistic Interpretability.
Trans. Mach. Learn. Res., 2025

Detecting Strategic Deception with Linear Probes.
Proceedings of the Forty-second International Conference on Machine Learning, 2025

2024
Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs.
CoRR, 2024

Evolution of SAE Features Across Layers in LLMs.
CoRR, 2024

Characterizing stable regions in the residual stream of LLMs.
CoRR, 2024

Evaluating Synthetic Activations composed of SAE Latents in GPT-2.
CoRR, 2024

You can remove GPT2's LayerNorm by fine-tuning.
CoRR, 2024

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks.
CoRR, 2024

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability.
CoRR, 2024

How to use and interpret activation patching.
CoRR, 2024

2023
Towards Automated Circuit Discovery for Mechanistic Interpretability.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023


  Loading...