Can Rager

According to our database1, Can Rager authored at least 14 papers between 2023 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
Discovering Forbidden Topics in Language Models.
CoRR, May, 2025

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability.
CoRR, March, 2025

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024
Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks.
CoRR, 2024

The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability.
CoRR, 2024

NNsight and NDIF: Democratizing Access to Foundation Model Internals.
CoRR, 2024

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

2023
Structured World Representations in Maze-Solving Transformers.
CoRR, 2023

Attribution Patching Outperforms Automated Circuit Discovery.
CoRR, 2023

An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l.
CoRR, 2023

A Configurable Library for Generating and Manipulating Maze Datasets.
CoRR, 2023

Safety of self-assembled neuromorphic hardware.
CoRR, 2023

Linearly Structured World Representations in Maze-Solving Transformers.
Proceedings of UniReps: the First Workshop on Unifying Representations in Neural Models, 2023


  Loading...