Atticus Geiger

According to our database1, Atticus Geiger authored at least 49 papers between 2018 and 2026.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

On csauthors.net:

Bibliography

2026
Do Sparse Autoencoders Capture Concept Manifolds?
CoRR, April, 2026

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought.
CoRR, March, 2026

Surgical Activation Steering via Generative Causal Mediation.
CoRR, February, 2026

From Directions to Regions: Decomposing Activations in Language Models via Local Geometry.
CoRR, February, 2026

The Shape of Beliefs: Geometry, Dynamics, and Interventions along Representation Manifolds of Language Models' Posteriors.
CoRR, February, 2026

2025
Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics.
CoRR, November, 2025

Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context.
CoRR, October, 2025

How Causal Abstraction Underpins Computational Explanation.
CoRR, August, 2025

Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization.
CoRR, June, 2025

HyperSteer: Activation Steering at Scale with Hypernetworks.
CoRR, June, 2025

Language Models use Lookbacks to Track Beliefs.
CoRR, May, 2025

Open Problems in Mechanistic Interpretability.
Trans. Mach. Learn. Res., 2025

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability.
J. Mach. Learn. Res., 2025

How Do Transformers Learn Variable Binding in Symbolic Programs?
Proceedings of the Forty-second International Conference on Machine Learning, 2025

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders.
Proceedings of the Forty-second International Conference on Machine Learning, 2025


HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Combining Causal Models for More Accurate Abstractions of Neural Networks.
Proceedings of the Causal Learning and Reasoning, Lausanne, Switzerland, 7-9 May 2025., 2025

Enhancing Automated Interpretability with Output-Centric Feature Descriptions.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small.
CoRR, 2024

Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations.
CoRR, 2024

A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments.
CoRR, 2024

ReFT: Representation Finetuning for Language Models.
Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions.
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations, 2024

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Updating CLIP to Prefer Descriptions Over Captions.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations.
Proceedings of the Causal Learning and Reasoning, 2024

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

2023
Linear Representations of Sentiment in Large Language Models.
CoRR, 2023

Interpretability at Scale: Identifying Causal Mechanisms in Alpaca.
CoRR, 2023

Causal Abstraction for Faithful Model Interpretation.
CoRR, 2023

Interpretability at Scale: Identifying Causal Mechanisms in Alpaca.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Causal Proxy Models for Concept-based Model Explanations.
Proceedings of the International Conference on Machine Learning, 2023

A Semantics for Causing, Enabling, and Preventing Verbs Using Structural Causal Models.
Proceedings of the 45th Annual Meeting of the Cognitive Science Society, 2023

Causal Abstraction with Soft Interventions.
Proceedings of the Conference on Causal Learning and Reasoning, 2023

Rigorously Assessing Natural Language Explanations of Neurons.
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023

ScoNe: Benchmarking Negation Reasoning in Language Models With Fine-Tuning and In-Context Learning.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2023

2022
CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Causal Distillation for Language Models.
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022

Inducing Causal Structure for Interpretable Neural Networks.
Proceedings of the International Conference on Machine Learning, 2022

2021
Causal Abstractions of Neural Networks.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Dynabench: Rethinking Benchmarking in NLP.
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021

DynaSent: A Dynamic Benchmark for Sentiment Analysis.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021

2020
Modular Representation Underlies Systematic Generalization in Neural Natural Language Inference Models.
CoRR, 2020

Relational reasoning and generalization using non-symbolic neural networks.
Proceedings of the 42th Annual Meeting of the Cognitive Science Society, 2020

Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation.
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 2020

2019
Recursive Routing Networks: Learning to Compose Modules for Language Understanding.
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019

Posing Fair Generalization Tasks for Natural Language Inference.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019

2018
Stress-Testing Neural Models of Natural Language Inference with Multiply-Quantified Sentences.
CoRR, 2018


  Loading...