Neel Nanda

According to our database¹, Neel Nanda authored at least 66 papers between 2021 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Links

On csauthors.net:

Bibliography

2025

Thought Branches: Interpreting LLM Reasoning Requires Resampling.

[BibT_eX]

[DOI]

Uzay Macar

Paul C. Bogdan

Senthooran Rajamanoharan

Neel Nanda

CoRR, October, 2025

Steering Evaluation-Aware Language Models to Act Like They Are Deployed.

[BibT_eX]

[DOI]

CoRR, October, 2025

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences.

[BibT_eX]

[DOI]

CoRR, October, 2025

Base Models Know How to Reason, Thinking Models Learn When.

[BibT_eX]

[DOI]

CoRR, October, 2025

Internal states before wait modulate reasoning patterns.

[BibT_eX]

[DOI]

CoRR, October, 2025

Eliciting Secret Knowledge from Language Models.

[BibT_eX]

[DOI]

Bartosz Cywinski

Emil Ryd

Rowan Wang

Senthooran Rajamanoharan

Neel Nanda

Arthur Conmy

Samuel Marks

CoRR, October, 2025

Real-Time Detection of Hallucinated Entities in Long-Form Generation.

[BibT_eX]

[DOI]

CoRR, September, 2025

RelP: Faithful and Efficient Circuit Discovery via Relevance Patching.

[BibT_eX]

[DOI]

Farnoush Rezaei Jafari

Oliver Eberle

Ashkan Khakzar

Neel Nanda

CoRR, August, 2025

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning.

[BibT_eX]

[DOI]

Senthooran Rajamanoharan

Neel Nanda

CoRR, July, 2025

Reasoning-Finetuning Repurposes Latent Representations in Base Models.

[BibT_eX]

[DOI]

CoRR, July, 2025

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.

[BibT_eX]

[DOI]

CoRR, July, 2025

Simple Mechanistic Explanations for Out-Of-Context Reasoning.

[BibT_eX]

[DOI]

Atticus Wang

Joshua Engels

Oliver Clive-Griffin

Senthooran Rajamanoharan

Neel Nanda

CoRR, July, 2025

Thought Anchors: Which LLM Reasoning Steps Matter?

[BibT_eX]

[DOI]

CoRR, June, 2025

Understanding Reasoning in Thinking Language Models via Steering Vectors.

[BibT_eX]

[DOI]

CoRR, June, 2025

Because we have LLMs, we Can and Should Pursue Agentic Interpretability.

[BibT_eX]

[DOI]

CoRR, June, 2025

How Visual Representations Map to Language Feature Space in Multimodal LLMs.

[BibT_eX]

[DOI]

CoRR, June, 2025

Convergent Linear Representations of Emergent Misalignment.

[BibT_eX]

[DOI]

Anna Soligo

Edward Turner

Senthooran Rajamanoharan

Neel Nanda

CoRR, June, 2025

Model Organisms for Emergent Misalignment.

[BibT_eX]

[DOI]

Edward Turner

Anna Soligo

Mia Taylor

Senthooran Rajamanoharan

Neel Nanda

CoRR, June, 2025

Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models.

[BibT_eX]

[DOI]

Patrick Leask

Neel Nanda

Noura Al Moubayed

CoRR, May, 2025

Towards eliciting latent knowledge from LLMs with mechanistic interpretability.

[BibT_eX]

[DOI]

Bartosz Cywinski

Emil Ryd

Senthooran Rajamanoharan

Neel Nanda

CoRR, May, 2025

Scaling sparse feature circuit finding for in-context learning.

[BibT_eX]

[DOI]

CoRR, April, 2025

Robustly identifying concepts introduced during chat fine-tuning using crosscoders.

[BibT_eX]

[DOI]

CoRR, April, 2025

An Approach to Technical AGI Safety and Security.

[BibT_eX]

[DOI]

CoRR, April, 2025

Learning Multi-Level Features with Matryoshka Sparse Autoencoders.

[BibT_eX]

[DOI]

CoRR, March, 2025

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability.

[BibT_eX]

[DOI]

CoRR, March, 2025

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful.

[BibT_eX]

[DOI]

Iván Arcuschin

Jett Janiak

Robert Krzyzanowski

Senthooran Rajamanoharan

Neel Nanda

Arthur Conmy

CoRR, March, 2025

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing.

[BibT_eX]

[DOI]

Subhash Kantamneni

Joshua Engels

Senthooran Rajamanoharan

Max Tegmark

Neel Nanda

CoRR, February, 2025

Open Problems in Mechanistic Interpretability.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2025

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control.

[BibT_eX]

[DOI]

Aleksandar Makelov

Georg Lange

Neel Nanda

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Sparse Autoencoders Do Not Find Canonical Units of Analysis.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models.

[BibT_eX]

[DOI]

Javier Ferrando

Oscar Balcells Obeso

Senthooran Rajamanoharan

Neel Nanda

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024

Universal Neurons in GPT2 Language Models.

[BibT_eX]

[DOI]

Wes Gurnee

Theo Horsley

Zifan Carl Guo

Tara Rezaei Kheirkhah

Trans. Mach. Learn. Res., 2024

BatchTopK Sparse Autoencoders.

[BibT_eX]

[DOI]

Bart Bussmann

Patrick Leask

Neel Nanda

CoRR, 2024

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks.

[BibT_eX]

[DOI]

CoRR, 2024

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2.

[BibT_eX]

[DOI]

Tom Lieberum

Senthooran Rajamanoharan

CoRR, 2024

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders.

[BibT_eX]

[DOI]

Senthooran Rajamanoharan

CoRR, 2024

Interpreting Attention Layer Outputs with Sparse Autoencoders.

[BibT_eX]

[DOI]

CoRR, 2024

Refusal in Language Models Is Mediated by a Single Direction.

[BibT_eX]

[DOI]

CoRR, 2024

Improving Dictionary Learning with Gated Sparse Autoencoders.

[BibT_eX]

[DOI]

Senthooran Rajamanoharan

CoRR, 2024

How to use and interpret activation patching.

[BibT_eX]

[DOI]

Stefan Heimersheim

Neel Nanda

CoRR, 2024

AtP*: An efficient and scalable method for localizing LLM behaviour to components.

[BibT_eX]

[DOI]

CoRR, 2024

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs.

[BibT_eX]

[DOI]

Bilal Chughtai

Alan Cooney

Neel Nanda

CoRR, 2024

Confidence Regulation Neurons in Language Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders.

[BibT_eX]

[DOI]

Senthooran Rajamanoharan

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Transcoders find interpretable LLM feature circuits.

[BibT_eX]

[DOI]

Jacob Dunefsky

Philippe Chlenski

Neel Nanda

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Refusal in Language Models Is Mediated by a Single Direction.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Explorations of Self-Repair in Language Models.

[BibT_eX]

[DOI]

Cody Rushing

Neel Nanda

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods.

[BibT_eX]

[DOI]

Fred Zhang

Neel Nanda

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023

Finding Neurons in a Haystack: Case Studies with Sparse Probing.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2023

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching.

[BibT_eX]

[DOI]

Aleksandar Makelov

Georg Lange

Neel Nanda

CoRR, 2023

Training Dynamics of Contextual N-Grams in Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

Linear Representations of Sentiment in Large Language Models.

[BibT_eX]

[DOI]

Curt Tigges

Oskar John Hollinsworth

Atticus Geiger

Neel Nanda

CoRR, 2023

Copy Suppression: Comprehensively Understanding an Attention Head.

[BibT_eX]

[DOI]

CoRR, 2023

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla.

[BibT_eX]

[DOI]

CoRR, 2023

Neuron to Graph: Interpreting Language Model Neurons at Scale.

[BibT_eX]

[DOI]

CoRR, 2023

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations.

[BibT_eX]

[DOI]

Bilal Chughtai

Lawrence Chan

Neel Nanda

Proceedings of the International Conference on Machine Learning, 2023

Progress measures for grokking via mechanistic interpretability.

[BibT_eX]

[DOI]

Proceedings of the Eleventh International Conference on Learning Representations, 2023

Emergent Linear Representations in World Models of Self-Supervised Sequence Models.

[BibT_eX]

[DOI]

Neel Nanda

Andrew Lee

Martin Wattenberg

Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023

2022

Fully General Online Imitation Learning.

[BibT_eX]

[DOI]

Michael K. Cohen

Marcus Hutter

Neel Nanda

J. Mach. Learn. Res., 2022

In-context Learning and Induction Heads.

[BibT_eX]

[DOI]

CoRR, 2022

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.

[BibT_eX]

[DOI]

CoRR, 2022

Predictability and Surprise in Large Generative Models.

[BibT_eX]

[DOI]

CoRR, 2022

Predictability and Surprise in Large Generative Models.

[BibT_eX]

[DOI]

Proceedings of the FAccT '22: 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, June 21, 2022

2021

An Empirical Investigation of Learning from Biased Toxicity Labels.

[BibT_eX]

[DOI]

Neel Nanda

Jonathan Uesato

Sven Gowal

CoRR, 2021

Neel Nanda

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...