Neel Nanda

According to our database1, Neel Nanda authored at least 58 papers between 2021 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning.
CoRR, July, 2025

Reasoning-Finetuning Repurposes Latent Representations in Base Models.
CoRR, July, 2025

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.
CoRR, July, 2025

Simple Mechanistic Explanations for Out-Of-Context Reasoning.
CoRR, July, 2025

Thought Anchors: Which LLM Reasoning Steps Matter?
CoRR, June, 2025

Understanding Reasoning in Thinking Language Models via Steering Vectors.
CoRR, June, 2025

Because we have LLMs, we Can and Should Pursue Agentic Interpretability.
CoRR, June, 2025

How Visual Representations Map to Language Feature Space in Multimodal LLMs.
CoRR, June, 2025

Convergent Linear Representations of Emergent Misalignment.
CoRR, June, 2025

Model Organisms for Emergent Misalignment.
CoRR, June, 2025

Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models.
CoRR, May, 2025

Towards eliciting latent knowledge from LLMs with mechanistic interpretability.
CoRR, May, 2025

Scaling sparse feature circuit finding for in-context learning.
CoRR, April, 2025

Robustly identifying concepts introduced during chat fine-tuning using crosscoders.
CoRR, April, 2025

An Approach to Technical AGI Safety and Security.
CoRR, April, 2025

Learning Multi-Level Features with Matryoshka Sparse Autoencoders.
CoRR, March, 2025

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability.
CoRR, March, 2025

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful.
CoRR, March, 2025

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing.
CoRR, February, 2025

Open Problems in Mechanistic Interpretability.
CoRR, January, 2025

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Sparse Autoencoders Do Not Find Canonical Units of Analysis.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024
Universal Neurons in GPT2 Language Models.
Trans. Mach. Learn. Res., 2024

BatchTopK Sparse Autoencoders.
CoRR, 2024

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks.
CoRR, 2024

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2.
CoRR, 2024

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders.
CoRR, 2024

Interpreting Attention Layer Outputs with Sparse Autoencoders.
CoRR, 2024

Refusal in Language Models Is Mediated by a Single Direction.
CoRR, 2024

Improving Dictionary Learning with Gated Sparse Autoencoders.
CoRR, 2024

How to use and interpret activation patching.
CoRR, 2024

AtP*: An efficient and scalable method for localizing LLM behaviour to components.
CoRR, 2024

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs.
CoRR, 2024

Confidence Regulation Neurons in Language Models.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Transcoders find interpretable LLM feature circuits.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Refusal in Language Models Is Mediated by a Single Direction.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Explorations of Self-Repair in Language Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing.
Trans. Mach. Learn. Res., 2023

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching.
CoRR, 2023

Training Dynamics of Contextual N-Grams in Language Models.
CoRR, 2023

Linear Representations of Sentiment in Large Language Models.
CoRR, 2023

Copy Suppression: Comprehensively Understanding an Attention Head.
CoRR, 2023

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla.
CoRR, 2023

Neuron to Graph: Interpreting Language Model Neurons at Scale.
CoRR, 2023

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models.
CoRR, 2023

A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations.
Proceedings of the International Conference on Machine Learning, 2023

Progress measures for grokking via mechanistic interpretability.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

Emergent Linear Representations in World Models of Self-Supervised Sequence Models.
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023

2022
Fully General Online Imitation Learning.
J. Mach. Learn. Res., 2022

In-context Learning and Induction Heads.
CoRR, 2022

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.
CoRR, 2022

Predictability and Surprise in Large Generative Models.
CoRR, 2022


2021
An Empirical Investigation of Learning from Biased Toxicity Labels.
CoRR, 2021


  Loading...