Neel Nanda

According to our database¹, Neel Nanda authored at least 22 papers between 2021 and 2024.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Links

On csauthors.net:

Bibliography

2024

AtP*: An efficient and scalable method for localizing LLM behaviour to components.

[BibT_eX]

[DOI]

CoRR, 2024

Explorations of Self-Repair in Language Models.

[BibT_eX]

[DOI]

Cody Rushing

Neel Nanda

CoRR, 2024

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs.

[BibT_eX]

[DOI]

Bilal Chughtai

Alan Cooney

Neel Nanda

CoRR, 2024

Universal Neurons in GPT2 Language Models.

[BibT_eX]

[DOI]

Wes Gurnee

Theo Horsley

Zifan Carl Guo

Tara Rezaei Kheirkhah

CoRR, 2024

2023

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching.

[BibT_eX]

[DOI]

Aleksandar Makelov

Georg Lange

Neel Nanda

CoRR, 2023

Training Dynamics of Contextual N-Grams in Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

Linear Representations of Sentiment in Large Language Models.

[BibT_eX]

[DOI]

Curt Tigges

Oskar John Hollinsworth

Atticus Geiger

Neel Nanda

CoRR, 2023

Copy Suppression: Comprehensively Understanding an Attention Head.

[BibT_eX]

[DOI]

CoRR, 2023

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods.

[BibT_eX]

[DOI]

Fred Zhang

Neel Nanda

CoRR, 2023

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla.

[BibT_eX]

[DOI]

CoRR, 2023

Neuron to Graph: Interpreting Language Model Neurons at Scale.

[BibT_eX]

[DOI]

CoRR, 2023

Finding Neurons in a Haystack: Case Studies with Sparse Probing.

[BibT_eX]

[DOI]

CoRR, 2023

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations.

[BibT_eX]

[DOI]

Bilal Chughtai

Lawrence Chan

Neel Nanda

Proceedings of the International Conference on Machine Learning, 2023

Progress measures for grokking via mechanistic interpretability.

[BibT_eX]

[DOI]

Proceedings of the Eleventh International Conference on Learning Representations, 2023

Emergent Linear Representations in World Models of Self-Supervised Sequence Models.

[BibT_eX]

[DOI]

Neel Nanda

Andrew Lee

Martin Wattenberg

Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023

2022

Fully General Online Imitation Learning.

[BibT_eX]

[DOI]

Michael K. Cohen

Marcus Hutter

Neel Nanda

J. Mach. Learn. Res., 2022

In-context Learning and Induction Heads.

[BibT_eX]

[DOI]

CoRR, 2022

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.

[BibT_eX]

[DOI]

CoRR, 2022

Predictability and Surprise in Large Generative Models.

[BibT_eX]

[DOI]

CoRR, 2022

Predictability and Surprise in Large Generative Models.

[BibT_eX]

[DOI]

Proceedings of the FAccT '22: 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, June 21, 2022

2021

An Empirical Investigation of Learning from Biased Toxicity Labels.

[BibT_eX]

[DOI]

Neel Nanda

Jonathan Uesato

Sven Gowal

CoRR, 2021

Neel Nanda

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...