Neel Nanda

According to our database1, Neel Nanda authored at least 22 papers between 2021 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components.
CoRR, 2024

Explorations of Self-Repair in Language Models.
CoRR, 2024

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs.
CoRR, 2024

Universal Neurons in GPT2 Language Models.
CoRR, 2024

2023
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching.
CoRR, 2023

Training Dynamics of Contextual N-Grams in Language Models.
CoRR, 2023

Linear Representations of Sentiment in Large Language Models.
CoRR, 2023

Copy Suppression: Comprehensively Understanding an Attention Head.
CoRR, 2023

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods.
CoRR, 2023

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla.
CoRR, 2023

Neuron to Graph: Interpreting Language Model Neurons at Scale.
CoRR, 2023

Finding Neurons in a Haystack: Case Studies with Sparse Probing.
CoRR, 2023

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models.
CoRR, 2023

A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations.
Proceedings of the International Conference on Machine Learning, 2023

Progress measures for grokking via mechanistic interpretability.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

Emergent Linear Representations in World Models of Self-Supervised Sequence Models.
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023

2022
Fully General Online Imitation Learning.
J. Mach. Learn. Res., 2022

In-context Learning and Induction Heads.
CoRR, 2022

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.
CoRR, 2022

Predictability and Surprise in Large Generative Models.
CoRR, 2022


2021
An Empirical Investigation of Learning from Biased Toxicity Labels.
CoRR, 2021


  Loading...