Neel Nanda
According to our database1,
Neel Nanda
authored at least 22 papers
between 2021 and 2024.
Collaborative distances:
Collaborative distances:
Timeline
Legend:
Book In proceedings Article PhD thesis Dataset OtherLinks
On csauthors.net:
Bibliography
2024
CoRR, 2024
2023
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching.
CoRR, 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods.
CoRR, 2023
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla.
CoRR, 2023
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models.
CoRR, 2023
A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations.
Proceedings of the International Conference on Machine Learning, 2023
Proceedings of the Eleventh International Conference on Learning Representations, 2023
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023
2022
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.
CoRR, 2022
Proceedings of the FAccT '22: 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, June 21, 2022
2021