Arthur Conmy

According to our database¹, Arthur Conmy authored at least 32 papers between 2022 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

How do LLMs Compute Verbal Confidence.

[BibT_eX]

[DOI]

CoRR, March, 2026

Automatically Finding Reward Model Biases.

[BibT_eX]

[DOI]

Atticus Wang

Iván Arcuschin

Arthur Conmy

CoRR, February, 2026

Simple LLM Baselines are Competitive for Model Diffing.

[BibT_eX]

[DOI]

CoRR, February, 2026

Fluid Representations in Reasoning Models.

[BibT_eX]

[DOI]

CoRR, February, 2026

Building Production-Ready Probes For Gemini.

[BibT_eX]

[DOI]

CoRR, January, 2026

2025

Base Models Know How to Reason, Thinking Models Learn When.

[BibT_eX]

[DOI]

CoRR, October, 2025

Eliciting Secret Knowledge from Language Models.

[BibT_eX]

[DOI]

Bartosz Cywinski

Emil Ryd

Rowan Wang

Senthooran Rajamanoharan

Neel Nanda

Arthur Conmy

Samuel Marks

CoRR, October, 2025

Thought Anchors: Which LLM Reasoning Steps Matter?

[BibT_eX]

[DOI]

CoRR, June, 2025

Understanding Reasoning in Thinking Language Models via Steering Vectors.

[BibT_eX]

[DOI]

CoRR, June, 2025

Line of Sight: On Linear Representations in VLLMs.

[BibT_eX]

[DOI]

CoRR, June, 2025

Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning.

[BibT_eX]

[DOI]

CoRR, May, 2025

Scaling sparse feature circuit finding for in-context learning.

[BibT_eX]

[DOI]

CoRR, April, 2025

An Approach to Technical AGI Safety and Security.

[BibT_eX]

[DOI]

CoRR, April, 2025

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability.

[BibT_eX]

[DOI]

CoRR, March, 2025

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful.

[BibT_eX]

[DOI]

Iván Arcuschin

Jett Janiak

Robert Krzyzanowski

Senthooran Rajamanoharan

Neel Nanda

Arthur Conmy

CoRR, March, 2025

Open Problems in Mechanistic Interpretability.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2025

Scaling Sparse Feature Circuits For Studying In-Context Learning.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

2024

Improving Steering Vectors by Targeting Sparse Autoencoder Features.

[BibT_eX]

[DOI]

Sviatoslav Chalnev

Matthew Siu

Arthur Conmy

CoRR, 2024

Applying sparse autoencoders to unlearn knowledge in language models.

[BibT_eX]

[DOI]

Eoin Farrell

Yeu-Tong Lau

Arthur Conmy

CoRR, 2024

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2.

[BibT_eX]

[DOI]

Tom Lieberum

Senthooran Rajamanoharan

CoRR, 2024

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders.

[BibT_eX]

[DOI]

Senthooran Rajamanoharan

CoRR, 2024

Interpreting Attention Layer Outputs with Sparse Autoencoders.

[BibT_eX]

[DOI]

CoRR, 2024

Improving Dictionary Learning with Gated Sparse Autoencoders.

[BibT_eX]

[DOI]

Senthooran Rajamanoharan

CoRR, 2024

Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders.

[BibT_eX]

[DOI]

Senthooran Rajamanoharan

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Stealing part of a production language model.

[BibT_eX]

[DOI]

Nicholas Carlini

Daniel Paleka

Krishnamurthy Dj Dvijotham

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Successor Heads: Recurring, Interpretable Attention Heads In The Wild.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023

Attribution Patching Outperforms Automated Circuit Discovery.

[BibT_eX]

[DOI]

Aaquib Syed

Can Rager

Arthur Conmy

CoRR, 2023

Copy Suppression: Comprehensively Understanding an Attention Head.

[BibT_eX]

[DOI]

CoRR, 2023

Towards Automated Circuit Discovery for Mechanistic Interpretability.

[BibT_eX]

[DOI]

Arthur Conmy

Augustine N. Mavor-Parker

Aengus Lynch

Stefan Heimersheim

Adrià Garriga-Alonso

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small.

[BibT_eX]

[DOI]

Proceedings of the Eleventh International Conference on Learning Representations, 2023

2022

Stylegan-Induced Data-Driven Regularization for Inverse Problems.

[BibT_eX]

[DOI]

Arthur Conmy

Subhadip Mukherjee

Carola-Bibiane Schönlieb

Proceedings of the IEEE International Conference on Acoustics, 2022

Arthur Conmy

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...