Adrià Garriga-Alonso

According to our database¹, Adrià Garriga-Alonso authored at least 25 papers between 2019 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Links

On csauthors.net:

Bibliography

2025

Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders.

[BibT_eX]

[DOI]

David Chanin

Adrià Garriga-Alonso

CoRR, August, 2025

Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban.

[BibT_eX]

[DOI]

CoRR, June, 2025

Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders.

[BibT_eX]

[DOI]

David Chanin

Tomás Dulka

Adrià Garriga-Alonso

CoRR, May, 2025

Among Us: A Sandbox for Agentic Deception.

[BibT_eX]

[DOI]

Satvik Golechha

Adrià Garriga-Alonso

CoRR, April, 2025

Open Problems in Mechanistic Interpretability.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2025

Interpreting Emergent Planning in Model-Free Reinforcement Learning.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024

Planning behavior in a recurrent neural network that plays Sokoban.

[BibT_eX]

[DOI]

Adrià Garriga-Alonso

Mohammad Taufeeque

Adam Gleave

CoRR, 2024

Adversarial Circuit Evaluation.

[BibT_eX]

[DOI]

Niels uit de Bos

Adrià Garriga-Alonso

CoRR, 2024

Investigating the Indirect Object Identification circuit in Mamba.

[BibT_eX]

[DOI]

Danielle Ensign

Adrià Garriga-Alonso

CoRR, 2024

Analyzing the Generalization and Reliability of Steering Vectors.

[BibT_eX]

[DOI]

CoRR, 2024

Analysing the Generalisation and Reliability of Steering Vectors.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Hypothesis Testing the Circuit Hypothesis in LLMs.

[BibT_eX]

[DOI]

Claudia Shi

Nicolas Beltran-Velez

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification.

[BibT_eX]

[DOI]

Thomas Kwa

Drake Thomas

Adrià Garriga-Alonso

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques.

[BibT_eX]

[DOI]

Rohan Gupta

Iván Arcuschin Moreno

Thomas Kwa

Adrià Garriga-Alonso

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

2023

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models.

[BibT_eX]

[DOI]

Bartlomiej Bojanowski

Christopher D. Manning

Daniel Moseguí González

Eunice Engefu Manyasi

Evgenii Zheltonozhskii

Fanyue Xia

Fatemeh Siar

Fernando Martínez-Plumed

Giambattista Parascandolo

Giorgio Mariani

Gloria Wang

Gonzalo Jaimovitch-López

Jaime Fernández Fisac

Jascha Sohl-Dickstein

José Hernández-Orallo

Karthik Gopalakrishnan

Lidia Contreras Ochando

Louis-Philippe Morency

María José Ramírez-Quintana

Michael I. Ivanitskiy

Neta Gur-Ari Krakover

Nitish Shirish Keskar

Pablo Antonio Moreno Casares

Pegah Alipoormolabashi

Shyamolima (Shammie) Debnath

Sneha Priscilla Makini

Yadollah Yaghoobzadeh

Trans. Mach. Learn. Res., 2023

Towards Automated Circuit Discovery for Mechanistic Interpretability.

[BibT_eX]

[DOI]

Arthur Conmy

Augustine N. Mavor-Parker

Aengus Lynch

Stefan Heimersheim

Adrià Garriga-Alonso

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

2022

Data augmentation in Bayesian neural networks and the cold posterior effect.

[BibT_eX]

[DOI]

Proceedings of the Uncertainty in Artificial Intelligence, 2022

Bayesian Neural Network Priors Revisited.

[BibT_eX]

[DOI]

Proceedings of the Tenth International Conference on Learning Representations, 2022

2021

<i>BNNpriors</i>: A library for Bayesian neural network inference with different prior distributions.

[BibT_eX]

[DOI]

Softw. Impacts, 2021

BNNpriors: A library for Bayesian neural network inference with different prior distributions.

[BibT_eX]

[DOI]

CoRR, 2021

Bayesian Neural Network Priors Revisited.

[BibT_eX]

[DOI]

CoRR, 2021

Exact Langevin Dynamics with Stochastic Gradients.

[BibT_eX]

[DOI]

Adrià Garriga-Alonso

Vincent Fortuin

CoRR, 2021

Correlated weights in infinite limits of deep convolutional neural networks.

[BibT_eX]

[DOI]

Adrià Garriga-Alonso

Mark van der Wilk

Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, 2021

2020

Understanding Variational Inference in Function-Space.

[BibT_eX]

[DOI]

CoRR, 2020

2019

Deep Convolutional Networks as shallow Gaussian Processes.

[BibT_eX]

[DOI]

Adrià Garriga-Alonso

Carl Edward Rasmussen

Laurence Aitchison

Proceedings of the 7th International Conference on Learning Representations, 2019

Adrià Garriga-Alonso

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...