David Krueger

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Towards Interpreting Visual Information Processing in Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Influence Functions for Scalable Data Attribution in Diffusion Models.

[BibT_eX]

[DOI]

Bruno Kacper Mlodozeniec

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Interpreting Emergent Planning in Model-Free Reinforcement Learning.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Protecting against simultaneous data poisoning attacks.

[BibT_eX]

[DOI]

Neel Alex

Amartya Sanyal

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024

Blockwise Self-Supervised Learning at Scale.

[BibT_eX]

[DOI]

Yann LeCun

Stéphane Deny

Trans. Mach. Learn. Res., 2024

Foundational Challenges in Assuring Alignment and Safety of Large Language Models.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2024

Learning to Forget using Hypernetworks.

[BibT_eX]

[DOI]

Jose Miguel Lara Rangel

CoRR, 2024

Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks.

[BibT_eX]

[DOI]

Madeline Brumley

Joe Kwon

Usman Anwar

CoRR, 2024

Adversarial Robustness of In-Context Learning in Transformers for Linear Regression.

[BibT_eX]

[DOI]

CoRR, 2024

Noisy Zero-Shot Coordination: Breaking The Common Knowledge Assumption In Zero-Shot Coordination Games.

[BibT_eX]

[DOI]

CoRR, 2024

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders.

[BibT_eX]

[DOI]

CoRR, 2024

Integrating uncertainty quantification into randomized smoothing based robustness guarantees.

[BibT_eX]

[DOI]

CoRR, 2024

Towards Reliable Evaluation of Behavior Steering Interventions in LLMs.

[BibT_eX]

[DOI]

CoRR, 2024

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning.

[BibT_eX]

[DOI]

CoRR, 2024

Exploring the design space of deep-learning-based weather forecasting systems.

[BibT_eX]

[DOI]

Kamyar Azizzadenesheli

CoRR, 2024

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Permissive Information-Flow Analysis for Large Language Models.

[BibT_eX]

[DOI]

Santiago Zanella-Béguelin

CoRR, 2024

A deeper look at depth pruning of LLMs.

[BibT_eX]

[DOI]

Christian Schröder de Witt

CoRR, 2024

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret.

[BibT_eX]

[DOI]

CoRR, 2024

Affirmative safety: An approach to risk management for high-risk AI.

[BibT_eX]

[DOI]

CoRR, 2024

IDs for AI Systems.

[BibT_eX]

[DOI]

CoRR, 2024

Foundational Challenges in Assuring Alignment and Safety of Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Safety Cases: How to Justify the Safety of Advanced AI Systems.

[BibT_eX]

[DOI]

CoRR, 2024

Interpreting Learned Feedback Patterns in Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Stress-Testing Capability Elicitation With Password-Locked Models.

[BibT_eX]

[DOI]

Ryan Greenblatt

Fabien Roger

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Predicting Future Actions of Reinforcement Learning Agents.

[BibT_eX]

[DOI]

Stephen Chung

Scott Niekum

José Miguel Hernández-Lobato

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

A Generative Model of Symmetry Transformations.

[BibT_eX]

[DOI]

James Urquhart Allingham

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Implicit meta-learning may lead language models to trust more reliable sources.

[BibT_eX]

[DOI]

Egor Krasheninnikov

Bruno Kacper Mlodozeniec

Tegan Maharaj

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Reward Model Ensembles Help Mitigate Overoptimization.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Visibility into AI Agents.

[BibT_eX]

[DOI]

Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024

Black-Box Access is Insufficient for Rigorous AI Audits.

[BibT_eX]

[DOI]

Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024

Implicitly Bayesian Prediction Rules in Deep Learning.

[BibT_eX]

[DOI]

Bruno Mlodozeniec

Richard E. Turner

Proceedings of the Symposium on Advances in Approximate Bayesian Inference, 2024

2023

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2023

Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models.

[BibT_eX]

[DOI]

CoRR, 2023

Managing AI Risks in an Era of Rapid Progress.

[BibT_eX]

[DOI]

CoRR, 2023

Meta- (out-of-context) learning in neural networks.

[BibT_eX]

[DOI]

Egor Krasheninnikov

Bruno Mlodozeniec

CoRR, 2023

Investigating the Nature of 3D Generalization in Deep Neural Networks.

[BibT_eX]

[DOI]

Thomas M. Breuel

CoRR, 2023

Unifying Grokking and Double Descent.

[BibT_eX]

[DOI]

Xander Davies

Lauro Langosco

CoRR, 2023

On The Fragility of Learned Reward Functions.

[BibT_eX]

[DOI]

CoRR, 2023

What Mechanisms Does Knowledge Distillation Distill?

[BibT_eX]

[DOI]

Cindy Wu

Ekdeep Singh Lubana

Bruno Kacper Mlodozeniec

Robert Kirk

Proceedings of UniReps: the First Workshop on Unifying Representations in Neural Models, 2023

Thinker: Learning to Plan and Act.

[BibT_eX]

[DOI]

Stephen Chung

Ivan Anokhin

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Mechanistic Mode Connectivity.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Machine Learning, 2023

Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics.

[BibT_eX]

[DOI]

Proceedings of the Eleventh International Conference on Learning Representations, 2023

Broken Neural Scaling Laws.

[BibT_eX]

[DOI]

Proceedings of the Eleventh International Conference on Learning Representations, 2023

Harms from Increasingly Agentic Algorithmic Systems.

[BibT_eX]

[DOI]

Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, 2023

Characterizing Manipulation from AI Systems.

[BibT_eX]

[DOI]

Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, 2023

2022

Domain Generalization for Robust Model-Based Offline Reinforcement Learning.

[BibT_eX]

[DOI]

Alan Clark

CoRR, 2022

Towards Out-of-Distribution Adversarial Robustness.

[BibT_eX]

[DOI]

Adam Ibrahim

Charles Guille-Escuret

CoRR, 2022

Defining and Characterizing Reward Hacking.

[BibT_eX]

[DOI]

Joar Skalse

Nikolaus H. R. Howe

CoRR, 2022

Defining and Characterizing Reward Gaming.

[BibT_eX]

[DOI]

Joar Skalse

Nikolaus H. R. Howe

Lauro Langosco di Langosco

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Goal Misgeneralization in Deep Reinforcement Learning.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Machine Learning, 2022

2021

Multi-Domain Balanced Sampling Improves Out-of-Distribution Generalization of Chest X-ray Pathology Prediction Models.

[BibT_eX]

[DOI]

CoRR, 2021

Filling gaps in trustworthy development of AI.

[BibT_eX]

[DOI]

CoRR, 2021

Out-of-Distribution Generalization via Risk Extrapolation (REx).

[BibT_eX]

[DOI]

Proceedings of the 38th International Conference on Machine Learning, 2021

2020

Active Reinforcement Learning: Observing Rewards at a Cost.

[BibT_eX]

[DOI]

CoRR, 2020

Hidden Incentives for Auto-Induced Distributional Shift.

[BibT_eX]

[DOI]

Tegan Maharaj

Jan Leike

CoRR, 2020

AI Research Considerations for Human Existential Safety (ARCHES).

[BibT_eX]

[DOI]

Andrew Critch

CoRR, 2020

Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims.

[BibT_eX]

[DOI]

Thomas Krendl Gilbert

CoRR, 2020

Out-of-Distribution Generalization via Risk Extrapolation (REx).

[BibT_eX]

[DOI]

CoRR, 2020

2018

Scalable agent alignment via reward modeling: a research direction.

[BibT_eX]

[DOI]

CoRR, 2018

Uncertainty in Multitask Transfer Learning.

[BibT_eX]

[DOI]

CoRR, 2018

Neural Autoregressive Flows.

[BibT_eX]

[DOI]

Proceedings of the 35th International Conference on Machine Learning, 2018

2017

Deep Prior.

[BibT_eX]

[DOI]

CoRR, 2017

Bayesian Hypernetworks.

[BibT_eX]

[DOI]

CoRR, 2017

A Closer Look at Memorization in Deep Networks.

[BibT_eX]

[DOI]

Devansh Arpit

Stanislaw Jastrzebski

Proceedings of the 34th International Conference on Machine Learning, 2017

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations.

[BibT_eX]

[DOI]

Proceedings of the 5th International Conference on Learning Representations, 2017

Deep Nets Don't Learn via Memorization.

[BibT_eX]

[DOI]

Nicolas Ballas

Stanislaw Jastrzebski

Proceedings of the 5th International Conference on Learning Representations, 2017

Nested LSTMs.

[BibT_eX]

[DOI]

Joel Ruben Antony Moniz

Proceedings of The 9th Asian Conference on Machine Learning, 2017

2016

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations.

[BibT_eX]

[DOI]

CoRR, 2016

Regularizing RNNs by Stabilizing Activations.

[BibT_eX]

[DOI]

Roland Memisevic

Proceedings of the 4th International Conference on Learning Representations, 2016

2015

Zero-bias autoencoders and the benefits of co-adapting features.

[BibT_eX]

[DOI]

Roland Memisevic

Kishore Reddy Konda

Proceedings of the 3rd International Conference on Learning Representations, 2015

NICE: Non-linear Independent Components Estimation.

[BibT_eX]

[DOI]

Laurent Dinh

Yoshua Bengio

Proceedings of the 3rd International Conference on Learning Representations, 2015

Testing Visual Attention in Dynamic Environments.

[BibT_eX]

[DOI]

Philip Bachman