David Krueger

Orcid: 0000-0001-7256-0937

Affiliations:
  • University of Cambridge, UK
  • University of Montréal, MILA, Canada (former)


According to our database1, David Krueger authored at least 88 papers between 2015 and 2025.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective.
CoRR, August, 2025

How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations.
CoRR, August, 2025

Mitigating Goal Misgeneralization via Minimax Regret.
CoRR, July, 2025

Distributional Training Data Attribution.
CoRR, June, 2025

Detecting High-Stakes Interactions with Activation Probes.
CoRR, June, 2025

Understanding (Un)Reliability of Steering Vectors in Language Models.
CoRR, May, 2025

From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization.
CoRR, May, 2025

Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models.
CoRR, February, 2025

Pitfalls of Evidence-Based AI Policy.
CoRR, February, 2025

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development.
CoRR, January, 2025

Open Problems in Machine Unlearning for AI Safety.
CoRR, January, 2025

Understanding In-Context Learning of Linear Models in Transformers Through an Adversarial Lens.
Trans. Mach. Learn. Res., 2025

Analyzing (In)Abilities of SAEs via Formal Languages.
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

Input Space Mode Connectivity in Deep Neural Networks.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Towards Interpreting Visual Information Processing in Vision-Language Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Influence Functions for Scalable Data Attribution in Diffusion Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Interpreting Emergent Planning in Model-Free Reinforcement Learning.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Protecting against simultaneous data poisoning attacks.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024
Blockwise Self-Supervised Learning at Scale.
Trans. Mach. Learn. Res., 2024

Foundational Challenges in Assuring Alignment and Safety of Large Language Models.
Trans. Mach. Learn. Res., 2024

Learning to Forget using Hypernetworks.
CoRR, 2024

Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks.
CoRR, 2024

Adversarial Robustness of In-Context Learning in Transformers for Linear Regression.
CoRR, 2024

Noisy Zero-Shot Coordination: Breaking The Common Knowledge Assumption In Zero-Shot Coordination Games.
CoRR, 2024

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders.
CoRR, 2024

Integrating uncertainty quantification into randomized smoothing based robustness guarantees.
CoRR, 2024

Towards Reliable Evaluation of Behavior Steering Interventions in LLMs.
CoRR, 2024

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning.
CoRR, 2024

Exploring the design space of deep-learning-based weather forecasting systems.
CoRR, 2024

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models.
CoRR, 2024

Permissive Information-Flow Analysis for Large Language Models.
CoRR, 2024

A deeper look at depth pruning of LLMs.
CoRR, 2024

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret.
CoRR, 2024

Affirmative safety: An approach to risk management for high-risk AI.
CoRR, 2024

IDs for AI Systems.
CoRR, 2024

Foundational Challenges in Assuring Alignment and Safety of Large Language Models.
CoRR, 2024

Safety Cases: How to Justify the Safety of Advanced AI Systems.
CoRR, 2024

Interpreting Learned Feedback Patterns in Large Language Models.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Stress-Testing Capability Elicitation With Password-Locked Models.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Predicting Future Actions of Reinforcement Learning Agents.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

A Generative Model of Symmetry Transformations.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Implicit meta-learning may lead language models to trust more reliable sources.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Reward Model Ensembles Help Mitigate Overoptimization.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Visibility into AI Agents.
Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024


Implicitly Bayesian Prediction Rules in Deep Learning.
Proceedings of the Symposium on Advances in Approximate Bayesian Inference, 2024

2023
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.
Trans. Mach. Learn. Res., 2023

Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models.
CoRR, 2023

Managing AI Risks in an Era of Rapid Progress.
CoRR, 2023

Meta- (out-of-context) learning in neural networks.
CoRR, 2023

Investigating the Nature of 3D Generalization in Deep Neural Networks.
CoRR, 2023

Unifying Grokking and Double Descent.
CoRR, 2023

On The Fragility of Learned Reward Functions.
CoRR, 2023

What Mechanisms Does Knowledge Distillation Distill?
Proceedings of UniReps: the First Workshop on Unifying Representations in Neural Models, 2023

Thinker: Learning to Plan and Act.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Mechanistic Mode Connectivity.
Proceedings of the International Conference on Machine Learning, 2023

Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

Broken Neural Scaling Laws.
Proceedings of the Eleventh International Conference on Learning Representations, 2023


Characterizing Manipulation from AI Systems.
Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, 2023

2022
Domain Generalization for Robust Model-Based Offline Reinforcement Learning.
CoRR, 2022

Towards Out-of-Distribution Adversarial Robustness.
CoRR, 2022

Defining and Characterizing Reward Hacking.
CoRR, 2022

Defining and Characterizing Reward Gaming.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Goal Misgeneralization in Deep Reinforcement Learning.
Proceedings of the International Conference on Machine Learning, 2022

2021
Multi-Domain Balanced Sampling Improves Out-of-Distribution Generalization of Chest X-ray Pathology Prediction Models.
CoRR, 2021

Filling gaps in trustworthy development of AI.
CoRR, 2021

Out-of-Distribution Generalization via Risk Extrapolation (REx).
Proceedings of the 38th International Conference on Machine Learning, 2021

2020
Active Reinforcement Learning: Observing Rewards at a Cost.
CoRR, 2020

Hidden Incentives for Auto-Induced Distributional Shift.
CoRR, 2020

AI Research Considerations for Human Existential Safety (ARCHES).
CoRR, 2020

Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims.
CoRR, 2020

Out-of-Distribution Generalization via Risk Extrapolation (REx).
CoRR, 2020

2018
Scalable agent alignment via reward modeling: a research direction.
CoRR, 2018

Uncertainty in Multitask Transfer Learning.
CoRR, 2018

Neural Autoregressive Flows.
Proceedings of the 35th International Conference on Machine Learning, 2018

2017
Deep Prior.
CoRR, 2017

Bayesian Hypernetworks.
CoRR, 2017

A Closer Look at Memorization in Deep Networks.
Proceedings of the 34th International Conference on Machine Learning, 2017

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations.
Proceedings of the 5th International Conference on Learning Representations, 2017

Deep Nets Don't Learn via Memorization.
Proceedings of the 5th International Conference on Learning Representations, 2017

Nested LSTMs.
Proceedings of The 9th Asian Conference on Machine Learning, 2017

2016
Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations.
CoRR, 2016

Regularizing RNNs by Stabilizing Activations.
Proceedings of the 4th International Conference on Learning Representations, 2016

2015
Zero-bias autoencoders and the benefits of co-adapting features.
Proceedings of the 3rd International Conference on Learning Representations, 2015

NICE: Non-linear Independent Components Estimation.
Proceedings of the 3rd International Conference on Learning Representations, 2015

Testing Visual Attention in Dynamic Environments.
CoRR, 2015


  Loading...