Dylan Hadfield-Menell

Affiliations:
  • University of California, Berkeley, USA


According to our database1, Dylan Hadfield-Menell authored at least 58 papers between 2013 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Defending Against Unforeseen Failure Modes with Latent Adversarial Training.
CoRR, 2024

Eight Methods to Evaluate Robust Unlearning in LLMs.
CoRR, 2024

Black-Box Access is Insufficient for Rigorous AI Audits.
CoRR, 2024

2023
Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF.
CoRR, 2023

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.
CoRR, 2023

Measuring the Success of Diffusion Models at Imitating Human Artists.
CoRR, 2023

Explore, Establish, Exploit: Red Teaming Language Models from Scratch.
CoRR, 2023

Benchmarking Interpretability Tools for Deep Neural Networks.
CoRR, 2023

Recommending to Strategic Users.
CoRR, 2023

Red Teaming Deep Neural Networks with Feature Synthesis Tools.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

Get It in Writing: Formal Contracts Mitigate Social Dilemmas in Multi-Agent RL.
Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, 2023

White-Box Adversarial Policies in Deep Reinforcement Learning.
Proceedings of the Workshop on Artificial Intelligence Safety 2023 (SafeAI 2023) co-located with the Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI 2023), 2023

2022
Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks.
CoRR, 2022

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks.
CoRR, 2022

Building Human Values into Recommender Systems: An Interdisciplinary Synthesis.
CoRR, 2022

How to talk so your robot will learn: Instructions, descriptions, and pragmatics.
CoRR, 2022

Linguistic communication as (inverse) reward design.
CoRR, 2022

Towards Psychologically-Grounded Dynamic Preference Models.
Proceedings of the RecSys '22: Sixteenth ACM Conference on Recommender Systems, Seattle, WA, USA, September 18, 2022

How to talk so AI will learn: Instructions, descriptions, and autonomy.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Robust Feature-Level Adversaries are Interpretability Tools.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Estimating and Penalizing Induced Preference Shifts in Recommender Systems.
Proceedings of the International Conference on Machine Learning, 2022

A Penalty Default Approach to Preemptive Harm Disclosure and Mitigation for AI Systems.
Proceedings of the AIES '22: AAAI/ACM Conference on AI, Ethics, and Society, Oxford, United Kingdom, May 19, 2022

2021
When Curation Becomes Creation: Algorithms, microcontent, and the vanishing distinction between platforms and creators.
ACM Queue, 2021

What are you optimizing for? Aligning Recommender Systems with Human Values.
CoRR, 2021

When curation becomes creation.
Commun. ACM, 2021

Estimating and Penalizing Preference Shift in Recommender Systems.
Proceedings of the RecSys '21: Fifteenth ACM Conference on Recommender Systems, Amsterdam, The Netherlands, 27 September 2021, 2021

Guided Imitation of Task and Motion Planning.
Proceedings of the Conference on Robot Learning, 8-11 November 2021, London, UK., 2021

2020
Multi-Principal Assistance Games: Definition and Collegial Mechanisms.
CoRR, 2020

Multi-Principal Assistance Games.
CoRR, 2020

Consequences of Misaligned AI.
Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

Silly Rules Improve the Capacity of Agents to Learn Stable Enforcement and Compliance Behaviors.
Proceedings of the 19th International Conference on Autonomous Agents and Multiagent Systems, 2020

Conservative Agency via Attainable Utility Preservation.
Proceedings of the AIES '20: AAAI/ACM Conference on AI, 2020

2019
An Extensible Interactive Interface for Agent Design.
CoRR, 2019

Adversarial Training with Voronoi Constraints.
CoRR, 2019

Conservative Agency.
Proceedings of the Workshop on Artificial Intelligence Safety 2019 co-located with the 28th International Joint Conference on Artificial Intelligence, 2019

On the Utility of Model Learning in HRI.
Proceedings of the 14th ACM/IEEE International Conference on Human-Robot Interaction, 2019

The Assistive Multi-Armed Bandit.
Proceedings of the 14th ACM/IEEE International Conference on Human-Robot Interaction, 2019

Human-AI Learning Performance in Multi-Armed Bandits.
Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019

Incomplete Contracting and AI Alignment.
Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019

Legible Normativity for AI Alignment: The Value of Silly Rules.
Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019

2018
On the Geometry of Adversarial Examples.
CoRR, 2018

Active Inverse Reward Design.
CoRR, 2018

Simplifying Reward Design through Divide-and-Conquer.
Proceedings of the Robotics: Science and Systems XIV, 2018

An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning.
Proceedings of the 35th International Conference on Machine Learning, 2018

2017
Inverse Reward Design.
Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 2017

Pragmatic-Pedagogic Value Alignment.
Proceedings of the Robotics Research, The 18th International Symposium, 2017

Should Robots be Obedient?
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017

Expressive Robot Motion Timing.
Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction, 2017

The Off-Switch Game.
Proceedings of the Workshops of the The Thirty-First AAAI Conference on Artificial Intelligence, 2017

2016
Cooperative Inverse Reinforcement Learning.
Proceedings of the Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, 2016

Sequential quadratic programming for task plan optimization.
Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2016

Guided search for task and motion plans using learned heuristics.
Proceedings of the 2016 IEEE International Conference on Robotics and Automation, 2016

2015
Multitasking: Optimal Planning for Bandit Superprocesses.
Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, 2015

Modular task and motion planning in belief space.
Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2015

Beyond lowest-warping cost action selection in trajectory transfer.
Proceedings of the IEEE International Conference on Robotics and Automation, 2015

2014
Unifying scene registration and trajectory optimization for learning from demonstrations with application to manipulation of deformable objects.
Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2014

2013
Optimization in the now: Dynamic peephole optimization for hierarchical planning.
Proceedings of the 2013 IEEE International Conference on Robotics and Automation, 2013


  Loading...