David Lindner

Orcid: 0000-0001-7051-7433

According to our database1, David Lindner authored at least 31 papers between 2019 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.
CoRR, July, 2025

Early Signs of Steganographic Capabilities in Frontier LLMs.
CoRR, July, 2025

Large language models can learn and generalize steganographic chain-of-thought under process supervision.
CoRR, June, 2025

Evaluating Frontier Models for Stealth and Situational Awareness.
CoRR, May, 2025

An Approach to Technical AGI Safety and Security.
CoRR, April, 2025

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking.
CoRR, January, 2025

2024
MISR: Measuring Instrumental Self-Reasoning in Frontier Models.
CoRR, 2024

ViSTa Dataset: Do vision-language models understand sequential tasks?
CoRR, 2024

Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework.
CoRR, 2024

Towards evaluations-based safety cases for AI scheming.
CoRR, 2024

Evaluating Frontier Models for Dangerous Capabilities.
CoRR, 2024

On scalable oversight with weak LLMs judging strong LLMs.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Learning Safety Constraints from Demonstrations with Unknown Rewards.
Proceedings of the International Conference on Artificial Intelligence and Statistics, 2024

2023
GoSafeOpt: Scalable safe exploration for global optimization of dynamical systems.
Artif. Intell., July, 2023

Algorithmic Foundations for Safe and Efficient Reinforcement Learning from Human Feedback.
PhD thesis, 2023

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.
Trans. Mach. Learn. Res., 2023

RLHF-Blender: A Configurable Interactive Interface for Learning from Diverse Human Feedback.
CoRR, 2023

Tracr: Compiled Transformers as a Laboratory for Interpretability.
CoRR, 2023

Tracr: Compiled Transformers as a Laboratory for Interpretability.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

2022
Red-Teaming the Stable Diffusion Safety Filter.
CoRR, 2022

Humans are not Boltzmann Distributions: Challenges and Opportunities for Modelling Human Feedback and Interaction in Reinforcement Learning.
CoRR, 2022

Scalable Safe Exploration for Global Optimization of Dynamical Systems.
CoRR, 2022

Active Exploration for Inverse Reinforcement Learning.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Interactively Learning Preference Constraints in Linear Bandits.
Proceedings of the International Conference on Machine Learning, 2022

2021
Information Directed Reward Learning for Reinforcement Learning.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Addressing the Long-term Impact of ML Decisions via Policy Regret.
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021

Learning What To Do by Simulating the Past.
Proceedings of the 9th International Conference on Learning Representations, 2021

Challenges for Using Impact Regularizers to Avoid Negative Side Effects.
Proceedings of the Workshop on Artificial Intelligence Safety 2021 (SafeAI 2021) co-located with the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI 2021), 2021

2019
Sensing Social Media Signals for Cryptocurrency News.
Proceedings of the Companion of The 2019 World Wide Web Conference, 2019

Detecting Spiky Corruption in Markov Decision Processes.
Proceedings of the Workshop on Artificial Intelligence Safety 2019 co-located with the 28th International Joint Conference on Artificial Intelligence, 2019


  Loading...