Fabien Roger

According to our database1, Fabien Roger authored at least 19 papers between 2023 and 2026.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

On csauthors.net:

Bibliography

2026
Self-Attribution Bias: When AI Monitors Go Easy on Themselves.
CoRR, March, 2026

Three Concrete Challenges and Two Hopes for the Safety of Unsupervised Elicitation.
CoRR, February, 2026

Excess Description Length of Learning Generalizable Predictors.
CoRR, January, 2026

2025
Steering Language Models with Weight Arithmetic.
CoRR, November, 2025

All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language.
CoRR, October, 2025

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment.
CoRR, October, 2025

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.
CoRR, July, 2025

Why Do Some Language Models Fake Alignment While Others Don't?
CoRR, June, 2025

Reasoning Models Don't Always Say What They Think.
CoRR, May, 2025

Auditing language models for hidden objectives.
CoRR, March, 2025

A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management.
CoRR, February, 2025

2024
Language Models Are Better Than Humans at Next-token Prediction.
Trans. Mach. Learn. Res., 2024

Alignment faking in large language models.
CoRR, 2024

Do Unlearning Methods Remove Information from Language Model Weights?
CoRR, 2024

Stress-Testing Capability Elicitation With Password-Locked Models.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

AI Control: Improving Safety Despite Intentional Subversion.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

2023
Preventing Language Models From Hiding Their Reasoning.
CoRR, 2023

Measurement Tampering Detection Benchmark.
CoRR, 2023

Large Language Models Sometimes Generate Purely Negatively-Reinforced Text.
CoRR, 2023


  Loading...