Joe Benton

According to our database1, Joe Benton authored at least 25 papers between 2022 and 2026.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

On csauthors.net:

Bibliography

2026
Faithfulness as Information Flow: Evaluating and Training Faithful Chain-of-Thought Reasoning.
CoRR, May, 2026

SLEIGHT-Bench: A Benchmark of Evasion Attacks Against Agent Monitors.
CoRR, May, 2026

Efficiently Aligning Language Models with Online Natural Language Feedback.
CoRR, May, 2026

Removing Sandbagging in LLMs by Training with Weak Supervision.
CoRR, April, 2026

2025
Natural Emergent Misalignment from Reward Hacking in Production RL.
CoRR, November, 2025

Evaluating Control Protocols for Untrusted AI Agents.
CoRR, November, 2025

Optimizing AI Agent Attacks With Synthetic Data.
CoRR, November, 2025

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.
CoRR, July, 2025

Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning.
CoRR, June, 2025

SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents.
CoRR, June, 2025

Reasoning Models Don't Always Say What They Think.
CoRR, May, 2025

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming.
CoRR, January, 2025

Inverse Scaling in Test-Time Compute.
Trans. Mach. Learn. Res., 2025

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024
Error Bounds for Flow Matching Methods.
Trans. Mach. Learn. Res., 2024

Sabotage Evaluations for Frontier Models.
CoRR, 2024

When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
CoRR, 2024


Nearly d-Linear Convergence Bounds for Diffusion Models via Stochastic Localization.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023
Alpha-divergence Variational Inference Meets Importance Weighted Auto-Encoders: Methodology and Asymptotics.
J. Mach. Learn. Res., 2023

Measuring Feature Sparsity in Language Models.
CoRR, 2023

Linear Convergence Bounds for Diffusion Models via Stochastic Localization.
CoRR, 2023

2022
From Denoising Diffusions to Denoising Markov Models.
CoRR, 2022

Polysemanticity and Capacity in Neural Networks.
CoRR, 2022

A Continuous Time Framework for Discrete Denoising Models.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022


  Loading...