We stand with Ukraine

We stand with Ukraine

Rohin Shah

Orcid: 0000-0002-0656-2800

According to our database¹, Rohin Shah authored at least 46 papers between 2014 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

Realistic honeypot evaluations for scheming propensity.

[DOI]

Victoria Krakovna

,

,

,

Sebastian Farquhar

,

CoRR, May, 2026

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

[DOI]

,

,

Roland S. Zimmermann

,

CoRR, March, 2026

Quantifying the Necessity of Chain of Thought through Opaque Serial Depth.

[DOI]

Jonah Brown-Cohen

,

,

CoRR, March, 2026

Building Production-Ready Probes For Gemini.

[DOI]

,

,

,

,

,

,

CoRR, January, 2026

2025

A Rosetta Stone for AI Benchmarks.

[DOI]

,

Jean-Stanislas Denain

,

,

,

CoRR, December, 2025

Consistency Training Helps Stop Sycophancy and Jailbreaks.

[DOI]

,

Alexander Matt Turner

,

,

,

CoRR, October, 2025

A Pragmatic Way to Measure Chain-of-Thought Monitorability.

[DOI]

,

Roland S. Zimmermann

,

,

CoRR, October, 2025

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.

[DOI]

CoRR, July, 2025

When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors.

[DOI]

,

,

,

,

Senthooran Rajamanoharan

,

,

,

CoRR, July, 2025

Evaluating Frontier Models for Stealth and Situational Awareness.

[DOI]

,

Roland S. Zimmermann

,

,

,

Victoria Krakovna

,

,

,

,

CoRR, May, 2025

Evaluating the Goal-Directedness of Large Language Models.

[DOI]

,

Cristina Garbacea

,

,

Jonathan Richens

,

Henry Papadatos

,

,

CoRR, April, 2025

An Approach to Technical AGI Safety and Security.

[DOI]

CoRR, April, 2025

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking.

[DOI]

Sebastian Farquhar

,

,

,

,

,

,

CoRR, January, 2025

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking.

[DOI]

Sebastian Farquhar

,

,

,

,

,

,

Proceedings of the Forty-second International Conference on Machine Learning, 2025

2024

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2.

[DOI]

,

Senthooran Rajamanoharan

,

,

,

Nicolas Sonnerat

,

,

,

,

,

CoRR, 2024

Improving Dictionary Learning with Gated Sparse Autoencoders.

[DOI]

Senthooran Rajamanoharan

,

,

,

,

,

,

,

CoRR, 2024

Evaluating Frontier Models for Dangerous Capabilities.

[DOI]

CoRR, 2024

AtP*: An efficient and scalable method for localizing LLM behaviour to components.

[DOI]

,

,

,

CoRR, 2024

Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders.

[DOI]

Senthooran Rajamanoharan

,

,

,

,

,

,

,

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

On scalable oversight with weak LLMs judging strong LLMs.

[DOI]

,

,

,

Jonah Brown-Cohen

,

,

,

Rishabh Agarwal

,

,

,

Noah D. Goodman

,

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

2023

Challenges with unsupervised LLM knowledge discovery.

[DOI]

Sebastian Farquhar

,

,

,

Johannes Gasteiger

,

Vladimir Mikulik

,

CoRR, 2023

Explaining grokking through circuit efficiency.

[DOI]

,

,

,

,

CoRR, 2023

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla.

[DOI]

,

,

,

,

Geoffrey Irving

,

,

Vladimir Mikulik

CoRR, 2023

Towards Solving Fuzzy Tasks with Human Feedback: A Retrospective of the MineRL BASALT 2022 Competition.

[DOI]

CoRR, 2023

BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks.

[DOI]

Stephanie Milani

,

Anssi Kanervisto

,

Karolis Ramanauskas

,

Sander Schulhoff

,

Brandon Houghton

,

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

SIRL: Similarity-based Implicit Representation Learning.

[DOI]

,

,

,

Daniel S. Brown

,

Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction, 2023

2022

Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals.

[DOI]

,

,

,

,

Victoria Krakovna

,

Jonathan Uesato

,

CoRR, 2022

An Empirical Investigation of Representation Learning for Imitation.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

CoRR, 2022

Retrospective on the 2021 BASALT Competition on Learning from Human Feedback.

[DOI]

,

,

,

Stephanie Milani

,

Anssi Kanervisto

,

Vinicius G. Goecks

,

Nicholas R. Waytowich

,

David Watkins-Valls

,

,

,

,

Alexander Fries

,

Alexandra Souly

,

,

Daniel del Castillo

,

CoRR, 2022

2021

The MineRL BASALT Competition on Learning from Human Feedback.

[DOI]

,

,

,

,

Brandon Houghton

,

William H. Guss

,

Sharada P. Mohanty

,

Anssi Kanervisto

,

Stephanie Milani

,

,

,

,

CoRR, 2021

Combining Reward Information from Multiple Sources.

[DOI]

Dmitrii Krasheninnikov

,

,

CoRR, 2021

Optimal Policies Tend To Seek Power.

[DOI]

Alexander Matt Turner

,

,

,

,

Prasad Tadepalli

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Retrospective on the 2021 MineRL BASALT Competition on Learning from Human Feedback.

[DOI]

,

,

,

Stephanie Milani

,

Anssi Kanervisto

,

Vinicius G. Goecks

,

Nicholas R. Waytowich

,

David Watkins-Valls

,

,

,

,

Alexander Fries

,

Alexandra Souly

,

,

Daniel del Castillo

,

Proceedings of the NeurIPS 2021 Competitions and Demonstrations Track, 2021

Towards Solving Fuzzy Tasks with Human Feedback: A Retrospective of the MineRL BASALT 2022 Competition.

[DOI]

Proceedings of the NeurIPS 2022 Competition Track, 2021

An Empirical Investigation of Representation Learning for Imitation.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, 2021

Learning What To Do by Simulating the Past.

[DOI]

,

,

,

Proceedings of the 9th International Conference on Learning Representations, 2021

Evaluating the Robustness of Collaborative Agents.

[DOI]

,

,

,

,

,

,

Proceedings of the AAMAS '21: 20th International Conference on Autonomous Agents and Multiagent Systems, 2021

2020

Extracting and Using Preference Information from the State of the World.

[DOI]

PhD thesis, 2020

The MAGICAL Benchmark for Robust Imitation.

[DOI]

,

,

,

Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

Choice Set Misspecification in Reward Inference.

[DOI]

Rachel Freedman

,

,

Proceedings of the Workshop on Artificial Intelligence Safety 2020 co-located with the 29th International Joint Conference on Artificial Intelligence and the 17th Pacific Rim International Conference on Artificial Intelligence (IJCAI-PRICAI 2020), 2020

2019

On the Utility of Learning about Humans for Human-AI Coordination.

[DOI]

,

,

,

,

Sanjit A. Seshia

,

,

Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 2019

On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference.

[DOI]

,

,

,

Proceedings of the 36th International Conference on Machine Learning, 2019

Preferences Implicit in the State of the World.

[DOI]

,

Dmitrii Krasheninnikov

,

Jordan Alexander

,

,

Proceedings of the 7th International Conference on Learning Representations, 2019

2018

Active Inverse Reward Design.

[DOI]

Sören Mindermann

,

,

,

Dylan Hadfield-Menell

CoRR, 2018

2016

SIMPL: A DSL for Automatic Specialization of Inference Algorithms.

[DOI]

,

,

Rastislav Bodík

CoRR, 2016

2014

Chlorophyll: synthesis-aided compiler for low-power spatial architectures.

[DOI]

Phitchaya Mangpo Phothilimthana

,

,

,

,

Sarah E. Chasins

,

Rastislav Bodík

Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014

Loading...