Samuel Marks

According to our database¹, Samuel Marks authored at least 32 papers between 2023 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

Model Spec Midtraining: Improving How Alignment Training Generalizes.

[BibT_eX]

[DOI]

CoRR, May, 2026

Introspection Adapters: Training LLMs to Report Their Learned Behaviors.

[BibT_eX]

[DOI]

CoRR, April, 2026

The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious.

[BibT_eX]

[DOI]

CoRR, April, 2026

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation.

[BibT_eX]

[DOI]

CoRR, March, 2026

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors.

[BibT_eX]

[DOI]

CoRR, February, 2026

2025

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers.

[BibT_eX]

[DOI]

CoRR, December, 2025

Auditing Games for Sandbagging.

[BibT_eX]

[DOI]

CoRR, December, 2025

Unsupervised decoding of encoded reasoning using language model interpretability.

[BibT_eX]

[DOI]

Ching Fang

Samuel Marks

CoRR, December, 2025

Liars' Bench: Evaluating Lie Detectors for Language Models.

[BibT_eX]

[DOI]

CoRR, November, 2025

Steering Evaluation-Aware Language Models to Act Like They Are Deployed.

[BibT_eX]

[DOI]

CoRR, October, 2025

Believe It or Not: How Deeply do LLMs Believe Implanted Facts?

[BibT_eX]

[DOI]

CoRR, October, 2025

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment.

[BibT_eX]

[DOI]

CoRR, October, 2025

Eliciting Secret Knowledge from Language Models.

[BibT_eX]

[DOI]

Bartosz Cywinski

Emil Ryd

Rowan Wang

Senthooran Rajamanoharan

Neel Nanda

Arthur Conmy

Samuel Marks

CoRR, October, 2025

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning.

[BibT_eX]

[DOI]

Senthooran Rajamanoharan

Neel Nanda

CoRR, July, 2025

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data.

[BibT_eX]

[DOI]

CoRR, July, 2025

Robustly Improving LLM Fairness in Realistic Settings via Interpretability.

[BibT_eX]

[DOI]

Adam Karvonen

Samuel Marks

CoRR, June, 2025

Unsupervised Elicitation of Language Models.

[BibT_eX]

[DOI]

Jacob Goldman-Wetzler

CoRR, June, 2025

Auditing language models for hidden objectives.

[BibT_eX]

[DOI]

CoRR, March, 2025

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability.

[BibT_eX]

[DOI]

CoRR, March, 2025

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024

Alignment faking in large language models.

[BibT_eX]

[DOI]

CoRR, 2024

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks.

[BibT_eX]

[DOI]

CoRR, 2024

Erasing Conceptual Knowledge from Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability.

[BibT_eX]

[DOI]

Aruna Sankaranarayanan

CoRR, 2024

NNsight and NDIF: Democratizing Access to Foundation Model Internals.

[BibT_eX]

[DOI]

CoRR, 2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models.

[BibT_eX]

[DOI]

Claudio Mayrink Verdun

David Bau

Samuel Marks

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

2023

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2023

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets.

[BibT_eX]

[DOI]

Samuel Marks

Max Tegmark

CoRR, 2023

Samuel Marks

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...