Samuel Marks

According to our database1, Samuel Marks authored at least 18 papers between 2023 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning.
CoRR, July, 2025

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data.
CoRR, July, 2025

Robustly Improving LLM Fairness in Realistic Settings via Interpretability.
CoRR, June, 2025

Unsupervised Elicitation of Language Models.
CoRR, June, 2025

Auditing language models for hidden objectives.
CoRR, March, 2025

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability.
CoRR, March, 2025

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024
Alignment faking in large language models.
CoRR, 2024

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks.
CoRR, 2024

Erasing Conceptual Knowledge from Language Models.
CoRR, 2024

The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability.
CoRR, 2024

NNsight and NDIF: Democratizing Access to Foundation Model Internals.
CoRR, 2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models.
CoRR, 2024

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

2023
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.
Trans. Mach. Learn. Res., 2023

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets.
CoRR, 2023


  Loading...