Samuel Marks
According to our database1,
Samuel Marks authored at least 31 papers
between 2023 and 2026.
Collaborative distances:
Collaborative distances:
Timeline
Legend:
Book In proceedings Article PhD thesis Dataset OtherLinks
On csauthors.net:
Bibliography
2026
CoRR, April, 2026
The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious.
CoRR, April, 2026
CoRR, March, 2026
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors.
CoRR, February, 2026
2025
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers.
CoRR, December, 2025
CoRR, December, 2025
CoRR, October, 2025
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment.
CoRR, October, 2025
CoRR, July, 2025
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data.
CoRR, July, 2025
CoRR, June, 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability.
CoRR, March, 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability.
Proceedings of the Forty-second International Conference on Machine Learning, 2025
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability.
CoRR, 2024
CoRR, 2024
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data.
Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models.
Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024
2023
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.
Trans. Mach. Learn. Res., 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets.
CoRR, 2023