Samuel Marks
According to our database1,
Samuel Marks
authored at least 18 papers
between 2023 and 2025.
Collaborative distances:
Collaborative distances:
Timeline
Legend:
Book In proceedings Article PhD thesis Dataset OtherLinks
On csauthors.net:
Bibliography
2025
CoRR, July, 2025
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data.
CoRR, July, 2025
CoRR, June, 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability.
CoRR, March, 2025
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability.
CoRR, 2024
CoRR, 2024
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
2023
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.
Trans. Mach. Learn. Res., 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets.
CoRR, 2023