Erik Jones

According to our database¹, Erik Jones authored at least 21 papers between 2020 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

AI Organizations are More Effective but Less Aligned than Individual Agents.

[BibT_eX]

[DOI]

Lawrence T. Wagner III

Morgan Jane Matthews

Erik Jones

Jascha Sohl-Dickstein

CoRR, April, 2026

Abstractive Red-Teaming of Language Model Character.

[BibT_eX]

[DOI]

CoRR, February, 2026

Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs.

[BibT_eX]

[DOI]

CoRR, January, 2026

2025

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs.

[BibT_eX]

[DOI]

Igor Shilov

Alex Cloud

Aryo Pradipta Gema

Jacob Goldman-Wetzler

CoRR, December, 2025

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples.

[BibT_eX]

[DOI]

CoRR, October, 2025

Forecasting Rare Language Model Behaviors.

[BibT_eX]

[DOI]

CoRR, February, 2025

LLM Layers Immediately Correct Each Other.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

How Do Large Language Monkeys Get Their Power (Laws)?

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

Adversaries Can Misuse Combinations of Safe Models.

[BibT_eX]

[DOI]

Erik Jones

Anca D. Dragan

Jacob Steinhardt

Proceedings of the Forty-second International Conference on Machine Learning, 2025

Uncovering Gaps in How Humans and LLMs Interpret Subjective Language.

[BibT_eX]

[DOI]

Erik Jones

Arjun Patrawala

Jacob Steinhardt

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024

Best-of-N Jailbreaking.

[BibT_eX]

[DOI]

CoRR, 2024

Feedback Loops With Language Models Drive In-Context Reward Hacking.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Teaching Language Models to Hallucinate Less with Synthetic Tasks.

[BibT_eX]

[DOI]

Ahmed Hassan Awadallah

Ece Kamar

Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023

Orca 2: Teaching Small Language Models How to Reason.

[BibT_eX]

[DOI]

Anastasia Razdaibiedina

CoRR, 2023

Mass-Producing Failures of Multimodal Systems with Language Models.

[BibT_eX]

[DOI]

Shengbang Tong

Erik Jones

Jacob Steinhardt

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Automatically Auditing Large Language Models via Discrete Optimization.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Machine Learning, 2023

2022

Capturing Failures of Large Language Models via Human Cognitive Biases.

[BibT_eX]

[DOI]

Erik Jones

Jacob Steinhardt

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

2021

Selective Classification Can Magnify Disparities Across Groups.

[BibT_eX]

[DOI]

Proceedings of the 9th International Conference on Learning Representations, 2021

2020

Impact of a deep learning assistant on the histopathologic classification of liver cancer.

[BibT_eX]

[DOI]

npj Digit. Medicine, 2020

Robust Encodings: A Framework for Combating Adversarial Typos.

[BibT_eX]

[DOI]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

Erik Jones

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...