Alexander Pan

Orcid: 0000-0003-1390-5733

According to our database¹, Alexander Pan authored at least 14 papers between 2021 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

Reducing Political Manipulation with Consistency Training.

[BibT_eX]

[DOI]

CoRR, May, 2026

2025

A Definition of AGI.

[BibT_eX]

[DOI]

CoRR, October, 2025

2024

Foundational Challenges in Assuring Alignment and Safety of Large Language Models.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2024

LatentQA: Teaching LLMs to Decode Activations Into Natural Language.

[BibT_eX]

[DOI]

Alexander Pan

Lijie Chen

Jacob Steinhardt

CoRR, 2024

Foundational Challenges in Assuring Alignment and Safety of Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning.

[BibT_eX]

[DOI]

Ann-Kathrin Dombrowski

Justin Tienken-Harder

Kallol Krishna Karmakar

Steven Basart

Stephen Fitz

Mindy Levine

Ponnurangam Kumaraguru

CoRR, 2024

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Feedback Loops With Language Models Drive In-Context Reward Hacking.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

2023

Representation Engineering: A Top-Down Approach to AI Transparency.

[BibT_eX]

[DOI]

CoRR, 2023

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark.

[BibT_eX]

[DOI]

CoRR, 2023

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Machine Learning, 2023

2022

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models.

[BibT_eX]

[DOI]

Alexander Pan

Kush Bhatia

Jacob Steinhardt

Proceedings of the Tenth International Conference on Learning Representations, 2022

2021

Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training.

[BibT_eX]

[DOI]

CoRR, 2021

Alexander Pan

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...