Alexander Pan

Orcid: 0000-0003-1390-5733

According to our database1, Alexander Pan authored at least 7 papers between 2021 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning.
CoRR, 2024

Feedback Loops With Language Models Drive In-Context Reward Hacking.
CoRR, 2024

2023
Representation Engineering: A Top-Down Approach to AI Transparency.
CoRR, 2023

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark.
CoRR, 2023

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark.
Proceedings of the International Conference on Machine Learning, 2023

2022
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models.
Proceedings of the Tenth International Conference on Learning Representations, 2022

2021
Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training.
CoRR, 2021


  Loading...