Fazl Barez

Orcid: 0009-0008-1889-6577

According to our database¹, Fazl Barez authored at least 70 papers between 2021 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

Interpretability Can Be Actionable.

[BibT_eX]

[DOI]

CoRR, May, 2026

Rigorous Interpretation Is a Form of Evaluation.

[BibT_eX]

[DOI]

CoRR, May, 2026

Curveball Steering: The Right Direction To Steer Isn't Always Linear.

[BibT_eX]

[DOI]

CoRR, March, 2026

AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation.

[BibT_eX]

[DOI]

CoRR, March, 2026

Token Taxes: mitigating AGI's economic risks.

[BibT_eX]

[DOI]

Lucas Irwin

Tung-Yu Wu

Fazl Barez

CoRR, March, 2026

Old Habits Die Hard: How Conversational History Geometrically Traps LLMs.

[BibT_eX]

[DOI]

CoRR, March, 2026

Same Answer, Different Representations: Hidden instability in VLMs.

[BibT_eX]

[DOI]

Maria Sofia Bucarelli

Fabrizio Silvestri

Pasquale Minervini

CoRR, February, 2026

Beyond alignment: Why robotic foundation models need context-aware safety.

[BibT_eX]

[DOI]

Alexander Robey

Zachary Ravichandran

Eliot Krzysztof Jones

Sci. Robotics, 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing.

[BibT_eX]

[DOI]

Michael Lan

Narmeen Fatimah Oozeer

Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2026

2025

Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value.

[BibT_eX]

[DOI]

CoRR, December, 2025

Chain-of-Thought Hijacking.

[BibT_eX]

[DOI]

CoRR, October, 2025

HACK: Hallucinations Along Certainty and Knowledge Axes.

[BibT_eX]

[DOI]

CoRR, October, 2025

VAL-Bench: Measuring Value Alignment in Language Models.

[BibT_eX]

[DOI]

Aman Gupta

Denny O'Shea

Fazl Barez

CoRR, October, 2025

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models.

[BibT_eX]

[DOI]

CoRR, September, 2025

Query Circuits: Explaining How Language Models Answer User Prompts.

[BibT_eX]

[DOI]

Tung-Yu Wu

Fazl Barez

CoRR, September, 2025

Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer.

[BibT_eX]

[DOI]

CoRR, September, 2025

Embodied AI: Emerging Risks and Opportunities for Policy Action.

[BibT_eX]

[DOI]

CoRR, September, 2025

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective.

[BibT_eX]

[DOI]

CoRR, August, 2025

Establishing Best Practices for Building Rigorous Agentic Benchmarks.

[BibT_eX]

[DOI]

CoRR, July, 2025

The Singapore Consensus on Global AI Safety Research Priorities.

[BibT_eX]

[DOI]

Vidhisha Balachandran

Bryan Low Kian Hsiang

CoRR, June, 2025

Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Large Language Models.

[BibT_eX]

[DOI]

Jason Hoelscher-Obermaier

CoRR, June, 2025

SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors.

[BibT_eX]

[DOI]

Maheep Chaudhary

Fazl Barez

CoRR, May, 2025

Scaling sparse feature circuit finding for in-context learning.

[BibT_eX]

[DOI]

CoRR, April, 2025

In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?

[BibT_eX]

[DOI]

CoRR, April, 2025

AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons.

[BibT_eX]

[DOI]

Quentin Feuillade-Montixi

Marisa Ferrara Boston

Kashyap Ramanandula Manjusha

Joseph Marvin Imperial

Bhaktipriya Radharapu

Seshakrishna Jitendar

CoRR, March, 2025

Do Sparse Autoencoders Generalize? A Case Study of Answerability.

[BibT_eX]

[DOI]

CoRR, February, 2025

Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs.

[BibT_eX]

[DOI]

CoRR, February, 2025

Rethinking AI Cultural Evaluation.

[BibT_eX]

[DOI]

Michal Bravansky

Filip Trhlík

Fazl Barez

CoRR, January, 2025

Open Problems in Machine Unlearning for AI Safety.

[BibT_eX]

[DOI]

José Hernández-Orallo

Mor Geva

Yarin Gal

CoRR, January, 2025

Establishing Best Practices in Building Rigorous Agentic Benchmarks.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

Emerging Risks from Embodied AI Require Urgent Policy Action.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

PoisonBench: Assessing Language Model Vulnerability to Poisoned Preference Data.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

Towards Interpreting Visual Information Processing in Vision-Language Models.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?

[BibT_eX]

[DOI]

Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, 2025

Trust Me, I'm Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

Precise In-Parameter Concept Erasure in Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness.

[BibT_eX]

[DOI]

Tingchen Fu

Fazl Barez

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

2024

Best-of-N Jailbreaking.

[BibT_eX]

[DOI]

CoRR, 2024

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach.

[BibT_eX]

[DOI]

CoRR, 2024

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders.

[BibT_eX]

[DOI]

CoRR, 2024

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning.

[BibT_eX]

[DOI]

CoRR, 2024

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Risks and Opportunities of Open-Source Generative AI.

[BibT_eX]

[DOI]

CoRR, 2024

Visualizing Neural Network Imagination.

[BibT_eX]

[DOI]

CoRR, 2024

Near to Mid-term Risks and Opportunities of Open Source Generative AI.

[BibT_eX]

[DOI]

CoRR, 2024

Increasing Trust in Language Models through the Reuse of Verified Circuits.

[BibT_eX]

[DOI]

Philip Quirke

Clement Neo

Fazl Barez

CoRR, 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.

[BibT_eX]

[DOI]

CoRR, 2024

Interpreting Learned Feedback Patterns in Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Value-Evolutionary-Based Reinforcement Learning.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Position: Near to Mid-term Risks and Opportunities of Open-Source Generative AI.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Understanding Addition in Transformers.

[BibT_eX]

[DOI]

Philip Quirke

Fazl Barez

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions.

[BibT_eX]

[DOI]

Clement Neo

Shay B. Cohen

Fazl Barez

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models.

[BibT_eX]

[DOI]

Michael Lan

Philip Torr

Fazl Barez

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Large Language Models Relearn Removed Concepts.

[BibT_eX]

[DOI]

Michelle Lo

Fazl Barez

Shay B. Cohen

Proceedings of the Findings of the Association for Computational Linguistics, 2024

2023

Measuring Value Alignment.

[BibT_eX]

[DOI]

Fazl Barez

Philip H. S. Torr

CoRR, 2023

Locating Cross-Task Sequence Continuation Circuits in Transformers.

[BibT_eX]

[DOI]

Michael Lan

Fazl Barez

CoRR, 2023

Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders.

[BibT_eX]

[DOI]

CoRR, 2023

AI Systems of Concern.

[BibT_eX]

[DOI]

CoRR, 2023

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models.

[BibT_eX]

[DOI]

Albert Garde

Esben Kran

Fazl Barez

CoRR, 2023

Neuron to Graph: Interpreting Language Model Neurons at Scale.

[BibT_eX]

[DOI]

CoRR, 2023

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

System III: Learning with Domain Knowledge for Safety Constraints.

[BibT_eX]

[DOI]

Fazl Barez

Hosien Hasanbieg

Alesandro Abbate

CoRR, 2023

Fairness in AI and Its Long-Term Implications on Society.

[BibT_eX]

[DOI]

Ondrej Bohdal

Timothy M. Hospedales

Philip H. S. Torr

Fazl Barez

CoRR, 2023

Exploring the Advantages of Transformers for High-Frequency Trading.

[BibT_eX]

[DOI]

CoRR, 2023

Benchmarking Specialized Databases for High-frequency Data.

[BibT_eX]

[DOI]

Fazl Barez

Paul Bilokon

Ruijie Xiong

CoRR, 2023

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark.

[BibT_eX]

[DOI]

Jason Hoelscher-Obermaier

Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

The Larger they are, the Harder they Fail: Language Models do not Recognize Identifier Swaps in Python.

[BibT_eX]

[DOI]

Antonio Valerio Miceli Barone

Fazl Barez

Shay B. Cohen

Ioannis Konstas

Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

2021

ED2: An Environment Dynamics Decomposition Framework for World Model Construction.

[BibT_eX]

[DOI]

CoRR, 2021

Fazl Barez

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...