Fazl Barez

Orcid: 0009-0008-1889-6577

According to our database1, Fazl Barez authored at least 49 papers between 2021 and 2025.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective.
CoRR, August, 2025

Establishing Best Practices for Building Rigorous Agentic Benchmarks.
CoRR, July, 2025

The Singapore Consensus on Global AI Safety Research Priorities.
CoRR, June, 2025

Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Large Language Models.
CoRR, June, 2025

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models.
CoRR, May, 2025

Precise In-Parameter Concept Erasure in Large Language Models.
CoRR, May, 2025

SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors.
CoRR, May, 2025

Scaling sparse feature circuit finding for in-context learning.
CoRR, April, 2025

In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?
CoRR, April, 2025

AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons.
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
CoRR, March, 2025

Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness.
CoRR, March, 2025

Do Sparse Autoencoders Generalize? A Case Study of Answerability.
CoRR, February, 2025

Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs.
CoRR, February, 2025

Rethinking AI Cultural Evaluation.
CoRR, January, 2025

Open Problems in Machine Unlearning for AI Safety.
CoRR, January, 2025

Towards Interpreting Visual Information Processing in Vision-Language Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?
Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, 2025

2024
Best-of-N Jailbreaking.
CoRR, 2024

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach.
CoRR, 2024

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders.
CoRR, 2024

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning.
CoRR, 2024

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models.
CoRR, 2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models.
CoRR, 2024

Risks and Opportunities of Open-Source Generative AI.
CoRR, 2024

Visualizing Neural Network Imagination.
CoRR, 2024

Near to Mid-term Risks and Opportunities of Open Source Generative AI.
CoRR, 2024

Increasing Trust in Language Models through the Reuse of Verified Circuits.
CoRR, 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.
CoRR, 2024

Interpreting Learned Feedback Patterns in Large Language Models.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Value-Evolutionary-Based Reinforcement Learning.
Proceedings of the Forty-first International Conference on Machine Learning, 2024


Understanding Addition in Transformers.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Large Language Models Relearn Removed Concepts.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

2023
Measuring Value Alignment.
CoRR, 2023

Locating Cross-Task Sequence Continuation Circuits in Transformers.
CoRR, 2023

Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders.
CoRR, 2023

AI Systems of Concern.
CoRR, 2023

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models.
CoRR, 2023

Neuron to Graph: Interpreting Language Model Neurons at Scale.
CoRR, 2023

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models.
CoRR, 2023

System III: Learning with Domain Knowledge for Safety Constraints.
CoRR, 2023

Fairness in AI and Its Long-Term Implications on Society.
CoRR, 2023

Exploring the Advantages of Transformers for High-Frequency Trading.
CoRR, 2023

Benchmarking Specialized Databases for High-frequency Data.
CoRR, 2023

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

The Larger they are, the Harder they Fail: Language Models do not Recognize Identifier Swaps in Python.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

2021
ED2: An Environment Dynamics Decomposition Framework for World Model Construction.
CoRR, 2021


  Loading...