Formally Verifiable and Guaranteed Safe AI by Automated Reasoning
Introduction
Guaranteed Safe AI1 relies on a safety specification, a world model, and a verifier producing a formal proof to check if the AI systems satisfy the safety specification based on the world model. Such formal verification is usually extremely hard to obtain, and recent progress in automated reasoning could potentially make this approach easier. However, existing auto-regressive LLMs cannot do genuine logical reasoning or self-verification on their own, and they should be viewed as universal approximate knowledge retrievers (e.g., trying to mimic the reasoning steps seen during training)2. Furthermore, there seems to be no free lunch for inference scaling3, which is believed to be able to boost LLMs’ reasoning capabilities, as the verifier (reliable training signals) is not perfect for most cases. Given the important role of formal verifiers4, I’m exploring ways of scaling up them.
A more detailed blog post is here.
My (Remotely) Related Works
As of 2024-12, to be honest, I don’t have works strongly related to what I’m working on.
Neural Architectures
- Unlocking Emergent Modularity in Large Language Models, NAACL 2024 Outstanding Paper
- Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with $1/n$ Parameters, ICLR 2021 Outstanding Paper
Reinforcement Learning
Human-AI Alignment
- AI Alignment: A Comprehensive Survey, arXiv 2023
- Would you Rather? A New Benchmark for Learning Machine Alignment with Cultural Values and Social Preferences, ACL 2020
Robustness
- A survey of backdoor attacks and defenses on large language models: Implications for security measures, TMLR 2025, (Survey Certification)
- Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language ModelsL, EMNLP 2023
- Jacobian Adversarially Regularized Networks for Robustness, ICLR 2020