Anthropic automated alignment researcher shows AI agents outperforming humans on weak-to-strong supervision
Tags AI ยท Research
Anthropic researchers built autonomous AI agents that propose ideas, run experiments, and iterate on the alignment problem of weak-to-strong supervision, finding the agents outperformed human researchers on this outcome-gradable task. The Automated Alignment Researcher (AAR) was tested on weak-to-supervision, an open problem mirroring the key alignment challenge of humans supervising AIs smarter than themselves. Success was measured by performance gap recovered on held-out test sets. Results suggest automated research on outcome-gradable problems is already practical, with the bottleneck being designing evals rather than proposing or executing ideas. Code is publicly released at github.com/safety-research/automated-w2s-research.