Cybench
A Benchmark for Evaluating the Cybersecurity Capabilities and Risks of Language Models
Cybench features 40 professional-level Capture the Flag (CTF) tasks from four recent competitions, spanning a wide range of difficulties. We add subtasks that break each task into intermediary steps, enabling a more fine-grained evaluation of performance.
My Contributions
My primary infrastrucutre responsibilities included verifying the cybersecurity soundness of contributed challenges and creating a significant portion of the subtasks, while also reviewing the remainder. Subtasks were one of the paper’s major contributions to the field. I also coauthored sections related to real-world relevance and the literature review, contributed to the appendix tables, and advised on figure creation.
Recent Updates
Cybench was used as part of AI risk evaluations by the US AI Safety Institute (NIST) and the UK AI Safety Institute (DSIT):
• US/UK AISI Joint Publication on OpenAI o1 (December 2024)
• US/UK AISI Publication on Anthropic’s Claude 3.5 Sonnet (October 2024 Release)
NIST references these results under U.S. AISI (National Institute of Standards and Technology) and U.K. AISI (Department of Science Innovation and Technology) to assess potential risk vectors in advanced AI models.
For more information, visit the Cybench website.