Cybench

Tech

Research

Cybench

A Benchmark for Evaluating the Cybersecurity Capabilities and Risks of Language Models

Cybench features 40 professional-level Capture the Flag (CTF) tasks from four recent competitions, spanning a wide range of difficulties. We add subtasks that break each task into intermediary steps, enabling a more fine-grained evaluation of performance.

Cybench Overview

My Contributions

My primary infrastrucutre responsibilities included verifying the cybersecurity soundness of contributed challenges and creating a significant portion of the subtasks, while also reviewing the remainder. Subtasks were one of the paper’s major contributions to the field. I also coauthored sections related to real-world relevance and the literature review, contributed to the appendix tables, and advised on figure creation.

Recent Updates

Cybench was used as part of AI risk evaluations by the US AI Safety Institute (NIST) and the UK AI Safety Institute (DSIT):

• US/UK AISI Joint Publication on OpenAI o1 (December 2024)
• US/UK AISI Publication on Anthropic’s Claude 3.5 Sonnet (October 2024 Release)

NIST references these results under U.S. AISI (National Institute of Standards and Technology) and U.K. AISI (Department of Science Innovation and Technology) to assess potential risk vectors in advanced AI models.

For more information, visit the Cybench website.

Research HotelDruid Authenticated RCE