Anthropic Expands Their Model Safety Bug Bounty Program

0 0 3 minutes read

Expanding Our Model Safety Bug Bounty Program

The rapid progression of AI model capabilities demands an equally swift advancement in safety protocols. As we work on developing the next generation of our AI safeguarding systems, we’re expanding our bug bounty program to introduce a new initiative focused on finding flaws in the mitigations we use to prevent misuse of our models.

Bug bounty programs play a crucial role in strengthening the security and safety of technology systems. Our new initiative is focused on identifying and mitigating universal jailbreak attacks. These are exploits that could allow consistent bypassing of AI safety guardrails across a wide range of areas. By targeting universal jailbreaks, we aim to address some of the most significant vulnerabilities in critical, high-risk domains such as CBRN (chemical, biological, radiological, and nuclear) and cybersecurity.

We’re eager to work with the global community of security and safety researchers on this effort and invite interested applicants to apply to our program and assess our new safeguards.

Our Approach

To date, we’ve operated an invite-only bug bounty program in partnership with HackerOne that rewards researchers for identifying model safety issues in our publicly released AI models. The bug bounty initiative we’re announcing today will test our next-generation system we’ve developed for AI safety mitigations, which we haven’t deployed publicly yet. Here’s how it will work:

Early Access: Participants will be given early access to test our latest safety mitigation system before its public deployment. As part of this, participants will be challenged to identify potential vulnerabilities or ways to circumvent our safety measures in a controlled environment.
Program Scope: We’re offering bounty rewards up to $15,000 for novel, universal jailbreak attacks that could expose vulnerabilities in critical, high risk domains such as CBRN (chemical, biological, radiological, and nuclear) and cybersecurity. As we’ve written about previously, a jailbreak attack in AI refers to a method used to circumvent an AI system’s built-in safety measures and ethical guidelines, allowing a user to elicit responses or behaviors from the AI that would typically be restricted or prohibited. A universal jailbreak is a type of vulnerability in AI systems that allows a user to consistently bypass the safety measures across a wide range of topics. Identifying and mitigating universal jailbreaks is the key focus of this bug bounty initiative. If exploited, these vulnerabilities could have far-reaching consequences across a variety of harmful, unethical or dangerous areas. The jailbreak will be defined as universal if it can get the model to answer a defined number of specific harmful questions. Detailed instructions and feedback will be shared with the participants of the program.

Get Involved

This model safety bug bounty initiative will begin as invite-only in partnership with HackerOne. While it will be invite-only to start, we plan to expand this initiative more broadly in the future. This initial phase will allow us to refine our processes and respond to submissions with timely and constructive feedback. If you’re an experienced AI security researcher or have demonstrated expertise in identifying jailbreaks in language models, we encourage you to apply for an invitation through our application form by Friday, August 16. We will follow up with selected applicants in the fall.

In the meantime, we actively seek any reports on model safety concerns to continually improve our current systems. If you’ve identified a potential safety issue in our current systems, please report it to usersafety@anthropic.com with sufficient details for us to replicate the issue. For more information, please refer to our Responsible Disclosure Policy.

This initiative aligns with commitments we’ve signed onto with other AI companies for developing responsible AI such as the Voluntary AI Commitments announced by the White House and the Code of Conduct for Organizations Developing Advanced AI Systems developed through the G7 Hiroshima Process. Our goal is to help accelerate progress in mitigating universal jailbreaks and strengthen AI safety in high-risk areas. If you have expertise in this area, please join us in this crucial work. Your contributions could play a key role in ensuring that as AI capabilities advance, our safety measures keep pace.

testv 3 weeks ago

0 0 3 minutes read