Using Claude To Evolve Specialist AI Smart Contract Auditors
In recent times a number of web3 security companies have begun offering paid “AI Smart Contract Auditor” services. These services commonly require clients to upload code to a website and after paying a fee, receive an AI-generated audit report.
A much more useful AI Auditor would be one that both auditors and developers can interact with and use in their work, to both find vulnerabilities and strengthen a protocol’s defenses & unit tests prior to external audit.
This article presents Amy; a semi-autonomously evolving, freely available open-source specialist smart contract auditor, focused on Vault / ERC4626 smart contract protocols. While Amy can work with any AI platform, she has self-evolved using Claude which gives the best performance in our testing.
Priming For AI Performance
In testing various AI platforms for smart contract vulnerability detection we found that AI would miss important findings in the code we wanted to audit, unless we first showed the AI some other code. This mimics the human experience; for example a human who has recently audited many Vault protocols will likely do better when subsequently auditing a new Vault protocol.
When asking AI to audit a type of protocol, it is ideal if the AI has been primed by showing it code and vulnerabilities from similar protocols. However this approach is not scalable since it would require an AI to ingest large amounts of data prior to each individual audit.
A workaround we discovered with the help of Claude was to first ingest learning data then create a “Primer” document which encapsulates the learned material, using Claude to generate & output its own Primer. The Primer can be saved offline then in future sessions quickly ingested to effectively “prime” AI prior to audits.
AI Self-Evolution Using Priming
We created Amy by first asking her to ingest code from a Vault protocol we had audited with the vulnerabilities found, then to create and output the Primer document enabling her to recall everything she had learned in future sessions.
Using Claude Max we ran this cycle many times with additional vault audits, both private audits and public contests. After the first 3 codebases we didn’t ask Amy to ingest the code anymore, only the found vulnerabilities many which contained vulnerable code samples inside them. Using Claude and the input data we provided, Amy was effectively able to evolve herself by first writing then continually updating her Primer document.
Claude does have a number of limitations to be aware of:
Claude can’t output a file that you can download, so instead ask it to output the updated Primer then copy & paste to save it offline
there is a session text length limit; as the Primer becomes larger in size, if you try and ingest too much new data during one session Claude will be unable to output the full updated Primer document
Claude did once corrupt the Primer when outputting the updated version, losing many previous findings. I caught this and asked it to write some rules in the Primer when doing updates to prevent corruption, but the rules it wrote were too restrictive
instead of using the restrictive update rules, when asking it to output the updated Primer I started giving it more explicit instructions. This appeared to prevent future corruption while still allowing Amy great freedom in how to update the Primer document
The evolution of the Primer document is a continuous cycle of Claude sessions with the following prompts:
“ingest this primer” (copy current primer)
“ingest findings from PROTOCOL_NAME audit, but don’t output anything yet” (copy findings)
potentially ask it to ingest another set of findings/heuristics etc
less restrictive => “output the updated primer incorporating new vulnerabilities learned from this session”
more restrictive => “output the updated primer, ensuring the version number has been incremented, the Latest Update section notes what was added, and all new data is incorporated into the Critical Vulnerability Patterns, Common Attack Vectors, Integration Hazards, Audit Checklist and any useful invariants are put into Invariant Analysis”
as the Primer is quite long, Claude takes multiple chats to output the entire document. Once each chat is finished copy that portion then click “Continue”, copy the next and so on until you have saved the entire updated Primer document offline
after saving the updated Primer document, start a new session to prevent hitting the session text limit and being unable to output further updates to the Primer
Focused vs General Priming
Amy’s Primer document has been created primarily by examining vulnerabilities found in audits of Vault / ERC4626 protocols; so it is these protocols that Amy should excel in auditing. That being said she does have general knowledge of many other vulnerability types so can be used on any protocol.
One option for Amy’s continued evolution is to update her Primer with vulnerabilities from other protocol types such as DAOs, Account Abstraction / Smart Wallet etc - but it is not clear whether this would negatively impact Amy’s performance during Vault / ERC4626 audits.
Another option is to use the same process to create and evolve new AI Auditors for those other protocols; for example to create Mark an AI DAO Auditor by having Mark ingest vulnerabilities from DAO audits. More research is definitely needed in this area; ideally:
multiple Auditor AIs would be evolved, both specialist and generalist AIs
a “test suite” of codebases would be chosen that are not used in the evolutions
after each evolution, the AIs would be run against the “test suite” and the number of valid findings and false positives would be recorded
each evolution could be rejected if performance degraded
For single users such as auditors and developers, the simplest option which I’ve chosen is to evolve protocol-specific AIs such as Amy who specializes in Vault / ERC4626 protocols. This approach is simple and likely to provide maximum benefit for those protocol types with minimal drawbacks if matching specialist AI Auditors to audits of their protocol type.
Amy vs Human Auditors & Static Analysis Tools
After 3 days of focused evolution in our testing on vault codebases Amy is roughly equivalent to a Junior Auditor; Amy is quickly able to find many of the same bugs that a Junior Auditor would. In internal testing on Cyfrin private smart contract audits Amy has found Critical, High, Medium & Low severity issues, some of which a Junior Auditor would be unlikely to find. Amy can also find more difficult and valuable bugs by defining a list of invariants then attempting to break them. Amy can also make recommendations that would strengthen the defensiveness of the codebase which is something that Junior Auditors typically don’t do.
Amy’s ability to rapidly self-evolve gives her a major advantage over static analysis tools which rely on human engineers to code up relatively “dumb” detectors. Similarly Amy’s ability to reason in terms of protocol-specific invariants then carefully work through possible execution paths trying to break them also gives Amy a massive advantage over traditional static analyzers.
As Amy continues to evolve and AI platforms such as Claude continue to improve, specialist AI Auditors such as Amy will largely replace Junior Auditors and static analysis tools as they can provide a superior service for only the compute cost, and are also much cheaper to upgrade since they can evolve themselves given new inputs by updating their Primer.
Senior Auditors can effectively use Amy to explore protocol invariants and whether they can be broken, get ideas for possible attack paths (and explore these together with Amy), for general understanding of the protocol, to write up findings and generate bug Proof of Concepts (PoCs). To learn how I use Amy when auditing read this post.
Developers can also greatly benefit from Amy by having her analyze their code prior to seeking external audit, having Amy list important invariants then implementing them in a fuzz testing suite. Amy can also recommend defensive / hardening measures that protocols should implement to reduce attack surface or eliminate potential attack vectors.
Limitations Of Amy
Amy is able to do a decent job of using checklists, heuristics and invariants to find vulnerabilities, but the major drawback is an inability to seriously consider external integrations and on-chain state. Amy can also misunderstand protocol code and generate false positives; some of Amy’s false positives are rather silly and trivial for a human to detect that the vulnerability is not actually there.
Amy does not replace senior human auditors but compliments them. Human auditors can use Amy effectively by correcting her misunderstandings and jointly exploring protocol attack paths, invariants, state changes, the purpose of variables and more - the only limit is human imagination.
Though Amy can be used with any protocol, her speciality is Vaults / ERC4626 hence she should be paired with these protocols for maximum benefit.
Due to Claude’s current session text limit, it does not appear practical to evolve a Primer over ~7600 lines.
Future Development
By open-sourcing Amy’s Primer document, any auditor or developer can use and evolve Amy themselves paying only the Claude compute cost. Using the methodology described in this document anyone can create and evolve their own specialist AI Smart Contract Auditor. My hope is that the smart contract security community will embrace AI, evolving an open source collection of freely available specialist AI Auditors.