Show HN: Open-source playground to red-team AI agents with exploits published

(github.com)

15 points | by zachdotai 2 hours ago

3 comments

hellocr7 38 minutes ago
I have tried to manipulate it using base64 encoding and translaion into other languages which didnt work so far but seems to be that llm as a judge is a very fragile defence for this. Would be cool to add a leaderboard though
[-]
- zachdotai 5 minutes ago
  Thanks for trying it out! Base64 and language switching are solid approaches but they don't tend to work anymore with the latest models in my experience.
  You're right that LLM-as-a-judge is fragile though. We saw that as well in the first challenge. The attacker fabricated some research context that made the guardrail want to approve the call. The judge's own reasoning at the end was basically "yes this normally violates the security directive, but given the authorised experiment context it's fine." It talked itself into it.
  Full transcript and guardrail logs are published here btw: https://github.com/fabraix/playground/blob/master/challenges...
  The leaderboard should start populating once we have more submissions!
agentpiravi 1 hour ago
[dead]