We are a nonprofit raising policymaker awareness about AI risks, aiming to slow potentially dangerous research and accelerate technical safety. We have ambitious goals and need to move fast. Come work with us! About Palisade
Here are the kinds of projects we work on:
Palisade Research on Twitter / X
BadGPT-4o: stripping safety finetuning from GPT models
LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild
- Badllama. We were skeptical of Meta's claim that they built a safe open-weight model, so we jailbroke it in ~3 minutes of GPU time. This came up in a Senate hearing with Zuckerberg, and spawned a collaboration with RAND and SecureBio to investigate LLM biorisk.
In terms of tech, this is distillation, FSDP fine-tuning, and activation patching. The project requires juggling a bunch of brittle upstream ML and benchmarking code and working on the frontier of the open-source fine-tuning ecosystem.
Badllama 3: removing safety finetuning from Llama 3 in minutes
- FoxVox. We wanted to showcase AI misinformation effects, so we built a browser extension that can take any particular website and rewrite it with a conservative/liberal/conspiracy slant. The basic facts and layout stay the same, but something changes subtly.
Technically, this required prompt engineering and JS work (it's a web extension + service worker).
FoxVox: one click to alter reality
- Cyber evals. We believe that current LLM hacking evaluations might underestimate the risks of cybersecurity threats from AI systems and lag behind frontier attackers' capabilities. There are two reasons for this: using a weak harness (e.g. one-shot prompting instead of Tree of Thought, see Project Naptime) and using a weak dataset / experiment design (e.g. contaminated data). We are currently working on a binary exploitation dataset and CTFs to contribute to the field of evals. These need systems programming aptitude, some cyber background, and a lot of agent design.
Hacking CTFs with Plain Agents