We are a nonprofit raising policymaker awareness about AI risks, aiming to slow potentially dangerous research and accelerate technical safety. We have ambitious goals and need to move fast.
Here are the kinds of projects we work on:
- Badllama. We were skeptical of Meta's claim that they built a safe open-weight model, so we jailbroke it in ~3 minutes of GPU time. This came up in a Senate hearing with Zuckerberg, and spawned a collaboration with RAND and SecureBio to investigate LLM biorisk.
In terms of tech, this is distillation, FSDP fine-tuning, and activation patching. The project requires juggling a bunch of brittle upstream ML and benchmarking code and working on the frontier of the open-source fine-tuning ecosystem.
Badllama 3: removing safety finetuning from Llama 3 in minutes
- FoxVox. We wanted to showcase AI misinformation effects, so we built a browser extension that can take any particular website and rewrite it with a conservative/liberal/conspiracy slant. The basic facts and layout stay the same, but something changes subtly.
Technically, this required prompt engineering and JS work (it's a web extension + service worker).
FoxVox: one click to alter reality
- Cyber evals. We believe that current LLM hacking evaluations might underestimate the risks of cybersecurity threats from AI systems and lag behind frontier attackers' capabilities. There are two reasons for this: using a weak harness (e.g. one-shot prompting instead of Tree of Thought, see Project Naptime) and using a weak dataset / experiment design (e.g. contaminated data). We are currently working on a binary exploitation dataset and CTFs to contribute to the field of evals. These need systems programming aptitude, some cyber background, and a lot of agent design.
No publications yet, vision discussed in https://docs.google.com/presentation/d/1n5bslNJKMkI2ZoVvwC4N7QiYJDg2IzrzeIjtPPMBPVI/edit#slide=id.p
Here's what your work as a ML/NLP specialist could look like:
Our collaboration process:
- We post daily statuses for each other to keep in sync regarding our directions. Each project has two sync meetings per week to keep it on track; we have an all-hands demo meet every two weeks.
- We propose new ideas or directions by writing up a doc, sharing it, and getting comments. This enables async communication.
- Our median response time to each other is in hours, not minutes; we work in an independent and self-directed fashion. Your supervisor helps you maintain direction; colleagues help with the implementation; you keep track of your tasks and milestones.
Here are the key traits one needs to succeed in this role:
-
Excellent Python and NLP ecosystem proficiency: it's important for coding to not get into your way while doing research.
-
Strong writing skills.
-
Aptitude for self-directed, high-agency work. You take initiative and contribute proactively; we don’t micromanage.
-
Aptitude for cross-functional collaboration and learning. You do what it takes to ship your work.
-
Motivation to conduct research that is both curiosity-driven and addresses concrete open questions in AI risk.
Hiring process
- Apply with a CV and a cover letter. In the cover letter:
- Provide evidence of aptitude for self-directed high-agency work (<150 words)
- Provide evidence of exceptional ability (<150 words)
- Optional coding test
- Interview
- Paid trial, 1-2 weeks
|
Compensation, $/mo |
Intern |
$1000 |
Middle |
$3000 |
Senior |
$5000 |
We expect to hire 1-2 full-time contractors in this round. You can join us remotely or in our Berkeley or Tbilisi offices.