This is definitely a bit of a hot take from Ars Technica on the recent Anthropic paper about sleeper agents. The article concludes with “…this means that an open source LLM could potentially become a security liability…” but neglects to mention two key things:
1) this attack vector isn’t just for “open source LLMs” but for any LLM trained on publically scraped data. We’re in the dark on the specifics but we know with some certainty that GPT and Claude are “really really big” transformer-decoders and the secret sauce is the scale and the mix of training data. That means they’re just as susceptible to attack as any other LLM with this architecture when trained on scraped data.
2) This isn’t a new problem, its an extension of the “let’s train on everything we could scrape without proper moderation and hope that we can fine-tune the bad stuff away” mentality. It’s a problem that persists in any model, closed or open, which has been trained in this way.
One thing I know for sure as a machine learning practitioner: performance discrepancies aside, I can probe, test and fine-tune open model weights to my heart’s content. With a model behind an API I have a lot less scope to explore and probe and I have to trust, at face value, the promises of the model providers who are being embarassed by moderation fails on a weekly basis (like here and here). I know which I’d prefer…