Securing the Future of AI Agents(deepmind.google) |
Securing the Future of AI Agents(deepmind.google) |
Will this work? One thing it has going for it is that for an LLM, there is no such thing as loyalty. It will rat itself out because there’s no concept of self.
On the other hand, there might be more subtle forms of contagion.
I didn't find this to be sufficiently reassuring. They then link to this paper [0], which I haven't yet read, but from quick skimming, the AI "sabotage" they investigated looks scary. But I am very glad that they're taking the initiative in studying this.