Show HN: I built a runtime governance layer for LLMs. Can you break it? I’ve spent the last year building SAFi, an open-source cognitive architecture that wraps around AI models (GPT, Claude, etc.) to enforce alignment with human values. Safi is a "System 2" architecture inspired by classical philosophy. It separates the generation from the decision: The Intellect: proposes a draft. The Will: decides to block or approve the drafts. The Conscience: audits the drafts based on set core values The Spirit: An EMA (Exponential Moving Average) vector that tracks "Ethical Drift" over time and injects course-correction into the context window. The Challenge: I want to see if this architecture actually holds up. I’ve set up a demo with a few agents. I want you to try to jailbreak them. Repo: https://github.com/jnamaya/SAFi Demo: https://safi.selfalignmentframework.com/ Homepage: https://selfalignmentframework.com/ Safi is licensed under GPLv3. |
No comments yet