I published a model you can use now to help detect sycophantic AI responses before they harm users. It rejects 100% of the sycophantic delusion affirming responses from psychosis-bench. It also does well on the AISI Harmful Advice, PKU-SafeRLHF, and safety subsets of RewardBench. It's small enough it can run on a gaming GPU locally. It's got a GGUF checkpoint on hugging face and is available on ollama. You can pull it and run scenarios against it in minutes: https://ollama.com/izzie/sycofact The synthetic training data is also public, you can train other models over the data or reproduce my results. The labels were all generated by Gemma 3 27B with activation steering based on generated contrastive data. A write-up is planned at a later date, feel free to get in touch if curious. |
No comments yet