We used sparse autoencoders to explain LLM moderation flags of violent threats(variance.co)6 points by karinemellata 1 year ago | 0 commentsNo comments yet