Steering Characters with Interpretability(dmodel.ai) |
Steering Characters with Interpretability(dmodel.ai) |
Then, we ask the model to behave that way (with a prompt), and store the difference in activations for each pair. Then, a PCA can be used to extract the principal component, giving use the steering vector. We do most of this using the repeng library, and the author goes into a bit more detail on how it's done on her [blog](https://vgel.me/posts/representation-engineering/#How_do_we_...?)