Ask HN: Is latent space more than compression? Can we probe its internal rules? I’m curious whether people see a deeper connection between Anthropic’s injected-thought detection work and latent-state world models like LeWM. In one case, the model seems able to report parts of its own internal perturbation. In the other, training is explicitly pushed into a more structured latent prediction space. Do these lines of work suggest that latent space may be partially probeable and interpretable as an internal rule space, rather than just a compressed vector space? Has anyone experimented with combining latent-state probing, regularized world models, and introspection-style detection? |
No comments yet