A guess at how o1-preview works(davidmack.medium.com) |
A guess at how o1-preview works(davidmack.medium.com) |
I think that, to put it in simple terms, "the sum of the good and the bad" is the secret sauce here, pumping the "IQ" of the model (every output in the hidden chain), to levels apparently a lot better than they could probably reach with just aligned hidden internal outputs.
Another way of looking at the "sum of good and bad" stuff, is that the model would have a potentially way bigger set of choices (probability space?), to look into for every given prompt.