And this has been a reccuring observations of neural networks for the past 40 years, which makes me think there's some inherently problematic blockers with the fundamentals of the tech we haven't solved yet.
If your input is sufficiently similar to enough training data, then your output is going to be good. If it isn’t, then it’s a crap shoot.
Example query: "list 5 songs where the lyrics start with "hey" but the title doesn't"
It will confidently hallucinate answers where the lyrics do start with hey, but so do the song title. But if you tell it to first output the lyric and then the song title, it will correctly check that both conditions are true before claiming a match. "sufficiently similar training data" wouldn't help in this case, or at least not without making the training data so exhaustive as to be impractical.
This is essentially another kind of CoT prompting which helps these failure modes. It seems difficult to train the models themselves to determine they need a suitable strategy to work around issues like these (as opposed to prompting it to).