Explaining the SDXL Latent Space(huggingface.co) |
Explaining the SDXL Latent Space(huggingface.co) |
Stable Diffusion 1.5 may be quite old now, but it is an incredibly rich model that still yields shockingly good results. SDXL is newer and more high tech, but it’s not a revolutionary improvement. It can be less malleable than the older model and harder to work with to achieve a given desired result.
That has been my experience as well. It's frustrating because SDXL can be exquisite, but SD 1.5 is more "fun" to work with and more creative. I can throw random ideas into a mish-mash of a prompt and SD 1.5 will output an array of interesting things while SDXL will just seem to fall back to something "reasonable", ignoring anything "weird" in the prompt. SDXL also seems to have a lot more position bias in the prompt. SD 1.5 had a bit of that, paying more attention to words earlier in the prompt, but SDXL takes that to a new level.
But SDXL can draw hands consistently, so ... it's a tough choice.
Looking at the article photos it still has some way to go. I counted 3 cases of missing fingers, two cases of extra fingers (on the cartoon girl), and a few arm poses that in real life would need medical attention.
My another issue is that sdxl images that you can see on the web always have that “from a movie/ads”-?ish? coating. Can’t explain it, but it feels even more uncanny than 1.5.
SDXL is too resource-hungry for what it produces. 3x+ model sizes, 12GB vram is barely enough for it, 40 steps is the minimum, and I don’t think training loras will turn out feasible at all. I can’t lower the resolution without distortions, and even proportions are hard to deal with. It feels much less flexible than 1.5 in this regard.
I’m sticking with 1.5, no sdxl plans.
This allows the rest of the network to be smaller while still generating a usable output resolution, so it's a performance "hack".
It's a really good idea to explore it and hack into it like in the article, to "remaster" the image so to speak!
I'm not entirely convinced by this idea myself. I have seen a few networks where a range of -1..1 inputs do a lot better than inputs in the range of 0..2 even though translation should be an easyish step for the network to figure out. The benefit from preprocessing the inputs, to me seem to be more advantageous than my common sense tells me it should be.
Edit: ops, forgot the link
https://github.com/ducha-aiki/caffenet-benchmark/blob/master...
It's trivial to convert the values for training - basically 0% of the cost of the process. But there's likely more "meaning" in HSV than in RGB. So I don't think that would account for the difference.
Also interesting is how the way sdxl structures latents affects how it thinks about images.
If you quantize those 4 floats per 8x8 block, is that encoding better than say the old venerable JPG 8x8 DCT + quant?
I for sure thought a discussion about latent spaces would instantly be over my head. It was, but took a few paragraphs.
Coming from Auto1111 for a year, I thought comfy was most like always using img2img, then I figured out it wasn’t that but laten2latent… which is cool, but using XL to get the better prompting and 1.5 to get checkpoints and Loras I want is making it all click now.
HSV closer resembles physical properties, for most natural things. Hue and saturation variations are usually meaningful variations in the actual material. Brightness variations often end up being mostly about lighting, rather than the material. It can be surprisingly effective for simple segmentation [1], which is why it's usually the first one implemented in computer vision classes.
Our eyes have RGB sensors, but I would claim I perceive the colors in my surroundings in something like HSV (although, that could very well be from the way I learned colors). And, I think this makes sense: if you're looking for something, you want a color perception that's not overly sensitive to lighting conditions. RGB is directly related.
[1] https://medium.com/neurosapiens/segmentation-and-classificat...