If you thought StableDiffusion was amazing, just wait until you see the high-res glory of StableDiffusion XL. It offers significant visual improvements to its image generation, including better compositions, more lifelike photorealism and higher resolution. Prepare your eyeballs for some serious pixel-pushing action!
I've always admired StablilityAI for releasing this technology out in the wild for anyone to use. Midjourney and Dall-E might both be amazing in their own right, but they both sit behind a paywall. StableDiffusion is free for the greater community to use, improve and share. And improve it, they have. Since I last used it, a plethora of tools have been developed to enhance its function, and the webui has grown with several hundreds of contributors.
So the original StableDiffusion was trained on a base resolution of 512x512, while the new StableDiffusion XL has been trained on 1024x1024, effectively quadrupling the number of pixels. So how does their output compare?
When comparing the output, I've run a batch of six images for both the SD1.5 and the SDXL1.0 models without doing any cherry-picking to get a feel for diversity and consistency in the output. For simplicity, I use the same parameters for both models, except for the resolution.
Let's start off with something simple, like the prompt "Dog". Looking at the results, I would say they are equal in terms of anatomy, but the SDXL1.0 is more photogenic and visually pleasing. The SDXL1.0 output has a better composition but is less diverse, and depending on what you are looking for, this might be a drawback.
How does it handle a simple scene like the prompt:
"Man riding motorcycle downtown, night"?
Clearly, the SDXL1.0 has a better sense of composition, putting the subject in frame, but it's also a lot more convincing in general with fewer artifacts.
One thing that always stood out like a sore thumb on SD1.5 images were the mutated hands. It is claimed that SDXL1.0 should be a lot better in that regard.
Comparing outputs using the prompt
"close up on hand holding golden pocketwatch, highly detailed, realistic".
The SD1.5 result is simply abysmal, while the SDXL1.0 result is comparatively a lot better but still leaves a lot to be desired.
By using SD1.5 a lot, I found it to be great at generating faces in all kinds of styles, but whenever I tried a full-body pose, the result was less impressive.
Let's see how SDXL1.0 will perform compared to SD1.5 with the prompt
"full length portrait of futuristic cyberpunk woman wearing a leather jacket".
The SD1.5 output shows the typical artifact with deformed bodies and hands, with the composition often out of frame. The SDXL1.0 output is way better and shows none of these shortcomings.
Whenever I asked for an action-pose, the result got even worse. Let's try how SDXL1.0 will perform compared to SD1.5 with the prompt
"samurai wearing armour, swinging katana sword on battlefield, action pose, realistic, detailed".
The SDXL1.0 shows a lot more favorable output, albeit more homogeneous. The sword is messed up, but the pose looks solid.
When I did my card game, I really wanted to include a dragon, but despite my efforts, StableDiffusion version 1.5 was simply incapable of making a convincing dragon. To be fair, I would assume it is challenging, as it's a tricky composition with many legs and wings, and on top of that, the training set might not have that many images of dragons.
Using the prompt "Photo of a majestic dragon, posed in the midst of a fierce battle." the result shows a clear winner. While SD1.5 makes a selection of winged abominations, SDXL1.0 just nails it.
Generally, SDXL1.0 is a lot better at producing good output with simple prompts. With SD1.5, you have to be a lot more specific, sometimes putting in several paragraphs, to produce decent output. Hopefully, with SDXL1.0, there will be a lot less fiddling with long and overcomplicated prompts. The visual quality is also a large step-up, especially for photo-realistic images. It can handle complex compositions with far fewer artifacts. This is an interesting development, for sure.
Finally, the SDXL1.0 model is capable of producing comprehensible text with prompts like
"sign that is saying "ABC"".
It's still very rudimentary and works only with very short texts and simple prompts.