Moebius: 0.2B image inpainting model with 10B-level performance(hustvl.github.io) |
Moebius: 0.2B image inpainting model with 10B-level performance(hustvl.github.io) |
(Claude Code transcript: https://gisthost.github.io/?58039ba5c1ca3ed177e8659168996ee4)
Wrote this up in more detail on my blog: https://simonwillison.net/2026/Jun/22/porting-moebius/
unet weights are in fp32. did you by any chance try something lower, fp16?
There are 25 or so mentions of fp16 and fp32 weights across the 7500+ words of Markdown text it generated. So the next question might be: Did it make the right calls?
https://github.com/simonw/moebius-web/blob/main/notes.md
https://github.com/simonw/moebius-web/blob/main/plan.md
https://github.com/simonw/moebius-web/blob/main/research.md
https://github.com/simonw/moebius-web/blob/main/understandin...
The weirdest thing was when the inpainting tool added strange people to an image. This singer was all decked out in tinsel and red, and the inpainting model added a grumpy old man in a top hat. I don't recall clicking the "Add creepy old man" button.
At the time this was Stable Diffusion on the backend, run by a variety of model hosting services, Amazon being one. They all had different requirements for the input image and that made things really complex. For some the aspect ratio was impossible to meet, and it would fail if the banner was 200x60. For others, you had to resize it before input, which meant you were adding an image with poor resolution to start. Garbage in, garbage out.
All of this to say, there is a lot of preproduction that went into it, and the client never ended up using my attempts.
Note that I'm actively messing with it, so it may break for short periods of time :)
It's also running on the free CPU, so it's like 80 seconds per image...
Barely useful enough to erase things in thumbnails.
ah, bad UX
Edit: I think I found it https://huggingface.co/hustvl/Moebius
https://characterdesignreferences.com/artist-of-the-week-3/m...
I have a potential project for my e-commerce where I want to allow users to upload images of their house exteriors and impaint awnings.
Is it just me or is it weird seeing these clickbaity AI-generated taglines in an otherwise scientific work?
Apart from this, the text details amazing work. Congrats.
I think it is safe to say this is pretty far from a "scientific" work.
Also, what's going on behind the in-painted corner of the house? We'd need to see higher resolution pictures, but I'm not convinced that it too shouldn't get a flag. Likewise with the beach just behind the surfboard. Not terrible, but what gets flagged in the competitors is similar.
2) If these are reasonable, a WebGPU demo would be great..
Obvious reference to the Dickens story A Christmas Carol. In the UK there's a bylaw that requires Christmassy events to hire a Scrooge-like figure to lurk in the background so people keep their enthusiasm in check.
The community made models (merges, fine tunes, etc) of that era are all completely overtrained and optimized for portraits and frontal shots. They would try to make a person out of anything. Inpainting faces is already a chore, even with a lot of tooling around that, but inpainting anything else is almost impossible. These models are also especially bad to fit objects naturally into scenes. You can make a crappy necklace or belt work, but introducing a new object into a scene just fails with infinite variety.
They are also much better using 512x512 as resolution, any larger deviation introduces more problems.
Considering you wanted to inpaint banner ads, they would probably get distorted heavily. Those models can't deal with fonts and are bad at a pixel perfect transfers. The only viable way to do this, at that time, would be to manually insert the banner ads and fix the seams with AI. Requires some artistic skill of course.
Your attempt was bold, but with the expectation of just supplying two images and let the models do it, it was impossible.
Thats because small models like SD (Stable Diffusion) are trained on very specific resolutions, its the fancier models that are trained on higher quality, or more diverse sets of resolutions, and if you use a higher quality model to generate lower resolution images, what's actually happening is you're trimming a much bigger image and getting a chunk of it output, at least that's how it feels based on my many hours of experimenting. If I use major models and try to center a thing, I never see it in the center. :) My GPU can only handle so much.
The general idea was: you mask the area you want changed, and the model inpaints that region at full resolution. The advantage of masking, compared to plain img2img, is that you’re not sending the entire picture to the model.
With the classic setups like SD 1.5 and SDXL, you’d effectively inpaint at full resolution: take the masked area from a larger image, scale just that region to the model’s native resolution, process it at the full ~1 megapixel then scale it back and composite it into the original. This lets you add MORE detail.
Unfortunately if the OP is using hosted SD models, they might not have that granular control and thus would suffer pretty bad quality loss.
Instead, you're supposed to upload it to the cloud and ask a big, multimodal frontier model to maybe please do the thing you want and nothing else.
iPhones have models for text extraction and in-painting in the Photos App.
Both don’t have knobs to tune them, but, I think, they are decent for their intended audience (definitely not flawless, but I don’t think that exists anywhere, even if dropping the ‘local’ requirement)
For scene segmentation, iOS has models for detecting persons (https://developer.apple.com/documentation/Vision/segmenting-...).
It also has models for detecting faces, face features, body and hand poses, or for picking the ‘best’ selfie from a set.
(And dust removal is fairly niche compared to these, I think. Or do I overlook some common use case for it that many people want?)
A problem is also that the cloud solutions need a complex UI to surface segmentation to the user. But the point you have there is that those models are probably not prime time ready yet, surfacing them would actually reveal they are not as powerful as the user expects. Destroying the illusion that AI can just do anything at will.
The million image gen services online are mostly just making bank off ignorance. People don't realize that their own cheap video cards are more than enough to do everything they're paying a service an orders of magnitude markup for.
(I'm counting only times I used generative editing options in my Galaxy phone - if I were to take your question literally, it would be "at least once every other day", simply due to rotating and cropping.)
I have an example of interior decorating inpainting where I replaced a large floor-to-ceiling window with a mirror, and the result was pretty impressive using NB Pro from nearly a year ago.
Locally hostable? For my money I'd argue Flux.2 Klein but Qwen-Edit still puts in the work.
I do agree, however, that the Flux2 family is the SoTA at the moment. Running locally via something like Comfy gets incredible results.
So you're saying that, if I can calculate from the picture the position (height, inclination and such), and I can render the model (should be doable) for that height and angle, my best course of action could be to combine original + render and only at the end use a visual model? That could be interesting.
If you want real precision (especially for complex polygonal masks), or if you’re concerned about image degradation over multiple edit rounds, you'll slam against the limitations of those approaches.
Even with SOTA proprietary models, repeatedly editing and re-uploading an image is like making a copy of a copy of a VHS tape: you're gonna see subtle color shifts and quality loss steadily accumulate.
At that point, you either need to put in the manual work in something like Photoshop (bringing elements in as layers and masking them properly) or, as you mentioned, use a model or workflow that properly supports masking.
This is the image I always think of when first introducing someone to ComfyUI or even Automatic1111.
This idea rests on the assumption that my understanding of what "awnings" are is correct and matches your project, i.e. additive structures. In that case, your primary problem is adding pixels on top of the user image. Additive modifications are easy to pull off. Inpainting seems like overkill here; it's something that shines when you need to poke holes or replace some of the aspects of the original that is not covered by the part you're adding.
OTOH, it might still be that inpainting is your best bet for operational reasons - additive modification itself may not be a problem, but fixing lighting and shadows might, and current image generations models should handle this in stride.
(I say should because that's my expectation, but I never tested any of the current models on for ability to fix shadows that cover areas similar to the targeted modification, but lay beyond it. It might be that you'll still need a model and a transform estimate just to generate a shadow map as a hint for the model where it needs to act and how.)
By the way I have a tried a handful of NB2 queries providing a reference image of the awning and of a user uploaded garden-facing building and I was very impressed by the results, I think that combining a 3d render at the right angle of the awning + NB 2 should do great.
Thanks a lot for your help, your feedback has been crucial! Dziekuje!
choose "samlple image" or upload something
then mark something with the mouse (and press 'run inpaint') and it'll work a bit and try to hide it, sorta like that "magic eraser" some newer android phones have
I like the idea that a piece of art, in addition of ultimately ending up as pixels on my screen, is also a window into a world that has been dreamt up by real human imagination, driven by their hopes and fears.
Semiconductors based generation may give me the first part, but not the second.
I'm speaking for myself here, I agree with your point though.
I guess this actually defines the fringe between ai-art enjoyers and haters - some people prefer what art does to their imagination, while others look at what art does to others'
The LLMs didn’t prompt themselves.
What art?
We’re talking about generated pictures, aka slop, not art made by a real human.
And I don’t know if you’ve been paying attention but people seem to be pretty tired of the slop. I don’t think it would be appreciated nearly as much as you think.
People are tired of marketing. AI generated slop people are annoyed with, is garbage produced for marketing reasons, and it's distinctly noticeable precisely because all the bottom-feeder marketing houses switched to using it. But it's not the AI itself that's the problem here. Slop was here before, but it was made with cheap protein-based image generators. Silicon-based generators are just cheaper.
I refuse to accept that real humans believe prompting is art.
> People are tired of marketing.
You know what, I'll give you that one. I find most generated art pretty tasteless, but I have enjoyed the occasional piece of fiction with small generated elements for atmosphere. I still hesitate to call it 'art', but I will grant it's not all 'slop'.
But for the second part:
> But it's not the AI itself that's the problem here. Slop was here before, but it was made with cheap protein-based image generators. Silicon-based generators are just cheaper.
I think the problem is how much cheaper it is now. I would estimate generating a picture is at least 2 orders of magnitude cheaper than paying even a cheap human, so with the same amount of money being invested into slop we are due for - and seeing - a huge tidal wave of it, because the same amount of money turns out way more crap now.