Moebius: 0.2B image inpainting model with 10B-level performance

This is the useful AI stuf. There’s so many usecases this makes possible.

Could this run locally on a smartphone ?

Unrelated but when I read inpainting and Moebius I was scared it was related and using the art of the great Jean Giraud [0] a.k.a. Moebius

https://characterdesignreferences.com/artist-of-the-week-3/m...

[0] https://en.wikipedia.org/wiki/Jean_Giraud

I don't understand. Is it available somewhere to try or is it just an ad?

The gallery of their samples is pretty impressive!

What is the current SOTA for impainting?

I have a potential project for my e-commerce where I want to allow users to upload images of their house exteriors and impaint awnings.

1) What are RAM requirements?

2) If these are reasonable, a WebGPU demo would be great..

The gallery of their samples is pretty impressive!

I don't understand. Is it available somewhere to try or is it just an ad?

Yeah it's great but how do I use it?

Edit: I think I found it https://huggingface.co/hustvl/Moebius

What is the current SOTA for impainting?

I have a potential project for my e-commerce where I want to allow users to upload images of their house exteriors and impaint awnings.

This is the useful AI stuf. There’s so many usecases this makes possible.

1) What are RAM requirements?

2) If these are reasonable, a WebGPU demo would be great..

Could this run locally on a smartphone ?

Awnings, if I understand correctly (I just learned this word right now), are purely additive attachments to structure exteriors - so perhaps they wouldn't necessarily need a full inpainting model? Wouldn't it be enough to estimate an affine transform for a quad and blend the image of awning directly (and the same with shadow map to fake shade)? Is classical photogrammetry up to such task these days?

Proprietary? Either gpt-image-2 or NB2.

I have an example of interior decorating inpainting where I replaced a large floor-to-ceiling window with a mirror, and the result was pretty impressive using NB Pro from nearly a year ago.

https://imgpb.com/ZXkiXV

Locally hostable? For my money I'd argue Flux.2 Klein but Qwen-Edit still puts in the work.

flux klein with LoRa. GPT image and nano often produce high frequency artifacts when editing.

how many times have you edited a photo you took on your phone in the last 7 days?

Unrelated but when I read inpainting and Moebius I was scared it was related and using the art of the great Jean Giraud [0] a.k.a. Moebius

https://characterdesignreferences.com/artist-of-the-week-3/m...

[0] https://en.wikipedia.org/wiki/Jean_Giraud

Yeah it's great but how do I use it?

Edit: I think I found it https://huggingface.co/hustvl/Moebius

with this size we could have a interaactive web demo.

Scared for the same reason I found last year's 'Ghibli filter' craze upsetting, I would have personally hated to have seen this artist's legacy used for promoting AI image generation.

flux klein with LoRa. GPT image and nano often produce high frequency artifacts when editing.

Proprietary? Either gpt-image-2 or NB2.

I have an example of interior decorating inpainting where I replaced a large floor-to-ceiling window with a mirror, and the result was pretty impressive using NB Pro from nearly a year ago.

https://imgpb.com/ZXkiXV

Locally hostable? For my money I'd argue Flux.2 Klein but Qwen-Edit still puts in the work.

For locally hostable image editing models, the edit variant of the recently released Boogu-Image[1] model is very good. Anecdotally, I'd say way better than Flux.2 Klein 9B and Qwen-Edit.

[1]: https://github.com/boogu-project/Boogu-Image

NB2 means "Nano Banana 2", a Google image generation model. https://blog.google/innovation-and-ai/technology/ai/nano-ban...

As far as I know, gpt-image-2 doesn't even let you define a mask unless you've already run it through one iteration, and once you do define the mask, it just ignores it 90% of the time. It's utterly useless for inpainting. Also, this and other proprietary models are severely limited in their output resolution.

I do agree, however, that the Flux2 family is the SoTA at the moment. Running locally via something like Comfy gets incredible results.

how many times have you edited a photo you took on your phone in the last 7 days?

I think 3? I feel like that's often enough. Sometimes it's nice to do a quick dumb ass gag on a whim. If I am anything I am a man who loves a dumb ass gag.

Half a dozen at least.

(I'm counting only times I used generative editing options in my Galaxy phone - if I were to take your question literally, it would be "at least once every other day", simply due to rotating and cropping.)

Personally, about 9 times. Would be higher if it was even easier and cheaper

I have no idea but I think you might be onto something.

So you're saying that, if I can calculate from the picture the position (height, inclination and such), and I can render the model (should be doable) for that height and angle, my best course of action could be to combine original + render and only at the end use a visual model? That could be interesting.

I'm quite perplexed by this comment. If I'm understanding you correctly, sure, what you describe is possible through significantly more effort, orchestration, and source photos. Or we can grab one still image and throw an inpainting model at it.

NB2 means "Nano Banana 2", a Google image generation model. https://blog.google/innovation-and-ai/technology/ai/nano-ban...

For locally hostable image editing models, the edit variant of the recently released Boogu-Image[1] model is very good. Anecdotally, I'd say way better than Flux.2 Klein 9B and Qwen-Edit.

[1]: https://github.com/boogu-project/Boogu-Image

Personally, about 9 times. Would be higher if it was even easier and cheaper

Half a dozen at least.

with this size we could have a interaactive web demo.

Scared for the same reason I found last year's 'Ghibli filter' craze upsetting, I would have personally hated to have seen this artist's legacy used for promoting AI image generation.

In case that happened then the rest of the world would probably appreciate the art, and a subset of it, the artist (and even a small subset of ~whole Internet-connected population is a lot of people). Some silver lining, perhaps.

I do agree, however, that the Flux2 family is the SoTA at the moment. Running locally via something like Comfy gets incredible results.

I think 3? I feel like that's often enough. Sometimes it's nice to do a quick dumb ass gag on a whim. If I am anything I am a man who loves a dumb ass gag.

Good on you. I've laughed at many dumbass gags but I've only been a passive consumer of them.

> In case that happened then the rest of the world would probably appreciate the art

What art?

We’re talking about generated pictures, aka slop, not art made by a real human.

And I don’t know if you’ve been paying attention but people seem to be pretty tired of the slop. I don’t think it would be appreciated nearly as much as you think.

Good on you. I've laughed at many dumbass gags but I've only been a passive consumer of them.

I have no idea but I think you might be onto something.

> In case that happened then the rest of the world would probably appreciate the art

What art?

We’re talking about generated pictures, aka slop, not art made by a real human.

And I don’t know if you’ve been paying attention but people seem to be pretty tired of the slop. I don’t think it would be appreciated nearly as much as you think.

It is possible to use generative AI in nonslop ways btw

This definition of "slop" doesn't cut reality just quite at the joints.

People are tired of marketing. AI generated slop people are annoyed with, is garbage produced for marketing reasons, and it's distinctly noticeable precisely because all the bottom-feeder marketing houses switched to using it. But it's not the AI itself that's the problem here. Slop was here before, but it was made with cheap protein-based image generators. Silicon-based generators are just cheaper.

It is possible to use generative AI in nonslop ways btw

This definition of "slop" doesn't cut reality just quite at the joints.

> This definition of "slop" doesn't cut reality just quite at the joints.

> People are tired of marketing.

You know what, I'll give you that one. I find most generated art pretty tasteless, but I have enjoyed the occasional piece of fiction with small generated elements for atmosphere. I still hesitate to call it 'art', but I will grant it's not all 'slop'.

But for the second part:

> But it's not the AI itself that's the problem here. Slop was here before, but it was made with cheap protein-based image generators. Silicon-based generators are just cheaper.

I think the problem is how much cheaper it is now. I would estimate generating a picture is at least 2 orders of magnitude cheaper than paying even a cheap human, so with the same amount of money being invested into slop we are due for - and seeing - a huge tidal wave of it, because the same amount of money turns out way more crap now.

> This definition of "slop" doesn't cut reality just quite at the joints.

> People are tired of marketing.

But for the second part:

> But it's not the AI itself that's the problem here. Slop was here before, but it was made with cheap protein-based image generators. Silicon-based generators are just cheaper.

Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

(*) Equal Contribution, (†) Project Leader, (📧) Corresponding Author.

1Huazhong University of Science and Technology 2VIVO AI Lab

Abstract

While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical deployment. Constructing a highly optimized task-specific specialist offers a promising solution; however, extreme structural compression inevitably triggers a severe representation bottleneck. To conquer this, we propose Moebius, a highly efficient lightweight inpainting framework. We systematically reconstruct the diffusion backbone by introducing the Local-λ Mix Interaction (LλMI) block. Comprising Local-λ and Interactive-λ modules, it elegantly summarizes spatial contexts and global semantic priors into fixed-size linear matrices, preserving complex latent interactions while drastically shedding parameters. Furthermore, to unlock the full representational capacity of this highly compact architecture, we synergistically pair it with an adaptive multi-granularity distillation strategy. Operating strictly within the latent space to avoid expensive pixel-space decoding, this strategy dynamically balances multiple gradient-based losses to achieve high-fidelity alignment. Extensive experiments across natural and portrait benchmarks demonstrate that this optimal synergy enables Moebius to rival or even surpass the generation quality of the 10B-level industrial generalist FLUX.1-Fill-Dev. Remarkably, Moebius achieves this using less than 2\% of the parameters (0.22B vs. 11.9B) while delivering a >15× acceleration in total inference time, setting a new efficiency standard for high-fidelity inpainting.

Method

Overall pipeline of Moebius. We adopt the Latent Diffusion Model (LDM) framework equipped with Latent Categories Guidance (LCG). To achieve extreme architectural efficiency, the denoising U-Net is systematically restructured using our proposed LλM I blocks (detailed in Sec. 3.2). Furthermore, an adaptive multi-granularity distillation strategy (Sec. 3.3) is applied during training to align our lightweight specialist with the high-capacity teacher, successfully mitigating the capacity drop caused by extreme structural compression.

Highlights

📉 Extreme Parametric Efficiency (< 2%): Moebius operates with a mere 0.22B (226M) parameters, which represents less than 2% of the size of the colossal industrial giant FLUX.1-Fill-Dev (11.9B). It shatters the heavy-compute narrative, making high-quality inpainting accessible on consumer-grade and edge devices.
⚡ 15× Inference Speedup (26ms/step): Achieves a blistering inference latency of only 26.01 ms per step on a single GPU. Combined with optimized sampling steps, Moebius delivers an overall >15× total runtime acceleration compared to 10B-level models.

🏆 10B-Level Inpainting Quality (on-par-with/surpass FLUX.1-Fill-Dev across 6 benchmarks): Size contraction does not mean representation degradation. Through the synergistic optimization of architecture and distillation, Moebius performs on par with, and in certain scenarios (such as complex textures and facial plausibility), surpasses 10B-level state-of-the-art (SOTA) generalist models (FLUX.1-Fill-Dev, SD3.5 Large-Inpainting) across 6 comprehensive benchmarks spanning both natural scenes (Places2) and portrait scenes (CelebA-HQ, FFHQ).
💡 Synergistic Core Innovations:
- Architecture Design (LλMI Block): Reformulates both self- and cross-attention by condensing spatial context and global semantic priors into fixed-size linear matrices, bypassing quadratic computational overhead.
- Adaptive Multi-Granularity Distillation Strategy: Transfers the representational capacity from our PixelHacker (teacher) strictly within the latent space (avoiding expensive pixel-space decoding). It bridges the giant capacity gap by aligning multi-granularity supervision—ranging from microscopic intermediate features to macroscopic diffusion trajectories—while dynamically balancing training via a gradient norm adaptive loss weighting mechanism.
- Optimal Synergistic Balancing: Systematically explores the mutual constraint and upper bound between compact structure and distillation. By mapping this architecture-distillation synergy frontier, we ensure our 0.22B Moebius (student) absorbs the maximum semantic reasoning of PixelHacker (teacher) without triggering representation saturation.

🚀 Task-Specific Specialist over Bloated Generalists: Rather than blindly scaling up, Moebius answers a fundamental question: Can a model be smarter, lighter, and faster when the task is explicitly defined? It serves as a highly optimized specialist that liberates real-world image inpainting and AI object removal from parameter bloat.

Visualizations

- Natural Scenes -

- Portrait Scenes -

Comparison on Natural Scenes (Places2)

Comparison on Portrait Scenes (CelebA-HQ, FFHQ)

BibTeX

@misc{DuanAndXu2026Moebius,
      title={Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance},
      author={Kangsheng Duan and Ziyang Xu and Wenyu Liu and Xiaohu Ruan and Xiaoxin Chen and Xinggang Wang},
      year={2026},
      eprint={2606.19195},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.19195},
}

Hacker Times