Qwen-Image: Crafting with native text rendering

qwenlm.github.io

Listen to this article (with local TTS)

GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD

We are thrilled to release Qwen-Image, a 20B MMDiT image foundation model that achieves significant advances in complex text rendering and precise image editing. To try the latest model, feel free to visit Qwen Chat and choose “Image Generation”.

The key features include:

Superior Text Rendering: Qwen-Image excels at complex text rendering, including multi-line layouts, paragraph-level semantics, and fine-grained details. It supports both alphabetic languages (e.g., English) and logographic languages (e.g., Chinese) with high fidelity.
Consistent Image Editing: Through our enhanced multi-task training paradigm, Qwen-Image achieves exceptional performance in preserving both semantic meaning and visual realism during editing operations.
Strong Cross-Benchmark Performance: Evaluated on multiple public benchmarks, Qwen-Image consistently outperforms existing models across diverse generation and editing tasks, establishing a strong foundation model for image generation.

Performance

We present a comprehensive evaluation of Qwen-Image across multiple public benchmarks, including GenEval, DPG, and OneIG-Bench for general image generation, as well as GEdit, ImgEdit, and GSO for image editing. Qwen-Image achieves state-of-the-art performance on all benchmarks, demonstrating its strong capabilities in both image generation and editing. Furthermore, results on LongText-Bench, ChineseWord, and TextCraft show that it excels in text rendering—particularly in Chinese text generation—outperforming existing state-of-the-art models by a significant margin. This highlights Qwen-Image’s unique position as a leading image generation model that combines broad general capability with exceptional text rendering precision.

Discussion (28 comments)

Not sure why this isn’t a bigger deal —- it seems like this is the first open-source model to beat gpt-image-1 in all respects while also beating Flux Kontext in terms of editing ability. This seems huge.

Good release! I've added it to the GenAI Showdown site. Overall a pretty good model scoring around 40% - and definitely represents SOTA for something that could be reasonably hosted on consumer GPU hardware (even more so when its quantized).

That being said, it still lags pretty far behind OpenAI's gpt-image-1 strictly in terms of prompt adherence for txt2img prompting. However as has already been mentioned elsewhere in the thread, this model can do a lot more around editing, etc.

https://genai-showdown.specr.net

The fact that it doesn’t change the images like 4o image gen is incredible. Often when I try to tweak someone’s clothing using 4o, it also tweaks their face. This only seems to apply those recognizable AI artifacts to only the elements needing to be edited.

This may be obvious to people who do this regularly, but what kind of machine is required to run this? I downloaded & tried it on my Linux machine that has a 16GB GPU and 64GB of RAM. This machine can run SD easily. But Qwen-image ran out of space both when I tried it on the GPU and on the CPU, so that's obviously not enough. But am I off by a factor of two? An order of magnitude? Do I need some crazy hardware?

A silly question: do any of these models generate pixels and also vector overlays? I don't see why we need to solve the text problem pixel-for-pixel if we can just generate higher-level descriptions of the text (text, font, font size, etc). Ofc, it won't work in all situations, but it will result in high fidelity for common business cases (flyers, websites, brochures, etc).

In their own first example of English text rendering, it's mistakenly rendered "The silent patient" as "The silent Patient", "The night circus" as "The night Circus", and miskerned "When stars are scattered" as "When stars are sca t t e r e d".

The example further down has "down" not "dawn" in the poem.

For these to be their hero image examples, they're fairly poor; I know it's a significant improvement vs. many of the other current offerings, but it's clear the bar is still being set very low.

Does anyone know how they actually trained text rendering into these models?

To me they all seem to suffer from the same artifacts, that the text looks sort of unnatural and doesn't have the correct shadows/reflections as the rest of the image. This applies to all the models I have tried, from OpenAI to Flux. Presumably they are all using the same trick?

Insane how many good Chinese open source models they've been releasing. This really gives me hope

Can it generate images of a lone person standing in front of a column of tanks on Tiananmen Square?

I just tested it out, very impressive results. I wonder what the Queen team did behind the scenes to make this work so well.

https://chat.qwen.ai/

(Select "Image Generation" and be sure to use the Qwen3-235B model - also tried selecting "Coder" but it errors out.)

“Qwen‑Image: Open‑source 20 B MMDiT model with stunning text rendering and image editing. Effortlessly create bilingual posters, infographics, slides, infill edits, comics.”

Go experience AI: https://www.qwenimagen.com/

Why it works: Highlights open‑source nature and 20 billion‑parameter strength Emphasizes its superior multilingual, layout‑aware text rendering Mentions real‑world use cases: posters, slides, graphics, image editing, comics/info visuals

Jaw dropping. Because text rendering isn't easy even with regular programming SDKs etc.

Anyone thinking otherwise hasn't attempted implementing it or haven't thought about it in depth.

For my first attempt I plugged in text and a description of a small new Unity package I'm working on, and it matched the intent/text extremely well.

There were a few small text mistakes and the image isn't quite as good as I've seen before, but overall it delivers on its promise.

> In this case, the paper is less than one-tenth of the entire image, and the paragraph of text is relatively long, but the model still accurately generates the text on the paper.

Nope. The text includes the line "That dawn will bloom" but the render reads "That down will bloom", which is meaningless.

A beast. Supposedly beats GPT-4o in image generation and Flux Kontext in image editing.

If it’s as good as they say, one less reason for that ChatGPT sub..

The text rendering is impressive, but I don't understand the value — wouldn't it be easier to add any text that you like in Figma?

I’m interested to see what this model can do, but also kinda annoyed at the use of a Studio Ghibli style image as one of the first examples. Miyazaki has said over and over that he hates AI image generation. Is it really so much to ask that people not deliberately train LoRAs and finetunes specifically on his work and use them in official documentation?

It reminds me of how CivitAI is full of “sexy Emma Watson” LoRAs, presumably because she very notably has said she doesn’t want to be portrayed in ways that objectify her body. There’s a really rotten vein of “anti-consent” pulsing through this community, where people deliberately seek out people who have asked to be left out of this and go “Oh yeah? Well there’s nothing you can do to stop us, here’s several terabytes of exactly what you didn’t want to happen”.

Still waiting for a model that generates 2D/3D environments to get rendered by tools like Blender

Wow, the text/writing is amazing! Also the editing in general, but the text really stands out

What lowest graphic card can support this self hosted with a reasonable output !

Pelican riding a bicycle image came out really nicely: (it added the text)

https://cdn.qwenlm.ai/output/wV13g6892e758082439d7000d439ed5...

It will take years for people to use these but Adobe is not alone.

is it an official site? https://qwen-image.ai

very interesting. people here talk a lot about censorship. However, I just reply sth with AIPAC, and my reply has been deleted. lol

Team Qwen: Please stop ripping off Studio Ghibli to demo your product.