Boogu-Image-0.1

Boosting Open-Source Unified Multimodal Understanding and Generation

Boogu-Image-0.1 is a strongly competitive Apache-2.0 open-source unified image generation and editing model family, including Base, Turbo, Edit, and other variants that provide stable, practical capabilities for high-quality text-to-image generation, fast generation, image editing, and Chinese-English text rendering, with performance that matches top closed-source models in many scenarios.

Online Demo Three variants Hugging Face Model weights GitHub Code repository Technical Report Coming Soon Visual Gallery For inspiration

Boogu Vision

Closed-source multimodal understanding and generation systems like Nano Banana Pro and GPT-Image-2 achieve remarkable performance not because of a single model, but through a highly unified suite of system capabilities. However, under training compute that is very limited compared with closed-source systems, we find that systematically improving a model's understanding ability, data quality, and training pipeline can still significantly improve image generation and editing performance. Specifically, compared with some existing open-source models, our training data scale is roughly one order of magnitude smaller. We hope our empirical study and open-source release will help advance the open-source ecosystem for unified multimodal understanding and generation.

Competitive General Performance

Different models have their own strengths, and it is difficult to make an objective single-answer judgment about which model is better. Even across different benchmarks, the relative performance of models is not always the same. Still, Boogu demonstrates competitive performance across many scenarios and benchmarks.

Boogu Arena
Strong overall performance
Across all evaluated models, the Boogu-Image-0.1 family ranks among the very top.
Qwen-Image-Bench
Top open-source performance
Boogu-Image-0.1 ranks first among evaluated open-source models on Qwen-Image-Bench.
Image Editing
Unified generation and editing
Boogu keeps competitive image editing performance while preserving strong text-to-image generation quality.

Boogu Arena. Since we could not evaluate on LM Arena directly, we created Boogu Arena. The leaderboard below reports Arena-style preference results across leading closed-source and open-source image generation systems. Across all evaluated models, the Boogu-Image-0.1 family ranks among the very top. We welcome teams with questions about the results to contact us so that we can work toward more objective, fair, and reproducible evaluation.

We believe evaluation of image generation systems should also take inference time into account. However, because different models run on different hardware platforms and serving environments, we do not provide a direct latency comparison here. Notably, on high-performance hardware, the raw Boogu-Image-0.1-Turbo model can run a single inference in under 1 second.

Evaluation setup. Boogu Arena follows the spirit of LM Arena-style evaluation. We use an LLM to generate a large set of diverse user personas, then ask each persona to produce a number of image generation prompts, resulting in more than 1K test prompts in total. We will release these prompts publicly for community reproduction and review.

Boogu Arena ELO Chart

Boogu-Image-0.1

Our result in the Boogu Arena visual comparison.

Qwen-Image-2512

Open-source baseline from a leading text-to-image evaluation setting.

HiDream-O1

Open-source baseline from a leading arena-style setting.

Seedream 5.0

Strong proprietary baseline for visual preference comparison.

Boogu-Image-0.1

Our result in the Boogu Arena visual comparison.

Qwen-Image-2512

Open-source baseline from a leading text-to-image evaluation setting.

HiDream-O1

Open-source baseline from a leading arena-style setting.

Seedream 5.0

Strong proprietary baseline for visual preference comparison.

Qwen-Image-Bench. Qwen-Image-Bench is a recently released high-quality benchmark, released after we froze our T2I training data. Compared with long-standing benchmarks, it is less affected by common issues such as data leakage, making it a useful testbed for modern image generation models. On this benchmark, Boogu-Image-0.1 achieves top-tier performance among the evaluated open-source models. Due to time constraints, the evaluation does not yet cover all available open-source baselines.

Parameter Efficiency
Final score vs. model size on Qwen-Image-Bench
0 20 40 60 80 47 48 49 50 51 52 53 54 55 Parameters (B) Final Score GLM-Image (7B) Qwen-Image (20B) Qwen-Image-2512 (20B) Hunyuan-Image-3.0 (80B) Boogu-Image-0.1 (10B)
Boogu-Image-0.1 (10B, 53.58) Other open-source baselines

Parameter efficiency on Qwen-Image-Bench. Boogu-Image-0.1 (10B) achieves the highest final score (53.58) among the compared models, outperforming larger counterparts such as Qwen-Image-2512 (20B, 52.06) and Hunyuan-Image-3.0 (80B, 50.81). This suggests that competitive benchmark performance can be obtained without scaling to substantially larger parameter counts.

Model Open Source Quality ↑ Aesthetics ↑ Alignment ↑ Real-world Fidelity ↑ Creative Generation ↑ Overall ↑
GPT Image 2Closed58.6567.5365.8557.3875.2364.69
Nano Banana 2.0Closed54.7761.0862.4054.2867.0559.82
GPT Image 1.5Closed55.1460.8861.7253.9566.3559.65
Nano Banana ProClosed55.6760.2661.2554.0766.2359.45
Qwen Image 2.0 ProClosed54.3958.6759.2851.8364.9457.84
Seedream 5.0Closed52.5558.4058.9051.9265.2957.22
Seedream 4.5Closed54.4158.7257.3151.6960.6456.78
Seedream 4.0Closed54.0158.8156.6451.0558.1556.21
FLUX 2 MaxClosed53.6456.8557.3549.3556.5055.33
FLUX 2 ProClosed52.3056.9457.0147.2956.1854.57
GPT Image 1Closed52.3455.0956.2848.1455.7854.07
Boogu-Image-0.1Apache-2.051.1955.4255.7848.0155.5553.58
Qwen Image 2512Apache-2.051.7654.7452.7247.0050.1952.06
Imagen 4.0 UltraClosed50.9054.2554.0245.5951.1451.99
HunyuanImage 3.0Other50.3553.5752.0044.3149.1250.81
Imagen 4.0Closed50.1652.6851.6444.8447.9450.29
Qwen ImageApache-2.048.4452.2550.7243.1647.3049.23
Kling Image 2.1Closed49.1150.1549.1844.7444.6748.26
GLM ImageApache-2.049.2650.6447.9044.6945.2348.19
Best in column
2nd best in column
Closed = closed-source  |  Open = open-source license not specified here

About ImgEdit. We include ImgEdit_O as a supplementary reference. In our observations, this benchmark does not always align well with human visual judgment and has limited coverage of In-Context Generation scenarios. As a result, it may not fully reflect the real user experience of current image editing models and may underestimate the performance of some closed-source models in interactive use cases. Whether ImgEdit should be used as a primary benchmark going forward should therefore be considered carefully; the results are kept here mainly for comparison with prior work.

Model Open Source ImgEdit_O ↑
Boogu-Image-0.1-Edit4.64
JoyAI4.57
FireRed-Image-Edit4.56
Qwen-Image-Edit-25114.51
LongCat-Image-Edit4.50
Nano Banana Pro4.37
FLUX.2 [Dev]4.35
Seedream 4.54.32
Qwen-Image-Edit-25094.31
Seedream 4.04.30
Nano Banana4.29
Step1X-Edit-v1.23.95
Best in column
2nd best in column
✗ Closed-source  |  ✓ Open-source

Five Powerful Variants,
One Unified Family

The Boogu-Image-0.1 family offers a full suite of models covering generation, editing, and versatile foundation use cases. We look forward to growing the family together with the open-source community.

Text-to-Image Variant Selection Sketch
Illustrative only: farther right means longer inference time, and higher means stronger generation quality.
Boogu-Image-0.1-Pro is a text-to-image system that combines Boogu-Image-0.1-Base and Boogu-Image-0.1-Turbo, with a stronger focus on high-quality generation scenarios.
Inference Time Quality Boogu-Image-0.1-Turbo Boogu-Image-0.1-Turbo-PE Boogu-Image-0.1-Turbo-Thinking Boogu-Image-0.1-Pro
G
Boogu-Image-0.1-Base
The core text-to-image foundation model behind Boogu-Image-0.1-Turbo. Focuses on high-quality generation, rich aesthetics, strong diversity, and controllability — ideal for creative workflows, fine-tuning, and downstream development. It is mainly intended for complex text-heavy scenarios such as ultra-dense text rendering with more than 100 characters; for photorealism, the Turbo model is usually the better default choice.
T
Boogu-Image-0.1-Turbo
A distilled Boogu-Image-0.1-Base variant with the same parameter count as the base model, typically requiring only 3-4 steps. We place particular emphasis on optimizing its photorealism while preserving bilingual text rendering and prompt adherence.
FP8
Boogu-Image-0.1-fp8
A quantized deployment-oriented variant for lower memory inference. It keeps the core Boogu-Image-0.1 behavior while reducing serving cost, making it a practical option for constrained hardware and high-throughput deployment.
E
Boogu-Image-0.1-Edit
Built for image editing and image-to-image workflows. Follows bilingual natural-language instructions for precise, creative edits — from local adjustments to imaginative transformations. It currently focuses on photography-oriented editing scenarios; performance remains more limited for In-Context Generation with large viewpoint or structure changes.
ET
Boogu-Image-0.1-Edit-Turbo
A faster editing variant for image-to-image iteration. It targets lower-latency editing workflows while preserving practical instruction following for photography, text-heavy edits, and creative transformations.

Things Many Practitioners Know,
But Few Papers Emphasize

Our report focuses on a set of practical observations that are already familiar to strong image-generation teams, yet are still under-discussed in public technical reports.

Lesson 01
Understanding and reasoning are a larger bottleneck than they first appear.
Strong visual quality alone is not enough. Many failures come from weak multimodal understanding, compositional reasoning, and instruction interpretation. Teams behind GPT-Image, Nano Banana, and Seedream have extremely strong understanding models, which gives them a major advantage in this dimension. For open-source teams that can only rely on open-source understanding models, this gap is difficult to close.
Lesson 02
Stronger multimodal understanding models make better text encoders.
We find that using a stronger multimodal understanding model as the text encoder can significantly improve the model's ability to understand complex prompts, fine-grained concepts, and contextual relationships, leading to better generation and editing results.
Lesson 03
Caption quality matters even more than expected.
Better captions do not merely improve alignment; they reshape what the model learns to attend to, especially for fine-grained objects, layout, and intent. Longer is not always better, and shorter is not always better either—different concepts need tailored captions.
Lesson 04
Test-time scaling can reliably improve quality, but latency must be balanced.
Test-time scaling can often improve generation quality in a stable way, including strategies such as prompt rewriting, candidate output inspection, and feedback-based regeneration. However, these methods also lengthen the inference pipeline and increase generation time, so a practical system must choose an appropriate balance between quality gains and user waiting cost. A fair evaluation protocol should consider both output quality and inference time.
Lesson 05
Current benchmarks do not fully match user experience.
Existing benchmarks have become the elephant in the room: many people pretend not to see that they are no longer sufficient for fully evaluating model performance. These benchmarks once greatly advanced the field, but today they often diverge sharply from what real users perceive as good outputs in interactive products. Early in the project, over-indexing on public test sets significantly slowed down our development loop; eventually, we decided to completely stop optimizing around public test sets, and instead relied more on real usage scenarios, human review, and user-experience-oriented evaluation.
Lesson 06
Unified models matter, but scalable infrastructure matters more.
A unified understanding-and-generation architecture is valuable, but the bigger advantage is reusing mature LLM infrastructure, making training, serving, and scaling easier. We believe that for teams with weaker LLM infrastructure, adopting a unified architecture may bring limited returns.
Lesson 07
High-quality image data remains the hardest gap for open-source models.
High-quality image data is critical, but in practice the quality of open datasets is still far below carefully sourced or commercially licensed data used by major labs.

Current Limitations

Limitation 01
World knowledge lags far behind closed-source models, and the gap is extremely hard to evaluate.
For tasks that require rich common sense, domain knowledge, real brands or people, and complex contextual understanding (such as artistic styles, famous landmarks, celebrities, and products), Boogu still has a clear gap from strong closed-source systems. This capability is also extraordinarily expensive to measure — a single landmark category alone may require 3,000+ test samples to cover, and even Arena-style evaluation struggles to assess it fully, so existing benchmarks can barely quantify this dimension and the real gap is likely larger than measured scores suggest.
Limitation 02
Image-to-image consistency and some in-context scenarios still lag behind.
For editing tasks that require strict preservation of the input subject, identity, layout, or fine details, Boogu's image-to-image consistency is still not stable enough. Because our image-to-image capability focuses more on applications such as photography and text generation, Boogu still trails Seedream 5.0 and Nano Banana Pro in some in-context generation scenarios.
Limitation 03
Text rendering is not stable enough yet.
Boogu can handle many Chinese and English text scenarios, but long text, dense typography, small fonts, and complex design layouts can still produce typos, missing characters, or layout drift. Text rendering is currently focused on Chinese and English; other languages are not specifically optimized and may degrade noticeably.
Limitation 04
Body structure can still fail in complex poses.
In multi-person interaction, occlusion, exaggerated motion, or unusual viewpoints, hands, limbs, and body structure may still become unnatural or inconsistent.
Limitation 05
Small faces and small limbs remain challenging.
Because we use the open-source FLUX.1 VAE, the reconstruction loss is relatively large. As a result, details such as small faces, small limbs, eyes, and text may still show artifacts or instability.
Limitation 06
The open-source release is still limited in scope.
Due to resource constraints, engineering complexity, and release boundaries, we are not able to open-source every training and system detail. The current open-source release aims to balance reproducibility, usability, and sustainable maintenance while providing a reliable starting point for community research and improvement.
Acknowledgements. Closed-source systems such as GPT-Image, Nano Banana, and the Seedream series helped us better understand the frontier capabilities and practical boundaries of unified understanding-and-generation systems. We also thank the Qwen, Z-Image, OmniGen2, FLUX, and broader open-source communities for the valuable foundations and references they have made available. We additionally thank DeepSeek for providing sufficiently strong open-source understanding models, which offer important support for the development of open-source unified multimodal understanding-and-generation systems.

The full technical report is coming soon.