Our team trained a new SOTA model from scratch and set a new standard for excellence in image generation
Meet Recraft V3 - the only model in the world that can generate images with long texts, as opposed to just one or a couple of words. Our team trained a new SOTA model from scratch and set a new standard for excellence in image generation. Over the past week Recraft V3 participated in the Hugging Face’s industry-leading Text-to-Image Model Leaderboard by Artificial Analysis. It secured #1 place with ELO rating of 1172. Recraft's new model is showing quality higher than models of Midjourney, OpenAI, and all other major image generation companies.
If you've ever tried to generate text on an image, you’ve likely faced significant challenges. Of course, image generators are improving, but rendering anything longer than a few words on an image is still frustratingly hard. Before we dive into the technical details of our implementation, let’s discuss why this is so challenging.
Let’s look at an example generated by Recraft’s previous image generation model Recraft 20B (with prompt: “a cat with a sign 'Recraft generates text amazingly good!' in its paws”):
The rendered text is amazingly wrong. Our generator attempted to produce the first word, “Recraft,” but failed and abandoned the task.
Such issues stem from insufficiently detailed conditions provided to the model during training. It is optimized on a dataset of images and corresponding captions. The captions typically describe the image content in general terms without providing specific details. This means there are many examples where the word “cat” is mentioned, and the model quickly learns how to draw cats. However, when it comes to generating text in images, the captions are usually far less helpful. In most cases, they ignore text entirely or provide only basic information, such as a company name on a logo or a headline on a newspaper. As a result, the model learns to produce correct text only in relatively simple scenarios. In all other cases, it generates text that isn’t provided in the conditions at all, leading to hallucinations, where the model renders squiggles that only vaguely resemble proper words.
Moreover, our brains are trained to process text, so we are good at spotting errors. If the model generated text in a language with a writing system you’re not familiar with, like Chinese, it might be harder for you to detect any mistakes. Similarly, you are less likely to notice issues with a drawn cat’s anatomy if you don’t look at them daily. The point is that some errors are much more noticeable to us, and incorrect text is one of them.
To tackle this issue, we needed to provide more detailed input conditions to the image generation model. An original image would be the best possible condition because it contains all the information about the text, but a model trained with such a condition wouldn’t learn to generate images just to copy them.
So, we settled on the second-best option: a drawing of a text layout. This is a new image in which all the text from the original image is rendered. The model is provided with this text layout image in addition to the caption during both training and inference. The source of our inspiration for the model was the TextDiffuser-2 paper.
Before model training, we had to collect a new dataset with such text layouts. The challenge is that these text layouts must be as accurate as possible. Otherwise, the model might learn to ignore them (as it does with inaccurate captions). Optical character recognition (OCR) models are used to extract all the text from images with their corresponding positions. However, no open-source model was good enough to process our dataset. One reason is that our dataset distribution differs significantly from the datasets these models were typically trained on.
We had to train our own OCR model. After numerous experiments, we developed a model based on the paper Bridging the Gap Between End-to-End and Two-Step Text Spotting. We manually labeled images from our dataset and trained the model on this new dataset. Additionally, we implemented a model to filter OCR predictions, helping us create a large dataset with high-quality text layouts.
With this dataset, we could now train a model to generate an image based on its caption and text layout. However, this model alone wasn’t sufficient to generate an image based on a prompt because it expects a text layout as well. This means we needed to generate the layout ourselves, requiring an additional model capable of understanding the given prompt and producing a corresponding text layout. While we already had ground truth text layouts from our OCR model, we couldn’t simply train a model to generate these layouts based on image captions. As mentioned earlier, captions rarely contain information about the text in the image. And a text layout generator should be trained on captions where all words from the text layout are explicitly mentioned. Otherwise, it might learn to hallucinate text layouts, just as a regular text-to-image model does. To solve this, we trained an image captioning model that can be conditioned on the text layout. The resulting model was of much higher quality and produced captions aligned with the text layouts.
After preparing the data, we were finally able to train our pipeline. The following image shows the stages we were required to train in this pipeline. Noticeably, it reverses the data preparation process, as outputs from the models used to prepare the data are now used as inputs for the newly trained models.
Fortunately, the text layout generator and the image generator can be trained independently. Let’s start with the text layout generator training process.
The generator was based on a large language model. Its input (caption) and output (text layout in a specific format) are depicted in the image below:
The biggest challenges we faced while developing the model were related to the output format:
Finally, we trained our image generation model. It was based on our original image generator but had an extra text layout input. The text layout was passed as an image, in a way similar to ControlNet architecture.
Now, we can examine the inference pipeline of our new model.
A notable feature of this pipeline is that the text layout can be defined in two ways: manually on the canvas or by the text layout generator. This gives users precise control over word positioning or the option to let the system automate the process.
This new pipeline is a significant leap forward in text generation quality. But it’s not the only improvement we’ve made. The SOTA model now understands user prompts much better, and the quality of the generated images is significantly higher than that of our previous model and any other competitor’s models.
We are incredibly excited to present this version!