In the mind of DALL-E

In recent weeks we have witnessed how the Internet has been flooded with implausible images created thanks to the second version of the DALL.E neural network, from OpenAI, whose beta still has restricted access, and its little brother, DALL.E Mini.

From versions of Kermit the Frog in different cinematographic contexts, reinterpretations of classic works such as Vermeer’s Girl with a Pearl Earring, or a “self-portrait” of Dalí himself as a cyborg, these are just a few examples of the creativity of the tandem between man and machine.

Source: UpenAl. Images generated from the text vibrant portralt paintina of Salvador Dall mi a robotic halt face

Although one might wonder where the point of balance between the creativity of the human and that of the machine is found. On the one hand, the user only has to provide a description of what he wants to create, a premise, without which it would be impossible for the model to generate any results.

However, on the other hand, only with that premise and without the generative capacity of the model, we would be unable to go beyond that description and produce any work.

Is it then the machine, the neural network, the true creative agent? That would make us, the users, a kind of inspiration, a muse; that “initial spark” necessary in the creative process, and would open the debate on whether or not machines can be creative.

According to the definition of the RAE, there is no doubt that DALL.E 2 has the power to create, but should we understand the creative process only as a black box that, given some inputs, whether extrinsic or intrinsic, is capable of generating a output? Or on the contrary, would the awareness of being a participant in the creation process be essential for us to be able to speak of creativity?

The controversial concept of consciousness has recently been the subject of dispute in the AI ​​community, given the no less controversial statements by Blake Lemoine, a former Google engineer who claimed, after having held conversations with the LaMDA model (acronym for Language Model for dialog applications), that it was self-aware.

Beyond the dichotomy between consciousness or trick, whether we users have become only that whisper of the creative process, or whether the latter exists or not; We are going to delve into “the artist’s mind” to understand a little better how he does it.

Source: original paper Hierarchical Text-Conditional Image Generation with CLIP Latents

In a classical approach, to be able to generate an image from the text, we would first need ttranslate it to a numerical representation or embedding and then do an inverse transformation, decoding said embedding to obtain the image. These two steps would be carried out by two differentiated models, called encoder and decoder, respectively.

OpenAI starts from this classic approach and goes one step further, on the one hand with an a priori model that transforms the text into a vector representation of the image, and on the other hand, with a broadcast decoder, capable of translating it and creating images. consistent with the text.

Starting from a data set made up of images and their captions, the objective of CLIP is to learn a joint representation of texts and images, in the same latent space. This approach of the embedding obtained with the text encoder and the one obtained with the image encoder at a mathematical level, allows us to gradually achieve a direct translation between the text, that “inspiration” that the user provides to the model, and the numerical representation of the obtained image.

DALL.E and other similar models of recent appearance, such as I left of Google, are capable of generating new images of realism and visual coherence not seen until now, and that are going to become key contributors in fields such as photography, architecture or design, being able to combine different textual premises and interpolating those intermediate embeddings to get new results.

Where does the creative part of DALL.E 2 reside then? Looking at the training scheme, we need to transform both the text and the image into those latent representations that coexist in the same mathematical space. This second encoder, the one in the image, is where the “artist’s mind” is located.

During training, we start from this encoder, inverting it to obtain a diffusion decoder, whose mission is the opposite, to go from the mathematical representation to an image generated by the model.

This decoder has the quality of being non-deterministic, given the same input, it will not always provide the same output.


Taking a probabilistic perspective, DALL.E is made up of two blocks. A first block deterministic y which provides an embedding z given a textual premise y, and a second block, nondeterministic and that is the one that taking that embedding of the image, and optionally, the encoding of the text, generates images that are always similar to each other for the same premise, but never identical.

Source: Open AI

The non-determinism becomes evident when taking that numerical representation of the image and feeding it to the decoder in successive steps, as we can see in the image with the box of The persistence of Dalí’s memory or the logo of Open AI itself, in which we find similarities between all the proposals, but never the same one repeated twice.

This opens the door to reinterpretations of any artistic work, and encourages reflection on what would have happened if the artist’s state of mind or circumstances had been marginally different during the creation process.

In the same way, it also opens the window to debate whether that creativity, underlying the decoder, is such. From one point of view, the result can take infinite values ​​within the statistical distribution, but on the other hand, it is restricted within the limits and properties of it. Is it then about creativity because of the infinite possibilities it offers us, or on the contrary can it be interpreted as an absence of it, lacking the vindictive character of artistic movements?

Regardless of where we position ourselves in this debate, and once we understand the mathematical brushstrokes that constitute the creative genius of DALL.E, I encourage all readers to participate in this artistic symbiosis between the user and the model, visually capturing what for the moment only dwells in your imagination.

Signed: Francisco Espiga González (ESIC Professor and AI expert)

If you are interested in the field of Artificial Intelligence, do not hesitate to consult the following link!

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *