AI images generated from the text prompts 鈥渁 baby daikon radish in a tutu walking a dog鈥 and 鈥渁n armchair in the shape of an avocado鈥 OpenAI
A neural network uses text captions to create outlandish images 鈥 such as armchairs in the shape of avocados 鈥 demonstrating it understands how language shapes visual culture.
OpenAI, an artificial intelligence company that , developed the neural network, which it calls DALL-E. It is a version of the company’s GPT-3 language model that can create expansive written works based on short text prompts, but DALL-E produces images instead.
鈥淭he world isn鈥檛 just text,鈥 says Ilya Sutskever, co-founder of OpenAI. 鈥淗umans don鈥檛 just talk: we also see. A lot of important context comes from looking.鈥
Advertisement
DALL-E is trained using a set of images already associated with text prompts, and then uses what it learns to try to build an appropriate image when given a new text prompt.
It does this by trying to understand the text prompt, then producing an appropriate image. It builds the image element-by-element based on what has been understood from the text. If it has been presented with parts of a pre-existing image alongside the text, it also considers the visual elements in that image.
鈥淲e can give the model a prompt, like 鈥榓 pentagonal green clock鈥, and given the preceding [elements], the model is trying to predict the next one,鈥 says Aditya Ramesh of OpenAI.
For instance, if given an image of the head of a T. rex, and the text prompt “a T. rex wearing a tuxedo”, DALL-E can draw the body of the T. rex underneath the head and add appropriate clothing.
The neural network, , can trip up on poorly worded prompts and struggles to position objects relative to each other 鈥 or to count.
鈥淭he more concepts that a system is able to sensibly blend together, the more likely the AI system both understands the semantics of the request and can demonstrate that understanding creatively,鈥 says Mark Riedl at the Georgia Institute of Technology in the US.
鈥淚鈥檓 not really sure how to define what creativity is,鈥 says Ramesh, who admits he was impressed with the range of images DALL-E produced.
The model produces 512 images for each prompt, which are then filtered using a separate computer model developed by OpenAI, called CLIP, into what CLIP believes are the 32 “best” results.
CLIP is trained on 400 million images available online. 鈥淲e find image-text pairs across the internet and train a system to predict which pieces of text will be paired with which images,鈥 says Alec Radford of OpenAI, who developed CLIP.
鈥淭his is really impressive work,鈥 says Serge Belongie at Cornell University, New York. He says further work is required to look at the ethical implications of such a model, such as the risk of creating completely faked images, for example ones involving real people.
Effie Le Moignan at Newcastle University, UK, also calls the work impressive. 鈥淏ut the thing with natural language is although it鈥檚 clever, it鈥檚 very cultural and context-appropriate,鈥 she says.
For instance, Le Moignan wonders whether DALL-E, confronted by a request to produce an image of Admiral Nelson wearing gold lam茅 pants, would put the military hero in leggings or underpants 鈥 potential evidence of the gap between British and American English.
Topics:



