Generative AI is changing everything. But what’s left when the hype is gone?
Not this time. Everyone involved saw that this was something special as they experimented with the model. Altman says, “It was very obvious that this was it-this was the product.” “There was no discussion. It was never discussed .”
But nobody, not Altman nor the DALL-E team, could have predicted how big this product would become. Altman says, “This is the first AI technology to catch fire with regular people.”
DALL-E 2 dropped in April 2022. Google released two text-to image models in May, Imagen and Parti. Midjourney, a text to image model for artists, followed. August also brought Stable Diffusion. This open-source model was released by Stability AI, a UK-based startup.
The doors were thrown open. In 2.5 months, OpenAI had signed up more than a million users. In less than half the time, more than a million people began using Stable Diffusion via Dream Studio’s paid-for service. Many more used Stable Diffusion through third party apps or downloaded the free version on their computers. (Emad Mostaque is the founder of Stability AI and says he wants to see a billion people. )
And then in October we had Round Two: a spate of text-to-video models from Google, Meta, and others. These can be used to create short video clips, animations and 3D images instead of static images.
The pace of development is incredible. The technology has generated hundreds of magazine covers and newspaper headlines in just a few months. It also sparked a social media firestorm that fueled a massive hype machine and sparked a fierce backlash.
“The shock and awe of this technology is amazing–and it’s fun, it’s what new technology should be,” says Mike Cook, an AI researcher at King’s College London who studies computational creativity. It’s so fast that your initial impressions of the technology are constantly changing before you can even get used to it. I think we’re going to spend a while digesting it as a society.”
Artists are caught in the middle of one of the biggest upheavals in a generation. Some will lose their jobs, while others will find new opportunities. Several are heading to court to fight legal battles over what is perceived as the misappropriation images to train models that could be replaced.
Creators were caught off guard, says Don Allen Stevenson III, a digital artist based in California who has worked at visual-effects studios such as DreamWorks. It’s scary for technically-trained people like me. He says, “Oh my God–that’s all my job.” “I went into an existential crisis for the first month of using DALL-E.”
But while some are still reeling from the shock, many–including Stevenson–are finding ways to work with these tools and anticipate what comes next.
The exciting truth is that we don’t know. While the tech will have a major impact on creative industries such as fashion, architecture, marketing and entertainment media, it will also have a huge impact on everyone else. It could be used to create designs for almost everything, including new drugs, clothes, and buildings. The generative revolution is underway.
A magical revolution
For Chad Nelson, a digital creator who has worked on video games and TV shows, text-to-image models are a once-in-a-lifetime breakthrough. He says, “This tech takes your brain from a lightbulb to a first sketch in seconds.” “The speed at which you can create and explore is revolutionary–beyond anything I’ve experienced in 30 years.”
Within weeks of their debut, people were using these tools to prototype and brainstorm everything from magazine illustrations and marketing layouts to video-game environments and movie concepts. People shared thousands of fan art online, including entire comic books. Altman even used DALL-E to generate designs for sneakers that someone then made for him after he tweeted the image.
Amy Smith, a Queen Mary University of London computer scientist, and a tattoo artist has been using DALLE to create tattoo designs. She says, “You can sit down and create designs together with the client.” “We’re in a revolution of media generation.”
Paul Trillo, a digital and video artist based in California, thinks the technology will make it easier and faster to brainstorm ideas for visual effects. He says, “People are saying that this is the death or death of effects artists or fashion designers.” “I don’t believe it’s the death or end of anything. It means that we don’t have .”
Stock photo companies are taking different positions. Getty has banned AI generated images. Shutterstock has agreed to embed DALLE on its website. It also plans to create a fund to pay artists whose work was used to train the models.
Stevenson says he has tried out DALL-E at every step of the process that an animation studio uses to produce a film, including designing characters and environments. DALL-E allowed him to complete the work of several departments in just a few minutes. He says, “It’s encouraging for all the people who’ve never been in a position to create because it was too costly or too technical.” It’s scary if you don’t want to change .”
Nelson believes there’s more to come. He sees this technology being adopted not only by media giants, but also by architecture and design companies. He admits that it isn’t yet ready. “Right now, it’s almost like you have a little magician box,” he said. This is great if your goal is to continue creating images. But it’s not the best option if you want to collaborate with a creative partner. He says, “If I want it create stories and build worlds it needs to have far more awareness of the things I’m creating.” This is the problem: These models don’t know what they’re doing.
Inside the black box
To see why, let’s look at how these programs work. The software appears black from the outside. After entering a brief description, or prompt, wait a few seconds. You will receive a few images that match the prompt (more or lesser). You might need to adjust your text to get the model to produce something more like what you wanted, or to create a serendipitous result. This is known as prompt engineering.
The most detailed and stylized images can take several hundred words. This makes prompt engineering a valuable skill. Online marketplaces are available where prompts that produce desirable results can be bought and sold.
Prompts may contain phrases that tell the model to follow a certain style. For example, “trending on ArtStation”, which instructs the AI to imitate the (typically very detailed style) images on ArtStation. “Unreal engine,” invokes the familiar graphic style used in certain video games. Some artists have been very unhappy with the fact that users can enter the names of artists to have the AI create pastiches of their work.
“I tried to metaphorically represent AI with the prompt ‘the Big Bang’ and ended up with these abstract bubble-like forms (right). It wasn’t exactly what I wanted, so then I went more literal with ‘explosion in outer space 1980s photograph’ (left), which seemed too aggressive. I also tried to grow digital plants by adding ‘plant 8 bit pixel art’ (center). “
Under the hood, text-to-image models have two key components: one neural network trained to pair an image with text that describes that image, and another trained to generate images from scratch. The basic idea behind text-to-image models is to get the second neural networks to generate images that match the prompt.
The big breakthrough behind the new models is in the way images get generated. The first version of DALL-E used an extension of the technology behind OpenAI’s language model GPT-3, producing images by predicting the next pixel in an image as if they were words in a sentence. This worked, but it wasn’t very well. Altman says that it was not a magical experience. It’s amazing that it worked .”
Instead, DALL-E 2 uses a diffusion model. Diffusion models are neural networks that are trained to clean up images by removing the pixelated noise from the training process. The process involves changing a few pixels at a time in images until you have nothing but random pixels. “If you do it a thousand times, the image will look like you have pulled the antenna cable out of your TV set–it is just snow,” says Bjorn Omer, who works on generative artificial intelligence at the University of Munich in Germany. He also helped to create the diffusion model that powers Stable Diffusion. The neural network is trained to reverse this process and predict the less pixelated version. The bottom line is that a diffusion model will attempt to generate something cleaner if it is given a lot of pixels. The model will then produce a cleaner image if you plug in the cleaned-up image. This process can be repeated enough times to make the model convert TV snow into high-resolution images.
AI generators don’t always work the way you want. These generators can produce horrendous results that look almost like distorted stock art. My experience is that adding a descriptor at end of the work with a style that is pleasing to the eye is the best way to make it look good.
The trick with text-to-image models is that this process is guided by the language model that’s trying to match a prompt to the images the diffusion model is producing. This causes the diffusion model to choose images that it considers to be a good match.
But models don’t just pull the links between text, images and other data out of thin air. The vast majority of text-to-image models are trained today on a huge data set called LAION. This includes billions of images and text that have been gathered from the internet. The text-to-image images you get are a distillation, not a representation of the real world. This is due to prejudice and pornography.
One last thing. There’s a subtle but important difference between the two most well-known models, DALLE 2 and Stable Diffusion. DALL-E 2’s diffusion models work with full-size images. Stable Diffusion uses a technique called “latent diffusion”, which Ommer and his collaborators invented. It compresses images encoded in the neural network within what’s known to be a latent space. This is where only the essential features are retained.
This means Stable Diffusion needs less computing power to function. Stable Diffusion runs on (good) personal computers, unlike DALL-E 2, that runs on powerful OpenAI servers. Stable Diffusion’s open source nature allows for programmers to modify it, make money off it, and it is lightweight enough to be run at home. This is responsible for the creativity explosion and rapid development of new apps.
For some, these models are a step toward artificial general intelligence, or AGI–an over-hyped buzzword referring to a future AI that has general-purpose or even human-like abilities. OpenAI has made it clear that its goal is to achieve AGI. Altman doesn’t care if DALL-E 2 competes with a host of similar tools, some free. Altman says, “We’re here for AGI, and not image generators.” It will fit into a larger product roadmap. It’s a small part of what AGI will do .”
That’s optimistic, to say the least–many experts believe that today’s AI will never reach that level. Text-to-image models are no better than the language-generating AIs they underpin in terms of basic intelligence. Google’s PaLM and GPT-3 are examples of tools that can regurgitate text patterns from billions of documents. DALL-E, Stable Diffusion also reproduce associations between text-image pairs found in billions of online examples.
The results are stunning, but if you poke too hard, the illusion crumbles. These models make basic howlers–responding to “salmon in a river” with a picture of chopped-up fillets floating downstream, or to “a bat flying over a baseball stadium” with a picture of both a flying mammal and a wooden stick. Because they are built on top technology that is not nearly as advanced as humans or most animals, it is difficult to understand the world. It may only be a matter of time until these models learn new tricks. Cook says that people say it’s not very skilled at this task now. “But it could be .” A hundred million dollars later, that’s OpenAI’s approach.
“We already know how to make it 10 times better,” says Altman. We know that there are logic reasoning tasks that it fails to complete. We’re going to go down a list of things, and we’ll put out a new version that fixes all of the current problems.”
If claims about intelligence and understanding are overblown, what about creativity? We believe that all humans are examples of creativity, including mathematicians, entrepreneurs and kindergarten children. It is difficult to see what these people share in common.
For some, the results are what matters most. Others argue that the process of making things and whether intent is involved in their creation are crucial.
Many people rely on Margaret Boden’s definition of creativity. She is an AI researcher and philosopher from the University of Sussex. She states that in order to be creative, an idea, or artifact, must be original, surprising, and valued.
Beyond that, it’s often a case of knowing it when you see it. Computer-based creativity is a field that researchers describe as using computers to produce creative results that could not have been produced by humans.
Smith, however, is happy to call this new breed generative models creative, despite its stupidity. She says, “It is clear that there is innovation within these images that cannot be controlled by any human input.” “The translation of text to image can often be surprising and beautiful .”
Maria Teresa Llano, who studies computational creative at Monash University, Melbourne, Australia, agrees text-to-image models are stretching past definitions. Llano doesn’t think they are creative. These programs can become repetitive if you use them a lot. They may not meet all or some of Boden’s requirements. This could be a major limitation of the technology. A text-to-image model generates new images in the same way as billions of existing images. Machine learning may only produce images that are similar to what it has seen in the past. This may not be relevant for computer graphics. Adobe already integrates text-to-image generation in Photoshop. Blender, Photoshop’s open-source cousin has a Stable Diffusion plugin. OpenAI and Microsoft are collaborating on a text to image widget for Office.
It is in this kind of interaction, in future versions of these familiar tools, that the real impact may be felt: from machines that don’t replace human creativity but enhance it. Llano says that creativity today is not due to the systems but the interaction with them. This includes the back-and forth, call-and response required to get the result you want.
This view is shared by other researchers in computational creativity. It’s not about what the machines do, it’s about how they do it. To make them true creative partners, you need to push them to be more independent, give them creative responsibility, and get them to curate and create. Aspects of this will soon be available. Someone has already written a program called CLIP Interrogator that analyzes an image and comes up with a prompt to generate more images like it. Others are using machine learning to augment simple prompts with phrases designed to give the image extra quality and fidelity–effectively automating prompt engineering, a task that has only existed for a handful of months.
Meanwhile, as the flood of images continues, we’re laying down other foundations too. Cook says that the internet is now forever contaminated by images made with AI. “The images that we made in 2022 will be a part of any model that is made from now on.”
We will have to wait to see exactly what lasting impact these tools will have on creative industries, and on the entire field of AI. Generative AI is now a tool for expression. Altman claims that he uses generated images in his personal messages just like he used emoji. Altman says that some of his friends don’t bother to generate the image, they just type the prompt.
But text-to-image models could be just the beginning. Generative AI could eventually be used to produce designs for everything from new buildings to new drugs–think text-to-X.
People will soon realize that craft or technique is not the barrier. It’s now only their imagination that is the limit.” Nelson says. Computers can be used in many industries to generate a wide range of designs, which can then be sorted for those that might work. Text-to-X models would allow a human designer to fine-tune that generative process from the start, using words to guide computers through an infinite number of options toward results that are not just possible but desirable. Computers have the ability to create spaces that are filled with infinite possibilities. Text-to-X will let us explore those spaces using words. I think that’s what it is,” Altman says. “I think it is just going to seep everywhere.”
I’m a journalist who specializes in investigative reporting and writing. I have written for the New York Times and other publications.