Summary: Ted Chiang on how artificial intelligence still isn’t as intelligent as it is perceived to be and how its profound limitations should temper our fears about it replacing real art-making.

Why A.I. Isn’t Going to Make Art

Source: Ted Chiang - 1993-01-01T04:59:59.999Z

0 UP DOWN

In 1953, Roald Dahl published “The Great Automatic Grammatizator,” a short story about an electrical engineer who secretly desires to be a writer. One day, after completing construction of the world’s fastest calculating machine, the engineer realizes that “English grammar is governed by rules that are almost mathematical in their strictness.” He constructs a fiction-writing machine that can produce a five-thousand-word short story in thirty seconds; a novel takes fifteen minutes and requires the operator to manipulate handles and foot pedals, as if he were driving a car or playing an organ, to regulate the levels of humor and pathos. The resulting novels are so popular that, within a year, half the fiction published in English is a product of the engineer’s invention.

Is there anything about art that makes us think it can’t be created by pushing a button, as in Dahl’s imagination? Right now, the fiction generated by large language models like ChatGPT is terrible, but one can imagine that such programs might improve in the future. How good could they get? Could they get better than humans at writing fiction—or making paintings or movies—in the same way that calculators are better at addition and subtraction?

Art is notoriously hard to define, and so are the differences between good art and bad art. But let me offer a generalization: art is something that results from making a lot of choices. This might be easiest to explain if we use fiction writing as an example. When you are writing fiction, you are—consciously or unconsciously—making a choice about almost every word you type; to oversimplify, we can imagine that a ten-thousand-word short story requires something on the order of ten thousand choices. When you give a generative-A.I. program a prompt, you are making very few choices; if you supply a hundred-word prompt, you have made on the order of a hundred choices.

If an A.I. generates a ten-thousand-word story based on your prompt, it has to fill in for all of the choices that you are not making. There are various ways it can do this. One is to take an average of the choices that other writers have made, as represented by text found on the Internet; that average is equivalent to the least interesting choices possible, which is why A.I.-generated text is often really bland. Another is to instruct the program to engage in style mimicry, emulating the choices made by a specific writer, which produces a highly derivative story. In neither case is it creating interesting art.

I think the same underlying principle applies to visual art, although it’s harder to quantify the choices that a painter might make. Real paintings bear the mark of an enormous number of decisions. By comparison, a person using a text-to-image program like DALL-E enters a prompt such as “A knight in a suit of armor fights a fire-breathing dragon,” and lets the program do the rest. (The newest version of DALL-E accepts prompts of up to four thousand characters—hundreds of words, but not enough to describe every detail of a scene.) Most of the choices in the resulting image have to be borrowed from similar paintings found online; the image might be exquisitely rendered, but the person entering the prompt can’t claim credit for that.

Some commentators imagine that image generators will affect visual culture as much as the advent of photography once did. Although this might seem superficially plausible, the idea that photography is similar to generative A.I. deserves closer examination. When photography was first developed, I suspect it didn’t seem like an artistic medium because it wasn’t apparent that there were a lot of choices to be made; you just set up the camera and start the exposure. But over time people realized that there were a vast number of things you could do with cameras, and the artistry lies in the many choices that a photographer makes. It might not always be easy to articulate what the choices are, but when you compare an amateur’s photos to a professional’s, you can see the difference. So then the question becomes: Is there a similar opportunity to make a vast number of choices using a text-to-image generator? I think the answer is no. An artist—whether working digitally or with paint—implicitly makes far more decisions during the process of making a painting than would fit into a text prompt of a few hundred words.

We can imagine a text-to-image generator that, over the course of many sessions, lets you enter tens of thousands of words into its text box to enable extremely fine-grained control over the image you’re producing; this would be something analogous to Photoshop with a purely textual interface. I’d say that a person could use such a program and still deserve to be called an artist. The film director Bennett Miller has used DALL-E 2 to generate some very striking images that have been exhibited at the Gagosian gallery; to create them, he crafted detailed text prompts and then instructed DALL-E to revise and manipulate the generated images again and again. He generated more than a hundred thousand images to arrive at the twenty images in the exhibit. But he has said that he hasn’t been able to obtain comparable results on later releases of DALL-E. I suspect this might be because Miller was using DALL-E for something it’s not intended to do; it’s as if he hacked Microsoft Paint to make it behave like Photoshop, but as soon as a new version of Paint was released, his hacks stopped working. OpenAI probably isn’t trying to build a product to serve users like Miller, because a product that requires a user to work for months to create an image isn’t appealing to a wide audience. The company wants to offer a product that generates images with little effort.

It’s harder to imagine a program that, over many sessions, helps you write a good novel. This hypothetical writing program might require you to enter a hundred thousand words of prompts in order for it to generate an entirely different hundred thousand words that make up the novel you’re envisioning. It’s not clear to me what such a program would look like. Theoretically, if such a program existed, the user could perhaps deserve to be called the author. But, again, I don’t think companies like OpenAI want to create versions of ChatGPT that require just as much effort from users as writing a novel from scratch. The selling point of generative A.I. is that these programs generate vastly more than you put into them, and that is precisely what prevents them from being effective tools for artists.

The companies promoting generative-A.I. programs claim that they will unleash creativity. In essence, they are saying that art can be all inspiration and no perspiration—but these things cannot be easily separated. I’m not saying that art has to involve tedium. What I’m saying is that art requires making choices at every scale; the countless small-scale choices made during implementation are just as important to the final product as the few large-scale choices made during the conception. It is a mistake to equate “large-scale” with “important” when it comes to the choices made when creating art; the interrelationship between the large scale and the small scale is where the artistry lies.

Believing that inspiration outweighs everything else is, I suspect, a sign that someone is unfamiliar with the medium. I contend that this is true even if one’s goal is to create entertainment rather than high art. People often underestimate the effort required to entertain; a thriller novel may not live up to Kafka’s ideal of a book—an “axe for the frozen sea within us”—but it can still be as finely crafted as a Swiss watch. And an effective thriller is more than its premise or its plot. I doubt you could replace every sentence in a thriller with one that is semantically equivalent and have the resulting novel be as entertaining. This means that its sentences—and the small-scale choices they represent—help to determine the thriller’s effectiveness.

Many novelists have had the experience of being approached by someone convinced that they have a great idea for a novel, which they are willing to share in exchange for a fifty-fifty split of the proceeds. Such a person inadvertently reveals that they think formulating sentences is a nuisance rather than a fundamental part of storytelling in prose. Generative A.I. appeals to people who think they can express themselves in a medium without actually working in that medium. But the creators of traditional novels, paintings, and films are drawn to those art forms because they see the unique expressive potential that each medium affords. It is their eagerness to take full advantage of those potentialities that makes their work satisfying, whether as entertainment or as art.

Of course, most pieces of writing, whether articles or reports or e-mails, do not come with the expectation that they embody thousands of choices. In such cases, is there any harm in automating the task? Let me offer another generalization: any writing that deserves your attention as a reader is the result of effort expended by the person who wrote it. Effort during the writing process doesn’t guarantee the end product is worth reading, but worthwhile work cannot be made without it. The type of attention you pay when reading a personal e-mail is different from the type you pay when reading a business report, but in both cases it is only warranted when the writer put some thought into it.

Recently, Google aired a commercial during the Paris Olympics for Gemini, its competitor to OpenAI’s GPT-4. The ad shows a father using Gemini to compose a fan letter, which his daughter will send to an Olympic athlete who inspires her. Google pulled the commercial after widespread backlash from viewers; a media professor called it “one of the most disturbing commercials I’ve ever seen.” It’s notable that people reacted this way, even though artistic creativity wasn’t the attribute being supplanted. No one expects a child’s fan letter to an athlete to be extraordinary; if the young girl had written the letter herself, it would likely have been indistinguishable from countless others. The significance of a child’s fan letter—both to the child who writes it and to the athlete who receives it—comes from its being heartfelt rather than from its being eloquent.

Many of us have sent store-bought greeting cards, knowing that it will be clear to the recipient that we didn’t compose the words ourselves. We don’t copy the words from a Hallmark card in our own handwriting, because that would feel dishonest. The programmer Simon Willison has described the training for large language models as “money laundering for copyrighted data,” which I find a useful way to think about the appeal of generative-A.I. programs: they let you engage in something like plagiarism, but there’s no guilt associated with it because it’s not clear even to you that you’re copying.

Some have claimed that large language models are not laundering the texts they’re trained on but, rather, learning from them, in the same way that human writers learn from the books they’ve read. But a large language model is not a writer; it’s not even a user of language. Language is, by definition, a system of communication, and it requires an intention to communicate. Your phone’s auto-complete may offer good suggestions or bad ones, but in neither case is it trying to say anything to you or the person you’re texting. The fact that ChatGPT can generate coherent sentences invites us to imagine that it understands language in a way that your phone’s auto-complete does not, but it has no more intention to communicate.

It is very easy to get ChatGPT to emit a series of words such as “I am happy to see you.” There are many things we don’t understand about how large language models work, but one thing we can be sure of is that ChatGPT is not happy to see you. A dog can communicate that it is happy to see you, and so can a prelinguistic child, even though both lack the capability to use words. ChatGPT feels nothing and desires nothing, and this lack of intention is why ChatGPT is not actually using language. What makes the words “I’m happy to see you” a linguistic utterance is not that the sequence of text tokens that it is made up of are well formed; what makes it a linguistic utterance is the intention to communicate something.

Because language comes so easily to us, it’s easy to forget that it lies on top of these other experiences of subjective feeling and of wanting to communicate that feeling. We’re tempted to project those experiences onto a large language model when it emits coherent sentences, but to do so is to fall prey to mimicry; it’s the same phenomenon as when butterflies evolve large dark spots on their wings that can fool birds into thinking they’re predators with big eyes. There is a context in which the dark spots are sufficient; birds are less likely to eat a butterfly that has them, and the butterfly doesn’t really care why it’s not being eaten, as long as it gets to live. But there is a big difference between a butterfly and a predator that poses a threat to a bird.