[1]Blog [2]About [3]Moonbound

This is a post from [4]Robin Sloan’s lab blog & notebook. You can [5]visit the
blog’s homepage, or [6]learn more about me.

[7]Is it okay?

February 11, 2025 Macbeth Consulting the Witches, 1825, Eugène Delacroix [8]
Macbeth Consulting the Witches, 1825, Eugène Delacroix

How do you make a language model? Goes like this: erect a trellis of code, then
allow the real program to grow, its development guided by a grueling training
process, fueled by reams of text, mostly scraped from the internet. Now. I want
to take a moment to think together about a question with no remaining practical
importance, but persistent moral urgency:

Is that okay?

The question doesn’t have any practical importance because the AI companies — 
and not only the companies, but the enthusiasts, all over the world — are going
to keep doing what they’re doing, no matter what.

The question does still have moral urgency because, at its heart, it’s a ques
tion about the things people all share together: the hows and the whys of
humanity’s common inheritance. There’s hardly anything bigger.

And, even if the companies and the enthusiasts rampage ahead, there are still
plenty of us who have to make personal decisions about this stuff every day.
You gotta take care of your own soul, and I’m writing this because I want to
clarify mine.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

A few ground rules.

First, if you (you engineer, you AI acolyte!) think the answer is obviously
“yes, it’s okay”, or if you (you journalist, you media executive!) think the
answer is obviously “no, it’s not okay”, then I will suggest that you are not
thinking with sufficient sensitivity and imagination about something truly new
on Earth. Nothing here is obvious.

Second, I’d like to proceed by depriving each side of its best weapon.

On the side of “yes, it’s okay”, I will insist that the analogy to human
learning is not admissible. “Don’t people read things, and learn from them, and
produce new work?” Yes, but speed and scale always influence our judgments
about safety and permissibility, and the speed and scale of machine learning is
off the charts. No human, no matter how well-read, could ever field requests
from a million other people, all at once, forever.

On the side of “no, it’s not okay”, I will set aside any arguments grounded in
copyright law. Not because they are irrelevant, but because … well, I think
modern copyright is flawed, so a victory on those grounds would be thin, a bit
sad. Instead, I’ll defer to deeper precedents: the intuitions and aspirations
that gave rise to copyright in the first place. To promote the Progress of Sci
ence and useful Arts, remember?

I hope partisans of both sides will agree this is a fair swap. Put down your
weapons, and let’s think together.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

I want to go carefully, step by step — yet I want to do so with brevity. Lan
guage models produce so … many … WORDS, and they seem to coax just as many out
of their critics. Logorrhea begets logorrhea. We can do better.

I’ll begin with my sense of what language models are doing. Here it is: lan
guage models collate and precipitate all the diverse reasons for writing,
across a huge swath of human activity and aspiration. Start to enumerate those
reasons: to inform, to persuade, to sell this stupid alarm clock, to dump the
CUSTOMERS table into a CSV file … and you realize it’s a vast field of desire
and action, impossible to hold in your head.

The language models have many heads.

In this formulation, language models are not merely trained on human writing.
They are the writing: all those reasons, granted the ability to speak for
themselves. I imagine the PyTorch code as a mech suit, with squishy language
strapped in tight … 

To make this work — you already know this, but I want to underscore it — only a
truly rich trove of writing suffices. Train a language model on all of
Shakespeare’s works and you won’t get anything useful, just a brittle
Shakespeare imitator.

In fact, the only trove known to produce noteworthy capabilities is: the entire
internet, or close enough. The whole extant commons of human writing. From here
on out, for brevity, we’ll call it Everything.

This is what makes these language models new: there has never, in human
history, been a way to operationalize Everything. There’s never been any
thing close.

Just as, above, I set copyright aside, I want also to set aside fair use and
the public domain. Again, not because they are irrelevant, but because those
intuitions and frameworks all assume we are talking about using some part of
the commons — not all of it.

I mean: ALL of it!

If language models worked like cartoon villains, slurping up Everything and
tainting it with techno-ooze, our judgment would be easy. But of course, digiti
zation is trickier than that: the airy touch of the copy complicates the sce
nario.

The language model reads Everything, and leaves Everything untouched — yet sud
denly this new thing exists, with strange and formidable powers.

Is that okay?

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

As we begin to feel our way across truly new terrain, we can inquire: how much
of the value of these models comes from Everything? If the fraction was just
one percent, or even ten, then we wouldn’t have much more to say.

But the fraction is, for sure, larger than that.

What goes into a language model? Data and compute.

For the foundation models like Claude, data means: Everything.

Compute combines two pursuits:

 1. software: the trellises and applications that support the development and
    deployment of these models, and

 2. hardware: the vast sultry data centers, stocked with chips, that give them
    room to run

There’s a lot of value in those pursuits; I don’t take either for granted, or
the labor they require. The experience you get using a model like Claude
depends on an ingenious scaffolding. [9]Truly! At the same time: I believe
anyone who works on these models has to concede that the trellises and the
chips, without data, are empty vessels. Inert.

Reasonable people can disagree about how the value breaks down. While I believe
the relative value of Everything in this mix is something close to 90%, I’m
willing to concede a 50/50 split.

And here is the important thing: there is no substitute.

You’ve probably heard about the race to generate novel training data, and all
the interesting effects such data can have. It is sometimes lost in those dis
cussions that these sophisticated new curricula can only be provided to a lan
guage model already trained on Everything. That training is what allows it to
make sense of the new material.

Also, it is often the case — not always, but often — that the novel training
data is generated by … a language model … which has itself been trained
on … you guessed it.

It’s Everything, all the way down.

Would it be possible to commission a fresh body of work, Everything’s equal in
scale and diversity, without any of the encumbrances of the commons? If you
could do it, and you trained a clean-room model on that writing alone, I con
cede that my question would be moot. (There would be other questions! Just not
this one.) Certainly, with as much money as the AI companies have now, you’d
expect they might try. We know they are already paying to produce new content,
lots of it, across all sorts of business and technical domains.

But this still wouldn’t match the depth and richness of Everything. I have a
hypothesis, which naturally might be wrong: that it is precisely the naivete of
Everything, the fact that its writing was actually produced for all those dif
ferent reasons, that makes it so valuable. Composing a fake corporate email,
knowing it will be used to train a language model, you’re not doing nothing,
but you’re not doing the same thing as the real email-writer. Your document
doesn’t have the same … what? The same grain. The same umami.

Maybe one of these companies will spend ten billion dollars to commission a
whole new internet’s worth of text and prove me wrong. However, I think there
are information-theoretic reasons to believe the results of such a project
would disappoint them.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

So! Understanding that these models are reliant on Everything, and derive a
large fraction of their value from it, one judgment becomes clear:

If their primary application is to produce writing and other media that crowds
out human composition, human production: no, it’s not okay.

For me, this is intuitively, almost viscerally, obvious. Here is the ultimate
act of pulling the ladder up behind you, a giant “fuck you” to every human who
ever wanted to accomplish anything, who matched desire to action, in writing,
part of Everything. Here is a technology founded in the commons, working to
undermine it. Immanuel Kant would like a word.

Fine. But what if that isn’t the primary application? What if language models,
by collating and precipitating all the diverse reasons for writing, become flex
ible general-purpose reasoners, and most of their “output” is never actually
read by anyone, instead running silent like the electricity in your walls?

It’s possible that language models could go on broadening and deepening in this
way, and eventually become valuable [10]aids to science and technology, [11]to
medicine and more.

This is tricky — it’s so, so tricky — because the claim is both (1) true, and
(2) convenient. One wishes it wasn’t so convenient. Can’t these companies
simply promise, with every passing year, that AI super science is just around
the corner … and meanwhile, wreck every creative industry, flood the internet
with garbage, grow rich on the value of Everything? Let us cook—while culture
fades into a sort of oatmeal sludge.

They can do that! They probably will. And the claim might still be true.

If super science is a possibility — if, say, Claude 13 can help deliver cures
to a host of diseases — then, you know what? Yes, it is okay, all of it. I’m
not sure what kind of person could insist that the maintenance of a media
status quo trumps the eradication of, say, most cancers. Couldn’t be me. Fine,
wreck the arts as we know them. We’ll invent new ones.

(I know that seems awfully consequentialist. Would I sacrifice anything, or
everything, for super science? No. But art and media can find new forms. That’s
what they do.)

Obviously, this scenario is especially appealing if the super science, like
Everything at its foundation, flows out into the commons. It should.

So — is super science really on the menu? We don’t have any way of knowing; not
yet. Things will be clearer in a few years, I think. There will either be real
undeniable glimmers, reported by scientists putting language models to work, or
there will still only be visions.

For my part, I think the chance of super science is below fifty percent, owing
mostly to the friction of the real physical world, which the language models
have, so far, avoided. But, I also think the chance is above ten percent, so,
I remain curious.

It’s not unreasonable to find this wager suspicious, but if you do, I might
ask: is there any possible-but-unproven technology that you think is worth pur
suing even at the cost of itchy uncertainty in the present? If the answer is
“yes, just not this one”: fair enough. If the answer is “no”: aha! I see you’ve
answered the question at the top of this page for yourself already.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Where does this leave us?

I suppose it’s not surprising, in the end:

If an AI application delivers some profound public good, or even if it might,
it’s probably okay that its value is rooted in this unprecedented operational
ization of the commons.

If an AI application simply replicates Everything, it’s probably not okay.

I’ll sketch out my current opinions more specifically:

I think the image generation models, trained on the Everything of pictures,
are: probably not okay. They don’t do anything except make more images. They
pee in the pool.

I think the foundation models like Claude are: probably okay. If it seemed, a
couple of years ago, that they were going to be used mainly to barf out text,
that impression has faded. It’s clear their applications are diverse, and often
have more to do with processes than end products.

The case of translation is compelling. If language models are, indeed, the
Babel fish, they might justify the operationalization of the commons even
without super science.

I think the case of code is especially clear, and, for me, basically settled.
That’s both (1) because of where code sits in the creative process, as an inter
mediate product, the thing that makes the thing, and (2) because the commons of
open-source code has carried the expectation of rich and surprising reuse for
decades. I think this application has, in fact, already passed the threshold of
“profound public good”: opening up programming to whole new groups of people.

But, again, it’s important to say: the code only works because of Everything.
Take that data away, train a model using GitHub alone, and you’ll get a far
less useful tool.

Maybe (it turns out) I’m less interested in litigating my foundational question
and more interested in simply insisting on the overwhelming, irreplaceable con
tribution of this great central treasure: all of us, writing, for every conceiv
able reason; desire and action, impossible to hold in your head.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Did we make progress here? I think so. It’s possible my question, at the
outset, seemed broad. In fact, it’s fairly narrow, about this core mechanism,
the operationalization of the commons: whether I can live with it, or not.

One extreme: if these machines churn through all media, and then, in their
deployment, blow away any prospect for a healthy market for human-made media,
I’d say, no, that’s not what we want from technology, or from our future.

Another extreme: if these machines churn through all media, and then, in their
deployment, discover several superconductors and cure all cancers, I’d say,
okay … we’re good.

What if they do both? Well, it would be a bummer for media, but on balance I’d
take it. There will always be ways for artists to get out ahead again. More on
that in another post.

I also think there are some potential policy remedies that would even out the
allocation of value here — although, these days, imagining interesting policy
is a sort of fantastical entertainment. Even so, I’ll post about those later,
too.

In this discussion, I set copyright and fair use aside. I should say, however,
that I’m not at all interested in clearing the air for AI companies, legally.
They’ve chosen to plunge ahead into new terrain — so let them enjoy the fog of
war, Civ-style. Let them cook!

[12]To the blog home page

I'm [13]Robin Sloan, a fiction writer. The main thing to do here is sign up for
my newsletter:

[14][                    ] [15][Subscribe]
This website doesn’t collect any information about you or your reading.
It aspires to the speed and privacy of the printed page.

Don’t miss [16]the colophon. Hony soyt qui mal pence


References:

[1] https://www.robinsloan.com/lab/
[2] https://www.robinsloan.com/about/
[3] https://www.robinsloan.com/moonbound/
[4] https://www.robinsloan.com/
[5] https://www.robinsloan.com/lab/
[6] https://www.robinsloan.com/about/
[7] https://www.robinsloan.com/lab/is-it-okay/
[8] https://www.clevelandart.org/art/1962.109?utm_source=Robin_Sloan_sent_me
[9] https://www.youtube.com/watch?v=ugvHCXCOmm4#t=9780
[10] https://research.google/blog/accelerating-scientific-breakthroughs-with-an-ai-co-scientist/?utm_source=Robin_Sloan_sent_me
[11] https://darioamodei.com/machines-of-loving-grace?utm_source=Robin_Sloan_sent_me
[12] https://www.robinsloan.com/lab/
[13] https://www.robinsloan.com/about?utm_source=Robin_Sloan_sent_me
[16] https://www.robinsloan.com/colophon/