 |
|
Hi you,
On this Dr. Martin Luther King Jr. Day, I find myself reaching for his wisdom as a salve for these difficult moments, and reflecting on how the moral clarity we ascribe to the civil rights movement was not received in its time with anything like universal embrace. I’m also thinking today of Gen Z, our hard conversations about Israel and Palestine, and the general narrative I keep encountering among older people that the kids just don’t understand the real world and need to grow up. I hope we can acknowledge that mass killing of all kinds is unacceptable. I hope we can remember, as King reminded us, that “all life is interrelated.” That “we are all caught in an inescapable network of mutuality, tied into a single garment of destiny. Whatever affects one directly, affects all indirectly.”
If you’re in need of some light-filled ways to face dark times, take a look at the work of Rev. angel Kyodo williams, whom I recently had the pleasure of meeting; or this episode of Krista Tippett’s On Being podcast with biomimicry evangelist Janine Benyus; or this ridiculous Australian ad satirizing the generation gap. Oh, and if you have a flight on United Airlines coming up, turn on that in-flight entertainment to see my PBS series America Outdoors, which is now in the system!
Now for some thoughts about technology and crediting sources in your work. No, it’s not about Bill Ackman’s war on everybody, but about The New York Times suing OpenAI…
|
 |
| A.I. & Its Times |
| On OpenAI eating the internet, the lawsuit heard ‘round the world, and a blueprint for detente with the media and entertainment business. |
|
|
|
| As you probably already know, The New York Times recently sued OpenAI and Microsoft, alleging that their generative A.I. tools—specifically ChatGPT and Bing Chat, now Copilot—have infringed its copyrights by summarizing and in some cases even offering verbatim regurgitations of Times articles. (Why we’ve settled on “regurgitation” versus “reproduction” or “replication” is beyond me—it makes the entire field sound disgusting and triggers my own gag reflex.) My colleague Eriq Gardner has the details of the suit, but in short, the Times wants “billions of dollars in statutory and actual damages,” although they would presumably settle for a fraction of that amount, plus a healthy royalty. Wouldn’t we all?
The Times is also seeking the “destruction” of the A.I. models trained on its content, a dramatic phrase that I can only hope involves Times staffers smashing OpenAI servers with baseball bats like that scene from Office Space, live-streamed as a benefit for subscribers. Now that’s something I would pay to see!
Anyway, the suit has been on my mind for a number of obvious reasons. Of course, I’m an intellectual property generator myself, which is a pretty lifeless way of saying I’ve published a book, produced podcasts, video series, and television shows, and delivered countless live talks and performances. I’m writing this article that you’re reading right now. And as a creator, I’ve had many reservations about how my creative output might be scraped and repurposed by large language models, although I haven’t sued any A.I. companies—at least not yet.
In general, I’m not opposed to my work contributing to a greater corpus of knowledge. I post on the internet, and I want my words and ideas to reach people. But I don’t like the notion of these models mischaracterizing my work, or replicating it without permission, attribution, or compensation. And I really don’t like the idea of a company monetizing my creative and intellectual contributions to the public good for private gain. Yes, yes, there are fair use exceptions and all that, but I didn’t begin publishing to the internet all those years ago with the expectation that I would be providing training data for a platform that might seek to replace me.
On the flip side, I’ve been happily experimenting with several generative A.I. systems and have found real value in them. I’ve used them to help with recipes and research, brainstorming and editing, synthesizing and remixing. I’m complicit in a system I critique. It’s like capitalism: neither all good nor all bad but complicated and sometimes confusing. So I’m glad the Times has filed its suit, and I hope it helps to clarify the rules, rights, and responsibilities that govern these emerging technologies.
Last year, around the launch of GPT-4, I published an essay outlining my early observations and concerns. Since then, of course, OpenAI has gone through a half-dozen crises—including the whiplash-inducing ouster and reinstatement of Sam Altman—and mesmerizing technological leaps forward. And now, my latest thoughts on where we stand as I update my own mental model of how the A.I. revolution plays out. |
| How Much Is Our Data Worth? |
|
| At the moment, generative A.I. models are essentially useless without access to human-generated information and data sets, which allow them to train their neural networks and to do the seemingly magical things they do. What is that data worth, and how would we calculate its economic value?
The Times, for instance, claims in its suit that its archives were the single largest proprietary data set ingested by ChatGPT. There’s no question that this library offers immense value for OpenAI, even if both parties disagree about a precise figure to ascribe to that value. Alas, the calculus becomes even harder when you consider, say, image-generation platforms like Midjourney or Stable Diffusion, which have been sued for allegedly infringing “millions of artists” by scraping billions of images from the web “without the consent of the original artists.” (The judge overseeing that case has asked the plaintiffs to refile with more specifics, while allowing the central allegation to stand; he also encouraged all creators to formally register copyrights for their work since it makes legal proceedings much smoother). The more data sources in the A.I. soup, the harder it is to reverse engineer a new work’s origins.
These are novel challenges for creators, publishers, and lawyers. Of course, we’ve grappled with similar legal questions in the past with the advent of the printing press, radio, television, etcetera, all of which were ultimately resolved via licensing and royalties. Digital media disruptors like Napster and YouTube also upended pre-existing business models, after considerable soul-searching and expensive litigation. But in the end, they and their business descendants came to a legal (although not particularly profitable) arrangement with artists.
Today’s A.I. conundrum, however, is different in several critical ways. While the Times cites a few examples of ChatGPT regurgitating near-verbatim copy from its articles, these systems aren’t really designed for that use. Most users aren’t going to ask ChatGPT for complete, exact replicas of copyrighted material. If they were, we’ve already developed systems to track copyrighted material on digital platforms through digital fingerprinting, for example. (YouTube has this in place when I use licensed music in my videos, and the rights holders get a share of any ad money I’m making.) Instead, these models “learn” from that copyrighted work and produce something that can have the same core ideas, information, and even styles as those original works. It’s not a copy, but the effect is often the same: the creation of a potential substitute with negative economic consequences for the original creators.
That complicates our ability to track the multiplicity of inputs and attribute their value. It’s not comparable to, say, radio, wherein SoundExchange can collect and distribute digital performance royalties from radio stations to artists and record labels. In that example, stations essentially pay a share of their revenues into a fund, which is distributed to rights holders and artists based on how much their songs are played. Those plays are counted based on third party and self-reported monitoring systems, which simplify the process of attribution and compensation. Tracking the share of Times content that contributed to any particular LLM response is a whole lot trickier. |
|
|
| One way to test the value of Times data to OpenAI would be to simply remove it from the model. (Easier said than done, I’m sure.) A.I. researchers have a way to approximate this method through something called an “ablation study.” Essentially, you create a version of a model without access to a source of data and compare the results to the model that uses that data. The difference in its outputs can then be attributed to that single source of data. We’d all have a clearer answer to the question of value then, at least qualitatively, and we might be surprised at how much or little any single data source is worth in the end. In this case, for instance, OpenAI now has licensing arrangements in place with the Associated Press and Axel Springer, the parent company of Politico and Business Insider, among others. How much would ChatGPT-4 suffer if it relied on those publications, but not the Times?
An economist friend of mine recently suggested another possibility: auctions, wherein owners allow generative A.I. companies to bid for the rights to use their data for training. The data owners could offer combinations of data, limit the rights so they are exclusive to the winner, or be allowed to sell the same data to multiple buyers. (One can imagine features like the ability to set a minimum “reserve price” below which you won’t sell, too.) I doubt you could implement real-time dynamic pricing—as there is in the programmatic ad market, for instance—since LLMs aren’t constantly accessing the data source with each user prompt. But for A.I. companies that want to limit their legal risk, an auction could be a good way to set pricing in a multibuyer, multiseller world, with some schedule for rerunning the auction over time to ensure the pricing remains relevant.
Auctions or not, I’m more convinced than ever of one of my 2024 predictions: that we’re about to enter a data gold rush. With the appropriate financial and legal framework in place, the generative A.I. era could be a boon for creators and struggling publishers looking for new ways to monetize their work. It could be particularly lucrative for companies with access to large amounts of specialized human knowledge, such as textbook publishers and scientific research outfits, but also businesses like Quora or even Reddit, which implemented new A.P.I. pricing last year specifically to get paid for LLM training. Now feels like the right time to sort this out, whether through deals, lawsuits, or legislation, as is happening around the world. |
| Inspiration vs. Imitation |
|
| Of course, regardless of a likely future settlement between the Times and OpenAI, or between other content creators and A.I. platforms, there remains the broader legal and philosophical question of whether the process of “training” LLMs is comparable to human learning or more like machine mimicry. If OpenAI can establish that machine learning is similar to human learning, for the purposes of ingesting intellectual property at least (another digestive metaphor for you!), then they will also have gotten a step closer to extending human rights to machines, which itself may help along their path to creating artificial superintelligence. And if they own this superintelligence, they can own the foundational model for almost all business and creative activity in the future.
The resolution of this evolving debate will be hugely consequential: So far, courts have ruled that works created solely by generative A.I. systems can’t be copyrighted, but with human involvement they may be. This part of the battle is just beginning, and it’s going to challenge us to answer very tough and occasionally weird questions.
Is the output of a generative model equivalent to authorship? How is what ChatGPT or Midjourney do different from humans when we absorb prior art and knowledge and transform it via our own creations? Can we just call this entire LLM universe a giant plagiarism machine? It’s easy to fall into the trap of philosophizing without grounding these discussions in the real world with real consequences. The New York Times lawsuit gives us one high-attention opportunity to do just that. Let’s assume that these machine learning systems really learn and really transform work and create, and they do that at a speed and scale we never could have imagined for ourselves. If the consequences of that involve massive value destruction, job displacement, misinformation, and other forms of destabilization, should we do it, and who should decide?
Right now, however, the primary question is about money. OpenAI, Microsoft, and most of the other operators of large language models don’t want to have to pay for the access to this underlying knowledge. And I don’t blame them. I don’t generally offer money for goods and services that don’t have a price tag. On the other hand, of course, they are fighting over access to raw materials, which do confer value. And it’s hard to address this quandary without realizing that the Sulzberger family, which controls the Times, probably doesn’t want to make the same mistake they did 25 or so years ago. After all, the economics of the news business would be very different if large publishers had not let the first generation of search engines crawl their algorithms.
There’s one more idea I want to put on the table. I think part of what complicates this moment is the relatively private and secretive nature of the LLM businesses operating today. The artists, authors, newspapers, and others who find that their material was used to train A.I. models all discover it after the fact. There’s no meaningful participation by I.P. owners or the public in the creation and deployment of what could be the most disruptive technology ever. And the operators of these models haven’t made commitments to public access to the models themselves or their benefits. That all encourages suspicion—as does the recent, and still bewildering, civil war at OpenAI, which seemed predicated on internal divisions about managing the balance of commercial and humanitarian interests.
There’s another way to do this that involves the stakeholders participating in the creation, deployment, and benefits of these systems, by design. This brief by the Data & Society Research Institute is inspiring in how it urges us to expand our imagination of how this all plays out. Citing examples from environmental impact, transportation, and science, the researchers lay out a number of ways the public can be involved in shaping, not just reacting to, innovation. That participation can come in the form of surveys, citizen assemblies, feedback sessions, and more, but any version of it requires commitment and participation by those doing the building.
As for how we handle the benefits of these A.I. systems, which are ultimately going to be built on the full breadth of human knowledge, perhaps we can turn to an existing playbook: taxes. As individuals and companies, we benefit from public investment and common resources like roads, education, even the rule of law. For A.I. companies building on public knowledge, they may ultimately have to commit to some form of public return on value, a reinvestment in that common knowledge, beyond their mere existence. What that looks like should be up to all of us. |
|
|
| That’s all for today. If you have feedback on this or any of my pieces, or want to get in touch, hit me up via baratunde@puck.news. |
|
|
|
| FOUR STORIES WE’RE TALKING ABOUT |
 |
| Ack, Man |
| On Bill Ackman’s very public crusade against B.I., Axel Springer, and KKR. |
| DYLAN BYERS |
|
 |
|
 |
|
 |
|
|
|
|
|
 |
|
|
|
Need help? Review our FAQs
page or contact
us for assistance. For brand partnerships, email ads@puck.news.
|
|
You received this email because you signed up to receive emails from Puck, or as part of your Puck account associated with . To stop receiving this newsletter and/or manage all your email preferences, click here.
|
|
Puck is published by Heat Media LLC. 227 W 17th St New York, NY 10011.
|
|
|
|