Authors’ pirated books used to train Generative AI

Authors’ pirated books used to train Generative AI

by Suswati Basu
The rise of AI books written by ChatGPT

The convergence of artificial intelligence and creative works has paved the way for innovative developments and transformative applications, but it hasn’t come without its share of ethical and legal dilemmas. Recently, a growing concern has come to the forefront: the use of copyrighted books to train generative AI models, which have the potential to replicate and generate text that closely mimics human language patterns.

Read: Amazon halts AI-generated books impersonating author Jane Friedman

Authors such as Stephen King, Zadie Smith, and Michael Pollan, among many others, find their works unwittingly fuelling the expansion of AI systems capable of producing human-like answers and content. King has subsequently responded to the claims through an op-ed in The Atlantic, where he said his entire body of work “could fit on one thumb drive.”

“Because the capacity of computer memory is so large—everything I ever wrote could fit on one thumb drive, a fact that never ceases to blow my mind—these programmers can dump thousands of books into state-of-the-art digital blenders. Including, it seems, mine. The real question is whether you get a sum that’s greater than the parts, when you pour back out.”

Stephen King

We look into the revelations made by writer and programmer Alex Reisner in The Atlantic regarding the uncharted territory of authors’ pirated books being harnessed to power generative AI.

Behind closed doors: the secretive world of AI training

Generative AI systems like ChatGPT are designed to understand and generate human-like text. To achieve this, they require vast amounts of textual data to learn from. According to Reisner, while some training text comes from sources like Wikipedia and online articles, the high-quality input needed to produce sophisticated AI responses often comes from books.

Read: Prosecraft AI takes down site after being accused of using authors’ works

The lawsuit filed by writers Sarah Silverman, Richard Kadrey, and Christopher Golden against Meta has brought to light the issue of AI companies using copyrighted books to train large language models.

The disturbing reality: pirated books as AI inputs

The heart of the matter lies in the revelation that upwards of 170,000 copyrighted books, including works by notable authors such as Stephen King and Zadie Smith, have been used to train AI models like LLaMA, GPT-J, and BloombergGPT. The dataset known as “Books3” contains a wide range of fiction and nonfiction books from both prominent and lesser-known publishers.

Just this week, TorrentFreak reported that Danish anti-piracy group Rights Alliance took down the prominent “Books3” dataset. A takedown notice sent on behalf of publishers prompted “The Eye” to remove the 37GB dataset of nearly 200,000 books, which it hosted for several years, even though copies continue to show up elsewhere. The so-called initial release was apparently in 2020.

Authors' pirated books are reportedly being used to train generative AI models. Twitter screenshot shows initial date of release
Authors’ pirated books are reportedly being used to train generative AI models. Twitter screenshot shows initial date of release.

The dataset’s widespread use, not limited to a single AI model, raises questions about ethical practices and the potential consequences of utilising pirated content for commercial purposes.

Which authors’ books are affected by generative AI?

Out of the 170,000 titles in the dataset, roughly one-third are fiction, while two-thirds are nonfiction, from both big and small publishers. Notable publishers include:

  • Over 30,000 titles from Penguin Random House and its imprints.
  • 14,000 titles from HarperCollins.
  • 7,000 titles from Macmillan.
  • 1,800 titles from Oxford University Press.
  • 600 titles from Verso.

The collection comprises both fiction and nonfiction works by well-known authors such as:

  • Elena Ferrante and Rachel Cusk.
  • At least nine books by Haruki Murakami.
  • Five books by Jennifer Egan.
  • Seven books by Jonathan Franzen.
  • Nine books by bell hooks.
  • Five books by David Grann.
  • 33 books by Margaret Atwood.

The dataset also includes some questionable entries, such as:

  • 102 pulp novels by L. Ron Hubbard.
  • 90 books by the Young Earth creationist pastor John F. MacArthur.
  • Multiple works of aliens-built-the-pyramids pseudo-history by Erich von Däniken.

Accessing The Pile: unmasking the contents of Books3

The investigation conducted by Reisner revealed the massive dataset called “the Pile,” which contains the Books3 dataset as well as material from various sources, such as YouTube subtitles and European Parliament documents. The extensive and diverse range of texts underscores the complexity of AI systems, which focus on word relationships rather than specific subject matter. The Pile’s contents, extracted and analysed by Reisner, exposed the scale and variety of pirated works used to train AI models, leading to ethical concerns about the origins and legality of the training data. Not only has Meta alluded to using The Pile for training its in-house AI according to Gizmodo, but Google was slapped with a lawsuit for ‘secretly stealing’ data to train Bard.

Reisner reported that a Meta spokesperson declined to comment on the company’s use of Books3, while Bloomberg did not respond to emails requesting comment. In addition, Stella Biderman, EleutherAI’s executive director, did not dispute that the company used Books3 in GPT-J’s training data.

The fair-use debate: copyright, creativity, and control

The debate surrounding the fair use of copyrighted material for training AI models is complex. While some argue that AI-generated content constitutes transformative works that enrich culture without harming the market for original works, others stress the need for authors to retain control over their creations. The blurred line between transformative works and unauthorised use complicates legal arguments, as AI companies maintain their proprietary stance, insisting on control over their models’ outputs and usage.

Clashing cultures: open-source ideals vs. copyright protection

The generative AI community’s affinity for open-source ideals collides with the traditional publishing world’s commitment to copyright protection. Open-source developers, fuelled by a desire to share and modify software freely, sometimes overlook the necessity of more restrictive licences for creative endeavours that require investment and control over reproduction and distribution.

Read: AI open letter: authors including Margaret Atwood urge companies to honour copyright

This cultural clash becomes particularly evident as AI systems are powered by pirated content, undermining authors’ rights and the financial rewards they deserve.

Navigating the nexus of AI and copyright

The use of copyrighted books to train generative AI models raises intricate questions about ethics, copyright, and the future of creative works. As AI technologies continue to advance, the issue of pirated content powering AI systems underscores the need for a thoughtful and balanced approach. Bridging the gap between the open-source ethos of AI development and the rights of content creators will require a delicate equilibrium to ensure that technological progress doesn’t come at the expense of authors’ intellectual property. Hence as the conversation unfolds, a deeper understanding of these complex dynamics will be crucial in shaping the ethical landscape of AI-powered creativity. As a result, ere’s definitely a showdown approaching between the tech and publishing industries.

Subscribe to my newsletter for new blog posts, recommendations & episodes. Let’s stay updated!

You may also like


Praise for MPs' bid to protect songs and books from AI mining - How To Be... August 30, 2023 - 11:36 pm

[…] comes as prominent authors such as Stephen King and Zadie Smith were among a number of writers to have their works used to train generative AI without their […]

Book reviews: erosion of trust due to AI, bombing and misrepresentation - How To Be... September 1, 2023 - 4:23 pm

[…] the same time, AI’s emergence in the review landscape has introduced a new dimension to the debate. Steven Levy’s […]

Amazon's AI-written mushroom foraging books could be 'life or death' - How To Be... September 4, 2023 - 8:10 pm

[…] Read: Authors’ pirated books used to train Generative AI […]

LibGen: publishers sue infamous 'shadow library' over pirated books - How To Be Books September 16, 2023 - 12:10 am

[…] Read: Authors’ pirated books used to train Generative AI […]

Stephen Fry raises alarm over AI identity theft using Harry Potter at CogX - How To Be Books September 17, 2023 - 12:49 am

[…] Read: Authors’ pirated books used to train Generative AI […]

More authors sue OpenAI and Meta over copyright due to training - How To Be Books September 18, 2023 - 9:07 am

[…] Read: Authors’ pirated books used to train Generative AI […]


Leave a Reply

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?
%d bloggers like this: