In the sphere of publishing and technology, another storm of controversy has erupted, with a staggering 183,000 books at its epicentre. Freelance writer Alex Reisner shone a light upon a dataset christened “Books3,” encompassing over 191,000 books. According to The Atlantic, these literary works were illicitly used for the training of generative AI systems by corporate giants such as Meta and Bloomberg, setting in motion a wave of legal disputes. Distinguished authors, including Sarah Silverman, Michael Chabon, and Paul Tremblay, have taken legal recourse, alleging copyright infringement against Meta for deploying their cherished creations in AI training.
Read: Authors’ pirated books used to train Generative AI
Reisner’s exposé has reverberated with authors, as numerous writers have come to realise that their work, imagination, and research were unwittingly harnessed to craft the very machines that may one day eclipse their human counterparts. The architects of these AI systems stand on the brink of reaping substantial profits, whilst the originators of the original content languish in a state of uncertainty.
Generative-AI training and the role of illicitly sourced books
In response to these damning allegations, Meta, the tech behemoth ensnared in these lawsuits, has refrained from providing unequivocal responses regarding the use of pirated books to instruct their generative-AI product known as LLaMA. Instead, they alluded to a recent court filing, asserting that neither LLaMA nor its output bears substantial resemblance to the authors’ books.
Books, Reisner says, wield pivotal influence in nurturing generative-AI proficiency. “Their long, thematically consistent paragraphs provide information about how to construct long, thematically consistent paragraphs—something that’s essential to creating the illusion of intelligence,” he adds. Consequently, tech titans routinely deploy vast collections of books without authorisation, procurement, or licensing, as Meta’s legal counsel contends that their generative AI’s outputs and the model itself do not substantially mimic existing literature. Human language is vast and flexible, and can be used and moulded into multiple shapes that AI does not have the capacity to do by itself – not yet anyway.
Read: More authors including Michael Chabon sue OpenAI and Meta over copyright due to training
In the training regimen, Reiser explains a “generative-AI system essentially builds a giant map of English words—the distance between two words correlates with how often they appear near each other in the training text.” The end product, a formidable expansive language model, yields more plausible responses for subjects that prominently feature in its training materials. The lack of transparency concerning the origins of these training data sources raises concerns, given that a system predominantly schooled in Western literature might falter when interrogated about works from the global South and East.

This reservoir of books, predominantly drawn from the English-speaking Western world, spans various genres, with 236 entries belonging to the bard himself, William Shakespeare, occupying the zenith. It encompasses the works of esteemed authors such as Agatha Christie, James Patterson, Danielle Steel, and R. L. Stine. Surprisingly, it also comprises 108 entries attributed to L. Ron Hubbard, the founder of Scientology. Document size, serving as a proxy for character count, significantly influences a model’s behaviour, with models trained on “Books3” exhibiting a conspicuous bias towards Shakespeare over Nobel laureate Alice Munro. Academic publishers like Routledge and Oxford University Press make substantial contributions, with a staggering 1,141 Lonely Planet guides enriching the dataset.

Read: AI open letter: authors including Margaret Atwood urge companies to honour copyright
The predicament confronting authors within the realm of generative AI may transcend the scope of copyright law, underscoring the secretive and non-consensual nature of AI training practices. The opacity surrounding the development of these programmes remains a mystery to most, even as these endeavours pose existential threats. Within “Books3,” books are enigmatically stored as colossal, unattributed text blocks, necessitating meticulous ISBN extraction and database searches to unveil their authors and titles – a task that Reisner successfully accomplished for 183,000 of the 191,000 identified titles.
“It is terrifying to think that my novel, which tackles imperialism and racism and homophobia, could be used by AI systems completely untethered from ethical concerns. Not just terrifying, sickening. My words in their mouth. Just vile.”
Damian Barr, You Will Be Safe Here Author
Upon the release of the database on The Atlantic, author Damian Barr found that his work had indeed been used, adding “This is theft, pure and simple.”
Transparency and ethical concerns: the enigma of AI training data
The issue of bias in AI systems has been extensively documented, with disproportionate training data bearing the potential for detrimental outcomes – including facial recognition. However, “Books3” offers a distinctive vantage point – what mix of books would constitute an impartial dataset? The distribution of viewpoints within historical narratives presents additional complexities. When algorithms, rather than human perspective, curate and sift knowledge through the filter of our own biases, it’s difficult to discern what is actually the truth.
Read: Stephen Fry raises alarm over AI identity theft using Harry Potter at CogX
As AI chatbots progressively supplant traditional search engines, the tech industry’s capacity to mould information access and manipulate perspectives escalates exponentially. While the internet democratised access to information by eradicating geographical constraints, AI chatbots reintroduce gatekeepers – inscrutable, unaccountable entities susceptible to “hallucinations” and potentially lacking proper source citation.
In a recent legal manoeuvre, Meta sought to downplay the significance of “Books3” within LLaMA’s training data, emphasising its diminutive proportion. However, the core concern remains unaddressed – to what extent does LLaMA depend on these texts to generate summaries and responses? Current algorithms enshroud this process in obscurity.
Read: Supergroup of authors including George R.R. Martin sue OpenAI
Despite offering an initial glimpse into the contents of “Books3,” it represents but a fraction of the enigmatic training data universe. The majority remains concealed, behind securely locked doors, beckoning further scrutiny and evoking disconcerting queries regarding the trajectory of AI’s impact on our world.
Make a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
Or enter a custom amount
Your contribution is appreciated, as everything you give we put back so we can provide the best information.
Your contribution is appreciated, as everything you give we put back so we can provide the best information.
Your contribution is appreciated, as everything you give we put back so we can provide the best information.
DonateDonate monthlyDonate yearly
3 comments
[…] Read: Unauthorised AI training: 183,000 books incite legal clashes […]
[…] Read: Unauthorised AI training: 183,000 books incite legal clashes […]
[…] Read: Unauthorised AI training: 183,000 books incite legal clashes […]