Unauthorised AI training: 183,000 books incite legal clashes

Unauthorised AI training: 183,000 books incite legal clashes

by Suswati Basu
3 comments

In the sphere of publishing and technology, another storm of controversy has erupted, with a staggering 183,000 books at its epicentre. Freelance writer Alex Reisner shone a light upon a dataset christened “Books3,” encompassing over 191,000 books. According to The Atlantic, these literary works were illicitly used for the training of generative AI systems by corporate giants such as Meta and Bloomberg, setting in motion a wave of legal disputes. Distinguished authors, including Sarah Silverman, Michael Chabon, and Paul Tremblay, have taken legal recourse, alleging copyright infringement against Meta for deploying their cherished creations in AI training.

Read: Authors’ pirated books used to train Generative AI

Reisner’s exposé has reverberated with authors, as numerous writers have come to realise that their work, imagination, and research were unwittingly harnessed to craft the very machines that may one day eclipse their human counterparts. The architects of these AI systems stand on the brink of reaping substantial profits, whilst the originators of the original content languish in a state of uncertainty.

Generative-AI training and the role of illicitly sourced books

In response to these damning allegations, Meta, the tech behemoth ensnared in these lawsuits, has refrained from providing unequivocal responses regarding the use of pirated books to instruct their generative-AI product known as LLaMA. Instead, they alluded to a recent court filing, asserting that neither LLaMA nor its output bears substantial resemblance to the authors’ books.

Books, Reisner says, wield pivotal influence in nurturing generative-AI proficiency. “Their long, thematically consistent paragraphs provide information about how to construct long, thematically consistent paragraphs—something that’s essential to creating the illusion of intelligence,” he adds. Consequently, tech titans routinely deploy vast collections of books without authorisation, procurement, or licensing, as Meta’s legal counsel contends that their generative AI’s outputs and the model itself do not substantially mimic existing literature. Human language is vast and flexible, and can be used and moulded into multiple shapes that AI does not have the capacity to do by itself – not yet anyway.

Read: More authors including Michael Chabon sue OpenAI and Meta over copyright due to training

In the training regimen, Reiser explains a “generative-AI system essentially builds a giant map of English words—the distance between two words correlates with how often they appear near each other in the training text.” The end product, a formidable expansive language model, yields more plausible responses for subjects that prominently feature in its training materials. The lack of transparency concerning the origins of these training data sources raises concerns, given that a system predominantly schooled in Western literature might falter when interrogated about works from the global South and East.

Unauthorised books used for AI training
Number of books listed by author in AI training database. Credit: Suswati Basu / The Atlantic / Alex Reisner

This reservoir of books, predominantly drawn from the English-speaking Western world, spans various genres, with 236 entries belonging to the bard himself, William Shakespeare, occupying the zenith. It encompasses the works of esteemed authors such as Agatha Christie, James Patterson, Danielle Steel, and R. L. Stine. Surprisingly, it also comprises 108 entries attributed to L. Ron Hubbard, the founder of Scientology. Document size, serving as a proxy for character count, significantly influences a model’s behaviour, with models trained on “Books3” exhibiting a conspicuous bias towards Shakespeare over Nobel laureate Alice Munro. Academic publishers like Routledge and Oxford University Press make substantial contributions, with a staggering 1,141 Lonely Planet guides enriching the dataset.

AI books training data shows how the model will behave depending on how much large the document size is. William Shakespeare has the largest amount at 124 million.
Alex Reisner says the size of books count for how an AI-training model will behave. Credit: Suswati Basu / The Atlantic / Alex Reisner
Read: AI open letter: authors including Margaret Atwood urge companies to honour copyright

The predicament confronting authors within the realm of generative AI may transcend the scope of copyright law, underscoring the secretive and non-consensual nature of AI training practices. The opacity surrounding the development of these programmes remains a mystery to most, even as these endeavours pose existential threats. Within “Books3,” books are enigmatically stored as colossal, unattributed text blocks, necessitating meticulous ISBN extraction and database searches to unveil their authors and titles – a task that Reisner successfully accomplished for 183,000 of the 191,000 identified titles.

“It is terrifying to think that my novel, which tackles imperialism and racism and homophobia, could be used by AI systems completely untethered from ethical concerns. Not just terrifying, sickening. My words in their mouth. Just vile.”

Damian Barr, You Will Be Safe Here Author

Upon the release of the database on The Atlantic, author Damian Barr found that his work had indeed been used, adding “This is theft, pure and simple.”

Transparency and ethical concerns: the enigma of AI training data

The issue of bias in AI systems has been extensively documented, with disproportionate training data bearing the potential for detrimental outcomes – including facial recognition. However, “Books3” offers a distinctive vantage point – what mix of books would constitute an impartial dataset? The distribution of viewpoints within historical narratives presents additional complexities. When algorithms, rather than human perspective, curate and sift knowledge through the filter of our own biases, it’s difficult to discern what is actually the truth.

Read: Stephen Fry raises alarm over AI identity theft using Harry Potter at CogX

As AI chatbots progressively supplant traditional search engines, the tech industry’s capacity to mould information access and manipulate perspectives escalates exponentially. While the internet democratised access to information by eradicating geographical constraints, AI chatbots reintroduce gatekeepers – inscrutable, unaccountable entities susceptible to “hallucinations” and potentially lacking proper source citation.

In a recent legal manoeuvre, Meta sought to downplay the significance of “Books3” within LLaMA’s training data, emphasising its diminutive proportion. However, the core concern remains unaddressed – to what extent does LLaMA depend on these texts to generate summaries and responses? Current algorithms enshroud this process in obscurity.

Read: Supergroup of authors including George R.R. Martin sue OpenAI

Despite offering an initial glimpse into the contents of “Books3,” it represents but a fraction of the enigmatic training data universe. The majority remains concealed, behind securely locked doors, beckoning further scrutiny and evoking disconcerting queries regarding the trajectory of AI’s impact on our world.

Subscribe to my newsletter for new blog posts, recommendations & episodes. Let’s stay updated!

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

£5.00
£15.00
£100.00
£5.00
£15.00
£100.00
£5.00
£15.00
£100.00

Or enter a custom amount

£

Your contribution is appreciated, as everything you give we put back so we can provide the best information.

Your contribution is appreciated, as everything you give we put back so we can provide the best information.

Your contribution is appreciated, as everything you give we put back so we can provide the best information.

DonateDonate monthlyDonate yearly

You may also like

3 comments

Jane Austen Meta AI: celebrities used in social interactions - How To Be Books October 1, 2023 - 10:03 pm

[…] Read: Unauthorised AI training: 183,000 books incite legal clashes […]

Reply
Watermarking in books: can AI bypass filters? - How To Be Books October 4, 2023 - 2:07 pm

[…] Read: Unauthorised AI training: 183,000 books incite legal clashes […]

Reply
Julia Cameron: The Artist’s Way author on AI and creative blocks - How To Be Books November 1, 2023 - 10:49 am

[…] Read: Unauthorised AI training: 183,000 books incite legal clashes […]

Reply

Leave a Reply

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?
%d