Events

ARTICLE

How Anthropic turned physical books into data to train Claude

Anna NoxCorp

21 hours ago

La evolución de los modelos frontera hacia sistemas de agentes autónomos en 2026.

HOW ANTHROPIC TURNED PHYSICAL BOOKS INTO DATA TO TRAIN CLAUDE

The race to build more capable artificial intelligence models is also a race to access better data. In Anthropic’s case, that search reportedly took an unusually physical form: buying printed books, dismantling them, scanning them page by page, and turning them into training material for Claude.

Court documents revealed as part of a class-action lawsuit against the company exposed details of the so-called Project Panama, an operation conceived in 2024 to feed its models with higher-quality texts than much of what is available on the internet. The premise was simple, although controversial: if books remain one of the most refined forms of human writing, they are also a valuable source for teaching AI to write better.

The issue was not only technical. It was legal, reputational, and cultural. Anthropic was not simply collecting digital data available online. According to the cited documents, the company reportedly bought used books, cut them with hydraulic machinery, and then scanned them with professional high-speed equipment. Afterward, the dismantled physical copies were sent for recycling.

The image summarizes a central tension in generative artificial intelligence: to produce synthetic language at scale, models need to absorb enormous amounts of human language. But when that language comes from copyrighted works, innovation begins to move into a zone of conflict.

WHAT PROJECT PANAMA WAS

Project Panama was, according to the revealed documents, an operation designed to convert physical books into training data. Anthropic reportedly relied mainly on the second-hand market to acquire large volumes of books, with the goal of building a high-quality literary and editorial corpus for Claude.

The company reportedly began buying books from The Strand, a historic bookstore in New York, and later turned to specialized retailers such as Better World Books in the United States and World of Books in the United Kingdom. The Washington Post estimated that the total number of books purchased may have ranged between 500,000 and 2 million books, acquired over an approximate period of six months.

The logic behind the project was different from databases extracted from the internet. Instead of depending on disorganized, repetitive, or low-quality web content, Anthropic sought edited, published, and culturally validated texts. In other words: books written, reviewed, and distributed within a traditional publishing industry.

The paradox is clear. To train a technology presented as part of the future of knowledge, the company reportedly used a reverse assembly line: destroying physical cultural objects to extract digital information from them.

A STRATEGY TO IMPROVE MODEL QUALITY

The decision to use books is not surprising from a technical perspective. Language models learn patterns of structure, argumentation, style, vocabulary, and coherence from the texts they process. A book usually offers narrative, conceptual, and editorial density that is difficult to find in short or fragmented internet posts.

For a model like Claude, that can be especially valuable. An AI trained on better texts can produce more organized responses, sustain complex ideas for longer, and replicate more sophisticated writing styles. The goal would not be only to respond faster, but to respond with greater clarity, depth, and consistency.

However, data quality does not remove the underlying question: who has the right to turn a human work into training material for a commercial artificial intelligence system?

THE LEGAL CONFLICT BEHIND THE BOOKS

The case against Anthropic was not focused only on physical books. The documents also revealed that the company reportedly used materials from pirated digital libraries. According to the information released, in 2021 Ben Mann, co-founder of Anthropic, downloaded millions of books from LibGen, a well-known unauthorized library. The following year, he also reportedly praised Pirate Library Mirror, a site that openly acknowledged infringing copyright laws in several countries.

This difference is key. The purchase of used physical books opened a possible legal defense based on the first-sale doctrine, a principle that allows someone who buys a copy to dispose of it without requiring additional permission from the rights holder. But that logic does not apply in the same way to the massive download of pirated books.

According to the published details, the use of purchased and destroyed books was considered legal in the context of the case, while the use of pirated books did not receive the same protection. The lawsuit ended with an out-of-court settlement of approximately $1.5 billion.

The figure matters not only because of its size, but because of the message it sends to the sector. AI companies may argue that they need large volumes of data to innovate, but that argument does not erase legal boundaries or tensions with authors, publishers, and rights holders.

Seleccionar el modelo adecuado es el pilar de la eficiencia operativa en 2026.

WHY THE CASE MATTERS FOR THE ENTIRE AI INDUSTRY

Anthropic is not the only company facing questions about copyright and model training. The generative AI industry grew around a complex premise: the broader and more diverse the dataset, the more capable the system becomes. But many of those data sources come from content created by people who never gave explicit consent for their work to train commercial products.

For years, much of the public debate focused on the internet: web pages, repositories, forums, articles, digitized books, and open or semi-open databases. Project Panama adds another layer: even when content is not downloaded from an illegal source, turning it into training data can still raise ethical and economic questions.

The operation also shows that AI companies are willing to invest significant sums to obtain better-quality data. This points toward a stage in which access to reliable, authorized, and specialized corpora may become as important a competitive advantage as computing power or technical talent.

QUALITY DATA BECOMES INFRASTRUCTURE

In the early years of generative AI, public discussion focused on model size and chip power. But the Anthropic case is a reminder that data quality remains a fundamental piece of the equation. A model does not learn in a vacuum: it learns from texts, images, conversations, documents, and records produced by entire societies.

That is why the debate is no longer simply whether an AI can write well. The more important question is what materials were used to make that possible, under what permissions, with what compensation, and with what level of transparency.

Project Panama exposes an uncomfortable reality for the sector: companies need human content to train systems capable of competing with humans in cognitive tasks. That dependence forces a much more serious discussion about licensing models, traceability, compensation, and data governance.

A NEW FRONTIER FOR COPYRIGHT

Copyright was designed to protect works in a world where copying and distribution had physical, commercial, and logistical costs. Generative AI changes that balance. Now, a work may not be directly reproduced, but it can be absorbed by a system that learns patterns from it and then generates new content.

That difference sits at the center of the conflict. Technology companies often argue that model training constitutes transformative use. Authors and publishers, on the other hand, argue that their works are being used to create products that may compete with the original creative labor.

The Anthropic case does not fully resolve that tension, but it does serve as a warning: AI training can no longer be treated as an invisible operation. The origin of data is becoming a public, legal, and strategic matter.

For companies in the sector, this represents a shift in era. Competitive advantage will not depend only on having the most advanced model, but on building a defensible data chain. In an increasingly regulated market, the legitimacy of training may become as important as model performance.

THE REPUTATIONAL COST OF TRAINING ON HUMAN WORKS

Project Panama also raises a reputational question. Anthropic has publicly positioned itself as a company focused on safety, alignment, and responsible AI development. For that reason, the revelation of a secret operation to destroy books and turn them into data may be especially sensitive.

From a business perspective, the decision can be understood as a way to obtain higher-quality data without relying exclusively on unauthorized digital sources. From a cultural perspective, however, the image of millions of books being cut and scanned to train an AI is difficult to separate from a broader question: what material and symbolic value is assigned to human creation in the model economy?

This is not about romanticizing paper or rejecting technological progress. Book digitization has existed for decades and has made it possible to preserve, search, and distribute knowledge in ways that were previously impossible. The difference lies in the destination: here, the texts are not digitized for human readers, but to train commercial systems capable of producing language at scale.

That change in purpose is what makes the case so relevant. Generative AI does not only consume culture; it also competes within the same cultural ecosystem that feeds it. Without clear rules, innovation risks advancing on top of accumulated legal conflicts.

WHAT MAY CHANGE AFTER THIS CASE

The Anthropic case may accelerate a discussion the industry can no longer avoid. AI companies will need to show more clearly what data they use, how they obtain it, and under which legal or contractual criteria they incorporate it into their models.

This could create room for new agreements with publishers, authors, libraries, universities, and media organizations. It could also push forward compensation models for the use of protected works, dataset audits, and stronger traceability systems.

For model developers, the question will no longer be only how to get more data, but how to obtain data that can withstand scrutiny from regulators, courts, users, and business partners. In a market where trust is part of the product, opacity becomes a cost.

For creators, the case reinforces the need to debate fairer conditions. AI can expand productivity, accelerate tasks, and open new forms of creation, but its development depends on a knowledge base built by people. Ignoring that relationship weakens the legitimacy of the entire ecosystem.

NOX CORP

NOXCORP’S VISION

The Anthropic case shows that artificial intelligence does not advance only through technical capability. It also advances through decisions about data, permissions, incentives, and trust.

For companies working with AI, the challenge is not only to create more efficient models. It is to build systems that can explain where they learn from, what limits they respect, and how they integrate responsibly into human work.

Automation needs data. But it also needs legitimacy.

The future of AI should not depend on extracting value from human production without conversation, rules, or transparency. It should rely on models where technology amplifies capabilities, respects rights, and enables new forms of collaboration between people and intelligent systems.

ABOUT NOXCORP

NoxCorp is a company focused on artificial intelligence systems that optimize human work and coordinate collaboration between AI agents and people, relying on humans for tasks that AI still cannot fully perform.

By Anna NoxCorp

Twitter: @NoxCorpIA

LinkedIn: Nox Corp IA