A.I. Is Coming for the Past, Too – Truth, Politics, and Democracy

New York Times, January 28, 2024

By Jacob N. Shapiro and Chris Mattmann

Mr. Shapiro is the managing director of the Empirical Studies of Conflict Project. Mr. Mattmann is the director of the Information Retrieval and Data Science Group at the University of Southern California.

We don’t have to imagine a world where deepfakes can so believably imitate the voices of politicians that they can be used to gin up scandals that could sway elections. It’s already here. Fortunately, there are numerous reasons for optimism about society’s ability to identify fake media and maintain a shared understanding of current events.

While we have reason to believe the future may be safe, we worry that the past is not.

History can be a powerful tool for manipulation and malfeasance. The same generative A.I. that can fake current events can also fake past ones. While new content may be secured through built-in systems, there is a world of content out there that has not been watermarked, which is done by adding imperceptible information to a digital file so that its provenance can be traced. Once watermarking at creation becomes widespread and people adapt to distrust content that is not watermarked, then everything produced before that point in time can be much more easily called into question.

And this will create a treasure trove of opportunities for backstopping false claims with generated documents, from photos placing historical figures in compromising situations, to altering individual stories in historical newspapers, to changing names on deeds of title. While all of these techniques have been used before, countering them is much harder when the cost of creating near-perfect fakes has been radically reduced.

This forecast is based on history. There are many examples of how economic and political powers manipulated the historical record to their own ends. Stalin purged disloyal comrades from history by executing them and then altering photographic records to make it appear as if they never existed. Slovenia, on becoming an independent country in 1992, erased over 18,000 people from the registry of residents — mainly members of the Roma minority and other ethnic non-Slovenes. In many cases, the government destroyed their physical records, leading to their loss of homes, pensions and access to other services, according to a 2003 report by the Council of Europe Commissioner for Human Rights.

False documents are a key part of many efforts to rewrite the historical record. The infamous Protocols of the Elders of Zion, first published in a Russian newspaper in 1903, purported to be meeting minutes from a Jewish conspiracy to control the world. First discredited in August 1921 as a forgery plagiarized from multiple unrelated sources, the Protocols featured prominently in Nazi propaganda and have long been used to justify antisemitic violence, including a citation in Article 32 of Hamas’s 1988 founding covenant.

In 1924 the Zinoviev Letter, said to be a secret communiqué from the head of the Communist International in Moscow to the Communist Party of Great Britain to mobilize support for normalizing relations with the Soviet Union, was published by The Daily Mail four days before a general election. The resulting scandal may have cost Labour the election. The letter’s origin has never been proved, but its authenticity was questioned at the time, and an official investigation in the 1990s concluded that it was most likely the work of White Russians — a conservative political faction led at the time by Russian émigrés opposed to the Communist government.

Decades later Operation Infektion, a Soviet disinformation campaign, used forged documents to spread the idea that the United States had invented H.I.V., the virus that causes AIDS, as a biological weapon. And in 2004 CBS News withdrew a controversial story because it could not authenticate the documents, which were later discredited as forgeries, that called into question the earlier service by George W. Bush, then the president, in the Texas Air National Guard. As it becomes easier to generate historical disinformation and as the sheer volume of digital fakes explodes, the opportunity will become available to reshape history or at least to call our current understanding of it into question.

The prospects of political actors using generative A.I. to effectively reshape history — not to mention fraudsters creating spurious legal documents and transaction records — are frightening. Fortunately, a path forward has been laid by the same companies that created the risk.

In indexing a large share of the world’s digital media to train their models, the A.I. companies have effectively created systems and databases that will soon contain all of humankind’s digitally recorded content or at least a meaningful approximation of it. They could start work today to record watermarked versions of these primary documents, which include newspaper archives and a wide range of other sources, so that subsequent forgeries are instantly detectable.

Such work faces some barriers. Google’s digital libraries’ effort to scan millions of the world’s library books and make them readily accessible online ran into intellectual property limits, rendering the historical archive unworkable for its intended purpose of making these texts searchable by anyone with an internet connection. Those same intellectual property concerns are causing creators and companies to fret about both the training data provided to generative A.I. and its implications when used to generate content.

Given this freighted history, including Google’s failed investment in its digital libraries project, who will step up and pay for a similar massive effort that would create immutable versions of historical data? Both government and industry have strong incentives to do so, and many of the intellectual property concerns around providing a searchable online archive do not apply to creating watermarked and time-stamped versions of documents, because those versions need not be made publicly available to serve their purpose. One can compare a claimed document to the recorded archive by using a mathematical transformation of the document known as a hash, the same technique the Global Internet Forum to Counter Terrorism uses to help companies screen for known terrorist content.

Aside from creating an important public good and protecting citizens from the dangers posed by manipulation of historical narratives, creating verified records of historical documents can be valuable for the large A.I. companies. New research suggests that when A.I. models are trained on A.I.-generated data, their performance quickly degrades. Thus separating what is actually part of the historical record from newly created “facts” may be critical.

Preserving the past will also mean preserving the training data, the associated tools that operate on it and even the environment that the tools were run in. Vint Cerf, an early internet pioneer, has called this type of record “digital vellum,” and we need it to secure the information environment.

Such a vellum will be a powerful tool. It can help companies to build better models by enabling them to analyze what data to include to get the best content and help regulators to audit bias and harmful content in the models. Tech giants are already conducting similar efforts to record the new content their models are creating — in part because they need to train their models on human-generated text and the data produced after the adoption of large language models may be tainted with generated content.

The time has come to extend this effort back in time as well, before our politics, too, become severely distorted by generated history.