Thursday, January 05, 2017

History is written and revised by the winners -- can the Internet Archive change that?

Kremvax during the Soviet coup attempt
I was naively optimistic in the early days of the Internet, assuming that it would enhance democracy while providing "big data" for historians. My first taste of that came during the Soviet coup attempt of 1991 when I worked with colleagues to create an archive of the network traffic in, out and within the Soviet Union. That traffic flowed through a computer called "Kremvax," operated by RELCOM, a Russian software company.

The content of that archive was not generated by the government or the establishment media -- it was citizen journalism, the collective work of independent observers and participants stored on a server at a university. What could go wrong with that?

Mumbai terrorist attack
The advent of the Web and Wikipedia fed my optimism. For example, when terrorists attacked various locations in Mumbai, India in 2008, citizen journalists inside and outside the hotels that were under attack began posting accounts. The Wikipedia topic began with two sentences:
The 28 November 2008 Mumbai terrorist attacks were a series of attacks by terrorists in Mumbai, India. 25 are injured and 2 killed.
In less than 22 hours, 242 people had edited the page 942 times expanding it to 4,780 words organized into six major headings with five subheadings. (Today it is over 130,000 bytes, revisions continue and it is still viewed over 2,000 times per month). What could go wrong with that?

The Arab Spring
The 2011 Arab Spring was also seen as a demonstration of the power of the Internet as a democratic tool and repository of history. What could go wrong with that?

What went wrong

The problem is that the Internet turned out to be a tool of governments and terrorists as well as citizens. Furthermore, historical archives can disappear or, worse yet, be changed to reflect the view of the "winner."

Our Soviet Coup archive was set up on a server at the State University of New York, Oswego, by professor Dave Bozack. What will happen to it when he retires?

If someone tried to delete or significantly alter the Wikipedia page on the Mumbai attack, they might be thwarted by one of the volunteers who has signed up to be "page watchers" -- people who are notified whenever the page they are watching is edited. We saw a reassuring demonstration of the rapid correction of vandalism in a podcast by Jon Udell. That was cool, but does it scale? Volunteers burn out. The page on the Mumbai attacks has 358 page watchers, but only 32 have visited the page after recent edits.

Even if a Wikipedia page remains intact, links to references and supporting material will eventually break -- "link rot." If our Soviet Coup archive disappears after Dave's retirement, all the links to it will break.

By the time of the Arab Spring, we were well aware of our earlier naivete -- the Internet was already being used for terrorism and government cyberwar and the dream of providing raw data for future historians and political scientists was fading.

The Internet Archive

Soviet coup archive from Internet Archive
I was slow to understand the fragility of the Internet, but others saw it early -- most importantly, Brewster Kahle, who, in 1996, established the Internet Archive to cache Web pages and preserve them against deletion or modification. They have been at it for 20 years now and have a massive online repository of books, music, software, educational material, and, of course, Web sites, including our Soviet Coup archive. As shown here, it has been archived 50 times since October 3, 2002 and it will be online long after Dave retires -- as long as the Internet Archive is online.

Khale understands that saving static Web sites like the Soviet Coup archive only captures part of what is happening online today. Since the late 1990s, we have been able to add programs to Web sites, turning them into interactive services. As such, he has recently begun archiving virtual machine versions of interactive government services and databases.

Khale is understandably concerned by the election of Donald Trump, who has demonstrated a keen ability to exploit the Internet and a disregard for truth. As such, he is raising money to create a backup copy of the Interent Archive in Canada and working to archive US Government Web sites and services.

The Internet is inconceivably large and growing exponentially. There is no way the Internet Archive can capture all of it, but it is the leading Internet-preservation organization today. Khale and his staff will continue their work and will inspire and collaborate with other relatively specialized efforts like that of climate scientists who are working to preserve government climate-science research results, data and services.

For more on the Internet Archive check out the following PBS News Hour segment (9m 12s):


You can read the transcript here.

I'd also recommend listening to this short (5m 14s) podcast interview of Brewster Kahle. He describes the End of Term project -- a collaborative effort to record US government (.gov and .mil) Web sites and services when a new administration takes over. He describes deletions and modifications from 2008 and 2012 and feels a special urgency today for obvious reasons.

You can read a transcript of the interview here.

-----
Update 1/6/2017

The Internet Archive has launched the Trump Archive with 700+ televised speeches, interviews, debates, and other news broadcasts. Mention by a fact-checking site was the "signal" used for inclusion of a video and links to the fact-check document are included in a companion spreadsheet. I hope they use speech recognition to produce searchable transcripts as well.

Too bad we did not have Trump and Clinton archives during the campaign -- I hope we will have similar, timely archives in the future. One can even imagine similar archives for state and local campaigns if a crowd-sourcing system were developed.


-----
Update 1/7/2017

There is an annotated PowerPoint presentation on citizen journalism here. I use it in teaching an Internet literacy class and there is a note on my PowerPoint presentation style here.