What is web archiving? History, technology, collections

Following on from last week’s introductory blog on web archiving this post takes a broader look at the technical and collecting environment of web archiving, as well as a brief look at its history.

The World Wide Web was pioneered in the late 1980s to help share information more efficiently and effectively. Needless to say this new system proved to be a hit, leading to its global rollout in the early 1990s. It didn’t take long for observers to ponder that there was probably a lot of content on the Web that would be worth saving for posterity (particularly due to its vulnerability to change), but how?

View of the first website, available at http://info.cern.ch/hypertext/WWW/TheProject.html

Today, online content is perhaps even more susceptible to change and loss, as a typical webpage has an average ‘lifespan’ of 44 to 75 days. Without action to capture content before it changes, we may end up with large gaps in history as recorded by the Web, which some researchers are already beginning to battle with. If we want to understand how our modern society ticks, we need a way to grab and save as much of this content as possible. To do this, we need web archives.

But what exactly do we mean by ‘web archiving’?

Web archiving can be defined as the process of capturing content that has been made available via the World Wide Web, preserving this content in a web archive, and making this accessible to users.

The most scalable way to do web archiving is with web crawler software. Crawlers are instructed to visit a selected website, or ‘seed’, on a certain date, and to explore this seed via its hyperlinks, copying content as it goes. The copied content is termed a ‘snapshot’. Each snapshot is often quality assured, and then preserved within ISO 28500:2009 WARC (Web ARChive) files. The WARC file constitutes an archival record of the snapshot captured at that point in time, and has the major advantage of enabling archivists to package together multiple related files from a website snapshot and preserve these long-term as individual entities.

To ensure users are aware they are viewing archived content (and not a live site), archived content is clearly identified via a banner and rewritten URL.

View of the Audit Scotland websites, as captured by NRS on 6th June 2017. Note the page banner and rewritten URL to show users they are in the web archive.

The Internet Archive and the National Library of Australia were the first organisations to archive web content in 1996, and the web archiving sector now reaches across the globe.

The Internet Archive, based in a former Christian Science Church, San Francisco. Taken from https://en.wikipedia.org/wiki/Internet_Archive#/media/File:Christian_science_church122908_02.jpg — The Internet Archive, based in a former Christian Science Church, San Francisco. Girl2k, Public domain, via Wikimedia Commons

As well as a core preservation argument, many libraries recognise the merit of archiving websites as part of their drive to record our cultural memory, whilst many archives also recognise official government websites as part and parcel of a nation’s public record.

This collecting pattern is illustrated in the UK:

The National Archives preserves UK Government websites in line with public records legislation. All archived snapshots are available online.
PRONI archives public and private websites in Northern Ireland to support their collecting remit, and these snapshots are available online.
The UK Parliamentary Archives captures the websites of Westminster, making these available online.
Copyright libraries (e.g. National Library of Scotland, Cambridge University Library etc.) collaborate with the British Library on the UK Web Archive. This collection is managed in two ways: under e-legal deposit, these libraries capture an annual snapshot of all UK-published Web content, however access to this is restricted to onsite library visitors. The libraries also capture selective collections, often relating to special themes or event, where they request the permission of website owners to capture web content and make this available online in the open UK Web Archive.

Web archiving does have its fair share of technical challenges. For instance, dynamically driven content e.g. webpages generated by a user search, Javascript, drop-down menus, and streamed media such as embedded YouTube videos are notoriously challenging to capture with a crawler.

This situation leads many web archives to devote effort to quality assuring content once it has been captured, and assessing whether any remedial actions may be possible. Crawling technology continues to be improved, often as part of international collaborations.

Given these challenges, it’s sobering to recognise that the perfect web archive, full of content that is complete and fully operational, simply does not exist. However, web archives remain pragmatic in the face of this, constantly re-evaluating methods, processes and strategies, never losing sight of their core goal: to preserve a representative, high-fidelity record of the Web.

In our next blog, we will explore how the NRS Web Continuity Service fits into all of this, what web continuity is, and how our work actively supports the Scottish Government’s commitment to openness and accountability.

What is web archiving? History, technology, collections

2 thoughts on “What is web archiving? History, technology, collections”

Leave a comment Cancel reply

What is web archiving? History, technology, collections

Share this:

2 thoughts on “What is web archiving? History, technology, collections”

Leave a comment Cancel reply