Following on from last week’s introductory blog on web archiving this post takes a broader look at the technical and collecting environment of web archiving, as well as a brief look at its history.

The World Wide Web was pioneered in the late 1980s to help share information more efficiently and effectively. Needless to say this new system proved to be a hit, leading to its global rollout in the early 1990s.  It didn’t take long for observers to ponder that there was probably a lot of content on the Web that would be worth saving for posterity (particularly due to its vulnerability to change), but how?

View of the first website, available at http://info.cern.ch/hypertext/WWW/TheProject.html
View of the first website, available at http://info.cern.ch/hypertext/WWW/TheProject.html

Today, online content is perhaps even more susceptible to change and loss, as a typical webpage has an average ‘lifespan’ of 44 to 75 days. Without action to capture content before it changes, we may end up with large gaps in history as recorded by the Web, which some researchers are already beginning to battle with. If we want to understand how our modern society ticks, we need a way to grab and save as much of this content as possible. To do this, we need web archives.

But what exactly do we mean by ‘web archiving’?

Web archiving can be defined as the process of capturing content that has been made available via the World Wide Web, preserving this content in a web archive, and making this accessible to users.

The most scalable way to do web archiving is with web crawler software. Crawlers are instructed to visit a selected website, or ‘seed’, on a certain date, and to explore this seed via its hyperlinks, copying content as it goes. The copied content is termed a ‘snapshot’. Each snapshot is often quality assured, and then preserved within ISO 28500:2009 WARC (Web ARChive) files. The WARC file constitutes an archival record of the snapshot captured at that point in time, and has the major advantage of enabling archivists to package together multiple related files from a website snapshot and preserve these long-term as individual entities.

To ensure users are aware they are viewing archived content (and not a live site), archived content is clearly identified via a banner and rewritten URL.

View of the Audit Scotland websites, as captured by NRS on 6th June 2017. Note the page banner and rewritten URL to show users they are in the web archive.
View of the Audit Scotland websites, as captured by NRS on 6th June 2017. Note the page banner and rewritten URL to show users they are in the web archive.

The Internet Archive and the National Library of Australia were the first organisations to archive web content in 1996, and the web archiving sector now reaches across the globe.

The Internet Archive, based in a former Christian Science Church, San Francisco. Taken from https://en.wikipedia.org/wiki/Internet_Archive#/media/File:Christian_science_church122908_02.jpg
The Internet Archive, based in a former Christian Science Church, San Francisco. Taken from https://en.wikipedia.org/wiki/Internet_Archive#/media/File:Christian_science_church122908_02.jpg

As well as a core preservation argument, many libraries recognise the merit of archiving websites as part of their drive to record our cultural memory, whilst many archives also recognise official government websites as part and parcel of a nation’s public record.

This collecting pattern is illustrated in the UK:

Web archiving does have its fair share of technical challenges. For instance, dynamically driven content e.g. webpages generated by a user search, Javascript, drop-down menus, and streamed media such as embedded YouTube videos are notoriously challenging to capture with a crawler.

This situation leads many web archives to devote effort to quality assuring content once it has been captured, and assessing whether any remedial actions may be possible. Crawling technology continues to be improved, often as part of international collaborations.

Given these challenges, it’s sobering to recognise that the perfect web archive, full of content that is complete and fully operational, simply does not exist. However, web archives remain pragmatic in the face of this, constantly re-evaluating methods, processes and strategies, never losing sight of their core goal: to preserve a representative, high-fidelity record of the Web.

In our next blog, we will explore how the NRS Web Continuity Service fits into all of this, what web continuity is, and how our work actively supports the Scottish Government’s commitment to openness and accountability.

One thought on “What is web archiving? History, technology, collections

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s