Web archiving

from Wikipedia, the free encyclopedia

Web archiving refers to the collection and permanent filing of online publications with the purpose of being able to offer the public and science a glimpse into the past in the future. The result of the process is a web archive .

The largest international facility for web archiving is the Internet Archive in San Francisco (USA), which sees itself as the archive of the entire World Wide Web. State archives and libraries in many countries are making efforts to secure network records in their area.

From 1987, the German archive laws defined the archiving of digital documents as a mandatory task of the state archives, but the implementation of this mandate is only just beginning. In 2006 the DNBG (Law on the German National Library) was passed, which extends the mandate of the German National Library to include the archiving of websites. The federal states are planning their legal deposit to change -Gesetze in this sense, or the change has already taken place.

Archiving Targets

The aim of web archiving is to systematically map a defined section of the web presences available on the Internet. For this purpose, an overarching collection policy, a selection process and the frequency of archiving must be clarified in advance.

An archived website with all multimedia functions ( HTML code , stylesheets , JavaScript , images and video) should be preserved in the long term. Metadata such as provenance , transfer time, MIME type and scope of the data are used for subsequent description, use and preservation . The metadata ensure the authenticity and integrity of the digital archive material.

After the takeover, technical and legal precautions must be taken to guarantee constant public accessibility and to prevent subsequent changes to the archive material.

Terminology

Original resource
An original source that is currently or should be available on the Internet and for which access to an earlier state is required.
Memento
A memento of an original source is a resource that encapsulates the original state of a source at a defined point in time.
TimeGate
A TimeGate is a resource that, based on a given date and time, finds the memento that best corresponds to this time limit.
TimeMap
A TimeMap is a resource that outputs a list of all mementos that have ever been created for the original source.

Selection process

Unspecific
In this selection process, an entire domain is gradually written to an archive. Due to the large memory requirement, the procedure only works for smaller domains (netarkivet.dk).
pick list
A list of institutions is determined in advance. The stability of the URLs associated with the institutions must be checked regularly.
Use of access statistics
In the future, “intelligent” harvesting is conceivable which, based on access counts, archives those parts of the web (or a selection) that have particularly high access rates.

Adoption Methods

Remote harvesting

The most common archiving method is to use a web crawler . A web crawler retrieves the content of a website like a human user and writes the results to an archive object. More precisely, this means a recursive search of websites based on the links found on them, starting from a certain starting area, which can either be a website or a list of websites that are to be searched. Due to quantitative limitations, e.g. due to duration or storage space, various restrictions (termination conditions) with regard to depth, domain and the types of files to be archived are possible.

In larger projects, the evaluation of websites for URL ranking is of particular importance. In the course of a crawl process, a large number of web addresses can accumulate, which are then processed either in a list using the FIFO method or as a priority queue . In the latter case, the websites can be imagined in a heap structure. Each website itself forms its own heap and every link to another website found in it forms a subheap that represents an element in the heap of the previous website. This also has the advantage that in the event of an overflowing URL list, the ones with the lowest priority are replaced by new entries first.

However, the initial structure on the server can rarely be reproduced exactly in the archive. In order to be able to rule out any technical problems that may arise in the run-up to mirroring, it is advisable to carry out an analysis of the website in advance. Although this doubles the data traffic in most cases, it shortens the working time considerably in the event of an error.

Examples of web crawlers are:

  • Heritrix
  • HTTrack
  • Offline Explorer

Archiving the Hidden Web

The hidden web or deep web refers to databases that often represent the actual content of a website and are only output upon request by a user. As a result, the web is constantly changing and it appears as if it is of infinite size. An interface that is mostly based on XML is required to take over these databases . The tools DeepArc ( Bibliothèque nationale de France ) and Xinq ( National Library of Australia ) have been developed for such access .

Transactional archiving

This procedure is used to archive the results of a website usage process. It is important for institutions that, for legal reasons, have to provide evidence of their use. An additional program must be installed on the web server.

Web archiving in Germany

At the federal level, the German National Library (DNB) has had the statutory mandate for web archiving since 2006. Since 2012, websites have been archived thematically and for certain events, i.e. selectively and not in full. The DNB works together with an external service provider. In addition, all DE domains have been crawled once in 2014 . The web archive is mainly accessed in the reading rooms.

In addition to the web archiving of the DNB, there are initiatives in various federal states:

There are also other web archiving initiatives in Germany, for example from party-affiliated foundations, from SWR , from Deutsche Post or from the biotechnology / pharmaceutical company AbbVie .

See also

Implementations

Web links

Individual evidence

  1. ^ Steffen Fritz: Rewriting History. (PDF) with WARC files. January 2016, archived from the original on November 9, 2017 ; accessed on November 9, 2017 (English).
  2. a b c d RfC 7089 HTTP Framework for Time-Based Access to Resource States - Memento
  3. a b c d Memento Guide: Introduction. Retrieved October 5, 2018 .
  4. Steffen Fritz: Practice report: Procedure for evaluating the archivability of web objects In: ABI Technik No. 2, 2015, pp. 117–120. doi: 10.1515 / abitech-2015-0015
  5. Tobias Steinke: Archiving the German Internet? Between a selective approach and .de domain crawl . German National Library, June 26, 2014 ( dnb.de [PDF]).
  6. ^ Felix Geisler, Wiebke Dannehl, Christian Keitel, Stefan Wolf: On the status of web archiving in Baden-Württemberg . In: Library Service . tape 51 , no. 6 , June 1, 2017, ISSN  2194-9646 , p. 481-489 , doi : 10.1515 / bd-2017-0051 ( degruyter.com [accessed on March 24, 2020]).
  7. ^ Tobias Beinert: Web archiving at the Bavarian State Library . In: Library Service . tape 51 , no. 6 , June 1, 2017, ISSN  2194-9646 , p. 490-499 , doi : 10.1515 / bd-2017-0052 ( degruyter.com [accessed on 24 March 2020]).
  8. Workflow web archiving in long-term archiving at the Bayerische Staatsbibliothek | BABS. Retrieved March 24, 2020 .
  9. Edoweb: Rhineland-Palatinate archive server for electronic documents and websites. Retrieved March 24, 2020 .