Screen scraping

The term screen scraping (English, for example: " scraping on the screen") generally includes all methods for reading texts from computer screens. Currently, however, the term is used almost exclusively in relation to web pages (hence web scraping or web harvesting ). In this case, screen scraping specifically describes the technologies that are used to obtain information by specifically extracting the required data .

Areas of application

Search engines and web mining

Search engines use so-called crawlers to search the World Wide Web , to analyze websites and to collect data such as web feeds or e-mail addresses . Screen scraping techniques are also used in web mining .

Replacement of web services

In order to make the retrieval and further processing of information from websites much easier for the customer , the provider of the page content (also the content provider) has the option of not only displaying the data in the form of a (human-readable) website, but also in a machine-readable one Format (e.g. XML ). Targeted requested data could be made available to the customer as a web service for automated further processing.

Often, however, the content provider is not interested in the mechanized retrieval of his data or the automated use of his service (especially with regard to special functions that should be reserved exclusively for real users), or the establishment of a web service would be associated with excessive costs and therefore uneconomical. In such cases, screen scraping is often used in order to still filter the desired data from the website.

Extended browsing

Screen scraping can be used to equip the browser with additional functions or to simplify previously cumbersome processes. For example, registration processes for forums can be automated or services of a website can be called up without the user having to visit the website, but rather via a browser toolbar.

A simple form of such a screen scrapers make bookmarklets represents.

Remixing

Remixing is a technique in which web content from different services is combined to form a new service ( see also mashup ). If no open programming interfaces are available, screen scraping mechanisms must also be used here.

abuse

However, screen scraping techniques can also be abused by copying the content of third-party websites against the will of the provider and offering it on a separate server.

functionality

Screen scraping essentially consists of two steps:

Retrieving web pages
Extraction of the relevant data

Retrieving web pages

Static websites

Ideally, the interesting data is on a website that can be accessed via a URL . All parameters required to call up the information are transferred via URL parameters (query string, see GET request ). In this simple case, the website is simply downloaded and the data is extracted using an appropriate mechanism.

Forms

In many cases the parameters are queried by filling out a web form . The parameters are often not transferred in the URL, but in the message body ( POST request ).

Personalized websites

Many websites contain personalized information. However, the Hypertext Transfer Protocol (HTTP) does not provide a native way of assigning requests to a specific person. In order to recognize a specific person, the server application must use session concepts based on HTTP . A frequently used option is the transmission of session IDs through the URL or through cookies . These session concepts must be supported by a screen scraping application.

Data extraction

A program for extracting data from web pages is also called a wrapper .

After the website has been downloaded, it is first important for the extraction of the data whether the exact location of the data on the website is known (e.g. second table, third column ).

If this is the case, various options are available for extracting the data. On the one hand, you can interpret the downloaded web pages as character strings and extract the desired data using regular expressions .

If the website is XHTML- compliant, an XML parser can be used . There are numerous supporting technologies ( SAX , DOM , XPath , XQuery ) for accessing XML . Often, however, the websites are only delivered in (possibly even incorrect) HTML format, which does not conform to the XML standard. With a suitable parser, it may still be possible to create an XML-compliant document. Alternatively, the HTML can be cleaned up with HTML Tidy before parsing . Some screen scrapers use a query language specially developed for HTML.

One criterion for the quality of the extraction mechanisms is the robustness against changes to the structure of the website. This requires fault-tolerant extraction algorithms.

In many cases, however, the structure of the website is unknown (e.g. when using crawlers). Data structures such as purchase price information or time information must then be recognized and interpreted even without fixed specifications.

architecture

Centralized architecture

A screen scraper can be installed on a special web server that calls up the required data at regular intervals or on request and offers it in a processed form. However, this server-side approach can lead to legal problems and can easily be prevented by the content provider by blocking the server IP .

Distributed architecture

With the distributed approach, the information is retrieved directly from the client. Depending on the application, the information is stored in a database, passed on to other applications or displayed in a form in the browser. The distributed architecture is not only more difficult to block, it also scales better.

Defense measures on the supplier side

Many content providers are not interested in obtaining specific information in isolation. The reason for this can be that the provider finances itself through advertisements that can be easily filtered through screen scraping. In addition, the content provider could have an interest in forcing the user to a certain navigation order. There are various strategies to safeguard these interests.

Control of user behavior

The server uses session IDs to force the user into a specific navigation order. When the traffic management page of the website is called up, a temporarily valid session ID is generated. This is transmitted via the URL, hidden form fields or through cookies. If a user or a bot reaches the site through a deep link , they cannot present a valid session ID. The server then redirects it to the routing side. EBay uses this strategy, for example , to prevent deep links to auction lists. A specially programmed screen scraper can, however, first get a valid session ID and then download the desired data.

The following example shows a JavaScript -based screen scraper that circumvented the strategy used by eBay. First it downloaded the main page, extracted a valid URL with a regular expression (in this case the list of auctions where floppy disks are bought) and opened it in the browser.

 function EbayScraper() {
    req = new XMLHttpRequest();
    req.open('GET', 'http://computer.ebay.de', false);
    req.send(null);
    var regex = new RegExp('http:\/\/computer\.listings\.ebay\.de\/Floppy-Zip-Streamer_Disketten_[a-zA-Z0-9]*');
    window.location = req.responseText.match(regex);
 }

In addition to the misuse of session IDs, there are other options for checking user behavior:

Control of the referrer to prevent deep links
Control whether elements embedded in the page (graphics etc.) are downloaded promptly
Control whether JavaScript elements are executed

However, all these methods involve certain problems, for example because referrer information is not mandatory, because embedded elements may be supplied by a proxy or from the cache , or because the user has simply deactivated the display of graphics or the execution of JavaScript.

Differentiate between humans and bot

Before delivering the data, the server tries to recognize whether the client is a person or a bot . A commonly used method for this is the use of captchas . The client is given a task that is as simple as possible for humans, but very difficult to solve for a machine. This can be a math problem or the typing of letters, whereby the difficulty for the machine often lies in recognizing the task. This can e.g. B. can be achieved by the arithmetic task is not transmitted as text, but as an image.

Captchas are used for certain online services such as forums, wikis, download sites or online networks, for example against automatic registration, automatic spying on the profiles of other users and automatic downloads by bots. Sometimes a client only has to solve a captcha after a certain number of actions.

Theoretically, bots can also be developed for all captchas that can solve these tasks on the basis of optical character recognition (extraction of the task from an image) so that this protection can be circumvented. It is also possible to pass the subtask on to a person so that he can solve the captcha for the machine. However, both of these mean considerable additional work for the bot operator.

Obfuscation

The information is provided in a form that is difficult or impossible to read for machines. For example as graphics, in Flash animations or Java applets . However, the usability often suffers from this .

JavaScript can also be used to obfuscate the data . This method is mainly used against email harvesters that collect email addresses to send spam. The actual data is not transferred in HTML code, but is only written into the website using JavaScript. The data can also be transmitted encrypted and only decrypted when the page is displayed. With the help of an obfuscator , the program code can be obscured in order to make the development of a screen scraper more difficult.

Simple example of obfuscating an e-mail address with JavaScript (without encryption):

  function mail() {
     var name = "info";
     var domain = "example.com";
     var mailto = 'mailto:' + name + '@' + domain;
     document.write(mailto);
  }

Creation of screen scrapers

Depending on the complexity of the task, a screen scraper must be reprogrammed. With the help of toolkits, screen scrapers can also be created without programming knowledge. There are various options for the implementation, for example as a library , as a proxy server or as an independent program.

Applications

Piggy Bank was an extension for Firefox developed by the Simile project at MIT . It could be used to link services from different providers. It automatically recognized RDF resources offered on a website . These could be saved, managed and combined with other services (such as geographic information with Google Maps ). Piggy Bank is no longer offered. As a replacement, Selenium offers itself, with which one can control a web browser such as Firefox programmatically.

Another well-known Firefox extension is Greasemonkey . It allows the user to execute their own JavaScript files in the browser, which can individualize the appearance and behavior of the displayed website without having to access the actual website. This makes it possible, for example, to add functions to websites, to correct errors in the display, to integrate content from other websites and to do recurring tasks automatically.

A9 from Amazon is an example of a centralized Remix architecture. A9 candisplaysearch results from various web services such as Windows Live , Wikipedia , answers.com and many others in one window.

Programming libraries

Programmers often use scripting languages for bespoke screen scraping projects. For Python, for example, there is the Beautiful Soup program library , which makes it easier to deal with real HTML . The domain-specific language redex (Regular Document Expressions) by Marcin Wojnarski is also based on Python. It was specially created for web scraping and is intended to close the gap between the practical but small-scale regular expressions and the powerful but very rigid XPath syntax.

Legal problems

When scraping the websites of third-party providers, you must ensure that copyrights are observed, especially if the content is integrated via your own offer. A legal gray area, on the other hand, is offering programs that enable client-side screen scraping. Some providers also explicitly prohibit the automatic reading of data in the terms of use.

Another problem may be the hiding of information, such as advertising or legally relevant information such as disclaimers , warnings or even the automatic confirmation of the terms and conditions by the screen scraper without the user seeing them.

literature

Max Völkel: Extraction of XML from HTML pages . (PDF; 2.6 MB) 2003.
Markus Weißmann: Comparison of wrapper systems . (PDF; 276 kB) 2002.
Gerald Huck, Peter Fankhauser, Karl Aberer, Erich Neuhold: Jedi: Extracting and Synthesizing Information from the Web . (PDF; 140 kB) 1998 (English)
Ling Liu, Carlton Pu, and Wei Han: XWRAP: An XMLenabled Wrapper Construction System for Web Information Sources. 2000. (English)

Web links

Data extraction for Web 2.0: Screen scraping in Ruby / Rails ( Memento from August 15, 2010 in the Internet Archive )
Screen-scraping with WWW :: Mechanize (English)

Individual evidence

[1] Selenium website

[2] Homepage of the Python library Beautiful Soup

[3] Reference implementation of redex in the Python library nifty

[4] Mention of redex on the Global Open Access List on October 9, 2014

[5] StudiVZ AGB Section 5.4.3