Web crawler

A web crawler (also spider , searchbot or robot ) is a computer program that automatically searches the world wide web and analyzes websites . Web crawlers are mainly used by search engines to index websites. Other uses are the collection of web feeds , e-mail addresses or other information.

Web crawlers are a special kind of bots , i.e. computer programs that largely automatically perform repetitive tasks.

history

The first web crawler was the World Wide Web Wanderer in 1993 , which was designed to measure the growth of the Internet. In 1994, WebCrawler launched the first publicly accessible WWW search engine with a full-text index. This is where the name web crawler for such programs comes from . As the number of search engines grew rapidly, there are now a large number of different web crawlers. According to an estimate from 2002, these generated up to 40% of all Internet data traffic.

technology

Structure of web crawlers

As with Internet surfing, a web crawler can access other URLs from a website via hyperlinks . All the addresses found are saved and visited one after the other. The newly found hyperlinks are added to the list of all URLs. In this way, theoretically all linked pages on the WWW that are not blocked for web crawlers can be found. In practice, however, a selection is often made, at some point the process is ended and started over. Depending on the task of the web crawler, the content of the web pages found is evaluated and stored, for example by means of indexing , in order to enable a later search in the data collected in this way.

Exclusion of web crawlers

With the help of the Robots Exclusion Standard , a website operator can use the robots.txt file and certain meta tags in the HTML header to tell a web crawler which pages to index and which not, provided the web crawler adheres to the protocol. To combat unwanted web crawlers, there are also special websites, so-called tar pits , which provide the web crawlers with incorrect information and also slow them down considerably.

Problems

A large part of the entire Internet is not recorded by web crawlers and thus also by public search engines, since much of the content cannot be accessed via simple links, but only via search masks and access-restricted portals , for example . These areas are also referred to as the “ deep web ”. In addition, the constant change of the web and the manipulation of the content ( cloaking ) represent a problem.

species

Thematically focused web crawlers are known as focused crawlers or focused web crawlers . The focus of the web search is realized on the one hand by the classification of a website itself and the classification of the individual hyperlinks. In this way, the focused crawler finds the best way through the web and only indexes relevant areas of the web (for a topic or a domain). The main obstacles in the practical implementation of such web crawlers are non-linked sub-areas and the training of classifiers.

Web crawlers are also used for data mining and for examining the Internet ( webometry ) and do not necessarily have to be restricted to the WWW.

A special form of web crawler are e-mail harvesters ("Harvester" for "Harvester"). This term is used for software that searches the Internet (WWW, Usenet , etc.) for e-mail addresses and "harvests" them. Electronic addresses are collected and can then be marketed. The result is i. d. Usually, but especially with spambots , promotional emails ( spam ). For this reason, the previously common practice of providing e-mail addresses on websites as a contact option via mailto: - link is increasingly being abandoned; sometimes an attempt is made to make the addresses illegible for the bots by inserting spaces or words. So [email protected] becomes a (at) example (dot) com . Most bots can, however, recognize such addresses. Another popular method is to embed the email address in a graphic. The e-mail address is therefore not available as a character string in the source text of the website and therefore cannot be found as text information for the bot. However, this has the disadvantage for the user that he cannot transfer the e-mail address into his e-mail program for sending by simply clicking on it, but has to copy the address. Much more serious, however, is that the site is no longer accessible and visually impaired people are excluded as well as bots.

Another purpose of web crawlers is to find copyrighted content on the Internet.

Individual evidence

^ X. Yuan, MH MacGregor, J. Harms: An efficient scheme to remove crawler traffic from the Internet. Computer Communications and Networks, 2002. Proceedings. Eleventh International Conference on Communications and Networks
↑ Sotiris Batsakis, Euripides GM Petrakis, Evangelos Milios: Improving the Performance of Focused Web Crawlers. April 9, 2012. (English)

Web links

The Web Robots Pages (English)
Webcrawling - Developing the Web , Ronny Harbich, 2008.

[1] X. Yuan, MH MacGregor, J. Harms: An efficient scheme to remove crawler traffic from the Internet. Computer Communications and Networks, 2002. Proceedings. Eleventh International Conference on Communications and Networks

[2] Sotiris Batsakis, Euripides GM Petrakis, Evangelos Milios: Improving the Performance of Focused Web Crawlers. April 9, 2012. (English)