Robots Exclusion Standard

from Wikipedia, the free encyclopedia

According to the agreement of the Robots Exclusion Standard - Protocol , a web crawler ( robot ) first reads the file (lower case) in the root directory of a domain when it finds a website . This file can be used to define whether and how the website can be visited by a web crawler. This gives website operators the opportunity to block selected areas of their website from (certain) search engines . The log is purely indicative and depends on the cooperation of the web crawler. One speaks here of "friendly" web crawlers. The exclusion of certain parts of a website by the protocol does not guarantee secrecy, for this purpose pages or subdirectories of a server must be protected by HTTP authentication , an access control list (ACL) or a similar mechanism. Some search engines still display the URLs found by the web crawler and to be blocked in the search results pages, but without a description of the pages. robots.txt

The protocol was developed in 1994 by an independent group, but is now generally accepted and can be regarded as a quasi- standard . At the beginning of June 2008, Google, Microsoft and Yahoo shared some common ground.

A mandatory prohibition of indexing is not achieved through the use of robots.txt, even if reputable web crawlers follow the instructions .

construction

The robots.txt file is a text file in an easy-to-read format. Each line consists of two fields separated by a colon.

User-agent: Sidewinder
Disallow: /

The first line describes the web crawler (here:) User-agent, to which the following rules apply. There can be any number of such blocks. Web crawlers read the file from top to bottom and stop when a block refers to it. For each URL that is excluded, there is a separate line with the Disallowcommand. User-agentEmpty lines are only allowed above lines. They separate the blocks from each other. Single-line comments beginning with a hash sign (#) are possible at any point. They are used for clarity and are ignored by the web crawler.

Instruction description example function
User-agent: Specification of the web crawler User-agent: Sidewinder Only applies to the web crawler named "Sidewinder".
User-agent: * Wildcard for user agent; applies to all web crawlers.
Disallow: Do not allow reading Disallow: No exclusion; the entire website can be searched.
Disallow: / The entire website may not be searched.
Disallow: /Temp/
Disallow: /default.html
The "Temp" directory and the "default.html" file must not be searched.
Disallow: /default All files and directories that begin with "default" are not searched. B. "default.html", "default.php", "default-page.html", "defaultfolder /", and so on. A ban on "default.html" also prohibits z. B. "default.html.php" or "default.html /", even if this constellation should be rare.
$ End of Line Anchors ( Googlebot , Yahoo! Slurp, msnbot only ) Disallow: /*.pdf$ All PDF files are ignored.
? URLs with '?' treat (only googlebot ) Disallow: /*? All urls starting with a '?' are ignored.
Allow: /*?$ All URLs beginning with a '?' ends are allowed.
Allow: Allow reading (only Ask.com , Googlebot , Yahoo! Slurp, msnbot ) Disallow: /
Allow: /public/
Only the “public” directory can be searched, the rest not.
Crawl-delay: Readout speed ( msnbot , Yahoo! Slurp, Yandex only ) Crawl-delay: 120 A new page for reading may only be called up every 120 seconds.
Sitemap: Sitemap URL ( Googlebot , Yahoo! Slurp, msnbot , Ask.com only ) Sitemap: http://example.com/sitemap.xml The sitemap according to the sitemap protocol is located at the address given.

Examples

# robots.txt für example.com
# Diese Webcrawler schließe ich aus
User-agent: Sidewinder
Disallow: /

User-agent: Microsoft.URL.Control
Disallow: /

# Diese Verzeichnisse/Dateien sollen nicht
# durchsucht werden
User-agent: *
Disallow: /default.html
Disallow: /Temp/ # diese Inhalte werden von Suchmaschinen nicht neu erfasst; ob bereits zuvor erfasste Inhalte entfernt werden, ist undefiniert
Disallow: /Privat/Familie/Geburtstage.html # Nicht geheim, sollen aber nicht von Suchmaschinen gecrawlt werden.

The following commands prevent all web crawlers from accessing the entire website. The indexing of the content in the search engine is excluded, but not the display of the URL and information that does not come from the page but from external sources. This also applies if indexing is permitted on individual pages again, since web crawlers do not even call up the page.

User-agent: *
Disallow: /

Another example:

robots.txt of the German language Wikipedia

Alternatives

Meta information

The indexing Web crawlers can also be meta elements in HTML reject a website -Quelltext. Meta elements are also purely indicative, require the cooperation of “friendly” web crawlers and do not guarantee confidentiality. If the search robot should not include the website in the search engine's index (noindex) or should not follow the hyperlinks on the page (nofollow), this can be noted in a meta element as follows:

<meta name="robots" content="noindex,nofollow" />

In HTML documents for which both should be allowed, the information can either be omitted or explicitly noted:

<meta name="robots" content="all" />

The syntax is hardly officially standardized, but is based on common practice and acceptance by the crawler developers.

Well-known keywords
encouragement Prohibition Hoped for effect
all - Pay maximum attention
index noindex (Don't) record this side
follow nofollow Links contained in the page (not) follow
archive noarchive (Do not) include the page in the web archiving or even eliminate any existing archived versions
- noopd OPD (dmoz): Use the metadata of the current page instead of the OPD entry. Future uncertain due to the temporary suspension of the service.
- noydir Yahoo ( AltaVista ): Use the metadata of the current page instead of an existing Yahoo entry. Obsolete as search engine was discontinued in 2013.

Instead of generally addressing all bots:

<meta name="robots" content="noindex,nofollow" />

you can also try to control certain bots:

<meta name="msnbot" content="nofollow" /> <!-- Microsoft -->
<meta name="GoogleBot" content="noindex" /> <!-- Google -->
<meta name="Slurp" content="noydir" /> <!-- Yahoo -->

ACAP

With ACAP 1.0 ( Automated Content Access Protocol ) an alternative to the Robots Exclusion Standard was created on November 30, 2007. This information is not used by search engine operators and other service providers. Google rules out using ACAP in its current form.

humans.txt

The robots.txt file provides “robots” (in the form of software / web crawlers) with additional information about a website. Based on this, Google introduced the humans.txt file in 2011 , which is intended to provide additional background information to human visitors to the website. Since then, this file has also been used by other websites, e.g. B. to name the programmers of the website or to describe the software used. Google itself uses the file for a brief self-presentation and references to jobs in the company.

See also

literature

  • Ian Peacock: Showing Robots the Door, What is Robots Exclusion Protocol? In: Ariadne , May 1998, Issue 15, web version .

Web links

Individual evidence

  1. Improvements to the Robots Exclusion Protocol . On: Google Blog Webmaster Headquarters, June 10, 2008.
  2. ^ Everything You Wanted To Know About Blocking Search Engines . On: searchengineland.com , June 12, 2008.
  3. ↑ About the robots.txt file - Help for Search Console. Retrieved August 22, 2018 .
  4. Using robots.txt . Yandex. Retrieved May 4, 2015.
  5. Specifications for Robots Meta Tags and X-Robots-Tag HTTP Headers . Google
  6. ^ Robots and the META element . W3C Recommendation
  7. xovi.de
  8. meta-tags.de
  9. itwire.com
  10. Google introduces humans.txt. In: GWB. May 7, 2011, accessed August 2, 2016 .
  11. We are people, not machines. In: humanstxt.org. Retrieved August 2, 2016 .
  12. Google: humans.txt from google.com. Google, accessed August 2, 2016 .