Googlebot

from Wikipedia, the free encyclopedia

Googlebot is a web crawler from the US company Google LLC . The computer program downloads World Wide Web content independently and feeds it to the manufacturer's own search engine, Google .

Working method

There are usually a few days between downloading a file version and updating the search engine index with the same content in this new version. How often Googlebot visits a page depends, among other things, on how many external links point to this page and how high its PageRank value is. In most cases, however, on average, the Googlebot only accesses a website once every few seconds.

In order to keep the access to the page to be indexed as low as possible, each crawl process is first stored in a cache used by all Googlebots. If a page is visited by several bots within a certain period of time, the request can thus be served from the cache.

Googlebot respects the robots.txt file and the robots instructions in HTML - meta tags . It should be noted that blocked CSS or JavaScript can lead to misunderstandings during the crawl process and the Googlebot may interpret the website incorrectly.

Dynamic page content

Up to now, the Googlebot has found it difficult or impossible to index page content that is only contained behind PHP sessions or variables. This is because the bot usually neither knows the necessary variables nor the associated parameters. Google is currently working on adapting the web crawler to such an extent that it can also capture content that has previously remained hidden behind several AJAX requests. In the future, it should also be possible to record content that a website loads dynamically. It is also planned that the web crawler will send POST requests to a website. The problem with this is that POST requests can unintentionally carry out user actions.

ID

Depending on the task, Googlebot identifies itself with the following user agent IDs :

Googlebot/2.1 (+http://www.google.com/bot.html) Mozilla/5.0 (compatible); Googlebot/2.1; (+http://www.google.com/bot.html)
Googlebot-Image/1.0

Another Google crawler is used to download pages in order to determine suitable advertising as part of the Google AdSense program. He identifies himself as follows:

Mediapartners-Google/2.1

verification

Some web users and crawlers use these identifiers to falsely pretend to be Googlebot in the hope that a site operator will provide particularly good or ad-free content for Googlebot.

To determine whether a visitor is actually a Google crawler, Google recommends using the Domain Name System . First, the visitor's IP address is translated into a domain name by means of an inverse request , which should end on googlebot.com . Then you check with a regular DNS request (forward lookup) whether you can get the visitor's original IP address again.

Web links

Individual evidence

  1. ^ Matt Cutts: Crawl caching proxy , April 23, 2006
  2. googlewebmastercentral.blogspot.com
  3. ^ Matt Cutts: How to verify Googlebot . September 20, 2006. Official Google Webmaster Central Blog, googlewebmastercentral.blogspot.com. Accessed November 13, 2006.