Deep web

from Wikipedia, the free encyclopedia

The Deep Web (also Hidden Web or Invisible Web ) or Hidden Web refers to the part of the World Wide Web that cannot be found when researching using normal search engines . In contrast to the Deep Web, the websites accessible via search engines are called Clear Web, Visible Web, or Surface Web . The deep web consists largely of topic-specific databases ( specialist databases ) and websites. In summary, it is content that is not freely accessible and / or content that is not indexed by search engines or that should not be indexed.

Types of the deep web

According to Sherman & Price (2001), five types of Invisible Web are distinguished: “Opaque Web” (opaque Web), “Private Web”, “Proprietary Web” (owner web), “Invisible Web” (invisible Web) ) and "Truly invisible Web" (actually invisible web).

Opaque Web

The Opaque Web (English. Opaque Engl .: opaque ) are web pages that can be indexed, but at present for reasons of technical efficiency or cost-effectiveness not be indexed (search depth, frequency of visits).

Search engines do not consider all directory levels and subpages of a website. When capturing web pages, web crawlers control via links to the following web pages. Web crawlers themselves cannot navigate, even get lost in deep directory structures , cannot capture pages and cannot find their way back to the start page. For this reason, search engines often consider five or six directory levels at most. Extensive and therefore relevant documents can be located in lower hierarchical levels and cannot be found by search engines due to the limited depth of indexing.

In addition, there are file formats that can only be partially recorded (for example PDF files, Google only indexes part of a PDF file and makes the content available as HTML).

There is a dependency on the frequency of the indexing of a website (daily, monthly). In addition, constantly updated databases, such as online measurement data, are affected. Websites without hyperlinks or navigation systems, unlinked websites, hermit URLs or orphan pages ( orphan for orphans ) are also included.

Private web

The private web describes web pages that could be indexed, but are not indexed due to access restrictions of the webmaster.

These can be websites in the intranet (internal websites), but also password-protected data (registration and possibly password and login ), access only for certain IP addresses , protection against indexing by the Robots Exclusion Standard or protection against indexing by the meta -Tag values noindex , nofollow and noimageindex in the source code of the website.

Proprietary Web

With Proprietary Web sites are meant that can be indexed, but only after recognition of a usage condition or by entering a password are available (free or paid).

Such websites can usually only be called up after identification (web-based specialist databases ).

Invisible web

The Invisible Web includes websites that could be indexed from a purely technical point of view, but are not indexed for commercial or strategic reasons - such as databases with a web form.

Truly Invisible Web

With Truly Invisible Web sites are called, which can not be indexed for technical reasons (yet). These can be database formats that were created before the WWW (some hosts), documents that cannot be displayed directly in the browser , non-standard formats (e.g. Flash ), as well as file formats that cannot be captured due to their complexity (graphic formats ). In addition, there is compressed data or websites that can only be operated via user navigation using graphics (image maps) or scripts ( frames ).

Databases

Dynamically created database web pages

Web crawlers work almost exclusively on static database websites and can not reach many dynamic database websites, as they can only reach deeper-lying pages through hyperlinks . However, those dynamic pages can often only be reached by filling out an HTML form , which a crawler is currently unable to accomplish.

Cooperative database providers allow search engines to access the content of their database via mechanisms such as JDBC , in contrast to (normal) non-cooperative databases , which only offer database access via a search form.

Hosts and specialist databases

Hosts are commercial information providers who bundle specialist databases from different information producers within one interface. Some database providers (hosts) or database producers themselves operate relational databases , the data of which cannot be accessed without a special access option (retrieval language, retrieval tool). Web crawlers understand neither the structure nor the language that is required to read information from these databases. Many hosts have been active as online services since the 1970s and in some cases operate database systems in their databases that were created long before the WWW.

Examples of databases: library catalogs ( OPAC ), stock exchange prices, timetables, legal texts, job exchanges, news, patents, telephone books, web shops, dictionaries.

Estimation of the amount of data

According to a study by BrightPlanet , which was published in 2001, the following properties emerged for the deep web:

The amount of data in the deep web is around 400 to 550 times greater than that in the surface web. 60 of the largest websites in the deep web alone contain around 7,500 terabytes of information, which is 40 times more than the surface web. There are reportedly more than 200,000 deep websites in existence. According to the study, websites from the Deep Web have an average of 50% more hits per month and are linked more often than websites from the Surface Web. The deep web is also the fastest growing category of new information on the web. Nevertheless, the deep web is hardly known to the public searching the Internet. More than half of the deep web is located in topic-specific databases.

Since BrightPlanet offers a commercial search aid with DQM2, the (possibly greatly overestimated) size specification must be viewed with great caution. There are some data that need to be cleaned up from BrightPlanet's estimate of the amount of data in the Deep Web:

  • Duplicates from library catalogs that overlap
  • National Climatic Data Center data collection (361 terabytes)
  • NASA data (296 terabytes)
  • further data collections (National Oceanographic Data Center & National Geophysical Data Center, Right to know Network, Alexa, ...)

The number of data sets shows that the study overestimates the size of the deep web by ten times. However, the information provider LexisNexis alone has 4.6 billion records, more than half the number of records from search engine leader Google. The deep web is therefore certainly much larger than the surface web.

In a study by the University of California, Berkeley in 2003, the following values ​​were determined as the size of the Internet: Surface Web - 167 terabytes, Deep Web - 91,850 terabytes. The printed holdings of the Library of Congress in Washington, one of the largest libraries in the world, are 10 terabytes.

Overall, the information about the Deep Web should not be taken too seriously. After all, many websites don't get into a search engine on their own. A privately created website is not visited immediately. But you can also register such a page or wait until your own page has been linked to other pages that have already been indexed by crawlers.

See also

literature

Web links

Individual evidence

  1. ^ Gary Price, The Invisible Web: uncovering information sources search engines can't see . CyberAge Books, Medford, NJ 2001, ISBN 0-910965-51-X (English).
  2. ^ Michael K. Bergman: The Deep Web: Surfacing Hidden Value . In: The Journal of Electronic Publishing , Volume 7, 2001, No. 1
  3. Internet Archive Wayback Machine ( Memento of March 14, 2006 in the Internet Archive )
  4. Internet ( Memento of the original from October 15, 2004 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. sims.berkeley.edu @1@ 2Template: Webachiv / IABot / www.sims.berkeley.edu