Google Scholar

from Wikipedia, the free encyclopedia

Google Scholar is a search engine operated by Google LLC and is used for general literature research on scientific documents. This includes free documents from the free Internet as well as paid offers. Full texts or at least bibliographical references are usually displayed as hits. Google Scholar analyzes and extracts the citations contained in the full texts and creates a citation analysis from them . In addition, the bibliographical information on these citations can be researched using the tracing service. In January 2018, Google Scholar estimated the size of around 389 million documents. This makes Google Scholar the world's largest academic search engine.

predecessor

Google Scholar is based on the experience that Google has been able to gain with various other services in previous years, above all, of course, from Google web search . The layout and ease of use as well as the indexing of all resources in an overall index have been transferred to the scientific search engine. With a few adjustments, the pagerank for rating and sorting the sources could also be adopted.

The Crossref project is regarded as the predecessor of Google Scholar . In addition to open access documents and documents from the self-archiving area, the full-text collections of numerous specialist publishers and companies were indexed. All these materials were searchable via the well-known, simple Google search interface.

The aim of the project was to make part of the deep web , namely the fee-based publications of publishers and specialist societies accessible only through registration and registration, accessible to the search engine. A joint agreement between Google and the publishers involved serves as the basis for this.

Range of functions

On November 18, 2004, Google launched the English beta version of Google Scholar; since April 21, 2006, the search service has also been available in German.

The main focus of the documented literature is on the specialist journals. However, Google Scholar also provides full text of other scientific documents or only the corresponding bibliographical data. This includes content from the free web, e.g. from private and institutional homepages as well as open access publications and documents from the self-archiving area. In addition, paid offers from publishers and specialist societies are also proven. Like its predecessor project CrossRef, Google Scholar is thus opening up part of the deep web .

The special thing about Google Scholar is the full text analysis and indexing. In scientific databases you can only search in the bibliographical information as well as the abstracts and key words . In contrast to the specialist databases, the selection and evaluation of the documents is not carried out intellectually, but on the basis of algorithms that evaluate the scientific nature and determine the ranking of the hit list.

The results of a literature search are displayed to the user sorted by relevance. A distinction is made between fee-based publisher's offers and free references, which, however, do not always lead directly to the full text, as well as in open access publications . The added value of the scientific search engine lies on the one hand in the ranking of documents and in the extraction and analysis of citations. Furthermore, there is also the possibility of forwarding search queries to WorldCat and the use of the "library link" for users of libraries who work with Google Scholar.

target group

According to the homepage, Google Scholar is targeting academics with its offer. Scientists, researchers, students, university lecturers, research assistants and doctoral students as well as schoolchildren are therefore counted among the target group.

Search space

Google Scholar sees itself as a search service for the general search for scientific literature. This primarily includes magazine articles, books, and technical reports. But also seminar papers and all kinds of student theses, Power Point presentations, abstracts , preprints and conference papers. Some of these documents are freely available on the web and some come from commercial providers. The full text offer is significantly expanded by the integration of data from Google Books .

The commercial suppliers of the data are scientific publishers, specialist societies and professional associations with which Google has an agreement. This allows the web crawlers to index their full text documents. Only academic articles are considered, not textbooks or monographs. It becomes clear that Google takes the definition of "scientific" very broad. In addition to specialist articles that are published in journals according to a peer review process , presentation slides, student work from university publications and documents that private individuals put on their homepage are also documented.

functionality

cover

As already explained, the search area of ​​this search engine includes scientific documents of different quality levels. Some of the documents are in different stages of processing. In this way, not only quality-checked specialist articles from scientific journals are identified , but also Open Access publications that did not always go through a peer review process , as well as preprints or lecture documents. The different versions of a document are grouped by Google Scholar. The publisher's publication is displayed as a hit and all other versions are summarized below this hit under the link "all ... hits". The list of all indexed versions can be called up.

Google Scholar analyzes and indexes documents in various formats. These include the formats HTML , PDF and PostScript ; compressed files can also be edited. The scope of the documents that are made available as full text has been significantly expanded through the integration of data from Google Books. However, topics that are not very popular are insufficiently represented in Google Scholar with references or full texts.

indexing

Google Scholar extracts the metadata such as: B. Title, author and year of publication. This is done automatically by searching through the documents by the web crawler and using an algorithm to differentiate between the individual text segments based on the document layout. The software recognizes them as a citation, author's name, year of publication, etc. This extraction is difficult because the documents are based on none or different standards and are available in different formats. Accordingly, the recognition of the metadata is partially incorrect. This has negative consequences for the findability of the documents as well as for all functions that Google Scholar offers based on this data. This applies above all to the publication servers of institutions whose metadata does not correspond to the scheme required by Google.

The extracted data are used for the citation results, for the ranking factor of the document and for the function “cited by”. They are also required for the advanced specific search and for export to reference management programs.

Ranking

The ranking procedure uses the established procedures of Google WebSearch . Since the well-known Google technology is used in the background of Google Scholar, this search service offers the same search interface and the same processing speed. However, scientific documents and their contents have special properties that make it necessary to adapt the principles and algorithms of the Pagerank .

The technology takes into account the full text of the document, the source in which the text was published, and most importantly, how often it is cited in other articles, to name just a few of the factors considered. Since Google hardly publishes any information about the ranking process , only guesses can be made about further popularity values ​​and the weighting. It is only known that literature that is frequently cited is displayed high in the hit list. Since current documents receive a lower ranking factor than older documents, the weighting of the publication date has been changed in favor of documents with a more recent date.

Citation extraction

For the automatic extraction and analysis of citations, Google uses its experience with link analysis and the knowledge of the search engine CiteSeer . Through the autonomous citation indexing , references are taken from the full texts and verified. Thus, Google Scholar also contains works that go beyond its degree of coverage. These are mainly books.

In some cases, Google Scholar is seen as a competitor to the cost-intensive citation databases Science Citation Index (SCI) and Scopus , as it offers a free citation analysis and takes into account more open access journals than these databases. Thus, Google Scholar has some advantages over the fee-based offers.

Like the automatic extraction of metadata, the automatic recognition of quotations is also prone to errors. This sometimes leads to redundant, incomplete or incorrect entries in the Google Scholar index.

Google Scholar offers with the functions "similar articles" and "cited by" the possibility to expand the research. The term “citation” is used to denote documents that are referenced in other academic resources but that are not included in full text in Google Scholar. Only the bibliographic data determined are presented to the user. However, the request can be forwarded to WorldCat using the "Library Search" link . This catalog is used to determine the nearest library that has this title in its inventory. Documents recognized as thematically related are listed via the link "Similar Articles". This function is also based on full text indexing and the subsequent automatic extraction and analysis of the data.

System architecture

Hardware and infrastructure

Google uses the existing infrastructure of its data centers to offer the Google Scholar service. Google operates data centers around the world in which the huge amount of data is stored and search queries are processed. This distributed data storage is managed by the Bigtable database software .

Web crawler

The web crawlers use links to access freely available internet sites which they search for scientific documents. Thanks to the agreements with specialist societies and publishers, this is not only possible for Google's web crawlers on the free web, but also on the protected pages of the contractual partners. The crawlers extract the bibliographic data of the documents found as well as the citations they contain. Special algorithms are used for these tasks. As usual with Google, there is no intellectual review of the work carried out. In contrast, other content providers such as hosts and libraries, which are providers of specialist databases , library catalogs and virtual specialist libraries, create their metadata records completely intellectually or semi-intellectually using learning indexing programs.

Link resolver

However, the crawlers are not given access to library data. Access to the necessary data from cooperating libraries can only take place via link resolver . These represent the interface to the electronic offers of the libraries. However, this requires changes to the link resolver by its provider. Then it is possible for Google Scholar to forward a library user from the hit list to the full text.

The required information about the licensed documents such as B. the provider and the period and the link to the full text from the library catalog possible. For this, an XML file is required on the library website, which is generated daily by the internal configuration files of the link resolver used. It contains the title of the journal, its ISSN and information on the subscription period. This information consists of the year, year and issue number of the first and last licensed magazine issue. In addition, comments can be added about stock shortages or access restrictions from the library. To help the libraries create this file, Google Scholar has provided a sample file.

Hit display and search

Every search searches for suitable documents and for all documents in which these documents are cited. Indexed publisher publications, if any, represent the main hit. The value-added services described are clearly offered at the end of the advertisement for each hit.

The hit list can be further restricted. The earliest year of publication or publication can be set using a pull-down menu. A second menu offers the option of including quotations in the list of hits or only showing hits that have at least one summary. With this setting it is possible to exclude hits without abstracts as well as citations. However, Google Scholar does not offer any further options for sorting by the user. Google Scholar offers an alerting service at this point. This enables a user to be informed by email about newly indexed documents that match the search query. The search query entered is transferred to the "Notification query" field. After any necessary changes to the search query and the entry of the email address, the alerting service is set up by clicking on "Create alert".

Google Scholar offers a simple search, an advanced search, and a search with operators within the simple search. Certain settings can be made in advance for these search variants. The language of the documents and the user interface as well as the number of hits per page can be selected. The home library for the library link function can also be selected in the settings. Another offered default concerns the reference management. With the setting "Bibliography Manager" the user can select the format in which he would like to import data into his reference management software.

simple search

In the simple search, individual search terms can be entered one after the other, which are automatically linked with "AND". The phrase search is possible by enclosing the search terms in quotation marks. For the search with the author's name, it is irrelevant whether it is entered according to the scheme “Last Name First Name” or “First Name Last Name”. However, the search query must include the first name (s) of an author both in full and only abbreviated to the first name in order to find all of the author's documents. Of course, the names of several people can also be entered in the search window.

Advanced Search

The extended search offers several input fields which enable the simple use of the Boolean operators. In the selection field "with all words" an automatic AND link is carried out and a search is made for the terms in all fields of the database. The phrase search is possible in the field "with the exact phrase". With synonymous, quasi-synonymous or other language terms in a search query, you can search with “any of the words”. Hits containing certain terms can be excluded using the “without the words” field. It corresponds to the operator "NOT".

The search can be carried out over the entire full text or restricted to the title of the article. Unfortunately, Google Scholar does not support researching merely in the metadata of an intellectually developed document. Further restrictions apply to the year of publication or a period and to the publication in a publication, e.g. B. in a trade journal, possible. However, it must be noted that not all indexed documents contain a year and that this is therefore not taken into account in the search. You can also explicitly only search for the metadata “Author”. The necessary searches described with different variants of the author's name are also necessary in the advanced search.

Command-based search

The details of the search query described under "advanced search" can also be made by entering the corresponding operators as characters or as a term in capital letters in the input field of the simple search.

The AND connection of terms is automatically created by stringing these terms together. The operator "AND" or the plus sign leads to the consideration of letters, numbers and general words (stop words), which are actually ignored in the search.

With the minus sign or the term "NOT" the following term is excluded from the search. In this way, documents with this term can be removed from the list of hits. The third Boolean operator "OR" can only be entered as a term. As already described, it can be used to consider synonyms, quasi-synonyms or translations of terms in one search. In this way, a broader thematic coverage can be achieved with one request.

Further operators are “autor”, “allintitle”, “filetype” and “site”. With them, the search can be restricted to the metadata author or title of a document or the document format and to the source such as a URL. The operator "allinurl" known from Google web search is (as already described) not supported by Google Scholar. In addition, GS offers few search options in comparison to the research options in scientific specialist databases. The entirety of the search options offers fewer search options for scientific research compared to the possibilities in specialist databases . The metadata such as abstracts , keywords, etc. created by specialist publishers and specialist societies are not taken into account by the search engine, for example.

example

The Google Scholar services described are to be demonstrated using an example search. The author's name Stephen Hawking is entered in the "simple search" field. The search results (as of July 2015) 29,100 hits. The hit list only shows thematically related documents on the first five pages. However, these are almost exclusively in English and demonstrate the strong concentration of the evidence on the English-speaking area.

To the right of the search field for the simple search is the link to the "advanced search". This offers various input fields for the more precise formulation of the search query. To search for publications by Stephen Hawking, his name must be entered as a phrase in the "Article written by" field. This search produced 790 hits. As already described above, a command-based search is also offered via the search field of the simple search. Here the operator must be used to search for author names. The search query is: author: Stephen Hawking. 790 hits are also found, since the search queries of the extended and the command-based search are identical.

The possible limitations of the hit list have already been outlined above. The structure of a short hit display is now described using a document from the hit list of the search presented:

[PDF] The big hit
S Hawking ... -2010 - buchliebling.com View
, quite different even from the picture we might have drawn a decade or two ago. Nevertheless, the first drafts of the new concept go back almost a hundred years. According to the traditional view of the universe moving ...
Quoted by: 5 - Similar articles - HTML version - All 7 versions

First, Google Scholar shows the title of the hit, which, when clicked, leads to the indexed document. In the next step, the extracted bibliographical information from the document is presented. As can be seen in this example, the metadata can be so scarce that it is insufficient for quoting in a scientific paper. An excerpt from the full text is then offered to evaluate the document. In the last line, Google Scholar offers the value-added services already presented.

If you click Quoted by: 5 , the publications that cite this work are displayed as a short hit list. The "Similar Articles" link also takes the user to a hit list with documents that deal with the same topic. Since this hit is in PDF format, Google Scholar enables it to be displayed in HTML. Seven other versions of the document could be recognized, which are grouped under the link "All 7 versions". Further value-added services are library search and library link . A search in WorldCat is offered if the hit found is a printed work (usually a book). If the Google Scholar user is also a user of a library that cooperates with the scientific search engine, the "library link" is also offered in the bottom line. As already described, the availability of a licensed electronic version of the article is checked and, if necessary, linked directly to the full text.

criticism

Positive review

The attractiveness of researching scientifically relevant documents with the Google Scholar search engine lies in the ease of use, the clear presentation of hits and the processing speed. The likely enormous size of the index and thus the covered search area and the usual quality of the ranking are essential for the success of the scientific search engine. In addition, the search engine can be used intuitively; knowledge of thesauri, classifications or other controlled vocabularies is not necessary.

These characteristics have made Google Scholar an important and heavily used competitor of established academic search services. The cooperation with libraries and the link to WorldCat also contributed to this. It should be emphasized in this context that the academic search engine Bielefeld Academic Search Engine (BASE) integrates results from Google Scholar into its search results.

Google Scholar makes both full texts and bibliographical data accessible. The importance of Google Scholar lies in opening up parts of the Invisible Web . Through the cooperation with publishers etc., documents are indexed that are hidden in databases and are normally not accessible to web crawlers . Together with the indexing of free web content, the scientific search engine can offer direct access to countless full texts or at least provide bibliographical evidence of them. For full texts that are subject to a fee, an abstract is available which can be used to assess the relevance of the document before paying the license fee. In addition, the evidence of works goes beyond the actual search area of ​​Google Scholar. The extraction of citations reveals works with their bibliographical information that are not digitally available.

Specifying the citations can help to find thematically related documents on the Internet, as the citing sources can be browsed. The same applies to the “cited by” function, which immediately provides additional sources on a topic.

If the sources are not digitally available, Google Scholar often offers forwarding them to WorldCat or the library link. This link is very useful for library users who work with Google Scholar.

The search engine is free of charge and, with its claim to provide evidence of scientific literature, competes with commercial database providers and full-text archives. Through the citation analysis of web citations , Google Scholar can be seen as an alternative (not necessarily as competition) to the established but expensive Science Citation Index and Scopus .

The interdisciplinary design of the search service increases the visibility of the publications across disciplines. Google Scholar evaluates the scientific nature of documents based on the respective layout. The search engine indexes journals that are not evaluated in the Science Citation Index due to the selective selection criteria. This mainly applies to open access journals. This increases the visibility of magazines and authors on the Internet. This can be described as the “democratization” of the science system.

The development of Internet resources with web crawlers has the advantage that there is only one index that has to be queried during a search. This also makes it easier to update the data and is an advantage over meta search engines. On the other hand, the hits are shown immediately in the display, regardless of which data provider they come from.

Negative review

The information policy is criticized by Google Scholar. The users are not informed about the criteria on which the assessment of the scientific nature and the ranking are based. Only vague statements are made about the exact target group. In principle, the search engine is aimed at anyone looking for scientific literature. It also remains unclear which databases are indexed. Nothing is disclosed about the degree of indexation and possible indexation gaps in the proof of full-text offers of the scientific cooperation partners. The statements remain very imprecise. The size of the database and the update frequency also remain unknown.

It must also be viewed critically that Google Scholar also regards student work and Power Point presentations as scientific forms of publication. Mixing these documents with specialist articles and their preprints means that the formal and specialist quality of the hits is different. It is particularly difficult for students with no experience in literature research to identify suitable and high-quality sources. In addition, the consideration of lecture documents and preprints creates the problem of duplicates or almost duplicates, since the different versions must be identified as belonging together and grouped under the most current version.

However, this assumes that the data is correctly recognized during indexing. The index data is extracted automatically from the full text using algorithms and used for all services. The only basis is the layout of the documents. If the data is read incorrectly in this process or is not classified in the correct category, the quality of all services offered will decrease.

But not only incorrectly indexed data have a negative impact. Since the citation frequency is determined exclusively from the indexed sources, this means, conversely, that non-indexed documents cannot be used for this service. This leads to a distortion of the picture. If citations for a title are not included in the index, this title will be ranked worse and will appear in the hit list below, although its content fits very well. In addition, the functionality of the citation extraction and analysis mechanism is controversial. The reason is its susceptibility to errors. The citation rate determined by Google Scholar is not always correct and, as just mentioned, it cannot contain all cited works. Thus, the actual relevance of the hit cannot be read from the citation rate.

The bibliographic information for all document types is also very brief in the hit information. In addition, they are often incorrect in terms of form and content due to the indexing and extraction algorithms described. They hardly meet the demands of scientific work.

In addition, the users are very dependent on the ranking of the hits, as Google Scholar does not offer any options for sorting the hits. Only citations or documents before a selected year of publication can be excluded. The lack of human control is problematic in this context. The algorithms determine which documents are indexed and which ranking value they receive.

The restriction of Google Scholars to the indexing of full texts must be viewed critically. Keywords , notations or abstracts that have high-quality articles from specialist journals are not indexed and are therefore completely ignored. With this, Google Scholar is giving away an opportunity to increase the precision of research. There is also no further processing of the indexed documents using stemming processes.

The search tools that Google Scholar offers are very limited. Restrictions can only be made according to author, journal and year of publication. These search options do not meet the requirements of a scientific specialist search. In addition, when searching with the date restriction, sources without a publication date are excluded and not included in the hit list. This search restriction is therefore unsuitable for both precise and completeness searches. It must be seen critically that, in addition, no truncation or masking can be carried out.

However , Google Scholar does not support several search restrictions that users of Google web search are used to. These include the operators "allinurl" and "filetype". In addition to the Boolean operators, Google Scholar only supports "allintitle", "site" and "autor". In addition, you can only search thematically using keywords. This is insufficient for a thematic search. The multi- disciplinary approach of Google Scholar has a further disadvantageous effect . The German-language version of Google Scholar does not have any thematic restrictions, it can only be searched on a multidisciplinary basis. In the English version there are seven general research areas to choose from to limit the search area. The quality of the documents is not an option either. It would make sense to limit the search to certain document types or to exclude types. The limited search options are also partially flawed, as they are based exclusively on automatically selected, indexed and evaluated data.

In summary, it can be said that the inadequate search options in Google Scholar cannot replace research in specialist databases . Thesauri , classifications and abstracts offer good search options that Google Scholar does not use, especially for thematic searches . The lack of truncation options is also a clear disadvantage compared to specialist databases. The scientific search engine Google Scholar should not be used for a literature search aimed at completeness or precision. However, it is ideal for an introduction to a topic and for researching full texts using bibliographical information.

literature

Web links

Individual evidence

  1. Michael Gusenbauer: Google Scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databases . In: Scientometrics . November 10, 2018, ISSN  0138-9130 , doi : 10.1007 / s11192-018-2958-5 ( springer.com [accessed November 26, 2018]).
  2. a b c d e f g h i j k Dirk Lewandowski: Google Scholar . Expansion and strategic orientation of the offer as well as effects on other offers in the field of scientific search engines. 2005 ( PDF ).
  3. ^ A b c Philipp Mayr: Google Scholar as an academic search engine . In: VÖB-Mitteilungen . tape 62 , no. 2 , 2009, p. 18-28 ( PDF ). PDF ( Memento from March 26, 2012 in the Internet Archive )
  4. a b Dirk Lewandowski: Evidence of German-language library and information science articles in Google Scholar . In: Information science and practice . tape 58 , no. 3 , 2007, p. 165-168 ( PDF ).
  5. a b c d Google Scholar , category About Google
  6. ^ A b c d Franka Handreck, Michael W. Mönnich: Google Scholar as an alternative to scientific specialist databases . In: BIT online . tape 11 , no. 4 , 2008, p. 401-406 .
  7. a b Google Scholar , category support for publishers
  8. a b c d e f g h i René König: Google, Google Scholar and Google Books in Science - Profile III as part of the Interactive Science project. ITA project report No. A52-3, Vienna: Institute for Technology Assessment (ITA), 2010 (PDF; 2.1 MB)
  9. K. Arlitsch: Why Google Scholar Has Trouble Indexing Institutional Repositories. 2012
  10. ^ Google Scholar , Category Google Scholar Help.
  11. ^ Google Scholar , category support for libraries.
  12. sample file . Retrieved January 2, 2019 .
  13. a b Philipp Mayr; Anne-Kathrin Walter: Coverage and topicality of the search service Google Scholar, 2006. In Information Wissenschaft und Praxis. DGI, Frankfurt ( Memento of the original from September 25, 2006 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. (PDF; 529 kB)  @1@ 2Template: Webachiv / IABot / www.ib.hu-berlin.de
  14. a b Ben Kaden: About Google Scholar, unpublished, 2006