Bielefeld Academic Search Engine

from Wikipedia, the free encyclopedia
Bielefeld Academic Search Engine (BASE)
logo
www.base-search.net
description Internet search engine
Registration optional
languages Chinese (simplified), German, English, French, Spanish (Castilian), Polish, Modern Greek, Ukrainian
owner Bielefeld University Library
Originator v. a. institutional repositories
Published June 24, 2004
items over 150 million
status on-line

BASE ( Bielefeld Academic Search Engine ) is a search engine for scientific documents. It is operated by the Bielefeld University Library with the search engine technology of the open source software Solr / Lucene . BASE is continuously being developed as a strategic project.

Target group and goal setting

BASE's offer is primarily aimed at scientists in universities and research institutions and at students. With the development of BASE, the university library is pursuing the goal of building a reliable, high-quality search service for research and teaching with the help of search engine technology.

BASE would like to give access to the content of scientific document servers that are made available free of charge as part of the Open Access movement via OAI-PMH ( Open Archives Initiative Protocol for Metadata Harvesting). The search engine is registered as an official OAI service provider and was involved in the EU project DRIVER (Digital Repository Infrastructure Vision for European Research), which was completed in 2009.

Due to the intellectual selection of the sources, BASE aims to deliver professionally qualified information in connection with extensive and high-quality metadata and thereby differentiate itself from commercial search engines.

Development history

chronology

date event
 June 2001 A new conceptual idea arises from the deficits found in a meta search environment using the example of the library portal "Digitale Bibliothek NRW" : development of a non-commercial search engine for scientific use
 Feb. 2002 -  Aug. 2002 Evaluation of search engine technology
  2003, summer Start of technical implementation; Development of a prototype (math demonstrator)
 Oct 2003 Announcement of the collaboration between the Bielefeld University Library and the FAST company : start of a strategic partnership to test and promote enterprise search technologies; Agreement on the use of the "FAST Data Search" system
 March 2004, spring Completion of the trial phase
 June 2004 Activation of the Bielefeld Academic Search Engine
 Aug 2004 Integration of further sources (university publication server, OAI sources, non-OAI-compatible sources); first indexing of full texts (electronic dissertations of the Ruhr University Bochum)
 Aug 2005 new possibilities of search refinement (restriction to the data source), different sorting of hits, search history of performed search queries
 Feb 2006 Replacement of the single server solution with a server farm (6 Linux computers)
 March 2006 Integration of hit-related links to the scientific search engine Google Scholar
 June 2006 Start of participation in the EU project DRIVER (Digital Repository Infrastructure Vision for European Research)
 May 2007 Search for similar word forms
 July 2007 over 100 German repositories in BASE; Introduction of a public test area: BASE Lab
 Oct 2007 Multilingual search ( Eurovoc thesaurus )
 July 2008 Adoption of search results via extensions of the Firefox browser in reference management programs
 Jan. 2009 Website relaunch: filtering according to document types in advanced search
 Aug 2010 More than 25 million documents in the BASE index
 Feb 2011 Preparation of the platform change from FAST to Lucene / Solr
 May 2011 Release of the BASE index produced with Lucene / Solr
 Aug 2011 More than 30 million documents from over 2,000 sources in the BASE index
 Jan. 2012 Mobile version for smartphone users
 Apr. 2012 Possibility to set up a personal login
 July 2012 Own search interface for document servers located in Germany
 Aug 2013 Marking of Open Access documents and sources with appropriate symbols. More than 50 million documents from 2,700 sources in the BASE index
 Nov 2013 More than 3.3 million documents indexed by CiteSeerX for the first time.
 June 2014 More than 3,000 sources / 60 million documents in the index.
 Sep 2014 Open Access documents are upgraded in the relevance ranking (can be switched off); from July 2014 on a test basis in the BASE Lab, from 23 September 2014 regular.
 Aug 2015 Search filters for subsequent use (license) and access (e.g. open access) can be selected in the advanced search
 Oct. 2016 In October 2016, more than 100 million documents were in the search index for the first time.
 May 2017 Since May 2017 users can "claim" their own publications, i. H. link with their ORCID .

Content

Scientific internet sources

The content of BASE is multidisciplinary. Only scientific sources are evaluated. BASE aims to tap into "Internet sources of the ' invisible web ' which are not indexed in commercial search engines or which are lost in their large numbers of hits" . BASE indexed:

Choice of sources and transparency

All searched sources are intellectually selected and checked. A list of sources makes the selection transparent. In addition to the indexed sources, more than 1,000 other sources with over 30 million documents were harvested , which, however, are not suitable for indexing for various reasons.

Timeliness and scope

The index is updated daily, the contents of individual document servers are updated weekly.

149,820,832 documents from 7,188 sources are currently searchable via BASE. The number of documents and sources has increased steadily since production started, and the index is being expanded further. Repository operators who are not listed in the list of sources are asked to contact the BASE team.

Country coverage and languages

Sources by country

In total, there are sources from 132 countries in the index. The countries with more than 100 indexed sources (repositories) are:

country swell Documents
United States 1112 46,783,311
Indonesia 927 1,953,101
Japan 565 2,647,278
Germany 406 23,498,271
United Kingdom 342 5,309,052
Brazil 321 2,542,618
Spain 264 5,766,729
France 201 11,469,711
Russia 192 2,289,342
Canada 182 1,785,127
Italy 171 5,285,122
India 157 632.818
Ukraine 136 818,900
Colombia 133 599,628
Peru 121 287,250
Turkey 116 939.118
Poland 114 3,622,417

Sources by continent

The European countries are most frequently represented, followed by Asia, North America, South America, Australia and Africa.

continent swell Documents
Europe 2756 73,949,899
Asia 1972 9,672,137
North America 1405 49.128.728
South America 784 4,633,903
Australia / Oceania 113 3,287,447
Africa 112 862.943
International / not assigned 46 8,285,775

All information: as of July 25, 2019

Documents by language

Sorted by language, the following picture emerges of the most used languages ​​with more than 250,000 recorded documents:

language Documents
English 62,893,004
French 8,884,514
Spanish 7,045,578
German 5,406,771
Portuguese 3,305,663
Polish 2,679,494
Italian 2,217,286
Japanese 2,070,289
Russian 1,499,516
Chinese 1,465,775
Latin 790.911
Norwegian 785,804
Dutch 717.558
Bahasa Indonesia 707.180
Ukrainian 569,819
Turkish 551,805
Modern Greek 462.023
Catalan 456.335
Czech 438.857
Swedish 430.280
Finnish 426.815
Hungarian 412,687
Danish 409.035
Croatian 299,451

About 1/3 of all sources are not assigned to any language.

Access to the indexed documents

BASE does not only provide evidence of Open Access offers . BASE offers the option of restricting a hit list to documents clearly classified as Open Access. At the moment, only 45% of the indexed documents can be unequivocally marked as Open Access by BASE, even if the actual rate of freely accessible documents is around 60%. The labeling of Open Access documents at document level is to be expanded. Since July 2014, Open Access documents have been given a boost factor in the relevance ranking, i.e. they tend to be displayed higher up in the list of results. This new function can be switched off.

Functions

User interface and navigation

The barrier-free user interface of BASE is simple and clear. The search interface is optionally available in Chinese (simplified script), German, English, French, Greek, Polish, Spanish (Castilian) or Ukrainian. Information about BASE is available in German and English.

The start page enables a search in the BASE index (standard search). From here the transition to the other functional and research areas of BASE takes place: advanced search, help, browsing and search history as well as to the mobile version . The options are located in a heading bar that is uniformly designed for all research pages, so that you can easily switch between the functions. Below the search mask you get u. a. to the pages About BASE (general information about the research portal), the BASE blog and the Twitter channel .

Research functionality

Standard search Consciously oriented towards the success of Google, BASE presents the user with the standard search with just one simple search field, which is available as standard for the free text search. With the help of a syntax explained in the help, it is possible to limit the search for individual terms to individual metadata fields. When entering the search terms, placeholders can be used for right truncation.

In addition, the standard search offers the option of automatically expanding the search terms to other word forms ( lemmatization ).

Advanced search The advanced search allows you to enter the search terms specifically for the following metadata fields as standard: Entire document, title, author, keywords, DOI, (part of) the URL and publisher. The search in the entire document corresponds to the standard search. The individual metadata fields can be combined with one another. They are automatically linked with the Boolean operator AND. Within a search field, the search terms can be combined using various Boolean operators using a special syntax documented in the help.

In addition, there is the option of restricting the search to the origin of the sources (certain countries or continents), to certain years or periods of publication, to certain types of documents (e.g. books, articles, dissertations, videos) and licenses for subsequent use ( creative Commons , public domain , software licenses such as GPL ). The number of titles displayed in the hit list can also be limited (10, 20, 30, 50 or 100).

Results display The search results are displayed in a list that is sorted by relevance by default . The determination of the relevance takes place according to various criteria, e.g. B. It makes a difference whether the search term occurs in the title or just elsewhere. The predefined ranking can, however, be changed and a user-defined sorting according to author, title or year of publication can be selected, optionally in ascending or descending order.

The individual search result contains - if available - extensive, qualified metadata (e.g., in addition to title and author, keywords, publisher, source, language, abstract, URL). Integrated into the hit display is the

  • Link to the original document (metadata or electronic full text),
  • Link to a new search query for the author,
  • Link to the data provider,
  • Link to a search query in Google Scholar (by searching for the title in Google Scholar, linked citations or different versions of the work can be found),
  • Link to export via email and in reference management programs ,
  • Link to add as a favorite in the personal profile (with login ).

If the number of hits is too extensive, it can be limited to author, keyword , Dewey decimal classification , year of publication, source, language, document type, access (open access / unknown) or subsequent use (license). Only one option can be selected from the drop-down menus at a time.

In addition, the search queries of the current session are displayed in a search history, which can be reissued each time. Search queries can also be saved permanently with a personal login. Furthermore, can the searches as RSS - or nuclear - web feed to subscribe to that search results can be sent or stored by e-mail. A personal login is also required for the latter.

A new search can be triggered directly from the hit list by changing the current search query.

Browsing

In addition to the search, BASE also offers browsing according to Dewey decimal classification (DDC), document type, subsequent use / license and access. The DDC of the documents is determined in two different ways: On the one hand, DDC numbers are already assigned by some data sources, which are transferred directly to the browsing. On the other hand, documents are also automatically reclassified within BASE. The technology used for this was developed as part of the DFG- funded project "Automatic enrichment of OAI metadata".

Discontinued projects

BASE DE

In a separate search interface, you could search specifically in sources whose document servers are located in Germany. This should enable national proof of OAI metadata. The so-called "Germany View" comprised around 6,300,000 documents from over 250 sources.

BASE Lab

With BASE Lab, BASE offered a public test area in which new functions could be tried out. The following functions first appeared there:

  • Use of computational linguistic processes for the automatic classification of OAI metadata within the framework of the DFG project "Automatic enrichment of OAI metadata with the aid of computational linguistic processes and development of services for the content-oriented networking of repositories".
  • Development of a service for the provision of aggregated and normalized OAI metadata
  • Expansion of the labeling of open access documents
  • Higher weighting of open access documents

technical basics

Search engine technology

The technical basis is the search engine technology from Solr and Vufind . It enables

  • the use of linguistic methods to optimize search queries (e.g. lemmatization , decomposition of compounds , permutations )
    The search terms are extended to other word forms (plural, genitive) through automatic language recognition and the creation of dictionaries.
  • Relevance ranking of search results
    The relevance is determined by an algorithm contained in the software
  • Subsequent limitation of the number of hits according to certain criteria (author, keyword, year of publication, source, language and type of document).

Integration of the data sources

The data is integrated into the search engine via different interfaces, namely via

  • As a rule: OAI harvesting
    Metadata from selected OAI document servers are integrated via the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).
  • in special cases: web crawlers
    Content from scientific websites is collected by a dedicated web crawler . The full text data determined here are analyzed for the metadata it contains.

Most of the data collected in Dublin Core format is very heterogeneous and therefore has to be normalized in a complex manner before indexing.

Interfaces to third-party providers

BASE enables a direct search for individual titles in Google Scholar through links in the hit lists . If the user uses BASE on site in a library, links in the Google hit lists can lead to the full text offered by the library. This requires the library to be configured.

Interfaces for the re-use of BASE services and data

BASE offers several programming interfaces :

  • The search or HTTP interface is a REST API for direct searches in the BASE index via Solr. Use is free of charge for non-commercial projects and only requires the registration of a fixed IP address .
  • The OAI-PMH- API offers project partners and selected non-commercial projects the opportunity to obtain the normalized BASE data (or thematic excerpts) up-to-date.
  • The installation of an HTML form as a search box for searching in BASE from your own website can be implemented without any programming effort .

Subsequent users

Integration in specialist portals

BASE is integrated into the metasearch of several German specialist portals . Subject portal paedagogik.de , Germanistik on the net , ilissAfrica , vifabio , virtual subject library medien buehne film and Livivo ( ZB MED ) either integrate the full BASE index or filter the search query according to a selection of repositories that match the respective subject. Since not only classic university publication servers, but also platforms with digitized photos, maps and other source materials are harvested, BASE also paves the way for primary research data and virtual research environments .

Use by open access services

BASE is a primary source of the web service dissem.in , which helps authors to discover their own specialist publications that are (still) hidden behind a paywall , although the authors are allowed to offer them for free download.

In a similar way, the web-based Altmetrik service uses Impactstory BASE to check whether there is a freely available version of an article in the sense of the green path to open access .

The alternative DOI resolver doai.io and oadoi.org use BASE to find freely available versions (e.g. preprints / eprints ) of articles that are otherwise only available against payment or with a campus license.

The browser plug-in Unpaywall uses BASE data to display a link to a legal, free version of the same work (if available) when accessing academic payment barriers .

Use by Discovery Services

The EBSCO Discovery Service (EDS) has been integrating the data collected and processed by BASE into its service since December 2015 .

Use by other search engines

BASE is a default active source of the non-commercial German meta search engine MetaGer and (since mid-2016) the meta search engines etools.ch (optional) and Searx (in the Science tab). BASE can also search the bibliographic metasearch Karlsruhe Virtual Catalog .

Comparable offers

A similar offer as BASE offer the British CORE (Connecting repositories) and originally from the University of Michigan developed OAIster (now part of OCLC ). Both are much smaller in size. Comparable commercial search engines with a scientific cut - but lower metadata quality - are Google Scholar and Microsoft Academic Search .

literature

Web links

Individual evidence

  1. a b FAQ . As of August 27, 2013.
  2. a b BASE Lab (further developments) . As of August 27, 2013.
  3. Norbert Lossau: Search engine technology and digital libraries - libraries have to open up the scientific Internet . In: Journal for Books and Libraries (ZfBB). 51 (2004), 5/6, p. 293. Accessed October 4, 2011.
  4. List of official OAI service providers: http://www.openarchives.org/service/listproviders.html . As of August 27, 2013.
  5. website of the DRIVER project ( Memento of 30 August 2013, Internet Archive )
  6. a b About BASE . As of August 27, 2013.
  7. a b Norbert Lossau, Friedrich Summann: Search engine technology and digital libraries: From theory to practice . In: Journal for Books and Libraries (ZfBB). 52 (2005), 1, p. 13. Accessed: August 27, 2013
  8. Norbert Lossau, Friedrich Summann: Search engine technology and digital libraries: From theory to practice . In: Journal for Books and Libraries (ZfBB). 52 (2005), 1, p. 13. Accessed: August 27, 2013. The use of Google software failed early due to organizational difficulties. Convera, Mnogo, Lucene and Fast Data Search were tested.
  9. Norbert Lossau, Friedrich Summann: Search engine technology and digital libraries: From theory to practice . In: Journal for Books and Libraries (ZfBB). 52 (2005), 1, p. 13 f. Accessed: August 27, 2013. Together with the University Library Center Cologne (hbz), an application for participation in the national project "Distributed Document Server (VDS)" was submitted on the basis of this preliminary work.
  10. Bielefeld University Library and industry leader FAST start strategic partnership to test and promote the new generation of enterprise search technologies for digital libraries . Bielefeld University, Information and Press Office: Press Release No. 168/2003. Accessed: August 27, 2013.
  11. Norbert Lossau, Friedrich Summann: Search engine technology and digital libraries: From theory to practice . In: Journal for Books and Libraries (ZfBB). 52 (2005), 1, p. 14f. Accessed: August 27, 2013.
  12. ^ Urte Kramer: Bielefeld Academic Search Engine . In: InetBib . June 24, 2004. Accessed: August 27, 2013.
  13. Urte Kramer: BASE Update . In: InetBib. August 27, 2004. Accessed: August 27, 2013.
  14. Urte Kramer: BASE: new release . In: InetBib. August 10, 2005. Accessed: August 27, 2013.
  15. Friedrich Summann, Sebastian Wolf: search engine technology and scientific search environment . In: VÖB Online-Mitteilungen. OM 86 (June 2006), p. 6. Accessed: August 27, 2013.
  16. Sebastian Wolf: BASE - new release with Google Scholar links . In: InetBib. March 2, 2006. Accessed: August 27, 2013.
  17. a b Sebastian Wolf: BASE-Update / DRIVER . In: InetBib. May 18, 2007. Access: August 27, 2013. The aim of the project is to network scientific repositories from universities and research institutions in Europe. Bielefeld University Library is responsible for the areas of harvesting, aggregation, storage and indexing of OAI metadata and contributes the expertise it has acquired through BASE to the project.
  18. Sebastian Wolf: More than 100 German repositories in BASE / New Features in BASE Lab . In: InetBib. July 6, 2007. Accessed: August 27, 2013.
  19. Sebastian Wolf: Over 500 repositories in BASE . In: InetBib. May 18, 2007. Accessed: August 27, 2013.
  20. Dirk Pieper: BASE Update . In: InetBib. July 7, 2008. Accessed: August 27, 2013.
  21. Sebastian Wolf: Search engine BASE: Over 1080 sources and new functions . In: InetBib. February 3, 2009. Accessed: August 27, 2013.
  22. Dirk Pieper: More than 25 million documents in BASE . In: InetBib. August 4, 2010. Accessed: August 27, 2013.
  23. Dirk Pieper: New BASE version in the BASE Lab . In: InetBib. February 14, 2011. Accessed: August 27, 2013.
  24. Dirk Pieper: BASE Migration . In: InetBib. May 18, 2011. Accessed: August 27, 2013.
  25. Sebastian Wolf: BASE search engine: Over 30 million documents / 2000 sources . In: InetBib. August 22, 2011. Accessed: August 27, 2013.
  26. Dirk Pieper: BASE smartphone usage . In: BASE blog . January 9, 2012. Accessed: August 27, 2013.
  27. Sebastian Wolf: Personal profile for BASE . In: BASE blog . April 17, 2013. Accessed: August 27, 2013.
  28. Dirk Pieper: National evidence of OAI metadata . In: BASE blog . July 17, 2013. Accessed: August 27, 2013.
  29. Dirk Pieper: New milestone for BASE: 50 million documents! In: BASE blog . August 26, 2013. Accessed: August 27, 2013.
  30. Dirk Pieper: Over 3.3 million documents from CiteSeerX in BASE In: BASE blog . December 11, 2013. Accessed: August 5, 2015.
  31. Sebastian Wolf: 60 million documents from 3000 sources in the BASE index . In: BASE blog . May 20, 2014. Accessed: June 25, 2014.
  32. a b c Sebastian Wolf: "Boosten" Open Access documents . In: BASE blog . July 29, 2014. Accessed: August 5, 2015.
  33. Christian Pietsch on Twitter : From now on, BASE (Bielefeld Academic Search Engine @BASEsearch) by default boosts search results that are declared Open Access. #openaccess . September 23, 2014. Accessed: August 5, 2015.
  34. @BASEsearch on Twitter: We added 2 new features: Search by license, eg #CreativeCommons and by access, eg #OpenAccess on August 25, 2015. Accessed : October 6, 2015.
  35. Bernd Fehling: OA (open access) processing . In: Inside BASE . September 19, 2015. Accessed October 6, 2015.
  36. @BASEsearch on Twitter: Huge milestone for BASE: More than 100 million documents indexed, about 60% Open Access on October 28, 2016. Accessed: November 14, 2016.
  37. ^ Paul Vierkant: ORCID claiming possible in BASE. Website of the DFG project ORCID DE on June 1, 2017. Accessed: August 16, 2017.
  38. a b c Dirk Pieper, Sebastian Wolf: Scientific documents in search engines . In: Handbook of Internet Search Engines. Heidelberg, 2009, p. 362. Accessed: August 27, 2013.
  39. a b c About BASE: The sources . Access: June 25, 2019.
  40. a b Sebastian Wolf: 10 years BASE . Accessed: June 25, 2014.
  41. About BASE: The sources (countries) . Access: June 25, 2019.
  42. Search in the entire index, limit search results by language . accessed on June 25, 2019.
  43. FAQ . This means that the metadata of the documents are displayed, but are not necessarily freely accessible in the electronic full text. If the document is subject to a license, BASE points out that the license control is carried out exclusively by the data provider and that the information seeker should contact his institution or university in order to obtain access. Accessed: August 27, 2013.
  44. ^ Matthias Lösch: Automatic subject indexing of electronic documents . Accessed: June 25, 2014.
  45. National evidence of OAI metadata . Accessed July 7, 2014.
  46. Norbert Lossau, Friedrich Summann: Search engine technology and digital libraries: From theory to practice . In: Journal for Books and Libraries (ZfBB). 52 (2005), 1, p. 15. Accessed: August 27, 2013.
  47. Dirk Pieper, Sebastian Wolf: BASE - A search engine for OAI sources and scientific websites . In: Information, Wissenschaft & Praxis (IWP). Vol. 58, No. 3, 2007, p. 155. Accessed: August 27, 2013.
  48. About BASE: Services . Retrieved August 5, 2015.
  49. http://dissem.in/sources
  50. Heather Piwowar: Now, a better way to find and reward open access . In: Impactstory Blog . June 5, 2016. Retrieved August 5, 2016.
  51. DOAI website . CAPSH (Committee for the Accessibility of Publications in Sciences and Humanities). Retrieved August 6, 2016.
  52. Frequently asked questions . In: unpaywall . Retrieved August 16, 2017.
  53. 80 million documents from BASE now accessible to EDS users . EBSCO. December 7, 2015. Retrieved August 5, 2016.
  54. https://etools.ch/
  55. The metasearch engine Searx contains a plugin for BASE since version 0.9.0.
  56. http://core.ac.uk/ CORE (COnnecting REpositories). Retrieved August 5, 2015.