Nutch

from Wikipedia, the free encyclopedia
Nutch

Lucene Nutch logo
Basic data

developer Apache Software Foundation
Current  version 2.4
( October 11, 2019 )
operating system Platform independence
programming language Java
category Crawler , parser and search engine .
License Apache license
German speaking No
nutch.apache.org

Nutch is a Java framework for Internet search engines . The software is open source and is developed within the Apache Software Foundation under the Apache license . Nutch based u. a. on Lucene ( stemming , indexing etc.), Solr (web functionalities) and Hadoop (scaling).

Nutch can search any large amount of data. It can be adapted to company-specific needs thanks to its plug-in architecture - e.g. to other document formats.

The German Federal Office for Consumer Protection and Food Safety operated the Nutch-based “consumer search engine” Clewwa . The Wikia Search search engine was also based on Nutch technology.

Nutch is currently being maintained in 2 versions

  • 1.x: Is a ready-made crawler , which enables a very fine configuration and relies on the data structures of Apache Hadoop , it should be ideal for batch processing
  • 2.x: Is offered as an alternative to version 1.x, the main difference is in the memory area, this has been abstracted and uses Apache Gora to link objects. This increased the flexibility of what (e.g. status, content, links, processed text ...) can be saved and how the storage e.g. B. takes place in NoSQL solutions.

Web links

Individual evidence

  1. nutch.apache.org . (accessed on March 11, 2020).
  2. The nutch Open Source Project on Open Hub: Languages Page . In: Open Hub . (accessed October 18, 2018).
  3. Home - NUTCH - Apache Software Foundation. Retrieved March 11, 2020 .