Web mining

Under Web Mining ( web mining ) and web mining is the transfer of techniques of data mining for (partially) automated extraction of information from the Internet , especially the World Wide Web . Web mining adopts procedures and methods from the areas of information retrieval , machine learning , statistics , pattern recognition and data mining. Three objects of investigation can be distinguished:

The content ( web content mining ) - for example using information retrieval methods.
The structure of the link ( web structure mining ) - for example using webometry methods . When Web Structure mining so-called come hubs used. There are good hubs that point to many valuable pages and valuable pages that many hubs point to.
User behavior ( web usage mining ) - for example through the analysis of log files .

Types of web mining

Web usage mining tries to identify regularities in the use of websites or web resources. In doing so, all secondary data that arise through the interaction of the user with a web resource are processed and analyzed. For example, web usage mining also includes analyzing the customer journey .

Web structure mining tries to identify the link structure on which a website or domain is based. A model is created based on the topology of the references (hyperlinks) of the website, with an optional description of the same. This can be useful for the categorization and ranking of a website and allows conclusions to be drawn about similarities between websites and their relationships to one another. For example, content-rich websites (so-called authorities) and overview-like websites (so-called hubs) could be found for a specific topic (see HITS algorithm ).

Web content mining deals with the detection of regularities in the content of a web resource. Web content mining is an application area for text mining . The data on the web consists of unstructured data such as text documents, semi-structured data such as HTML documents and more structured data such as tables or dynamically generated HTML pages. Basically, the content of a website consists of various types of data, such as texts, images, audio, video, metadata and hyperlinks. Web content mining of multiple data types is known as "multimedia data mining" and can be understood as part of web content mining. However, the content of the web consists mainly of unstructured text. Text mining can be understood as a form and overarching research area of web content mining. The methods used are general data mining methods, with statistical and computational linguistic processes realizing the transformation of the texts into a form that is adequate (for data mining).

literature

Raymond Kosala, Hendrik Blockeel: Web Mining Research: A Survey. In: SIGKDD Explorations. 2, No. 1, 2000, pages 1-10.
Marc Ehrig, Jens Hartmann, Christoph Schmitz: Ontology-based web mining. (PDF; 255 kB) In: Peter Dadam (Hrsg.): Informatik 2004. Informatik connects. Contributions to the 34th annual conference of the Society for Computer Science. Köllen, Bonn 2004, ISBN 3-88579-380-6 , pages 187-193.
Frank Bensberg: Web log mining as a marketing research tool. Gabler, Wiesbaden 2001, ISBN 3-8244-7309-7 .
Markus Leibold: Web log mining in PR controlling. VDM, Munich 2006, ISBN 978-3-86550-392-3 .

Web links

http://www.cs.umbc.edu/~kolari1/Mining/webmining.html - link collection of scientific articles
http://www.mindup.de/html/web-mining.html - Review article
YALE (Yet Another Learning Environment) : free open source software for knowledge discovery, data mining including web mining and text mining, machine learning etc. (today: RapidMiner ): YALE offers together with the also free WordVectorTool a free complete software Environment for numerous web mining and text mining tasks
Idea Web Miner - free tool for web content mining including web log mining, web patent mining and web news mining

Web mining

contents

Types of web mining

See also

literature

Web links