Log file analysis

With the log file analysis , the log file of a computer system is examined for a certain period of time according to certain criteria. In current computer systems, a wide variety of systems are also logged. For example, log files can be found on every web server , every database and every firewall . Various conclusions can be drawn from this, depending on the type, content and scope of the recording of the log file.

Evaluation of web server log files

The interpreted statistics make it possible to optimize the structure of the website. They are the basis for usability analyzes or provide information about the success of a marketing campaign . With an analysis of the log files one can operate web controlling to some extent .

Some possible questions

What are the user's IP address and host name ?
Which browser did he use?
On which page was the link with which the user came to the page?
Which search engine and which keywords did he use?
How long did he stay on the site ?
How many pages does it call up?
Which page did he leave the website on?
Which additional modules has he installed?
Which operating system does he use?
Which websites did employee Mustermann visit during working hours? (the works council must usually be involved here)
Where does the user come from?

These questions can be answered primarily by evaluating the so-called communication edge data.

Analysis problems

The main problem with log file analysis with web server log files is the fact that HTTP is a stateless protocol. This means that every request from a client for a website (or every single graphic that appears in it, etc.) is an independent action for the web server. If the user clicks through a website, the web server has no knowledge that the user has just accessed a page.

In order to still enable stateful HTTP, a so-called session ID is occasionally assigned to dynamically generated web pages when the user is called up for the first time, which the client then always sends with the subsequent requests. This can be done using a cookie or an additional parameter attached to each URI , although a cookie is not visible in the log file and requires separate programming for the log file analysis. If a cookie can be set (this depends on the client), later recognition is also possible, provided the cookie has not been changed or deleted in the meantime. Otherwise, only purely statistical statements can be made about the (probable) return of a page. This is then e.g. B. possible by combinations of the same IP address, screen resolution, matching plug-ins, etc., but this method is not precise. However, there are studies on techniques how to recognize individual computers based on their individual accuracy of the system clock.

Another possibility in HTTP to identify a user is to use the IP address. However, it can be the same for many different users if they use a proxy server , network address translation or the like. They should therefore only be used with great caution, as an IP address cannot be equated with a user.

Often, however, the operator of a website does not have access to the log file of the web server, so that an attempt is often made to enable statistical evaluation using counting pixels . For this purpose, small, invisible (1 × 1 pixel, transparent) images are integrated into the website, which are stored on a web server whose log file can be evaluated.

Extended information, such as the screen resolution or a list of installed browser plug-ins , is also welcome, but is not contained in a log file. This information is then usually determined using a client-side script language and also logged separately using counting pixels.

Correlation of log files

In addition to the evaluation of individual files, the supreme discipline is the correlation of different log files, especially for error analysis . It is important that the systems involved provide all log entries with a time stamp and that the clocks of these systems are almost synchronized. The use of a network time protocol such as NTP is recommended here .

An example of a correlation between log files and entries would be the connection of firewall log files and router log files as well as accounting data on a system that has been compromised by a cracker .

In addition to the pure log analysis, there is now a new software branch of "Security information and event management", or SIEM for short. These systems usually take a different approach to log analysis. Differences between SIEM and pure log analysis:

SIEM: a) The logs are "normalized" - broken down into individual pieces of information and then stored in a database. SIEM systems know exactly the syntax of individual log generators or the different device families and can correlate and deduplicate the alarms against each other. Information is thus already transformed from the raw data. B) The logs are combined with other data in terms of time or space. For this purpose, further log data sources and other systems from the areas of FCAPS (mostly fault management), WMI events, SNMP traps, information from the Active Directory and Netflow / SFLow data can be combined and correlated. c) Thanks to the correlation of all data sources, baselines for normal operation can then be determined in the SIEM system and alarms can be generated very early in the event of deviations.

With the classic log analysis, the specialist who interprets the logs sits in front of the "PC" - with the SIEM, the manufacturer should provide the corresponding functions and know-how in the software.

Admissibility of the analysis in Germany

In the opinion of the supervisory authorities, personal data of a user may only be collected and used without consent insofar as this is necessary in order to enable the use of telemedia and billing. The analysis of usage behavior using complete IP addresses (including geolocation) is only permitted with conscious, unambiguous consent due to the fact that this data can be related to individuals.

Programs for log file analysis

There are a number of programs that help analyze log files. The following lists some of them:

Free programs (open source)

Analogous
AWStats
Kibana
LIRE
mgaccesslog
ModLogAn
pageLogger
Matomo (formerly: Piwik)
RRDtool
Visitors
W3Perl
Webalizer
Webalizer Xtended
WUM
WUMprep

Free programs (freeware)

HTTP LogStat
Funnel Web Analyzer
PrimaWebtools
Web Entry Miner WEM
Xlogan

Commercial programs

aconon web controlling
APAGO
CounterLabs
EXAConsult MBIS
Intares-MQS web mining
LFApro
LogDrill
Mescalero
NetMind
NetTracker
Piwik PRO
RapidEngines (acquired by SevOne in 2014)
SAS web analytics
ShopStat
Sawmill Analytics
SmarterStats
W3 Statistics (free version available)
Urchin software
WiredMinds
WebReflow (free version available)
WebSpy
WebTrends
Xlogan Pro

Product directories

in German on web-analytics.org
in English by Terry Lund

literature

Frank Bensberg: Web Log Mining as an Instrument of Marketing Research - A Systematic Approach for Internet-Based Markets. Wiesbaden 2001, ISBN 3-8244-7309-7 .
R. Kimball, R. Merz: The Data Webhouse Toolkit. New York et al. 2000, ISBN 0-471-37680-9 .
C. Lutzky, M.-H. Teichmann: Log files in market research: design options for analysis purposes. In: Yearbook of sales and consumption research. 48th year, 2002, pp. 295-317.
B. Massand, M. Spiliopolou: Web Usage Analysis and User Profiling. Berlin et al. 2000, ISBN 3-540-67818-2 .

Individual evidence

↑ Data protection-compliant design of analysis methods for measuring the range of Internet offers. ( Memento from May 23, 2012 in the Internet Archive ) 26./27. November 2009 see data protection officer: Logging of IP addresses is not permitted [added on 02/19/2010].

[1] Data protection-compliant design of analysis methods for measuring the range of Internet offers. ( Memento from May 23, 2012 in the Internet Archive ) 26./27. November 2009 see data protection officer: Logging of IP addresses is not permitted [added on 02/19/2010].