Classification (statistics)

Classification or classification referred to in the statistics , the classification of feature values or statistical series in separate groups or classes of size classes. Each element of the examined totality is assigned to exactly one class depending on its value on the corresponding variable. Classification is helpful in the case of too large a number of different values of an (observed) random variable in order to be processed or represented practically. This type of processing of data is also carried out if the collected values are only to be regarded as an approximation of the true values or if (quasi) continuous variables are to be examined using methods for discrete variables .

All values of a class lie within the upper and lower class limits , with the difference between the upper and lower class limits being the class width . The middle of the class represents the representative value of a class used for further analysis. The class frequency or occupation number corresponds to the number of elements contained in the class.

Class and classification

Classes are disjoint, i.e. H. non-overlapping, contiguous intervals of characteristic values, which are limited and clearly defined by a lower and an upper class limit.

A classification is a combination of the same or similar characteristics in a group or class. Since it is often not possible or useful in statistical studies to collect or process all individual (different) characteristics or realizations of the random variables examined, a better overview of the data can be achieved through classification. This applies in particular to continuous or quasi-continuous features or to features whose number of (different) feature expressions is very large.

The disadvantage of the classification is the loss of information , since the individual observation values are "lost" by simply considering the classes and instead only representative values such as the number of observations contained in a certain class or the class middle are available for further analyzes. Within a class, the observations should be distributed as evenly as possible across the characteristics , i.e. H. the characteristics should not only accumulate in a limited area of the class so that the class and class breadth are representative of the observations contained.

Class boundary

A class limit is the value of a metrically scaled (random) variable that limits a class up or down. A class is defined by two class limits, the lower class limit and the upper class limit , whereby the upper class limit of the -th class corresponds to the lower class limit of the -th class, i.e. H. ${\ displaystyle j \,}$ ${\ displaystyle x_ {j} ^ {u}}$ ${\ displaystyle x_ {j} ^ {o}}$ ${\ displaystyle (j = 1, \ dots, k) \,}$ ${\ displaystyle j \,}$ ${\ displaystyle (j + 1) \,}$

{\ displaystyle x_ {j} ^ {o} = x_ {j + 1} ^ {u}, \ quad j = 1, \ ldots, k-1}

.

The assignment of the class boundaries to a class can be done in two ways. Either the lower class limit belongs to the class and the upper class limit to the class or the lower class limit belongs to the class and the upper class limit to the class , i.e. H. ${\ displaystyle x_ {j} ^ {u}}$ ${\ displaystyle j \,}$ ${\ displaystyle x_ {j} ^ {o}}$ ${\ displaystyle j + 1 \,}$ ${\ displaystyle x_ {j} ^ {u}}$ ${\ displaystyle j-1 \,}$ ${\ displaystyle x_ {j} ^ {o}}$ ${\ displaystyle j \,}$

{\ displaystyle x_ {j} ^ {u} <x \ leq x_ {j} ^ {o}}

or .

{\ displaystyle x_ {j} ^ {u} \ leq x <x_ {j} ^ {o}, \ quad j = 1, \ ldots, k}

The following example illustrates the two alternatives for classifying (j = 1 to 4):

designation	Alternative 1	Alternative 2
class 1	<100	≦ 100
2nd grade	≧ 100 to <120	> 100 to ≦ 120
Class 3	≧ 120 to <150	> 120 to ≦ 150
Grade 4	≧ 150	> 150

An observation value or an examined statistical unit is therefore assigned to class if or applies. ${\ displaystyle x_ {i} \,}$ ${\ displaystyle (i = 1, \ dots, n)}$ ${\ displaystyle j \,}$ ${\ displaystyle x_ {j} ^ {u} \ leq x_ {i} <x_ {j} ^ {o}}$ ${\ displaystyle x_ {j} ^ {u} <x_ {i} \ leq x_ {j} ^ {o}, \; j = 1, \ ldots, k}$

For class 2 in the table, this can be formulated as follows:

Alternative 1: The value is at least 100 and is below 120.
Alternative 2: The value is above 100 and does not exceed 120.

Class breadth

The class width is the difference between the upper and lower class limits.

{\ displaystyle \ Delta x_ {j} = x_ {j} ^ {o} -x_ {j} ^ {u}, \ quad j = 1, \ ldots, k}

In the example above, the following class widths result:

designation	Class breadth
class 1	indefinite
2nd grade	20th
Class 3	30th
Grade 4	indefinite

The classes of a feature can also have different widths. The optimal number of classes or the width of the classes depends on the specific investigation situation (data, goals). Some "rules of thumb" for determining the number of classes or instead the class width can be found in the article on the histogram . The Jenks-Caspall algorithm provides a method for automatic classification.

Mid-class

After the classification, the middle of the class can be used as a representative value of a class for further analysis . With a symmetrical distribution of the elements of a class to the characteristics or values contained in the respective class, it can be determined as the arithmetic mean of the lower and upper class limit. ${\ displaystyle x_ {j} \,}$ ${\ displaystyle j \,}$

{\ displaystyle x_ {j} = {\ frac {x_ {j} ^ {u} + x_ {j} ^ {o}} {2}}, \ quad j = 1, \ ldots, k}

In the example above, the following class centers result:

designation	Mid-class
class 1	indefinite
2nd grade	110
Class 3	135
Grade 4	indefinite

Frequency density

As an example, the metrically constant characteristic “net annual income” of a well-defined population of people is examined. Since the number of people decreases with increasing income, one chooses i. d. Usually the upper income brackets are wider than the middle and lower ones, so that the presentation remains clear.

However, if a characteristic is divided into classes of different widths, the (absolute or relative) class frequency is not very meaningful without specifying the class width. It is therefore important to calculate the frequency density in order to make the classes comparable. It corresponds to the column height belonging to the class width and class frequency in a histogram . The frequency density of a class is the ratio of the absolute or the relative frequency of a class to the corresponding class width.

The frequency density for results as follows: ${\ displaystyle x_ {j} ^ {u} \ leq X <x_ {j} ^ {o}}$

{\ displaystyle {\ widehat {h}} \ left (x_ {j} \ right) = {\ frac {h \ left (x_ {j} \ right)} {x_ {j} ^ {o} -x_ {j } ^ {u}}}}

with the absolute frequency of class

{\ displaystyle h \ left (x_ {j} \ right)}

{\ displaystyle j \,}

or

{\ displaystyle {\ widehat {f}} \ left (x_ {j} \ right) = {\ frac {f \ left (x_ {j} \ right)} {x_ {j} ^ {o} -x_ {j } ^ {u}}}}

with the relative frequency of class .

{\ displaystyle f \ left (x_ {j} \ right)}

{\ displaystyle j \,}

Representation of classified variables

A frequency table offers a possibility of a systematic and clear representation of a classified continuous random variable .

Feature classes ${\ displaystyle x_ {j} ^ {u} \ leq X <x_ {j} ^ {o}}$	absolute frequency ${\ displaystyle h (x_ {j}) \,}$	relative frequency ${\ displaystyle f (x_ {j}) \,}$
${\ displaystyle x_ {1} ^ {u} -x_ {1} ^ {o}}$	${\ displaystyle h \ left (x_ {1} \ right)}$	${\ displaystyle f \ left (x_ {1} \ right)}$
${\ displaystyle x_ {2} ^ {u} -x_ {2} ^ {o}}$	${\ displaystyle h \ left (x_ {2} \ right)}$	${\ displaystyle f \ left (x_ {2} \ right)}$
${\ displaystyle \ vdots}$	${\ displaystyle \ vdots}$	${\ displaystyle \ vdots}$
${\ displaystyle x_ {j} ^ {u} -x_ {j} ^ {o}}$	${\ displaystyle h \ left (x_ {j} \ right)}$	${\ displaystyle f \ left (x_ {j} \ right)}$
${\ displaystyle \ vdots}$	${\ displaystyle \ vdots}$	${\ displaystyle \ vdots}$
${\ displaystyle x_ {k} ^ {u} -x_ {k} ^ {o}}$	${\ displaystyle h \ left (x_ {k} \ right)}$	${\ displaystyle f \ left (x_ {k} \ right)}$
total	${\ displaystyle n \,}$	1

where is the number of objects to be examined. Cross tables can be used to display multidimensional frequency distributions. The graphical representation of classified variables can take place via a histogram, a column or bar diagram , a bar diagram or, in the case of very few classes , a pie diagram . ${\ displaystyle n \,}$

Location parameters

Since there are only intervals and no exact values in a classification, only intervals and no exact values can be determined for the location parameters. The number of cars per thousand inhabitants in European countries is chosen as an example.

Class no.	Number of cars per 1000	Number of countries	Frequency density
1	over 0 to 200	5	0.025
2	over 200 to 300	6th	0.06
3	over 300 to 400	6th	0.06
4th	over 400 to 500	9	0.09
5	over 500 to 700	6th	0.03

Arithmetic mean

Lower limit: (5 x 0 + 6 x 200 + 6 x 300 + 9 x 400 + 6 x 500) / 32 = 300

Upper limit: (5 x 200 + 6 x 300 + 6 x 400 + 9 x 500 + 6 x 700) / 32 = 434.375

So: 300 <arithmetic mean ≤ 434.375.

Or: the arithmetic mean = 367.1875, whereby the maximum error can be ± 67.1875.

Quartiles

The 1st quartile is in the 2nd class, so: 200 <1st quartile ≤ 300.

The 2nd quartile = median is in the 3rd class, i.e.: 300 <2nd quartile ≤ 400.

The 3rd quartile is in the 4th class, so: 400 <3rd quartile ≤ 500.

mode

Since the concrete distribution of the values is not known, it cannot be determined which values occur most frequently, i.e.: 0 <mode ≤ 700.

Modal class

The modal class is the class with the highest frequency density, i.e. the 4th class with the frequency density 0.09.

Note: A frequency distribution with the following additional assumptions is often used as an example:

the values per class are evenly distributed, d. That is, neighboring values have the distance class width / frequency = 1 / frequency density
the values per class are symmetrical to the middle of the class.

From this, precise values for the position parameters can be determined with fine analyzes and geometrical considerations (e.g. application of the ray sets). Or a clear original list is defined by the two assumptions.

In the example, the following unique original list can be created

unambiguous original list according to the example
Class no.	Number of cars per 1000	Number of countries	Clear original list
1	over 0 to 200	5	20; 60; 100; 140; 180
2	over 200 to 300	6th	208.33; 225; 241.67; 258.33; 275; 291.67
3	over 300 to 400	6th	308.33; 325; 341.67; 358.33; 375; 391.67
4th	over 400 to 500	9	405.56; 416.67; 427.78; 438.89; 450; 461.11; 472.22; 483.33; 494.44
5	over 500 to 700	6th	516.67; 550; 583.33; 616.67; 650; 683.33

The following values then result from this list

Arithmetic mean = (5 x 100 + 6 x 250 + 6 x 350 + 9 x 450 + 6 x 600) / 32 = 367.1875
1st quartile = (241.67 + 258.33) / 2 = 250
2nd quartile = median = (375 + 391.67) / 2 = 383.33
3rd quartile = (472.22 + 483.33) / 2 = 477.78
Each value is mode, since each value occurs exactly once

Scatter parameters can then also be calculated from such a unique original list.

Individual evidence

^ Günter Bamberg, Franz Baur, Michael Krapp: Statistics . 14th edition. Oldenbourg, 2008, p. 14 .

↑ Source: Statistics: Classification of a metric feature with many different characteristics (Wikibooks)

[Jeske2003-1] Günter Bamberg, Franz Baur, Michael Krapp: Statistics . 14th edition. Oldenbourg, 2008, p. 14 .