Random sample

A random sample (also likely selection , random selection , random sample ) is a sample from the population , with the aid of a special selection process is drawn. With such a random selection procedure, each element of the population has a specifiable probability (greater than zero) of being included in the sample. Strictly speaking, the methods of inductive statistics can only be used for random samples .

Mathematical definition

A sample is first and foremost a subset of a population. Additional conditions are set for a random sample:

The elements are randomly drawn from the population and
the probability with which an element is drawn from the population can be specified.

Furthermore, a distinction is made between an unrestricted and a simple random sample:

unrestricted random sample

Each element of the population has an equal chance of getting into the sample.

Simple random sample

Each element of the population has the same probability of being sampled and
the draws from the population take place independently of one another.

An unrestricted random sample is obtained e.g. B. with a pull without replacement and a simple random sample z. B. when pulling with replacement .

Examples

Literary digest disaster

The literary digest disaster of 1936 shows what can happen if no random sample is drawn from the population. A skewed sample led to a completely wrong election forecast.

Election survey

A survey of voters after they came out of the voting booth regarding their voting behavior is an unrestricted random sample (if no respondent refuses to answer) regarding the voters. However, it is not an (unrestricted) random sample of those entitled to vote.

Pocket control

The retail industry repeatedly complains that theft of goods by its own employees causes great damage. That is why larger supermarkets carry out a bag check when employees leave the supermarket. Since a complete bag check of all employees would be too time-consuming (and this would probably also have to be paid as working time), the employees walk past a lamp when leaving the supermarket through the staff exit. Computer-controlled, it shows either a green light (employee is not controlled) or a red light (employee is controlled). This selection is then a simple random selection.

Random Sampling in Mathematical Statistics

In mathematical statistics , random samples are the basis for drawing conclusions from the sample about characteristics of the population. A concrete sample is then viewed as a realization of the random variable . These random variables are known as sample variables and indicate the probability with which a certain element of the population can be drawn in the -th drawing using a certain selection process. ${\ displaystyle x_ {1}, \ dotsc, x_ {n}}$ ${\ displaystyle X_ {1}, \ dotsc, X_ {n}}$ ${\ displaystyle i}$

If a simple random sample was drawn, it can be shown that the sample variables are distributed independently and identically (abbreviation iid , from the English independent and identically distributed ). I.e. the distribution type and the distribution parameters of all the sampling variables are equal to the distribution in the population ( identically distributed ), and due to the independence of the drawings, the sampling variables are also independently of each other ( independent ). ${\ displaystyle X_ {i}}$

Many problems in inductive statistics assume that the sample variables are iid.

Dependent and independent samples

In analyzes with more than one sample, a distinction must be made between dependent and independent samples. Instead of a dependent sample , one also speaks of connected samples or paired samples .

Dependent samples usually occur with repeated measurements on the same examination object. For example, the first sample consists of people before treatment with a particular drug, and the second sample of the same people after treatment, i.e. H. the elements of two (or more) samples can be assigned to each other in pairs.

With independent samples , there is no relationship between the elements of the samples. This is the case, for example, if the elements of the samples come from different populations. For example, the first sample consists of women and the second sample of men, or when people are randomly divided into two or more groups.

Formally, this means for the sample variables (with the th test object and the th measurement): ${\ displaystyle X_ {ij}}$ ${\ displaystyle i}$ ${\ displaystyle i}$ ${\ displaystyle j}$ ${\ displaystyle j}$

for independent samples: all sample variables are independent of one another. ${\ displaystyle X_ {ij}}$
With dependent samples: The sample variables of the first sample are independent of each other, but there is a dependency between the sample variables , since they are collected on the same examination object. ${\ displaystyle X_ {11}, \ dotsc, X_ {n1}}$ ${\ displaystyle X_ {i1}, \ dotsc, X_ {ip}}$ ${\ displaystyle i}$

Single-stage random samples

A pure (also: simple) or unrestricted random sample can be described using an urn model. For this purpose, a fictitious vessel is filled with balls, which are then drawn at random: drawing with replacement results in a simple random sample, drawing without replacement results in an unrestricted random sample. Using an urn model, various random experiments, such as a lottery drawing, can be simulated.

Sample size

The sample size (often also called sample size ) is the number of samples of a population required for a test in order to determine statistical parameters with a specified accuracy by means of estimation . However, the sample size is often determined by standards or empirical values. In the case of a simple random sample, it is generally the case that the statistical parameters become better the larger the sample size (see e.g. the table in this section). However, if the drawing from the population cannot be carried out independently of one another (e.g. in the case of time series or stochastic processes ), it can happen that the increase in the sample size leads to a deterioration in the statistical parameters (e.g. the variance ), see Smit's paradox .

If the unknown parameter is in the population, then an estimator is constructed as a function of the sample variable. The expected value of the random variable is usually , and the following applies: ${\ displaystyle \ theta}$ ${\ displaystyle \ Theta = \ Theta (X_ {1}, \ dotsc, X_ {n})}$ ${\ displaystyle X_ {1}, \ dotsc, X_ {n}}$ ${\ displaystyle \ Theta}$ ${\ displaystyle \ operatorname {E} (\ Theta) = \ theta}$

{\ displaystyle P (\ theta -e \ leq \ Theta \ leq \ theta + e) ​​= 1- \ alpha}

,

where is a point estimate of the unknown parameter, the absolute error and the probability that a realization will assume in the central dither interval. ${\ displaystyle {\ hat {\ theta}}}$ ${\ displaystyle e}$ ${\ displaystyle 1- \ alpha}$ ${\ displaystyle \ Theta}$

The absolute error is the same , so ${\ displaystyle e = c_ {1- \ alpha / 2} {\ sqrt {\ operatorname {Var} (\ Theta)}}}$

{\ displaystyle P \ left (\ theta -c_ {1- \ alpha / 2} {\ sqrt {\ operatorname {Var} (\ Theta)}} \ leq \ Theta \ leq \ theta + c_ {1- \ alpha / 2} {\ sqrt {\ operatorname {Var} (\ Theta)}} \ right) = 1- \ alpha}

,

where mostly depends on the distribution type of and applies to the variance . The following table gives an estimate of the sample size for the unknown mean value or the unknown proportional value . ${\ displaystyle c_ {1- \ alpha / 2}}$ ${\ displaystyle \ Theta}$ ${\ displaystyle \ operatorname {Var} (\ Theta) \ propto {\ tfrac {1} {n}}}$ ${\ displaystyle \ mu}$ ${\ displaystyle \ pi}$

Unknown parameter	condition	e		Estimation of sample size
Unknown parameter	condition	${\ displaystyle c_ {1- \ alpha / 2}}$	${\ displaystyle {\ sqrt {\ operatorname {Var} (\ Theta)}}}$	Estimation of sample size
${\ displaystyle \ mu}$	${\ displaystyle X_ {i} \ sim N (\ mu; \ sigma)}$ and known ${\ displaystyle \ sigma}$	${\ displaystyle z_ {1- \ alpha / 2}}$	${\ displaystyle \ sigma / {\ sqrt {n}}}$	${\ displaystyle n \ geq {\ frac {z_ {1- \ alpha / 2} ^ {2} \ sigma ^ {2}} {e ^ {2}}}}$
	${\ displaystyle X_ {i} \ sim N (\ mu; \ sigma)}$ and unknown ${\ displaystyle \ sigma}$	${\ displaystyle t_ {n-1; 1- \ alpha / 2}}$	${\ displaystyle s / {\ sqrt {n}}}$	${\ displaystyle n \ geq {\ frac {t_ {n-1; 1- \ alpha / 2} ^ {2} s ^ {2}} {e ^ {2}}}}$
	${\ displaystyle X_ {i} \ sim (\ mu; \ sigma)}$ and ${\ displaystyle n> 30}$	${\ displaystyle z_ {1- \ alpha / 2}}$	${\ displaystyle s / {\ sqrt {n}}}$	${\ displaystyle n \ geq {\ frac {z_ {1- \ alpha / 2} ^ {2} s ^ {2}} {e ^ {2}}}}$
${\ displaystyle \ pi}$	${\ displaystyle np (1-p) \ geq 9}$	${\ displaystyle z_ {1- \ alpha / 2}}$	${\ displaystyle {\ sqrt {p (1-p) / n}}}$	${\ displaystyle n \ geq {\ frac {z_ {1- \ alpha / 2} ^ {2}} {4e ^ {2}}} \ geq {\ frac {z_ {1- \ alpha / 2} ^ {2 } p (1-p)} {e ^ {2}}}}$

Example (choice)

Required sample sizes with simple random selection

One party scored 6% in a poll shortly before the election. How large does a voter survey on election day have to be, so that the true percentage can be determined with an accuracy of ? ${\ displaystyle 1- \ alpha = 95 \, \%}$ ${\ displaystyle e = 1 \, \%}$

{\ displaystyle n \ geq {\ frac {1 {,} 96 ^ {2}} {4 \ cdot 0 {,} 01 ^ {2}}} = 9604}

or more precisely

{\ displaystyle n \ geq {\ frac {1 {,} 96 ^ {2} \ cdot 0 {,} 06 \ cdot 0 {,} 94} {0 {,} 01 ^ {2}}} \ approx 2167}

.

I.e. The somewhat more precise estimate of the sample size for the proportional value shows that 2167 voters still have to be interviewed in order to obtain the election result with an accuracy of 1%. The graphic on the right shows which sample sizes are necessary for a certain estimated proportion and a given certainty.

Example (material testing)

In materials testing , a sample size of 10 per 1000 parts produced is quite common. He is u. a. depends on the safety relevance of the component or the material . In destructive tests such as tensile tests , an attempt is made to keep the test effort and thus the sample as small as possible. In the case of non-destructive testing - e.g. Example in image processing systems for the completeness check - is often a 100 -% - performed control to an error in the production can be seen as quickly as possible.

Multi-level random selection (also complex random selection)

The following selection procedures are particularly important, the last two being referred to as two-stage selection procedures :

Stratified random sample ( stratified sample ): The elements are classified into groups (subsets) according to a certain characteristic. The aim of grouping is to create groups that are as homogeneous as possible; homogeneously related to the feature to be examined. A random sample is then drawn from within each of these groups. Both a pure random sample and a weighted procedure can be used as selection procedures .

Cluster sampling ( cluster sample ): Unlike the groups in the stratified sample groups are formed here that are as heterogeneous as possible (. Except that all should be similar as possible the groups as a whole, as school classes for a school). First, a (relatively small) random sample is drawn from the groups. Then all elements contained in the groups drawn are included in the sample. A classic example is the questioning of entire city blocks or school classes. First, the school classes to be interviewed are randomly selected. Then all the students in the school classes are interviewed. The so-called cluster effect occurs in the cluster sample . The more homogeneous the elements within the groups and the more heterogeneous the groups are, the greater it is.

Stepped random sample ( staged sample ): It is often preferred to reduce costs and time savings of layering reasons. Grading is also recommended if a list of all cases (objects of investigation, characteristics, etc.) of the population does not exist and a simple random sample cannot therefore be carried out ( e.g. an investigation based on texts. As not all texts are electronically recorded or . are available, there are high costs to access the respective archives. This can be avoided by grading ). In essence, the grading process is based on stratification by:

Classification criteria (characteristics) determined,
divides the population according to these characteristics into mutually exclusive subpopulations (primary units),
now a random selection of the subpopulations is made and limited to a certain number of primary units that are examined. The remaining subpopulations are ignored.
The random sample of the feature carriers (objects, individuals, cases) is now determined from the randomly selected primary units. One institute, for example, wants to ask 500 people about their consumer behavior. In step 2 the population, e.g. B. on the basis of geographic features, divided into East, North, South and West Germany. In step 3, it was determined that consumer behavior in eastern and southern German supermarkets (secondary units) is the focus of the study, so that 250 people (tertiary units) are interviewed in each of the two regions.
The subpopulations (of the two regions examined) are now combined to form an overall sample.

Random route method

Application models

ADM design as a combination of stratification and gradation

Random drawing problems

In practical research (especially in the field of social sciences) a “real” random sample can only very rarely be selected. There are several reasons for this:

Universities are statistically understood as a set in the mathematical sense. This means that it is clearly defined which feature carriers belong to the population and which do not, which requires a clear delimitation in terms of time, space and the feature. This often does not succeed because the population is not known at all (e.g. not all persons who currently have a depression in Germany are known) or this changes over time (e.g. due to births and deaths).
Due to ethical and data protection concerns, a list of the entire population (e.g. all people in Germany or a certain city) cannot be accessed and selected from.
Not all persons drawn from a register are willing to take part in an investigation (e.g. telephone survey). In addition, it can be assumed that participants differ from non-participants in certain characteristics (social status, level of education, etc.).

In practice, therefore, an ad hoc sample is often used, ie those persons are surveyed who voluntarily declare themselves willing to take part in an investigation. It is therefore necessary to check whether the survey population ( frame population ; population that is actually surveyed ) corresponds to the target population ( population for which the statements of the study are to apply).

literature

Joachim Behnke, Nina Baur, Nathalie Behnke : Empirical Methods in Political Science (= UTB 2695 Basic Course Political Science ). Schöningh et al., Paderborn et al. 2006, ISBN 3-506-99002-0 .
Markus Pospeschill: Empirical Methods in Psychology . tape 4010 . UTB, Munich 2013, ISBN 978-3-8252-4010-3 .
Jürgen Bortz, Nicola Dörig: Research methods and evaluation for human and social scientists. 4th edition. Springer, Heidelberg 2006, ISBN 3-540-33305-3 .

Individual evidence

↑ Literary Digest Disaster. Market research wiki, accessed February 12, 2011 .
↑ Theft costs trade billions. Der Tagesspiegel , November 14, 2007, accessed on February 12, 2011 .
↑ Bernd Rönz, Hans G. Strohe (Ed.): Lexicon Statistics . Gabler Wirtschaft, Wiesbaden 1994, ISBN 3-409-19952-7 , p. 412 .
^ Jürgen Janssen, Wilfried Laatz: Statistical data analysis with SPSS for Windows. An application-oriented introduction to the basic system and the Exact Tests module . 6th, revised and expanded edition. Springer, Berlin et al. 2007, ISBN 978-3-540-72977-8 , pp. 353 .
↑ See: Hans-Friedrich Eckey, Reinhold Kosfeld, Matthias Türck: Probability Theory and Inductive Statistics. Basics - methods - examples. Gabler, Wiesbaden 2005, ISBN 3-8349-0043-5 , p. 185.