statistics

from Wikipedia, the free encyclopedia

Statistics “is the teaching of methods for dealing with quantitative information” ( data ). It is a way of "establishing a systematic connection between experience ( empiricism ) and theory". Statistics is the summary of certain methods for analyzing empirical data. An old term for statistics was collective research . Statistics is used as an auxiliary science by all empirical disciplines and natural sciences, such as medicine (medical statistics), psychology (psychometry), political science, sociology, economics ( econometrics ), biology ( biometrics ), chemistry ( Chemometrics) and physics. Statistics thus represent the theoretical basis of all empirical research. Since the amount of data is increasing rapidly in all disciplines, statistics and the analysis of these data derived from them are also gaining in importance. On the other hand, statistics are a branch of pure mathematics. The goal of pure mathematical statistics is to prove general statements with the methods of pure mathematics. She makes use of knowledge from the basic mathematical disciplines of analysis and linear algebra .

etymology

The word statistics comes from the Latin statisticum " concerning the state " and Italian statista statesman or politician , which in turn comes from the Greek στατίζω (to classify). German statistics, introduced by Gottfried Achenwall in 1749, originally referred to the “doctrine of data about the state”. In the 19th century the Scot John Sinclair used the word for the first time in its current meaning of general data collection and analysis .

introduction

Statistics is viewed on the one hand as an independent mathematical discipline about collecting, analyzing, interpreting or presenting data, and on the other hand as a sub-area of mathematics , especially stochastics .

The statistics are divided into the following three sub-areas:

  • The descriptive statistics (also descriptive statistics or empirical statistics ): Available data are described in suitably, processed and combined. Her methods are used to compress quantitative data into tables, graphs and key figures. For some institutions, as is the case with official statistics or the Socio-Economic Panel (SOEP), the main task is to compile such statistics.
  • The inductive statistics (including mathematical statistics , inferential statistics , judging statistics or inferential ): In the inductive statistics is passed from the data of a sample properties of a population from. The probability theory provides the basis for the necessary estimation and testing procedures.
  • The exploratory statistics (even hypotheses-generating statistics , analytical statistics or data mining ): This is methodologically an intermediate form of the two aforementioned portions but gets as an application form increasingly independent significance. Using descriptive procedures and inductive test methods, she systematically searches for possible relationships (or differences) between data in existing databases and at the same time wants to evaluate them in terms of their strength and reliability of results. The results found in this way can be understood as hypotheses that can only be considered statistically secure after they have been confirmed by inductive test procedures based on them with appropriate (prospective) test planning .

The difference between descriptive and exploratory statistics is also clear from the questions :

  • Descriptive statistics: how can you describe a distribution of a characteristic?
  • Exploratory statistics: what is remarkable or unusual about a distribution of a characteristic?

history

Modern statistics emerged from various historical ( data-analytical ) developments that merged into today's statistics in the course of the 19th and 20th centuries. In particular, the division of the statistics into a descriptive and an inferential statistic reflects this historical development.

Official statistics

The beginnings of official statistics go back well before the birth of Christ. The first official statistics were censuses (probably for the first time in Egypt around 2700 BC, during the Xia dynasty around 2000 BC, in the city of Mari in Mesopotamia around 1700 BC). In ancient Greece, at least in Athens, there were civil registers, registers of population movements, import lists of goods subject to duty (such as imports of grain) and property registers. The citizens and their property were recorded in Roman censuses.

In Germany, the first census took place in Nuremberg in 1449 . The city administration wanted to record the population and supplies in order to decide whether refugees from the Margrave War could still be allowed into the city or not. The French statesman Colbert started with extensive (official) statistical surveys in 1665 with the establishment of trade statistics .

In Prussia , population statistics (births, marriages and deaths) have been compiled since 1683 on the order of Elector Friedrich Wilhelm and expanded over time: in 1719 the household inventory and municipal finances, in 1778 the livestock, sowing, grain prices, flax and tobacco cultivation, factories, smelting and mines, shipping and trade. Other German states and cities followed suit, for example Bavaria in 1771 with the Dachsberg folk description. Since the establishment of the Statistical Office of the German Empire in 1872, all official statistics have been kept in Germany. In Austria , too , Maria Theresa carried out the first census in 1753 .

By 1870 modern statistical authorities existed in most of the large countries in Europe. At the conferences of the Statistical Congress (1853–1878), quality standards were formulated which most states used.

In contrast to the current results of official statistics , the statistics produced were not published and were considered state secrets.

University statistics

Independent of the official statistics, the so-called university statistics, a term for descriptive civics and regional studies that is now hardly used any more, developed. The compilation of the Italian Sansovino (1562) is a first listing of the forms of government of twenty states. Similar works were created by the Italian Botero (1589), the French d'Avitys (1616) and the Dutchman de Laet (1624-1640). The main representative of university statistics in Germany was the statistician Achenwall .

The official statistics served the administration and the support of government or administrative decisions. University statistics were intended to be more of a general source of information for statesmen and initially only contained textual descriptions. This included the form of government, legal provisions and individual facts, just “state peculiarities” in the sense of being worthy of note . Tables were only added later, as was the case with Büsching . The university statisticians, however, did not carry out any surveys themselves, but processed and published them through access to the official statistics.

The 19th century brought refinements to observation practices, their institutional consolidation and the idea of ​​objectification. At the end of the 19th century, the term “ population ” found increasing use. By 1890, fully developed mathematical statistics were available. Since the middle of the century, Adolphe Quetelet has been researching social figures according to averages, correlations and regularities and invented the “statistical average citizen” ( l'homme moyen ).

Political arithmetic

It was only the political arithmeticists who began to search for laws in the data. This had its origins in the increasingly popular tontines , a kind of pension insurance. The Englishman Graunt analyzed birth and death lists in 1660 and wanted to find general laws about the gender ratio, the ratio of deaths and births, and death rates. The English statistician and economist Petty applied this type of analysis to economic data. The main representative of the political arithmetic in Germany is the statistician Sussmilch with his work The Divine Order in the Relationships of the Human Sex, from Birth, Death and Reproduction of the Same Proven from 1741.

These types of statistics also influenced philosophical questions, such as the existence of the individual's free will . Quetelet found that the number of marriages in Belgian cities showed less deviation from the average than the number of deaths. And that, although the time of marriage is subject to free will and the time of death (usually) not.

probability calculation

The modern calculation of probability emerged from considerations of games of chance . The correspondence between Pascal and Fermat in 1654 is considered to be the hour of birth of the calculus of probability. The foundation of modern calculus of probability was completed with the publication of Kolmogorov's textbook Basic Concepts of the Calculus of Probabilities in 1933.

Steps in the practical implementation of statistics

A statistical study is always carried out in the interplay of statistical-mathematical methodology and theoretical specialist knowledge. It can be roughly divided into five steps:

planning

In the planning phase (or also the definition phase), the research questions (problem and objective of the investigation and their theoretical justification) must be clearly defined. To answer the following must be decided:

A statistical investigation is rarely a direct sequence of the five steps, but mostly a constant change between the various phases depending on the data, analysis results and theoretical considerations. An important sub-area is statistical experimental design, which usually also includes so-called sample size planning (e.g. in clinical studies). If these case numbers are too small, the study may not have enough power to show the connection. Basically, it can be said that studies with higher case numbers also have more power. With the help of statistical methods, it is possible to calculate the number of cases exactly when using a t-test (this checks whether two mean values ​​of a sample differ from one another in a statistically significant manner).

Elevation

After determining the type of survey , there are corresponding steps.

Primary statistical survey

The researcher collects his own data, for example through a survey. This means that the data collection procedure must be defined, for example through the ADM design , and the collection must be carried out in accordance with these regulations.

Secondary statistical survey

The researcher uses individual data that were collected by others, for example by a statistical office . So he saves work because he doesn't raise himself. Often, however, the variables surveyed do not exactly match the research question or the desired operationalization.

Tertiary statistical survey

The researcher only uses aggregated data for one statistical unit of spatial reference that has been collected and published by others.

Furthermore, a distinction is made between randomized data and pure observation data (from which quasi-randomized data can be created by computer simulations, e.g. by propensity score matching).

processing

The processing phase includes the coding of the data, the data cleansing (plausibility check and correction, outliers , missing values) and any necessary (statistical or factual) transformations of the collected variables .

The processing also includes imputation methods for missing values. This refers to methods of inserting the missing values ​​using a model to be justified. Extreme caution is required here, as there is now our own research in the field of imputation methods.

Conventions and symbols specify the results of careful processing. The statistics of the city of Bern work according to the following rules:

symbol meaning
- Dash: nothing occurs (value exactly zero).

A dash is also set if the conceptual requirements for an entry are missing, but the character can be replaced by a zero in calculations.

0 0.0 A size that is less than half the smallest unit used.
() Empty brackets: No figures are given for reasons of data protection.
... Depending on the context, three points mean: number not known, irrelevant, not listed for statistical reasons or not applicable.
1 , 2 A superscript number is used to indicate a footnote.
r A superscript r indicates a value that has been corrected compared to earlier (“restated”).
G A superscript g is used for estimated dates.
/ A slash between two years indicates the associated values ​​as mean values.
- A hyphen between two years indicates the associated values ​​as a sum.
Σ Any differences between the total and the added individual values ​​or partial sums are due to random rounding differences.

analysis

In the analysis phase, the methods of exploratory, descriptive and inductive statistics are applied to the data (indicators, graphics and tests). Due to the partly automatically collected data volumes and the increasingly complex evaluation processes (such as bootstrapping processes ), an analysis is hardly possible without suitable statistical software (such as R).

interpretation

The interpretation of the results of the statistical analysis is of course made taking into account the respective subject area. Of great and interdisciplinary importance, however, is the conversion of numbers into language, the accurate language conversion of the results obtained, the scientific criteria. Without referring back to the hypotheses and questions raised in the course of the scientific knowledge process, the statistical analysis remains irrelevant. Most of the weaknesses of a statistical analysis become visible in the statistical evaluation. Too often there is only the pure number representation and too little attention is paid to a clear linguistic safeguarding of results. A convincing statistical evaluation will incorporate the obtained results into a flowing text, provided with the relevance, the first steps from the question to the statistical method, the climax of a structured presentation of the results and, last but not least, the reference to the larger scientific context, also in consciousness possible weaknesses of the analysis. Only the reference and cross-reference to other scientifically obtained and valid study results then contribute to a progress in knowledge.

Information content and evaluation

Statistics represent a representation of collected data. Depending on how the data was obtained, the content of the information corresponds to a usable result. If the real and objective processes are abandoned, however, incorrect conclusions can also be drawn from statistics. This makes it possible to determine how large the proportion of fare dodgers on trains or the average income of the population in a particular location could be. However, no connections should be formed from statistically linked data alone.

When dealing with statistics, it is always important to check the entire data content for relevance, for the relationship of the partial information to one another and to the environment. Even with a suitable interpretation of the data, incorrect evidence can be found if one or the other relationship is left out or placed in the wrong environment. Statistics are therefore required to be “ objective ” (regardless of the statistician's point of view), “ reliable ”, “ valid ” (valid across context), “significant” (significant) and “ relevant ” (important).

Schools and schools of thought

In textbooks the impression is sometimes given that there is only one, constantly evolving statistical model. In the descriptive statistics , there is little controversy in the inductive statistics , however, there are different schools of thought, which analyze a problem differently, evaluate and numerically calculated. Little known approaches are

Inductive statistics are dominated by

The following table shows some differences between the types of inference:

classic inference Bayesian inference statistical decision theory
used inference concept objectivistic, cognitivistic, frequentistic subjectivistic, cognitivistic, non-frequencyistic subjectivist, decisionist, non-frequencyist
Information used earlier: Priority information → now: sample data → later: consequences of action
sample data only additional priority information additional consequences of action
Information processing Sampling and likelihood functions additional priority distributions for prior information and posterior distribution using Bayes' formula additional loss function for action consequences
Methods used Point and interval estimation as well as test procedures based on the sample distributions Point and interval estimation as well as test procedures based on the posterior distributions Establishment of decision-making functions
Method assessment Unknown parameter is fixed and probability statements only concern the estimation . Unknown parameter is stochastic and probability statements also concern .

application

Statistics were originally developed for official statistics and also for the analysis of games of chance . In many disciplines there was a need for “objective” testing and decision-making of theories, for which mathematics and the rules of statistics are suitable. The application of statistical methods in the specialist sciences has developed its own sub-areas.

  • Official statistics are the entirety of the statistics compiled by official institutions, in particular the statistical offices .
  • Business statistics denote on the one hand the description and review of internal processes with the help of statistical methods and on the other hand external statistics for a total of businesses.
  • Population statistics are the teaching of the systematic recording, presentation and interpretation of the demographic situation and development with the help of statistical methods (see also Demography ).
  • Biostatistics (also: biometrics ) deals with issues that arise in medical research and other research areas dealing with living beings.
  • Chemometrics (also chemometrics) is the chemical sub-discipline that deals with the application of mathematical and statistical methods in order to optimally plan, develop, select or evaluate chemical processes and experiments.
  • Data mining and machine learning are statistical and probabilistic models that capture patterns in the data through the use of calculation algorithms.
  • Demography or population science is a scientific discipline that statistically deals with the development of populations and their structures.
  • Epidemiology is the scientific discipline that deals with the causes and consequences as well as the spread of health-related conditions and events in populations.
  • Education uses statistical techniques to describe and understand large student populations (e.g. PISA)
  • Financial statistics focuses on three topics: empirical analyzes and modeling of financial time series as well as agent-based modeling for simulated and real markets.
  • Geostatistics refers to certain stochastic methods for characterizing and estimating spatially correlated georeferenced data.
  • Municipal statistics creates small-scale primary, secondary and tertiary statistics for statistical spatial reference units for municipal planning and decisions.
  • Econometrics is a branch of economics that brings together economic theory as well as mathematical methods and statistical data in order to empirically check economic theoretical models and to analyze economic phenomena quantitatively.
  • Operations research is a branch of applied mathematics that deals with the optimization of certain processes or procedures, including statistical methods.
  • Quantitative linguistics uses statistical methods to investigate language acquisition, language change, and the use and structure of languages.
  • Population ecology is a branch of ecology that deals with the composition, dynamics and interaction of biological populations. Traditionally, population ecology is divided into statistical population description and population dynamics. An essential part of the same is the interaction of populations in the context of competition and predator-prey relationships.
  • Psychometry is the field of psychology that deals generally with the theory and method of psychological measurement. It is a compilation of (specific) mathematical and statistical models and methods. These were developed to summarize and describe the empirical data obtained in the context of psychological research and to draw conclusions from them. Above all, they serve to create psychological models, such as mathematical-statistical, i.e. psychometric, models over various cognitive functional areas, over personality areas, which are derived and formalized from the corresponding basic theories.
  • Six Sigma is a method from quality management , the core element of which is the description, measurement, analysis, improvement and monitoring of business processes using statistical means.
  • Sports statistics are used to present sporting achievements that have already been made and are used to analyze these achievements and to make predictions about future achievements. They are the basis for sports betting .
  • Statistical mechanics (also: statistical thermodynamics) was originally an application area of ​​mechanics. The state of a physical system is no longer characterized by the exact temporal course of the position and momentum of the individual particles, but by the probability of finding such microscopic states and thus stands for the (theoretical and experimental) analysis of numerous, fundamental properties of systems of many particles (Atoms, molecules).
  • Statistical physics deals with the description of natural phenomena in which a large number of sub-systems (or particles) are involved, but only statements about the totality are of interest or basically only incomplete information about the detailed behavior of the sub-systems is available. It is a physical discipline, the mathematical basis of which is theorems of probability theory and asymptotic statistics and a few physical hypotheses.
  • Environmental statistics is concerned with the collection of environmental data and the analysis of ecosystems, their pressures and reactions, with the help of statistical methods.
  • Actuarial science is the science that applies mathematical and statistical methods to measure risk in the insurance and banking system .
  • Economic statistics is the study of the systematic recording, presentation and interpretation of economic facts with the help of statistical methods.

education

software

R is an open source statistical software

The development of computers since the second half of the 20th century has had a major impact on statistics. Early statistical models were almost always linear models . The increasing computing capacity and the development of suitable numerical algorithms caused an increased interest in non-linear models such as artificial neural networks and led to the development of complex statistical models, for example generalized linear models or multi-level models .

Due to the individual availability of statistics software, you can also display data yourself and carry out a large number of calculations. This ranges from the calculation of location parameters (such as mean values, median, mode) and measures of dispersion (such as standard deviation, variance, range) to complex statistical models. As a rule, data can also be represented in a variety of diagrams, such as box plots and stem-leaf diagrams. Visualization programs can be used for specialized graphics .

The increase in computing power has also led to an increasing popularity of computer-intensive methods based on resampling techniques (permutation tests, bootstrapping processes ). The Bayesian statistics is by Gibbs sampling , become possible.

Eminent statisticians

literature

Portal: Statistics / Literature  - Overview of Wikipedia content on the topic of Statistics / Literature

Web links

Commons : Statistics  - collection of images, videos and audio files
Wiktionary: Statistics  - explanations of meanings, word origins, synonyms, translations
Wikibooks: Introduction to Statistics  - Learning and Teaching Materials

Individual evidence

  1. a b Rinne, Horst .: Pocket book of statistics . 4th, completely revised. and exp. Edition German, Frankfurt, M. 2008, ISBN 978-3-8171-1827-4 , pp. 1 .
  2. Lincoln E. Moses: Think and Explain with statistics . Addison-Wesley, 1986, ISBN 978-0-201-15619-5 , pp. 1-3 .
  3. ^ David Moore: Statistics for the Twenty-First Century . The Mathematical Association of America, Washington, DC 1992, Teaching Statistics as a Respectable Subject, p. 14-25 .
  4. ^ William Lee Hays: Statistics for the social sciences . Holt, Rinehart and Winston, 1973, ISBN 978-0-03-077945-9 , pp. xii .
  5. Wolfgang Polasek: Exploratory data analysis . Introduction to descriptive statistics. 2nd Edition. Springer, Berlin 1994, ISBN 978-3-540-58394-3 .
  6. ^ Ian Shaw: The Oxford History of Ancient Egypt . Oxford University Press, 2004, ISBN 978-0-19-280458-7 , pp. 4-5 .
  7. ^ Federal Statistical Office Wiesbaden (ed.): Population and Economy 1872–1972 . W. Kohlhammer Stuttgart / Mainz, 1972, p. 15-16 .
  8. Jürgen Osterhammel: The transformation of the world. A story of the 19th century. CH Beck. 2nd edition of the 2016 special edition. ISBN 978-3-406-61481-1 . P. 59
  9. ^ Sansovino, F. (1578), Del governo et amministratione di diversi regni et repvbliche, cosi antiche come moderne , Per ordine di Iacomo Sansouino, Venetia ( Open Library )
  10. Botero, G. (1589), Della ragion di Stato libri dieci , Appresso i Gioliti, Venetia ( Open Library )
  11. Jürgen Osterhammel: The transformation of the world. A story of the 19th century. CH Beck. 2nd edition of the 2016 special edition. ISBN 978-3-406-61481-1 . P. 60
  12. ^ Peter Koch: Contributions to the history of the German insurance industry, part 2 . Verlag Versicherungswirtschaft, 2005, p. 28 .
  13. Graunt, J. (1665) Natural and Political Observations mentioned in a following Index, and made upon the Bills of Mortality , 1665 ( digitized version )
  14. ^ Wappäus, JE (1861), General Population Statistics (Second Theil) Verlag der JC Hinrichs'schen Buchhandlung, Leipzig, p. 411ff
  15. Statistical spatial reference system
  16. Statistical yearbook of the city of Bern, reporting year 2016.
  17. ^ H. Rinne (1997): Pocket book of statistics. (2nd edition), Harri Deutsch Verlag, Frankfurt am Main, p. 471 ff.
  18. Municipal spatial reference system: http://www.staedtestatistik.de/agk.html