Biostatistics

from Wikipedia, the free encyclopedia

The Biostatistics is an area of statistics . It deals with issues that arise in medical research - therefore also known as medical statistics - and other research areas dealing with living things (e.g. in agricultural experiments, statistical genetics).

Her tasks include the planning and implementation of studies as well as the analysis of the data obtained with the help of statistical methods. The term is often Biometrics also synonymously used to Biostatistics.

Modern biostatistics

Recently, an increase in the importance of statistics in the life sciences is seen . This is due to the existence and emergence of various high throughput methods (such as next generation sequencing , microarrays on the RNA / DNA (i.e. gene) level and mass spectrometry on the protein level). The technical modalities mentioned generate enormous amounts of raw data that can only be analyzed using biostatistical methods. This new approach is also known as systems biology .

The methods used to evaluate this data are quite complex. On the methodological side, the following are used, among other things: Statistical machine learning through e.g. B. Artificial Neural Networks , Support Vector Machines and Principal Component Analysis . Of course, classic statistical concepts such as regression or correlation also play a role as the basis for these methods. Robust statistics are required to evaluate this data. These are statistical methods that are not susceptible to outliers (these are measured values ​​that are far too high or too low due to random occurrences). A large number of outliers occur in gene expression data. To do this, you just have to keep in mind that even a dust particle on a microarray can have a serious effect on the measurements.

The Random Forest method by Leo Breiman is also becoming more and more important, in particular because, in contrast to, for example, the support vector machines, it is very easy to interpret. The fact is that with this method random decision trees are generated and these can be clearly interpreted. For example, you can statistically secure and support clinical decisions. Furthermore, one can prove the correctness of clinical decisions with mathematical rigor. The method is also used in clinical decision support systems. Another advantage (besides the interpretability) of the random forests in contrast to the SVMs is the shorter computing speed. The training time in a random forest increases linearly with the number of trees. A test example is evaluated individually on each tree and can therefore be parallelized.

Basically, it can be said that the extremely large biological data sets are high-dimensional and redundant. This means that much of the information collected is not at all relevant for the classification (of, for example, sick and non-sick individuals). It is also possible that, due to the presence of multicollinearity, the information from one predictor is contained in another predictor. The two predictors can have a high correlation. Here, in order to reduce the data set without losing essential information, so-called dimension reduction techniques (for example the above-mentioned principal component analysis) are used.

Classical statistical methods, such as linear or logistic regression and linear discriminant analysis , are often not suitable for their application to high-dimensional data (i.e. data in which the number of observations is smaller than the number of predictors :) . These statistical methods were developed for low dimensional data ( ). Often it can even be the case that the application of a linear regression to a high-dimensional data set with all predictors yields a very high coefficient of determination , although it is not a statistical model with great predictive power. Caution should be exercised when interpreting this.

Recently, attempts have also been made to incorporate knowledge of gene regulatory networks and biochemical signal cascades into the analysis (gene set enrichment analysis). There are several bioinformatic tools for this (including GSEA - Gene Set Enrichment Analysis from the Broad Institute). The idea is that it is often more sensible to consider the perturbation of entire sets of genes (e.g. signal cascades such as the Jak-Stat signal path ) together than to examine the perturbation of individual genes. Furthermore, one makes use of the research work on biological signal cascades. This also makes the analysis more robust: Because it is more likely to find a single false positive gene than a whole false positive signal cascade. There is also the possibility that the perturbation of a signal cascade found has already been described in the literature.

The Mendelian randomization is a non-experimental approach to determine causal relationships using DNA sequences .

Clinical studies

Biostatistics are also used in clinical studies . In such studies, the effectiveness of certain drugs, medical devices or treatment methods is examined within the framework of evidence-based medicine. Biostatistics already help with optimal study planning, i.e. right at the beginning of a clinical study. For example, the number of attempts must be calculated. Also, the study is ideally double-blind (i.e. both experimenter and patient do not know whether they contain placebo or drug). With the help of modern statistical methods it can be determined which patient will particularly benefit from which therapy or whether a therapy makes sense at all. Using the technique of statistical matching , a quasi-randomized study can be developed from non-randomized observation data.

Nutritional research

Biostatistical methods are also used in nutritional research in order to be able to research the health effects of certain foods. Questions such as “Is a certain food related to the development of a certain disease?” Or “Does the consumption of a certain food have a positive effect on a certain disease?” Play a role. In Germany, the German Institute for Nutritional Research conducts research in this area.

Preventive medicine

The preventive medicine is a branch of medicine that deals with the prevention of diseases before they arise. Here, too, biostatistics are used to find out how diseases can be prevented.

literature

  • Wolfgang Köhler , Gabriel Schachtel, Peter Voleske: Biostatistics. An introduction for biologists and agronomists , 3rd updated a. exp. Edition Springer, Berlin 2002, ISBN 978-3540429470 .
  • Christel Weiß: Basic knowledge of medical statistics , 5th edition Springer, Berlin 2010, ISBN 978-3-642-11336-9 .
  • Hedderich, Sachs: Applied Statistics , 14th edition, Springer, Berlin

Web links