Network inference (systems biology)

from Wikipedia, the free encyclopedia

Network inference ( reconstruction of networks , inference (from Latin inferre ): conclusion ) refers to the identification or reconstruction of a network model of a real system using measured data and prior knowledge . In systems biology , network inference refers to the identification of biological networks, in particular gene regulation networks , using both measured biomedical and / or molecular biological data, in particular data from gene expression analysis , as well as previous molecular biological knowledge. In device and software technology, network inference is called reverse engineering ; this term is also used figuratively for network inference in systems biology.

Biological network

The properties and behavior of many systems can be mapped and simulated using network models. A network consists of components ( nodes ) that are connected to one another via edges. In systems biology, nodes represent genes , proteins , metabolites , cells , tissues , organs , organisms or species in particular . The edges represent molecular biological and biochemical processes (e.g. transcription , translation , enzymatically catalyzed reactions), interactions (e.g. protein-protein interactions ), metabolic processes, information flows or trophic relationships in food chains . For example, a gene regulatory network (GRN) consists of nodes that represent the genes and edges that connect the genes. In a simplified way, the compounds represent the processes of gene expression via the synthesis of certain proteins with a gene regulatory or catalytic function ( transcription factors , repressors , inducers or enzymes that catalyze the synthesis of metabolites via biochemical reactions that act on signal transduction and thus influence the expression of genes) .

Network inference as a solution to an optimization problem

Network inference is understood as the solution to an optimization problem in which the properties of the network model and the measured data are brought into the greatest possible agreement (similarity) under certain boundary conditions. To quantify the agreement, there are various measures for the distance between the measurement data on the one hand and the values ​​on the other hand that are obtained as a result of the simulation of the network model. In dynamic systems such as a GRN, the response of the biological system (e.g. an organ or organism) to an external disturbance (e.g. temperature jump, infection , administration of an active substance ) is measured and compared with the simulated response of the GRN , ie the distance between measurement and calculation is determined. If the responses to several different disturbances are included in the network inference instead of just one glitch, more complex networks can be identified.

The boundary conditions for the optimization problem are determined, among other things, by the existing prior knowledge of the network. If the previous knowledge is fraught with uncertainties or in turn implies an optimization task (e.g. "The number of active edges should be as small as possible"), the previous knowledge can also be used in the formulation of the objective function (evaluation function) in addition to minimizing the distance between network model- and system behavior (e.g. additive) are recorded.

The various algorithms of network inference differ with regard to

  • the use and preprocessing of measurement data,
  • the network type and the methodology of the model simulation ,
  • the way of reducing the complexity of the network model,
  • the use of prior knowledge and hypotheses.

Measurement data

The measured data should reflect the system behavior with the highest possible information content. In dynamic systems such as a GRN, for this purpose the response of the (biological) system (e.g. an organ or organism) to an external disturbance is measured and compared with the simulated response of the (GRN) model. H. the distance between the two time series is determined, for example, as a Euclidean distance or with the Manhattan metric . The type of disturbance, if necessary the number of test repetitions and also the number and allocation of measurements (e.g. measurement times) can be determined by optimal statistical test planning. The measurement method to be selected is primarily determined by the system and the available resources. Various methods of gene expression analysis are available for inferring GRN , e.g. B. RNA-Seq are available.

Network types

A network is a model of a real system. Networks are often visualized by graphs and can be analyzed using methods of graph theory, so that metric properties of the graphs, for example the number of cliques , are included in the objective function of the above. Optimization problems can be included. With regard to the property of the edges, a distinction is made between directed and undirected graphs, depending on whether connections or relationships between the nodes have (preferred) directions. A distinction is made between weighted and unweighted edges, depending on whether the edges have values, e.g. B. real numbers for reaction rates can be assigned. Different network types are distinguished with regard to the different possibilities of mathematical representation of the nodes and edges. In the simplest case, the network can be described using methods of Boolean algebra . In such a Boolean network, for example, value 1 represents an expressed gene and value 0 a non-expressed (“sleeping”) gene. An extension leads to the fact that instead of the two-valued logic one calculates with probabilities. Typically, a Bayesian network is formulated in this way, with the nodes being assessed with the probability that a gene is expressed and the edges from gene A to gene B with the conditional probability that gene B is upregulated if A is upregulated. Thirdly, the relative or absolute amount of the transcript (expression intensity), i.e. the amount of mRNA , as a result of the transcription of a gene can be quantified by a real number.

The values ​​representing a node (Boolean unit, probability, real number) can either be constant or variable over time, so that a distinction is made between static and dynamic networks. Edges of a network are often evaluated not just by numbers, but by mathematical functions of varying complexity. If the nodes are represented by real numbers in dynamic networks, (ordinary) differential equations or difference equations are often used for the mathematical representation of edges .

Reduction of complexity

The complexity of a network depends on the network type, the number of nodes and edges as well as the mathematical function with which the edges are evaluated. Biological systems are high-dimensional with their thousands of genes, proteins, metabolites, cells, etc. The relationships between these components are non-linear and dynamic. Thus, biological networks are typically highly complex. The inference, i.e. H. the reconstruction of complex networks from existing data and available knowledge is not only numerically complex ( NP-difficult ), but such networks are often not clearly identifiable . This problem arises with the inference of fully genomic gene regulatory networks due to their complexity on the one hand and the limited number and quality (measurement errors) of the measurement data and the insufficient completeness of the available prior knowledge on the other. In order to make such non-identifiable networks identifiable, either the number and quality of the measurement data must be increased or the complexity of the network model must be reduced. Since the number and quality of the measurement data is limited (due to the practically available resources and techniques), reducing the complexity is the decisive task in network inference in systems biology. The reduction in complexity can be done in different ways:

  1. Reduction in the number of nodes
  2. Reduction in the number of edges
  3. Simplification of the functions that the edges represent

These simplifications conflict with the holistic claim of systems biology.

Regarding 1. Systems biology claims to examine a biological system in its entirety. Any reduction in the number of nodes, i.e. H. of the components involved (genes, proteins, metabolites, etc.) is based on hypotheses or on conscious restrictions (or - if any - on recognized laws). A typical limitation is that only differentially expressed genes are considered as nodes. Furthermore, genes expressed or regulated (co-regulated) in the same way are combined into groups by means of cluster analysis or into modules by means of prior knowledge of gene function and regulation. The groups or modules then form the nodes of the network.

Regarding 2. Various hypotheses were used in systems biology to reduce the edges released for network inference. After one of these hypotheses, the gene regulatory network is sparse ( English sparse cross-linked). The minimization of the number of edges is then taken into account as an additional criterion for network inference.

To 3. The simplest function for evaluating edges is binary. With such Boolean networks - given the number of nodes and edges - the greatest possible reduction in complexity is achieved. The problem then lies in the mapping of i. d. Usually real-valued measured values, e.g. B. intensities of gene expression, to these two values. For a somewhat less serious simplification of the functions that represent the edges, instead of non-linear differential equation systems with non-linear time-variable functions, linear differential equations are used - or, to simplify even further, differential equations, which are then converted into an algebraic system of equations .

Prior knowledge for network inference

In the case of biological systems without a reduction in complexity, especially with fully genomic GRN, the task of network inference is not only poorly conditioned , but also underdetermined, i.e. H. the number of experimental data is too small for a clear identification of the network structure and parameters. Since the number and quality of the measurement data cannot be increased at will, a. Due to financial restrictions, in addition to the reduction in complexity - which is problematic in terms of systems biology - the consideration of prior knowledge plays a decisive role. The previous knowledge relates both to the aggregation of nodes (to clusters or modules, see above) and to the edges, i.e. to the existing knowledge of the relationships between the nodes. In the simplest case, it is factual or hypothetical knowledge about the absence of a connection. With thousands of nodes in a GRN, prior knowledge of millions of edges is required. The amount of such knowledge is steadily increasing in the specialist literature, but for the use of this knowledge in numerical algorithms, the knowledge must be machine-readable, e.g. B. readable from databases.

For example, prior knowledge about transcription factors and other regulatory proteins as well as their DNA binding sites was extracted from several databases ( Gene Ontology , oPOSSUM , JASPAR , TRANSFAC , PathwayStudio ) for the inference of a GRN of liver cells . While databases for protein-protein interactions have reached an advanced level in some biological species, such databases for gene-protein-gene relationships, such as the high-quality, manually maintained database TRANSFAC , are very sketchy or contain insecure information for almost all species because automatically generated hypothetical and not experimentally validated entries. This is mainly due to the fact that the gene-protein-gene relationships mediated via gene expression (transcription, RNA processing, translation, protein maturation) and protein- DNA interactions at transcription factor binding sites are themselves complex, dynamic and non-linear. With each successful and reliable inference of a GRN, however, the knowledge that can be used for subsequent network inference with new measurement data increases.

Validation of the network models

Since both the measurement data and the prior knowledge are often subject to errors and uncertainties and a network model only incompletely depicts the properties of a real system, the inferred network model must be validated. A distinction is made here between internal and external validity . Without further experimental effort, the internal validity is based on the given quantities of measurement data and prior knowledge using a resampling method, e.g. B. by means of cross-validation to determine.

The ability to generalize is decisive for the quality of an inferred network model. H. the prediction quality for the system under changed (experimental) conditions. This test is done by simulating the network model to make predictions under changed conditions, which are subsequently implemented experimentally and again experimental data are measured and compared with the predicted system behavior. Due to the inevitable, but only hypothetically justified, reduction in the complexity of a network model suitable for network inference and also due to possible measurement errors and uncertainties in the prior knowledge used, the conclusions obtained with bioinformatic methods are themselves only hypotheses. These hypotheses are valuable for the focused and thus resource-saving planning of experiments that serve to verify the hypotheses obtained.

As a measure of the validity z. B. the area under the curve ( AUC - area under the curve ) of the ROC curve ( Receiver Operating Characteristic ) is used.

Examples of algorithms for inferring gene regulation networks

The abundance of different network inference algorithms can be grouped into the following categories, whereby different algorithms can also be used in parallel, in combinations or complementary:

  • REVEAL and other algorithms for Boolean networks
  • Statistical methods such as LASSO ( Least Absolute Shrinkage and Selection Operator ) and LARS ( Least-Angle Regression )
  • Bayesian networks like ScanBMA

The suitability of an algorithm depends on the model type, the available measurement data, the available prior knowledge, the complexity of the system, in particular the number of network nodes and, above all, the objective of the network inference. Since 2006, as part of the international project Dialogue on Reverse Engineering and Assessment Methods ( DREAM ), the most powerful algorithms for network inference have been determined on the basis of given data and a known system (only for the jury). One result of this project is the realization that the aggregation of the predictions over several network models, which were calculated with different network inference algorithms, improves the quality and robustness of the predictions. It was also found that LASSO methods are best suited for genome-wide network inference , provided they are well configured and the measurement data and prior knowledge are available in sufficient quantity and quality - a prerequisite for the Escherichia coli bacterium . Boolean networks are particularly suitable for modeling stationary situations on the basis of gene knockout data and for identifying signal paths .

literature

  • M. Bansal, V. Belcastro, A. Ambesi-Impiombato, D. di Bernardo: How to infer gene networks from expression profiles . In: Molecular Systems Biology . tape 3 , 2007, p. 78 , doi : 10.1038 / msb4100120 .
  • M. Hecker, S. Lambeck, S. Toepfer, E. van Someren, R. Guthke: Gene regulatory network inference: data integration in dynamic models - A review . In: BioSystems . tape 96 , 2009, p. 86-103 , doi : 10.1016 / j.biosystems.2008.12.004 .
  • T. Ideker, NJ Krogan: Differential network biology . In: Molecular Systems Biology . tape 8 , 2012, p. 565 , doi : 10.1038 / msb.2011.99 , PMID 22252388 .
  • SR Maetschke, PB Madhamshettiwar, MJ Davis, MA Ragan: Supervised, semi-supervised and unsupervised inference of gene regulatory networks . In: Briefings in Bioinformatics . tape 15 , 2014, p. 195-211 , doi : 10.1093 / bib / bbt034 .
  • P. Meyer, T. Cokelaer, D. Chandran, KH Kim, PR Loh, G. Tucker, M. Lipson, B. Berger, C. Kreutz, A. Raue, B. Steiert, J. Timmer, E. Bilal, HM Sauro, G. Stolovitzky, J. Saez-Rodriguez: Network topology and parameter estimation: from experimental design methods to gene regulatory network kinetics using a community based approach . In: BMC Systems Biology . tape 8 , 2014, p. 13 , doi : 10.1186 / 1752-0509-8-13 .
  • S. Hill, L. Heiser, T. Cokelaer, et al .: Inferring causal molecular networks: empirical assessment through a community-based effort . In: Nature Methods . tape 13 , 2016, p. 310-318 , doi : 10.1038 / nmeth.3773 .
  • MM Saint-Antoine, A. Singh: Network inference in systems biology: recent developments, challenges, and applications . In: Current Opinion in Biotechnology . tape 63 , 2020, p. 89–98 , doi : 10.1016 / j.copbio.2019.12.002 .

Individual evidence

  1. M. Weber, SG Henkel, S. Vlaic, R. Guthke, EJ van Zoelen, D. Driesch: Inference of dynamical gene-regulatory networks based on time-resolved multi-stimuli multi-experiment data applying NetGenerator V2.0 . In: BMC Systems Biology . tape 7 , 2013, p. 1 , doi : 10.1186 / 1752-0509-7-1 , PMID 23280066 .
  2. J. Linde, S. Schulze, SG Henkel, R. Guthke: Data- and knowledge-based modeling of gene regulatory networks . To update. In: EXCLI Journal . tape 14 , 2015, ISSN  1611-2156 , p. 346-378 , PMID 27047314 .
  3. a b p Vlaic, T. Conrad, C. Tokarski-Fast, M. Gustafsson, U. Dahmen, R. Guthke, S. Schuster: modules Discoverer: Identification of regulatory modules in protein-protein interaction networks . In: Scientific Reports . tape 8 , no. 1 , 2018, p. 433 , doi : 10.1038 / s41598-017-18370-2 , PMID 29323246 .
  4. a b R. Guthke, U. Möller, M. Hoffmann, F. Thies, S. Töpfer: Dynamic network reconstruction from gene expression data applied to immune response during bacterial infection . In: Bioinformatics . tape 21 , 2005, p. 1626-1634 , PMID 15613398 .
  5. S. Vlaic, W. Schmidt-Heck, M. Matz soy, E. Marbach, J. Linde, A. Meyer-Baese, S. Zellmer, R. Guthke, R. Gebhardt: The extended Tilar approach: a novel tool for dynamic modeling of the transcription factor network regulating the adaption to in vitro cultivation of murine hepatocytes . In: BMC Systems Biology . tape 6 , 2012, p. 147 , doi : 10.1186 / 1752-0509-6-147 .
  6. SM Colby, RS McClure, CC Overall, et al .: Improving network inference algorithms using resampling methods . In: BMC Bioinformatics . tape 19 , 2018, p. 376 , doi : 10.1186 / s12859-018-2402-0 .
  7. Jump up J. Linde, P. Hortschansky, E. Fazius, A. Brakhage, R. Guthke, H. Haas: Regulatory interactions for iron homeostasis in Aspergillus fumigatus inferred by a Systems Biology approach . In: BMC Systems Biology . tape 6 , January 19, 2012, p. 6 , doi : 10.1186 / 1752-0509-6-6 .
  8. Omid Abbaszadeh, Ali Reza Khanteymoori, Ali Azarpeyvand: Parallel Algorithms for Inferring Gene Regulatory Networks . A review. In: Current Genomics . tape 19 , p. 603-614 , doi : 10.2174 / 1389202919666180601081718 .
  9. ^ S. Liang, S. Fuhrman, R. Somogyi: Reveal, a general reverse engineering algorithm for inference of genetic network architecture . In: Pacific Symposium on Biocomputing . tape 1998 , 1998, pp. 18-29 , PMID 9697168 .
  10. R. Tibshirani: Regression shrinkage and selection via the lasso . In: Journal of the Royal Statistical Society, Series B . tape 58 , 1996, pp. 267-288 , JSTOR : 2346178 .
  11. ^ EP van Someren, BL Vaes, WT Steegenga, AM Sijbers, KJ Dechering, MJ Reinders: Least absolute regression network analysis of the murine osteoblast differentiation network . In: Bioinformatics . tape 22 , 2006, p. 477 , doi : 10.1093 / bioinformatics / bti816 , PMID 16332709 .
  12. B. Efron, T. Hastie, I. Johnstone, R. Tibshirani: Least angle regression . In: Annals of Statistics . tape 32 , 2004, p. 409-499 , doi : 10.1214 / 009053604000000067 .
  13. ^ R. Bonneau, DJ Reiss, P. Shannon, M. Facciotti, L. Hood, NS Baliga et  al .: The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo . In: Genome Biology . tape 7 , no. 5 , 2006, p. R36 , PMID 16686963 .
  14. N. Friedman, M. Linial, I. Nachman, D. Pe'er: Using bayesian networks to analyze expression data . In: Journal of Computational Biology . tape 7 , 2000, pp. 601-620 , doi : 10.1089 / 106652700750050961 , PMID 11108481 .
  15. ^ WC Young, AE Raftery, KY Yeung: Fast Bayesian inference for gene regulatory networks using ScanBMA . In: BMC Systems Biology . tape 8 , 2014, p. 47 , PMID 24742092 .
  16. ^ X. Liang, WC Young, LH Hung, AE Raftery, KY Yeung: Integration of Multiple Data Sources for Gene Network Inference Using Genetic Perturbation Data . In: Journal of Computational Biology . tape 26 , no. 10 , 2019, pp. 1113‐1129 , doi : 10.1089 / cmb.2019.0036 .
  17. A. Wille, P. Zimmermann, E. Vranová, A. Fürholz, O. Laule, S. Bleuler, L. Hennig, A. Prelic, P. von Rohr, L. Thiele, E. Zitzler, W. Gruissem, P. Bühlmann: Sparse graphical gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana . In: Genome Biology . tape 5 , no. 11 , 2004, p. R92 , doi : 10.1186 / gb-2004-5-11-r92 , PMID 15535868 .
  18. ^ K. Basso, AA Margolin, G. Stolovitzky, U. Klein, R. Dalla-Favera, A. Califano: Reverse engineering of regulatory networks in human B cells . In: Nature Genetics . tape 37 , 2005, pp. 382-390 , doi : 10.1038 / ng1532 , PMID 15778709 .
  19. ^ Dialogue for Reverse Engineering Assessment and Methods (DREAM). Retrieved May 17, 2020 .
  20. D. Marbach, JC Costello, R. Küffner, NM Vega, RJ Prill, DM Camacho, et al .: Wisdom of crowds for robust gene network inference . In: Nature Methods . tape 9 , 2012, p. 796-804 , doi : 10.1038 / nmeth.2016 , PMID 22796662 .