Biological network inference

Many types of biological networks exist. Few such networks are known in anything approaching their complete structure, even in the simplest bacteria. Still less is known on the parameters governing the behavior of such networks over time, how the networks at different levels in a cell interact, and how to predict the complete state description of a eukaryotic cell or bacterial organism at a given point in the future. Systems biology (ref WP art), in this sense, is still in its infancy. Prediction is the subject of dynamic modeling (ref WP article). This article focuses on a necessary prerequisite to dynamic modeling of a network: inference of the topology, that is, prediction of the "wiring diagram" of the network. More specifically, we focus here on inference of biological network structure using the growing sets of high-throughput expression data for genes, proteins, and metabolites.

Briefly, methods using high-throughput data for inference of regulatory networks rely on searching for patterns of partial correlation or conditional probabilities that indicate causal influence (Sprites et al., 2000). Such patterns of partial correlations found in the high-throughput data, possibly combined with other supplemental data on the genes or proteins in the proposed networks, or combined with other information on the organism, form the basis upon which such algorithms work. Such algorithms can be of use in inferring the topology of any network where the change in state of one node can affect the state of other nodes.

In a topological sense, a network is a set of nodes and a set of directed or undirected edges between the nodes. Biological networks currently under study using such computational inference methods include:

1) Transcriptional regulatory networks. Genes are the nodes and the edges are directed. A gene serves as the source of a direct regulatory edge to a target gene by producing a RNA or protein molecule that functions as a transcriptional activator or inhibitor of the target gene. If the gene is an activator, then it is the source of a positive regulatory connection; if an inhibitor, then it is the source of a negative regulatory connection. Computational algorithms used to infer the topology take as primary input the data from a set of microarray runs measuring the mRNA expression levels of the genes under consideration for inclusion in the network.

As of 2007, the great bulk of high-throughput data being fed into correlation-based algorithms comes from microarray experiments, and such analysis is the most fruitful point of biological application for such algorithms. (This is reflected in the reference list at bottom, where almost all bioinformatic algorithm references are directed toward use of microarray data.) Clustering or some form of statistical classification is typically employed to perform an initial organization of the high-throughput mRNA expression values derived from microarray experiments. The question then arises: how can the clustering or classification results be connected to the underlying biology? Such results can be useful for pattern classification – for example, to classify subtypes of cancer, or to predict differential responses to a drug (pharmacogenomics). But to understand the relationships between the genes, that is, to more precisely define the influence of each gene on the others, the scientist typically attempts to reconstruct the transcriptional regulatory network. This can be done by using background literature, or information in public databases, combined with the clustering results. It can also be done by the application of a correlation-based inference algorithm, as will be discussed below, an approach which is having increased success as the size of the available microarray sets keeps increasing [Faith et al. 2007, Hayete et al. 2007].

2) Signal transduction networks (very important in the biology of cancer). Proteins are the nodes and the edges are directed. Primary input into the inference algorithm would be data from a set of experiments measuring protein activation / inactivation (e.g., phosphorylation [WP ref] / dephosphorylation [WP ref]) across a set of proteins.

3) Metabolite networks. Metabolites are the nodes and the edges are directed. Primary input into an algorithm would be data from a set of experiments measuring metabolite levels.

4) Intraspecies or interspecies communication networks in microbial communities. Nodes are excreted organic compounds and the edges are directed. Input into an inference algorithm is data from a set of experiments measuring levels of excreted molecules.

Protein-protein interaction networks are also under very active study. However, reconstruction of these networks does not use correlation-based inference in the sense discussed for the networks already described (interaction does not necessarily imply a change in protein state), and a description of such interaction network reconstruction is left to other articles.

Correlation-based Inference Algorithms
1) from classical statistics - STUB  - baseline: Pearson correlation

2) from information theory - STUB  concept of mutual information   - ARACNE algorithm   - CLR algorithm

3) from graphical probabilistic models - STUB   - Bayesian network structure learning    - K2 alg - needs a node ordering    - BANJO toolkit

DREAM project - stub

Platforms for network inference - STUB - geWorkbench, Columbia - SEBINI

Visualization of inferred network - STUB - Cytoscape tool

Expansion of inferred network using public databases - data integration - STUB - CABIN tool

Keywords
bioinformatics, gene regulatory network, metabolic network modeling, network analysis, computational systems biology, systems biology, Bayesian nework. mutual information, protein-protein interaction prediction