Protein-protein interaction prediction

Protein-protein interaction prediction is a field combining bioinformatics and structural biology in an attempt to identify and catalog interactions between pairs or groups of proteins. Understanding protein-protein interactions is important in investigating intracellular signaling pathways. Experimentally, interactions between pairs of proteins are inferred from yeast two-hybrid systems, from affinity purification/mass spectrometry assays, or from protein microarrays. In parallel to the experimental determination of the interactome, computational methods are being developed.

Methods
Proteins that interact are more likely to co-evolve, therefore it is possible to make inferences about interactions between pairs of proteins based on their phylogenetic distances. It has also been observed in some cases that pairs of interacting proteins have fused orthologues in other organisms. In addition, a number of bound protein complexes have been structurally solved and can be used to identify the residues that mediate the interaction so that similar motifs can be located in other organisms.

Phylogenetic profiling
This method involves using a sequence search tool such as BLAST for finding homologues of a pair of proteins, then building multiple sequence alignments with alignment tools such as Clustal. From these multiple sequence alignments, phylogenetic distance matrices are calculated for each protein in the hypothesized interacting pair. If the matrices are sufficiently similar (as measured by their Pearson correlation coefficient) they are deemed likely to interact.

Identification of homologous interacting pairs
This method consists of searching whether the two sequences have homologues which form a complex in a database of known structures of complexes. The identification of the domains is done by sequence searches against domain databases such as Pfam using BLAST. If more than one complex of Pfam domains is identified, then the query sequences are aligned using a hidden Markov tool called HMMer to the closest identified homologues, whose structures are known. Then the alignments are analysed to check whether the contact residues of the known complex are conserved in the alignment.

Identification of structural patterns
A third method builds a library of known protein-protein interfaces from the PDB, where the interfaces are defined as pairs of polypeptide fragments that are below a threshold slightly larger than the Van der Waals radius of the atoms involved. The sequences in the library are then clustered based on structural alignment and redundant sequences are eliminated. The residues that have a high (generally >50%) level of frequency for a given position are considered hotspots. This library is then used to identify potential interactions between pairs of targets, providing that they have a known structure (i.e. present in the PDB).

Bayesian network modelling
Bayesian methods integrate data from a wide variety of sources, including both experimental results and prior computational predictions, and use these features to assess the likelihood that a particular potential protein interaction is a true positive result. These methods are useful because experimental procedures, particularly the yeast two-hybrid experiments, are extremely noisy and produce many false positives, while the previously mentioned computational methods can only provide circumstantial evidence that a particular pair of proteins might interact.

Relationship to docking methods
The field of protein-protein interaction prediction is closely related to the field of protein-protein docking, which attempts to use geometric and steric considerations to fit two proteins of known structure into a bound complex. This is a useful mode of inquiry in cases where both proteins in the pair have known structures and are known (or at least strongly suspected) to interact, but since so many proteins do not have experimentally determined structures, sequence-based interaction prediction methods are especially useful in conjunction with experimental studies of an organism's interactome.

Servers

 * InterProSurf
 * ADVICE
 * FastContact
 * InterPreTS
 * PRISM
 * PIP
 * SPPIDER
 * cons-PPISP

Dynamics Method

 * Simple brute force approach:


 * The Dynamics Method performs PPIP using the same rules as the real system by simulating the dynamics of every force on every atom in two proteins of interest in order to predict first folding, and then interaction. It then does the same for every potential protein pair combination in the genome.
 * Advantages and disadvantages
 * hypothetically accurate
 * impossible due to massive computational requirements

Folding and Docking

 * The unworkable Dynamics Method can be broken up into two smaller sub-problems to avoid or minimise computation of dynamics; Folding and Docking:


 * The most effective Folding Prediction Method predicts protein folding structures using a reasonable amount of computational time by using statistical substitution, followed by tweaking. Statistical substitution involves folding a small number of amino acids or residues by using the previously observed statistically dominant folding configuration. Tweaking is similar to heating the structure in that it introduces small random changes and selects those that have the lowest energy states.
 * Advantages and disadvantages
 * Reasonable results for individual predictions.
 * Accuracy improves as more folding conformations are verified.
 * Too slow to run on a genome wide
 * Not accurate with atypical structures.


 * Once protein folding has been successfully modeled, Protein Docking is the next logical step. To simplify the dynamics of docking, Binary docking methods find potentially active sites on a single folded protein structure and match them to active sites on a second protein using pattern recognition software or geometric hashing algorithms. Conserved domains are observed [52] and used to imply potential binding partners because surface complementarity between interacting protein sites is high.
 * Advantages and disadvantages
 * Multiple protein dockings are also being accurately predicted.
 * Relies on folding information that is not available for much of the genomes.
 * To slow for a genome wide tool.
 * Low reliability

Sequence Method

 * The Sequence Method is an attempt to avoid the modelling of folding and docking altogether by using direct pattern recognition of the binding sequences.
 * '''Advantages and disadvantages
 * Fast enough to be used as genome wide tool.
 * Oversimplification is possible.

Graph Learning Method

 * The Graph Learning Method improves on the sequence method and its problems by programming a computer to learn what attributes are important for PPIP by identifying patterns in observed interactions. It then uses these attribute patterns for PPIP.
 * Advantages and disadvantages
 * Fast genome wide tool.
 * Good reliability

Vector Learning Method

 * The Vector Learning Method is an alternative to the Graph Learning Method and is currently competing for the title of most efficient method. Both machine learning methods are probably of equal potential. A training set is mapped to an n-dimensional space where successful combinations of residues or amino acids are represented in a hyperspace. Each piece of the pattern or residue attribute is mapped to a separate dimension “vectorization”. Unlike normal two dimensional (latitude and longitude) city maps, protein pattern maps are most effective when using more than 20 dimensions. If a potential protein pair lies within the space identified as successful an interaction is predicted.
 * Advantages and disadvantages
 * Fast genome wide tool.
 * Good reliability

Evolutionary Method

 * Because a large amount of work has been done on interactomes, the Evolutionary Method is becoming a practical speedup. It uses the data from PPIPs or experimentally verified interaction maps to infer protein interaction for evolutionarily related organisms.
 * '''Advantages and disadvantages
 * Relies on interactomes.
 * Fast genome wide tool.
 * Good reliability.

Validation
Predictions must be validated experimentally, however all experimental methods are costly and have numerous unavoidable associated error producing FN and FP. therefore choosing and understanding superior methods of verification is vary important

Significant results
many new drugs and biological understandings are developed starting with PPIP before moving on to experimental methods, saving time and millions of dollars in the process.

PPIP produces results that need biological verification and further exploration before the results can be used to cure diseases with new drugs or understanding. The results are used heavily as a starting point for biological research where most of the metabolic pathway of interest is unknown.

Interpreting the results of PPIP can be problematic because of the volumes of data generated therefore, the data is often organised in a hierarchical manner, or an interactome. The two best approaches are to simply display only one or two interaction links deep of a hierarchy at a time, the second is to assign the highly interactive (hub) proteins to be the roots of the interaction trees, creating groupings of functionally and spatially related proteins.

The main goal of proteomics is to predict the structures, interactions and functions of the proteins. Specific function is only found through interactions. The prediction of protein-protein interactions is of vital interest in proteomics.