PPI-Detect

The prediction of the likelihood of protein-protein interactions (PPI) is a challenging task, especially if amino acid sequences are the only information available. We use a novel procedure for general-purpose numerical encoding of polypeptides, which transforms pairs of amino acid sequences into a machine-learning-friendly vector, whose elements represent numerical descriptors of groups of residues in the proteins. PPI-Detect is a Support-Vector-Machine model for predicting protein-protein and protein-peptide interactions.

How to cite PPI-Detect:

S. Romero‐Molina, Y. B. Ruiz‐Blanco, M. Harms, J. Münch, E. Sanchez‐Garcia. PPI‐Detect: A support vector machine model for sequence‐based prediction of protein–protein interactions J. Comput. Chem. 2019, 1‐10. DOI: 10.1002/jcc.25780

The input of PPI-Detect web server:

To execute PPI-Detect, at least two sequences (FASTA format) most be provided by separate.

For example:

Sequences A

You can either "Enter a sequence(s)" or "Upload a file" with the lines:

>PA
... sequence PA...

Sequences B

You must provide a file with the sequences to combine, here PB and PC:

>PB
.... sequence PB...
>PC
.... sequence PC...

Then, the interaction likelihood will be computed for all the combinatorial pairs between the two sets of sequences:

>PA and >PB
>PA and >PC

The output of PPI-Detect web server:

A table with the next information for each protein-protein pair:

Instance: Name of the instance (protein-protein pair)
Prediction: The prediction of interaction likeliness
Score: The probability of occurring an interaction

Analysis of the projection of the predicted case into the applicability domain (AD) of the PPI model.

AD 1st-99th: The case is Out of the AD when at least a descriptor value is outside the range defined by the 1st and the 99th percentiles of the training data.
AD 100th: The case is Out of the AD when at least a descriptor value is outside the range of the training data for the PPI model.

Example files:

sequences_A.fasta

>PF00189
IKRGIEFNKSYKGIIKNIISNAFKSRCLGLKIAIQGRINGNVMTRKQIFFHGKLPLQKFSANIKYSSGTALTIHGCIGIKVWL

sequences_B.fasta

>PF00163
KRKHSKFKFDRRIGENLWNNPRSSVIACSNPPGQHGAKIKTKTSDFCVRMIAKQKLKFYYSNLTESKLRKLYKKALRYGGNASHNLVRLLE
>PF01599
CLSHFNVDKDGNVQILKKVCPTCGPEIFMSSHAEGFFCSKCFST
>PF01246
LKEGTCIFSGHDVPKGSGLIKVTNDTRSFVFKNQKVLKLVERKINPKDIAWTQASRILHKKGEKKT
>PF00281
NVFNIQKLDRIVINIGINSAIHDPKQILLCLTALELITTQKPVIYRSKKSIAAFKVR
>PF01479
MRLDEYVHHNGYTESRSKAQDIILAGCVFVNGVKVTSKAQKIKDTDKI

Download files examples

Example output:

The server shows next table, that summarizes all the information provided in the output files, plus a link to download them:

#	Instance	Prediction	Score	AD 1st & 99th	AD 100th
0	PF00189PF00163	Interaction	0.578	Out	Out
1	PF01248PF01599	Not interaction	0.158	Out	Out
2	PF01248PF01246	Not interaction	0.183	Out	Out
3	PF00163PF00281	Not interaction	0.181	Out	Out
4	PF00163PF01479	Not interaction	0.379	Out	Out

Notes:

5th and 6th columns indicates if the sequence is within the applicability domain (AD) of the model.

PPI-Detect was built with a nonredundant benchmarking dataset of PPI gathered from three comprehensive, curated and publicly available databases. These databases contain information about pairs of protein domains with proven interactions (3did and iPfam), and domain pairs with very little chances of being involved in an interaction (Negatome 2.0).

We split the dataset into training and test sets. The interacting domains are the positive cases and the noninteracting domains are the negative cases.

Training: This subset includes 3491 pairs (1613 positive and 1878 negative). download

Testing: This subset includes 836 pairs of domains (309 positive and 527 negative).

To estimate the performance of the final model, we grouped the test data by degrees of difficulty:

Very hard subset. It gathers pairs of individual domains not present in the training data. It contains 103 domain pairs (57 positive and 46 negative). download
Mid-hard subset. It comprises domain pairs where only one of the domains is present in the training data. It contains 307 domain pairs (102 positive and 205 negative). download
Easy subset. It comprises pairs where both domains are present in the training data. It includes 426 domain pairs (150 positive and 276 negative). download

The files contain only the pairs of domains, to obtain the sequences click here.

PPI-Detect

Help content

Example

Running PPI-Detect from Python scripts

PPI-detect data