Target identification and validation is crucial during the process of drug research and development (R＆D). Besides the disease relevance, successful drug targets generally have some common features different from other non-target proteins. Systematic summary of the features of successful drug targets and further computationally predicting whether a protein is proper to be used as a drug target (i.e. target prediction) based on these features will greatly improve efficiency and success rate of target selection.
Owing to the development history of the pharmaceutical industry, previous target prediction studies focused on traditional, dominant small-molecules drugs or drugs irrespective of drug types. Actually target properties of different types of drugs are significantly different (which is realized by several studies [J Drug Target. 2009, 17(7):524-32; Nat Biotechnol. 2012, 30(4):317-20; Genome Med. 2014, 6(7):57] and also proved by our systematic analyses and comparison of targets of different types of drugs (See our future paper)), and thus it is necessary to do the target prediction respectively for different types of drugs. So far the community lacks a target prediction method and tool specially for protein and peptide drugs, which after decades of development have grown into major drug class of the marketplace and are keeping rapid development.
POPPIT (Predictor Of Protein, PeptIde, small-molecule drugs’ therapeutic Targets) is the first web server used for genome-wide target prediction specially for different drug types (including protein, peptide and small-molecule drugs) based on their respective features, and meanwhile provides lots of related annotations (including >60 data fields) for the potential targets and their relevance to various diseases (which is achieved by Disease relevance analysis of POPPIT), aiming to improve efficiency and success rate of target selection for different types of drugs.
1) Target prediction respectively for protein, peptide and small-molecule drugs
2) To provide abundant annotations for potential targets, providing clues for deep functional validation of candidate targets
3) Disease relevance analysis of candidate targets. Please refer to Q10 for more information.
4) Disease2Target: Staring with an interested disease, target prediction of proteins with relevance to the disease. Please refer to Q11 for more information.
POPPIT supports two simple workflows. In the first one, users start from the homepage to firstly predict targets and then can also further check predicted potential targets' associated disease (by Disease relevance analysis function). In the second one, users start with an interested disease from "Disease2Target" page and then do target prediction for proteins with relevance to the disease.
Therapeutic targets are referred to as those proteins (or other biomolecules) through the interaction with which the drug exerts the therapeutic effects, excluding side-effect targets and other binding proteins without pharmacological efficacy.
By our systematic analysis and comparison, we find that they are significantly different between each other at multiple aspects including AA composition and physicochemical properties, structure, evolution, network topology and especially functions etc.. See our future paper for more information.
To establish the target prediction models, first the golden standard positive (GSP) and negative (GSN) datasets were determined respectively for protein, peptide and small-molecule drugs.
The construction of GSP set (i.e. known therapeutic target list of approved protein/peptide/small-molecule drugs):
We collected therapeutic targets of approved drugs (removing withdrawn drugs) from TTD (version: 5.1.02), GtoPdb (downloaded on 03/19/2017), DrugBank (downloaded on 07/26/2015) databases and Ref. [Annu Rev Pharmacol Toxicol. 2014, 54:9-26]. For GtoPdb, we only considered its strictly defined “primary” targets; and for DrugBank, only drug targets with known pharmacological action were used. All targets of “group” type such as a protein complex were excluded. Drug types (protein, peptide and small-molecule drugs) were distinguished based on related annotations provided by corresponding resources or by manual curation, where peptide were separated from proteins on the basis of size, and are arbitrarily defined as molecules containing fewer than 50 amino acids. Ultimately 132, 55 and 634 human therapeutic target proteins uniformly represented by Swiss-Prot accession number (SP AC) respectively for protein, peptide and small-molecule drugs were obtained (This data can be downloaded by Download page of POPPIT).
For the GSN sets, considering the difficulty gaining an experimentally negative dataset, we adopted the following scheme. First we removed known protein (/peptide/small-molecule) drugs’ targets as many as possible from the whole human proteome. Besides the corresponding GSP set, the excluded targets also included non-“primary” ones of approved protein (/peptide/small-molecule) drugs from GtoPdb (downloaded on 03/19/2017), trial protein (/peptide/small-molecule) drugs’ targets from Ref. [Annu Rev Pharmacol Toxicol. 2014, 54:9-26] and all (also including experimental) protein (/peptide/small-molecule) drugs’ all targets (also including those without known pharmacological action) from DrugBank (downloaded on 07/26/2015), as well as those of “group” type. Then randomly picked 200 (/100/1000) proteins from the remaining human proteome constituted the GSN set of the protein (/peptide/small-molecule) drugs’ targets.
We used naïve Bayes model to integrate multiple features to construct the target prediction models respectively for protein, peptide and small-molecule drugs. Ultimately Model_8_protein integrating 8 features including "transcriptional factor”, “signaling molecule”, “signal peptide”, "transmembrane region", DER, "betweenness centrality_PPI", "pathway number" and “indegree_TF” is used for target prediction for protein drugs, Model_4_peptide integrating DER, “signal peptide”, “transmembrane region” and “signaling molecule” for peptide drugs, Model_9_small integrating "basic", “aromatic”, "enzyme", "signling molecule", DER, “pathway number”, “indegree_TF”, “ion channel” and “reaction number” for small-molecule drugs. Please see our future paper for the details.
Both 10-fold cross-validation and tests based on multiple independent test schemes indicate the satisfactory performance of Model_8_protein, Model_4_peptide and Model_9_small. Please see our future paper for the details.
A protein’s target prediction score for a certain type of drug is given by Model_8_protein, Model_4_peptide and Model_9_small stated above. These three prediction models are constructed using naïve Bayes, so the prediction score is just the combined likelihood ratio (LR) of multiple pieces of supporting evidence. Generally score (i.e. combined LR) >= 1 indicates that there is evidence supporting the protein to be a target for a certain type of drug.
The whole human genome is ranked according to the decreasing prediction scores (i.e. combined LR) (for a certain type of drug). Rank just indicates the protein’s position in the whole genome. Proteins with the same scores will be arranged with the same position. Suppose n proteins with same scores are arranged with the same Rank ith, and then the following proteins’ Rank will be (i+n)th.
The function of Disease relevance analysis is designed to give diseases with relevance to the query protein.
This relevance between a protein and a disease is quantified here by their distance in the PPI network, defined as minimum shortest path length between the protein and the disease’s known related genes. Generally the shorter the distance is, the stronger the relevance is (Nat Biotechnol. 2007, 25(10):1119-26). Here the PPI network integrated by us was used. Known gene-disease associations (distance=0) were from three databases including OMIM (Online Mendelian Inheritance in Man), PheGenI (Phenotype-Genotype Integrator) (downloaded on 10/21/2016) and CTD (The Comparative Toxicogenomics Database) (version: Jul. 6, 2017).
The OMIM database is a comprehensive collection covering all known diseases with a genetic component.
CTD curated gene–disease associations are extracted from the published literature by CTD curators or are derived from the OMIM database.
PheGenI provides gene-disease associations from GWAS.
See our future paper for more details.
Disease2Target is designed to mainly be used for target estimation of proteins with relevance to an interested disease. That is, by Disease2Target users can start with an interested disease to do target prediction.
Of course Disease2Target can also be used as a tool independent of POPPIT, used for relevance analysis between a protein and a disease based on PPI network.
The two functions:
1) When your query is a disease, its associated proteins together with their target estimation results will be returned.
2) When your query is a protein, its associated diseases will be returned.
Principle of Target2Disease:
This relevance between a protein and a disease is quantified by their distance in the PPI network, defined as the minimum shortest path length between the protein and the disease’s known related genes (Nat Biotechnol. 2007, 25(10):1119-26). Here the PPI network integrated by us was used. Known gene-disease associations (distance=0) were from three databases including OMIM (Online Mendelian Inheritance in Man), PheGenI (Phenotype-Genotype Integrator) (downloaded on 10/21/2016) and CTD (The Comparative Toxicogenomics Database) (version: Jul. 6, 2017).
Here we divide the target prediction results into three confidence grades (high/median/low). First, "target prediction score < 1" corresponds to "low" confidence grade. For other prediction results with score>=1, if a protein is ranked in the top 50% of all proteins with score>=1 in the whole genome ranked based on the decreasing scores, this protein is considered to be of "high confidence". And the other with score>=1 correspond to median confidence. For target prediction of protein drug, the boundary between high and median confidence grades is "combined_LR=13", 6 for peptide drug, 8 for small-molecule drug.
"Known" indicates whether or not the protein belongs to the golden standard positive (GSP) dataset, i.e. whether or not the protein is a known successful therapeutic target of an approved drug of a certain type. Please see Q5 for more details.
Known targets of approved protein drugs (GSP)
Known targets of approved peptide drugs (GSP)
Known targets of approved small-molecule drugs (GSP)
Our GSP for protein/peptide/small-molecule drugs is constructed by integrating therapeutic targets of approved protein/peptide/small-molecule drugs from 4 sources including Ref. [Annu Rev Pharmacol Toxicol. 2014, 54:9-26], GtoPdb, TTD and DrugBank (v2015-07-26). If the protein belongs to the GSP set of a certain type of drug, this data field gives its source information. Please see Q5 for more information.
From Ref. *
From DrugBank (v2015-07-26)
If the protein is a known successful therapeutic target of an approved drug of a certain type, this data field gives the target’s corresponding approved drugs of that type together with their indications or drug IDs provided by the corresponding source. Please see Q5 for more information.
Known targets of approved and trial protein drugs
Known targets of approved and trial peptide drugs
Known targets of approved and trial small-molecule drugs
Ref. [Annu Rev Pharmacol Toxicol. 2014, 54:9-26] provides therapeutic target list of approved and trial drugs. If the protein belongs to this list, this data field will give the corresponding approved or trial drugs together with their indications.
Known targets of approved protein drugs
Known targets of approved peptide drugs
Known targets of approved small-molecule drugs
If this protein is a known therapeutic target of an approved protein/peptide/small-molecule drug collected by DrugBank (v2017-06-21), its corresponding approved drug name and DrugBank drug ID will be given.
Likelihood Ratio (LR) of feature f is defined as the ratio of the probability of feature f observed in the GSP set to that in the GSN set. It can measure the prediction ability of feature f.
Combined likelihood ratio (LR), i.e. target prediction score of the protein for a certain type of drug, given by the target prediction models of that type of drug.
The whole human genome is ranked according to the decreasing prediction scores (for a certain type of drug). Rank just indicates the protein’s position in the whole genome. Proteins with the same scores will be arranged with the same position. Suppose n proteins with same scores are arranged with the same Rank ith, and then the following proteins’ Rank will be (i+n)th.
It indicates that weather or not the protein contains signal peptide. A signal peptide is a short peptide present at the N-terminus of proteins that are targeted to the endoplasmic reticulum and eventually destined to be either secreted, extracellular or periplasmic etc..
It indicates that weather or not the protein contains transmembrane region. Signal peptides and transmembrane regions of human proteins were both parsed from related annotations of SwissProt (downloaded on 02/11/2016).
PEST motif number
A PEST region is a peptide sequence enriched in proline (P), glutamic acid (E), serine (S) and threonine (T), and is invariably found in proteins with a short half-life and thus hypothesized to serve as proteolytic signals [Trends Biochem Sci. 1996, 21(7):267-71]. Here we adopted EMBOSS epestfind program (using default parameters) to count the number of PEST motifs in a protein, and only “potential” PEST motifs were included.
Intrinsically disordered proteins are those that lack fixed or ordered 3-D structures. The disorder score of a protein is computed as the ratio of the length of disordered regions to its total length. FoldIndex was used to predict the intrinsic disorder of a protein (using default parameters), and to reduce the false positive rate only those disordered regions with length no smaller than 30 were considered.
Domain list the protein contains. Pfam domain assignments of human proteins were parsed from SwissProt (downloaded on 03/23/2016).
Domain number (including domain repeats)
Domain number (including domain repeats) that the protein contains.
GO term list the protein belongs to. Gene-GO term association data were derived from Gene Ontology Consortium (version: 3/16/2016).
Considering that successful protein/peptide/small-molecule drugs’ targets may tend to contain special domains, we also predict protein/peptide/small-molecule drugs’ targets by identifying enriched domains in the corresponding GSP set. Domain enrichment ratio (DER) is used to measure the enrichment degree of a domain in the GSP set. DER is calculated as the ratio of probability of observing this domain in the GSP set to that in the whole proteome. If a protein contains multiple enriched domains, the largest DER together with corresponding domain is used (which is given by this data field).
G protein-coupled receptor(GPCR)
Nuclear hormone receptors(NHR)
Indicates whether or not the protein is a GPCR, ion channel, …… The list of human transcriptional factors (TFs) was from Lambert et al.’s paper [Cell 172: 650-665], involving 1639 genes. Housekeeping genes are those detected in all tissues, which were obtained based on RNA-sequencing data in 32 tissues from Uhlén et al.’s paper [Science. 2015, 347(6220): 1260419], involving 8874 genes. The list of signaling molecules were from the signal transduction network as described below. G protein-coupled receptors (GPCRs) and kinases were from UniProt (http://www.uniprot.org/docs/7tmrlist and http://www.uniprot.org/docs/pkinfam) (Release: 2017_10 of 25-Oct-2017), both ion channels and nuclear hormone receptors (NHR) from HUGO Gene Nomenclature Committee (HGNC) database (downloaded on Jul. 2017), and transporters from Human Transporter Database (HTD) (version: 2014-01-01).
If the protein is an enzyme, its EC number will be given here. The human gene list coding enzymes were parsed from SwissProt (downloaded on 02/11/2016).
The tissue specificity score (TSPS) is adopted to measure the degree of tissue specific expression of a gene. Here we used RNA-sequencing data from 32 tissues provided by Uhlén et al. to compute the TSPS [Science. 2015, 347(6220):1260419]. Please refer to [Cell. 2010, 140(5): 744-52] for its formula.
Indicates the number of reactions the protein participates in. Human gene-reaction associations were from Recon 2 (version: 11.05.2015) [Metabolomics. 2016, 12: 109].
Degree in integrated PPI network
Interactors in integrated PPI network
Self-interacting protein in integrated PPI network
Betweenness centrality in integrated PPI network
Degree in PPI network_2015Science
Betweenness centrality in PPI network_2015Science
Degree in signal transduction network
Betweenness centrality in signal transduction network
Topological indexes of the protein in the PPI network and signal transduction network. The PPI network_2015Science is from [Science. 2015, 347(6224):1257601]. The integrated human PPI network was obtained by integrating experimental PPIs of “direct interaction” type from Database of Interacting Proteins (DIP) (version: 4/30/2016), Molecular INTeraction database (MINT) (downloaded on 7/17/2016), IntAct (version: 2016-07-06) and Biological General Repository for Interaction Datasets (BioGRID) (version: 3.4.138), containing 69357 PPIs between 12558 proteins. The human signal transduction network was provided by Cui et al. (version 6) [Mol Syst Biol. 2007, 3: 152].
Indegree in TF-target network
Indegree in the transcriptional factor (TF) –target gene interaction network, i.e. the number of TFs regulating the (target) gene/protein. The transcriptional regulation network was from [BMC Bioinformatics. 2016, 17 Suppl 5: 181].
TF list targeting the protein
Transcriptional factor list regulating the (target) protein.
Outdegree in TF-target network
Outdegree in the transcriptional factor (TF) – target gene interaction network, i.e. the number of target gene regulated by the protein (i.e. TF).
Target gene list of the protein
Target gene list regulated by the protein (i.e. TF)
Pathway number (KEGG)
Pathway list (KEGG)
Pathway number and corresponding pathway list the protein participates in. Biological pathway data were from KEGG (downloaded on 7/13/2016)
SwissProt subcellular location
Subcellular location information of the protein, provide by SwissProt database.
The data on evolutionary rates and original ages of human proteins are from our previous work (see Ref. [BMC Evol Biol. 2011, 11:133] for more details).
We used Cratio to check the gene polymorphism, which is computed as Yao et al. did [Genome Res. 2008,18(2):206-13]. See more in our future paper.
Grand average of hydropathicity (GRAVY)
The grand average of hydropathy (GRAVY) of a protein is computed as the sum of hydropathy values [J. Mol. Biol. 1982, 157: 105-132] of its all AAs, divided by its AA sequence length. The GRAVY value is calculated by ProtParam program.
A protein whose instability index is smaller than 40 is predicted as stable. The instability index was proposed by Guruprasad et al. [ Protein Eng. 1990, 4,155-161.]. The instability index is calculated by ProtParam program downloaded from Comprehensive Perl Archive Network (CPAN).
The percentage of aromatic/basic amino acids (AAs) in the protein’s sequence.
“Distance” is the distance between a protein and a disease in the PPI network, which is the smallest one among the shortest path lengths between the protein and the disease’s known related genes in the PPI network, and “Path” is the corresponding path of “Distance”.
Disease related genes
Known disease-related genes from OMIM/CTD/PheGenI.