Conceived and designed the experiments: DSL OE ST. Performed the experiments: DSL OE. Analyzed the data: DSL OE ST. Wrote the paper: DSL OE ST.
Current address: HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Medical College of Cornell University, New York, New York, United States of America
The increasing ability to generate large-scale, quantitative proteomic data has brought with it the challenge of analyzing such data to discover the sequence elements that underlie systems-level protein behavior. Here we show that short, linear protein motifs can be efficiently recovered from proteome-scale datasets such as sub-cellular localization, molecular function, half-life, and protein abundance data using an information theoretic approach. Using this approach, we have identified many known protein motifs, such as phosphorylation sites and localization signals, and discovered a large number of candidate elements. We estimate that ∼80% of these are novel predictions in that they do not match a known motif in both sequence and biological context, suggesting that post-translational regulation of protein behavior is still largely unexplored. These predicted motifs, many of which display preferential association with specific biological pathways and non-random positioning in the linear protein sequence, provide focused hypotheses for experimental validation.
Short amino acid sequences, typically 3 to 10 amino acids in length, play important functional roles in determining protein behavior
Computational approaches have been developed to discover protein motifs and have led to fundamental observations related to the sequence determinants of protein behavior
Here we describe a new
Similar to ongoing experimental and computational efforts to decode the regulatory genome
FIRE-pro seeks to identify protein motifs whose pattern of presence and absence across all amino acid sequences is highly informative about the behavior profile for the corresponding proteins. The algorithm takes as input a user-specified protein behavior profile listing a quantitative measurement or discrete attribute of every protein (e.g., half-life or nuclear localization). Presented is a schematic example using discrete localization data. Here, knowing whether the motif is present or absent in the amino acid sequence provides significant information regarding the behavior of the protein. For each candidate motif (e.g., “KRK”), FIRE-pro calculates the correlation between the motif profile and the protein behavior profile using mutual information. Motifs that maximize the mutual information are ultimately selected for further characterization.
Motifs are defined as fixed-length patterns using a degenerate code of amino acids. For example, a motif may be defined as “L.[RK]”; in this motif, only “L” is allowed at the first motif position, any amino acid is allowed at the second position (“.”, equivalent to “x” in some representations), and either “R” or “K” is allowed at the third position. Given a motif, the
Informative motifs are discovered via a
To aid in the interpretation of motif predictions, our framework also includes post-processing steps intended to assess statistical significance, minimize false positives, and determine the biological significance and functional roles of the predicted motifs. Motif significance is calculated through non-parametric randomization tests in which the protein behavior profile is shuffled and the mutual information is calculated between this shuffled profile and the motif profile. This procedure is repeated 10,000 times by default and a motif is deemed significant if its mutual information with the motif profile is greater than all 10,000 randomized information values. Biological significance of motifs is explored by analyzing the set of proteins containing the motif and the positions of predicted motif instances to determine GO enrichment, overlap with protein domains, and possible motif-motif interactions. A detailed description of the FIRE-pro framework can be found in
We used FIRE-pro to discover motifs involved in a broad range of biological processes and functions. To this end, we compiled and analyzed >600 experimental protein datasets from
Our analyses revealed a total of ∼6,900 protein motifs with an average of 11 motifs per protein dataset (the full catalogue of motifs can be found in
| CLB2: B-type cyclin | SP.[RK] | 312 | SP.[RK] | CDK kinase substrate | Y | Pkinase (1e-04) | −3.5 | cell cycle (1e-16) |
| PTK2: Putative S/T kinase | RR.[SHP] | 122 | RR.S | PKA kinase substrate | - | phosphotransferase activity (0.01) | ||
| GO: nuclear part | [KRN]KR[KSR] | 99 | K[KR].[KR] | Nuclear localization | Bromodomain (0.001) | −1.1 | nuclear lumen (1e-91) | |
| TPK1: cAMP-dependent kinase | R[RK].S | 96 | R[KER].S | PKA kinase substrate | ||||
| LSB3: C-terminal SH3 domain | [PQ]P..P[PTM]R | 92 | P..P | SH3 general ligand | actin cytoskeleton biogenesis (1e-05) | |||
| GO: membrane | L[LAF]G | 89 | LLG | Beta2-Integrin binding | Mito_carr (1e-06) | 0.3 | intrinsic to membrane (1e-67) | |
| GO: transcription | N[NTP]N[NAP] | 77 | NNNN | Poly-asparagine | Y | Zn_clus (0.001) | −0.7 | transcription (1e-10) |
| RSP5: Ubiquitin-protein ligase | PP.Y | 76 | PP.Y | LIG_WW_1 | ||||
| CLB2: B-type cyclin | L..SP | 74 | SP | ERK1,2 Kinase substrate | Pkinase (0.001) | −1.4 | bud neck (1e-06) | |
| RIM11: kinase | [GSQ]S..[ANV]SP | 72 | [ST]…[ST]P | RIM11 Kinase substrate | ||||
| GO: transcription | Q[QNH]Q | 68 | QQQ | Poly-glutamine | zf-C2H2 (1e-11) | −0.9 | transcription (1e-14) | |
| GO: membrane-enclosed lumen | K[KRE][REH]K | 67 | KR | CLV_PCSK_PC1ET2_1 | Y | nuclear lumen (1e-10) | ||
| GO: nucleus | LK | 67 | F.F.LK…K.R | Phosphatidylserine binding | WD40 (1e-07) | −0.4 | nuclear lumen (1e-19) | |
| GO: cellular morphogenesis | [STL]S..[SAD]S | 66 | S..[ST] | Casein kinase I phos. site | Pkinase (0.01) | −4.6 | cellular morphogenesis (1e-15) | |
| Localization: actin | PPP.[PHY] | 63 | PPP | Polyproline | Y | SH3_1 (1e-04) | −0.7 | actin cortical patch (1e-14) |
| GO: cell cycle | [SYI]S…S | 54 | S…S | WD40 binding | Pkinase (1e-04) | −4.8 | cell cycle (10) | |
| PPH22: phosphatase subunit | SP.[GD]R[LYN] | 52 | SP | ERK1,2 Kinase substrate | Proteasome (1e-08) | −3.7 | proteasome core complex(1e-10) | |
| CDC15: MEN kinase | S..[PWH]S | 30 | S…S | WD40 binding | Pkinase (1e-18) | −2 | protein kinase activity (1e-14) | |
| SMT3: SUMO family protein | A[DVA]A | 66 | [LV]IA[DE][PA] | Caveolin pattern | carboxylic acid metabolism (1e-07) | |||
| YCK1: membrane casein kinase | S.[SEV]D | 65 | HSTSDD | BCKDC kinase | ||||
| Plasmodium expression cluster | K..Y[ISH] | 47 | Y[LI] | SH2 ligand for PLCgamma1 | Y | Rifin_STEVOR (0.01) | −5.3 | |
| PRE2: 20S proteasome subunit | VEYA | 46 | VIYAAPF | Abl kinase substrate | Y | Proteasome (1e-09) | −3.8 | proteasome core complex (1e-11) |
| PPH22: phosphatase subunit | [TIV][FH]SP | 36 | SP | ERK substrate | Y | Proteasome (1e-12) | −4.5 | proteasome core complex (1e-16) |
| PPH22: phosphatase subunit | EY.[LS]E[AS] | 36 | [DE]Y | EGFR kinase substrate | Y | Proteasome (1e-10) | −4.1 | proteasome core complex (1e-09) |
| HTZ1: Histone | [GVH]G[KYQ]G | 32 | GGQ | N-methylation in E. coli | Y | Histone (1e-05) | −2.5 | nuclear chromatin (1e-06) |
| PAB1: Poly(A) binding | G.[PRT]G | 31 | IQ.RG.RG | Binding on Calmodulin | RRM_1 (0.001) | −4.1 | RNA metabolism (1e-09) | |
| Localization: periphery (S. pombe) | T..[PSL]N | 30 | T..[SA] | FHA of KAPP binding | Pkinase (1e-04) | −2 | barrier septum (1e-54) | |
| Plasmodium expression cluster | R.[GSA]R | 29 | [AG]R | Protease matriptase site | DEAD (1e-13) | −2.9 | ATP-dependent helicase activity (1e-12) | |
| ARC1: tRNA binding | S[DQP]S | 28 | R.S.S.P | 14-3-3 bindings | Pkinase (1e-14) | −3.9 | protein kinase activity (1e-13) | |
| HHT1: histone | KP..[KFV][KHA] | 28 | KP..[QK] | LIG_SH3_4 | Histone (0.01) | −2.8 | chromatin architecture (1e-07) | |
| PPI clusters | SP[STN] | 24 | SP | ERK substrate | interphase (1e-06) | |||
| Localization clusters (Huh, 2003) | P..[PSE]P | 21 | P.[ST]PP | ERK substrate | Y | PX (1e-05) | −0.3 | cell cortex part (1e-24) |
| Localization multiclass (Huh, 2003) | T..[SFL]T | 11 | T..[SA] | FHA of KAPP binding | Y | nuclear pore (1e-29) | ||
| Localization clusters (Huh, 2003) | TG.G[KLW][TFY] | 11 | TGY | ERK6/SAPK3 activation sites | Helicase_C (1e-10) | −1.1 | RNA helicase activity (1e-11) | |
| GO: nuclear part | DE[EDK][ED] | 131 | Y | nuclear lumen (1e-09) | ||||
| Ubiquitin-conjugates (Peng, 2003) | L..[LDS]A | 125 | Y | IBN_N (1e-05) | −0.4 | Golgi apparatus (1e-08) | ||
| GO: membrane | I[FIW]..V | 70 | Adaptin_N (0.001) | 0.6 | transporter activity (1e-40) | |||
| GO: ribosome biogenesis | E[EDK]..E[EKD] | 67 | WD40 (0.01) | −2.3 | cytoplasm organization (1e-12) | |||
| YAP1: Basic leucine zipper | QQ..M[QIV][QTA] | 66 | RNA polymerase II TF activity (1e-06) | |||||
| NOP2: RNA methyltransferase | R[GST].[DQF]IP | 56 | Y | DEAD (1e-05) | −1.1 | ribosome biogenesis (1e-08) | ||
| GO: DNA-dependent transcription | N.D[DST] | 52 | zf-C2H2 (1e-06) | −1.5 | transcription, DNA-dependent (1e-23) | |||
| GO: transcription | N.D[DST] | 52 | zf-C2H2 (1e-06) | −1.5 | transcription, DNA-dependent (1e-23) | |||
| SMT3: SUMO family protein | V.[DKG]A | 47 | Y | carboxylic acid metabolism (1e-04) | ||||
| POB3: Nucleosome maintenance | [GH]S..KA[SI] | 33 | Histone (0.01) | −1.6 | chromatin architecture (0.001) | |||
| UBP15: Ubiquitin-specific protease | A.[TSL]S | 28 | Pkinase (0.001) | −2.1 | protein kinase activity (0.001) | |||
| PRE2: 20S proteasome subunit | Q[VID]E | 26 | Proteasome (1e-08) | −4.8 | proteasome complex (1e-19) | |||
| Half-life (Belle, 2006) | R.[RSY]S | 25 | reg. of cellular physiological process (1e-04) | |||||
| PPI clusters | GGL[FTL][GEP] | 13 | snRNP protein import into nucleus (1e-07) | |||||
Known: matches previously identified; Semi-novel: matches sequence but has distinct biological context; Novel: no match.
Select (a) known, (b) semi-novel, and (c) novel motifs discovered by FIRE-pro. Known motifs match previously identified motifs in the literature in both sequence and biological context. Semi-novel motifs match previously identified motifs in sequence but not in biological context. Novel motifs do not match any previously identified motif. Motifs presented here were selected based on a combination of criteria including high mutual information and
Consistent with the central role of phosphorylation in protein signaling networks
FIRE-pro discovered over six motifs that are highly informative of interaction with Cdc28 (
(A) P-value heatmap of motifs found in Cdc28-interacting proteins. Columns correspond to classes of proteins and rows correspond to predicted motifs. The yellow color-map indicates over-representation of a motif in a given class; significant over-representation (p<0.05 after Bonferroni correction) is highlighted using red frames. Similarly, the blue color-map and blue frames indicate under-representation. For each motif, we indicate 1) position-weight matrix (PWM) representation, 2) mutual information (MI) value, 3)
A global analysis of fifty-seven kinase interaction datasets in
Altogether, these results indicate that FIRE-pro efficiently re-discovers known functional sites that mediate post-translational regulation even among noisy, proteome-scale data sets, but also produces many candidate novel protein regulatory elements that may have important roles in regulating protein behavior.
Automated comparison
We hypothesized that, in some cases, conserved elements of protein domains may lead to the detection of informative motifs, referred to here as domain signatures. This situation may occur when similarly behaving proteins contain the same protein domain. We devised a strategy to assign p-values and domain overlap scores to indicate the extent to which a motif co-occurs and overlaps with a known protein domain more than would be expected by chance (see
Across all analyzed profiles, we found that 2,596 motifs (37%) co-occur with a domain. Of those motifs, 1531 or 22% of all motifs, have positive domain overlap scores and can be considered to be domain signatures (
Across all profiles, 15% of discovered motifs are associated with a domain but lie near the domain rather than in the domain itself. This includes the cyclin-dependent kinase substrate “SP.[RK]” and the motif “V..[TSP]P”, which are consistently located near kinase domains. For example, in 85% (11/13) of Cdc28-interacting proteins containing a protein kinase domain and the “SP.[RK]” motif, the motif lies nearby rather than within the domain. This type of motif may impart a functional site to proteins with a common domain and may regulate domain function and specificity, perhaps by mediating interactions with other proteins.
Many motifs, especially localization signals, tend to be positioned at the N- or C- termini of proteins
FIRE-pro also determines motif pairs that co-occur within the same proteins and co-localize in the primary protein sequences (
Overall, our protein motif dataset contains ∼1,500 interacting motif pairs involving ∼2,000 individual motifs, indicating that ∼25% of the motifs are involved in motif-motif interactions. While some interactions represent neighboring domain signatures, others may mediate co-regulation of protein binding or post-translational modification. One example of potential co-regulation is a cluster of three co-localizing motifs found in proteins that interact with the ribosomal subunit protein Rps17b. Of the 63 proteins that contain the motifs “AR..[AR]”, “K.[RAK]A”, or “G[KMI]K[VAG]”, over half contain at least two of the three motifs. This observation suggests that interaction with Rps17B is mediated by several, possibly redundant protein motifs; alternatively it may indicate that additional proteins cooperate with Rps17b to regulate its targets.
Many protein motif analyses involve comparing two classes of proteins, e.g. CDC28-interacting proteins vs proteins that do not interact with CDC28. However, many protein behaviors involve more than two protein groups. This is the case for protein localization, where proteins can be localized to many distinct compartments, e.g. nucleus, ER, Golgi, cytoplasm, membrane and mitochondria. While each of these behaviors can in principle be analyzed independently, analyzing them simultaneously can be useful because the same protein motif can be associated with multiple protein groups and thus weak but consistent enrichment across multiple groups would result in higher and more significantly informative motifs. Due to its use of mutual information, the FIRE-pro framework can naturally process multiple protein groups simultaneously. Moreover, the combined heatmaps show motif over- or under-representation across all groups provides easier interpretation of protein motif function. As part of our global analysis, we applied FIRE-pro to a sub-cellular localization dataset obtained from ∼4,000 GFP-tagged proteins in
Application of FIRE-pro to the resulting multi-class localization partition revealed sixteen motifs (
The data
An advantage of FIRE-pro over existing methods is its ability to discover motifs associated with quantitative protein measurements. This feature stems from the capacity of FIRE-pro to find motifs that are informative of multiple protein groups, as quantitative protein measurements are first discretized (i.e., split into bins containing similar measurement values) prior to estimating the mutual information. The discretization process used in FIRE-pro is the same as the one used in FIRE
Half-life data for ∼3,750 yeast proteins
It has previously been shown that functional instances of protein motifs tend to lie in intrinsically disordered regions of protein sequence
To benchmark FIRE-pro's performance, we carried out a comparative study of existing methods (Motif-x
A critical feature of our approach is that it returns very few motifs when given randomized input. To illustrate this, we randomly shuffled the protein behavior profiles of the five datasets mentioned above and applied FIRE-pro to the shuffled data with the same parameters as the original run. The number of motifs found per randomized dataset ranged from 0–3 with an average of 1.2 motifs as compared to a range of 3–17 and an average of 9 motifs per real dataset (
As the amount of available proteomic and genomic data expands, biologists increasingly rely on computational methods to extract key features and general principles from the data. We have therefore designed an approach to protein motif discovery that is capable of harnessing the information found in large-scale datasets, such as protein abundance, gene expression, localization, and post-translational modification. FIRE-pro facilitates the discovery of short sequence motifs informative of the global behavior of proteins. The use of mutual information provides a universal framework that can be applied to any type of biological data, be it discrete or quantitative, and the algorithm can be applied to proteome-scale data from any organism, including humans. The algorithm can simultaneously discover over- and under-represented motifs and has no requirement for an explicit background model. Given the increasingly quantitative nature of proteomics experiments, we believe that FIRE-pro is a valuable tool capable of revealing sequence elements that determine diverse protein behavior.
The systematic application of our approach to a set of ∼650 proteomic datasets revealed several novel insights into post-translational regulatory networks. We discovered that many of the strongest motifs tend to be under-represented in specific groups of proteins just as they are over-represented in coherent groups of proteins in which the motif is thought to play a functional role. Context-dependent avoidance of specific motifs may represent a crucial constraint for the evolution of protein sequences and be an important parameter in successful design of custom proteins. It was also intriguing to discover a number of potential phosphorylation motifs informative of protein-protein interactions. While these motifs need to be further characterized and experimentally tested, the abundance of known and putative phosphorylation sites is not surprising as eukaryotic genomes contain hundreds of kinases that exert a profound influence on cellular activity. However, relatively little is known about substrate specificity for most of these kinases, and we anticipate that our framework and results will shed light on the structure of phosphorylation networks. Our study also underscores the fact that functional motifs tend to have a variety of non-random features including gene functional enrichment, position biases in linear sequence, relationships with protein domains, co-occurrence with other motifs, and associations with regions of protein. The comprehensive understanding of these features is important because it provides information regarding some of the mechanisms underlying post-translational regulation. A natural extension of our work is the systematic integration of these distinct features, e.g., using probabilistic weighting, in order to enable the recognition of functional protein motif instances and to facilitate the prediction of post-translational regulation directly from primary protein sequences.
In summary, FIRE-pro is an approach to protein motif-finding suited for the proteomic era. Rather than finding motifs over-represented in a set of proteins relative to a background set, the algorithm seeks to discover motifs informative of measurements or behaviors associated with each protein. In addition to presenting an approach for motif-finding in large-scale data, we have presented a number of examples of known and novel predictions of protein motifs uncovered by FIRE-pro. In the future, such information could form the basis for a library of protein motifs to be used in synthetic biology, i.e., to engineer protein behaviors such as half-life, localization, and interaction partners. It is our hope that computational tools such as FIRE-pro will help advance the body of biological and biomedical knowledge and perhaps yield new organizing principles about post-translational regulation of protein function. To this end, the source code, datasets, and results are freely available at
Supplementary methods and text
(0.24 MB PDF)
Click here for additional data file.
Mitochondrial localization motifs. (A) P-value heatmap of motifs enriched in mitochondrial-localized proteins. Columns correspond to classes of proteins and rows correspond to predicted motifs. The yellow/blue color-map indicates over/under-representation of a motif in a given group. (B) Position bias of a mitochondrial motif corresponding to the "RxxS" consensus sequence for the N-terminal mitochondrial signal peptide cleavage site. A histogram of normalized motif positions in mitochondrial proteins ("Enriched") reveals that the motif is highly enriched in the N-terminus relative to non-mitochondrial proteins ("Other").
(0.08 MB PDF)
Click here for additional data file.
Protein localization profile. (A) Sub-cellular localization data from
(0.10 MB PDF)
Click here for additional data file.
Enrichment analysis table for motifs associated with sub-cellular localization (see
(0.09 MB PDF)
Click here for additional data file.
GO analysis of protein half-life and enrichment analysis of half-life motifs. (A) iPAGE analysis of quantitative half-life data in
(0.10 MB PDF)
Click here for additional data file.
GO analysis of quantitative protein abundance data in
(0.06 MB PDF)
Click here for additional data file.
Analysis of quantitative protein abundance data in
(0.08 MB PDF)
Click here for additional data file.
GO analysis of clustered PPI network profile. iPAGE analysis of clustered protein-protein interaction data from BioGRID. Clusters include groups enriched for biological processes such as RNA processing and protein sumoylation, cellular components such as nucleoplasm and proteosomal complex, and molecular functions such as transferase activity. The columns represent protein clusters and correspond to the columns in
(0.06 MB PDF)
Click here for additional data file.
Multiclass analysis of clustered protein interaction network. (A) Protein-protein interaction data from BioGRID was clustered using the MCL graph-clustering algorithm and cluster indices were used as input to FIRE-pro. Ten motifs are found in the protein interaction clusters compared to zero motifs found in a control analysis of genetic-interaction cluster data. A log p-value matrix shows a number of known and unknown motifs involved in various modules of the network. (B) Enrichment analysis of all proteins containing each motif.
(0.09 MB PDF)
Click here for additional data file.
Protein disorder analysis. Distributions of disorder scores for all 3-mers, 4-mers, and FIRE-pro motifs. Disordered regions of the
(0.10 MB PDF)
Click here for additional data file.
Phosphorylation motifs found amongst kinase and phosphatase interactors
(0.05 MB PDF)
Click here for additional data file.
Domain signature motifs
(0.07 MB PDF)
Click here for additional data file.
Categorization into known, semi-novel, and novel motifs
(0.06 MB PDF)
Click here for additional data file.
Motif-discovery algorithms used in comparison
(0.05 MB PDF)
Click here for additional data file.
Summary of data sets used in algorithm comparison
(0.04 MB PDF)
Click here for additional data file.
Results of algorithm comparison
(0.06 MB PDF)
Click here for additional data file.
A catalogue of all motifs discovered by FIRE-pro
(1.65 MB XLS)
Click here for additional data file.
We thank Hani Goodarzi, Alison Hottes and other members of the Tavazoie lab for their helpful comments. We are also grateful to Bambi Tsui for making FIRE-pro available through a web interface.