Deciphering the diverse protein interactions is a major step in the quest for understanding the molecular mechanisms of the cell. Binding of protein domains to short peptide sequences, called linear motifs, facilitates many of these interactions. Data on the sequence specificities of peptide-binding domains is sparse, and wet-lab biologists usually resort to basic pattern searches to identify new putative binding sites for experimental follow-up. Most motifs, however, have poor specificity and thus appear in numerous proteins by random chance. Therefore, subsequent filtering or prioritization of the matches is crucial when scanning a full proteome with a pattern to identify new binding sites. The phosphorylation predictor NetworKIN showed that one can dramatically improve motif-based predictions by adding contextual data in the form of a protein-protein association network.
DoReMi (Domain aided Regular Expression Mining) allows users to easily search the complete human proteome with a user-specified sequence motif and prioritize the resulting matches based on their network context in the STRING database. First, the user provides either a regular expression of the motif or a set of aligned binding sites from which a PSSM will be calculated. Second, the user specifies the domain or domains from the PFam database with which the motif is believed to interact. DoReMi then identifies all matches to the motif in the human proteome and scores each match based on 1) how well it fits the motif and 2) how strongly the context network associates the matched protein with a protein containing one of the specified domains.