Background
Phages are short for bacteriophages. They are viruses that infect bacteria. Many phages such as M13 and fd are good expression vectors. In 1985, George P. Smith displayed foreign peptides on the virion surface by inserting the foreign DNA fragments into the filamentous phage gene III [1]. He also demonstrated that foreign peptides in fusion proteins on the virion surface were in immunologically accessible form and specific fusion phage could be enriched and isolated from a phage library of random inserts in a fusion-phage vector by one or more rounds of affinity selection. Later on, this technology is termed as phage display or biopanning. Since the pioneering work described above, phage display technology has been developed, refined and improved further by many scientists from various fields; and its applications has extended from epitope mapping [2], antibody engineering [3], organ targeting [4] to new material and new energy studies [5, 6] as well.
Usually, the substance used to screen phage library is termed target. The natural partner of target is called template. Peptide mimicking the binding site on the template and capable of binding to the target is defined as mimotope, which was first introduced by Mario Geysen et al [7]. As the mimic of binding site, mimotope has been widely used in mapping epitopes [8], identifying drug target [9] and inferring protein interaction networks [10, 11]. Besides, mimotope has also shown its potential in the development of new diagnostics [12], therapeutics [13] and vaccines [14].
In the biopanning result however, there are not only mimotopes, but also all kinds of target-unrelated peptides (TUPs) [15]. Phage display data are usually a mixture of mimotopes (desired signal) and target-unrelated peptides (unwanted noise). Target-unrelated peptides can be divided into two categories [16]. One category of TUP is called selection-related TUP (SrTUP). Although unable to bind to the target site, they can react with contaminants or other components of the screening system and then sneak into the biopanning results. Another category of TUP is called propagation-related TUP (PrTUP). They creep into the output of biopanning because they have a higher infection rate or faster secretion rate [17, 18]. Phages with growth advantage can not only be noise but also decrease the library diversity and lead to a loss of useful mimotopes. Simulations and experiments showed that subtle differences in growth rate yielded drastic differences in clone abundances after rounds of amplifications [19]. Thus, propagation-related TUP may even dominate the biopanning results. As TUPs are peptides unrelated to the target, they are not proper candidates for developing diagnostics, therapeutics and vaccines. They also interfere with the prediction of protein interaction sites if they are taken as a mimotopes. Unfortunately, such mistakes are made from time to time [20]. Changing experimental conditions and improving experimental methods can decrease TUPs. For example, increasing the stringency of panning may reduce TUPs; subtractive procedures may decrease selection-related TUPs; amplification in isolated compartment can mitigate the growth advantage of propagation-related TUPs [21]. However, TUPs cannot be eradicated experimentally due to the experiment itself. Therefore, to exclude TUPs from the biopanning results with computational tools has become an alternative and more convenient choice.
SAROTUP History
In 2010, we developed the first version of the program SAROTUP [20]. The name of this online tool was an acronym for "Scanner And Reporter Of Target-Unrelated Peptides". In version 1.0, it was based on only 23 TUP motifs known till then, which included 12 motifs specific for the capturing agents, five motifs specific for the constant region of antibody, three motifs specific for the screening solid phase, two motifs specific for the contaminants in the target sample, and one confirmed PrTUP. Since then, SAROTUP has been developed into a suite of tools for detecting TUPs and other purposes. Besides the TUP motif-based tool, three tools based on the BDB database [22, 23, 24] are integrated into SAROTUP version 2.0. In the current version, i.e. version 3.0, two predictors based on machine learning methods are implanted into SAROTUP. Now, SAROTUP has 6 tools in total, which are called TUPScan, MimoScan, MimoSearch, MimoBlast, PhD7Faster, and SABinder respectively. The last two tools are collected into a directory page called TUPredict. Additionally, these tools in the SAROTUP suit have been redeveloped using C++ language. Accordingly, SAROTUP version 3.0 is also distributed as an open source software with graphical user interface (GUI), which can perform perfectly natively on Windows, Linux or Mac OS X systems. The use of the GUI version is same as for the web service. We will give some other help information that has not been embedded or addressed on the web interface of each tool in the following sections.
TUPScan is capable of finding peptides with known TUP motifs in your query sequences. In version 3.0, the TUP motifs or confirmed TUPs embedded in the TUPScan script are updated to 850 items, including one biotin-binding motif, one albumin-binding motif, one Protein A-binding motif, one lipid A-binding motif, three plastic-binding motifs, four metal ion-binding motifs, five immunoglobulin Fc region-binding motifs, five streptavidin-binding motifs, six unrelated antibody-binding motifs, 25 peptides with various targets, 776 propagation-related TUPs, and 22 selection-related TUPs. These documented peptides and TUP motifs are compiled from published reviews [15, 16], research articles [17, 18, 25, 26, 27], and database articles [22, 23, 24].
TUPredict is a directory page that collects predictors for target-unrelated peptides developed by our lab using machine learning methods. Different types of features were used to construct quite a few predictive models and only the one with the best performance was selected to be developed into the corresponding web service. Due to the limit of data, only three predictors, i.e. PhD7Faster 2.0, PSBiner and SABinder, are available currently.
The PhD7Faster 2.0 tool is capable of predicting if phages from the Ph.D.-7 library bearing the input peptides might grow faster. This tool was developed with pseudo amino acid composition (PseAAC) and tripeptide composition by Support Vector Machine (SVM). The results of PhD7Faster 2.0 would provide the predicted probability.
The training data sets were generated from a research article published recently by Derda et al [28]. They sequenced millions of phages of the naïve and amplified Ph.D.-7 phage display libraries after a single round of bacterial amplification. By comparing the abundance of each peptide before and after amplification using Bioconductor package edgeR, they identified 770 peptides with significantly higher growth rate, which were collected into the positive training dataset. The negative dataset was composed of those peptides with the copy number of one in the amplified Ph.D.-7 phage display library. Sequences in the negative dataset, which also appeared in the positive dataset, were eliminated. Then both the positive and negative datasets were processed as follows: (i) peptide sequences containing ambiguous residues (such as “X”, “B” and “Z”) were excluded; (ii) sequences within 2 Hamming distance (h=2, the Hamming distance between two strings of equal length is the minimum number of substitutions required to change one string into the other.) were removed. Finally, 749 peptides were retained in the positive dataset. We randomly selected 749 peptides from the remaining negative dataset. Therefore, the benchmark dataset was composed of 749 fast-propagating peptides and 749 regular-growing peptides.
Two types of features, PseAAC and tripeptide composition were used to present individual peptide sequence. The fselect.py program in the LIBSVM 3.23 software was applied to evaluate each feature’s significance to the classification system. As a consequence, every feature corresponds to an F-score. The greater F-score implies the larger significance of the corresponding feature to the classification system. We rearranged all features by F-scores in descending order. The incremental feature selection (IFS) strategy was then utilized to determine the optimal feature subset, which can produce the maximal accuracy. Feature selection was conducted as follows: (i) investigating the accuracy of the first feature subset which included the feature with the largest F-score; (ii) examining the accuracy of the second feature subset that was generated by appending the feature with the second largest F-score; (iii) iterating the second step from the larger F-score to the smaller F-score until all candidate features were added. The best feature subset with the highest accuracy can be finally obtained.
The optimal feature subset with 644 features was determined through feature selection against 8027 features including 8000 tripeptide features and 27 PseAAC features, which was then employed to train the SVM-based model. The results from ten-fold cross-validation showed that the accuracy of the predictive model was 81.84% with 0.64 MCC, 84.51% sensitivity and 79.17% specificity. The area under ROC curve (AUC) was approximately 0.90. The permutation test resulted in a p-value of less than 0.001.
The SABinder tool is capable of predicting peptides that can bind to streptavidin. This tool was developed on the basis of an ensemble model trained by SVM using optimized dipeptide composition (ODPC). The datasets were built through the following steps:
Streptavidin-binding peptides:
- Collect all peptides obtained from completely random library in the MimoDB Database version 4.0 that can bind to streptavidin.
- Delete the terminal cysteine (C) of peptides if they are from cysteine-restricted library.
- Remove the redundant peptides.
- Exclude sequences harboring ambiguous residues ("X", "B" and "Z") or non-alpha characters.
Non-streptavidin-binding peptides:
- Collect all peptides obtained from completely random library in the MimoDB Database version 4.0.
- Delete the terminal cysteine (C) of peptides if they are from cysteine-restricted library.
- Remove the redundant peptides.
- Exclude sequences harboring ambiguous residues ("X", "B" and "Z") or non-alpha characters.
- Remove the sequences which are same with that of the positive dataset.
After the above procedures, 199 streptavidin-binding peptides and 15,266 non-streptavidin-binding peptides were obtained. The negative samples remarkably outnumbered the positive samples. Therefore, down-sampling strategy was proposed to work out the challenge by randomly picking out 199 peptides from the negative samples. To diminish random errors, such procedure was repeated ten times. The only one positive dataset with 199 peptides was paired with the ten negative sub-datasets above, respectively. As a consequence, ten pairs of sub-datasets were generated and each pair was made up of 199 peptides with specific affinity to streptavidin and 199 peptides without affinity to streptavidin. Four types of features, amino acid composition (AAC), optimized amino acid composition (OAAC), dipeptide composition (DPC) and optimized dipeptide composition (ODPC), were used to encode individual peptide sequence. The 5-fold cross-validation results show that ODPC feature can achieve the best performance with an accuracy of 89.2% and a MCC of 0.79.
The PSBinder tool is capable of predicting peptides that can bind to polystyrene surface. This tool was developed on the basis of an ensemble model trained by SVM using optimized dipeptide composition (ODPC).
We compared each sequence in negative dataset with the one in positive dataset and deleted the same sequence in negative dataset. Both of positive dataset and negative dataset were deleted the cysteine amino acids at both ends of the circular peptides and excluded the peptide sequences harboring ambiguous residues ("B" "J" "O" "U" "X" and "Z") or nonalpha characters. Eventually we obtained 104 peptides as negative dataset and positive, respectively. In order to avoid the prediction model overfitting because of the high sequence similarity of positive dataset, we used cd-hit software to keep the peptide sequence similarity below 80%.
After the above procedures, we collected 104 positive peptide sequences, and the negative data set was 104 peptide sequences with the same length and source as the positive data set. As a consequence, the 5-fold cross-validation results show that ODPC feature can achieve the best performance with an accuracy of 86.54% and a MCC of 0.73.
MimoScan is capable of finding peptides with your query patterns in the BDB database. Patterns of either TUPs or mimotopes are OK. For example, one or more motifs are often derived from biopanning results. You can then submit these patterns to MimoScan and check how specific you patterns are. The PROSITE Perl module is used to convert query pattern to regular expression, and the latter is used by MimoScan script to match each peptide in the BDB database. Thus, your query patterns must be written in PROSITE format [29]. The following pattern syntax is taken from the official documentation page of PROSITE:
- The standard IUPAC one-letter codes for the amino acids are used in PROSITE.
- The symbol 'x' is used for a position where any amino acid is accepted.
- Ambiguities are indicated by listing the acceptable amino acids for a given position, between square brackets '[ ]'. For example: [ALT] stands for Ala or Leu or Thr.
- Ambiguities are also indicated by listing between a pair of curly brackets '{ }' the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Ala and Met.
- Each element in a pattern is separated from its neighbor by a '-'.
- Repetition of an element of the pattern can be indicated by following that element with a numerical value or, if it is a gap ('x'), by a numerical range between parentheses.
- Examples:
- x(3) corresponds to x-x-x
- x(2,4) corresponds to x-x or x-x-x or x-x-x-x
- A(3) corresponds to A-A-A
- Note: You can only use a range with 'x', i.e. A(2,4) is not a valid pattern element.
- When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a '<' symbol or respectively ends with a '>' symbol. In some rare cases (e.g. PS00267 or PS00539), '>' can also occur inside square brackets for the C-terminal element. 'F-[GSTV]-P-R-L-[G>]' means that either 'F-[GSTV]-P-R-L-G' or 'F-[GSTV]-P-R-L>' are considered.
The following extended syntax which is allowed for ScanProsite, is also fit for MimoScan.
- Examples :[AC]-x-V-x(4)-{ED}
This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}
< A-x-[ST](2)-x(0,1)-V
This pattern, which must be in the N-terminal of the sequence ('<'), is translated as: Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val
<{C}*>
This pattern describes all sequences which do not contain any Cysteines.
IIRIFHLRNI
This pattern describes all sequences which contain the subsequence 'IIRIFHLRNI'.
MimoSearch is capable of finding peptides identical to your query sequences in the BDB database. Actually, MimoScan and MimoBlast can also find peptides in the BDB database that are identical to query sequences. However, MimoSearch is the best choice to find out peptides binding to various targets because the target for each peptide will be explicitly displayed in its result table. On the contrary, you cannot directly get the target information from the result tables of MimoScan and MimoBlast. You must click the Mimoset ID number linked to the BDB database to find them. The difference is due to: (1) both MimoScan and MimoBlast use the sequence file produced from the BDB database, which lacks for target name; (2) MimoSearch uses SQL query against the relational BDB database which has all information needed. However, MimoScan and MimoBlast also have their own advantages. For example, MimoScan will find out all sequences containing your query peptide, and MimoBlast can list all similar sequences besides the identical ones.
Powered by BLASTP 2.2.31+ [30] and the BDB Database, MimoBlast is capable of finding peptides similar to your query sequences in the BDB database. The parameters of BLASTP program are rather complicated and you can visit the official help page of BLAST to learn more. According to our tests, the expect value is one of the most important parameters that affect the blast result. To simplify MimoBlast, only two parameters are displayed on its interface for users to modify. One is the expect value, with default value 10. The other is the max results (max target sequences), which is set to 300 by default in MimoBlast. This parameter determines the maximum number of aligned sequences to display, although the actual number of alignments may be greater than this number. By default, MimoBlast uses the following combination of parameters: expect value 10, word size 3, filters of low-complexity regions on, scoring matrix BLOSUM62, gap open 11, gap extend 1 and max target sequences 300. Frequently however, no significant matches to the database may be found with a short peptide of 15 residues or shorter under the settings above. The reasons for this usually are that the expect value parameter is set too stringently and the word size parameter is set too high. If no hits are found with default parameter combination, you can turn on the preset parameters for short peptide alignment. This combination is specially optimized for short nearly exact matches, of which the expect value is set to 20000, word size 2, filters of low-complexity regions off, scoring matrix PAM30, composition-based statistics off, gap open 9, gap extend 1 and max target sequences 300.
References
- Smith GP: Filamentous fusion phage: novel expression vectors that display cloned antigens on the virion surface. Science 1985, 228(4705): 1315-1317.
- Scott JK, Smith GP: Searching for peptide ligands with an epitope library. Science 1990, 249(4967): 386-390.
- McCafferty J, Griffiths AD, Winter G, Chiswell DJ: Phage antibodies: filamentous phage displaying antibody variable domains. Nature 1990, 348(6301): 552-554.
- Pasqualini R, Ruoslahti E: Organ targeting in vivo using phage display peptide libraries. Nature 1996, 380(6572): 364-366.
- Lee YJ, Yi H, Kim WJ, Kang K, Yun DS, Strano MS, Ceder G, Belcher AM: Fabricating genetically engineered high-power lithium-ion batteries using multiple virus genes. Science 2009, 324(5930): 1051-1055.
- Nam YS, Magyar AP, Lee D, Kim JW, Yun DS, Park H, Pollom TS, Jr., Weitz DA, Belcher AM: Biologically templated photocatalytic nanostructures for sustained light-driven water oxidation. Nat Nanotechnol 2010, 5(5): 340-344.
- Geysen HM, Rodda SJ, Mason TJ: A priori delineation of a peptide which mimics a discontinuous antigenic determinant. Mol Immunol 1986, 23(7): 709-715.
- Smith GP, Petrenko VA: Phage Display. Chem Rev 1997, 97(2): 391-410.
- Rodi DJ, Janes RW, Sanganee HJ, Holton RA, Wallace BA, Makowski L: Screening of a library of phage-displayed peptides identifies human bcl-2 as a taxol-binding protein. J Mol Biol 1999, 285(1): 197-203.
- Tong AH, Drees B, Nardelli G, Bader GD, Brannetti B, Castagnoli L, Evangelista M, Ferracuti S, Nelson B, Paoluzi S et al: A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 2002, 295(5553): 321-324.
- Thom G, Cockroft AC, Buchanan AG, Candotti CJ, Cohen ES, Lowne D, Monk P, Shorrock-Hart CP, Jermutus L, Minter RR: Probing a protein-protein interaction by in vitro evolution. Proc Natl Acad Sci U S A 2006, 103(20): 7619-7624.
- Deutscher SL: Phage display in molecular imaging and diagnosis of cancer. Chem Rev 2010, 110(5): 3196-3211.
- Macdougall IC, Rossert J, Casadevall N, Stead RB, Duliege AM, Froissart M, Eckardt KU: A peptide-based erythropoietin-receptor agonist for pure red-cell aplasia. N Engl J Med 2009, 361(19): 1848-1855.
- Knittelfelder R, Riemer AB, Jensen-Jarolim E: Mimotope vaccination - from allergy to cancer. Expert Opin Biol Ther 2009, 9(4): 493-506.
- Menendez A, Scott JK: The nature of target-unrelated peptides recovered in the screening of phage-displayed random peptide libraries with antibodies. Anal Biochem 2005, 336(2): 145-157.
- Vodnik M, Zager U, Strukelj B, Lunder M: Phage display: selecting straws instead of a needle from a haystack. Molecules 2011, 16(1): 790-817.
- Brammer LA, Bolduc B, Kass JL, Felice KM, Noren CJ, Hall MF: A target-unrelated peptide in an M13 phage display library traced to an advantageous mutation in the gene II ribosome-binding site. Anal Biochem 2008, 373(1): 88-98.
- Thomas WD, Golomb M, Smith GP: Corruption of phage display libraries by target-unrelated clones: diagnosis and countermeasures. Anal Biochem 2010, 407(2): 237-240.
- Derda R, Tang SK, Li SC, Ng S, Matochko W, Jafari MR: Diversity of phage-displayed libraries of peptides during panning and amplification. Molecules 2011, 16(2): 1776-1803.
- Huang J, Ru B, Li S, Lin H, Guo FB: SAROTUP: scanner and reporter of target-unrelated peptides. J Biomed Biotechnol 2010, 2010: 101932.
- Derda R, Tang SK, Whitesides GM: Uniform amplification of phage with different growth characteristics in individual compartments consisting of monodisperse droplets. Angew Chem Int Ed Engl 2010, 49(31): 5301-5304.
- Ru B, Huang J, Dai P, Li S, Xia Z, Ding H, Lin H, Guo F, Wang X: MimoDB: a New Repository for Mimotope Data Derived from Phage Display Technology. Molecules 2010, 15(11): 8279-8288.
- Huang J, Ru B, Zhu P, Nie F, Yang J, Wang X, Dai P, Lin H, Guo FB, Rao N: MimoDB 2.0: a mimotope database and beyond. Nucleic Acids Res 2012, 40(Database issue ): D271-D277.
- He B, Chai G, Duan Y, Yan Z, Qiu L, Zhang H, Liu Z, He Q, Han K, Ru B, Guo F B, Ding H, Lin H, Wang X, Rao N, Zhou P and Huang J: BDB: biopanning data bank. Nucleic acids research 2016 (Database issue), 44, D1127-1132.
- Vodnik M, Strukelj B and Lunder M: HWGMWSY, an unanticipated polystyrene binding peptide from random phage display libraries. Analytical biochemistry 2012, 424(2), 83-86.
- Nguyen K T, Adamkiewicz M A, Hebert L E, Zygiel E M, Boyle H R, Martone C M, Melendez-Rios C B, Noren K A, Noren C J and Hall M F: Identification and characterization of mutant clones with enhanced propagation rates from phage-displayed peptide libraries. Analytical biochemistry 2014, 462, 35-43.
- Matochko W L, Cory Li S, Tang S K and Derda R: Prospective identification of parasitic sequences in phage display screens. Nucleic acids research 2014, 42(2), 1784-1798.
- Shao J, Xu D, Tsai SN, Wang Y, Ngai SM: Computational identification of protein methylation sites through bi-profile Bayes feature extraction. PLoS ONE 2009, 4(3): e4920.
- Sigrist CJ, Cerutti L, Hulo N, Gattiker A, Falquet L, Pagni M, Bairoch A, Bucher P: PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform 2002, 3(3): 265-274.
- Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL: BLAST+: architecture and applications. BMC Bioinformatics 2009, 10: 421.