Tools

4.1 Introduction

Phages are short for bacteriophages. They are viruses that infect bacteria. Many phages such as M13 and fd are good expression vectors. In 1985, George P. Smith displayed foreign peptides on the virion surface by inserting the foreign DNA fragments into the filamentous phage gene III ^[1]. He also demonstrated that foreign peptides in fusion proteins on the virion surface were in immunologically accessible form and specific fusion phage could be enriched and isolated from a phage library of random inserts in a fusion-phage vector by one or more rounds of affinity selection. Later on, this technology is termed as phage display or biopanning. Since the pioneering work described above, phage display technology has been developed, refined and improved further by many scientists from various fields; and its applications has extended from epitope mapping ^[2], antibody engineering ^[3], organ targeting ^[4] to new material and new energy studies ^[⁵^,⁶^] as well.

Usually, the substance used to screen phage library is termed target. The natural partner of target is called template. Peptide mimicking the binding site on the template and capable of binding to the target is defined as mimotope, which was first introduced by Mario Geysen et al^[7]. As the mimic of binding site, mimotope has been widely used in mapping epitopes ^[⁸^], identifying drug target ^[9]and inferring protein interaction networks ^[¹⁰^,¹¹^]. Besides, mimotope has also shown its potential in the development of new diagnostics ^[12], therapeutics ^[13]and vaccines ^[14].

In the biopanning result however, there are not only mimotopes, but also all kinds of target-unrelated peptides (TUPs) ^[15]. Phage display data are usually a mixture of mimotopes (desired signal) and target-unrelated peptides (unwanted noise). Target-unrelated peptides can be divided into two categories ^[16]. One category of TUP is called selection-related TUP (SrTUP). Although unable to bind to the target site, they can react with contaminants or other components of the screening system and then sneak into the biopanning results. Another category of TUP is called propagation-related TUP (PrTUP). They creep into the output of biopanning because they have a higher infection rate or faster secretion rate ^[¹⁷^,¹⁸^]. Phages with growth advantage can not only be noise but also decrease the library diversity and lead to a loss of useful mimotopes. Simulations and experiments showed that subtle differences in growth rate yielded drastic differences in clone abundances after rounds of amplifications ^[19]. Thus, propagation-related TUP may even dominate the biopanning results. As TUPs are peptides unrelated to the target, they are not proper candidates for developing diagnostics, therapeutics and vaccines. They also interfere with the prediction of protein interaction sites if they are taken as mimotopes. Unfortunately, such mistakes are made from time to time ^[20]. Changing experimental conditions and improving experimental methods can decrease TUPs. For example, increasing the stringency of panning may reduce TUPs; subtractive procedures may decrease selection-related TUPs; amplification in isolated compartment can mitigate the growth advantage of propagation-related TUPs ^[21]. However, TUPs cannot be eradicated experimentally due to the experiment itself. Therefore, to exclude TUPs from the biopanning results with computational tools has become an alternative and more convenient choice.

4.2 MimoSearch

MimoSearch is capable of finding peptides identical to your query sequences in the BDB database. Actually, Simple Search, Advanced Search, MimoScan and MimoBlast can also find peptides in the BDB database that are identical to query sequences. It must be pointed out that users can only use one peptide to search for Simple Search and Advanced Search. However, MimoSearch is the best choice to find out peptides binding to various targets because the target for each peptide will be explicitly displayed in its result table. On the contrary, you cannot directly get the target information from the result tables of Simple Search, Advanced Search, MimoScan and MimoBlast. You must click the BiopanningDataSet ID number linked to the BDB database to find them. The difference between MimoSearch, MimoScan and MimoBlast is due to: (1) both MimoScan and MimoBlast use the sequence file produced from the BDB database, which lacks for target name; (2) MimoSearch uses SQL query against the relational BDB database which has all information needed. However, MimoScan and MimoBlast also have their own advantages. For example, MimoScan will find out all sequences containing your query peptide, and MimoBlast can list all similar sequences besides the identical ones.

4.3 MimoBlast

Powered by BLASTP 2.2.24+ ^[22] and the BDB Database, MimoBlast is capable of finding peptides similar to your query sequences in the BDB database. The parameters of BLASTP program are rather complicated and you can visit the official help page of BLAST to learn more. According to our tests, the expect value is one of the most important parameters that affect the blast result. To simplify MimoBlast, only two parameters are displayed on its web interface for users to modify. One is the expect value, with default value 10. The other is the max results (max target sequences), which is set to 300 by default in MimoBlast. This parameter determines the maximum number of aligned sequences to display, although the actual number of alignments may be greater than this number. By default, MimoBlast uses the following combination of parameters: expect value 10, word size 3, filters of low-complexity regions on, scoring matrix BLOSUM62, gap open 11, gap extend 1 and max target sequences 300. Frequently however, no significant matches to the database may be found with a short peptide of 15 residues or shorter under the settings above. The reasons for this usually are that the expect value parameter is set too stringently and the word size parameter is set too high. If no hits are found with default parameter combination, you can turn on the preset parameters for short peptide alignment. This combination is specially optimized for short nearly exact matches, of which the expect value is set to 20000, word size 2, filters of low-complexity regions off, scoring matrix PAM30, composition-based statistics off, gap open 9, gap extend 1 and max target sequences 300.

4.4 MimoScan

MimoScan is capable of finding peptides with your query patterns in the BDB database. Patterns of either TUPs or mimotopes are OK. For example, one or more motifs are often derived from biopanning results. You can then submit these patterns to MimoScan and check how specific you patterns are. The PROSITE Perl module is used to convert query pattern to regular expression, and the latter is used by MimoScan script to match each peptide in the BDB database. Thus, your query patterns must be written in PROSITE format ^[23]. The following pattern syntax is taken from the official documentation page of PROSITE:

1. The standard IUPAC one-letter codes for the amino acids are used in PROSITE.

2. The symbol 'x' is used for a position where any amino acid is accepted.

3. Ambiguities are indicated by listing the acceptable amino acids for a given position, between square brackets '[ ]'. For example: [ALT] stands for Ala or Leu or Thr.

4. Ambiguities are also indicated by listing between a pair of curly brackets '{ }' the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Ala and Met.

5. Each element in a pattern is separated from its neighbor by a '-'.

6. Repetition of an element of the pattern can be indicated by following that element with a numerical value or, if it is a gap ('x'), by a numerical range between parentheses.

Examples:

x(3) corresponds to x-x-x

x(2,4) corresponds to x-x or x-x-x or x-x-x-x

A(3) corresponds to A-A-A

Note: You can only use a range with 'x', i.e. A(2,4) is not a valid pattern element.

7. When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a '<' symbol or respectively ends with a '>' symbol. In some rare cases (e.g. PS00267 or PS00539), '>' can also occur inside square brackets for the C-terminal element. 'F-[GSTV]-P-R-L-[G>]' means that either 'F-[GSTV]-P-R-L-G' or 'F-[GSTV]-P-R-L>' are considered.

The following extended syntax which is allowed for ScanProsite, is also fit for MimoScan.

Examples:

[AC]-x-V-x(4)-{ED}

This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}

< A-x-[ST](2)-x(0,1)-V

This pattern, which must be in the N-terminal of the sequence ('<'), is translated as: Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val

<{C}*>

This pattern describes all sequences which do not contain any Cysteines.

IIRIFHLRNI

This pattern describes all sequences which contain the subsequence 'IIRIFHLRNI'.

4.5 TUPScan

TUPScan is capable of finding peptides with known TUP motifs in your query sequences. In version 2.0, the TUP motifs or confirmed TUPs embedded in the TUPScan script are updated to 61 items, including one biotin-binding motif, one albumin-binding motif, one Protein A-binding motif, three plastic-binding motifs, four metal ion-binding motifs, five immunoglobulin Fc region-binding motifs, five streptavidin-binding motifs, six unrelated antibody-binding motifs, five peptides with various targets, nine propagation-related TUPs, and 21 selection-related TUPs. These documented peptides and TUP motifs are compiled from recent reviews ^[¹⁵^,¹⁶^], research articles ^[¹⁷^,¹⁸^], and database articles ^[²⁴^,²⁵^].

4.6 TUPredict

TUPredict is a directory page that collects predictors for target-unrelated peptides developed by our lab using machine learning methods. Different types of features were used to construct quite a few predictive models and only the one with the best performance was selected to be developed into the corresponding web service. Due to the limit of data, only three predictors, i.e. PhD7Faster, SABinder and PSBinder, are available currently.

4.6.1 PhD7Faster

The PhD7Faster tool is capable of predicting if phages from the Ph.D.-7 library bearing the input peptides might grow faster. This tool was developed on the basis of ten models trained with reduced amino acid pair composition (RAAPC) by Support Vector Machine (SVM). Therefore, the results of PhD7Faster would provide the voting number and the average probability of ten models.

The training data sets were generated from a research article published recently by 't Hoen et al ^[26]. They sequenced millions of phages of the Ph.D.-7 library after a single round of bacterial amplification. The results showed that 62% of unique sequences appeared only once whereas the remaining 38% were found twice or more times. Based on a Poisson distribution, the probability that a peptide appears 15 times in a naïve Ph.D.-7 library is only 5.6 x 10-50. Therefore, peptides appearing 15 or more times were considered to be with growth advantage and taken as the positive data set. The negative data set was composed of the peptides appearing only once. However, the two data sets are extremely unbalanced: 160 peptides with growth advantage and 2,308,778 peptides without growth advantage. We used down-sampling strategy to deal with the imbalanced data sets. That is, 160 peptides were chosen randomly from the negative data set. This was repeated ten times to reduce random error. Thus ten pairs of sub-datasets were obtained and each pair contained 160 peptides with growth advantage and 160 peptides without growth advantage.

Six types of features, AAC, RAAC, DAAPC, RAAPC, BC and BFE, were used to present individual peptide sequence. AAC and AAPC are short for Amino Acid Composition and Amino Acid Pair Composition. RAAC and RAAPC are reduced AAC and reduced AAPC. To select a good reduced feature set, we conducted the following steps: (i) calculated the accuracy of each feature, (ii) added a feature to an initial null feature combination in descending order sequentially and calculate the accuracy of each feature combination. The combination with the best accuracy was selected. With the binary code (BC) feature, each amino acid is represented by 20 variables, all zeros except for the one characterizing the given amino acid. Thus, a vector with 140 dimensions was used to encode a 7-mer peptide. The Bayes Feature Extraction (BFE) approach ^[27] refers to two profiles manner: positive position-specific and negative position-specific profiles. These profiles were generated through calculating the frequency of each amino acid at each position of peptide sequence in the positive data set and the control data set, respectively. Therefore, a 7-mer peptide was encoded by a 14-dimensional vector containing information on amino acid in the positive and negative spaces.

For each type of features, we calculated measures of ten models based on the ten pairs of sub-datasets and compared their average values. The 5-fold cross-validation results show that RAAPC features can achieve the best performance with an accuracy of 79.34% and a MCC of 0.589.

4.6.2 SABinder

The SABinder tool is capable of predicting peptides that can bind to streptavidin. This tool was developed on the basis of one model trained by SVM using RAAPC (Reduced Amino Acid Pair Composition) features. The data sets were built through the following steps:

Streptavidin-binding peptides:

1. Collect all peptides obtained from completely random library in the BDB Database version 4.0 that can bind to streptavidin.

2. Delete the terminal cysteine (C) of peptides if they are from cysteine-restricted library.

3. Remove the redundant peptides.

4. Exclude sequences harboring ambiguous residues or non-alpha characters.

Non-streptavidin-binding peptides:

1. Collect all peptides obtained from completely random library in the BDB Database version 4.0.

2. Delete the terminal cysteine (C) of peptides if they are from cysteine-restricted library.

3. Remove the redundant peptides.

4. Exclude sequences harboring ambiguous residues or non-alpha characters.

5. Remove the sequences which are same with that of positive data set.

After the above procedures, 199 streptavidin-binding peptides and 15,266 non-streptavidin-binding peptides were obtained. Down-sampling was utilized to deal with the imbalanced data sets: 199 peptides were randomly picked out from non-streptavidin-binding peptides as the negative data set. Four types of features, AAC (Amino Acid Composition), RAAC (Reduced AAC), AAPC (Amino Acid Pair Composition) and RAAPC (Reduced AAPC), were used to encode individual peptide sequence. The 5-fold cross-validation results show that RAAPC features can achieve the best performance with an accuracy of 89.2% and a MCC of 0.79.

4.6.3 PSBinder

The PSBinder tool is capable of predicting peptides that can bind to polystyrene surface. This tool was developed on the basis of an ensemble model trained by SVM using optimized dipeptide composition (ODPC).

We compared each sequence in negative dataset with the one in positive dataset and deleted the same sequence in negative dataset. Both of positive dataset and negative dataset were deleted the cysteine amino acids at both ends of the circular peptides and excluded the peptide sequences harboring ambiguous residues ("B" "J" "O" "U" "X" and "Z") or nonalpha characters. Eventually we obtained 104 peptides as negative dataset and positive, respectively. In order to avoid the prediction model overfitting because of the high sequence similarity of positive dataset, we used cd-hit software to keep the peptide sequence similarity below 80%.

After the above procedures, we collected 104 positive peptide sequences, and the negative data set was 104 peptide sequences with the same length and source as the positive data set. As a consequence, the 5-fold cross-validation results show that ODPC feature can achieve the best performance with an accuracy of 86.54% and a MCC of 0.73.

References

Smith GP: Filamentous fusion phage: novel expression vectors that display cloned antigens on the virion surface. Science 1985, 228(4705): 1315-1317.
Scott JK, Smith GP: Searching for peptide ligands with an epitope library. Science 1990, 249(4967): 386-390.
McCafferty J, Griffiths AD, Winter G, Chiswell DJ: Phage antibodies: filamentous phage displaying antibody variable domains. Nature 1990, 348(6301): 552-554.
Pasqualini R, Ruoslahti E: Organ targeting in vivo using phage display peptide libraries. Nature 1996, 380(6572): 364-366.
Lee YJ, Yi H, Kim WJ, Kang K, Yun DS, Strano MS, Ceder G, Belcher AM: Fabricating genetically engineered high-power lithium-ion batteries using multiple virus genes. Science 2009, 324(5930): 1051-1055.
Nam YS, Magyar AP, Lee D, Kim JW, Yun DS, Park H, Pollom TS, Jr., Weitz DA, Belcher AM: Biologically templated photocatalytic nanostructures for sustained light-driven water oxidation. Nat Nanotechnol 2010, 5(5): 340-344.
Geysen HM, Rodda SJ, Mason TJ: A priori delineation of a peptide which mimics a discontinuous antigenic determinant. Mol Immunol 1986, 23(7): 709-715.
Smith GP, Petrenko VA: Phage Display. Chem Rev 1997, 97(2): 391-410.
Rodi DJ, Janes RW, Sanganee HJ, Holton RA, Wallace BA, Makowski L: Screening of a library of phage-displayed peptides identifies human bcl-2 as a taxol-binding protein. J Mol Biol 1999, 285(1): 197-203.
Tong AH, Drees B, Nardelli G, Bader GD, Brannetti B, Castagnoli L, Evangelista M, Ferracuti S, Nelson B, Paoluzi S et al: A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 2002, 295(5553): 321-324.
Thom G, Cockroft AC, Buchanan AG, Candotti CJ, Cohen ES, Lowne D, Monk P, Shorrock-Hart CP, Jermutus L, Minter RR: Probing a protein-protein interaction by in vitro evolution. Proc Natl Acad Sci U S A 2006, 103(20): 7619-7624.
Deutscher SL: Phage display in molecular imaging and diagnosis of cancer. Chem Rev 2010, 110(5): 3196-3211.
Macdougall IC, Rossert J, Casadevall N, Stead RB, Duliege AM, Froissart M, Eckardt KU: A peptide-based erythropoietin-receptor agonist for pure red-cell aplasia. N Engl J Med 2009, 361(19): 1848-1855.
Knittelfelder R, Riemer AB, Jensen-Jarolim E: Mimotope vaccination - from allergy to cancer. Expert Opin Biol Ther 2009, 9(4): 493-506.
Menendez A, Scott JK: The nature of target-unrelated peptides recovered in the screening of phage-displayed random peptide libraries with antibodies. Anal Biochem 2005, 336(2): 145-157.
Vodnik M, Zager U, Strukelj B, Lunder M: Phage display: selecting straws instead of a needle from a haystack. Molecules 2011, 16(1): 790-817.
Brammer LA, Bolduc B, Kass JL, Felice KM, Noren CJ, Hall MF: A target-unrelated peptide in an M13 phage display library traced to an advantageous mutation in the gene II ribosome-binding site. Anal Biochem 2008, 373(1): 88-98.
Thomas WD, Golomb M, Smith GP: Corruption of phage display libraries by target-unrelated clones: diagnosis and countermeasures. Anal Biochem 2010, 407(2): 237-240.
Derda R, Tang SK, Li SC, Ng S, Matochko W, Jafari MR: Diversity of phage-displayed libraries of peptides during panning and amplification. Molecules 2011, 16(2): 1776-1803.
Huang J, Ru B, Li S, Lin H, Guo FB: SAROTUP: scanner and reporter of target-unrelated peptides. J Biomed Biotechnol 2010, 2010: 101932.
Derda R, Tang SK, Whitesides GM: Uniform amplification of phage with different growth characteristics in individual compartments consisting of monodisperse droplets. Angew Chem Int Ed Engl 2010, 49(31): 5301-5304.
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL: BLAST+: architecture and applications. BMC Bioinformatics 2009, 10: 421.
Sigrist CJ, Cerutti L, Hulo N, Gattiker A, Falquet L, Pagni M, Bairoch A, Bucher P: PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform 2002, 3(3): 265-274.
Ru B, Huang J, Dai P, Li S, Xia Z, Ding H, Lin H, Guo F, Wang X: MimoDB: a New Repository for Mimotope Data Derived from Phage Display Technology. Molecules 2010, 15(11): 8279-8288.
Huang J, Ru B, Zhu P, Nie F, Yang J, Wang X, Dai P, Lin H, Guo FB, Rao N: MimoDB 2.0: a mimotope database and beyond. Nucleic Acids Res 2012, 40(Database issue ): D271-D277.
't Hoen PA, Jirka SM, Ten Broeke BR, Schultes EA, Aguilera B, Pang KH, Heemskerk H, Aartsma-Rus A, van Ommen GJ, and den Dunnen JT: Phage display screening without repetitious selection rounds. Anal Biochem 2012, 421(2): 622-631.
Shao J, Xu D, Tsai SN, Wang Y, Ngai SM: Computational identification of protein methylation sites through bi-profile Bayes feature extraction. PLoS ONE 2009, 4(3): e4920.