CASPredict

 

Support Vector Machine (SVM)

SVM, is a powerful classification method that has been widely used for protein prediction. The basic principle of SVM is to map the input vector to the high-dimensional space through the kernel function, construct the separation hyperplane with the largest spacing, and realize the classification of observations. Empirical studies have shown that the prediction performance of the radial basis kernel function (RBF) is better than the linear function, polynomial function, and sigmoid function. In this paper, we downloaded the integration toolkit LIBSVM 3.24 to implement the construction of the classification model based on SVM.


SVM model construction

We built the final SVM model by using the training dataset (155 Cas proteins and 155 non-Cas proteins) with the 167 optimal dipeptide features, which could make full use of the dataset. The F-score of each feature was calculated based on the training dataset, and a grid search strategy was also applied on the training dataset to seek for the best feature number, the error factor c and kernel function variance gamma.


Hmmbuild program

The hmmbuild is used to construct profile hidden Markov models (HMMs) from multiple sequence alignment(s).


The hmmscan search algorithm

The hmmscan is used to search protein sequences against collections of protein profiles. Using each query sequence to search the target database of profiles, and output ranked lists of the profiles with the most significant matches to the sequence. CASPredict uses hmmscan homology search algorithm in HMMER3.1 which downloads from http://hmmer.org/ to search against the Cas protein family hidden Markov models (HMMs) deposited in the TIGRFAMs and Pfam protein families databases.


Cas protein family hidden Markov models

We first downloaded the multiple sequence alignments or seed alignments for protein families from the TIGRFAMs (ftp://ftp.jcvi.org/pub/data/TIGRFAMs/, version 15.0) and Pfam (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/, version 34.0) databases, searched the seed alignments with the "CRISPR" keyword and found 157 CRISPR-associated seed alignments in total: 101 from TIGRFAMs and 56 from Pfam. The Cas-HMMs.zip shows the Cas protein family HMMs.


Options for hmmscan

The respective profile HMMs defined thresholds (gathering thresholds) are used to determine hit significance.

Filter: Bias composition filtering on


Glossary

Reporting Thresholds

The reporting thresholds controls how many matches that fall below the inclusion threshold are still shown in the results (i.e. reported). As every target model (Cas protein family hidden Markov models) in the library is compared to the query, if all matches were reported, then potentially more outputs would be generated. However, it can often be useful to view borderline matches as they may reveal more distant potential informative similarities to the model.


Inclusion Thresholds

Inclusion (or significance) thresholds are stricter than reporting thresholds. Inclusion thresholds control which hits are considered to be reliable enough to be included in an output alignment or a subsequent search round. In hmmscan, which does not have any alignment output nor any iterative search steps, inclusion thresholds have little effect. They only affect what domains get marked as significant (!) or questionable (?) in domain output.


Gathering Thresholds

Also called the gathering cut-off, the gathering threshold is actually comprised of two bit scores, a sequence cut-off and a domain cut-off, used to define the significance of a sequence and a hit respectively. These are defined in the profile HMM and set both significance and reporting thresholds so that no insignificant hits are reported. This threshold is the default setting for hmmscan. In CasPredict, the model-specific thresholding is set to the gathering threshold.


Bias Composition

Turning off the bias composition filter can increases sensitivity, but at a high cost in speed, especially if the query has biased residue composition (for example a repetitive sequence region). Without the bias filter, too many sequences may pass the filter with biased queries, leading to slower than expected performance, hence it is switched on in CASPredict.


E-value

An E-value (expectation value) is the number of hits that would be expected to have a score equal to or better than this by chance alone. A good E-value is much less than 1, for example, an E-value of 0.01 would mean that on average about 1 false positive would be expected in every 100 searches with different query sequences. An E-value around 1 is what we expect just by chance. E-values are widely used as all you need to decide on the significance of a match is the E-value, but note that they vary according to the size of the target database.


Profile HMM

Profile hidden Markov Models (HMMs) are a way of turning a multiple sequence alignment into a position-specific scoring system, which is suitable for searching databases for remotely homologous sequences.


Results

Figure 1. Result page

Number: the "Number" column shows the serial number of the query sequence;

Query: the "Query" column shows the content between the ">" character and the first space;

Length: the "Length" column shows the length of the query sequence;

Probability: the "Probability" column shows the probability that the query sequence is predicted to be a Cas protein, which is obtained based on the machine learning method of SVM;

Yes/No: the "Yes/No" column shows the prediction result, when the "Probability" is greater than or equal to "tp", the column is displayed as "Yes", otherwise "No". When the "Yes/No" column shows "Yes", CASPredict will further use hmmscan to search query sequence against Cas protein family HMMs.When the "Yes/No" column is displayed as "No", the "Model", "Description", and "E-Value" columns are all filled with "NA";

Model: The "Model" column shows the model name with the best hit between the query sequence and the Cas protein family Hidden Markov Models (HMMs). For a query sequence matching more than one model, the model with the minimum E-value of full query sequence is treated as the best hit. Users can click the name of model to look at the details of conserved protein domain family. If the "Model" column shows "No matching records found", it means that there is no model that matches the sequence;

Description: the "Description" column shows the description of the model;

E-value: the "E-value" column reflects the statistical significance of the match to this sequence: the lower the E-value, the more significant the hit;