Tutorial DatasetsIn the tutorial we discuss the use of different kernels for predicting splice sites (some background information can be found here). We used two different data sets derived from the same source dataset: the first consists of two real-valued features related to the GC content before and after the splice site, whereas the second one consists of the sequences themselves. These two datasets are part of the easysvm package and are also available for download here:
Installing EasysvmEasysvm heavily relies on the Shogun toolbox, which has to be installed before using Easysvm:
> wget http://shogun-toolbox.org/archives/shogun/releases/0.6/sources/shogun-0.6.3.tar.bz2 > tar xjf shogun-0.6.3.tar.bz2 > cd shogun-0.6.3/src > ./configure --interface=python-modular --prefix=$HOME > make installOnce Shogun is installed, also download and install Easysvm:
> wget http://www.raetschlab.org/projects/easysvm/easysvm-0.2.tar.gz > tar xzf easysvm-0.2.tar.gz > cd easysvm-0.2 > python setup.py install --prefix=$HOMEFor all subsequent examples we always assume that the working directory is the base directory of Easysvm (here easysvm-0.2) and that the python path is properly set, e.g.
> export PYTHONPATH=$HOME/lib/python2.5/site-packages
Classification using Real-valued FeaturesIn the first example we would like to reproduce one result given in the paper: the classification based on the GC-content features using the linear kernel. For this we use 5-fold cross-validation (easysvm.py cv) to obtained unbiased predictions for 5 parts of the data (with C=1 and linear kernel). In a second step we estimate the generalization error (command easysvm.py eval):
> python scripts/easysvm.py cv 5 1 linear arff data/C_elegans_acc_gc.arff lin_gc.out 2 features, 2200 examples Using 5-fold crossvalidation > head -4 lin_gc.out #example output split 0 -0.8740213 0 1 -0.9755172 2 2 -0.9060478 1 > python2.5 scripts/easysvm.py eval lin_gc.out arff data/C_elegans_acc_gc.arff lin_gc.perf > tail -6 lin_gc.perf Averages number of positive examples = 40 number of negative examples = 400 Area under ROC curve = 91.3 % Area under PRC curve = 55.8 % accuracy (at threshold 0) = 90.9 %The same experiment can be performed using the webservice, where one may either upload the data files or import an existing history where the above analysis has been done with the Gaussian kernel.
To use a different kernel, replace "linear" in the first line for instance with "gauss 1" for the Gaussian kernel with width 1 or "poly true true" for the normalized inhomogeneous polynomial kernel of degree 3.
The reported number in the paper (88.2%) is slightly different from the obtained result. The reason is that random split of the data have been used, which were different from the ones used in the paper. To reproduce exactly the numbers as in the paper one may run the following script:
> cd data > python ../splicesites/tutorial_example.py > head results.txt Kernel Parameters C auROC linear scale=1.0 C=5.00 88.2% poly degree=3 C=10.00 91.4% poly degree=5 C=10.00 90.4% gauss width=100.0 C=5.00 87.9%
Classification of SequencesIn the second example we directly work with the sequences to be classified. Again, we use 5-fold cross-validation (easysvm.py cv) and in a second step we estimate the generalization error (command easysvm.py eval), were we use the WD kernel of degree 10 (shift=0) and C=1:
> python scripts/easysvm.py cv 5 1 wd 10 0 arff data/C_elegans_acc_seq.arff wd_seq.out 2 features, 2200 examples Using 5-fold crossvalidation > head -4 wd_seq.out #example output split 0 -0.3350671 0 1 -0.4158783 2 2 1.1674859 1 > python2.5 scripts/easysvm.py eval wd_seq.out arff data/C_elegans_acc_seq.arff wd_seq.perf > tail -6 lin_gc.perf Averages number of positive examples = 40 number of negative examples = 400 Area under ROC curve = 98.8 % Area under PRC curve = 86.9 % accuracy (at threshold 0) = 96.4 %The same experiment can be performed using the webservice, where one may either upload the data files or import an existing history where the same analysis has been.
To use a different kernel, replace "wd 10 0" in the first line for instance with "spec 5" for the Spectrum kernel of degree 5.
Creating New DatasetsWe also provide a python script (datagen.py) that allows one to generate artificial datasets that are often very useful to study the properties of the algorithms. Here are a few examples:
> python scripts/datagen.py motif arff gattaca 10 50 10-15 0.1 tttt 100 50 15 0.1 testmotif1.arff > python scripts/datagen.py cloud 100 3 0.6 1.3 testcloud1.arff > python scripts/datagen.py motif arff gattaca 100 50 10-15 0.1 tttt 1000 50 15 0.1 testmotif2.arff > python scripts/datagen.py cloud 1000 3 0.6 1.3 testcloud2.arff > python scripts/datagen.py motif fasta gattaca 10 50 10-15 0.1 testmotifpos.fasta > python scripts/datagen.py motif fasta tttt 100 50 15 0.1 testmotifneg.fasta > python scripts/datagen.py motif fasta gattaca 100 50 10-15 0.1 tm1.fasta > python scripts/datagen.py motif fasta tttt 1000 50 15 0.1 tm2.fasta