Tutorial: Support Vector Machines and Kernels for Computational Biology

Asa Ben-Hur, Cheng Soon Ong, Sören Sonnenburg, Bernhard Schölkopf, and Gunnar Rätsch

  • Basics
  • Tutorial (PDF)
  • Software
  • Galaxy Webservice
  • Splice Site Prediction
  • Examples
  • Imprint

Preliminaries

Tutorial Datasets

In the tutorial we discuss the use of different kernels for predicting splice sites (some background information can be found here). We used two different data sets derived from the same source dataset: the first consists of two real-valued features related to the GC content before and after the splice site, whereas the second one consists of the sequences themselves. These two datasets are part of the easysvm package and are also available for download here:
  • GC-Content features (ARFF, CSV)
  • Sequences (ARFF, CSV, FASTA)
The following examples based on Easysvm use the data in ARFF format. The use of data in FASTA format is analogous, with the only difference that labeled examples have to come in separate files.

Installing Easysvm

Easysvm heavily relies on the Shogun toolbox, which has to be installed before using Easysvm:
> wget http://shogun-toolbox.org/archives/shogun/releases/0.6/sources/shogun-0.6.3.tar.bz2
> tar xjf shogun-0.6.3.tar.bz2
> cd shogun-0.6.3/src
> ./configure --interface=python-modular --prefix=$HOME
> make install
Once Shogun is installed, also download and install Easysvm:
> wget http://www.fml.tuebingen.mpg.de/raetsch/projects/easysvm/easysvm-0.2.tar.gz
> tar xzf easysvm-0.2.tar.gz
> cd easysvm-0.2
> python setup.py install --prefix=$HOME

For all subsequent examples we always assume that the working directory is the base directory of Easysvm (here easysvm-0.2) and that the python path is properly set, e.g.
> export PYTHONPATH=$HOME/lib/python2.5/site-packages

Classification using Real-valued Features

In the first example we would like to reproduce one result given in the paper: the classification based on the GC-content features using the linear kernel. For this we use 5-fold cross-validation (easysvm.py cv) to obtained unbiased predictions for 5 parts of the data (with C=1 and linear kernel). In a second step we estimate the generalization error (command easysvm.py eval):
> python scripts/easysvm.py cv 5 1 linear arff data/C_elegans_acc_gc.arff lin_gc.out
2 features, 2200 examples
Using 5-fold crossvalidation
> head -4 lin_gc.out
#example	output	split
0	-0.8740213	0
1	-0.9755172	2
2	-0.9060478	1
> python2.5 scripts/easysvm.py eval lin_gc.out arff data/C_elegans_acc_gc.arff lin_gc.perf
> tail -6 lin_gc.perf
Averages
   number of positive examples = 40
   number of negative examples = 400
   Area under ROC curve        = 91.3 %
   Area under PRC curve        = 55.8 %
   accuracy (at threshold 0)   = 90.9 %
The same experiment can be performed using the webservice, where one may either upload the data files or import an existing history where the above analysis has been done with the Gaussian kernel.
To use a different kernel, replace "linear" in the first line for instance with "gauss 1" for the Gaussian kernel with width 1 or "poly true true" for the normalized inhomogeneous polynomial kernel of degree 3.

The reported number in the paper (88.2%) is slightly different from the obtained result. The reason is that random split of the data have been used, which were different from the ones used in the paper. To reproduce exactly the numbers as in the paper one may run the following script:

> cd data
> python ../splicesites/tutorial_example.py
> head results.txt
Kernel	Parameters	C	auROC
linear	scale=1.0	C=5.00	88.2%
poly	degree=3	C=10.00	91.4%
poly	degree=5	C=10.00	90.4%
gauss	width=100.0	C=5.00	87.9%

Classification of Sequences

In the second example we directly work with the sequences to be classified. Again, we use 5-fold cross-validation (easysvm.py cv) and in a second step we estimate the generalization error (command easysvm.py eval), were we use the WD kernel of degree 10 (shift=0) and C=1:
> python scripts/easysvm.py cv 5 1 wd 10 0 arff data/C_elegans_acc_seq.arff wd_seq.out
2 features, 2200 examples
Using 5-fold crossvalidation
> head -4 wd_seq.out
#example        output  split
0       -0.3350671      0
1       -0.4158783      2
2       1.1674859       1
> python2.5 scripts/easysvm.py eval wd_seq.out arff data/C_elegans_acc_seq.arff wd_seq.perf
> tail -6 lin_gc.perf
Averages
   number of positive examples = 40
   number of negative examples = 400
   Area under ROC curve        = 98.8 %
   Area under PRC curve        = 86.9 %
   accuracy (at threshold 0)   = 96.4 % 
The same experiment can be performed using the webservice, where one may either upload the data files or import an existing history where the same analysis has been.
To use a different kernel, replace "wd 10 0" in the first line for instance with "spec 5" for the Spectrum kernel of degree 5.

Creating New Datasets

We also provide a python script (datagen.py) that allows one to generate artificial datasets that are often very useful to study the properties of the algorithms. Here are a few examples:
> python scripts/datagen.py motif arff gattaca 10 50 10-15 0.1 tttt 100 50 15 0.1 testmotif1.arff 
> python scripts/datagen.py cloud 100 3 0.6 1.3 testcloud1.arff 
> python scripts/datagen.py motif arff gattaca 100 50 10-15 0.1 tttt 1000 50 15 0.1 testmotif2.arff 
> python scripts/datagen.py cloud 1000 3 0.6 1.3 testcloud2.arff

> python scripts/datagen.py motif fasta gattaca 10 50 10-15 0.1 testmotifpos.fasta
> python scripts/datagen.py motif fasta tttt 100 50 15 0.1 testmotifneg.fasta
> python scripts/datagen.py motif fasta gattaca 100 50 10-15 0.1 tm1.fasta
> python scripts/datagen.py motif fasta tttt 1000 50 15 0.1 tm2.fasta

Copyright © 2008 A. Ben-Hur, C.S. Ong, S. Sonnenburg, B. Schölkopf, and G. Rätsch