Throughout this tutorial we use the example of splice site prediction for illustration. It is a problem arising in computational gene finding and concerns the recognition of splice sites that mark the boundaries between exons and introns in eukaryotes. Introns are spliced from premature mRNAs after transcription:
The vast majority of splice sites are characterized by the presence
of specific dimers on the intronic side of the splice site: GT
for donor and AG for acceptor sites.
Yet, only about 0.1-1% of all GT and AG occurrences in
the genome represent true splice sites.
Detecting the presence of sequence motifs
such as this one
for acceptor sites in C. elegans can lead to a reduction in the number of false positives. Further improvement can be obtained using SVMs that integrate a variety of sequence features before and after the splice site. The problem of recognizing acceptor splice sites allows us to illustrate different properties of SVMs using different kernels (similar results can be obtained for donor splice sites).
In the first part of the tutorial we use real-valued features describing the sequence surrounding the splice site. For illustration purposes we use only two features: the GC content in the exon and intron flanking potential acceptor sites. These features are motivated by the fact that the GC-content of exons is typically higher than that of introns. In the second part we show how to take advantage of the flanking pre-mRNA sequence itself leading to considerable performance improvements.
The data used in the numerical examples in the tutorial was generated by taking a random subset of 200 true splice sites and 2000 decoys sites from the first 100000 entries in the C. elegans acceptor splice site dataset available here. The original and derived datasets are part of the easysvm package and can also be downloaded here: