The industrial approach that is often taken to the generation of biological data results in large data sets that require HPC for their analysis. This talk will overview an analysis of gene splicing that we have performed using publicly available gene and transcript data.The genes of complex organisms (eukaryotes) contain stretches of sequence (called introns) that are spliced out of the RNA transcript of a gene prior to translation of the RNA to give the protein product. In excess of 90% of human gene sequence is intronic. Such a mosaic gene structure allows for the generation of more that one gene product from a single gene, through the adoption of alternative patterns of gene splicing. The importance of a detailed understanding of this phenomenon has increased with the smaller than expected number of genes identified in the draft human genome.
It is the case that large data sets of gene and transcript sequences are available, and the use of HPC allows for the matching and spliced alignment of the sequences in these data sets. Constructing and analysing large numbers of these alignments makes it possible to build up a picture of the way in which alternative splicing is utilised by the cell, and what the regulatory factors are that determine the patterns of splicing.