Automated Sequence Classification on basis of LOFT attributes

Title: Automated Sequence Classification on basis of LOFT attributes



The success of Comparative Genomics techniques increases significantly with increasing specificity in the classification of functionally equivalent sequences. The most common method to obtain a rapid categorization involves direct sequence comparisons using for instance BLAST. The most similar sequences in such a procedure, given a certain scoring scheme, are known as ‘best hits’ and in case the ‘best hits’ are connected between several species they have been grouped into so-called clusters of orthologous groups (COGs). Although general databases exist with a COG classification for the proteins encoded by many sequenced genomes, the classification often is rather ambiguous. Members of a single COG are often not truly functionally equivalent (orthologous) but simply similar (homologous). The classification can be improved considerably in case it is based on phylogeny instead of sequence similarity scoring. Recently, a phylogenetic classification tool was developed within the CMBI. It is called Levels of Orthology From Trees (LOFT) (van der Heijden et al. 2007)



i) To develop and implement a scheme that provides a unique LOFT classification for the proteins (and their domains) encoded by bacterial genomes and ii) to devise and implement a method that keeps the classification updated while new genomes are being sequenced.



Ia) The project will initially focus on the proteins that are encoded by the genomes of the low GC Gram positive bacteria. All protein sequences will be extracted from the database maintained at NCBI. All to all BLAST comparisons will be performed and the sequences will be distributed into groups on basis of the scoring. Then, the groups of sequences will be aligned with MUSCLE 3.6 (Edgar 2004). The interdependence of cut-off and homogeneity of the alignments will be investigated (number of gaps, number of conservations). Homogeneous alignments will be used to create neighbor-joining (NJ) trees (using e.g. CLUSTAL W 1.83; Thompson et al. 1994). Then, LOFT will be used to categorize the sequences. The generated data will be stored in a database.

Ib) As many proteins contain several functional domains and as these domains do not necessarily evolve at the same rate, the above classification will also be made on the level of the individual functional domains. The related domain models will be extracted from Pfam (Bateman et al. 2004), Prodom (Bru et al. 2005) and SMART (Schultz et al. 1998). Possible deviations between the domain classifications will be investigated.

II) The methods by which the classification can be expanded or updated in a relatively fast and reliable way will be explored by including the genomes of the non-Gram positive bacteria.


Character of project and required expertise:

Creation of Tool/Dbase. Scripting skills essential.



Mark de Been and Christof Francke