Project 4. Protein knowledge building through comparative genomics and data integration


Key project members: Wilco Fleuren, Tim Hulsen, Raoul Frijters, Peter Groenen, Wynand Alkema 


Sponsor: Biorange (NBIC) and Schering-Plough


Project Goal

Develop in-silico methods that aggregate data on protein function from multiple organisms and multiple sources in order to infer similarities and differences in protein functions across various species. The focus is on functional annotation of the multiple proteins together, in the context of the specific pathways or processes that they participate in.



It has been estimated that over 90% of candidate drugs do not reach the market due to unforeseen toxic effects or a lack of efficacy in clinical trials. One of the reasons for this high attrition rate is the difficulty of translating data on protein function from animal models such as rat and mouse to a human setting. Detecting critical differences between these model organisms and human is therefore essential to choose the correct animal models for testing candidate drugs and to correctly interpret the data from these experiments in the light of human physiology.



  • Develop methods to quickly retrieve sets of genes that are related to each other based on eg participation in the same biological pathway or the same disease.
  • Develop methods to automatically retrieve annotations for these sets of genes from major annotation databases such as KEGG, EntrezGene, Pubmed, GeneOntology, GeneExpressionOmnibus (GEO) etc.
  • Develop methods to quickly assess the conservation of the genes in the entire network by calculating network conservation score. We believe that the annotation and conservation of networks as a whole better reflect the presence or absence of orthologous biology between human and model organisms than studying conservation of gene function on a gene by gene basis.


Results to date

We have developed a method to automatically construct biological networks starting from drugs, diseases or genes of interest. In this case a pathway is a very broad concept, which may refer to any group of genes that have a physiological connection. We build these networks we use the CoPub system, defining biological networks as networks of genes that have a strong link with each other in the literature.

In addition, we have developed methods to automatically map and visualize annotation data and gene expression data on these gene networks as well as phylogenetic data. These phylogenetic data were obtained from the Ensembl orthology pipeline using the software package PhyloPat developed in the CDD group.

The above methods are successfully applied in a case study in which we were looking for putative biomarkers for Rheumatoid arthritis treatment. By a combination of literature mining and gene expression mapping we were able to identify putative biomarkers with orthologs in mice and rat. One of the predictions was tested in clinical samples via a proteomics experiment. The data showed that the selected protein biomarker was significantly regulated in sera of rheumatoid arthritis patients.

We currently are applying our methods to selected pathways of the Immune system and genes that are targeted by various anti-inflammatory drugs.



2007 J.D. Holbrook, P. Sanseau. Drug discovery and computational evolutionary analysis. Drug Discovery Today 12 (19-20), 826-832.

2003 B. Searls. Pharmacophylogenomics: genes, evolution and drug targets. Nature Reviews Drug Discovery 2 (8): 613-623.

2006 T. Hulsen, J. de Vlieg, P.M.A. Groenen. PhyloPat: phylogenetic pattern analysis of eukaryotic genes. BMC Bioinformatics 7: 398.

2009 T. Hulsen, P.M.A. Groenen, J. de Vlieg, W. Alkema. PhyloPat: an updated version of the phylogenetic pattern database contains gene neighborhood. Nucleic Acids Res. (Database issue).