
Ebook: Essays in Bioinformatics

The rapidly growing number of genomic data has had a profound impact on how bioinformatics is being taught to biologists. In the earlier days it was customary to lead the students gradually through all the concepts and tools, more recently this approach became less practicable because of the fast development of the field. The main body of this book consists of chapters summarizing the fundamental concepts of bioinformatics, based on the topics presented at a course held in Dubrovnik, Croatia, in 2003. The second part of the book contains application papers submitted by the students after the course.
When, as President of the Committee on International Co-operation of the Croatian Academy of Sciences and Arts, I visited the Royal Society and the British Academy in July 2000, I had in mind that the co-operation with the Royal Society be expanded as much as possible and, for the first time, that preparations be made for signing an Agreement on Co-operation with the British Academy. As usually, I could not help visiting the Birkbeck College of the University of London so well-known to me since the time Professor J.D. Bernal was there. I owe that visit mostly to Professor Alan Mackay, FRS, to whom I am tied by many years of friendship. It was on that occasion that in a conversation with David Moss, Professor of Biomolecular Structures, and his co-worker Dr. Clare Sansom, the idea was conceived to organize the postgraduate course in bioinformatics, this newly emerging interdisciplinary research area as the interface between biological and computational sciences, primarily aimed at research students from Central and Eastern Europe.
During the visit to the Royal Society, Alan and I met Professor Brian Heap, Vice-President and Foreign Secretary of the Royal Society at that time, and his collaborators. Professor Brian Heap supported our efforts on the condition that the Royal Society and the Croatian Academy of Sciences and Arts acted as initiators, while the Birkbeck College in London and the Faculty of Science in Zagreb took over organization. However, this was not the only part of activities which were agreed upon.
In view of the Agreement on Co-operation concluded between the Royal Society and the Croatian Academy of Sciences and Arts, Dr. Clare Sansom several times visited Zagreb and the International University Centre (IUC) in Dubrovnik where the course was intended to be organized. The realization of the course would be hardly thinkable without her persistence and wish for success. However, Professor Sibila Jelaska, Department of Molecular Biology, Faculty of Science, Zagreb, and Professor David S. Moss, School of Crystallography, Birkbeck College, supervised the course as its co-directors. It is a special pleasure to me that Dr. Kristian Vlahovicek, a former research student of mine, also greatly contributed to the organization of the course.
The course aroused far more interest among young researchers than it had been expected so that the number of participants had to be limited due to objective reasons (lack of room and mostly lack of computers in the IUC). Eight lecturers from five countries and 23 students from some ten countries took part in the course. The success was surprising, students enjoyed the course and learnt a lot finally marking the course with the average score on the Good/Excellent boundary.
Last but not least, the organization of the course was facilitated by the financial support of the NATO within the NATO Science Programme. The course was also sponsored by the Faculty of Science of the University of Zagreb and PLIVA, Zagreb, the largest Croatian pharmaceutical industry. For Croatian participants generous financial support was obtained from the Ministry of Science and Technology of the Republic of Croatia. Gratitude is due to the International University Centre, the organizers of the course, to all lecturers and participants.
All students would like such advanced courses to be continued in future. Let us act according to their wishes.
Professor Emeritus Boris Kamenar, Zagreb, July 2004
The advent of modern bioinformatics is the result of a long succession of scientific discoveries and paradigm changes in chemistry and biology. This chapter provides an introduction to the pertinent events in these diverse fields.
The key problem of bioinformatics is the prediction of properties, such as structure or function, based on similarity This chapter reviews the concepts and tools of similarity analysis used in various fields of bioinformatics.
The analysis of similarity is a fundamental task in comparing sequences, three dimensional structures as well as genomes and molecular networks. This chapter reviews the common principles underlying these diverse applications.
The GenBank sequence database is an annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced at National Center for Biotechnology Information (NCBI) as part of an international collaboration with the European Molecular Biology Laboratory (EMBL) Data Library from the European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan (DDBJ). GenBank and its collaborators receive sequences produced in laboratories throughout the world from more than 115,000 distinct organisms. GenBank continues to grow at an exponential rate, doubling every 10 months. Release 142, produced in June 2004, contained over 40.3 billion nucleotide bases in more than 35.5 million sequences. GenBank is built by direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centers. Direct submissions are made to GenBank using BankIt [http://www.ncbi.nlm.nih.gov/BankIt/], which is a Web-based form, or the stand- alone submission program, Sequin
[http://www.ncbi.nlm. nih.gov/Sequin/index.html]
We describe some of the aspects of Swiss-Prot that make it unique, explain what are the developments we believe to be necessary for the database to continue to play its role as a focal point of protein knowledge, and provide advice pertinent to the development of high quality knowledge resources on one aspect or the other of the life sciences.
EMBOSS evolved from EGCG, a collection of programs written to extend the GCG package, originally written by the Genetics Computer Group of Wisconsin University. EMBOSS follows the general structure of GCG and sets out to reproduce and extend the functionality of GCG in an open source package. Currently, EMBOSS only runs on UNIX computers. The programs of EMBOSS can be run from the UNIX command line or from behind a number of Graphical User Interfaces (GUIs). EMBOSS offers a wide range of programs covering most aspects of sequence analysis. In addition, a number of well established public domain programs have been engineered to follow the conventions of EMBOSS and then incorporated into the package. Software developers from many places across the world have written programs for the EMBOSS package. Such contributions are encouraged from the user community and training is offered to aspiring contributors.
Visualisation of local DNA conformation is a useful tool in interpreting and designing experiments at the molecular level. There are a number of methods whereby local curvature as well as other conformational parameters can be predicted. Calculation of these parameters on a genomic scale may help to clarify the role of these elements in genomic architecture.
Description of protein structure is based on a hierarchy ofconcepts, from the peptide bond to secondary structures, motifs and folds. The classification of protein structures is usually achieved by segregating mainly-alpha, mainly-beta, and mixed (alpha/beta and alpha+beta) structures. This chapter gives an overview of structural concepts as well as examples how these are implemented in databases such as CATH, SCOP and FSSP.
The resources provided by NCBI for studying the three-dimensional (3D) structures of proteins center around two databases: the Molecular Modeling Database (MMDB), which provides structural information about individual proteins; and the Conserved Domain Database (CDD), which provides a directory of sequence and structure alignments representing conserved functional domains within proteins (CDs). Together, these two databases allow scientists to retrieve and view structures, find structurally similar proteins to a protein of interest, and identify conserved functional sites. To enable scientists to accomplish these tasks, NCBI has integrated MMDB and CDD into the Entrez retrieval system. In addition, structures can be found by BLAST, because sequences derived from MMDB structures have been included in the BLAST databases. Once a protein structure has been identified, the domains within the protein, as well as domain “neighbors” (i.e., those with similar structure) can be found. For novel data not yet included in Entrez, there are separate search services available. Protein structures can be visualized using Cn3D, an interactive 3D graphic modeling tool. Details of the structure, such as ligand-binding sites, can be scrutinized and highlighted. Cn3D can also display multiple sequence alignments based on sequence and/or structural similarity among related sequences, 3D domains, or members of a CDD family. Cn3D images and alignments can be manipulated easily and exported to other applications for presentation or further analysis.
Protein secondary structure prediction is believed to improve by combining different predictions into a consensus secondary structure prediction. Ten different protein secondary structure prediction programs were compared and given weights by a feed forward neural network. A dataset of approximately 6000 proteins was taken from the DSSP database and was used to train the neural network. The resulting weights indicate that the secondary structure prediction programs PHD and Predator performed better than the other methods. However training of the neural network with a smaller but more stringently selected dataset did not support these results for the Predator program. The performance of the program PHD remained the same when the smaller dataset was used to train the neural network.
In this chapter, bioinformatics techniques are used to gain some insights into the structure and function of a largely uncharacterised protein family called SAND. From a phylogenomics analysis, we determine SAND as a eukaryotic gene and show that a duplication event gave rise to two SAND genes in vertebrates. SAND was found to be absent from archea and bacteria. From a phylogenetic analysis, we characterise a number of subfamilies. With the use of multiple sequence alignments, we highlight amino acids and sequence motifs conserved in SAND proteins plus those invariant in subfamilies or taxonomical groups. In addition, we predict a secondary structure and solvent accessibility profile and carry out protein fold predictions for the SAND proteins.
Bioinformatics is a general approach underlying current paradigms in the pharmaceutical, agricultural and bio-industrial sectors. The parallel development of genomics, proteomics and informatics has resulted in a number of complex approaches and brought about profound changes within the R & D philosophy of the affected sectors. This chapter aims to provide an overview of how the scientific approach has changed in these three areas.
The ß-spectrin family of proteins was the subject of the analysis of amino acid replacenents at aligned positions. The homologous and non-homologous positions were subjected to an analysis of the interrelations among occurring residues and the mechanism of variability using the algorithm of genetic semihomology [6]. 67 ß-spectrin sequences were collected and 55 of them were subjected to an comparative analysis. After in-depth studies of the global multiple alingnment, a consensus sequence was construscted. It was the base of the detailed analysis of genetic relations among all the amino acid residues occuring the same positions of homologous sequences. Such examination shows a detailed picture of the relations among the representatives of the ß-spectrin family and gives a possibility of following the evolutionary paths of the protein family arising, what is the base of further analytic examinations of the ß-spectrin family.
The major goal of bioinformatics is the analysis of sequence, structure and function relationships. In these studies, lab experiments and computational work must validate and consolidate each other, and findings of both initiatives expedite each other's improvement. This process requires experts who can both work at lab bench and in computer applications. This chapter summarises a computer scientist's views on the diverse fields of bioinformatics.
The cytocrome c nitrite reductase (ccNir) isolated from the sulphate-reducing bacterium Desulfovibrio desulfuricans ATCC 27774 is a hetero-oligomeric complex composed by two subunits (61 KDa and 19 KDa), encoded by genes nrfA and nrfH, respectively. We report the use of bioinformatic predictive models in order to access of ccNir most relevant topological characteristics, namely signal peptides and signal anchors. We made used of a combined method of SignalP V2.0 (SignalP-HMM and Signal-NN) in association with TMHMM 2.0 for the prediction of the presence and location of signal peptide cleavage sites, to discriminate between cleavable signal peptides and N-terminal transmembrane anchors segments and, to predict of transmembrane helices.
Oxidative folding combines the formation of native disulfide bond with the conformational folding resulting in the native three-dimensional fold. Oxidative folding pathways can be described in terms of disulfide intermediate species (DIS) containing a varying number of disulfide bonds and free cysteine residues, which can also be – as opposed to the majority of protein folding states –isolated and experimentally studied. Each DIS corresponds to a family of folding states (conformations) that the given DIS can adopt in three dimensions. The oxidative folding space can be represented as a network of DIS states interconnected by disulfide interchange reactions reactions that can either create/abolish or rearrange disulfide bridges. Such networks can be used to visualize folding pathways in terms of the experimentally observed intermediates. In a number of experimentally studied cases, the observed intermediates appear as part of contiguous oxidative folding pathways.
Genetic profiling using microsatellite markers provides a highly efficient method for characterizing and identifying grape varieties. This work describes the use of genetic markers, including single sequence repeat markers, in the discovery of genetic relatedness of the American cultivar Zinfandel and autochthonous Croatian grape varieties (Vitis vinifera L.)
Pectinmethylesterase, an enzyme involved in cell wall softening of papaya fruit was isolated. The structure of this cDNA and its expression during development and ripening of the fruit was analysed. Northern Blotting and was used to determine the expression of pectinmethylesterase genes along fruit development and ripening. PME is differentially expressed in the inner and outer mesocarp. The levels of PME activity increase gradually with maturation until day 7 of ripening. The pectinmethylesterase activity increases differentially from the outer mesocarp to the inner mesocarp along ripening. These values are similar for 7 days ripened fruits which corresponds to 70% ripening. After that ripening stage there are no significant differences between PME in inner and outer mesocarp and the PME activity is reduced of about 10%. The phylogram generated using an alignment of the deduced amino acid sequences of PME and of 10 PMEs homologues from other plant species revealed that pectinmethyl esterase from papaya fruits presents higher similarity with tomato PME sequences than with the other PMEs sequences available. The amount of total RNA in the mature ripe fruit duplicated the amount of total RNA in the green fruit. All the cDNAs were expressed at similar levels at the inner and outer mesocarp tissues during the different stages of fruit ripening. However, its expression was highest for ripening stages 1, 3, 5 and 7 decreasing thereafter to lower levels of expression. These results show that increase in mRNA translation parallels the increase in PME activity until day 7 of ripening.
This work aimed to study some of the processes involved in organogenic nodule formation in Humulus lupulus var. Nugget. Organogenesis and in vitro somatic embryogenesis from differentiated plant cells are complex morphogenic processes involving physiological, biochemical, molecular and elemental tissue and cell changes. These morphogenic processes play pivotal roles in plant biotechnology. Knowledge on the signals involved in their induction, formation and development will enable in the future a controlled induction of morphogenesis.
The relationship among human genetic polymorphism, cancer susceptibility is increasingly important for risk assessment, early diagnosis and prevention, of clinical disease and cancer. This work analyses single nucleotide polymorphism (SNP) in human xenobiotic and estrogen metabolising genes and it is suggested that combinations of polymorphic enzymes may be better predictors of cancer risk than polymorphisms in one or two genes alone.
The primary goal of computational molecular biology, like molecular biology itself, is to understand the meaning of the genomic information and how this information is expressed. Molecular systematics makes phylogenetic inferences from molecular data using computacional methods. The systematics of Silene section Siphonomorpha Otth was approached from three different perspectives, the first analysing global relationships within the section, the second studying two pairs of taxa with problematic species boundaries, and the third using one of the species to study rarity at ecological and genetic level.