Development and testing of algorithmic solutions for problems in computational genomics and proteomics
MetadataShow full item record
This dissertation covers three subjects: (i) computational characterization of Antigen (Ag)-Antibody (Ab) interactions (ii) a novel and effective algorithm to predict the epitope of a protein based on an antibody imprinting technique (iii) a comparison of existing de novo genome assembler algorithms targeted specifically at the assembly of data generated by Illumina (Solexa) short-read sequencing technology, and suggestions for their improvement. The first part focuses on identification, characterization and understanding the ways in which the antibodies and antigens interact. We analyze Epitope/Paratope region using a large dataset of Ag - Ab complex structural data taken from the PDB. Epitope/Paratope regions in our dataset have been characterized in terms of their size, average amino acid residue composition, residue-residue pairing preferences, and residue dispersion in the epitope and paratope regions. This analysis provides a more up-to-date picture of the Ag-Ab interface and provides new insights into the role of residue composition and distribution in Ag-Ab recognition. The above analysis helps in obtaining a refined substitution matrix optimized for antibody imprinting technique and used to improve the effectiveness of the epitope prediction algorithms that have also been developed and are the second focus of the thesis. The third and the final part focus on the de novo genome assembly problems. The genome assembly programs takes the short reads generated by Whole genome shotgun sequencing technology and computationally reconstructs the genome. For the genome assembly problem the connections between read length, read type, repeat complexity, quality score and coverage and how these parameters help in improving or diminishing the capability of the assembly programs to assemble the sequence data were studied in depth. At the end of this experimental process it gives us a better understanding of the impact of the above mentioned parameters on the complexity of genome assembly and helps ascertain margins on these parameters of sequence data that enable efficient and accurate assembly by the programs.