Theses and Dissertations at Montana State University (MSU)

Permanent URI for this collectionhttps://scholarworks.montana.edu/handle/1/733

Browse

Search Results

Now showing 1 - 3 of 3
  • Thumbnail Image
    Item
    String analysis and algorithms with genomic applications
    (Montana State University - Bozeman, College of Engineering, 2024) Liyana Ralalage, Adiesha Lakshan Liyanage; Chairperson, Graduate Committee: Binhai Zhu
    In biology, genome rearrangements are mutations that change the gene content of a genome or the arrangement of the genes on a genome. Understanding how genome rearrangements occur in a genome can help us to understand the evolutionary history of extant species, improve genetic engineering, and understand the basis of genetic diseases. In this dissertation, we explored four problems related to genome partitioning and tandem duplication and deletion rearrangement operations. Our interest was focused on determining how difficult it is to solve these problems and identifying efficient algorithms to solve them. The proposed problems were formulated as string problems and then analyzed using complexity theory. In the first chapter, we explored several variations of F -strip recovery problem called XSR-F and GSR-F and their complexity under different parameters. We proved that the XSR-F problem is hard to solve unless we restrict the allowed block sizes to one size. We provided a polynomial time algorithm for GSR-F under a fixed alphabet and fixed F . In the second and third chapters, we introduced two string problems named longest letter- duplicated subsequence (LLDS) and longest subsequence-repeated subsequence (LSRS)-- formulated as alternative problem formulations for the tandem-duplication distance problem that allow to extract information about segments of genes that may have undergone tandem duplication-- analyzed the complexity of their variations and devised efficient algorithms to solve them. We proved that constrained versions of LLDS and LSRS problems are NP- hard for parameter d > or = 4, while general versions were polynomially solvable which hints that any variations closer to the original tandem duplication distance problem are still hard to solve. In the final chapter, we delved into two heuristic algorithms designed to compute genomic distance between two mitochondrial genomes and a heuristic algorithm to predict ancestral gene order under the TDRL (tandem-duplication random loss) model. We improved the previously studied method developed for permutation strings by tweaking heuristic choices aimed at calculating the minimum distance between two genomes to apply to non-permutation strings. These heuristic algorithms were implemented and tested on a real-world mitochondrial genome data set.
  • Thumbnail Image
    Item
    Machine learning for pangenomics
    (Montana State University - Bozeman, College of Engineering, 2021) Manuweera, Buwani Sakya; Chairperson, Graduate Committee: Brendan Mumey; This is a manuscript style paper that includes co-authored chapters.
    Finding genotype-phenotype associations is an important task in biology. Most of the the existing reference-based methods introduce biases because they use a single genome from an individual as the reference sequence. So, these biases can lead to limitations in inferred genotype-phenotype associations. Advances in sequencing techniques have enabled access to a large number of sequenced genomes from multiple organisms from different species. These can be used to create a pangenome, which represents a collection of genetic information from multiple organisms. Using a pangenome can effectively reduce those limitation issues as it does not require a reference. Recently, machine learning techniques are emerging as effective methods for problems involving genomics and pangenomics data. Kernel methods are used as a part of machine learning models to compute similarities between instances. Kernels can map the given set of data into a different feature space that can help distinguish the data into corresponding classes. In this work, we develop supervised machine learning models using a set of features gathered using pangenomic graphs, and the effectiveness of those features is evaluated in predicting yeast phenotypes. We first evaluated the effectiveness of the features using a a traditional supervised machine learning model and, then compared it to novel custom kernels that incorporate the information from the pangenomic graphical structure. Experimental results using yeast phenotypes indicate that the developed machine learning models that use reference-free features and novel kernels outperform models based on traditional reference-based features. This work has implications for bioinformaticians and computational biologists working with pangenomes as well as computer scientists developing predictive models for genomic data.
  • Thumbnail Image
    Item
    Duplications and deletions in genomes: theory and applications
    (Montana State University - Bozeman, College of Engineering, 2022) Zou, Peng; Chairperson, Graduate Committee: Binhai Zhu
    In computational biology, duplications and deletions in genome rearrangements are important to understand an evolutionary process. In cancer genomics research, intra-tumor genetic heterogeneity is one of the central problems. Gene duplications and deletions are observed occurring rapidly in cancer during tumour formation. Hence, they are recognized as critical mutations of cancer evolution. Understanding these mutations are important to understand the origins of cancer cell diversity which could help with cancer prognostics as well as drug resistance explanation. In this dissertation, first, we prove that the tandem duplication distance problem is NP-complete, even if |sigma| > or = 4, settling a 16-year old open problem. And we obtain some positive results by showing that if one of the input sequences, S, is exemplar, then one can decide if S can be transformed into T using at most k tandem duplications in time 2 O (k 2) + poly(n). Motivated by computing duplication patterns in sequences, a new fundamental problem called the longest letter-duplicated subsequence (LLDS) is investigated. We investigate several variants of this problem. Due to fast mutations in cancer, genome rearrangements on copy number profiles are used more often than genome themselves. We explore the Minimum Copy Number Generation problem. We prove that it is NP-hard to even obtain a constant factor approximation. We also show that the corresponding parameterized version is W[1]-hard. These either improve the previous hardness result or solve an open problem. And we then give a polynomial algorithm for the Copy Number Profile Conforming problem. Finally, we investigate the pattern matching with 1-reversal distance problem. With the known results on Longest Common Extension queries, one can design an O(n+m) time algorithm for this problem. However, we find empirically that this algorithm is very slow for small m. We then design an algorithm based on the Karp-Rabin fingerprints which runs in an expected O(nm) time. The algorithms are implemented and tested on real bacterial sequence dataset. The empirical results shows that the shorter the pattern length is (i.e., when m < 200), the more substrings with 1-reversal distance the bacterial sequences have.
Copyright (c) 2002-2022, LYRASIS. All rights reserved.