Theses and Dissertations at Montana State University (MSU)
Permanent URI for this collectionhttps://scholarworks.montana.edu/handle/1/733
Browse
17 results
Filters
Settings
Search Results
Item String analysis and algorithms with genomic applications(Montana State University - Bozeman, College of Engineering, 2024) Liyana Ralalage, Adiesha Lakshan Liyanage; Chairperson, Graduate Committee: Binhai ZhuIn biology, genome rearrangements are mutations that change the gene content of a genome or the arrangement of the genes on a genome. Understanding how genome rearrangements occur in a genome can help us to understand the evolutionary history of extant species, improve genetic engineering, and understand the basis of genetic diseases. In this dissertation, we explored four problems related to genome partitioning and tandem duplication and deletion rearrangement operations. Our interest was focused on determining how difficult it is to solve these problems and identifying efficient algorithms to solve them. The proposed problems were formulated as string problems and then analyzed using complexity theory. In the first chapter, we explored several variations of F -strip recovery problem called XSR-F and GSR-F and their complexity under different parameters. We proved that the XSR-F problem is hard to solve unless we restrict the allowed block sizes to one size. We provided a polynomial time algorithm for GSR-F under a fixed alphabet and fixed F . In the second and third chapters, we introduced two string problems named longest letter- duplicated subsequence (LLDS) and longest subsequence-repeated subsequence (LSRS)-- formulated as alternative problem formulations for the tandem-duplication distance problem that allow to extract information about segments of genes that may have undergone tandem duplication-- analyzed the complexity of their variations and devised efficient algorithms to solve them. We proved that constrained versions of LLDS and LSRS problems are NP- hard for parameter d > or = 4, while general versions were polynomially solvable which hints that any variations closer to the original tandem duplication distance problem are still hard to solve. In the final chapter, we delved into two heuristic algorithms designed to compute genomic distance between two mitochondrial genomes and a heuristic algorithm to predict ancestral gene order under the TDRL (tandem-duplication random loss) model. We improved the previously studied method developed for permutation strings by tweaking heuristic choices aimed at calculating the minimum distance between two genomes to apply to non-permutation strings. These heuristic algorithms were implemented and tested on a real-world mitochondrial genome data set.Item An exploration of whole-genome comparative genomic strategies for polyploid crop genomes(Montana State University - Bozeman, The Graduate School, 2022) Reynolds, Gillian Lucy; Co-chairs, Graduate Committee: Brendan Mumey and Jennifer A. LachowiecGenome comparison for large and complex polyploid crop genomes is a highly complex venture, yet it is critical. Given a rising demand for food coupled with yield-impacting resource limitations and rapidly changing global climates it has never been more important to characterise the underlying genetic variation which underpins traits of agronomic interest. In this work, the problem of polyploidy genome comparison is explored at three levels. The first chapter characterizes the sequence relationships that exist between, and within, polyploidy genomes. This is achieved by hijacking a metagenomic strategy for rapid, and efficient, genome sequence classification. The second chapter then utilizes the identified subgenome- specific k-mer profiles for recruitment of assembled contigs and scaffolds previously only recruitable via more resource intensive optical mapping strategies. This makes a greater proportion of the assembled data usable for downstream variant analysis. The third chapter then zooms into the problem of how to identify variants from large -scale sequencing data while minimizing bias and computational costs. A critical assessment of modern variant calling for crop genomes is performed and an algorithm to further extend a new, resource efficient, approach for large scale comparative genomics is presented and critically evaluated. In all, the work presented herein takes a top-down journey from genome- and subgenome- level comparative genomics all the way to identifying base-pair resolution strategies that are capable of revealing the underlying sequences responsible for keeping the world fed.Item Genomic composition of green algae grown in high alkaline conditions(Montana State University - Bozeman, College of Agriculture, 2023) Goemann, Calvin Lee Cicha; Chairperson, Graduate Committee: Blake Wiedenheft; This is a manuscript style paper that includes co-authored chapters.Algae are responsible for 50% of global oxygen production and sequestration of CO 2 from the atmosphere. Algal photosynthesis plays a critical role in all aquatic ecosystems converting sunlight and CO2 into usable biomass. Algal growth and biomass production can be coopted to produce industrially relevant bioproducts like triacylglycerol (TAGs) that can be converted into biodiesel and provide a sustainable carbon-neutral alternative to fossil fuels. In high-stress environments, algae produce high levels of TAGs. Multiple stresses including nitrogen limitation and high pH impact algae physiology, but little is known about how algae shift their metabolism to produce TAGs in response to these stresses. This topic remains relatively unexplored due to the limited availability of complete algae genomes. Here we sequence and annotate the complete telomere-to-telomere genome of an alkali-tolerant green algae Chlorella sp. SLA-04. Genomic analysis supports a reclassification of Chlorophyta green algae and illuminates how SLA-04 adapts to diverse environmental conditions. Additionally, transcriptomic analysis revealed how Chlorella sp. SLA-04 rewires carbon metabolism in high alkaline and nutrient-deplete conditions to produce TAGs while minimizing photosynthetic oxidative stress. Together, we double the amount of publicly available telomere-to-telomere green algal genomes and use this resource to explore how algae respond to diverse environmental conditions in their native and industrial settings.Item Alkaline microalgae from Yellowstone National Park: physiological and genomic characterization for biofuel production(Montana State University - Bozeman, College of Agriculture, 2021) Moll, Karen Margaret; Chairperson, Graduate Committee: Brent M. Peyton; This is a manuscript style paper that includes co-authored chapters.Alternatives are needed to avoid future economic and environmental impacts from continued exploration, harvesting transport, and combustion of conventional hydrocarbons resulting in a rise of atmospheric CO 2. Microalgae, including diatoms, are eukaryotic photoautotrophs that can utilize inorganic carbon (e.g., CO 2) as a carbon source and sunlight as an energy source, and many microalgae can store carbon and energy in the form of neutral lipids. In addition to accumulating useful precursors for biofuels and chemical feed-stocks, the use of autotrophic microorganisms can further contribute to reduced CO 2 emissions through utilization of atmospheric CO 2. Most microalgal biofuel research has focused on green algae. However, there are good reasons to consider diatoms for biofuel research. Diatoms are responsible for approximately 40% of marine primary productivity, are important in freshwater systems, and are known to assimilate 20% of global CO 2. Identification and implementation of factors that can contribute to rapid growth will minimize inputs and production costs, thus improving algal biofuel viability. Nine green algae strains that were isolated from Witch Creek, Yellowstone National Park, were compared to two culture collection strains (PC-3 and UTEX395) for growth rates, dry cell weights and lipid accumulation. The strains exhibiting the fastest growth rates were WC-5, WC-1 and WC-2b. The culture collection strain was the best biomass producer and WC-5 and UTEX395 were the most productive for lipid. Based on the growth rates and lipid content, the best strains for biodiesel production were WC-1 and WC-5. In addition to the green algae strains, diatom strain, RGd-1 has previously been found to accumulate 30-40% (w/w) triacylglycerol and 70-80% (w/w) fatty acid methyl esters that can be transesterified into biodiesel. The RGd-1 was sequenced via Illumina 2x50 and PacBio RSII reads and genome comparisons revealed that the RGd-1 genome is significantly divergent from other publicly available genome sequences. RGd-1 was found to have nearly complete metabolic pathways for fatty acid elongation using acetyl-CoA in the mitochondrion or malonyl-CoA in the cytoplasm. The ability to switch between two different starting substrates may confer an advantage for fatty acid and neutral lipid biosynthesis. Further, RGd-1 was found to use the glyoxylate shunt as part of its central carbon metabolism. This carbon conservation pathway may potentially explain why RGd-1 is able to produce high concentrations of lipids. Using Illumina R MiSeq sequencing it was possible to obtain thorough community analysis of bacteria associated with RGd-1 in culture. Nine primary taxa were identified and further research will elucidate their roles as potential phycosphere bacteria that may have specific functional roles that contribute to RGd-1 health. With long-range PacBio reads, RGd-1 was found to have a potential bacterial symbiont, Brevundimonas sp.Item Improving genomic resources for the study of invasiveness in Eurasian watermilofil (Myriophyllum spicatum) and their hybrids(Montana State University - Bozeman, College of Agriculture, 2021) Pashnick, Jeffrey John; Chairperson, Graduate Committee: Ryan Thum; This is a manuscript style paper that includes co-authored chapters.Genomics has revolutionized the way biologists ask fundamental questions about evolution. The thousands to tens of thousands of molecular markers generated through modern genomics increase the likelihood of detecting traits associated with a phenotype of interest. While genomics provides ever increasing evidence detecting these traits, they must be developed in each new system. Myriophyllum spicatum L. (Eurasian watermilfoil, EWM) and their hybrids with native Myriophyllum sibiricum Komorov (northern watermilfoil, NWM) are heavily managed aquatic plants in the United States. Genotypes both within and across these taxa and their hybrids can differ in their growth and herbicide response, prompting interest in determining which specific genotypes and genes will respond best to specific control tactics. However, because genotypes are unable to be distinguished by morphology, distinguishing genotypes requires molecular markers. EWM, NWM, and their hybrid are hexaploid (2n=6x=42) and developing these molecular markers requires accurately genotyping in a hexaploid with unknown chromosomal inheritance. The first manuscript of this dissertation empirically tested the genotyping information obtained from three commonly used molecular marker types, AFLPs, microsatellites, and GBS data. We found that while GBS markers have the lowest error rate, all molecular marker types provide the same genotype information. In the second chapter we used a mapping population, GBS data, and likelihood models to determine if watermilfoil was an allohexaploid, autohexaploid, or a mix between them. We found overwhelming evidence that watermilfoil is an allohexaploid across the genome. Finally, using the characteristics of each molecular marker type, the third chapter developed a cost-effective and information dense panel of microhaplotypes to genotype in watermilfoil. Microhaplotyping data can be shared across laboratories and promotes collaboration with weed managers by informing management with genetic information. Together, the work in this dissertation provides diploidized molecular markers and polyploid mode of inheritance to begin to connect genotype to herbicide response traits in watermilfoil.Item Machine learning for pangenomics(Montana State University - Bozeman, College of Engineering, 2021) Manuweera, Buwani Sakya; Chairperson, Graduate Committee: Brendan Mumey; This is a manuscript style paper that includes co-authored chapters.Finding genotype-phenotype associations is an important task in biology. Most of the the existing reference-based methods introduce biases because they use a single genome from an individual as the reference sequence. So, these biases can lead to limitations in inferred genotype-phenotype associations. Advances in sequencing techniques have enabled access to a large number of sequenced genomes from multiple organisms from different species. These can be used to create a pangenome, which represents a collection of genetic information from multiple organisms. Using a pangenome can effectively reduce those limitation issues as it does not require a reference. Recently, machine learning techniques are emerging as effective methods for problems involving genomics and pangenomics data. Kernel methods are used as a part of machine learning models to compute similarities between instances. Kernels can map the given set of data into a different feature space that can help distinguish the data into corresponding classes. In this work, we develop supervised machine learning models using a set of features gathered using pangenomic graphs, and the effectiveness of those features is evaluated in predicting yeast phenotypes. We first evaluated the effectiveness of the features using a a traditional supervised machine learning model and, then compared it to novel custom kernels that incorporate the information from the pangenomic graphical structure. Experimental results using yeast phenotypes indicate that the developed machine learning models that use reference-free features and novel kernels outperform models based on traditional reference-based features. This work has implications for bioinformaticians and computational biologists working with pangenomes as well as computer scientists developing predictive models for genomic data.Item Duplications and deletions in genomes: theory and applications(Montana State University - Bozeman, College of Engineering, 2022) Zou, Peng; Chairperson, Graduate Committee: Binhai ZhuIn computational biology, duplications and deletions in genome rearrangements are important to understand an evolutionary process. In cancer genomics research, intra-tumor genetic heterogeneity is one of the central problems. Gene duplications and deletions are observed occurring rapidly in cancer during tumour formation. Hence, they are recognized as critical mutations of cancer evolution. Understanding these mutations are important to understand the origins of cancer cell diversity which could help with cancer prognostics as well as drug resistance explanation. In this dissertation, first, we prove that the tandem duplication distance problem is NP-complete, even if |sigma| > or = 4, settling a 16-year old open problem. And we obtain some positive results by showing that if one of the input sequences, S, is exemplar, then one can decide if S can be transformed into T using at most k tandem duplications in time 2 O (k 2) + poly(n). Motivated by computing duplication patterns in sequences, a new fundamental problem called the longest letter-duplicated subsequence (LLDS) is investigated. We investigate several variants of this problem. Due to fast mutations in cancer, genome rearrangements on copy number profiles are used more often than genome themselves. We explore the Minimum Copy Number Generation problem. We prove that it is NP-hard to even obtain a constant factor approximation. We also show that the corresponding parameterized version is W[1]-hard. These either improve the previous hardness result or solve an open problem. And we then give a polynomial algorithm for the Copy Number Profile Conforming problem. Finally, we investigate the pattern matching with 1-reversal distance problem. With the known results on Longest Common Extension queries, one can design an O(n+m) time algorithm for this problem. However, we find empirically that this algorithm is very slow for small m. We then design an algorithm based on the Karp-Rabin fingerprints which runs in an expected O(nm) time. The algorithms are implemented and tested on real bacterial sequence dataset. The empirical results shows that the shorter the pattern length is (i.e., when m < 200), the more substrings with 1-reversal distance the bacterial sequences have.Item Population structure, gene flow, and genetic diversity of Rocky Mountain bighorn sheep informed by genomic analysis(Montana State University - Bozeman, College of Agriculture, 2020) Flesch, Elizabeth Pearl; Chairperson, Graduate Committee: Jennifer Thomson; Jay J. Rotella, Jennifer M. Thomson, Tabitha A. Graves and Robert A. Garrott were co-authors of the article, 'Evaluating sample size to estimate genetic management metrics in the genomics era' in the journal 'Molecular ecology resources ' which is contained within this dissertation.; Tabitha A. Graves, Jennifer M. Thomson, Kelly M. Proffitt, P.J. White, Thomas R. Stephenson and Robert A. Garrott were co-authors of the article, 'Evaluating wildlife translocations using genomics: a bighorn sheep case study' in the journal 'Ecology and evolution' which is contained within this dissertation.; Tabitha A. Graves, Jennifer M. Thomson, Kelly M. Proffitt and Robert A. Garrott were co-authors of the article, 'Genetic diversity of bighorn sheep population is associated with dispersal, augmentation, and bottlenecks' submitted to the journal 'Biological conservation' which is contained within this dissertation.This dissertation evaluated the genomics of bighorn sheep (Ovis canadensis) herds across the Rocky Mountain West to determine optimal sample size for estimating kinship within and between populations (Chapter Two), to detect gene flow due to natural dispersal and translocations (Chapter Three), and to evaluate the correlation between genetic diversity and influences on population size (Chapter Four). To date, wildlife managers have moved many bighorn sheep across the Rocky Mountain West in an effort to provide new genetic diversity to isolated herds. However, little is known about the genetics of these herds and the real impacts of translocations. To learn how populations have been impacted by these management actions, we genotyped 511 bighorn sheep from multiple populations using a new cutting-edge genomic research technique, the Illumina Ovine High Density array, which contained about 24,000 to 30,000 single nucleotide polymorphisms informative for Rocky Mountain bighorn sheep. First, we determined that a sample size of 20 to 25 bighorn sheep was adequate for assessment of intra- and interpopulation kinship. In addition, we concluded that a universal sample size rule for all wild populations or genetic marker types may not be able to sufficiently address the complexities that impact genomic kinship estimates. Secondly, we synthesized genomic evidence across multiple analyses to evaluate 24 different translocation events; we detected eight successful reintroductions and five successful augmentations. One native population founded most of the examined reintroduced herds, suggesting that environmental conditions did not need to match for populations to persist following reintroduction. Finally, we determined that influences on population size over time were correlated with genetic diversity. Gene flow variables, including unassisted connectivity and animals contributed in augmentations, were more important predictors than historic minimum population size and origin (i.e. native vs. reintroduced). This hypothesis-based research approach will give wildlife managers additional biological insight to help inform various management options for bighorn sheep restoration and conservation.Item Rock powered life in the Samail ophiolite: an analog for early Earth(Montana State University - Bozeman, College of Agriculture, 2021) Fones, Elizabeth Marie; Chairperson, Graduate Committee: Eric Boyd; Daniel R. Colman, Emily A. Kraus, Daniel B. Nothaft, Saroj Poudel, Kaitlin R. Rempfert, John R. Spear, Alexis S. Templeton and Eric S. Boyd were co-authors of the article, 'Physiological adaptations to serpentinization in the Samail ophiolite, Oman' in the journal 'The International Society for Microbial Ecology journal' which is contained within this dissertation.; Daniel R. Colman, Emily A. Kraus, Ramunas Stepanauskas, Alexis S. Templeton, John R. Spear and Eric S. Boyd were co-authors of the article, 'Diversification of methanogens into hyperalkaline serpentinizing environments through adaptations to minimize oxidant limitation' in the journal 'The International Society for Microbial Ecology journal' which is contained within this dissertation.; David W. Mogk, Alexis S. Templeton and Eric S. Boyd were co-authors of the article, 'Endolithic microbial carbon cycling activities in subsurface mafic and ultramafic igneous rock' which is contained within this dissertation.Serpentinization is a geochemical process wherein the oxidation of Fe(II)-bearing minerals in ultramafic rock couples with the reduction of water to generate H 2, which in turn can reduce inorganic carbon to biologically useful substrates such as carbon monoxide and formate. Serpentinization has been proposed to fuel a subsurface biosphere and may have promoted life's emergence on early Earth. However, highly reacted waters exhibit high pH and low concentrations of potential electron acceptors for microbial metabolism, including CO 2. To characterize how serpentinization shapes the distribution and diversity of microbial life, direct cell counts, microcosm-based activity assays, and genomic inferences were performed on environmental rock and water samples from the Samail Ophiolite, Oman. Microbial communities were shaped by water type with cell densities and activities generally declining with increasing pH. However, cells inhabiting highly reacted waters exhibited adaptations enabling them to minimize stresses imposed by serpentinization, including preferentially assimilating carbon substrates for biomolecule synthesis rather than dissimilating them for energy generation, maintaining small genomes, and synthesizing proteins comprised of more reduced amino acids to minimize energetic costs and maximize protein stability in highly reducing waters. Two distinct lineages of a genus of methanogens, Methanobacterium, were recovered from subsurface waters. One lineage was most abundant in high pH waters exhibiting millimolar concentrations of H2, yet lacked two key oxidative [NiFe]-hydrogenases whose functions were presumably replaced by formate dehydrogenases that oxidize formate to yield reductant and CO 2. This allows cells to overcome CO 2/oxidant limitation in high pH waters via a pathway that is unique among characterized Methanobacteria. Finally, gabbro cores from the Stillwater Mine (Montana, U.S.A) were used to develop methods for detecting the activities of cells inhabiting mafic to ultramafic igneous rocks while controlling for potential contaminants. Optimized protocols were applied to rock cores from the Samail Ophiolite, where rates of biological formate and acetate metabolism were higher in rocks interfacing less reacted waters as compared with more extensively reacted waters, and in some cases may greatly exceed activities previously measured in fracture waters. This dissertation provides new insights into the distribution, activities, and adaptations exhibited by life in a modern serpentinizing environment.Item Morphological adaptations facilitating attachment for archaeal viruses(Montana State University - Bozeman, College of Letters & Science, 2019) Hartman, Ross Alan; Chairperson, Graduate Committee: Mark J. YoungLittle is known regarding the attachment and entry process for any archaeal virus. The virus capsid serves multiple biological functions including: to protect the viral genome during transit between host cells, and to facilitate attachment and entry of the viral genome to a new host cell. Virus attachment is conducted without expenditure of stored chemical energy i.e. ATP hydrolysis. Instead, virus particles depend on diffusion for transportation and attachment from one host cell to another. This thesis examines the attachment process for two archaeal viruses. Sulfolobus turreted icosahedral virus (STIV) is well characterized for an archaeal virus. Still, no information is available concerning STIV attachment or entry. The research presented here shows that STIV attaches to a host cell pilus. Furthermore, combining the previously determined atomic model for the virus, with cryo-electron tomography, a pseudo-atomic model of the interaction between the host pilus and virus was determined. Based on this data, a model is proposed for the maturation of the virus capsid from a noninfectious to an infectious form, by dissociation of accessory proteins. Finally, a locus of genes is identified in the host cell, encoding proteins essential for viral infection, that are likely components of the pili structure recognized by STIV. The isolation of a new archaeal virus, Thermoproteus Piliferous Virus 1 (TSPV1), is also presented here. The TSPV1 virion has numerous fibrous extensions from the capsid, of varying length, that are the first observed for any virus. The capsid 2-3nm fibers likely serve to extend the effective surface area of the virus, facilitating attachment to host cells. Characterization of this new virus was conducted, including genome sequencing and determination of the protein identity for the capsid fibers. The research presented here provides a substantial advancement in our knowledge of the attachment process for archaeal viruses. Attachment to host pili is now emerging as a common theme for archaeal viruses. Furthermore, the isolation of the new archaeal virus TSPV1 demonstrates a novel strategy to increase the probability of interaction between a virus and host cell.