i THE VIRAL AUTHORS OF EVOLUTION FROM SEQUENCE TO STRUCTURE by William Stevens Henriques A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Microbiology and Immunology MONTANA STATE UNIVERSITY Bozeman, Montana December 2024 ©COPYRIGHT by William Stevens Henriques 2024 All Rights Reserved ii DEDICATION This dissertation is dedicated to E. coli and elephants, and all the biology in between. iii ACKNOWLEDGEMENTS Science is an adventure into the unknown that requires curiosity, boldness, and luck. But science is also coming back with a new kernel of knowledge, which requires grit, creativity, and clarity. I first want to thank my mentor Blake Wiedenheft for teaching me these lessons. Working with Blake has made me a better storyteller and scientist. Science is also filled with failures and missteps. I thank my co-mentor Harmit Malik, who has reminded me that failures and missteps are part of the process, and the most important thing is to return to the lab day in and day out with both compassion and enthusiasm. No one ventures into the unknowns of science alone. Mentors and colleagues offer a helping hand in the squalls of failures, tedium, and rejection. I would like to thank Anna, Artem, Andrew, Royce, and Janet for their mentorship, as well as Murat, Tanner, and Calvin for their bioinformatic companionship and Nate for many shared hours at the microscope. A special thanks to Dr. Martin Lawrence and Colin Gauvin at Montana State’s cryoEM Facility, the Research Cyberinfrastructure team and especially Coltran Hophan-Nichols, and my graduate committee for their encouragement and guidance. Finally, I want to acknowledge my friends who kept me sane and singing, the wonderful crew at Blackbird kitchen who kept me connected to the Bozeman community, and my family, who was always on the other end of the phone during those long dark nights on Everest. This work was funded in part by the National Institutes of Health (R35GM134867), the M.J. Murdock Charitable Trust, a young investigator award to Blake Wiedenheft from Amgen, and the Montana State University Agricultural Experimental Station (USDA NIFA). iv TABLE OF CONTENTS 1. INTRODUCTION .............................................................................................................. 1 Conserved Viral Defense Systems ...................................................................................... 2 Unique Viral Defense Systems ........................................................................................... 2 Abundance Leads to Co-option For Defense ...................................................................... 4 Domestication Beyond Defense .......................................................................................... 5 Sequence and Structure ....................................................................................................... 6 Dissertation Overview ........................................................................................................ 7 References ......................................................................................................................... 12 2. GENOME SEQUENCE, PHYLOGENETIC ANALYSIS, AND STRUCTURE BASED ANNOTATION REVEALS METABOLIC POTENTIAL OF CHLORELLA SP. SLA-04 ................................................................... 20 Contribution of Authors and Co-Authors ......................................................................... 20 Manuscript Information .................................................................................................... 22 Abstract ............................................................................................................................. 23 Introduction ....................................................................................................................... 24 Methods............................................................................................................................. 25 Strain Information ................................................................................................. 25 Strain Cultivation Conditions ................................................................... 25 Axenic Isolation ........................................................................................ 26 Genome Sequencing and Assembly ...................................................................... 26 High Molecular Weight DNA Extraction .................................................. 26 RNA Extraction ......................................................................................... 27 DNA and RNA Sequencing ...................................................................... 28 Genome Assembly and Annotation ........................................................... 28 Comparative Analyses .......................................................................................... 30 Phylogenetic Analysis. .............................................................................. 30 Results ............................................................................................................................... 31 General Genome Characteristics ........................................................................... 31 Chlorella sp. SLA-04 Genome Assembly Statistics ................................. 31 Genome Completeness Analysis ............................................................... 31 Annotation of Chlorella Sp. SLA-04 ........................................................ 32 Annotation Using Protein 3D Structural Homology ................................. 33 Phylogenetic Analysis of Chlorella Sp. SLA-04....................................... 35 Genomic Analysis of Metabolic Pathways in Chlorella Sp. SLA-04 ................... 35 Discussion ......................................................................................................................... 37 Conclusion ........................................................................................................................ 40 Data Access ....................................................................................................................... 40 Acknowledgements ........................................................................................................... 40 References ......................................................................................................................... 40 v TABLE OF CONTENTS CONTINUED Supplementary Figures ..................................................................................................... 48 Supplementary Tables ....................................................................................................... 52 3. CLARIFYING CRISPR: WHY REPEATS IDENTIFIED IN THE HUMAN GENOME SHOULD NOT BE CONSIDERED CRISPRS .............................................. 54 Contribution of Authors and Co-Authors ......................................................................... 54 Manuscript Information .................................................................................................... 55 Abstract ............................................................................................................................. 56 Introduction ....................................................................................................................... 56 Methods............................................................................................................................. 57 CRISPR Predictions .............................................................................................. 57 Identification of Cas Genes in the Human Genome ............................................. 59 Comparison of Repeat Length Distribution of CRISPRs from Prokaryotes and HCRISPRs ................................................................................. 59 Results and Discussion ..................................................................................................... 60 References ......................................................................................................................... 65 Supplementary Figures ..................................................................................................... 68 Supplemental Information ................................................................................................ 69 4. THE DIVERSE EVOLUTIONARY HISTORIES OF DOMESTICATED METAVIRAL CAPSID GENES IN MAMMALS ........................................................... 73 Contribution of Authors and Co-Authors ......................................................................... 73 Manuscript Information .................................................................................................... 74 Abstract ............................................................................................................................. 75 Introduction ....................................................................................................................... 76 Results ............................................................................................................................... 80 Divergent Metaviral Capsid-derived Genes in the Human Genome .................... 80 Domesticated Metaviral Capsid Genes Experienced Distinct Evolutionary Retention Fates ................................................................................ 84 Domesticated Metaviral Capsid Genes are Retained under Purifying and Positive Selection ........................................................................... 89 Domesticated Capsid Genes are Structurally Diverse .......................................... 93 Diverse N-terminal Structures in Domesticated Genes Derived from Metaviral Ancestors ..................................................................................... 98 A Novel Putative RNA-binding Domain in the PNMA Family ......................... 101 Discussion ....................................................................................................................... 105 Methods............................................................................................................................ 111 Building Capsid HMMs ....................................................................................... 111 Identifying Capsid Genes in Vertebrate Genomes ...............................................113 vi TABLE OF CONTENTS CONTINUED Matching Human Capsid-like ORF Sequences to Previously Annotated Genes ..................................................................................................114 Analysis of Domain Architecture and Genetic Context .......................................114 Phylogenetic Analysis of Capsid-like Sequences ................................................114 Identification of Mammalian Orthologs ..............................................................115 Analysis of Evolutionary Selective Constraints ..................................................119 Structure Predictions and Homology Modelling ................................................ 121 Phylogenetic Analysis of Putative RBD Sequences ........................................... 122 Expression Analysis ............................................................................................ 123 International Mouse Phenotyping Consortium ................................................... 123 Data and Resource Availability ....................................................................................... 123 References ....................................................................................................................... 124 Supplementary Figures ................................................................................................... 136 Supplementary Tables ..................................................................................................... 153 5. THE RETROTRANSPOSON-DERIVED CAPSID GENES PNMA1 AND PNMA4 MAINTAIN REPRODUCTIVE CAPACITY ................................................... 159 Contribution of Authors and Co-Authors ....................................................................... 159 Manuscript Information .................................................................................................. 162 Abstract ........................................................................................................................... 163 Introduction ..................................................................................................................... 163 Results ............................................................................................................................. 165 Expression of PNMA1 and PNMA4 Declines with Age in Human Ovaries ................................................................................................................ 165 Male and Female Mice Lacking Pnma1 or Pnma4 Show Premature Loss of Fertility ................................................................................................... 168 Pnma1 and Pnma4 Mutants Acquire Abdominal Obesity but Appear Behaviorally Normal .............................................................................. 173 PNMA1 and PNMA4 Form Capsid-like Structures in Gonadal Tissue .................................................................................................................. 175 Human Variation at the PNMA1 and PNMA4 Loci is Associated with Reproductive Defects .................................................................................. 178 Discussion ....................................................................................................................... 179 Methods........................................................................................................................... 183 Generation of Mutant Mouse Models ................................................................. 183 Mouse Husbandry ............................................................................................... 184 Mouse Genotyping .............................................................................................. 185 Fertility Assay ..................................................................................................... 186 Body and Gonad Weight ..................................................................................... 186 Immunofluorescence of Testis Sections .............................................................. 186 Epididymal Sperm Counts .................................................................................. 188 vii TABLE OF CONTENTS CONTINUED Periodic Acid-Schiff (PAS) Staining ................................................................... 188 Blood Collection, Sample Preparation, and Hormone Analysis ......................... 189 Ovarian Histology ............................................................................................... 189 Oocyte Collection and Maturation ...................................................................... 189 Mouse Oocyte Isolation, Culturing, Microinjection and High- Resolution Confocal Microscopy ....................................................................... 190 Plasmid Creation ................................................................................................. 191 HEK293T Cell Culture ....................................................................................... 191 HEK293T Transfection ....................................................................................... 192 Cloning, Expression, and Purification of Recombinant PNMA1 and PNMA4 ........................................................................................................ 193 Negative Stain Transmission Electron Microscopy ............................................ 194 Velocity Gradient Ultracentrifugation................................................................. 195 Western Blotting ................................................................................................. 195 Quantitative PCR ................................................................................................ 197 Extracellular Protein Analysis ............................................................................. 198 Analysis of Testicular Capsid-Like PNMA4 Protein .......................................... 198 Lysate Preparation ................................................................................... 198 Double Sucrose Gradient ........................................................................ 199 Iodixanol Step Gradient .......................................................................... 199 Northern Blot Analysis ....................................................................................... 199 PNMA1-5 Expression Analysis in Human and Mouse Ovary ............................ 200 Analysis of Traits Associated with Human PNMA1 and PNMA4 Variation .............................................................................................................. 201 Statistical Analysis .............................................................................................. 201 Neurobehavioral Tests ......................................................................................... 201 Open Field Test ....................................................................................... 202 Grip Strength Test ................................................................................... 202 Inverted Hang Test .................................................................................. 202 Y Maze Spontaneous Alternation ........................................................... 203 Fear Conditioning ................................................................................... 203 References ....................................................................................................................... 204 Supplementary Figures ................................................................................................... 212 Supplementary Tables ..................................................................................................... 224 6. A VIRALLY-ENCODED TRNA NEUTRALIZES THE PARIS ANTIVIRAL DEFENSE SYSTEM................................................................................ 228 Contribution of Authors and Co-Authors ....................................................................... 228 Manuscript Information .................................................................................................. 231 Abstract ........................................................................................................................... 232 Introduction ..................................................................................................................... 232 viii TABLE OF CONTENTS CONTINUED Results ............................................................................................................................. 234 PARIS Forms a Propeller-shaped Complex ........................................................ 234 AriA Sequesters AriB .......................................................................................... 236 AriA Binds to Ocr and Releases AriB................................................................. 239 Activated PARIS Inhibits Translation ................................................................. 244 Triggers and a Suppressor of PARIS .................................................................. 245 Viral TRNA Resistant to AriB Cleavage............................................................. 246 Evolution of the PARIS Defense System ............................................................ 249 Discussion ....................................................................................................................... 250 Methods........................................................................................................................... 252 Bacterial Strains, Phages, and Plasmids ............................................................. 252 Expression and Purification of the PARIS Complex .......................................... 253 In Vivo Protein Pull-down ................................................................................... 254 Glutaraldehyde Cross-linking Assay................................................................... 255 In Vitro Protein Pull-down .................................................................................. 255 Cryo-EM Sample Preparation and Data Collection ............................................ 256 Cryo-EM Data Processing of Inactive PARIS .................................................... 257 Cryo-EM Data Processing of Ocr Pulldown ....................................................... 258 Model Building and Refinement ......................................................................... 258 ATPase Assays .................................................................................................... 259 Efficiency of Plaquing Assay .............................................................................. 260 Liquid Culture Phage Infection ........................................................................... 261 Solid Media Toxicity Assay ................................................................................ 261 Liquid Culture Toxicity Assay ............................................................................ 262 Fluorescence Microscopy ................................................................................... 262 Extraction of Total DNA ..................................................................................... 263 TUNEL Assay ..................................................................................................... 263 Hi-C Procedure and Sequencing ......................................................................... 264 Metabolic Labeling ............................................................................................. 265 In Vitro Translation ............................................................................................. 266 Isolation and Sequencing of T5 Escaper Mutants ............................................... 267 Extraction of Total RNA ..................................................................................... 268 Preparation of Enriched Small RNA Fractions from the Total RNA Samples ............................................................................................................... 269 Cleavage Assay by AriB ..................................................................................... 270 Mapping of the AriB Cleavage Site by Primer Extension .................................. 270 Mapping the AriB Cleavage Site by Specific Reverse Transcription (RT) and Sanger Sequencing .............................................................................. 271 Northern Blot ...................................................................................................... 272 Defense System Detection .................................................................................. 272 Domain Annotation ............................................................................................. 273 ix TABLE OF CONTENTS CONTINUED AriA and AriB Tree ............................................................................................. 273 Defense System ATPase AAA15/21 Tree ........................................................... 274 Statistics and Reproducibility ............................................................................. 274 References ....................................................................................................................... 274 Extended Data Figures .................................................................................................... 280 Extended Data Tables ...................................................................................................... 291 Supplementary Tables ..................................................................................................... 292 7. STRUCTURE REVEALS WHY GENOME FOLDING IS NECESSARY FOR SITE-SPECIFIC INTEGRATION OF FOREIGN DNA INTO CRISPR ARRAYS .......................................................................................................... 312 Contribution of Authors and Co-Authors ....................................................................... 312 Manuscript Information .................................................................................................. 314 Abstract ........................................................................................................................... 315 Introduction ..................................................................................................................... 315 Results ............................................................................................................................. 318 Cryo-EM Structure of Type I-F CRISPR Integration Complex ......................... 318 Foreign DNA Constrains Cas2/3 Linker against Cas1 ....................................... 322 Cas2 Homodimers Recognize and Bend Inverted Repeat Sequences ............................................................................................................ 324 Cas2 Homodimer is Surrounded by Four DNA Helices ..................................... 326 IHF and the Leader Direct Integration into Diverse Sequences ......................... 326 5’ GT is Critical for Cas1-mediated Integration ................................................. 329 Discussion ....................................................................................................................... 330 Methods........................................................................................................................... 335 Nucleic Acid Preparation .................................................................................... 335 Cas1 and Cas2/3 Mutagenesis ............................................................................ 336 Protein Purification ............................................................................................. 336 In Vitro Integration Assays .................................................................................. 338 Assembly and Purification of I-F Integration Complex ..................................... 339 Cryo-EM Sample Preparation and Data Acquisition .......................................... 340 Cryo-EM Image Processing ................................................................................ 342 Model Building and Validation ........................................................................... 343 Cas1, Cas2/3, and Repeat Conservation Analysis .............................................. 344 Data Availability ................................................................................................. 347 Extended Data Figures .................................................................................................... 347 References ....................................................................................................................... 358 Supplementary Tables ..................................................................................................... 366 x TABLE OF CONTENTS CONTINUED 8. STRUCTURES REVEAL HOW A CRISPR INTEGRASE CAPTURES, DELIVERS, AND INTEGRATES FOREIGN DNA INTO THE CRISPR ARRAY ........................................................................................................................... 374 Contribution of Authors and Co-Authors ....................................................................... 374 Manuscript Information .................................................................................................. 375 Abstract ........................................................................................................................... 376 Introduction ..................................................................................................................... 376 Results ............................................................................................................................. 378 Cas1-2/3 Forms a DNA Capture Complex ......................................................... 378 Structural Basis for Cas3 HD Nuclease and RecA Helicase Inhibition ............................................................................................................. 382 Cas3 Blocks Foreign DNA Integration ............................................................... 385 Foreign DNA Capture Exposes Genome Binding Sites ..................................... 386 Structure Precludes a Role for Cas3 in PAM Trimming ..................................... 390 Foreign DNA Binding Exacerbates Asymmetric Instability ............................... 391 The PAM Changes the Stability of the Integration Complex ............................. 391 The CRISPR Repeat Channel Stretches and Distorts The Repeat ...................... 395 Discussion ....................................................................................................................... 396 Materials & Methods ...................................................................................................... 399 Conservation Analysis of Cas2/3 and Cas1 ........................................................ 399 Nucleic Acid Preparation .................................................................................... 400 Protein Purification ............................................................................................. 400 Assembly and Purification of the Delivery Complex ......................................... 402 Assembly and Purification of the Integration Complex ..................................... 402 CryoEM Sample Preparation and Data Acquisition – Cas1-2/3 Capture Complex ................................................................................................ 403 CryoEM Sample Preparation and Data Acquisition – Cas1-2/3 Delivery Complex ............................................................................................... 403 CryoEM Sample Preparation and Data Acquisition – Cas1-2/3 Integration Complex ........................................................................................... 404 Cryo-EM Image Processing ................................................................................ 405 Model Building and Validation ........................................................................... 405 References ....................................................................................................................... 406 Supplementary Figures ................................................................................................... 414 Supplementary Tables ..................................................................................................... 424 9. CONCLUSION ............................................................................................................... 426 From Sequence to Structure ............................................................................................ 428 Structure Determines Function ....................................................................................... 429 Concluding Remarks ....................................................................................................... 430 xi TABLE OF CONTENTS CONTINUED References ....................................................................................................................... 431 CUMULATIVE REFERENCES CITED .............................................................................. 433 APPENDIX: RAPID CRYOEM STRUCTURE DETERMINATION FROM CELL-FREE TRANSCRIPTION AND TRANSLATION ................................................... 493 Abstract ........................................................................................................................... 494 Main Text ........................................................................................................................ 494 References ....................................................................................................................... 498 xii LIST OF TABLES Table Page 1. Supplementary Table 2.1. Genome statistics of Chlorella sp. SLA-04 genome. ............................................................................................................................. 52 2. Supplementary Table 2.2. Summary of repetitive elements identified by RepeatMasker using Dfam and Repbase as a combined reference database. ................... 53 3. Supplementary Table 2.3. Active transcription of lipid synthesis genes within the Chlorella sp. SLA-04 genome. ........................................................................ 53 4. Supplementary Table 4.1. Model statistics for custom capsid Hidden Markov Models (HMMs) build from multiple sequence alignments (MSA) of capsid amino acid sequences. ..................................................................................... 153 5. Supplementary Table 4.2. Gene names table. ................................................................. 155 6. Supplementary Table 4.3. Table of genomes used in this study ..................................... 156 7. Supplementary Table 4.4. BLAST statistics for consensus Repbase metaviral sequences used as proxies for ancestral metaviruses ...................................... 157 8. Supplementary Table 4.5. Phenotypes associated with knockouts of domesticated metaviral genes in mice. ........................................................................... 158 9. Supplementary Table 5.1. Traits associated with PNMA1 and PNMA4 variants in the human population. ................................................................................... 227 10. Extended Data Table 6.1. Cryo-EM data collection, processing, and model validation......................................................................................................................... 291 11. Supplementary Table 1. Bacterial strains, phages, and plasmids used in the study ................................................................................................................................ 299 12. Supplementary Table 2. Primers used for cloning ...........................................................311 13. Supplementary Table 3. Nucleotide sequence of the probes used for Northern blot and in primer extension .............................................................................311 14. Table 7.1. Cryo-EM data collection, refinement, and validation statistics ..................... 321 15. Supplementary Table 1. Oligonucleotides used in this study. ......................................... 373 16. Supplementary Table 8.1. Refinement statistics for the Cas1-2/3 complex ................... 424 xiii LIST OF TABLES CONTINUED 17. Supplementary Table 8.2. Refinement statistics for Cas1-2/3 bound to dsDNA fragment ............................................................................................................. 425 xiv LIST OF FIGURES Figure Page 1. Figure 2.1. Complete genome of Chlorella sp. SLA-04. .................................................. 32 2. Figure 2.2. Annotation using protein structural similarity improves functional understanding of SLA-04 proteome. ............................................................... 33 3. Figure 2.3. Phylogenomic analysis of green algae performed using orthologous genes indicates that SLA-04 is distinct from other Chlorella sorokiniana species. .......................................................................................................... 34 4. Figure 2.4. Flow of carbon from fixation to lipid synthesis in Chlorella sp. SLA-04. ............................................................................................................................. 36 5. Supplementary Figure 2.1. DNA isolation and sequencing parameters for Chlorella sp. SLA-04. ....................................................................................................... 48 6. Supplementary Figure 2.2. Phylogenetic tree of LTR retrotransposon capsid domains places SLA-04 retrotransposons among Copia/Pseudoviridae LTR retrotransposons. ............................................................................................................... 49 7. Supplementary Figure 2.3. Phylogenomic analysis of green algae performed using 18S rDNA. ............................................................................................................... 50 8. Supplementary Figure 2.4. Presence of ATP citrate lyase and sphingolipid enzymes across algal genomes.......................................................................................... 51 9. Figure 3.1. Comparison of architectural features in CRISPRs and hCRISPRs. ........................................................................................................................ 60 10. Figure 3.2. Repetitive elements in the human genome do not contain architectural features consistent with CRISPR loci .......................................................... 62 11. Supplementary Figure 3.1. Previously reported CRISPR-like elements in the mitochondrial genome of a plant (Vicia faba) and the genome of a giant mimivirus (APMV) ........................................................................................................... 68 12. Figure 4.1. Full-length capsid-like ORFs in the human genome ...................................... 82 13. Figure 4.2. Metaviral-derived capsid genes show distinct evolutionary trajectories across placental mammals. ............................................................................. 85 xv LIST OF FIGURES CONTINUED 14. Figure 4.3. RTL3 and RTL10 exhibit domain-specific and lineage-specific patterns of conservation. ................................................................................................... 87 15. Figure 4.4. Evolutionary rates of domesticated metaviral genes across placental mammals and primates. ..................................................................................... 91 16. Figure 4.5. Four independent metavirus domestication events include structurally distinct N-terminal domains. ......................................................................... 94 17. Figure 4.6. Predicted RNA-binding domain in the PNMA family and related metaviruses ..................................................................................................................... 102 18. Supplementary Figure 4.1. Constructing Hidden Markov Models for the capsid domain of LTR retrotransposons ......................................................................... 136 19. Supplementary Figure 4.2. Phylogeny of ARC proteins ................................................ 137 20. Supplementary Figure 4.3. Phylogeny of the putative RNA-binding domain of PNMA proteins. .......................................................................................................... 139 21. Supplementary Figure 4.4. Phylogeny of PNMA protein capsid domains. .................... 140 22. Supplementary Figure 4.5. PNMA6E/F has duplicated multiple times in distinct lineages. .............................................................................................................. 142 23. Supplementary Figure 4.6. Phylogenetic tree for SIRH/RTL family (half capsid alignment). ........................................................................................................... 144 24. Supplementary Figure 4.7. Phylogenetic tree for SIRH/RTL family (full capsid alignment). ........................................................................................................... 145 25. Supplementary Figure 4.8. Conservation and divergence of the capsid domain in the SIRH/RTL family. .................................................................................... 146 26. Supplementary Figure 4.9. Lack of structural conservation in the N-terminal domain of the SIRH/RTL family. .................................................................................... 148 27. Supplementary Figure 4.10. The evolutionary origins of the putative RNA- binding domain of the PNMA family. ............................................................................. 150 28. Supplementary Figure 4.11. Human expression of domesticated metaviral capsid genes. ................................................................................................................... 152 xvi LIST OF FIGURES CONTINUED 29. Figure 5.1. PNMA1 and PNMA4 are expressed in human ovaries. ............................... 166 30. Figure 5.2. PNMA1 and PNMA4 are conserved among eutherians ............................... 168 31. Figure 5.3. Male mice lacking Pnma1 or Pnma4 prematurely lose reproductive capacity ...................................................................................................... 170 32. Figure 5.4. Female mice lacking Pnma1 or Pnma4 have age-dependent reproductive defects. ....................................................................................................... 172 33. Figure 5.5. Pnma1 and Pnma4 mutants gain abdominal fat with age. ............................ 174 34. Figure 5.6. PNMA proteins form capsid-like structures that can exit human cells ................................................................................................................................. 176 35. Figure 5.7. Model for PNMA1 and PNMA4 function. ................................................... 179 36. Supplementary Figure 5.1. PNMA2, PNMA3, and PNMA5 expression analysis in human ovaries. .............................................................................................. 212 37. Supplementary Figure 5.2. DAZL has the capacity to post-transcriptionally regulate PNMA1 and PNMA4. ....................................................................................... 213 38. Supplementary Figure 5.3. Diagram of Pnma1 and Pnma4 loci on M. musculus chromosome 12 ............................................................................................... 214 39. Supplementary Figure 5.4. Defective testicular and ovarian morphology in Pnma1 and Pnma4 mutants ............................................................................................. 216 40. Supplementary Figure 5.5. Analysis of undifferentiated spermatogonial cells by PLZF staining .................................................................................................... 217 41. Supplementary Figure 5.6. Several ovarian features are not dramatically altered in Pnma1 and Pnma4 mutants. ............................................................................ 218 42. Supplementary Figure 5.7. Oocytes that enter meiosis are largely unaffected in Pnma1 and Pnma4 mutants. ........................................................................................ 219 43. Supplementary Figure 5.8. Single-cell analysis of Pnma1 and Pnma4 expression in aging mouse ovaries ................................................................................. 220 44. Supplementary Figure 5.9. Pnma1 and Pnma4 are expressed in subpopulations of ovarian cells ....................................................................................... 221 xvii LIST OF FIGURES CONTINUED 45. Supplementary Figure 5.10. Mice lacking Pnma1 or Pnma4 exhibit normal neurobehavioral and muscular traits. .............................................................................. 222 46. Supplementary Figure 5.11. Size exclusion chromatography of recombinant PNMA1 and PNMA4. ..................................................................................................... 223 47. Supplementary Figure 5.12. PNMA1 and PNMA4 expressed in human cells forms capsid-sized particles. ........................................................................................... 224 48. Figure 6.1. PARIS is a two-component system that assembles into a propeller-shaped supramolecular complex. .................................................................... 235 49. Figure 6.2. Interaction between the AriA ATPase and AriB essential for PARIS-mediated defense. ............................................................................................... 237 50. Figure 6.3. AriA interacts with Ocr in an ATPase-dependent manner, leading to the release of AriB from the PARIS complex ............................................................. 240 51. Figure 6.4. The activation of PARIS leads to cell death and inhibition of translation ........................................................................................................................ 242 52. Figure 6.5. PARIS cleaves E. coli tRNALys and T5 encodes a non-cleavable tRNALys. ........................................................................................................................ 247 53. Figure 6.6. Phylogenies of PARIS and related ABC ATPase powered defense systems ............................................................................................................... 249 54. Extended Data Figure 6.1. AlphaFold2 structural predictions of AriA and AriB................................................................................................................................. 280 55. Extended Data Figure 6.2. Structural comparison of PARIS homologs. ........................ 281 56. Extended Data Figure 6.3. Workflow for structural determination of the PARIS complex ............................................................................................................... 282 57. Extended Data Figure 6.4. AriB toxicity is dependent on association with AriA ................................................................................................................................ 283 58. Extended Data Figure 6.5. Purification of activated AriB. ............................................. 284 59. Extended Data Figure 6.6. Structure determination of AriA purified using a Ocr-Strep pulldown. ........................................................................................................ 285 xviii LIST OF FIGURES CONTINUED 60. Extended Data Figure 6.7. AriA ATPase activity. ........................................................... 286 61. Extended Data Figure 6.8. PARIS activation does not result in total RNA or DNA degradation ............................................................................................................ 287 62. Extended Data Figure 6.9. PARIS activation results in DNA compactization akin to inhibition of translation with chloramphenicol. .................................................. 288 63. Extended Data Figure 6.10. Identification of the T5 genomic region required for infection of PARIS+ cells. .......................................................................... 289 64. Extended Data Figure 6.11. tRNALys cleavage product and phylogenetics of PARIS ......................................................................................................................... 290 65. Figure 7.1. Cryo-EM structure of the type I-F CRISPR integration complex ......................................................................................................................................... 319 66. Figure 7.2. Foreign DNA constrains the Cas2/3 linker against conserved Cas1 residues. ................................................................................................................. 323 67. Figure 7.3. The Cas2 homodimer simultaneously coordinates four dsDNA helices critical to CRISPR integration. ........................................................................... 325 68. Figure 7.4. Sequence motifs in the leader and IHF proteins facilitate Cas1- 2/3-based integration into diverse repeat sequences. ...................................................... 328 69. Figure 7.5. I-F CRISPR integration complex suggests a mechanism for primed acquisition and Cascade impact on integration. ................................................. 331 70. Figure 7.6. DNA is a flexible scaffold that controls DNA mobilization. ........................ 333 71. Extended Data Figure 7.1. Cryo-EM sample preparation, imaging and processing for type I-F integration complex ................................................................... 349 72. Extended Data Figure 7.2. Cas1-2/3 undergoes a large structural rearrangement during integration. ................................................................................... 350 73. Extended Data Figure 7.3. Conservation analysis of Cas1-2/3 residues involved in DNA binding and integration. ...................................................................... 351 74. Extended Data Figure 7.4. Cas1-2/3 predominantly recognizes IR motifs, CRISPR repeat and foreign DNA through non-sequence specific interactions. ..................................................................................................................... 352 xix LIST OF FIGURES CONTINUED 75. Extended Data Figure 7.5. Purification of structure-guided mutants of Cas1- 2/3 ................................................................................................................................... 353 76. Extended Data Figure 7.6. Validation of Cas1-2/3 interactions with the foreign DNA and IR motifs............................................................................................. 354 77. Extended Data Figure 7.7. PAM blocks Cas-mediated integration of foreign DNA into CRISPR repeat. .............................................................................................. 355 78. Extended Data Figure 7.8. Control reactions for integration assay and generation of 32P-labelled ladder. .................................................................................. 356 79. Extended Data Figure 7.9. Validation of Cas1-2/3 interactions with the repeat. .............................................................................................................................. 357 80. Figure 8.1. One side of the Cas1-2/3 integrase has a positively charged foreign DNA-binding channel. ....................................................................................... 379 81. Figure 8.2. The Cas3 active sites are blocked in the DNA capture complex. ................. 384 82. Figure 8.3. Foreign DNA triggers a conformational rearrangement that exposes leader recognition sites and preserves Cas3 inhibition ..................................... 387 83. Figure 8.4. The PAM blocks the second transesterification by altering complex stability. ............................................................................................................ 392 84. Figure 8.5. The repeat channel distorts the CRISPR repeat............................................ 395 85. Supplementary Figure 8.1. Purification and data processing of the Cas1-2/3 capture complex. ............................................................................................................. 414 86. Supplementary Figure 8.2 Cas1 in complex with Cas2/3 preserves dimeric architecture. ..................................................................................................................... 415 87. Supplementary Figure 8.3. Conservation of the four-bladed propeller Cas1- 2/3 architecture across type I-F systems ......................................................................... 416 88. Supplementary Figure 8.4. The Cas2 C-terminus determines the extent of complex interactions ....................................................................................................... 417 89. Supplementary Figure 8.5. Positively charged foreign DNA binding residues are conserved in the Cas1-2/3 complex ............................................................ 418 xx LIST OF FIGURES CONTINUED 90. Supplementary Figure 8.6. A conserved domain inserted into the HD active site explains nuclease inhibition ..................................................................................... 419 91. Supplementary Figure 8.7. Three interfaces clamp Cas3 in place, inhibiting both interference and integration interactions ................................................................. 420 92. Supplementary Figure 8.8. Purification and data processing of the Cas1-2/3 capture complex .............................................................................................................. 421 93. Supplementary Figure 8.9. Purification and cryoEM data analysis of PAM- containing integration complex....................................................................................... 422 94. Supplementary Figure 8.10. Positively charged residues in the repeat channel are conserved in type I-F CRISPR systems. ...................................................... 423 95. Figure A.1. Sequence to structure in 24 hours ................................................................ 495 96. Figure A.2. TxTl cryoEM produces high resolution structures ..................................... 497 xxi ABSTRACT Viruses are the most abundant biological entities on the planet. Like any parasite, these genetic parasites require a host. Inside a host cell, viruses hijack the transcription and translation machinery to replicate. In response to the ubiquitous threat of viral predation, host cells have evolved an astonishing diversity of mechanisms to stop, control, and even repurpose pieces of viruses. In this dissertation, I use bioinformatic, biochemical, and structural approaches to show this ancient conflict shapes host evolution by constantly rewriting the host genome. Research conducted during this dissertation has spanned the domains of life with three major foci. First, I have used bioinformatics to identify genetic parasites and defense systems embedded within algal and human genomes. Second, I have conducted deep evolutionary analyses to identify viral genes that have been repurposed for host function. And third, I have used cryogenic electron microscopy to determine the structure of viral defense systems in bacteria. 1 CHAPTER ONE INTRODUCTION Viruses are the most abundant biological entities on the planet1,2. As obligate intracellular genetic parasites, viruses require a host cell which they invade and reprogram to produce more virus. Viral progeny exit the reprogrammed host cell and remain biologically inert until they encounter a new host, at which point the viral replication cycle begins again. Since the discovery of the tobacco mosaic virus in 1898, viruses have been found to infect cells from every domain of life3–5. This selfish reproductive strategy is the most successful on the planet: there are an estimated 10 viral particles for every cell on earth1,2,6. Viruses influence life from the cellular level to the global level. An estimated 20-40% of the ocean’s biomass is lysed every day by viral infection, in a stunning illustration of how selfish genetic elements run riot in the world1. However, these estimates incompletely capture the extent to which selfish genetic elements dominate the biosphere. Some genetic parasites, viral-like elements called transposons and retrotransposons, embed in the host genome, replicate within the cell, and insert new copies directly into the genome of their original host. A surprising finding of the Human Genome Project was the discovery that approximately half of the human genome is made up of transposons, retrotransposons, and their decaying remnants7–9. Metagenomic studies indicate that selfish genetic elements may make up to 8% of bacterial genes10. In eukaryotes, selfish genetic elements can make up anywhere from 3% to 92% of the genome11,12. A survey of complete genomes and meta transcriptome data found that transposase genes are by far the most abundant genes in nature13. These proteins are used by selfish genetic elements to integrate selfish DNA 2 directly into the host genome13. Between extracellular viruses and intracellular genome parasites, selfish genetic elements exert tremendous pressure on their cellular hosts. Conserved Viral Defense Systems In response to the ubiquitous threat of selfish genetic elements, host cells across all domains of life have evolved diverse defense systems. Some defense systems are conserved from bacteria to eukaryotes and interfere with critical and universal steps in viral replication14–16. The cGAS/CD-NTase and STING pathways sense viral infection and synthesize cyclic di-nucleotide secondary messengers to activate an anti-viral response17–19. The viperin systems alter ribonucleotide pools in response to viral infection, terminating RNA synthesis20,21. Activation of gasdermins upon viral infection triggers assembly of membrane-puncturing pores leading to cell death before the replication cycle can complete22,23. Argonaute proteins use short nucleic-acid guides to target and cleave selfish RNA or DNA24,25. Toll/Interleukin-1 receptor (TIR) domains associated with pathogen-associated molecular pattern receptors deplete cellular NAD+ upon sensing a viral molecular pattern, depleting cellular energy26,27. Defense mechanisms conserved across the domains of life emphasize the ubiquity of viral parasitism. They also underscore the truth in the saying coined by the inventor of the microscope, Antoni von Leeuwenhoek, and famously paraphrased by Jacques Monod and François Jacob: “what is true for bacteria is also true for elephants.”28 Unique Viral Defense Systems However, not everything that is true for bacteria is true for elephants. Some defense systems, such as the adaptive immune system of vertebrates, have no known homologues in 3 prokaryotes. Others, such as the CRISPR-Cas and restriction modification (RM) systems, have only been found in prokaryotes29,30. Studying these outliers have provided significant insights into the diversity of viral defense mechanisms and tools that have revolutionized medicine and molecular biology. Early studies of bacterial immune systems revealed bacteria possess a mechanism to distinguish between self and non-self31,32. Now called restriction-modification (RM) systems, these two-component systems generally encode a methyl-transferase and a nuclease. Methylation of the genome creates a distinct signature of self. Invading nucleic acids from viruses are recognized and cleaved at specific target sites by nucleases, shutting down the viral infection33,34. In a clear example of ‘molecular arms race’35–38, viruses have evolved mechanisms to shut down these first-line of defense systems. For example, the “overcoming classical restriction” (Ocr) protein jams up restriction modification systems by mimicking viral DNA and blocking the active site of the restriction endonuclease.39 By blocking nuclease activity, viruses carrying Ocr circumvent this first-line defense system. CRISPR-Cas systems are prokaryote-specific immune systems that also distinguish between self and non-self. However, a key distinguishing feature of the CRISPR-Cas systems is that they are both adaptable and heritable. In CRISPR-Cas systems, clustered regularly interspaced short palindromic repeats (CRISPRs) catalogue foreign DNA to be read out by CRISPR-associated (Cas) proteins which target and degrade complementary foreign nucleic acid. 40,41. Though structurally diverse, CRISPR-Cas systems generate adaptive immunity through three universally conserved steps: biogenesis, interference, and adaptation (reviewed in 42–45). In CRISPRs, conserved direct repeat sequences flank unique virally-derived sequences called 4 spacers46–48. During biogenesis, Cas proteins process the transcribed CRISPR array into individual guide RNAs that contain a single unique spacer sequence49,50. These guide RNAs are loaded into RNA-guided endonucleases that identify and cleave foreign nucleic acid during interference51,52 In the adaptation stage, Cas1 and Cas2 assemble into a heterohexameric complex that binds fragments of foreign nucleic acid and inserts them as new spacers through a transposition-like mechanism53. Abundance Leads to Co-option For Defense The Cas1 protein that mediates foreign DNA insertion derives from the transposase of an ancient selfish genetic element called a casposon51,54,55. Cas1 represents a striking example of how selfish genes may be repurposed for the host as ‘guns for hire’ 37,51,55. In CRISPR immunity, Cas1 performs a very similar function to its transposase ancestor, inserting short fragments of foreign DNA into the CRISPR by transposition53,56–59. However, Cas1 no longer services the replication of a selfish genetic element, but rather functions to protect the cellular host from selfish threats by providing the CRISPR interference machinery with updated guide RNAs. Intriguingly, adaptive immunity in eukaryotes also relies on a domesticated transposase. Vertebrates possess an adaptive immune system that is capable of generating molecular memories of pathogens60. However, most of vertebrate adaptive immunity is not heritable. Memories of infections are not transmitted from generation to generation, though they are preserved for months to years in the form of memory B cells. Rather, the V(D)J recombinase generates fragments of DNA that are ligated back together to create variable antigen recognition sites in antibodies in a mechanism that is quite similar to transposition54,61. Indeed, domestication of an ancient DNA transposon gave rise to this form of eukaryotic adaptive immunity. These two 5 striking examples of transposases domesticated for defense in prokaryotes and eukaryotes begs a deeper analysis of parasitic genes embedded within host genomes to identify parasites repurposed for host function. Domestication Beyond Defense In mammalian genomes, the most common class of selfish elements are retrotransposons, which generate new copies using a ‘copy-and-paste’ mechanism62–65. Retrotransposons are classified into two broad groups: long-terminal-repeat (LTR) retrotransposons and non-long- terminal-repeat (non-LTR) retrotransposons, which include both autonomous LINE-1 retrotransposons and non-autonomous Alu and SVA elements that rely on LINE-1 replication machinery. LTR retrotransposons (which include endogenous retroviruses, ERVs) encode structural group-specific antigen (gag) genes and enzymatic polymerase (pol) genes, flanked by LTRs. Their ‘exogenous’ retroviral counterparts also encode an envelope protein (env)66,67, which mediates membrane fusion needed for infection, and can subvert immune defenses via an immunosuppressive domain68. Multiple retroviral envelope genes have been domesticated for retroviral defense in mice69,70, humans71, sheep72, cats73,74, and chickens75. However, env genes have also been domesticated for non-defense function as a class of genes called syncitins. Syncitin proteins mediate membrane fusion to form multinucleated syncytial trophoblast cells in the placenta76,77. Syncytin gene domestication is a remarkable example of convergence, with at least seven independent events in different mammalian orders77,78, and in unusual lineages of lizard79 and fish80. These env domestications illustrate two key points. First, viral genes can be substrate for 6 novel host functions beyond their viral function. Second, the homologous viral domains may be repurposed for completely different functions. Gag genes of LTR retrotransposons and retroviruses encode a polyprotein that includes capsid, matrix, and nucleocapsid domains that together package the viral genome during the virion assembly process65,81. Like env, gag has been domesticated both for retroviral restriction in mice (e.g., the Fv1 gene82–84), as well as for other critical host functions85–89. At least four domesticated capsid genes have been shown to assemble into capsid-like structures90–94, which perform essential functions in host reproduction87 and neuronal function90,95. One of these gag- derived genes is vertebrate Arc (Activity-Regulated, Cytoskeletal-associated), which originated from an ancient LTR retrotransposon. ARC protein forms capsid-like structures and functions in the brain to regulate learning and memory as a signaling hub and messenger RNA shuttle in neurons90,95. An independent domestication event led to dArc1 in Drosophila species, which also forms capsid-like structures and packages mRNA for intercellular neuronal signaling92,96. Together with Cas1 and the V(D)J recombinase, Arc and dArc1 elegantly demonstrate how functional homologues from selfish genetic elements have been repurposed for similar host functions across massive evolutionary distances. Sequence and Structure Genetic conflict plays out in the ecosystem of genomes97. Two major technological advances have facilitated the study of these ecosystems by rendering previously invisible molecular processes visible. Genome sequences contain the set of instructions that build proteins, organelles, cells, and organisms. Fred Sanger changed the world when he developed his method for efficiently sequencing DNA98, because for the first time in human history, the genetic code 7 was readable. This has had profound implications for evolutionary biology, as it is now possible to reconstruct the evolutionary history of entire kingdoms of life from DNA sequence. It has also revolutionized our understanding of genetic conflict. The first full length genomes ever sequenced were the MS2 viral RNA genomes and the bacteriophage ØX174 genome98,99. The explosion of sequenced genomes has led to a corresponding explosion in the number of genetic parasites and defense systems against those parasites100. While genomic information provides the blueprint for a biological entity, structural techniques are critical for understanding the structure and function of the machines encoded by nucleic acid instructions. Here again, viruses have played a critical role in developing a new way of seeing. Because most viruses are not visible in the visible light spectrum, the development of new microscopy techniques was necessary to understand the physical basis for viral infections. Some of the first electron micrographs ever recorded were images of viral infections of bacteria101 and the first three-dimensional reconstruction from electron microscopy data was the reconstruction of a T4 bacteriophage tail102. Dissertation Overview This dissertation examines both sequence and structure to understand the evolutionary outcomes of genetic conflict. The first three chapters deal with the assembly and annotation of defense systems, genetic parasites, and domesticated genes in two complete, telomere-to- telomere eukaryotic genomes. The fourth chapter begins to explore the functions beyond defense of domesticated genes in placental mammals. The final three chapters shift into the prokaryotic realm and use a combination of bioinformatics and cryogenic electron microscopy to study the ongoing molecular arms race in bacteria. 8 Chapter 2 presents telomere-to-telomere sequence of the genome of the algae SLA-04 derived from long-read sequencing technology. Algae are a broad class of photosynthetic eukaryotes that are phylogenetically and physiologically diverse. Most of the phylogenetic diversity has been inferred from 18S rDNA sequencing since there are only a few complete genomes available in public databases. We used ultra-long-read Nanopore sequencing to determine a gapless, telomere-to-telomere complete genome sequence of Chlorella sp. SLA-04, previously described as Chlorella sorokiniana SLA-04. Chlorella sp. SLA-04 is a green alga that grows to high cell density in a wide variety of environments – high and neutral pH, high and low alkalinity, and high and low salinity. SLA-04’s ability to grow in high pH and high alkalinity media without external CO2 supply is favorable for large-scale algal biomass production. Phylogenetic analysis performed using ribosomal DNA and conserved protein sequences consistently reveal that Chlorella sp. SLA-04 forms a distinct lineage from other strains of Chlorella sorokiniana. We complement traditional genome annotation methods with high throughput structural predictions and demonstrate that this approach expands functional prediction of the SLA-04 proteome. Genomic analysis of the SLA-04 genome identifies the genes capable of utilizing TCA cycle intermediates to replenish cytosolic acetyl-CoA pools for lipid production. We also identify a complete metabolic pathway for sphingolipid anabolism that may allow SLA-04 to readily adapt to changing environmental conditions and facilitate robust cultivation in mass production systems. Collectively, this work clarifies the phylogeny of Chlorella sp. SLA-04 within Trebouxiophyceae and demonstrates how structural predictions can be used to improve annotation beyond sequence-based methods. 9 Chapter 3 presents an analysis of the first human telomere-to-telomere genome in search of CRISPR-Cas adaptive immune systems. CRISPR-Cas systems are found in about 40% of bacterial and 85% of archaeal genomes, but not to date in eukaryotic genomes. Recently, a paper published in Communications Biology reported the identification of 12,572 putative CRISPRs in the human genome, which they call “hCRISPR”. We attempted to reproduce this analysis using the telomere-to-telomere human genome and show that repetitive elements identified as putative CRISPR loci contain neither the repeat-spacer-repeat architecture nor the cas genes characteristic of functional CRISPR systems. Chapter 4 presents an in-depth evolutionary analysis of viral genes domesticated for host function in placental mammals. Using a complete human genome assembly and 25 additional vertebrate genomes, we re-analyzed the evolutionary trajectories and functional potential of capsid genes domesticated from Metaviridae, a lineage of retrovirus-like retrotransposons. Our study expands on previous analyses to unearth several new insights about the evolutionary histories of these ancient genes. We find that at least five independent domestication events occurred from diverse Metaviridae, giving rise to three universally-retained single-copy genes evolving under purifying selection and two gene families unique to placental mammals with multiple members showing evidence of rapid evolution. In the SIRH/RTL family, we find diverse amino-terminal domains, widespread loss of protein-coding capacity in RTL10 despite its retention in several mammalian lineages, and differential utilization of an ancient programmed ribosomal frameshift in RTL3 between the domesticated capsid and protease domains. Our analyses also reveal that most members of the PNMA family in mammalian genomes encode a conserved putative amino-terminal RNA-binding domain both adjoining and independent from 10 domesticated capsid domains. Our analyses lead to a significant correction of previous annotations of the essential CCDC8 gene. We show that this putative RNA-binding domain is also present in several extant Metaviridae, revealing a novel protein domain configuration in retrotransposons. Collectively, our study reveals the divergent outcomes of multiple domestication events from diverse Metaviridae in the common ancestor of placental mammals. Chapter 5 demonstrates that two PNMA family genes, PNMA1 and PNMA4, support reproductive capacity during aging. Analysis of donated human ovaries shows that expression of both genes declines normally with age, while several PNMA1 and PNMA4 variants identified in genome-wide association studies are causally associated with low testosterone, altered puberty onset, or obesity. Six-week-old mice lacking either Pnma1 or Pnma4 are indistinguishable from wild-type littermates, but by six months the mutant mice become prematurely subfertile, with precipitous drops in sex hormone levels, gonadal atrophy, and abdominal obesity; overall they produce markedly fewer offspring than controls. These findings expand our understanding of factors that maintain human reproductive health and lend insight into the functional consequence of the domestication of retrotransposon-derived genes. Chapter 6, explains how viruses compete for limited cellular resources by delivering defense mechanisms that protect the host from competing genetic parasites1. PARIS is a defense system, often encoded in viral genomes, that is composed of a 55 kDa ABC ATPase (AriA) and a 35 kDa TOPRIM nuclease (AriB)2. However, the mechanism by which AriA and AriB function in phage defense is unknown. Here we show that AriA and AriB assemble into a 425 kDa supramolecular immune complex. We use cryo-EM to determine the structure of this complex which explains how six molecules of AriA assemble into a propeller-shaped scaffold that 11 coordinates three subunits of AriB. ATP-dependent detection of foreign proteins triggers the release of AriB, which assembles into a homodimeric nuclease that blocks infection by cleaving host tRNALys. Phage T5 subverts PARIS immunity through expression of a tRNALys variant that is not cleaved by PARIS, and thereby restores viral infection. Collectively, these data explain how AriA functions as an ATP-dependent sensor that detects viral proteins and activates the AriB toxin. PARIS is one of an emerging set of immune systems that form macromolecular complexes for the recognition of foreign proteins, rather than foreign nucleic acids3. Chapter 7 describes how bacteria and archaea acquire resistance to viruses and plasmids by integrating fragments of foreign DNA into the first repeat of a CRISPR array. The mechanism of site-specific integration in diverse CRISPR systems remains poorly understood. Here, we determine a 560 kDa integration complex structure that explains how Pseudomonas aeruginosa Cas (Cas1-2/3) and non-Cas proteins (integration host factor) fold 150 base-pairs of host DNA into a U-shaped bend and a loop that protrude from Cas1-2/3 at right angles. The U-shaped bend traps foreign DNA on one face of the Cas1-2/3 integrase, while the loop places the first CRISPR repeat in the Cas1 active site. Both Cas3s rotate 100-degrees to expose DNA binding sites on either side of the Cas2 homodimer, that each bind an inverted repeat motif in the leader. Leader sequence motifs direct Cas1-2/3-mediated integration to diverse repeat sequences that have a 5’- GT. Collectively, this work reveals new DNA binding surfaces on Cas2 that are critical for DNA folding, and site-specific delivery of foreign DNA. In Chapter 8, I expand on the work of Chapter 7 by purifying the Cas1-2/3 fusion of Pseudomonas aeruginosa and determining multiple cryoEM structures at distinct stages of CRISPR adaptation. Collectively, these structures reveal a prominent channel on one face of the 12 integration complex that captures short fragments of foreign DNA. Foreign DNA binding triggers a series of conformational changes that reposition Cas3, exposing the Cas1 active sites and new DNA binding surfaces that are necessary for homing the DNA-bound integrase to the CRISPR locus. Upon docking to the CRISPR, the integrase catalyzes two sequential transesterification reactions that covalently link a discrete fragment of foreign dsDNA to the first repeat of the CRISPR locus. Taken together, these cryoEM structures clarify how the Cas1/2-3 fusion proteins orchestrate foreign DNA capture, site specific delivery, and integration of new DNA into the bacterial genome. References 1. Suttle, C. A. Viruses in the sea. Nature 437, 356–361 (2005). 2. Hendrix, R. W., Smith, M. C. M., Burns, R. N., Ford, M. E. & Hatfull, G. F. Evolutionary relationships among diverse bacteriophages and prophages: All the world’s a phage. Proceedings of the National Academy of Sciences 96, 2192–2197 (1999). 3. Torsvik, T. & Dundas, I. D. Bacteriophage of Halobacterium salinarium. Nature 248, 680– 681 (1974). 4. d’Herelle, F. Sur un microbe invisible antagoniste des bacilles dysentétriques. Comptes rendus de l’Académie des sciences 165, 373–375 (1917). 5. Beijerinck, M. W. On a Contagium vivum fluidum causing the Spotdisease of the Tobacco- leaves. Koninklijke Nederlandse Akademie van Wetenschappen Proceedings Series B Physical Sciences 1, 170–176 (1898). 6. Suttle, C. A. The significance of viruses to mortality in aquatic microbial communities. Microb Ecol 28, 237–243 (1994). 7. Venter, J. C. et al. The Sequence of the Human Genome. Science 291, 1304–1351 (2001). 8. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860– 921 (2001). 9. Li, W.-H., Gu, Z., Wang, H. & Nekrutenko, A. Evolutionary analyses of the human genome. Nature 409, 847–849 (2001). 13 10. Brazelton, W. J. & Baross, J. A. Abundant transposases encoded by the metagenome of a hydrothermal chimney biofilm. The ISME Journal 3, 1420–1424 (2009). 11. Shao, C. et al. The enormous repetitive Antarctic krill genome reveals environmental adaptations and population insights. Cell 186, 1279-1294.e19 (2023). 12. Kim, J. M., Vanguri, S., Boeke, J. D., Gabriel, A. & Voytas, D. F. Transposable elements and genome organization: a comprehensive survey of retrotransposons revealed by the complete Saccharomyces cerevisiae genome sequence. Genome Res 8, 464–478 (1998). 13. Aziz, R. K., Breitbart, M. & Edwards, R. A. Transposases are the most abundant, most ubiquitous genes in nature. Nucleic Acids Research 38, 4207–4217 (2010). 14. Culbertson, E. M. & Levin, T. C. Eukaryotic CD-NTase, STING, and viperin proteins evolved via domain shuffling, horizontal transfer, and ancient inheritance from prokaryotes. PLOS Biology 21, e3002436 (2023). 15. Wein, T. & Sorek, R. Bacterial origins of human cell-autonomous innate immune mechanisms. Nat Rev Immunol 22, 629–638 (2022). 16. Ledvina, H. E. & Whiteley, A. T. Conservation and similarity of bacterial and eukaryotic innate immunity. Nat Rev Microbiol 22, 420–434 (2024). 17. Whiteley, A. T. et al. Bacterial cGAS-like enzymes synthesize diverse nucleotide signals. Nature 567, 194–199 (2019). 18. Wu, J. & Chen, Z. J. Innate Immune Sensing and Signaling of Cytosolic Nucleic Acids. Annual Review of Immunology 32, 461–488 (2014). 19. Wu, J. et al. Cyclic GMP-AMP Is an Endogenous Second Messenger in Innate Immune Signaling by Cytosolic DNA. Science 339, 826–830 (2013). 20. Bernheim, A. et al. Prokaryotic viperins produce diverse antiviral molecules. Nature 589, 120–124 (2021). 21. Rivera-Serrano, E. E. et al. Viperin Reveals Its True Function. Annual Review of Virology 7, 421–446 (2020). 22. Broz, P., Pelegrín, P. & Shao, F. The gasdermins, a protein family executing cell death and inflammation. Nat Rev Immunol 20, 143–157 (2020). 23. Johnson, A. G. et al. Bacterial gasdermins reveal an ancient mechanism of cell death. Science 375, 221–225 (2022). 24. Zaremba, M. et al. Short prokaryotic Argonautes provide defence against incoming mobile genetic elements through NAD+ depletion. Nat Microbiol 7, 1857–1869 (2022). 14 25. Anzelon, T. A. et al. Structural basis for piRNA targeting. Nature 597, 285–289 (2021). 26. Ofir, G. et al. Antiviral activity of bacterial TIR domains via immune signalling molecules. Nature 600, 116–120 (2021). 27. Bayless, A. M. et al. Plant and prokaryotic TIR domains generate distinct cyclic ADPR NADase products. Science Advances 9, eade8487 (2023). 28. Lane, N. The unseen world: reflections on Leeuwenhoek (1677) ‘Concerning little animals’. Philos Trans R Soc Lond B Biol Sci 370, 20140344 (2015). 29. Koonin, E. V. Evolution of RNA- and DNA-guided antivirus defense systems in prokaryotes and eukaryotes: common ancestry vs convergence. Biol Direct 12, 5 (2017). 30. Dimitriu, T., Szczelkun, M. D. & Westra, E. R. Evolutionary Ecology and Interplay of Prokaryotic Innate and Adaptive Immune Systems. Current Biology 30, R1189–R1202 (2020). 31. Arber, W., Hattman, S. & Dussoix, D. On the host-controlled modification of bacteriophage λ. Virology 21, 30–35 (1963). 32. Arber, W. Host specificity of DNA produced by Escherichia coli. Journal of Molecular Biology 11, 247–256 (1965). 33. Loenen, W. A. M., Dryden, D. T. F., Raleigh, E. A., Wilson, G. G. & Murray, N. E. Highlights of the DNA cutters: a short history of the restriction enzymes. Nucleic Acids Research 42, 3–19 (2014). 34. Bickle, T. A. & Krüger, D. H. Biology of DNA restriction. Microbiol Rev 57, 434–450 (1993). 35. Elde, N. C., Child, S. J., Geballe, A. P. & Malik, H. S. Protein kinase R reveals an evolutionary model for defeating viral mimicry. Nature 457, 485–489 (2009). 36. Elde, N. C. & Malik, H. S. The evolutionary conundrum of pathogen mimicry. Nature Reviews Microbiology 7, 787–797 (2009). 37. Koonin, E. V., Makarova, K. S., Wolf, Y. I. & Krupovic, M. Evolutionary entanglement of mobile genetic elements and host defence systems: guns for hire. Nature Reviews Genetics 21, 119–131 (2020). 38. Bernheim, A. & Sorek, R. The pan-immune system of bacteria: antiviral defence as a community resource. Nat Rev Microbiol 18, 113–119 (2020). 15 39. Krüger, D. H., Reuter, M., Hansen, S. & Schroeder, C. Influence of phage T3 and T7 gene functions on a type III (EcoP1) DNA restriction-modification system in vivo. Mol Gen Genet 185, 457–461 (1982). 40. Lee, H. & Sashital, D. G. Creating memories: molecular mechanisms of CRISPR adaptation. Trends in Biochemical Sciences 47, 464–476 (2022). 41. Koonin, E. V. & Makarova, K. S. Origins and evolution of CRISPR-Cas systems. Philos Trans R Soc Lond B Biol Sci 374, 20180087 (2019). 42. Makarova, K. S. et al. Evolutionary classification of CRISPR–Cas systems: a burst of class 2 and derived variants. Nat Rev Microbiol 18, 67–83 (2020). 43. Jackson, S. A. et al. CRISPR-Cas: Adapting to change. Science 356, eaal5056 (2017). 44. Mohanraju, P. et al. Diverse evolutionary roots and mechanistic variations of the CRISPR- Cas systems. Science 353, aad5147 (2016). 45. Sorek, R., Lawrence, C. M. & Wiedenheft, B. CRISPR-Mediated Adaptive Immune Systems in Bacteria and Archaea. Annual Review of Biochemistry 82, 237–266 (2013). 46. Bolotin, A., Quinquis, B., Sorokin, A. & Ehrlich, S. D. Clustered regularly interspaced short palindrome repeats (CRISPRs) have spacers of extrachromosomal origin. Microbiology 151, 2551–2561 (2005). 47. Mojica, F. J. M., Díez-Villaseñor, C., Soria, E. & Juez, G. Biological significance of a family of regularly spaced repeats in the genomes of Archaea, Bacteria and mitochondria. Molecular Microbiology 36, 244–246 (2000). 48. Jansen, Ruud., Embden, Jan. D. A. van, Gaastra, Wim. & Schouls, Leo. M. Identification of genes that are associated with DNA repeats in prokaryotes. Molecular Microbiology 43, 1565–1575 (2002). 49. Mojica, F. J. M., Díez-Villaseñor, C., García-Martínez, J. & Almendros, C. Short motif sequences determine the targets of the prokaryotic CRISPR defence system. Microbiology 155, 733–740 (2009). 50. Brouns, S. J. J. et al. Small CRISPR RNAs Guide Antiviral Defense in Prokaryotes. Science 321, 960–964 (2008). 51. Krupovic, M., Makarova, K. S., Forterre, P., Prangishvili, D. & Koonin, E. V. Casposons: a new superfamily of self-synthesizing DNA transposons at the origin of prokaryotic CRISPR-Cas immunity. BMC Biology 12, 36 (2014). 16 52. Haft, D. H., Selengut, J., Mongodin, E. F. & Nelson, K. E. A Guild of 45 CRISPR- Associated (Cas) Protein Families and Multiple CRISPR/Cas Subtypes Exist in Prokaryotic Genomes. PLOS Computational Biology 1, e60 (2005). 53. Nuñez, J. K., Lee, A. S. Y., Engelman, A. & Doudna, J. A. Integrase-mediated spacer acquisition during CRISPR–Cas adaptive immunity. Nature 519, 193–198 (2015). 54. Koonin, E. V. & Krupovic, M. Evolution of adaptive immunity from transposable elements combined with innate immune systems. Nat Rev Genet 16, 184–192 (2015). 55. Krupovic, M., Béguin, P. & Koonin, E. V. Casposons: mobile genetic elements that gave rise to the CRISPR-Cas adaptation machinery. Current Opinion in Microbiology 38, 36–43 (2017). 56. Wright, A. V. et al. Structures of the CRISPR genome integration complex. Science 357, 1113–1118 (2017). 57. Santiago-Frangos, A., Buyukyoruk, M., Wiegand, T., Krishna, P. & Wiedenheft, B. Distribution and phasing of sequence motifs that facilitate CRISPR adaptation. Current Biology 31, 3515-3524.e6 (2021). 58. Fagerlund, R. D. et al. Spacer capture and integration by a type I-F Cas1–Cas2-3 CRISPR adaptation complex. Proceedings of the National Academy of Sciences 114, E5122–E5128 (2017). 59. Wiegand, T. et al. Reproducible Antigen Recognition by the Type I-F CRISPR-Cas System. The CRISPR Journal 3, 378–387 (2020). 60. Flajnik, M. F. & Kasahara, M. Origin and evolution of the adaptive immune system: genetic events and selective pressures. Nat Rev Genet 11, 47–59 (2010). 61. Liu, C., Zhang, Y., Liu, C. C. & Schatz, D. G. Structural insights into the evolution of the RAG recombinase. Nat Rev Immunol 22, 353–370 (2022). 62. Krupovic, M. et al. Ortervirales: New Virus Order Unifying Five Families of Reverse- Transcribing Viruses. Journal of Virology 92, e00515-18 (2018). 63. Finnegan, D. J. Retrotransposons. Curr. Biol. 22, R432-437 (2012). 64. Burns, K. H. & Boeke, J. D. Human Transposon Tectonics. Cell 149, 740–752 (2012). 65. Dodonova, S. O., Prinz, S., Bilanchone, V., Sandmeyer, S. & Briggs, J. A. G. Structure of the Ty3/Gypsy retrotransposon capsid and the evolution of retroviruses. PNAS 116, 10048– 10057 (2019). 17 66. Doolittle, R. F., Johnson, M. S. & McClure, M. A. Origins and Evolutionary Relationships of Retroviruses. The Quarterly Review of Biology 64, 1–30 (1989). 67. Hayward, A. Origin of the retroviruses: when, where, and how? Current Opinion in Virology 25, 23–27 (2017). 68. Ashkenazi, A., Faingold, O. & Shai, Y. HIV-1 fusion protein exerts complex immunosuppressive effects. Trends in Biochemical Sciences 38, 345–349 (2013). 69. Ikeda, H., Laigret, F., Martin, M. A. & Repaske, R. Characterization of a molecularly cloned retroviral sequence associated with Fv-4 resistance. J Virol 55, 768–777 (1985). 70. Taylor, G. M., Gao, Y. & Sanders, D. A. Fv-4: identification of the defect in Env and the mechanism of resistance to ecotropic murine leukemia virus. J Virol 75, 11244–11248 (2001). 71. Frank, J. A. et al. Evolution and antiviral activity of a human protein of retroviral origin. Science 378, 422–428 (2022). 72. Varela, M., Spencer, T. E., Palmarini, M. & Arnaud, F. Friendly Viruses. Annals of the New York Academy of Sciences 1178, 157–172 (2009). 73. Ito, J. et al. Refrex-1, a Soluble Restriction Factor against Feline Endogenous and Exogenous Retroviruses. Journal of Virology 87, 12029–12040 (2013). 74. Ito, J., Baba, T., Kawasaki, J. & Nishigaki, K. Ancestral Mutations Acquired in Refrex-1, a Restriction Factor against Feline Retroviruses, during its Cooption and Domestication. Journal of Virology 90, 1470–1485 (2016). 75. Robinson, H. L., Astrin, S. M., Senior, A. M. & Salazar, F. H. Host Susceptibility to endogenous viruses: defective, glycoprotein-expressing proviruses interfere with infections. Journal of Virology 40, 745–751 (1981). 76. Kim, F. J., Battini, J.-L., Manel, N. & Sitbon, M. Emergence of vertebrate retroviruses and envelope capture. Virology 318, 183–191 (2004). 77. Dupressoir, A., Lavialle, C. & Heidmann, T. From ancestral infectious retroviruses to bona fide cellular genes: role of the captured syncytins in placentation. Placenta 33, 663–671 (2012). 78. Lavialle, C. et al. Paleovirology of ‘syncytins’, retroviral env genes exapted for a role in placentation. Philosophical Transactions of the Royal Society B: Biological Sciences 368, 20120507 (2013). 18 79. Cornelis, G. et al. An endogenous retroviral envelope syncytin and its cognate receptor identified in the viviparous placental Mabuya lizard. Proceedings of the National Academy of Sciences 114, E10991–E11000 (2017). 80. Henzy, J. E., Gifford, R. J., Kenaley, C. P. & Johnson, W. E. An Intact Retroviral Gene Conserved in Spiny-Rayed Fishes for over 100 My. Molecular Biology and Evolution 34, 634–639 (2017). 81. Olson, E. D. & Musier-Forsyth, K. Retroviral Gag protein - RNA interactions: Implications for specific genomic RNA packaging and virion assembly. Semin Cell Dev Biol 86, 129– 139 (2019). 82. Bénit, L. et al. Cloning of a new murine endogenous retrovirus, MuERV-L, with strong similarity to the human HERV-L element and with a gag coding sequence closely related to the Fv1 restriction gene. J Virol 71, 5652–5657 (1997). 83. Yap, M. W., Colbeck, E., Ellis, S. A. & Stoye, J. P. Evolution of the Retroviral Restriction Gene Fv1: Inhibition of Non-MLV Retroviruses. PLOS Pathogens 10, e1003968 (2014). 84. Young, G. R., Yap, M. W., Michaux, J. R., Steppan, S. J. & Stoye, J. P. Evolutionary journey of the retroviral restriction gene Fv1. PNAS 115, 10130–10135 (2018). 85. Brandt, J., Veith, A.-M. & Volff, J.-N. A family of neofunctionalized Ty3/gypsy retrotransposon genes in mammalian genomes. Cytogenet. Genome Res. 110, 307–317 (2005b). 86. Campillos, M., Doerks, T., Shah, P. K. & Bork, P. Computational characterization of multiple Gag-like human proteins. Trends in Genetics 22, 585–589 (2006). 87. Ono, R. et al. Deletion of Peg10, an imprinted gene acquired from a retrotransposon, causes early embryonic lethality. Nat Genet 38, 101–106 (2006). 88. Sekita, Y. et al. Role of retrotransposon-derived imprinted gene, Rtl1, in the feto-maternal interface of mouse placenta. Nat Genet 40, 243–248 (2008). 89. Kokošar, J. & Kordiš, D. Genesis and regulatory wiring of retroelement-derived domesticated genes: a phylogenomic perspective. Mol Biol Evol 30, 1015–1031 (2013). 90. Pastuzyn, E. D. et al. The Neuronal Gene Arc Encodes a Repurposed Retrotransposon Gag Protein that Mediates Intercellular RNA Transfer. Cell 172, 275-288.e18 (2018). 91. Abed, M. et al. The Gag protein PEG10 binds to RNA and regulates trophoblast stem cell lineage specification. PLOS ONE 14, e0214110 (2019). 92. Erlendsson, S. et al. Structures of virus-like capsids formed by the Drosophila neuronal Arc proteins. Nat Neurosci 1–4 (2020) doi:10.1038/s41593-019-0569-y. 19 93. Segel, M. et al. Mammalian retrovirus-like protein PEG10 packages its own mRNA and can be pseudotyped for mRNA delivery. Science 373, 882–889 (2021). 94. Xu, J. et al. PNMA2 forms immunogenic non-enveloped virus-like capsids associated with paraneoplastic neurological syndrome. Cell 187, 831-845.e19 (2024). 95. Nikolaienko, O., Patil, S., Eriksen, M. S. & Bramham, C. R. Arc protein: a flexible hub for synaptic plasticity and cognition. Seminars in Cell & Developmental Biology 77, 33–42 (2018). 96. Ashley, J. et al. Retrovirus-like Gag Protein Arc1 Binds RNA and Traffics across Synaptic Boutons. Cell 172, 262-274.e11 (2018). 97. Brookfield, J. F. Y. The ecology of the genome — mobile DNA elements and their hosts. Nat Rev Genet 6, 128–136 (2005). 98. Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74, 5463–5467 (1977). 99. Fiers, W. et al. Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene. Nature 260, 500–507 (1976). 100. Doron, S. et al. Systematic discovery of antiphage defense systems in the microbial pangenome. Science 359, eaar4120 (2018). 101. Luria, S. E. & Anderson, T. F. The Identification and Characterization of Bacteriophages with the Electron Microscope. Proc Natl Acad Sci U S A 28, 127-130.1 (1942). 102. De Rosier, D. J. & Klug, A. Reconstruction of Three Dimensional Structures from Electron Micrographs. Nature 217, 130–134 (1968). 20 CHAPTER TWO GENOME SEQUENCE, PHYLOGENETIC ANALYSIS, AND STRUCTURE BASED ANNOTATION REVEALS METABOLIC POTENTIAL OF CHLORELLA SP. SLA-04 Contribution of Authors and Co-Authors Manuscript in Chapter 2 Author: Calvin L.C. Goemann Contributions: Conceptualization, methodology, investigation & data collection, genomics & bioinformatics analysis, writing – original draft, writing – review & editing Co-Author: Royce Wilkinson Contributions: Conceptualization, methodology, investigation & data collection, writing – review & editing Co-Author: William Henriques Contributions: Genomics & bioinformatics analysis, writing – review & editing Co-Author: Huyen Bui Contributions: Genomics & bioinformatics analysis, writing – review & editing Co-Author: Hannah M. Goemann Contributions: Investigation & data collection, genomics & bioinformatics analysis, writing – review & editing Co-Author: Ross P. Carlson Contributions: Conceptualization, writing – review & editing Co-Author: Srihar Viamajala Contributions: Conceptualization, writing – review & editing 21 Co-Author: Robin Gerlach Contributions: Conceptualization, writing – review & editing Co-Author: Blake Wiedenheft Contributions: Conceptualization, methodology, writing – original draft, writing – review & editing 22 Manuscript Information Calvin L. C. Goemann, Royce Wilkinson, William Henriques, Huyen Bui, Hannah M. Goemann, Ross P. Carlson, Sridhar Viamajala, Robin Gerlach, Blake Wiedenheft Algal Research Status of Manuscript: ☐ Prepared for submission to a peer-reviewed journal ☐ Officially submitted to a peer-reviewed journal ☐ Accepted by a peer-reviewed journal ☑ Published in a peer-reviewed journal Elsevier Volume 69 January 2023 Submitted: 17 February 2022 Published online: 13 December 2022 https://doi.org/10.1016/j.algal.2022.102943 https://doi.org/10.1016/j.algal.2022.102943 23 Abstract Algae are a broad class of photosynthetic eukaryotes that are phylogenetically and physiologically diverse. Most of the phylogenetic diversity has been inferred from 18S rDNA sequencing since there are only a few complete genomes available in public databases. Here we use ultra-long-read Nanopore sequencing to determine a gapless, telomere-to-telomere complete genome sequence of Chlorella sp. SLA-04, previously described as Chlorella sorokiniana SLA- 04. Chlorella sp. SLA-04 is a green alga that grows to high cell density in a wide variety of environments – high and neutral pH, high and low alkalinity, and high and low salinity. SLA-04’s ability to grow in high pH and high alkalinity media without external CO2 supply is favorable for large-scale algal biomass production. Phylogenetic analysis performed using ribosomal DNA and conserved protein sequences consistently reveal that Chlorella sp. SLA-04 forms a distinct lineage from other strains of Chlorella sorokiniana. We complement traditional genome annotation methods with high throughput structural predictions and demonstrate that this approach expands functional prediction of the SLA-04 proteome. Genomic analysis of the SLA-04 genome identifies the genes capable of utilizing TCA cycle intermediates to replenish cytosolic acetyl-CoA pools for lipid production. We also identify a complete metabolic pathway for sphingolipid anabolism that may allow SLA-04 to readily adapt to changing environmental conditions and facilitate robust cultivation in mass production systems. Collectively, this work clarifies the phylogeny of Chlorella sp. SLA-04 within Trebouxiophyceae and demonstrates how structural predictions can be used to improve annotation beyond sequence-based methods. 24 Introduction Algae are diverse photosynthetic eukaryotes that assimilate six times more nitrogen than terrestrial plants, perform 45% of global oxygenic photosynthesis, and 50% of total CO2 fixation1– 4. The phylogenetic and physiological diversity of algae is mirrored by the diversity of aquatic ecosystems where algae are found (e.g., temperatures 0-56oC, pH 0-11.5, and salt concentrations 0.02-3M)5,6. However, it is currently unclear how selective pressures from disparate aquatic environments shape algal genotypes and corresponding phenotypes. Genomic analysis of algae across different environments is essential to understand how algae survive and adapt to changing conditions. As of September 2022, there are 170 green algae genomes in the National Center for Biotechnology Information (NCBI) database. Most of these genomes are highly fragmented and only 16 contain defined chromosome sequences. This resource limitation restricts efforts to confidently identify genotypes that drive functional diversity. Understanding the phylogenetic and functional diversity of algae will benefit from a broader collection of complete algal genomes that can be used for comparative analyses. To date, most algal genomes have been sequenced using short-read sequencing platforms7,8. While these platforms are accurate, PCR bias and short read lengths frequently result