Scholarship & Research
Permanent URI for this communityhttps://scholarworks.montana.edu/handle/1/1
Browse
3 results
Search Results
Item Exploring the feasibility of an automated biocuration pipeline for research domain criteria(Montana State University - Bozeman, College of Engineering, 2019) Anani, Mohammad; Chairperson, Graduate Committee: Indika KahandaResearch on mental disorders has been largely based on manuals such as the ICD-10 (International Classification of Diseases) and DSM-V (the Diagnostic Statistical Manual of Mental Disorders), which rely on the signs and symptoms of disorders for classification. However, this approach tends to overlook the underlying mechanisms of brain disorders and does not express the heterogeneity of those conditions. Thus, the National Institute of Mental Health (NIMH) introduced a new framework for mental illness research, namely, Research Domain Criteria (RDoC). RDoC is a research framework which utilizes various units of analysis from genetics, neural circuits, etc., for accurate multi-dimensional classification of mental illnesses. The RDoC framework is manually updated with units of analysis in periodic workshops. The process of updating the RDoC framework is accomplished by researching relevant evidence in the literature by domain experts. Due to the large amount of relevant biomedical research available, developing a method to automate the process of extracting evidence from the biomedical literature to assist with the curation of the RDoC matrix is key. In this thesis, we formulate three tasks that would be necessary for an automated biocuration pipeline for RDoC: 1) Labeling biomedical articles with RDoC constructs, 2) Retrieval of brain research articles, and 3) Extraction of relevant data from these articles. We model the first problem as a multilabel classification problem with 26 constructs of RDoC and use a gold-standard dataset of annotated PubMed abstracts and employ various supervised classification algorithms. The second task classifies general PubMed abstracts relevant to brain research using the same data from the first task and other unlabeled abstracts for training a model. Finally, for the third task, we attempt to extract Problem, Intervention, Comparison, and Outcomes (PICO) elements and brain region mentions from a subset of the RDoC abstracts. To the best of our knowledge, this is the first study aimed at automated data extraction and retrieval of RDoC related literature. The results of automating the aforementioned tasks are promising; we have a very accurate multilabel classification model, a good retrieval model, and an accurate brain region extraction model.Item High-dimensional data indexing with applications(Montana State University - Bozeman, College of Engineering, 2015) Schuh, Michael Arthur; Chairperson, Graduate Committee: Rockford J. Ross; Rafal A. Angryk (co-chair)The indexing of high-dimensional data remains a challenging task amidst an active and storied area of computer science research that impacts many far-reaching applications. At the crossroads of databases and machine learning, modern data indexing enables information retrieval capabilities that would otherwise be impractical or near impossible to attain and apply. One such useful retrieval task in our increasingly data-driven world is the k-nearest neighbor (k-NN) search, which returns the k most similar items in a dataset to the search query provided. While the k-NN concept was popularized in every-day use through the sorted (ranked) results of online text-based search engines like Google, multimedia applications are rapidly becoming the new frontier of research. This dissertation advances the current state of high-dimensional data indexing with the creation of a novel index named ID* (\ID Star"). Based on extensive theoretical and empirical analyses, we discuss important challenges associated with high dimensional data and identify several shortcomings of existing indexing approaches and methodologies. By further mitigating against the negative effects of the curse of dimensionality, we are able to push the boundary of effective k-NN retrieval to a higher number of dimensions over much larger volumes of data. As the foundations of the ID* index, we developed an open-source and extensible distance-based indexing framework predicated on the basic concepts of the popular iDistance index, which utilizes an internal B+-tree for efficient one-dimensional data indexing. Through the addition of several new heuristic-guided algorithmic improvements and hybrid indexing extensions, we show that our new ID* index can perform significantly better than several other popular alternative indexing techniques over a wide variety of synthetic and real-world data. In addition, we present applications of our ID* index through the use of k-NN queries in Content-Based Image Retrieval (CBIR) systems and machine learning classification. An emphasis is placed on the NASA sponsored interdisciplinary research goal of developing a CBIR system for large-scale solar image repositories. Since such applications rely on fast and effective k-NN queries over increasingly large-scale and high-dimensional datasets, it is imperative to utilize an efficient data indexing strategy such as the ID* index.Item FP-Tree motivated system for information retrieval using an abstraction path-based inverted index(Montana State University - Bozeman, College of Engineering, 2009) McAllister, Richard Arthur; Chairperson, Graduate Committee: Rafal A. AngrykLanguage ontologies provide an avenue for automated lexical analysis that may be used in concert with existing information retrieval methods to allow lexicographic relationships to be considered in searches. This paper presents a method of information retrieval that uses WordNet, a database of lexical ontologies, to generate paths of abstraction via lexicographic relationships, and uses these relationships as the basis for an inverted index to be used in the retrieval of documents from an indexed corpus. We present this method as a entree to a line of research in using lexical ontologies to perform graph analysis of documents, and through this process improve the precision of existing information retrieval techniques.