Theses and Dissertations at Montana State University (MSU)
Permanent URI for this communityhttps://scholarworks.montana.edu/handle/1/732
Browse
6 results
Search Results
Item Large-scale automated human protein-phenotype relation extraction from biomedical literature(Montana State University - Bozeman, College of Engineering, 2020) Pourreza Shahri, Morteza; Chairperson, Graduate Committee: Indika KahandaIdentifying protein-phenotype relations is of paramount importance for applications such as uncovering rare and complex diseases. Human Phenotype Ontology (HPO) is a recently introduced standardized vocabulary for describing disease-related phenotypic abnormalities in humans. While the official HPO knowledge base maintains known associations between human proteins and HPO terms, it is widely believed that this is incomplete. However, due to the exponential growth of biomedical literature, timely manual curation is infeasible, rendering the need for efficient and accurate computational tools for automated curation. In this work, we present HPcurator, a novel two-step framework for extracting relations between proteins and HPO terms from biomedical literature. First, we implement ProPheno, a comprehensive online dataset composed of human protein-phenotype co-mentions extracted from the entire set of biomedical articles. Subsequently, we show that these co-mentions are useful as a complementary source of input for a different, but highly related, task of automated protein-phenotype prediction. Next, we develop a supervised machine learning model called PPPred, which, to the best of our knowledge, is the first predictive model that can classify the validity of a given sentence-level protein-phenotype co-mention. Using a gold standard dataset composed of manually curated sentence co-mentions, we demonstrate that PPPred significantly outperforms several baseline methods. Finally, we propose SSEnet, a novel deep semi-supervised ensemble framework for relation extraction that combines deep learning, semi-supervised learning, and ensemble learning. This framework is motivated by the fact that while the manual annotation of co-mentions is extremely prohibitive, we have access to millions of unlabeled co-mentions. We develop a prototype of HPcurator by instantiating SSEnet with ProPheno, self-learning, pre-trained language models, as well as convolutional and recurrent neural networks. This system can successfully output a ranked list of relevant sentences for a user input protein-phenotype pair. Our experimental results indicate that this system provides state-of-the-art performance in human protein- HPO term relation extraction. The findings and the insight gained from this work have implications for biocurators, biologists, and the computer science community involved in developing biomedical text mining tools.Item Event-triggered causality: a causality detection tool for big data(Montana State University - Bozeman, College of Engineering, 2018) Davis, Tyler Bruce; Chairperson, Graduate Committee: Ross K. SniderFinding causal relationships in time series data is a well-known problem and methods such as Granger causality or transfer entropy look for it in continuous data sources. However, when data contains discrete events and comes from multiple sources with varying data types, assumptions underlying these methods are often violated. We present a new method called Event Triggered Causality (ETC) that can determine causal relationships between observed events within time series data from very different sensors. The new causality metric takes a data mining approach where events in the data are first identified with data dependent event detectors. The events are then clustered according to their spectral fingerprints and assessed for causality using both similarity and predictability measures. Event similarity is measured using distance metrics while temporal predictability is measured using a temporal entropy metric. ETC is then extended to find successive causal links between events, called Directed Event Triggered Causality, which takes the form of a directed graph. We use these methods to analyze potential causal links in two different situations. The first is searching for causal links between marmoset vocal interactions and related movements. The second is between commands from a farmer, his sheep dog, and the movement of sheep. The construction of these metrics helps to expand the definition of event-based causality and provides a method to further understand complex systems such as social and behavioral interactions.Item Exploring timeliness for accurate location recommendation on location-based social networks(Montana State University - Bozeman, College of Engineering, 2017) Xu, Yi; Chairperson, Graduate Committee: Qing YangAn individual's location history in the real world implies his or her interests and behaviors. Accordingly, people who share similar location histories are likely to have common interest and behavior. This thesis analyzes and understands the process of Collaborative Filtering (CF) approach, which mines an individual's preference from his/her geographic location histories and recommends locations based on the similarities between the user and others. We find that a CF-based recommendation process can be summarized as a sequence of multiplications between a transition matrix and visited-location matrix. The transition matrix is usually approximated by the user's interest matrix that reflect the similarity among users, regarding to their interest in visiting different locations. The visited-location matrix provides the history of visited locations of all users, which is currently available to the recommendation system. We find that recommendation results will converge if and only if the transition matrix remains unchanged; otherwise, the recommendations will be valid for only a certain period of time. Based on our analysis, a novel location-based accurate recommendation (LAR) method is proposed, which considers the semantic meaning and category information of locations, as well as the timeliness of recommending results, to make accurate recommendations. We evaluated the precision and recall rates of LAR, using a large-scale real-world data set collected from Brightkite. Evaluation results confirm that LAR offers more accurate recommendations, comparing to the state-of-art approaches.Item Anomaly detection through spatio-temporal data mining, with application to near real-time outlying sensor identification(Montana State University - Bozeman, College of Engineering, 2017) Galarus, Douglas Edward; Chairperson, Graduate Committee: John Paxton; Rafal A. Angryk (co-chair)There is a need for robust solutions to the challenges of near real-time spatio-temporal outlier and anomaly detection. In our dissertation, we define and demonstrate quality measures for evaluation and comparison of overlapping, real-time, spatio-temporal data providers and for assessment and optimization of data acquisition, system operation and data redistribution. Our measures are tested on real-world data and applications, and our results show the need and potential to develop our own mechanisms for outlier and anomaly detection. We then develop a representative, near real-time solution for the identification of outlying sensors that far outperforms state of the art methods in terms of accuracy and is computationally efficient. When applied to a real-world, meteorological data set, we identify numerous problematic sites that otherwise have not been flagged as bad. We identify sites for which metadata is incorrect. We identify observations that have been mislabeled by provider quality control processes. And, we demonstrate that our method outperforms enhanced versions of state of the art methods for assessment of accuracy using comparable or less computation time. There are many quality-related problems with real data sets and, in the absence of an approach like ours, these problems may have largely gone unidentified. Our approach is novel for the simple but effective way that it accounts for spatial and temporal variation, and that it addresses more than just accuracy. Collectively these contributions form an overarching data-mining framework and example that can be used and extended for data-mining method development, model building and evaluation of spatio-temporal outlier and anomaly detection processes.Item Mining spatiotemporal co-occurrence patterns from massive data sets with evolving regions(Montana State University - Bozeman, College of Engineering, 2014) Ganesan Pillai, Karthik; Chairperson, Graduate Committee: John Paxton; Rafal A. AngryK (co-chair)Due to the current rates of data acquisition, the growth of data volumes in nearly all domains of our lives is reaching historic proportions [5], [6], [7]. Spatiotemporal data mining has emerged in recent decades with the main goal focused on developing data-driven mechanisms for the understanding of the spatiotemporal characteristics and patterns occurring in the massive repositories of data. This work focuses on discovering spatiotemporal co-occurrence patterns (STCOPs) from large data sets with evolving regions. Spatiotemporal co-occurrence patterns represent the subset of event types that occur together in both space and time. Major limitations of existing spatiotemporal data mining models and techniques include the following. First, they do not take into account continuously evolving spatiotemporal events that have polygon-like representations. Second, they do not investigate and provide sufficient interest measures for the STCOPs discovery purposes. Third, computationally and storage efficient algorithms to discover STCOPs are missing. These limitations of existing approaches represent important hurdles while analyzing massive spatiotemporal data sets in several application domains that generate big data, including solar physics, which is an application of our interdisciplinary research. In this work, we address these limitations by i) introducing the problem of mining STCOPs from data sets with extended (region-based) spatial representations that evolve over time, ii) developing a set of novel interest measures, and iii) providing a novel framework to model STCOPs. We also present and investigate three novel approaches to STCOPs mining. We follow this investigation by applying our algorithm to perform a novel data-driven discovery of STCOPs from solar physics data.Item FP-Tree motivated system for information retrieval using an abstraction path-based inverted index(Montana State University - Bozeman, College of Engineering, 2009) McAllister, Richard Arthur; Chairperson, Graduate Committee: Rafal A. AngrykLanguage ontologies provide an avenue for automated lexical analysis that may be used in concert with existing information retrieval methods to allow lexicographic relationships to be considered in searches. This paper presents a method of information retrieval that uses WordNet, a database of lexical ontologies, to generate paths of abstraction via lexicographic relationships, and uses these relationships as the basis for an inverted index to be used in the retrieval of documents from an indexed corpus. We present this method as a entree to a line of research in using lexical ontologies to perform graph analysis of documents, and through this process improve the precision of existing information retrieval techniques.