Large-scale automated human protein-phenotype relation extraction from biomedical literature

dc.contributor.advisorChairperson, Graduate Committee: Indika Kahandaen
dc.contributor.authorPourreza Shahri, Mortezaen
dc.date.accessioned2021-09-16T19:30:37Z
dc.date.available2021-09-16T19:30:37Z
dc.date.issued2020en
dc.description.abstractIdentifying protein-phenotype relations is of paramount importance for applications such as uncovering rare and complex diseases. Human Phenotype Ontology (HPO) is a recently introduced standardized vocabulary for describing disease-related phenotypic abnormalities in humans. While the official HPO knowledge base maintains known associations between human proteins and HPO terms, it is widely believed that this is incomplete. However, due to the exponential growth of biomedical literature, timely manual curation is infeasible, rendering the need for efficient and accurate computational tools for automated curation. In this work, we present HPcurator, a novel two-step framework for extracting relations between proteins and HPO terms from biomedical literature. First, we implement ProPheno, a comprehensive online dataset composed of human protein-phenotype co-mentions extracted from the entire set of biomedical articles. Subsequently, we show that these co-mentions are useful as a complementary source of input for a different, but highly related, task of automated protein-phenotype prediction. Next, we develop a supervised machine learning model called PPPred, which, to the best of our knowledge, is the first predictive model that can classify the validity of a given sentence-level protein-phenotype co-mention. Using a gold standard dataset composed of manually curated sentence co-mentions, we demonstrate that PPPred significantly outperforms several baseline methods. Finally, we propose SSEnet, a novel deep semi-supervised ensemble framework for relation extraction that combines deep learning, semi-supervised learning, and ensemble learning. This framework is motivated by the fact that while the manual annotation of co-mentions is extremely prohibitive, we have access to millions of unlabeled co-mentions. We develop a prototype of HPcurator by instantiating SSEnet with ProPheno, self-learning, pre-trained language models, as well as convolutional and recurrent neural networks. This system can successfully output a ranked list of relevant sentences for a user input protein-phenotype pair. Our experimental results indicate that this system provides state-of-the-art performance in human protein- HPO term relation extraction. The findings and the insight gained from this work have implications for biocurators, biologists, and the computer science community involved in developing biomedical text mining tools.en
dc.identifier.urihttps://scholarworks.montana.edu/handle/1/15981en
dc.language.isoenen
dc.publisherMontana State University - Bozeman, College of Engineeringen
dc.rights.holderCopyright 2020 by Morteza Pourreza Shahrien
dc.subject.lcshDiseasesen
dc.subject.lcshProteinsen
dc.subject.lcshPhenotypeen
dc.subject.lcshData miningen
dc.subject.lcshMachine learningen
dc.subject.lcshNeural networks (Computer science)en
dc.subject.lcshPredictive analyticsen
dc.titleLarge-scale automated human protein-phenotype relation extraction from biomedical literatureen
dc.typeDissertationen
mus.data.thumbpage58en
thesis.degree.committeemembersMembers, Graduate Committee: Brendan Mumey; Upulee Kanewala; Diane Bimczoken
thesis.degree.departmentComputing.en
thesis.degree.genreDissertationen
thesis.degree.namePhDen
thesis.format.extentfirstpage1en
thesis.format.extentlastpage207en

Files

Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
Name:
pourreza-shahri-large-scale-2020.pdf
Size:
2.74 MB
Format:
Adobe Portable Document Format
Description:
Large-scale automated human protein-phenotype relation extraction from biomedical literature (PDF)

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
826 B
Format:
Plain Text
Description:
Copyright (c) 2002-2022, LYRASIS. All rights reserved.