Exploring the feasibility of an automated biocuration pipeline for research domain criteria
MetadataShow full item record
Research on mental disorders has been largely based on manuals such as the ICD-10 (International Classification of Diseases) and DSM-V (the Diagnostic Statistical Manual of Mental Disorders), which rely on the signs and symptoms of disorders for classification. However, this approach tends to overlook the underlying mechanisms of brain disorders and does not express the heterogeneity of those conditions. Thus, the National Institute of Mental Health (NIMH) introduced a new framework for mental illness research, namely, Research Domain Criteria (RDoC). RDoC is a research framework which utilizes various units of analysis from genetics, neural circuits, etc., for accurate multi-dimensional classification of mental illnesses. The RDoC framework is manually updated with units of analysis in periodic workshops. The process of updating the RDoC framework is accomplished by researching relevant evidence in the literature by domain experts. Due to the large amount of relevant biomedical research available, developing a method to automate the process of extracting evidence from the biomedical literature to assist with the curation of the RDoC matrix is key. In this thesis, we formulate three tasks that would be necessary for an automated biocuration pipeline for RDoC: 1) Labeling biomedical articles with RDoC constructs, 2) Retrieval of brain research articles, and 3) Extraction of relevant data from these articles. We model the first problem as a multilabel classification problem with 26 constructs of RDoC and use a gold-standard dataset of annotated PubMed abstracts and employ various supervised classification algorithms. The second task classifies general PubMed abstracts relevant to brain research using the same data from the first task and other unlabeled abstracts for training a model. Finally, for the third task, we attempt to extract Problem, Intervention, Comparison, and Outcomes (PICO) elements and brain region mentions from a subset of the RDoC abstracts. To the best of our knowledge, this is the first study aimed at automated data extraction and retrieval of RDoC related literature. The results of automating the aforementioned tasks are promising; we have a very accurate multilabel classification model, a good retrieval model, and an accurate brain region extraction model.