Show simple item record

dc.contributor.advisorChairperson, Graduate Committee: Rafal A. Angryken
dc.contributor.authorHossain, Mahmud Shahriaren
dc.description.abstractThis thesis report introduces a new technique of document clustering based on frequent senses. The developed system, named GDClust (Graph-Based Document Clustering) [1], works with frequent senses rather than dealing with frequent keywords used in traditional text mining techniques. GDClust presents text documents as hierarchical document-graphs and uses an Apriori paradigm to find the frequent subgraphs, which reflect frequent senses. Discovered frequent subgraphs are then utilized to generate accurate sense-based document clusters. We propose a novel multilevel Gaussian minimum support strategy for candidate subgraph generation. Additionally, we introduce another novel mechanism called Subgraph-Extension mining that reduces the number of candidates and overhead imposed by the traditional Apriori-based candidate generation mechanism. GDClust utilizes an English language thesaurus (WordNet [2]) to construct document-graphs and exploits graph-based data mining techniques for sense discovery and clustering. It is an automated system and requires minimal human interaction for the clustering purpose.en
dc.publisherMontana State University - Bozeman, College of Engineeringen
dc.subject.lcshDocument clusteringen
dc.subject.lcshGraphic methodsen
dc.subject.lcshGraph theoryen
dc.subject.lcshComputer programsen
dc.titleApriori approach to graph-based clustering of text documentsen
dc.rights.holderCopyright 2008 by Mahmud Shahriar Hossainen
thesis.catalog.ckey1327464en, Graduate Committee: John Paxton; Hunter Lloyden Science.en

Files in this item


This item appears in the following Collection(s)

Show simple item record

MSU uses DSpace software, copyright © 2002-2017  Duraspace. For library collections that are not accessible, we are committed to providing reasonable accommodations and timely access to users with disabilities. For assistance, please submit an accessibility request for library material.