FP-growth approach for document clustering

dc.contributor.advisorChairperson, Graduate Committee: Rafal A. Angryken
dc.contributor.authorAkbar, Monikaen
dc.date.accessioned2013-06-25T18:40:28Z
dc.date.available2013-06-25T18:40:28Z
dc.date.issued2008en
dc.description.abstractSince the amount of text data stored in computer repositories is growing every day, we need more than ever a reliable way to group or categorize text documents. Most of the existing document clustering techniques use a group of keywords from each document to cluster the documents. In this thesis, we have used a sense based approach to cluster documents instead of using only the frequency of the keywords. We use relationships between the keywords to cluster the documents. The relationships are retrieved from the WordNet ontology and represented in the form of a graph. The document-graphs, which reflect the essence of the documents, are searched in order to find the frequent subgraphs. To discover the frequent subgraphs, we use the Frequent Pattern Growth (FP-growth) approach, which was originally designed to discover frequent patterns. The common frequent subgraphs discovered by the FP-growth approach are later used to cluster the documents. The FP-growth approach requires the creation of an FP-tree. Mining the FP-tree, which is created for a normal transaction database, is easier compared to large document-graphs, mostly because the itemsets in a transaction database is smaller compared to the edge list of our document-graphs. Original FP-tree mining procedure is also easier because the items of a traditional transaction database are stand-alone entities and have no direct connection to each other. In contrast, as we look for subgraphs in graphs, they become related to each other in the context of connectivity. The computation cost makes the original FP-growth approach somewhat inefficient for text documents. We modify the FP-growth approach, making it possible to generate frequent subgraphs from the FP-tree. Later, we cluster documents using these subgraphs.en
dc.identifier.urihttps://scholarworks.montana.edu/handle/1/807en
dc.language.isoenen
dc.publisherMontana State University - Bozeman, College of Engineeringen
dc.rights.holderCopyright 2008 by Monika Akbaren
dc.subject.lcshDocument clusteringen
dc.titleFP-growth approach for document clusteringen
dc.typeThesisen
thesis.catalog.ckey1320270en
thesis.degree.committeemembersMembers, Graduate Committee: Rockford J. Ross; Hunter LLoyden
thesis.degree.departmentComputer Science.en
thesis.degree.genreThesisen
thesis.degree.nameMSen
thesis.format.extentfirstpage1en
thesis.format.extentlastpage66en

Files

Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
Name:
AkbarM0508.pdf
Size:
740.37 KB
Format:
Adobe Portable Document Format
Copyright (c) 2002-2022, LYRASIS. All rights reserved.