FP-growth approach for document clustering

Akbar, Monika

FP-growth approach for document clustering

Files

AkbarM0508.pdf (740.37 KB)

Date

2008

Authors

Akbar, Monika

Publisher

Montana State University - Bozeman, College of Engineering

Abstract

Since the amount of text data stored in computer repositories is growing every day, we need more than ever a reliable way to group or categorize text documents. Most of the existing document clustering techniques use a group of keywords from each document to cluster the documents. In this thesis, we have used a sense based approach to cluster documents instead of using only the frequency of the keywords. We use relationships between the keywords to cluster the documents. The relationships are retrieved from the WordNet ontology and represented in the form of a graph. The document-graphs, which reflect the essence of the documents, are searched in order to find the frequent subgraphs. To discover the frequent subgraphs, we use the Frequent Pattern Growth (FP-growth) approach, which was originally designed to discover frequent patterns. The common frequent subgraphs discovered by the FP-growth approach are later used to cluster the documents. The FP-growth approach requires the creation of an FP-tree. Mining the FP-tree, which is created for a normal transaction database, is easier compared to large document-graphs, mostly because the itemsets in a transaction database is smaller compared to the edge list of our document-graphs. Original FP-tree mining procedure is also easier because the items of a traditional transaction database are stand-alone entities and have no direct connection to each other. In contrast, as we look for subgraphs in graphs, they become related to each other in the context of connectivity. The computation cost makes the original FP-growth approach somewhat inefficient for text documents. We modify the FP-growth approach, making it possible to generate frequent subgraphs from the FP-tree. Later, we cluster documents using these subgraphs.

URI

https://scholarworks.montana.edu/xmlui/handle/1/807

Collections

Theses and Dissertations at Montana State University (MSU)

Full item page

FP-growth approach for document clustering

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections