High-dimensional data indexing with applications

Schuh, Michael Arthur

High-dimensional data indexing with applications

dc.contributor.advisor	Chairperson, Graduate Committee: Rockford J. Ross; Rafal A. Angryk (co-chair)	en
dc.contributor.author	Schuh, Michael Arthur	en
dc.date.accessioned	2016-01-03T17:34:10Z
dc.date.available	2016-01-03T17:34:10Z
dc.date.issued	2015	en
dc.description.abstract	The indexing of high-dimensional data remains a challenging task amidst an active and storied area of computer science research that impacts many far-reaching applications. At the crossroads of databases and machine learning, modern data indexing enables information retrieval capabilities that would otherwise be impractical or near impossible to attain and apply. One such useful retrieval task in our increasingly data-driven world is the k-nearest neighbor (k-NN) search, which returns the k most similar items in a dataset to the search query provided. While the k-NN concept was popularized in every-day use through the sorted (ranked) results of online text-based search engines like Google, multimedia applications are rapidly becoming the new frontier of research. This dissertation advances the current state of high-dimensional data indexing with the creation of a novel index named ID* (\ID Star"). Based on extensive theoretical and empirical analyses, we discuss important challenges associated with high dimensional data and identify several shortcomings of existing indexing approaches and methodologies. By further mitigating against the negative effects of the curse of dimensionality, we are able to push the boundary of effective k-NN retrieval to a higher number of dimensions over much larger volumes of data. As the foundations of the ID* index, we developed an open-source and extensible distance-based indexing framework predicated on the basic concepts of the popular iDistance index, which utilizes an internal B+-tree for efficient one-dimensional data indexing. Through the addition of several new heuristic-guided algorithmic improvements and hybrid indexing extensions, we show that our new ID* index can perform significantly better than several other popular alternative indexing techniques over a wide variety of synthetic and real-world data. In addition, we present applications of our ID* index through the use of k-NN queries in Content-Based Image Retrieval (CBIR) systems and machine learning classification. An emphasis is placed on the NASA sponsored interdisciplinary research goal of developing a CBIR system for large-scale solar image repositories. Since such applications rely on fast and effective k-NN queries over increasingly large-scale and high-dimensional datasets, it is imperative to utilize an efficient data indexing strategy such as the ID* index.	en
dc.identifier.uri	https://scholarworks.montana.edu/handle/1/9215	en
dc.language.iso	en	en
dc.publisher	Montana State University - Bozeman, College of Engineering	en
dc.rights.holder	Copyright 2015 by Michael Arthur Schuh	en
dc.subject.lcsh	Indexing	en
dc.subject.lcsh	Nearest neighbor analysis (Statistics)	en
dc.subject.lcsh	Information retrieval	en
dc.subject.lcsh	Content-based image retrieval	en
dc.title	High-dimensional data indexing with applications	en
dc.type	Dissertation	en
thesis.catalog.ckey	2898829	en
thesis.degree.committeemembers	Members, Graduate Committee: Petrus Martens; John Paxton; Lisa Davis	en
thesis.degree.department	Computer Science.	en
thesis.degree.genre	Dissertation	en
thesis.degree.name	PhD	en
thesis.format.extentfirstpage	1	en
thesis.format.extentlastpage	131	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: SchuhM0815.pdf
Size:: 4.58 MB
Format:: Adobe Portable Document Format

Download

Collections

Theses and Dissertations at Montana State University (MSU)