Faculty of Computer Science Database Workgroup

QuEval

Reliable and Reproducible Evaluation of High-Dimensional Index Structures

Description

Multimedia data, or high-dimensional data in general, have been subject to research for more than two decades and gain momentum even more in the communication technology age. From a database point of view, the myriads of gigabyte of data pose the problem of managing this data. In this course, query processing is a challenging taks due to the high dimensionality of such data. In the past, dozens of index structures for high-dimensional data have been proposed and some of them are even standard-like references. However, it is still some kind of black magic to decide which index structure fits to a certain problem or outweighs another index structure. In particular, the following questions have to be answered to make a decision in favor or against a certain index structure:

  • When to use which index structure (e.g., regarding dimensionality)?
  • What kind of index is the best for a specific problem?
  • How do we adjust existing parameters of a chosen index structure to achieve the best results?
This is where QuEval, a framework for quantitative comparison and evaluation of high-dimensional index structures comes into play. QuEval is a Java-based framework that supports the comparison of index strucutres regarding certain characteristics such as dimensionality, accuracy, or performance. Currently, the framework contains six different index structures. However, a main focus of the framework is its extensibility and we encourage people to contribute to QuEval by providing more index structures or other interesting aspects for their comparison.

QuEval is under constant development. The following features are implemented:

  • A Data Generator that generates the actual data for the comparison, based on some user-specified properties.
  • A Query-Point Generator that randomly generates the query points for a comparison scenario.
  • Persistent storage of datasets and query points that allows for replication of certain case studies.
  • The HDI Tester that is the core component and responsible for the actual comparison of index structures.
  • Six index structures, integrated within the framework. For details see below.
  • Support for generating different kind of data such as equally distributed or clustered data.
  • A graphical user interface (GUI) supporting an intuitive usage of the framework.

Overview

Generally, our framework consists of three parts: Data Generator, Query-Point Generator and HDI Tester. In the following we describe in short each of these components and how they interact (which is also indicated by the Figure below).

Data Generator: This component is responsible for generating the data set that is used for a certain scenario. To this end, the user can specify certain characteristics of the data. In particular, we currently support the specification of dimensionality (i.e., number of dimensions), the amount of data (i.e., number of data points in the data set), the stochastic distribution of data, and the range of values for each dimension. Furthermore, it is possible to save the generated data set and thus reuse it for replicating or extending a former scenario.

Query-Point Generator: Analogously, this component is responsible for generating the query points used within a scenario. This can be done in two different ways: First, this component takes the generated data set as input and selects randomly a specified number of query points out of this data set. Consequently, the query points exhibit the same properties as the data points in the data set. Alternatively, the Query-Point Generator can generate query points independent of the underlying data set (but with the same properties). Furthermore, the generated query points can be saved and reused for replication.

HDI Tester: This is the core component of our framework since it performs the actual comparison of index structures. To this end, the user has to specify a scenario description, containing the following information:

  • A data set, generated by the Data Generator, or a point collection stored as comma separated values (CSV).
  • A set of query points randomly generated by the Query-Point Generator for exact match queries that not exists in the data set.
  • A set of points from the data set as further query points for exact match and knn queries.
  • A selection of indexes that are subject to comparison within the scenario.
  • The number of test rounds and amount of nearest neighbors k that shall be computed.
  • Whether the accuracy shall be determined or not.
The user can create this scenario description by means of a graphical user interface. To compare/evaluate the selected indexes, the HDI Tester determines the accuracy (only if chosen) and time efficiency. In the case that the framework computes the accuracy (currently based on the euclidean distance), it uses the sequential scan to determine the correct query response and compares this result to the query result of the evaluated index structures. Thehe results of a scenario are saved in a CSV file that can be deployed for detailed analyses of the experimental results.

Implemented Index Structures

Currently, the following index structures are available within the framework.
NameFurther information/original paperLink(s)
Pyramid technique Original paper: S.Berchthold et al., The pyramid-technique: towards breaking the curse of dimensionality, SIGMOD Records, 1998
Implementation based on: D.-H. Lee et al., An efficient technique for nearest-neighbor query processing on the SPY-TEC, Trans. Knowl. Data Eng., 2003
paper access

paper access
VA-File Original paper: R.Weber et al., An approximation-based data structure for similarity search, technical report, 1997 paper access
Prototype-based approach Original paper: E.C.Gonzalez et al., Effective proximity retrieval by ordering permutations, Trans. Pattern Analysis and Machine Intelligence, 2008 paper access
original implementation (in C#)
LSH-based approach Original paper: P.Indyk et al., Approximate nearest neighbors: towards removing the curse of dimensionality., ACM SAC, 1998
Implementation based on: M.Data et al., Locality-sensitive hashing scheme based on p-stable distributions, Int. Symp. on Comp. Geometry, 2004
paper access

paper access
R-Tree Original paper: A.Guttman, A dynamic index structure for spatial searching, SIGMOD, 1984
Original implementation from the GeoCraft open source framework; adjusted in order to add exact match functionality to the R-Tree
paper access
original implementation
k-d-Tree Original paper: J.L.Bentley, Multidimensional binary search trees used for associative searching, Comm. ACM, 1975
Original implementation from Simon Levy's software archive
paper access
original implementation

Integrating New Index Structures

One of our aims is to maintain an extensible framework so that other researcher can contribute to it and use it for their specific purposes. A central aspect of this extensibility is the possibility to add new index structures to the framework. To this end, an API exists that has to be implemented for each new index structure.
You can find a documentation (JavaDoc style) of this interface here: AIndexStructure -- Documentation

Visualizing Index Structures

Coming soon...

Download

The current version of QuEval can be downloaded as ZIP-compressed archive: QuEval.zip
After downloading the file, extract it to your file system. Afterwards, proceed as follows (where QUEVAL_PATH denotes the main directory of the extracted framework):

Windows

  1. Go to QUEVAL_PATH and execute the file start-framework.bat. The framework should start (see the screenshot below for an example).

Mac OS/Linux

  1. Open the terminal and type cd QUEVAL_PATH/Framework.
  2. Start the framework by executing the file framework.jar as follows: java -jar framework.jar
Screenshot of the QuEval GUI immediately after starting the application.

Source Code

QuEval is released under the EPL License. The source code will be available here until end of January 2012.

Example Data

Recently, we conducted a large-scale case study for the implemented index structures. We will make the data of this case study available until end of January 2012. Stay tuned ;)

Contact

QuEval is developed mainly at the University of Magdeburg, Germany. It is open source and we are currently working on the first release so that everybody who is interested can extend/improve it. For information about the project, technical questions and bug reports: please contact the development team via Martin Schäler.

Project members: Former project members:
  • Björn-Erik Aust
  • David Broneske
  • Denis Dietze
  • Maik Lampe
  • Tran Tuan Ngyuen