QuEval
Reliable and Reproducible Evaluation of High-Dimensional Index Structures
Description
Multimedia data, or high-dimensional data in general, have been subject to research for more than two decades and gain momentum even more in the communication technology age. From a database point of view, the myriads of gigabyte of data pose the problem of managing this data. In this course, query processing is a challenging taks due to the high dimensionality of such data. In the past, dozens of index structures for high-dimensional data have been proposed and some of them are even standard-like references. However, it is still some kind of black magic to decide which index structure fits to a certain problem or outweighs another index structure. In particular, the following questions have to be answered to make a decision in favor or against a certain index structure:
- When to use which index structure (e.g., regarding dimensionality)?
- What kind of index is the best for a specific problem?
- How do we adjust existing parameters of a chosen index structure to achieve the best results?
QuEval is under constant development. The following features are implemented:
- A Data Generator that generates the actual data for the comparison, based on some user-specified properties.
- A Query-Point Generator that randomly generates the query points for a comparison scenario.
- Persistent storage of datasets and query points that allows for replication of certain case studies.
- The HDI Tester that is the core component and responsible for the actual comparison of index structures.
- Six index structures, integrated within the framework. For details see below.
- Support for generating different kind of data such as equally distributed or clustered data.
- A graphical user interface (GUI) supporting an intuitive usage of the framework.
Overview
Generally, our framework consists of three parts: Data Generator, Query-Point Generator and HDI Tester. In the following we describe in short each of these components and how they interact (which is also indicated by the Figure below).
Data Generator: This component is responsible for generating the data set that is used for a certain scenario. To this end, the user can specify certain characteristics of the data. In particular, we currently support the specification of dimensionality (i.e., number of dimensions), the amount of data (i.e., number of data points in the data set), the stochastic distribution of data, and the range of values for each dimension. Furthermore, it is possible to save the generated data set and thus reuse it for replicating or extending a former scenario.
Query-Point Generator: Analogously, this component is responsible for generating the query points used within a scenario. This can be done in two different ways: First, this component takes the generated data set as input and selects randomly a specified number of query points out of this data set. Consequently, the query points exhibit the same properties as the data points in the data set. Alternatively, the Query-Point Generator can generate query points independent of the underlying data set (but with the same properties). Furthermore, the generated query points can be saved and reused for replication.
HDI Tester: This is the core component of our framework since it performs the actual comparison of index structures. To this end, the user has to specify a scenario description, containing the following information:
- A data set, generated by the Data Generator, or a point collection stored as comma separated values (CSV).
- A set of query points randomly generated by the Query-Point Generator for exact match queries that not exists in the data set.
- A set of points from the data set as further query points for exact match and knn queries.
- A selection of indexes that are subject to comparison within the scenario.
- The number of test rounds and amount of nearest neighbors k that shall be computed.
- Whether the accuracy shall be determined or not.
Implemented Index Structures
Currently, the following index structures are available within the framework.| Name | Further information/original paper | Link(s) |
|---|---|---|
| Pyramid technique | Original paper: S.Berchthold et al., The pyramid-technique: towards breaking the curse of dimensionality, SIGMOD Records, 1998 Implementation based on: D.-H. Lee et al., An efficient technique for nearest-neighbor query processing on the SPY-TEC, Trans. Knowl. Data Eng., 2003 |
paper access paper access |
| VA-File | Original paper: R.Weber et al., An approximation-based data structure for similarity search, technical report, 1997 | paper access |
| Prototype-based approach | Original paper: E.C.Gonzalez et al., Effective proximity retrieval by ordering permutations, Trans. Pattern Analysis and Machine Intelligence, 2008 | paper access original implementation (in C#) |
| LSH-based approach | Original paper: P.Indyk et al., Approximate nearest neighbors: towards removing the curse of dimensionality., ACM SAC, 1998 Implementation based on: M.Data et al., Locality-sensitive hashing scheme based on p-stable distributions, Int. Symp. on Comp. Geometry, 2004 | paper access paper access |
| R-Tree | Original paper: A.Guttman, A dynamic index structure for spatial searching, SIGMOD, 1984 Original implementation from the GeoCraft open source framework; adjusted in order to add exact match functionality to the R-Tree | paper access original implementation |
| k-d-Tree | Original paper: J.L.Bentley, Multidimensional binary search trees used for associative searching, Comm. ACM, 1975 Original implementation from Simon Levy's software archive | paper access original implementation |
Integrating New Index Structures
One of our aims is to maintain an extensible framework so that other researcher can contribute to it and use it for their specific purposes. A central aspect of this extensibility is the possibility to add new index structures to the framework. To this end, an API exists that has to be implemented for each new index structure. You can find a documentation (JavaDoc style) of this interface here: AIndexStructure -- DocumentationVisualizing Index Structures
Coming soon...Download
The current version of QuEval can be downloaded as ZIP-compressed archive: QuEval.zip After downloading the file, extract it to your file system. Afterwards, proceed as follows (where QUEVAL_PATH denotes the main directory of the extracted framework):Windows
- Go to QUEVAL_PATH and execute the file start-framework.bat. The framework should start (see the screenshot below for an example).
Mac OS/Linux
- Open the terminal and type cd QUEVAL_PATH/Framework.
- Start the framework by executing the file framework.jar as follows: java -jar framework.jar

- Screenshot of the QuEval GUI immediately after starting the application.
Source Code
QuEval is released under the EPL License. The source code will be available here until end of January 2012.Example Data
Recently, we conducted a large-scale case study for the implemented index structures. We will make the data of this case study available until end of January 2012. Stay tuned ;)Contact
QuEval is developed mainly at the University of Magdeburg, Germany. It is open source and we are currently working on the first release so that everybody who is interested can extend/improve it. For information about the project, technical questions and bug reports: please contact the development team via Martin Schäler.
Project members:- Alexander Grebhahn
- Tim Hering
- Christina Pielach
- Martin Schäer
- Reimar Schröter
- Sandro Schulze
- Björn-Erik Aust
- David Broneske
- Denis Dietze
- Maik Lampe
- Tran Tuan Ngyuen
