Introduction
TopX is a search engine for ranked retrieval of XML (and plain-text)
data, developed at the Max-Planck Institute for Informatics. TopX
supports a probabilistic-IR scoring model for full-text content
conditions and tag-term combinations, path conditions for all XPath
axes as exact or relaxable constraints, and ontology-based relaxation
of terms and tag names as similarity conditions for ranked retrieval.
For speeding up top-k queries, various techniques are employed:
probabilistic models as efficient score predictors for a variant of the
threshold algorithm, judicious scheduling of sequential accesses for
scanning index lists and random accesses to compute full scores,
incremental merging of index lists for on-demand, self-tuning query
expansion, and a suite of specifically designed, precomputed indexes to
evaluate structural path conditions.
TopX has been stress-tested and experimentally evaluated on a variety
of datasets including the TREC Terabyte benchmark, the INEX XML
information retrieval benchmark, and an XML version of the Wikipedia
encyclopedia. TopX has also served as a reference engine for the INEX
2006 benchmarking initiative. It can be accessed
for interactive queries on various datasets.
The TopX software comprises two major parts:
- the TopX indexer and a suite of data import and indexing methods,
- the TopX search servlet and its underlying class libraries.
Both use a JDBC-compliant SQL database system as a backend. The current
implementation is based on Oracle 9i or 10g; other backend
systems may require configuration and minor code modifications. The
TopX servlet runs under the Tomcat servlet engine.
TopX has been developed by Martin Theobald in the Databases and
Information Systems Research Group (D5) at the Max-Planck Institute for
Informatics. More information about the models and algorithms in TopX
can be found at the homepage
of the D5 group and especially in Martin's
dissertation.
If you use TopX in your scientific work, please cite as:
Martin Theobald, Ralf Schenkel, Gerhard Weikum: An Efficient and
Versatile Query Engine for TopX Search. 31st International
Conference
on Very Large Data Bases (VLDB), Trondheim, Norway, 2005.
available here
(see here
for its BibTeX entry).
Resources
See here
for an installation guide, here for the main
SourceForge page of TopX, and here
for a
precompiled archive (including the source files).