Medusa Query Engine
Medusa is the code name for a next generation streaming query engine for dotNetRDF.
The current prototype development work is ongoing on the 1.9 branch in the new dotNetRDF.Sparql.Core library.
Motivation
Our current Leviathan Engine is what is typically characterised as a block based engine, this means that it works by evaluating one chunk of the query in full before moving onto the next. While this has some advantages it also some key disadvantages, namely:
- Very memory inefficient for queries with large intermediate results
- Can't return any results until all results are generated
While the current version of Leviathan does have some limited support for lazy evaluation of certain queries this is by no means fantastic and is limited to a very small subset of queries.
A streaming engine by contrast is designed to evaluate the query as lazily as possible, in the .Net context this means it will be implemented almost entirely using LINQ. The advantages of a streaming engine are that it avoids the stated disadvantages of a block based engine. A streaming engine has much lower memory overhead because it only needs enough memory to calculate the next answer, this also means that it can start returning results as soon as it finds it's first results. Bear in mind that it does still have some disadvantages:
- May end up doing more work than a block based engine for some kinds of query
- May be harder to parallelize evaluation for some kinds of query
Note that since we will use LINQ to build the Medusa engine using PLINQ to parallelize things is an obvious option and one Leviathan already uses internally to great effect. However with a streaming engine we will have to be careful of when to parallelize because some things will not be amenable to parallelization e.g. ORDER BY
.
Design
The Medusa query engine will be implemented as a ground up rewrite of the existing query engine. Where appropriate it will port and re-use existing code but past experimentation has shown that the current API makes the design we want to follow unworkable. Therefore the decision has been taken to write the new engine mostly from scratch, this will likely be quicker than trying to bend the existing API to the new design.
The design borrows heavily from the design of the streaming Apache Jena ARQ engine where appropriate.
Results API
A new IQueryResult
API is introduced replacing the current poor design of returning Object
and relying on users casting it appropriately. This API provides properties for determining what kind of result it is and properties for accessing strongly typed results.
The SparqlResultSet
API is replaced with a new ITabularResults
API which embodies the notion that results may be streaming i.e. not cached into memory. Extensions of this API that allow for random access and caching into memory are also provided.
Query API
The existing immutable SparqlQuery
API is replaced with a new mutable IQuery
interface and a basic Query implementation. The intention is that a IQuery
is entirely mutable so it can be built by a parser or programmatically from code.
The existing GraphPattern
API is replaced with a new IElement
API based on the Element
API from ARQ. This is necessary to resolve several problems that have emerged over the years in how graph patterns are represented. One feature of this change is that ITriplePattern
and its associated APIs are removed entirely, things are either a IElement
or they are a Triple
contained within an appropriate IElement
instance.
Algebra API
The ISparqlAlgebra
API is replaced with a simplified IAlgebra
API. One of the big changes here is that the representation of the algebra is separated from the evaluation thereof. This makes it possible to build new/alternative query engines in the future based upon the same API without needing significant API rewrites. It also reduces some complexity with their being multiple implementations of certain core algebra concepts with different evaluation strategies.
The ISet
API is renamed to ISolution
to better describe it and has some additional methods added.
In the new design the evaluation strategy is managed separately by a IAlgebraExecutor
interface which will make it much easier to extend and improve specific parts of the evaluation engine as necessary. As already noted the Medusa engine will evaluate algebra by converting each operation into an appropriate IEnumerable<ISolution>
thus meaning that the query is evaluated lazily i.e. until the solution sequence is enumerated minimal real work is done.
A IAlgebraVisitor
interface is provided for accessing algebra in an easy fashion and a IAlgebraTransform
interface is provided for transforming algebra from one form to another e.g. in evaluation.
Join Strategies
One of the flexibilities that the new design affords us is the ability to selectively choose different join strategies rather being locked into the current choices of join strategy. A IJoinStrategy
API is provided which includes APIs for selecting the join strategy to use, implementations of the different join strategies and the ability to easily add new join strategies in future.
The current prototype has cross products, loop joins and two variations of hash join. There are also simple decorators that change joins into left/minus joins as necessary. In future we will probably also look at adding merge joins.
Expression Evaluation
The ISparqlExpression
interface is converted to the IExpression
interface and some additional methods added. Specific sub-interfaces for the different kinds of expression are explicitly provided. The existing implementations can be ported over with a certain amount of rewriting required to adapt them to the API changes.
The main change to the interface is the Evaluate()
method, currently its signature looks like the following:
IValuedNode Evaluate(SparqlEvaluationContext context, int bindingID);
We propose to change this to the following:
IValuedNode Evaluate(ISolution solution, IExpressionContext context);
Some form of context is still needed because some forms of expressions need to have state persisted through the life of the query (e.g. the NOW()
function). Also some functions need a reference back to the algebra processor (needed for EXISTS
and NOT EXISTS
).
Another change that will be made is late binding of extension functions. Extension functions will be represented by a generic placeholder expression that will late bind to the actual function at evaluation time. Thus the available custom expression factories will form part of the IExpressionContext
Status
The status of the Medusa prototype is as follows:
- Results API implemented
- IQuery API implemented
- IQuery to IAlgebra compilation mostly implemented
- Algebra API mostly implemented
- Algebra evaluation mostly implemented
- Minimal IExpression API
The following features are missing:
- Most expressions are not yet ported
- No SPARQL Query parser
- No SPARQL Updates
- No Group By, Property Path, Service and Property Function evaluation
This means that basic queries can be built and evaluated programmatically but you can't yet parse and run an arbitrary SPARQL query.
Also what is implemented is still in an early stage so APIs may evolve as the implementation continues. The testing is currently fairly limited.