dotNetRDF 3.0 Design
dotNetRDF 3.0 will start to implement a lot of the restructuring that Rob had originally planned for the 2.0 release but which got put on hold as more effort went in to building out new features and supporting additional syntaxes.
The following is the original 2.0 design plans kept here for reference, actual planning for 3.0 is taking place in the dotNetRDF 3.0 project board on GitHub
TODO: This page should be updated as we get closer to tackling some of the issues described below.
API Modularization
Currently the Core API has become quite bloated and it would make sense to divide some features out into their own separate libraries, the rough proposal is as follows:
Module | APIs |
---|---|
dotNetRDF.dll | Core data model, Configuration and Formatting APIs |
dotNetRDF.IO.dll | Parsing and Writing APIs |
dotNetRDF.IO.Json.dll | Parsing and Writing implementations that require Json.Net |
dotNetRDF.IO.RdfA.dll | Parsing and Writing implementations that require HtmlAgilityPack to support RDFa |
dotNetRDF.Sparql.dll | SPARQL Query and Update engine |
dotNetRDF.Sparql.FullText.dll | dotNetRDF.Query.FullText renamed |
dotNetRDF.Ontology.dll | Ontology API, likely expanded |
dotNetRDF.Storage.dll | Storage API |
dotNetRDF.Storage.Virtuoso.dll | dotNetRDF.Data.virtuoso renamed |
dotNetRDF.Web.dll | Web API, abstractions of current ASP.Net support |
dotNetRDF.Web.Asp.dll | ASP.Net implementation of Web API |
API Refactors
There are a number of API refactors we wish to make as part of a 2.0 release and these are detailed here, they range from minor clean up to fairly significant refactors.
Data Model Refactor
The core data model has become somewhat large and in places overly complex with multiple ways of achieving the same thing, not to mention confusing and misleading interface names in many places. There are a variety of refactors we are proposing all of which aim to simplify the API and in some cases reduce the memory footprint and complexity of their usage.
Nodes Refactor
I propose that we refactor the Nodes API as follows:
- Remove the tight coupling between a Node and the Graph it originated from
- Allow for free moving of Nodes between different Graphs
- Identify Blank Nodes in a way that allows the above
- Don't require any work to create Node values to be done in Nodes
- Move all Node classes to the Nodes namespace
- Consolidate nodes into a single interface
The first point means removing the Graph
and GraphUri
properties from the INode
interface, doing so removes the coupling between a Node and its originating Graph. This has particular benefits for things like creating wrappers around factories and graphs since it will eliminate some of the current issues we can encounter involving Graph reference mismatches.
Once this is done point 2 becomes easy because we can now freely copy Nodes around because they just represent node values provided we fix point 3. As a result the existing CopyNode()
methods can all be deprecated and ultimately removed.
Point 5 is primarily just a clean up activity to better organize the code going forward.
Point 6 is likely the most significant, rather than having a separate interface for each Node type all the relevant methods will be directly on INode
. The contract will be that trying to access something that doesn't make sense for the relevant NodeType
will result in a NodeValueException
. Making this change avoids the needs for lots of casting throughout the code base and should make code substantially easier to read and write.
Blank Node Identification
Point 3 is somewhat trickier, currently we use user defined/auto-assigned string labels to identify blank nodes. This has proved to be a poor implementation decision requiring tons of hacks and workarounds to have blank nodes be properly scoped and avoid collisions. My initial though was to identify blank nodes by a combination of two Guids, one is the Node ID identifier and one the Factory identifier. This means that Blank Nodes don't strictly meet point 1 because this means it is tied to the factory, in practice this is likely not necessary.
Thus it will depend on whether we allow users to create Blank Nodes using an explicit Guid, if we do then we might need the Factory ID to ensure correct scoping. If we don't allow this we can likely get away with a single Guid since it is for all intents and purposes a unique identifier so will also provide inherent scoping. However the current consensus in the community is that a blank node can span multiple graphs and be the same node, it is only the serialization of that node which is scoped to a specific graph and in the case of dataset formats scoped to the dataset.
One thing we may want to do in making this change is still allow users to create blank nodes by human readable label and simply map these consistently to Guids internally (within the scope of a factory). However a better solution might be to push this responsibility off on parsers which removes the need for graphs and node factories to track this. Ultimately this will reduce memory usage within the system as this mapping will then be temporary for the life of a parser, there are techniques which can make this mapping extremely memory efficient which Jena already uses and we should look at adopting those in our parsers.
Node Creation
Point 4 refers to the fact that internally some nodes are creating by passing a Graph reference and having them call back to the graph to get data such as new Blank Node ID or resolving the Prefixed Name. This again was a poor design decision and so we will remove the relevant constructors and instead require the Node Factory creating the Node to provide us with the Blank Node ID or resolved URI.
Triple refactor
In light of the Nodes refactor we will make some similar changes to the Triple API:
- Remove reference to the Graph on a Triple
- Remove the rarely used
Context
parameter on a Triple
This is intended to simplify the Triple
class, reduce it's memory footprint and get to a data model where it purely represents a triple. It also simplifies the constructor for Triple
since we no longer need to validate that the Nodes originate from the same Graph. Removing the Context
parameter only really affects the rarely used N3 function contexts which is a feature we don't truly support anyway and so I would prefer to remove support for.
With the removal of the Graph
property we need to introduce a proper Quad
class to represent Quads, the Quad
class will have essentially the same API as a Triple but with the addition of a Graph
property which will return a INode
Note: We wish to make the Graph
property return a INode
to support non-URI names for graphs though this goes beyond standard RDF, however it is common across other APIs and we should aim to be as broad and extensible as possible.
Implementation wise I intend to make Quad
a standalone class not an extension because we don't want to allow implicit casting of Quad
to Triple
. Users should always be aware that this is a lossy operation, an AsTriple()
method will be provided to do this explicitly. Conversely Triple
will likely provide an AsQuad(Uri graphUri)
and AsQuad(INode graphNode)
methods. This decision also means we can implement Quad
as a decorator over a Triple
allowing us to reduce memory footprint further.
IGraph Refactor
The Nodes refactor will result in a couple of minor changes to the IGraph
API as it currently stands:
- There will no longer need to be a two argument form of
Merge()
since Nodes don't have a reference to their Graph URI to be preserve GetNextBlankNodeID()
becomes obsolete and eventually removed- Remove all defunct
GetXNode()
methods
As a result of this and the Nodes refactor there will be some implementation benefits for IGraph
. For example Merge()
becomes super simple, just Assert()
the triples from one graph into the other since there is no need to worry about mapping Blank Node IDs, this is particularly true since we will be explicitly allowing for blank node sharing across graphs.
The more significant changes we propose for the IGraph
API are as follows:
- Remove the
BaseUri
property entirely, the current name is a misnomer and generally used as the graph name. It would be better to have no name associated with anIGraph
directly because it is purely a set of triples, graph names would be a property of a storage implementation for sets of graphs. - Add a
Quads
property to retrieve the Triple's inQuad
form - Make
Triples
property return aIEnumerable<Triple>
instead ofBaseTripleCollection
- Simplify the set of
GetTriplesWithXY()
methods into a singleFind(INode s, INode p, INode o)
method where any/all may be null to signify matching anything. - Add convenience
RetractWhere()
method so that internally implementations can avoid materializing deletion candidates where possible - Remove other defunct selection related methods
- Add
Edges
property which returnsINode
instances used in the predicate position - Rename
Nodes
property toVertices
to more accurately reflect what is returned
The 3rd change is the most significant but also the most important, it hides internal implementation details of how triples (more than it is already) are stored and makes it much easier to implement Graphs that don't have any direct in-memory storage if desired.
ITripleStore/ISparqlDataset Refactor
The ITripleStore
interface is both poorly named and a messy API, it also overlaps heavily with the ISparqlDataset
API. I suggest consolidating the functionality of the two APIs (while removing irrelevant functionality) into a single IGraphStore
API. The new name more accurately reflects the purpose and would have a cleaner API, excerpts of the proposed API are given below:
interface IGraphStore
{
IEnumerable<Uri> GraphUris { get; }
IEnumerable<IGraph> Graphs { get; }
IGraph this[Uri u] { get; }
bool HasGraph(Uri u);
bool Add(IGraph g);
bool Add(IGraph g, Uri graphUri);
bool AddTriple(Uri graphUri, Triple t);
bool AddQuad(Quad q);
bool Copy(Uri srcUri, Uri destUri);
bool Move(Uri srcUri, Uri destUri);
bool Remove(Uri u);
//Get all Triples in the Store
IEnumerable<Triple> Triples { get; }
//Find matching triples
IEnumerable<Triple> Find(INode subj, INode pred, INode obj);
//Get all Quads in the store
IEnumerable<Quad> Quads { get; }
//Find all matching quads
IEnumerable<Quad> Find(INode graph, INode subj, INode pred, INode obj)
//Is a Triple found anywhere in the store
bool ContainsTriple(Triple t);
//Is the Triple contained in the given Graphs
bool ContainsTriple(IEnumerable<Uri> graphUris, Triple t);
//Does a Quad exist in the store
bool ContainsQuad(Quad q);
}
Note that while this interface outline is by no means complete it does not include the active and default graph management portions of the ISparqlDataset
API. It is proposed that the burden of tracking what constitutes those graphs is the job of the query engine. Likely we will introduce custom algebra to cover this or manage it purely in the query processors.
There will also be no Graphs
property exposing a BaseGraphCollection
, that will instead become purely the standard backing implementation for our in-memory implementations of this interface.
With this interface in place we will need to introduce new implementations of this which cover our existing key implementations:
- An in-memory implementation
- A storage backed implementation
IO APIs Refactor
The Data Model refactors outlined above will facilitate some refactoring and streamlining of the existing IO APIs particularly in the Parsing namespace. I propose to remove the IStoreReader
interface and generalize the IRdfReader
interface so we a RDF parser can pass into a IGraph
/IGraphStore
. This will mean adding overloads of the Load()
method which take a IGraphStore
instead.
Since there will be no longer a BaseUri
property on graphs it will also be useful to provide a Base URI to readers and writers for the purposes of parsing and serialization so appropriate overloads will be required for this.
As a side effect of the above the IRdfHandler
interface will need to be expanded to add a HandleQuad()
method, parsers will call either HandleTriple()
or HandleQuad()
depending on the data they produce. This also allows for handlers to handle triples and quads differently if they so desire.
Formatting API Improvements
On the writing side we will introduce a IQuadFormatter
which provides a Format(Quad q)
method in a similar way to the existing ITripleFormatter
method. Similar to the changes on the parsing side we will replace the IStoreWriter
interface with an expanded IRdfWriter
interface.
We may also expand the formatting APIs to allow them to be passed a TextWriter
directly in order to eliminate some of the overheads of generating and then writing a string. The existing overloads that return strings can be rewritten in terms of StringWriter
to minimize code duplication.
Query APIs Refactor
As a result of other refactors there will be some changes required in the Query API, most notably that it needs to reflect the fact that queries will now operate over a IGraphStore
instance and must track the default/active graphs themselves.
The mechanism by which we will track the default/active graphs is yet to be determined.
Query Parser and AST
While the current query parser and AST give us full conformance with the SPARQL specification their code is quite ugly and hacky in some respects. We propose to replace the parser with a new parser either hand-written or using a parser generator with good .Net support like SableCC. As part of this rewrite we would change the AST representation to align more closely with that used by ARQ as that would allow us to eliminate some of the ugly code we currently have.
Query Results
Currently methods that can take any type of query return an Object
and the user must cast to an appropriate IGraph
or SparqlResultSet
, this has proved to be a less than ideal decision and so I would like to introduce a general QueryResult
class like so:
class QueryResult
{
bool IsResultSet { get; }
bool IsGraph { get; }
bool IsBoolean { get; }
SparqlResultSet Results { get; }
IGraph Graph { get; }
bool BooleanResult { get; }
}
Doing this makes peoples code simpler since they don't need to cast and it also allows us to expand result returns in the future to include additional information e.g. execution time.
We also intend to abstract SparqlResultSet
into an interface and refactor it somewhat to support true streaming of results rather than having to buffer the entire results set into memory.
Query Processor Refactor
Currently LeviathanQueryProcessor
is visitor-like in it's implementation but most of the real implementation is hidden inside the algebra classes. I would like to move all implementation into the query processor itself. This makes the code easier to maintain since it's all in one place and much easier for people to override specific parts of the implementation.
Some more thoughts on how the query processors can be refactored to support a new streaming engine can be found in the [[Medusa Query Engine|DeveloperGuide-Architecture-Design-Medusa]] design document.
ISparqlExpression Refactor
Right now to evaluate an expression you need to pass in the entire query context, realistically all you actually need is the ISet
you are applying the expression on and possibly some limited context. Making this refactor has a couple of blockers right now:
- Some expressions like
EXISTS
have semantics somewhat at odds with most other functions - We should implement
IEquatable<ISparqlExpression>
so that we can more easily perform expression manipulations
With equality in place it then becomes relatively easy to scan the expression tree to find aggregates and replace them with temporary variables (or their assigned variables) in the same way that ARQ does. In the spirit of the Query Processor Refactor and to make this easier we should also consider refactoring IExpressionTransformer
into a visitor interface as part of these changes.
If we make this refactor it also paves the way for writing a true streaming engine that does not to process the query in blocks like Leviathan does but rather operates more in the manner of ARQ.
Optimizer Refactor
As with other proposed refactors changing IAlgebraOptimiser
to be a visitor would make it easier to use and extend going forward.