Full Text Querying with SPARQL
Full Text SPARQL is a non-standard extension to our SPARQL engine provided in the separate dotNetRDF.Query.FullText.dll library. This library is included in the standard distribution from the 0.6.0 release onwards.
It uses Lucene.Net to build and query full text indexes and allows you to leverage this capability directly from SPARQL queries. This document will guide you through the process of creating and querying a full text index.
General Usage
To use the library you'll need to add a reference to dotNetRDF.Query.FullText.dll into your project (or install it via NuGet) and you should ensure that Lucene.Net is included in your project as well as this provides the actual indexing and query functionality. Using NuGet is the preferred way to install since it will sort out dependencies and framework versions for you.
The majority of the classes provided by this library can be found in the VDS.RDF.Query.FullText
namespace, the only other class you'll typically need is the FullTextOptimiser
which is located in the VDS.RDF.Query.Optimisation
namespace.
Creating an Index
Before you can perform full text queries you must first build an index from your RDF data. To do this you will use an instance of the IFullTextIndexer
interface, an indexer provides the means to index Triples, Graphs and Datasets and builds an index which relates the full text of literal objects to one of the nodes of each Triple.
Currently the following implementations are available:
Indexer | Description |
---|---|
LuceneSubjectsIndexer | Relates the Subject of the Triple to the full text of the Literal Object |
LucenePredicatesIndexer | Relates the Predicate of the Triple to the full text of the Literal Object |
LuceneObjectsIndexer | Relates the Object of the Triple to its own full text |
So let's look at an example of building an index:
using System;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Store;
using VDS.RDF;
using VDS.RDF.Query.FullText;
using VDS.RDF.Query.FullText.Indexing;
using VDS.RDF.Query.FullText.Indexing.Lucene;
using VDS.RDF.Query.FullText.Schema;
public class FullTextIndexingExample
{
public static void Main(String[] args)
{
IFullTextIndexer indexer = null;
try
{
//First get a Graph we want to Index
Graph g = new Graph();
g.LoadFromFile("example.ttl");
//Then create an indexer and index the data
indexer = new LuceneSubjectsIndexer(FSDirectory.Open("example"), new StandardAnalyzer(), new DefaultIndexSchema());
indexer.Index(g);
}
catch (Exception ex)
{
//Handle any errors that occurred during Indexing
}
finally
{
//Always dispose of your index when it's built to ensure that indexed data is persisted to the index
if (indexer != null) indexer.Dispose();
}
}
}
Note that when we created the indexer we passed in a Lucene.Net Directory
and an Analyzer
- you can use whatever implementations of these you like with our indexers. The DefaultIndexSchema
is a schema used to control how the indexed data is stored onto fields on the documents in the index, for most use cases you will only ever need to use this default implementation but you can implement your own if you are an advanced user.
Querying an Index
To query an index you use a IFullTextSearchProvider
instance, currently there is a single implementation LuceneSearchProvider
. The following example demonstrates its usage:
using System;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Store;
using VDS.RDF;
using VDS.RDF.Query.FullText;
using VDS.RDF.Query.FullText.Search;
using VDS.RDF.Query.FullText.Search.Lucene;
using VDS.RDF.Query.FullText.Schema;
using VDS.RDF.Writing.Formatting;
public class FullTextSearchExample
{
public static void Main(String[] args)
{
//This example assumes we've already created our index in a folder called example
IFullTextSearchProvider provider = null;
try
{
//Get a Lucene Search Provider
provider = new LuceneSearchProvider(Lucene.Util.Version.LUCENE_30, FSDirectory.Open("example"), new StandardAnalyzer(), new DefaultIndexSchema());
//Use it to make a search and print the results
NTriplesFormatter formatter = new NTriplesFormatter();
foreach (IFullTextSearchResult result in provider.Match("text"))
{
Console.WriteLine("Node: " + result.Node.ToString(formatter) + " - Score: " + results.Score.ToString());
}
}
catch (Exception ex)
{
//Handle any exceptions that occur during querying
}
finally
{
//Always dispose of a search provider when done as not doing so may cause problems with other code accesing your index
if (provider != null) provider.Dispose();
}
}
}
As with our previous example the LuceneSearchProvider
takes a Lucene.Net Directory
and Analyzer
plus a IFullTextIndexSchema
.
Note: This constructor allows you to omit either/both of the Analyzer or Schema, in this case this uses the default Lucene.Net StandardAnalyzer
and/or the DefaultIndexSchema
Full Text Querying with SPARQL
So now that you've seen how to build and query an index programatically lets look at how you go about making a full text query via SPARQL. To do this you will need to create an instance of the FullTextOptimiser
and attach it to your SPARQL Queries as an Algebra Optimiser.
The following example shows how to do this:
using System;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Store;
using VDS.RDF;
using VDS.RDF.Parsing;
using VDS.RDF.Query;
using VDS.RDF.Query.FullText;
using VDS.RDF.Query.FullText.Search;
using VDS.RDF.Query.FullText.Search.Lucene;
using VDS.RDF.Query.Optimisation;
using VDS.RDF.Writing.Formatting;
public class FullTextSparqlExample
{
public static void Main(String[] args)
{
//This example assumes we've already created our index in a folder called example
IFullTextSearchProvider provider = null;
try
{
//Create our dataset
InMemoryDataset dataset = new InMemoryDataset();
//Assume we load it with data from somewhere...
//Create and parse our query
SparqlParameterizedString queryString = new SparqlParameterizedString();
queryString.Namespaces.Add("pf", new Uri(FullTextHelper.FullTextMatchNamespace));
queryString.CommandText = "SELECT * WHERE { ?match pf:textMatch 'text' }";
SparqlQueryParser parser = new SparqlQueryParser();
SparqlQuery query = parser.ParseFromString(queryString);
//Get a Lucene Search Provider
//For simplicity I've used the short constructor which assume StandardAnalyzer and DefaultIndexSchema
provider = new LuceneSearchProvider(Lucene.Util.Version.LUCENE_29, FSDirectory.Open("example"));
//Create the Full Text Optimiser and attach it to the query
FullTextOptimiser optimiser = new FullTextOptimiser(provider);
query.AlgebraOptimisers = new IAlgebraOptimiser[] { optimiser };
//Now we can go ahead and run our query
SparqlResultSet results = query.Evaluate(dataset) as SparqlResultSet;
if (results != null)
{
NTriplesFormatter formatter = new NTriplesFormatter();
foreach (SparqlResult result in results)
{
Console.WriteLine(result.ToString(formatter));
}
}
}
catch (Exception ex)
{
//Handle any exceptions that occur during querying
}
finally
{
//Always dispose of a search provider when done as not doing so may cause problems with other code accesing your index
if (provider != null) provider.Dispose();
}
}
}
Those of you who may be familiar with LARQ will notice that the query syntax for full text query is identical to that.
So you can do things like get scores for matches:
# Get matches with scores
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
SELECT * WHERE { (?match ?score) pf:textMatch "text" . }
Or apply a limit on the results:
# Get up to 10 matches
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
SELECT * WHERE { ?match pf:textMatch ( "text" 10) . }
Or apply a score threshold to the results:
# Apply a Score Threshold of 0.75
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
SELECT * WHERE { ?match pf:textMatch ( "text" 0.75) . }
Note: You can apply either a limit/threshold on their own, in this case a threshold must be a decimal/double while a limit must be an integer. If you wish to apply both a threshold and a limit the threshold is assumed to always appear first. This is in line with the LARQ syntax for full text query.
Use with SPARQL Endpoints
You can use Full Text Querying with SPARQL Endpoints by configuring it via the Configuration API, see Configuration API - Full Text Query for more details.
Keeping an index in sync with Datasets
If your dataset is mutable then you may wish to keep your full text index in sync with your dataset as it changes. To do this you can use the FullTextIndexedDataset
which is a decorator that can be applied over another ISparqlDataset
and will automatically keep your index in sync with changes made to the dataset.