Select Page

Managing implicit facts in PoolParty using RDFox

April 6, 2022
Semantics Amsterdam 2021
PoolParty Semantic Suite

Knowledge graphs provide the means to break data silos by integrating heterogeneous data sets using a common interface that describes the data with a clear semantics. Knowledge graphs (KGs) typically consist of vocabularies, ontologies and instance data about a knowledge domain. For each of these building blocks that form a KG, data is represented using Semantic Web standards such as RDF, SKOS, RDFS and OWL. One of the main characteristics of KGs is that ontologies not only can be used to semantically describe the data, but they can also be used to do reasoning, i.e., to infer implicit facts from the explicit facts. These explicit ground facts can be either instance data or vocabulary data. Reasoning infers additional information to expand the data sets based on ontology semantics of RDFS and OWL, which supports use cases like semantic search and recommendations.

Motivation

In the context of Semantic Web reasoning, which is based on RDFS and OWL semantics, the inferences are based on the open world assumption. This means that implicit knowledge is only added to the data set by computing the closure for the explicit facts and axioms, but facts are never retracted by RDFS and OWL reasoning. Instance data cannot be proven to be wrong by open world reasoning and therefore no facts are deleted on the instance data side by the reasoner. However, the explicit facts might of course change over time when users delete data in the KG. This deletion has to be accounted for in a system that uses reasoning. Here we face the challenge that the deductive basis of RDFS and OWL does not directly support this retraction of facts. Therefore we have to introduce a strategy to align fact deletion with open world reasoning.

We have to consider performance for fast updates when deleting data in combination with reasoning. An approach is to consider minimal changes, where we change the data incrementally and avoid computing the closure for the whole KG. We also need a concise update semantics to have consistent data. To achieve this, the DRed (delete and rederive) algorithm can be used.

Albin Ahmeti

Albin Ahmeti

Data & Knowledge Engineer at Semantic Web Company

Robert David

Robert David

CTO at Semantic Web Company

Valerio Cocci

Valerio Cocci

Senior Knowledge Engineer at Oxford Semantic Technologies

Yavor Nenov

Yavor Nenov

Chief Science Officer at Oxford Semantic Technologies

DRed Incremental Updates

DRed [1-7] is an algorithm that maintains the materialisation of recursive views (defined in terms of a Datalog program) with negation and aggregation.
Knowledge graphs contain facts that are explicitly stored as well as implicit facts that are deduced from the explicit facts. Both kinds of facts are stored as triples in a triple store, typically in separate named graphs. This introduces new challenges of data management when an update to the data in a triple store is done by inserting and deleting facts. Each such update operation on the explicit facts can trigger consequences for the implicit facts as a follow-up effect, i.e., we have to insert a new implicit fact via derivation, or we have to delete an implicit fact in case it can not be inferred anymore by other explicit facts in the KG.

There exist a number of techniques that can be used to deal with the update of implicit facts in a KG. These techniques are called update semantics and one can refer to [1] to see a more comprehensive list, their definition and a detailed elaboration on the topic. One very well-known update semantics, which has been introduced in the context of deductive databases, is called DRed (delete and rederive).

DRed is a technique that is used to delete implicit facts in case they can no longer be inferred by other non-deleted explicit triples in KG. In essence, DRed – in the most naive version – deletes the consequences of triples (triples that entail other triples via rules), followed by a re-materialization.

The DRed algorithm essentially computes [1]:

    1. Delete a superset of the derived (inferred) tuples using semi-naive evaluation, so-called “overestimation”;
    2. Re-insert deleted tuples that have an alternative derivation, i.e., rederive;
    3. Insert new tuples plus the corresponding derived tuples (using semi-naive evaluation).

DRed works incrementally and only removes those implicit triples that cannot be supported anymore by explicit triples and ontology axioms. This makes it a very viable approach to deal with implicit facts in a number of use cases. For instance, if one asserts an explicit triple, implicit triples would be derived by the entailment; hence, if one drops the very same explicit triple in the next update, it removes the previously inserted implicit triples as well. In this way, we are preserving idempotency, i.e., both update operations “cancel-out.” On the contrary, if we were not to remove these implicit triples, then they would be considered as “side-effects” as known in the database theory (more precisely elaborated in the area of view updates), and these are usually remnant facts that are hard to be justified or potentially wrong.

Next, we are going to discuss DRed as a technique for updating implicit facts. The RDFox triple store uses a variant of DRed to update and manage implicit triples, which is very performant as it does the computation in an incremental way. First, we export the KG triples from PoolParty to RDFox, and afterwards we will leverage this off-the-shelf functionality in RDFox in order to update implicit facts. Hence, with this combination, we open the door for new use cases.

RDFox PoolParty Integration

In our upcoming PoolParty 8.2 release, we implemented an integration with the RDFox triple store. RDFox is a high-performance scalable in-memory triple store. It provides a SPARQL 1.1 compliant endpoint, an in-store SHACL processor that can be triggered by SPARQL operations and a deductive rule engine based on a combination of RDF and Datalog that can be used to implement reasoning tasks.

PoolParty integrates with RDFox via the SPARQL endpoint, which means that all operations from PoolParty components are transformed to SPARQL queries or updates. This provides a Semantic Web standards-based integration with RDFox that is flexible while it still leverages the in-store features, like SHACL validation. Using this architecture, we can provide several functions to implement use cases:

    1. Manage the KG from PoolParty components and describe it using PoolParty ontologies;
    2. Each KG update operation in PoolParty is translated to SPARQL update operation in RDFox;
    3. We leverage RDFox for managing and updating implicit facts as this functionality is provided out of the box, i.e., using the off-the-shelf feature of the triple store.

For the use case described in this blog post, we have used PoolParty Suite components such as: Ontology Management, Extractor and UnifiedViews. Ontology Management allows an ontologist to define ontological statements (axioms) in order to describe taxonomy or instance data. For taxonomy data this means that the ontology will give additional expressivity beyond the SKOS model. The ontological statements can also be exported in an external triple store where they can be used for reasoning purposes as described in this use case with RDFox. PoolParty Extractor is a text mining component based on the knowledge graph that is used to annotate the text with the concepts. UnifiedViews is a data processing and orchestration application that one can use to do ETL where one can do extract, transform and load operations by using RDF.

RDFox DRed Example

In the following we describe the DRed approach by using a running example. Let us suppose we have vocabulary data in PoolParty using these facts:

:Ann a skos:Concept ; a :Child .
:Ben a skos:Concept ; a :Father .
:Ann :hasFather :Ben .
:Father a owl:Class .
:Parent a owl:Class .
:Child a owl:Class .

:hasFather a owl:ObjectProperty .
:hasParent a owl:ObjectProperty .
:hasChild a owl:ObjectProperty .

:hasFather rdfs:subPropertyOf :hasParent . 
:hasChild owl:inverseOf :hasParent . 
:Father rdfs:subClassOf :Parent .
:hasChild rdfs:domain :Parent . 
:hasChild rdfs:range :Child .

These axioms are translated into Datalog rules when imported in RDFox, and they look like this:

:hasParent[?x, ?y] :- :hasFather[?x, ?y] . (1)
:hasChild[?x, ?y] :- :hasParent[?y, ?x] .  (2)
:Parent[?x] :- :Father[?x] .               (3)
:Parent[?y] :- :hasChild[?y, ?x] .         (4)
:Child[?x] :- hasChild[?y, ?x] .           (5)

Then, RDFox (and any other triple store engine) would derive new implicit facts based on those facts and axioms, as shown in the following:

:Ann :hasParent :Ben .
:Ben a :Parent ; :hasChild :Ann .

Let us consider the following update operation:

DELETE DATA {:Ben a :Father}

In fact in DRed, by its definition, when such an explicit triple is removed, also the inferred ones – by applying the rules we mentioned above – are removed. Precisely, these are the steps that are computed:

DELETE DATA {:Ben a :Father} 
DELETE DATA {:Ben a :Parent}  #via rule (3)
INSERT DATA {:Ben a :Parent}  #via rules (1), (2) and (4)

Note that the implicit triple :Ben a :Parent, despite being deleted, gets re-inserted in the rederive step. This is because the implicit triple is still supported by other non-deleted explicit triples via the rules (1), (2) and (4).

On the contrary, let us take another example where there is no rule support for re-derivation. Now assuming we have this update operation:

DELETE DATA { :Ann :hasFather :Ben . }

then triples :Ann :hasParent :Ben . :Ben :hasChild :Ann . will also be deleted, since there are no other triples that can re-derive them. 

Use Case Scenario

There can be a number of use cases that can take advantage of this approach of managing implicit data. In essence, DRed allows that inserts and deletes preserve idempotency, which is a very important postulate [8][1].

For our use case, we are using RDFox as a triple store that can reason in an incremental way with implicit data on a scale of billions of triples, leveraging a variant of DRed. In this variant of DRed, RDFox reduces the number of over-deleted facts using a query [9]. The instance data is coming from an ETL pipeline in UnifiedViews that transforms tabular data to RDF and pushes them to RDFox. The ontological axioms are exported from PoolParty to RDFox in the same datastore (equivalent to the rdf4j repository). RDFox internally creates rules out of axioms as explained in the previous section, and together with the instance data entails new implicit facts, computing the materialisation. Delete or insert operations are then computed by RDFox via the DRed algorithm in an incremental way using the algorithm outlined above.

In the UnifiedViews pipeline, we use a DPU (Data Processing Unit) that creates an extraction model from the (instance) data labels in RDFox and stores it in an index data structure (Solr or Elasticsearch) used by PoolParty. The extraction model is then used to extract and annotate text by finding in the text the concepts that have matching labels. This can be used in conjunction with different parameters that are possible in the extract/annotate call, such as filtering concepts that have a certain class, which we are going to elaborate next. Hence, after the text is annotated with the concepts we would get only the subset of concepts that have a relationship rdf:type to the custom class in PoolParty. A number of rdf:type relationships are also derived in RDFox via axioms such as rdfs:domain, rdfs:range, rdfs:subClassOf.

Assuming we have the following text that we want to extract using the extraction model:

“Ben is the father of Ann and every morning he sends her to kindergarten.”

If we use this text to annotate against the extraction model by using an additional parameter for class, we use the parameter customClass=:Parent, then we would return back :Ben as an answer.

This result is returned given that in the previous delete operation, despite :Ben a :Parent is being deleted, it is re-derived by other rules and re-inserted by the DRed algorithm. On the contrary, if we were not to use RDFox, this result wouldn’t be possible and as such would not be returned by the extractor.

This use case demonstrates the capabilities of RDFox to reason with implicit data, which are then directly impacting the Extractor’s results in PoolParty. In a high performance scenario, where we use the Extractor in the context of large data sets and frequent data changes, we benefit from RDFox’s high performance applying incremental reasoning.

References

[1] Albin Ahmeti, “Updates in the context of Ontology-based Data Management”, PhD thesis, 2020. Link to thesis: https://aic.ai.wu.ac.at/~polleres/supervised_theses/Albin_Ahmeti_PhD_2020.pdf 

[2] Ashish Gupta, Inderpal Singh Mumick, and V. S. Subrahmanian. 1993. Maintaining views incrementally. SIGMOD Rec. 22, 2 (June 1, 1993), 157–166. DOI: https://doi.org/10.1145/170036.170066

[3] Optimised Maintenance of Datalog Materialisations https://arxiv.org/pdf/1711.03987.pdf

[4] Stefano Ceri and Jennifer Widom. Deriving incremental production rules for deductive data. Inf. Syst., 19(6):467–490, 1994.

[5] Raphael Volz, Steffen Staab, and Boris Motik. Incrementally maintaining materializations of ontologies stored in logic databases. J. on Data Semantics, 2:1–34, 2005.

[6] Jakub Kotowski, François Bry, and Simon Brodt. Reasoning as axioms change – incremental view maintenance reconsidered. In 5th International Conference on Web Reasoning and Rule Systems (RR 2011), volume 6902 of LNCS, pages 139–154, Galway, Ireland, August 2011. Springer.

[7] Jacopo Urbani, Alessandro Margara, Ceriel J. H. Jacobs, Frank van Harmelen, and Henri E. Bal. Dynamite: Parallel materialization of dynamic rdf data. In International Semantic Web Conference (ISWC2013), volume 8218 of LNCS, pages 657–672. Springer, October 2013.

[8] Carlos E. Alchourron and David Makinson. On the logic of theory change: Contraction functions and their associated revision functions. Theoria, 48:14–37, 1982.

[9] Boris Motik, Yavor Nenov, Robert Piro, Ian Horrocks: Maintenance of datalog materialisations revisited. Artif. Intell. 269: 76-136 (2019) https://dblp.org/rec/journals/ai/MotikNPH19.html.

Learn more about Enterprise Knowledge Graphs and how to implement them in your organization