How does PoolParty Semantic Suite integrate with Graph Databases? – Part 1
PoolParty Semantic Suite is the most complete semantic middleware on the global market. The platform provides components for many different use case scenarios along the whole linked data life cycle. Use cases range from entity linking over semantic enrichment to automated data quality checks.
In this series of blog posts, we will explain how to integrate RDF Graph Databases with PoolParty Semantic Suite. We will highlight those graph databases features which are fundamental to PoolParty's components to ensure a seamless workflow.
While PoolParty's components make use of different machine learning algorithms, all components are based on semantic web technologies using RDF graph databases for data persistence as well as data publishing. As a result, PoolParty's components have varying requirements regarding integration options and performance of underlying databases.
In this first blog post, we start with an overview of the integration architecture, followed by a description of components and their requirements regarding data storage. In the second blog post, we will also elaborate on other graph database solutions that are not integrated into PoolParty directly.
General Principles for the Integration of Graph Databases with PoolParty Semantic Suite
Semantic Web Company is the leading provider of graph-based knowledge technologies and also the vendor of PoolParty Semantic Suite. Semantic Web Company follows the principle of developing technologies based on open W3C standards.
Semantic Web Company also tries to keep the architecture of the PoolParty Semantic Suite as open as possible. This way we promise the lowest possible vendor lock-in effects for the customer.
As a consequence, we do not support any proprietary features of graph databases of different vendors. Instead, all integrations are based on W3C standards and the RDF4J library. This allows our customers to freely choose the underlying graph database that meets these conditions. Although some caveats are required, we still believe that this open architecture is key to providing the customer with long-term flexibility and adaptability to a variety of changing application scenarios.
PoolParty's Components and Graph Databases Requirements
PoolParty Thesaurus Management
PoolParty Thesaurus Management includes a lot of different features centered around SKOS based knowledge modeling complemented by the use of ontologies. This component has two main requirements regarding graph databases. First, it has to do a lot of recursive loading of data combined with the storing of a small number of triples. Second, it does reasoning based on complex SPARQL queries - where performance is essential.
Import Assistant and Quality Validation
The Import Assistant is a feature of the Thesaurus Management component that controls the quality of data being imported to projects on an RDF level. It defines a set of constraints that have to be satisfied by the data to guarantee stable operations for the component. If violations are found, the user can correct the problems using repair options provided by the Assistant. Having resolved all problems, the data can safely be imported into the project. PoolParty uses a combination of SHACL and SPARQL to implement these constraints. Since SHACL can be transformed to SPARQL queries, this resolves to running the constraint checks on the SPARQL endpoint of the project. Some of these checks are quite complex, so the Import Assistant needs good performance for the query engine to run the validation within a reasonable amount of time. The same holds for quality validation that can be done within a project to detect modeling problems.
Corpus Analysis is based on PoolParty’s natural language processing (NLP) component. The basis for corpus analysis is text corpora and knowledge models which reflect a knowledge domain. The analysis can be used to extract and identify different parts of text by making a semantic identification of concepts, named entities and shadow concepts calculated based on the context. While the analysis itself is not store-dependent, the results are stored as RDF in a graph database to be able to do analytics based on SPARQL queries. As a result, it requires a graph database that provides fast storage of large amounts of data in a transaction.
RDF4J works well at a level of tens of millions of triples in one single repository, but its performance starts to drop at hundreds of millions of triples. This is the case for most situations since each project is stored in a dedicated repository and a thesaurus does not contain large amounts of data compared to statistical datasets. For example, MeSH, an extensive and commonly used taxonomy, has just less than one million triples. However, the largest chunks of data most frequently come from corpus analysis. Depending on the size and content of the corpus and completeness of the thesaurus regarding the knowledge domain, the extraction data will scale to a significant level. For example, with a corpus of 5000 documents with an average length of one page, the analysis result can produce tens of millions of triples. Although each corpus has a dedicated repository, the task for tens of thousands of documents is unlikely to be completed within the desired time. In this case, a third-party, scalable RDF store can be integrated so that PoolParty can store the corpus at a higher speed and throughput.
PoolParty GraphEditor is a configurable ontology-based RDF editor that generates a custom user interface (UI) based on ontology elements. It also allows for working on a low abstraction level by using filtered search, tabular views and bulk operations on the RDF data. This component can do complex filter queries based on user selected conditions for the ontology elements and therefore needs a performant SPARQL engine to work on large data sets.
PoolParty GraphSearch is a component providing faceted search and analytics. It can be configured to use a knowledge graph for annotating documents for semantic search purposes. It can also use various taxonomies and ontologies to search RDF data in graph databases. Furthermore, it provides visualizations for analytical purposes and a customizable recommender engine with plugins for different use case scenarios in order to provide similarity based navigation and rules-based recommendations. GraphSearch graph database functionality is based on SPARQL only. For faceted search and recommendations, the graph database should provide a fast SPARQL engine.
PoolParty UnifiedViews is an orchestrator for various ETL-like data integration scenarios. UnifiedViews is used to create pipelines that serve different data integration tasks. Each pipeline consists of Data Processing Units (DPUs) and data flow among them. Any DPU is implemented to solve a (sub-)task by encapsulating a code fragment while providing a GUI to be configurable by users without coding skills. For instance, one DPU receives files as input, retrieves their content and uses the PoolParty Extractor service to annotate and transform the semantics of the content into RDF. This result is then passed on to the next DPU. Dozens of other DPUs exist that take RDF data to further transform or enrich them using SPARQL queries. By default, UnifiedViews uses an internal RDF4J store to perform these operations. However, a dedicated performant working store for much bigger datasets (surpassing 10 millions of triples), as well as better SPARQL query performance, is needed regularly. Users can configure such a graph database in the UnifiedViews configuration.
Chief Technology Officer
Data Knowledge Engineer