Answering Provenance-Aware Queries on RDF Data Cubes under Memory Budgets

Abstract

The steadily-growing popularity of semantic data on the Web and the support for aggregation queries in SPARQL 1.1 have propelled the interest in Online Analytical Processing (OLAP) and multidimensional data (cubes) in RDF. Query processing in such settings is challenging because SPARQL OLAP queries tend to be complex: they usually contain many triple patterns with grouping and aggregation. Moreover, one important factor of query answering on web data is its provenance, i.e., metadata that tells us about the origin and quality of the data. Some applications, e.g., in data analytics, access control, etc., require to augment the data with provenance metadata and run queries that impose constraints on this provenance. This task is called provenance-aware query answering. In this paper, we investigate the benefit of caching some parts of an RDF cube augmented with provenance information when answering provenance-aware SPARQL queries. We propose provenance-aware caching (PAC), a caching approach based on a provenance-aware partitioning for RDF graphs, and a benefit model optimized for RDF cubes and SPARQL queries with aggregation. Our results on real and synthetic data show that PAC outperforms the LRU (least recently used) caching strategy and the Jena TDB native caching in terms of hit-rate and performance.

Downloads

Input data:
- SSB balanced datasets + prov. queries
- SSB unbalanced datasets + prov. queries
- SSB analytical queries
- SSB's cube structure (in TSV format, required by the experimental software)
- QBOAirbase datasets + prov. queries
- QBOAirbase-DK analytical queries
- QBOAirbase-GB analytical queries
- QBOAirbase's cube structure (in TSV format, required by the experimental software)
- Instructions on how to load the datasets into a Jena TDB graph
Software:
- PAC experimental software
- Instructions on how to run the experiments on the Jena TDB datasets
- The software uses configuration files for all the input arguments. Here are some configuration input files: pac cold setting, pac warm setting, lru warm setting, tdb setting, pac no-optimization, lru no-optimization, tdb no-optimization.
Experimental data (output by the experimental software in TSV format):
- Script to turn the experimental data into LateX charts (the script generates many more charts that were not included in the paper)
- Instructions to run the script