AirBase is the European air quality dataset maintained by the EEA (Environmental European Agency). The dataset is publicly available on the Web, and contains air quality monitoring data for 40 European countries. The multidimensional nature of the data makes it a good fit for OLAP (Online Analytical Processing) systems. These systems are optimized for complex queries with aggregation and grouping on multidimensional data, which are common in data analysis. Moreover, by linking the data to existing knowledge bases in the Semantic Web, we can magnify its value, allowing for more sophisticated data analytics. Based on these observations, we introduce and describe QBAirbase, a multidimensional provenance-augmented version of the Airbase dataset. QBOAirbase represents air pollution information as an RDF data cube, which has been linked to the YAGO and DBpedia knowledge bases. QBOAirbase is based on the version 8 of the Airbase dataset.
Data cubes are a concept designed to model multidimensional data. They are the target of OLAP (OnLine Analytical Processing) applications in data warehouses. In such scenarios the query engine is confronted with complex aggregation queries aimed for tasks such as reporting, analytics, and data mining. QBOAirbase is modeled as a data cube using semantic standards such as RDF, QB and QB4OLAP. The central concept in a data cube are observations. An observation is a measurement, for example, the concentration of an air pollutant. An observation can have multiple measures. In QBOAirbase an observation is associated to a measurement of a single pollutant. Observations are defined by their coordinates in a set of dimensions. For instance, in QBOAirbase the concentration of an air pollutant is measured at a location, at certain time, and under some sensor configurations. Dimensions can have levels. For example the location of the measurement of a pollutant occurs in a station located in a city and country. It follows that the location dimension consists of three levels: station -> city -> country. Each level has a many-to-one relationship with its successor. Finally, dimensions can have attributes. For example cities are defined by attributes such as its population.
The description of QBOAirbase's cube structure is available in RDF using the QB4OLAP vocabulary. We also provide a visual representation of QBOAirbase's cube structure:
The following table lists both the measure predicates of observations, as well as the attributes of the different dimension levels:
Category | Predicates | |
---|---|---|
Measures | s:SO2, s:SPM, s:PM10, s:BS, s:O3, s:NO2, s:NOX, s:CO, s:Pb, s:Hg, s:Cd, s:Ni, s:As, s:C6H6, s:PM2.5 | |
Station | p:europeanCode, p:station, p:localCode, p:establishedDate, p:shutDownDate, p:type, p:ozoneClassification, p:areaType, p:ruralSubType, p:streetType, p:longitudeDegree, p:latitudeDegree, p:altitude, p:localAdministrativeUnitLevel1Code, p:localAdministrativeUnitLevel2Code, p:localAdministrativeUnitLevel2Name, p:localAdministrativeUnitLevel1Code, p:isEuropeanMonitoringEvaluationProgramme | |
City | p:city | |
Country | p:isoCode, p:country | |
Year | p:yearNum | |
Sensor | p:europeanCode, p:code, europeanGroupCode, p:statisticShortName, p:statisticName, p:startDate, p:endDate, p:automaticMeasurement, p:measurementTechnique, p:equipment, p:samplingPoint, p:samplingTime, p:calibrationMethod | |
Component | p:component, p:code, p:caption, p:europeanGroupCode, p:unit |
QBOAirbase is also available via a SPARQL Endpoint. We provide some example OLAP queries so that users can "play" with the data: query 1 (result), query 2 (result), query 3 (result), query 4 (result), query 5 (result), query 6 (result), query 7 (result), query 8 (result), query 9 (result), query 10 (result).
We also provide the queries as well as their descriptions for download.
QBOAirbase is linked with the DBpedia and YAGO knowledge bases. The identifiers for countries, cities and air pollutants are linked to their YAGO and DBpedia counterparts by means of the owl:sameAs predicate. For example the triple <air:country/Denmark, owl:sameAs, yago:Denmark> links the country Denmark, as defined in QBOAirbase, to the country Denmark as defined in the YAGO knowledge base.
The following federated query calculates, using the YAGO SPARQL endpoint, the ratio of urban population in Greece that has been exposed to more than 18 um/g3 of O3 in 2011 (result).QBOAirbase builds from the data in the Airbase dataset and makes use of the PROV ontology (PROV-O) to model provenance information. PROV-O is a W3C standard to model workflow provenance. This is provenance about the sources and the processes to publish an RDF triple. In QBOAirbase we distinguish two types of triples: metadata and information triples. Metadata triples are created by our data generation tool in order to, e.g., be compliant with the QB vocabulary. They include rdf:type predicates for all the types of entities defined in the cube structure, e.g., <air:country/Denmark, rdf:type, air:Country> or owl:sameAs statements such as <air:country/Denmark, owl:sameAs, yago:Denmark> (to link our dataset to YAGO). Information triples contain the actual data that we extracted from the Airbase dataset, that is, information about the measurements of the observations such as the station that delivered the measurement, the year, the sensor configurations, etc. These correspond to all the triples with predicates from Table 1. Each triple in QBOAirbase is assigned an RDF resource of type prov:Entity, i.e., a provenance entity. That provenance entity represents the workflow provenance of the triple, and it is described using RDF and PROV-O. The figure below describes the workflow of a provenance entity assigned to an information triple (borders highlighted). Following the same style of the PROV-O specification, in the figure round nodes represent provenance entities, rectangles with lines represent activities, and pentagons represent agents.
In the following we describe the meaning of some of the nodes in the figure.
QBOAirbase is based on the Airbase dataset (version 8) and is made available under the terms of the Open Data Commons Attribution License (v1.0)