Guided Multidimensional Analysis of RDF Data

Autor
M. Hilal
Dissertation
PT2401 (2024)
1. Betreuer
Assoz. Univ.-Prof. Mag. Dr. Christoph Schütz
2. Betreuer
o.Univ.-Prof. DI Dr. Michael schrefl
Ressourcen
Kopie  (Senden Sie ein Email mit  PT2401  als Betreff an dke.win@jku.at um diese Kopie zu erhalten)

Kurzfassung (Englisch)

Abstract

A growing amount of data is published on the web using the Resource Description Framework (RDF) format. A majority of these data are sets of Linked Open Data (LOD) adhering to particular best practices. These data sets are rich in knowledge about different topics across various domains. Therefore, a variety of analytical questions can be answered over these data, which enables interested users to obtain useful information that may open interesting perspectives.

Data warehouses are databases that particularly allow to store and analyze data that are potentially obtained and integrated from various sources. Data warehouses allow business users to view and inspect aggregated summaries over enterprise data, which serves to improve decisionmaking.

The multidimensional (MD) model is the most prominent model to arrange the data stored in data warehouses. The MD model contains facts quantitatively described by measure values that are organized along hierarchically structured dimensions. Multidimensional queries against the MD model aggregate measure values along dimension hierarchies at various degrees of detail. Online Analytical Processing (OLAP) comprises a set of operators that serve to manipulate MD queries such that an MD query can be transformed into a new MD query, e.g., by changing the granularity.

Typically, MD data analysis operates on enterprise-internal data. Nevertheless, data published over the web, especially RDF data, represent a valuable resource of knowledge that may answer many interesting analytical questions that cannot be otherwise answered. The Data Cube vocabulary (QB) and its extension, the QB4OLAP vocabulary, have been introduced in order to representMDdata over the web in RDF format (which we call statistical RDF data). Nevertheless, a major part of LOD sets as well as RDF data sets in general, such as DBpedia and Wikidata, are non-statistical, i.e., these data sets do not adhere to QB(4OLAP) since these data are not originally meant to be used for MD data analysis.

Conducting MD data analysis over external RDF data sets may be challenging, especially for average users who are typically unfamiliar with such data. In particular, average users typically do not have sufficient technical skills to perform MD analysis over these RDF data, e.g., experience with the RDF data model and its underlying query language (SPARQL). Average users, furthermore, typically do not have appropriate analysis knowledge over the RDF data source to be explored, e.g., proper (sequences of) MD queries to follow in order to fulfill their information needs in particular analysis scenarios. To this end, we develop a comprehensive approach in this dissertation in order to bridge that gap. In particular, the approach allows users with limited technical skills and domain knowledge to conduct MD data analysis guided by expert knowledge, using familiar and intuitive concepts, over both statistical and non-statistical LOD sets (and RDF data sources in general).

We introduce the enriched MD model to represent the schema and instances of the data to be analyzed in addition to business terms defined as predicates over these data. Whereas the enriched MD model can be used with statistical LOD, we extend it with mapping queries to become the superimposed enriched MD model to account for non-statistical LOD. Typical MD querying can be straightforwardly applied to statistical LOD. However, an adaptedMDquerying approach accompanies the superimposed enriched MD model for non-statistical LOD in order to account for potential irregularities that these data may exhibit with respect to statistical LOD.

On top of the enriched MD model, we introduce the concept of semantic web analysis graph (SWAG) that serves to explicitly model the analytical knowledge of experts familiar with MD data analysis over the LOD source at hand. Analysis situations are parameterized MD queries that constitute the nodes of a SWAG. Navigation steps are sets of parameterized OLAP operations that constitute the edges of a SWAG. Parameterization of SWAGs is an enabler for reuse and eliminates the need for “reinventing the wheel” in similar analysis scenarios; analysts can bind the parameters of a SWAG to values of their choice. The main building blocks of a SWAG are MD concepts as well as OLAP operations, which are fairly simple. Once the analysis knowledge of experts is explicitly represented as a SWAG, it can be leveraged to guide less experienced users. In particular, a dedicated RDF vocabulary serves to represent and publish SWAGs over the web; semantic web analysis graphs represented using this vocabulary can be deployed in SWAG-BI, which is our proof-of-concept prototype and represents an execution engine for SWAGs. SWAG-BI is a web application that exposes a user interface allowing users to instantiate and execute SWAGs. As a consequence, the user’s task in order to conduct MD data analysis boils down to choosing a suitable SWAG and providing values for its parameters. We have employed SWAG-BI in a user study in order to assess the subjective usability of the approach; the results from the user study show good overall usability.