I am very pleased to announce that the above paper has been published in Springer's Data Mining and Knowledge Discovery journal. It presents a novel approach to analysing provenance information, combining provenance network metrics and machine learning; the main aim currently is to label data from analysing their provenance.
This is also the first time that I published all the data and code used to produce the paper's results online: https://github.com/trungdong/datasets-provanalytics-dmkd There, you can find notebooks describing the data preparation and the analyses shown in the paper along with some extra experiments that were not included due to space constraints.
What is it about?
Traditionally, data analytics usually involves analysing the data themselves to discover patterns, outliers, and insights. In the paper above, we analyse instead the provenance of the data, that is the historical record of the data describing its origins and what influenced its production.
Analysing provenance records, however, is not straight-forward. Within the PROV data model that we use, there are three main provenance concepts: Entity, Activity, and Agent. In addition, there are 15+ possible provenance relations to link the three concepts, each of which has a specific meaning.
In this work, we represent provenance information as a graph and analyse the topological characteristics of such a graph using network metrics, for example, measuring its size, diameter, or the distances between certain elements. By so doing, we can subsume provenance information into a set of numeric values reflecting the topology of its graph-based representation.
Having the numeric features from provenance information, we then apply off-the-shelf machine learning algorithms to build predictive models for properties of the data described in the provenance information. Using this approach, in three different applications, we could re-infer the owners of provenance documents, assess the quality of crowdsourced data from CollabMap, and detect instructions from chat messages in an alternate-reality game.
In brief, the approach aims to infer some properties of data from analysing the network features of the data's provenance information, hence the name Provenance Network Analytics. One nice thing about this approach is that, apart from initial training examples, the predictive models can operate on provenance (represented using the PROV standards) without relying on domain-specific information. Therefore, it can serve as a generic data analytics tool in applications where provenance are captured or can be generated (from application logs, for example).
We describe the approach in detail in the paper. If you are interested to know more, go read the paper at http://rdcu.be/G3Nz.Go Top