Provenance management for dynamic, distributed and dataflow environments

Provenance, the derivation history of data objects, records how, when, and by whom a piece of data was created and modified. Provenance allows users to understand the context of derived data, estimate its quality for use, locate data of interest, and determine datasets affected by erroneous processes. Thus it is playing an important role in scientific experiments and business processes for data quality control, audit trail, and ensuring regulatory compliance. While most of the previous works only study provenance in a closed and well-controlled environment (e.g., a workflow engine), challenges still exist for holistic provenance management in practical and open environments, where provenance can be distributed, dynamic and diverse. For example, in the Energy Informatics domain, provenance is often collected from large-scale workflows across disciplines and organizations and thus is usually stored in distributed repositories. However, there has been limited research on reconstruction of and query over distributed provenance information. Meanwhile, recurrent and stream processing workflows can generate fine-grained provenance with overwhelming size that can be larger than the original dataset. Provenance storage approaches for efficiently managing such metadata volumes do not have adequate focus in literature. And lastly, the fact that legacy tools without automatic provenance collection functionalities are still widely used leads to the requirement of manual provenance annotation operations, which causes provenance to be incomplete. In this thesis, by using Energy Informatics as an exemplar domain, we design and develop algorithms and systems for managing provenance in dynamic, distributed and dataflow environments, that are motivated by real world challenges. In particular, we make the following contributions: (1) template-based algorithms that can efficiently store provenance information for dynamic datasets, (2) algorithms for reconstructing and querying provenance graphs from distributed provenance repositories, (3) semantic-based approaches for predicting incomplete provenance. We evaluate our research contributions with use cases from the Energy Informatics domain, including both Smart Oilfield and Smart Grid. The evaluation results demonstrate that our work can achieve efficient and scalable provenance management. As future work, we also discuss key challenges and initial solutions for presenting provenance across different granularities based on its usage context information.