Querying Provenance Information in Distributed Environments

The growing recognition of the importance of provenance for data intensive and multidisciplinary domains is leading to careful collection of provenance. One consequence of this is the proliferation of provenance repositories hosted for individual organization or communities, with limited ability to reconstruct and query for and on provenance across them. Community standards like the Open Provenance Model (OPM) allow uniform interpretation and exchange of provenance metadata but do not prescribe query or service specifications to access provenance. If data reuse and sharing across institutions is not accompanied by passing provenance at the time of data exchange, we need to track the provenance and query for them or over them across distributed provenance repositories. In this article, we present approaches for querying over distributed provenance information, and address two common provenance query models that we formalize: provenance retrieval query and provenance filter query. Our problem is motivated by Smart Oilfield applications in the energy informatics domain, and we evaluate the performance of our algorithms using synthetic workflows based on the domain.

[1]  Yong Zhao,et al.  A Logic Programming Approach to Scientific Workflow Provenance Querying , 2008, IPAW.

[2]  Cláudio T. Silva,et al.  Using Mediation to Achieve Provenance Interoperability , 2009, 2009 Congress on Services - I.

[3]  Yogesh L. Simmhan,et al.  Towards a Quality Model for Effective Data Selection in Collaboratories , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[4]  Jennifer Widom,et al.  Lineage tracing for general data warehouse transformations , 2003, The VLDB Journal.

[5]  Wenfei Fan,et al.  Using partial evaluation in distributed query evaluation , 2006, VLDB.

[6]  Paul T. Groth A Distributed Algorithm for Determining the Provenance of Data , 2008, 2008 IEEE Fourth International Conference on eScience.

[7]  Paul T. Groth,et al.  Recording and using provenance in a protein compressibility experiment , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[8]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[9]  Ashish Gehani,et al.  Tracking and Sketching Distributed Data Provenance , 2010, 2010 IEEE Sixth International Conference on e-Science.

[10]  Fu Xiao-hui Survey of Data Provenance , 2012 .

[11]  Viktor K. Prasanna,et al.  Workflow instance detection: Toward a knowledge capture methodology for smart oilfields , 2008, 2008 IEEE International Conference on Information Reuse and Integration.

[12]  Bin Liu,et al.  Michigan Molecular Interactions (MiMI): putting the jigsaw puzzle together , 2006, Nucleic Acids Res..

[13]  Carole A. Goble,et al.  Using Semantic Web Technologies for Representing E-science Provenance , 2004, SEMWEB.

[14]  Yolanda Gil,et al.  Provenance trails in the Wings/Pegasus system , 2008, Concurr. Comput. Pract. Exp..

[15]  Sanjeev Khanna,et al.  Data Provenance: Some Basic Issues , 2000, FSTTCS.

[16]  Paul T. Groth,et al.  Representing distributed systems using the Open Provenance Model , 2011, Future Gener. Comput. Syst..

[17]  Yogesh L. Simmhan,et al.  Analysis of approaches for supporting the Open Provenance Model: A case study of the Trident workflow workbench , 2011, Future Gener. Comput. Syst..

[18]  Bertram Ludäscher,et al.  Efficient provenance storage over nested data collections , 2009, EDBT '09.

[19]  Jennifer Widom,et al.  Panda: A System for Provenance and Data , 2010, IEEE Data Eng. Bull..

[20]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[21]  Robert Stevens,et al.  Annotating, Linking and Browsing Provenance Logs for {e-Science} , 2003 .

[22]  Dan Suciu,et al.  Query Decomposition and View Maintenance for Query Languages for Unstructured Data , 1996, VLDB.

[23]  Luc Moreau,et al.  Recording and Reasoning over Data Provenance in Web and Grid Services , 2003, OTM.

[24]  Carole A. Goble,et al.  Workflows to open provenance graphs, round-trip , 2011, Future Gener. Comput. Syst..

[25]  Yogesh L. Simmhan,et al.  The Open Provenance Model core specification (v1.1) , 2011, Future Gener. Comput. Syst..

[26]  Sanjeev Khanna,et al.  Differencing Provenance in Scientific Workflows , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[27]  Thomas Heinis,et al.  Efficient lineage tracking for scientific workflows , 2008, SIGMOD Conference.

[28]  Adriane Chapman,et al.  Efficient provenance storage , 2008, SIGMOD Conference.

[29]  Viktor K. Prasanna,et al.  On Using Cloud Platforms in a Software Architecture for Smart Energy Grids , 2010 .

[30]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[31]  Viktor K. Prasanna,et al.  A Semantic Framework for Integrated Asset Management in Smart Oilfields , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[32]  Paul T. Groth,et al.  Extracting causal graphs from an open provenance data model , 2008, Concurr. Comput. Pract. Exp..

[33]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[34]  Val Tannen,et al.  Querying data provenance , 2010, SIGMOD Conference.

[35]  James Frew,et al.  Automatic capture and reconstruction of computational provenance , 2008 .

[36]  Amit P. Sheth,et al.  Semantic Provenance for eScience: Managing the Deluge of Scientific Data , 2008, IEEE Internet Computing.

[37]  Simon Miles Electronically Querying for the Provenance of Entities , 2006, IPAW.

[38]  Jennifer Widom,et al.  Practical lineage tracing in data warehouses , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[39]  Jing Zhao,et al.  A Provenance-Integration Framework for Distributed Workflows in Grid Environments , 2008 .

[40]  Bertram Ludäscher,et al.  Techniques for efficiently querying scientific workflow provenance graphs , 2010, EDBT '10.

[41]  Yolanda Gil,et al.  Wings for Pegasus: Creating Large-Scale Scientific Applications Using Semantic Representations of Computational Workflows , 2007, AAAI.

[42]  Alan R. Hevner,et al.  Query Processing in Distributed Database System , 1979, IEEE Transactions on Software Engineering.

[43]  Iraj Ershaghi,et al.  Continuing-Education Needs for the Digital Oil Fields of the Future , 2005 .

[44]  Marta Mattoso,et al.  Provenance management in Swift , 2011, Future Gener. Comput. Syst..

[45]  Viktor K. Prasanna,et al.  Integrating Provenance Information in Reservoir Engineering , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[46]  Yong Zhao,et al.  Tracking provenance in a virtual data grid , 2008, Concurr. Comput. Pract. Exp..

[47]  Wil M. P. van der Aalst,et al.  Workflow mining: discovering process models from event logs , 2004, IEEE Transactions on Knowledge and Data Engineering.

[48]  Clement T. Yu,et al.  Distributed query processing , 1984, CSUR.

[49]  Yogesh L. Simmhan,et al.  Query capabilities of the Karma provenance framework , 2008, Concurr. Comput. Pract. Exp..

[50]  Werner Dubitzky,et al.  Briefings in bioinformatics. , 2009, Briefings in bioinformatics.

[51]  Margo I. Seltzer,et al.  Choosing a Data Model and Query Language for Provenance , 2008, IPAW 2008.

[52]  Min Wang,et al.  Provenance query evaluation: what's so special about it? , 2009, CIKM.

[53]  Dan Suciu,et al.  Distributed query evaluation on semistructured data , 2002, TODS.

[54]  Peter Gluchowski,et al.  Data Warehouse , 1997, Informatik-Spektrum.

[55]  Viktor K. Prasanna,et al.  An Architecture of a Workflow System for Integrated Asset Management in the Smart Oil Field Domain , 2007, 2007 IEEE Congress on Services (Services 2007).