Brown Dog: Leveraging everything towards autocuration

We present Brown Dog, two highly extensible services that aim to leverage any existing pieces of code, libraries, services, or standalone software (past or present) towards providing users with a simple to use and programmable means of automated aid in the curation and indexing of distributed collections of uncurated and/or unstructured data. Data collections such as these encompassing large varieties of data, in addition to large amounts of data, pose a significant challenge within modern day "Big Data" efforts. The two services, the Data Access Proxy (DAP) and the Data Tilling Service (DTS), focusing on format conversions and content based analysis/extraction respectively, wrap relevant conversion and extraction operations within arbitrary software, manages their deployment in an elastic manner, and manages job execution from behind a deliberately compact REST API. We describe both the motivation and need/scientific drivers for such services, the constituent components that allow for arbitrary software/code to be used and managed, and lastly an evaluation of the systems capabilities and scalability.

[1]  Sandra Payette,et al.  Fedora: an architecture for complex objects and their relationships , 2005, International Journal on Digital Libraries.

[2]  Arcot Rajasekar,et al.  From SRB to iRODS : Policy Virtualization using Rule-based Data Grids , 2008 .

[3]  Rob Miller,et al.  Sikuli: using GUI screenshots for search and automation , 2009, UIST '09.

[4]  Chandra Krintz,et al.  A Pluggable Autoscaling Service for Open Cloud PaaS Systems , 2012, 2012 IEEE Fifth International Conference on Utility and Cloud Computing.

[5]  Luigi Marini,et al.  Using Lucene to index and search the digitized 1940 US Census , 2013, Concurr. Comput. Pract. Exp..

[6]  William Underwood Grammar-Based Specification and Parsing of Binary File Formats , 2012, Int. J. Digit. Curation.

[7]  Peter Bajcsy,et al.  A Mosaic of Software , 2011, 2011 IEEE Seventh International Conference on eScience.

[8]  Adrian Kaehler,et al.  Learning OpenCV 3: Computer Vision in C++ with the OpenCV Library , 2016 .

[9]  R. Manmatha,et al.  Word spotting for historical documents , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[10]  Rob Kooper,et al.  On improving the communication between models and data. , 2013, Plant, cell & environment.

[11]  Junliang Chen,et al.  Workload Predicting-Based Automatic Scaling in Service Clouds , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[12]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Gerhard Klimeck,et al.  nanoHUB.org: Advancing Education and Research in Nanotechnology , 2008, Computing in Science & Engineering.

[15]  Nancy Wilkins-Diehr,et al.  XSEDE: Accelerating Scientific Discovery , 2014, Computing in Science & Engineering.

[16]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[17]  Steve Kelling,et al.  Participatory design of DataONE - Enabling cyberinfrastructure for the biological and environmental sciences , 2012, Ecol. Informatics.

[18]  Yaxing Wei,et al.  YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts , 2015, ArXiv.

[19]  Chi-Ren Shyu,et al.  A neotropical Miocene pollen database employing image-based search and semantic modeling1 , 2014, Applications in plant sciences.

[20]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[21]  William C. Regli,et al.  On the long-term retention of geometry-centric digital engineering artifacts , 2011, Comput. Aided Des..

[22]  Ewan Klein,et al.  An Extensible Toolkit for Computational Semantics , 2009, IWCS.

[23]  MacKenzie Smith,et al.  DSpace: An Open Source Dynamic Digital Repository , 2003, D Lib Mag..

[24]  Kyungho Jeon,et al.  PigOut: Making multiple Hadoop clusters work together , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[25]  Jefferson R. Heard,et al.  A system for scalable visualization of geographic archival records , 2011, 2011 IEEE Symposium on Large Data Analysis and Visualization.

[26]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[27]  Nathan D. Miller,et al.  Image analysis is driving a renaissance in growth measurement. , 2013, Current opinion in plant biology.

[28]  M. Norman,et al.  yt: A MULTI-CODE ANALYSIS TOOLKIT FOR ASTROPHYSICAL SIMULATION DATA , 2010, 1011.3514.

[29]  Inna Kouper,et al.  Towards Sustainable Curation and Preservation: The SEAD Project's Data Services Approach , 2015, 2015 IEEE 11th International Conference on e-Science.

[30]  LudäscherBertram,et al.  Scientific workflow management and the Kepler system , 2006 .

[31]  Ray Smith An Overview of the Tesseract OCR Engine , 2007 .

[32]  Guilherme Galante,et al.  A Survey on Cloud Computing Elasticity , 2012, 2012 IEEE Fifth International Conference on Utility and Cloud Computing.

[33]  Peter Bajcsy,et al.  Towards a Universal, Quantifiable, and Scalable File Format Converter , 2009, 2009 Fifth IEEE International Conference on e-Science.

[34]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[35]  Peter Bajcsy,et al.  Versus: A Framework for General Content-Based Comparisons | NIST , 2011 .

[36]  Joe Futrelle,et al.  Medici : A Scalable Multimedia Environment for Research , 2011 .

[37]  Ian T. Foster,et al.  Globus Online: Accelerating and Democratizing Science through Cloud-Based Services , 2011, IEEE Internet Computing.

[38]  Shantenu Jha,et al.  P∗: A model of pilot-abstractions , 2012, 2012 IEEE 8th International Conference on E-Science.

[39]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[40]  Rajkumar Buyya,et al.  Dynamically scaling applications in the cloud , 2011, CCRV.

[41]  Cláudio T. Silva,et al.  VisTrails: enabling interactive multiple-view visualizations , 2005, VIS 05. IEEE Visualization, 2005..

[42]  René F. Kizilcec How Much Information?: Effects of Transparency on Trust in an Algorithmic Interface , 2016, CHI.

[43]  Mahadev Satyanarayanan,et al.  Olive: Sustaining Executable Content Over Decades , 2014 .