Seeing the Earth in the Cloud: Processing one petabyte of satellite imagery in one day

The proliferation of transistors has increased the performance of computing systems by over a factor of a million in the past 30 years, and is also dramatically increasing the amount of data in existence, driving improvements in sensor, communication and storage technology. Multi-decadal Earth and planetary remote sensing global datasets at the petabyte (8×1015 bits) scale are now available in commercial clouds (e.g., Google Earth Engine and Amazon NASA NEX), and new commercial satellite constellations are planning to generate petabytes of images per year, providing daily global coverage at a few meters per pixel. Cloud storage with adjacent high-bandwidth compute, combined with recent advances in machine learning for computer vision, is enabling understanding of the world at a scale and at a level of granularity never before feasible. We report here on a computation processing over a petabyte of compressed raw data from 2.8 quadrillion pixels (2.8 petapixels) acquired by the US Landsat and MODIS programs over the past 40 years. Using commodity cloud computing resources, we convert the imagery to a calibrated, georeferenced, multiresolution tiled format suited for machine-learning analysis. We believe ours is the first application to process, in less than a day, on generally available resources, over a petabyte of scientific image data. We report on work using this reprocessed dataset for experiments demonstrating country-scale food production monitoring, an indicator for famine early warning. We apply remote sensing science and machine learning algorithms to detect and classify agricultural crops and then estimate crop yields.

[1]  Michael W. Marcellin,et al.  JPEG2000 - image compression fundamentals, standards and practice , 2013, The Kluwer international series in engineering and computer science.

[2]  Gordon Bell,et al.  What's next in high-performance computing? , 2002, CACM.

[3]  James Mason,et al.  Results from the Planet Labs Flock Constellation , 2014 .

[4]  Stefan Behnel,et al.  Cython: The Best of Both Worlds , 2011, Computing in Science & Engineering.

[5]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[6]  D. Roy,et al.  An overview of MODIS Land data processing and product status , 2002 .

[7]  Matthew J. Turk,et al.  Dark Sky Simulations: Early Data Release , 2014, 1407.2600.

[8]  Alexandru Iosup,et al.  A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing , 2009, CloudComp.

[9]  Guido Rossum,et al.  Python Reference Manual , 2000 .

[10]  Michael S. Warren,et al.  Astrophysical N-body simulations using hierarchical tree data structures , 1992, Proceedings Supercomputing '92.

[11]  David L. Hart,et al.  NCAR storage accounting and analysis possibilities , 2013, XSEDE.

[12]  Gordon Bell,et al.  Beyond the Data Deluge , 2009, Science.

[13]  Thomas L. Sterling,et al.  BEOWULF: A Parallel Workstation for Scientific Computation , 1995, ICPP.

[14]  Larry Denneau,et al.  The Pan-STARRS wide-field optical/NIR imaging survey , 2010, Astronomical Telescopes + Instrumentation.

[15]  Michael W. Marcellin,et al.  JPEG2000 - image compression fundamentals, standards and practice , 2002, The Kluwer International Series in Engineering and Computer Science.

[16]  C. Justice,et al.  High-Resolution Global Maps of 21st-Century Forest Cover Change , 2013, Science.

[17]  Travis E. Oliphant,et al.  Guide to NumPy , 2015 .

[18]  Thomas L. Sterling,et al.  Pentium Pro Inside: I. A Treecode at 430 Gigaflops on ASCI Red, II. Price/Performance of $50/Mflop on Loki and Hyglac , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[19]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[20]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[21]  Matthew McCullough,et al.  Version Control with Git: Powerful Tools and Techniques for Collaborative Software Development , 2009 .

[22]  Feng Gao,et al.  Landsat Ecosystem Disturbance Adaptive Processing System LEDAPS algorithm description , 2013 .

[23]  Michael S. Warren,et al.  The Space Simulator: Modeling the Universe from Supernovae to Cosmology , 2003, SC.

[24]  Michael W. Marcellin,et al.  Improved Resolution Scalability for Bilevel Image Data in JPEG2000 , 2007, IEEE Transactions on Image Processing.

[25]  B. Markham,et al.  Forty-year calibrated record of earth-reflected radiance from Landsat: A review , 2012 .

[26]  Robert Buckley Using Lossy JPEG 2000 Compression For Archival Master Files , 2013 .