Cached Sufficient Statistics for Automated Mining and Discovery from Massive Data Sources

There many massive databases in industry and science. There are also many ways that decision makers, scientists, and the public need to interact with these data sources. Wide ranging statistics and machine learning algorithms similarly need to query databases, sometimes millions of times for a single inference. With millions or billions of records (e.g. biotechnology databases, inventory management systems, astrophysics sky surveys, corporate sales information, science lab data repositories) this can be intractable using current algorithms. (a start-up company), both jointly run by Andrew Moore and Jee Schneider, are concerned with the fundamental computer science of making very advanced data analysis techniques computationally feasible for massive datasets. How can huge data sources (Gigabytes up to Terabytes) be analyzed automatically? There is no oo-the-shelf technology for this. There are devastating computational and statistical diiculties; manual analysis of such data sources is now passing from being simply tedious into a new, fundamentally impossible realm where the data sources are just too large to assimilate by humans. This situation is ironic given the large investment the US has put into gathering scientiic data. The only alternative is automated discovery. It is our thesis that the emerging technology of cached suucient statistics will be critical to developing automated discovery on massive data. A cached suucient statistics representation is a 1