Our Oil & Gas customers tell us they have real challenges when it comes to analyzing large data volumes.
This challenge is particularly acute when working with geospatial data in their data science workflows.
The pain we hear expressed over and over again can be expressed by:
- The amount of time and effort it takes to collect and transform hundreds of millions to tens of billions of rows data for analysis
- Terrible GIS performance when data volumes become even modestly large (over a few hundred thousand data points)
- Inability to execute geospatial analysis functions on large data sets
The challenges listed above result in a real gap for data science teams - namely the ability to apply predictive intelligence across large data frames with location intelligence.
We’ve heard people tell us their existing mapping tools are “cartoonish”, or that the best tools they have are really slow to work with.
These issues are readily avoidable. Today, you can load hundreds of millions to tens of billions of rows of data with location attributes and perform cross-basin, cross-functional statistical and machine learning workflows across all of them - with sub second response of operations.
We’ve managed in our own testing, for instance, to perform DCA forecasting using the entire North American well data set for the past 35 years. The application looks something like what is pictured below using ~260 million rows of data, which is still an admittedly small dataset (for more information, see: Oil and Gas Production Analytics With OmniSci, Mitchell. 11Nov2019).
Figure 1. OmniSci Upstream Oil & Gas Demo
This is just one application of next generation analytics technology in O&G. There are many.
Data Volume matters because the error in any sample set shrinks by the square root of the sample size. In Data Science model training scenarios, error results in model bias. The larger your sample size, the lower your margin of error and, consequently (perhaps theoretically) mitigating large biases in your forward projections.
Speed matters because it allows you to fail fast and iterate faster. If you achieve a 100x improvement in data ingest, data analytics, model training, and scenario execution for instance, you can compress many more iterations into the equivalent amount of time it might take you to complete one iteration today using high latency backend technologies.
The fundamental technology underpinning the example presented above is software optimized for massively parallel computation on HPC-class equipment (e.g., GPU compute) and / or single to multi node CPU clusters.
Data visualization still matters, and it is an integral part of any data science workflow.