OmniSci and data streaming


#1

I would like to check about behavior of engine core.

Scenario
I am interested in real time aggregations over large data sets which are continuously updated (once per hour, minute by mini-batch of data).

Imagine I load initial batch 1bilion of rows into few tables, i execute aggreagtion query first time (cold start), this will take some time to upload data into VRAM of GPU.
Next step is that I receive update of another 10 milions of rows and going to execute same query again.

Question: Does omnisci automatically update in-memory data at same moment the new mini-batch is loaded into table, so at moment of query execution I don’t need to wait for new data coming or the query is the trigger which upload new 10milions of data into VRAM (so there will be a delay in getting the result)?

Thank you for answer on this.


#2

If you stream data into OmniSci, you will still need to run a query to get the data into memory. How ‘delayed’ this will be is really a function of how much data you are ingesting per mini-batch and the overall speed of moving from disk to CPU RAM to GPU RAM.


#3

Hi Randy, but in general its only loading new data (for latest epoch) right. if I have in memory lets say 2 weeks of data and new data for 1 hour arrived (so they are much smaller than those 2 weeks), omnsci does some incremental memory update.

I am asking if it does detect that these 2 weeks are already there from previous query execution (can be same query but also different query, the important is both queries used same data structures)

Example:
Time of execution 2 queries at first epoch:

  1. select count(a),b from t group by b
  2. select count©,b from t group by b
    for second query I expect OmniSci engine understand that a and b columns are already in memory, so it will only load missing c column (and check if new data from a and b columns are available - doing incremental update only)

second epoch(new batch of data arrived)
I execute same queriers and expect again only delta will be uploaded to existing VRAM structures.

Is my assumption correct?

Regarding CPU to GPU - I worked on Alenka GPU prototype database (which Anton is creator of) and it was able to beat enterprise database engines even on single gpu where it worked (disk to gpu), so there was some latency especially we had to iterated as dataset had 10G and gpu had only 2GB vram, but the enormous speedup was done by heavy compression of data in disk, which resulted in faster throughput to VRAM, but actually presented almost none latency on decompression, which is deadly fast on GPU. I expect you use also something like this to limit the latencies, right?

Thank you very much Randy for response again and have a good day.


#4

Yes, we keep data ‘hot’ in GPU RAM, assuming that it fits. So as long as the table updated happened after to the first query in such a short time as to not kick the original data out of cache, it will still be there and OmniSci will just load the incremental new data.

Yes, we use compression to the extent that it makes sense.