In-memory database and dealing with large amounts of data

Looking at more of the details of why Omnisci is fast it seems that there is a reliance on using an in-memory database.

Our database currently has 200 gigabytes of data and is being stored entirely in RAM. However we are constantly adding more data and we will have well over a terabyte soon.

How will OmniSciDB handle very large datasets once they can no longer be stored entirely in RAM? Is there any optimization as far as caching common queries to limit read time from disk, or other features of that nature?

Hi @Dominic,

Omnisci is fast for several reasons, and one of them is that the Hot Data is cached on System/GPU Ram; this makes us capable of exploiting the entire bandwidth available on the System or GPU.

When you load the data on the database, it’s not stored directly in RAM but on the disk (exceptions are temporary tables that reside in RAM), and the data is pulled from disk to RAM only when and just what is needed.

As an example; Let’s say you have a table with 9 columns with a row size of 30 bytes and nearly 110m of records containing the US domestic flights data from 1987 to 2009; so to keep the entire table in ram, it would need 3150MB, but as said before when the table is loaded nothing is pre-loaded into RAM.

omnisql> select count(*) from flights_o;
EXPR$0
110072756
omnisql> \memory_summary
OmniSci Server CPU Memory Summary:
            MAX            USE      ALLOCATED           FREE
    25606.60 MB        0.00 MB        0.00 MB        0.00 MB

OmniSci Server GPU Memory Summary:
[GPU]            MAX            USE      ALLOCATED           FREE
  [0]     3658.38 MB        0.00 MB     3658.38 MB     3658.38 MB

if we need to know the average delay by the month of the first 6 months of 2008, we run a query like that

omnisql> select extract(month from dep_timestamp),avg(depdelay) from flights_o where cast(dep_timestamp as date) between '2008-01-01' and '2008-06-30' group by 1;
EXPR$0|EXPR$1
1|11.66394319550027
2|13.70029445274699
3|12.4345848201434
4|8.091432675628795
5|7.855462141128088
6|13.69413707955152
omnisql> \memory_summary
OmniSci Server CPU Memory Summary:
            MAX            USE      ALLOCATED           FREE
    25606.60 MB       80.53 MB     4096.00 MB     4015.47 MB

OmniSci Server GPU Memory Summary:
[GPU]            MAX            USE      ALLOCATED           FREE
  [0]     3658.38 MB       80.53 MB     3658.38 MB     3577.85 MB

As you can see, the entire table hasn’t been pulled from disk to ram; in fact, just 80MB are needed to run this query because the database is also a columnar one, so only the columns used in the queries need to reside in RAM and in case of filtering, just fragments that could contain data are read from disk.
This is possible because the system knows the min and max value of each column that fragments contain, so the optimizer can prune the one that can’t contain the data requested by the query and save time and RAM; without the pruning of the fragments, we would need 6*110M bytes, so 650MB.

Looking at the distribution of the data in the table and knowing that we created the table with a fragment size of 32M (the default), we run this query

omnisql> select abs(rowid/32000000),min(dep_timestamp),max(dep_timestamp),count(*) from  flights_o group by 1 ;
EXPR$0|EXPR$1|EXPR$2|EXPR$3
0|1987-10-01 00:12:00|1994-01-01 10:10:00|32000000
1|1993-01-01 03:02:00|2000-12-28 11:50:00|32000000
2|2000-01-01 00:50:00|2006-01-01 08:02:00|32000000
3|2005-03-01 00:57:00|2009-01-01 18:27:00|14072756

So we need just the last fragment that contains only 14M or records because it the only one that contains the flight’s data of 2008, so the 80MB; running the same query asking the data from 1994 would require the second fragment, so the double of memory.

If you know in advance that the data you users are likely to use is the last 3 years, you can warm up caches when you start the database if you want or leave the system free to read data at runtime.

The only actual limit of the s is pi is that the entire data needed to run a SINGLE query needs to reside in RAM; streaming data from disk isn’t supported (with a combination of parameters and GPUs is possible on some queries, but it’s not officially supported right now).

In the end you just need enough memory to fit the chunks of data contaning the most used columns and “partitions” .

I hope I replied to your question, but if not feel free to ask for more.

Regards,
Candido

1 Like