I captured the nyc taxi drives and generating out of them another random data in different times to get as large as possible testing data set which fits into single node MapD on 8 V100 32GB VRAM (256GB in total). Any comments on my optimizations welcomed
Table design of merged records
ID - BIGINT - 8 bytes
POINT (2 float) 2x 4 bytes = 8 bytes (maybe overhead)
NOTE: I wasn’t able to find in documentation real data requirement for POINT data type:
So I estimate it as 2 floats
TIMESTAMP - 8 bytes
OBJECT TYPE (yellow, green, etc.) - TINYINT - 1 byte
OBJECT LABEL (some custom enumeration to mark object …) - TINYINT - 1 byte
TOTAL BYTES = 26 bytes
I understand there is some overhead, but I don’t count this at moment…
1 billion records estimates to ~24 GB.
I have in total 8x 32GB V100 Tesla = 256GB
So gross calculation shows I might be able to fit around 9-10 billions of records on my single node.
Now I am looking for optimization of storage.
Idea 1: TIMESTAMP
Use FIXED type.
TIMESTAMP ENCODING FIXED(32) - 4 bytes
After this optimization TOTAL = 22 bytes
As I plan to use Mapd as real time temporal buffer only (for lets say 1-7 days data) I can even reduce this resolution to use SMALLINT where 3 days will fit in. If I stay on minute resolution I get 4320 minutes in 3 day buffer in total.
Ater this optimization TOTAL = 20 bytes
Idea 2: Object ID
The BIGINT is pretty super big for my purpose:
Minimum value: -9,223,372,036,854,775,807; maximum value: 9,223,372,036,854,775,807
I think I can go with INTEGER as:
-2,147,483,647; maximum value: 2,147,483,647
is enough range for the object unique identifier
Afrter this optimization TOTAL = 16 Bytes
So it looks like 15GB for 1 billion of records.
So eventually I can fit around 17(even 15-16 would be good) billions of records at most.
Can someone comment on my observations/estimations?
Any comments on my numbers welcomed.
Thanks for looking so deeply into OmniSci.
Your numbers are correct assuming you used COMPRESSED POINTS, to make it generally a bit more consumable (so you can see the enumerations as strings) you may want to use TEXT ENCODING DICT(8) for your object types and object labels.
TEXT ENCODING DICT(8)
Your create table would be like
create table your_table (
pt GEOMETRY(POINT, 4326),
tm TIMESTAMP ENCODING FIXED (32),
obj_type TEXT ENCODING DICT (8),
obj_label TEXT ENCODING DICT(8)
When loading at this scale you may want to consider increasing your fragment size as well to reduce the number of fragments. I would recommend exploring a size of ~150M up from default of 32M
I think its crucial to keep data representation as small as possible due to limited VRAM.
Really thank you for your response, your result matches mine (I just wasn’t sure about tinyint vs text dict(8).
Ok, thx I am now looking at WITH available clauses:
Thanks again for this hint!
If you are using decimal, take care when you define the precision because in the latest releases of the database you can save a considerable amount of memory just choosing the right precision
mapdql> create table test_decimal(f1 decimal(3,2), f2 decimal(7,2), f3 decimal(10,2));
mapdql> \d test_decimal
CREATE TABLE test_decimal (
f1 DECIMAL(3,2) ENCODING FIXED(16),
f2 DECIMAL(7,2) ENCODING FIXED(32),
so f1 will use just 2 bytes, f2 4 and f3 8
Additionally I will prepare set of some basic Spatial queries and will execute them always after 0.5 bilions of new data arrives and monitor the VRAM and storage on disk to have better idea where is my limit.