Table design optimization for largest


#1

Hi,

I captured the nyc taxi drives and generating out of them another random data in different times to get as large as possible testing data set which fits into single node MapD on 8 V100 32GB VRAM (256GB in total). Any comments on my optimizations welcomed

Table design of merged records
ID - BIGINT - 8 bytes
POINT (2 float) 2x 4 bytes = 8 bytes (maybe overhead)
NOTE: I wasn’t able to find in documentation real data requirement for POINT data type:
https://www.omnisci.com/docs/latest/5_datatypes.html
So I estimate it as 2 floats
TIMESTAMP - 8 bytes
OBJECT TYPE (yellow, green, etc.) - TINYINT - 1 byte
OBJECT LABEL (some custom enumeration to mark object …) - TINYINT - 1 byte

TOTAL BYTES = 26 bytes

I understand there is some overhead, but I don’t count this at moment…

1 billion records estimates to ~24 GB.

I have in total 8x 32GB V100 Tesla = 256GB

So gross calculation shows I might be able to fit around 9-10 billions of records on my single node.

Now I am looking for optimization of storage.

Idea 1: TIMESTAMP
Use FIXED type.

TIMESTAMP ENCODING FIXED(32) - 4 bytes

After this optimization TOTAL = 22 bytes

As I plan to use Mapd as real time temporal buffer only (for lets say 1-7 days data) I can even reduce this resolution to use SMALLINT where 3 days will fit in. If I stay on minute resolution I get 4320 minutes in 3 day buffer in total.

Ater this optimization TOTAL = 20 bytes

Idea 2: Object ID

The BIGINT is pretty super big for my purpose:
Minimum value: -9,223,372,036,854,775,807; maximum value: 9,223,372,036,854,775,807

I think I can go with INTEGER as:
-2,147,483,647; maximum value: 2,147,483,647

is enough range for the object unique identifier

Afrter this optimization TOTAL = 16 Bytes

Review
So it looks like 15GB for 1 billion of records.
So eventually I can fit around 17(even 15-16 would be good) billions of records at most.

Can someone comment on my observations/estimations?

Any comments on my numbers welcomed.


#2

Hi,

Thanks for looking so deeply into OmniSci.

Your numbers are correct assuming you used COMPRESSED POINTS, to make it generally a bit more consumable (so you can see the enumerations as strings) you may want to use TEXT ENCODING DICT(8) for your object types and object labels.

Your create table would be like

create table your_table (
ID INT,
pt GEOMETRY(POINT, 4326),
tm TIMESTAMP ENCODING FIXED (32),
obj_type TEXT ENCODING DICT (8),
obj_label TEXT ENCODING DICT(8)
);

When loading at this scale you may want to consider increasing your fragment size as well to reduce the number of fragments. I would recommend exploring a size of ~150M up from default of 32M

Regards


#3

Nice @dwayneberry.

I think its crucial to keep data representation as small as possible due to limited VRAM.

Really thank you for your response, your result matches mine (I just wasn’t sure about tinyint vs text dict(8).

Ok, thx I am now looking at WITH available clauses:
https://www.omnisci.com/docs/latest/5_tables.html

Thanks again for this hint!


#4

Additionally I will prepare set of some basic Spatial queries and will execute them always after 0.5 bilions of new data arrives and monitor the VRAM and storage on disk to have better idea where is my limit.


#5

Hi @archenroot,

If you are using decimal, take care when you define the precision because in the latest releases of the database you can save a considerable amount of memory just choosing the right precision

e.g.

mapdql> create table test_decimal(f1 decimal(3,2), f2 decimal(7,2), f3 decimal(10,2));
mapdql> \d test_decimal
CREATE TABLE test_decimal (
f1 DECIMAL(3,2) ENCODING FIXED(16),
f2 DECIMAL(7,2) ENCODING FIXED(32),
f3 DECIMAL(10,2))

so f1 will use just 2 bytes, f2 4 and f3 8