Discussions

Expand all | Collapse all

Table design optimization for largest

  • 1.  Table design optimization for largest

    Posted 01-23-2019 04:32

    Hi,

    I captured the nyc taxi drives and generating out of them another random data in different times to get as large as possible testing data set which fits into single node MapD on 8 V100 32GB VRAM (256GB in total). Any comments on my optimizations welcomed

    Table design of merged records
    ID - BIGINT - 8 bytes
    POINT (2 float) 2x 4 bytes = 8 bytes (maybe overhead)
    NOTE: I wasn’t able to find in documentation real data requirement for POINT data type:
    https://www.omnisci.com/docs/latest/5_datatypes.html
    So I estimate it as 2 floats
    TIMESTAMP - 8 bytes
    OBJECT TYPE (yellow, green, etc.) - TINYINT - 1 byte
    OBJECT LABEL (some custom enumeration to mark object …) - TINYINT - 1 byte

    TOTAL BYTES = 26 bytes

    I understand there is some overhead, but I don’t count this at moment…

    1 billion records estimates to ~24 GB.

    I have in total 8x 32GB V100 Tesla = 256GB

    So gross calculation shows I might be able to fit around 9-10 billions of records on my single node.

    Now I am looking for optimization of storage.

    Idea 1: TIMESTAMP
    Use FIXED type.

    TIMESTAMP ENCODING FIXED(32) - 4 bytes

    After this optimization TOTAL = 22 bytes

    As I plan to use Mapd as real time temporal buffer only (for lets say 1-7 days data) I can even reduce this resolution to use SMALLINT where 3 days will fit in. If I stay on minute resolution I get 4320 minutes in 3 day buffer in total.

    Ater this optimization TOTAL = 20 bytes

    Idea 2: Object ID

    The BIGINT is pretty super big for my purpose:
    Minimum value: -9,223,372,036,854,775,807; maximum value: 9,223,372,036,854,775,807

    I think I can go with INTEGER as:
    -2,147,483,647; maximum value: 2,147,483,647

    is enough range for the object unique identifier

    Afrter this optimization TOTAL = 16 Bytes

    Review
    So it looks like 15GB for 1 billion of records.
    So eventually I can fit around 17(even 15-16 would be good) billions of records at most.

    Can someone comment on my observations/estimations?

    Any comments on my numbers welcomed.



  • 2.  RE: Table design optimization for largest

    Posted 01-23-2019 05:19

    Hi,

    Thanks for looking so deeply into OmniSci.

    Your numbers are correct assuming you used COMPRESSED POINTS, to make it generally a bit more consumable (so you can see the enumerations as strings) you may want to use TEXT ENCODING DICT(8) for your object types and object labels.

    Your create table would be like

    create table your_table (
    ID INT,
    pt GEOMETRY(POINT, 4326),
    tm TIMESTAMP ENCODING FIXED (32),
    obj_type TEXT ENCODING DICT (8),
    obj_label TEXT ENCODING DICT(8)
    );
    

    When loading at this scale you may want to consider increasing your fragment size as well to reduce the number of fragments. I would recommend exploring a size of ~150M up from default of 32M

    Regards



  • 3.  RE: Table design optimization for largest

    Posted 01-23-2019 05:54

    Nice @dwayneberry.

    I think its crucial to keep data representation as small as possible due to limited VRAM.

    Really thank you for your response, your result matches mine (I just wasn’t sure about tinyint vs text dict(8).

    Ok, thx I am now looking at WITH available clauses:
    https://www.omnisci.com/docs/latest/5_tables.html

    Thanks again for this hint!



  • 4.  RE: Table design optimization for largest

    Posted 01-23-2019 06:22

    Hi @archenroot,

    If you are using decimal, take care when you define the precision because in the latest releases of the database you can save a considerable amount of memory just choosing the right precision

    e.g.

    mapdql> create table test_decimal(f1 decimal(3,2), f2 decimal(7,2), f3 decimal(10,2));
    mapdql> \d test_decimal
    CREATE TABLE test_decimal (
    f1 DECIMAL(3,2) ENCODING FIXED(16),
    f2 DECIMAL(7,2) ENCODING FIXED(32),
    f3 DECIMAL(10,2))

    so f1 will use just 2 bytes, f2 4 and f3 8



  • 5.  RE: Table design optimization for largest

    Posted 01-23-2019 06:04

    Additionally I will prepare set of some basic Spatial queries and will execute them always after 0.5 bilions of new data arrives and monitor the VRAM and storage on disk to have better idea where is my limit.