Hi,

I captured the nyc taxi drives and generating out of them another random data in different times to get as large as possible testing data set which fits into single node MapD on 8 V100 32GB VRAM (256GB in total). Any comments on my optimizations welcomed

Table design of merged records

ID - BIGINT - 8 bytes

POINT (2 float) 2x 4 bytes = 8 bytes (maybe overhead)

**NOTE: I wasn’t able to find in documentation real data requirement for POINT data type:**

**https://www.omnisci.com/docs/latest/5_datatypes.html**

**So I estimate it as 2 floats**

TIMESTAMP - 8 bytes

OBJECT TYPE (yellow, green, etc.) - TINYINT - 1 byte

OBJECT LABEL (some custom enumeration to mark object …) - TINYINT - 1 byte

TOTAL BYTES = 26 bytes

I understand there is some overhead, but I don’t count this at moment…

1 billion records estimates to ~24 GB.

I have in total 8x 32GB V100 Tesla = 256GB

So gross calculation shows I might be able to fit around 9-10 billions of records on my single node.

Now I am looking for optimization of storage.

Idea 1: TIMESTAMP

Use FIXED type.

TIMESTAMP ENCODING FIXED(32) - 4 bytes

After this optimization TOTAL = 22 bytes

As I plan to use Mapd as real time temporal buffer only (for lets say 1-7 days data) I can even reduce this resolution to use SMALLINT where 3 days will fit in. If I stay on minute resolution I get 4320 minutes in 3 day buffer in total.

Ater this optimization TOTAL = 20 bytes

Idea 2: Object ID

The BIGINT is pretty super big for my purpose:

Minimum value: -9,223,372,036,854,775,807; maximum value: 9,223,372,036,854,775,807

I think I can go with INTEGER as:

-2,147,483,647; maximum value: 2,147,483,647

is enough range for the object unique identifier

Afrter this optimization TOTAL = 16 Bytes

Review

So it looks like 15GB for 1 billion of records.

So eventually I can fit around 17(even 15-16 would be good) billions of records at most.

Can someone comment on my observations/estimations?

Any comments on my numbers welcomed.