I captured the nyc taxi drives and generating out of them another random data in different times to get as large as possible testing data set which fits into single node MapD on 8 V100 32GB VRAM (256GB in total). Any comments on my optimizations welcomed
Table design of merged records
ID - BIGINT - 8 bytes
POINT (2 float) 2x 4 bytes = 8 bytes (maybe overhead)
NOTE: I wasn’t able to find in documentation real data requirement for POINT data type:
So I estimate it as 2 floats
TIMESTAMP - 8 bytes
OBJECT TYPE (yellow, green, etc.) - TINYINT - 1 byte
OBJECT LABEL (some custom enumeration to mark object …) - TINYINT - 1 byte
TOTAL BYTES = 26 bytes
I understand there is some overhead, but I don’t count this at moment…
1 billion records estimates to ~24 GB.
I have in total 8x 32GB V100 Tesla = 256GB
So gross calculation shows I might be able to fit around 9-10 billions of records on my single node.
Now I am looking for optimization of storage.
Idea 1: TIMESTAMP
Use FIXED type.
TIMESTAMP ENCODING FIXED(32) - 4 bytes
After this optimization TOTAL = 22 bytes
As I plan to use Mapd as real time temporal buffer only (for lets say 1-7 days data) I can even reduce this resolution to use SMALLINT where 3 days will fit in. If I stay on minute resolution I get 4320 minutes in 3 day buffer in total.
Ater this optimization TOTAL = 20 bytes
Idea 2: Object ID
The BIGINT is pretty super big for my purpose:
Minimum value: -9,223,372,036,854,775,807; maximum value: 9,223,372,036,854,775,807
I think I can go with INTEGER as:
-2,147,483,647; maximum value: 2,147,483,647
is enough range for the object unique identifier
Afrter this optimization TOTAL = 16 Bytes
So it looks like 15GB for 1 billion of records.
So eventually I can fit around 17(even 15-16 would be good) billions of records at most.
Can someone comment on my observations/estimations?
Any comments on my numbers welcomed.