Obsfuction of data with effect of saving space


#1

I have a task to benchmark OmniSci on multi GPU node. I am preparing TPC-H benchmark and dataset provided by Mark https://tech.marksblogg.com/billion-nyc-taxi-rides-nvidia-tesla-mapd.html

But I have also another idea to test and would like to ask you here about it as its more related to data layout within core of engine. I would like to obfuscate some data in schema which should lead to:

  • save space -> store more data on single node (which fit into memory)
  • replacing strings with integers to possible even speedup

Example:
column Country(String): United States, Great Britain
will become Country(Integer): 1, 2,

On large datasets I can get real saving, I will keep mapping either on middleware platform which read aggregated views and remap values to human readable form or it will be translated on front-end directly.

Q1: Do you think it will help? Imagine I have lot of enumeration like columns in my data.

Q2: How OmniSci works, it uses GPU VRAM until dataset fits in, but what will happen if not (for some query execution)? Is it going to use as second instance host RAM as additional (secondary fast storage)? I would like to test these cases and what is the slow down. So the ration can tell me if it is worth buying more RAM or extend cluster by new GPU node… Any production grade experiences well welcomed. Thank you.


#2

This is already how OmniSci works, you do not need to pre-process data to make this work. Just set your column type to TEXT ENCODING DICT(T) where T can be either 8, 16, or 32 depending on the expected number of distinct values (in bits) in the your dictionary-encoded column.

https://www.omnisci.com/docs/latest/5_datatypes.html (see the section ‘Fixed Encoding’)

OmniSci will intelligently move data between GPU RAM, CPU RAM and then fallback to disk for a given query. So as you say, we will try and keep the data in GPU RAM to the extent that we can, then fall back to CPU RAM on the same machine. For information about recommended hardware configurations, please see the following link:

https://www.omnisci.com/docs/latest/4_hardware_configuration_guide.html

In general, your point about buying more hardware is correct. As you find your workload exceeds your current capacity, adding more GPUs to a single node (and hopefully, with additional CPU RAM as well) will increase your analytics workload capacity. If you can no longer scale a single node, our Enterprise Edition can also be run in distributed mode to have a multi-node GPU set up.


#3

hi @archenroot,

Q1)
On Omni-sci database the strings that are you planning to use for filtering, grouping, and joining have to be defined as dictionary encoded texts so it will be stored on disk/memory as a number. For a more accurate evaluation, if the obfuscation is needed, you should use a different technique like shuffling or dictionary.

So it’s not useful to save memory or get a performance boost, but it will be helpful to choose the right bit size for the encoding. Using you example, so a column containing countries you should use an 8 bit encoding, so every row in the table for countries will use just 1 byte and you be limited to 255 distinct values, for state name a 16 bit encoding limited to 65535 distinct values, and for addresses a 32 bit one limited to 1 billion distinct values.
If you have a column with the same data on two tables, it will help to define it as a shared dictionary, especially if you are going to use those columns for joining the tables.

You can find information about datatypes and encodings here.

http://docs.mapd.com/latest/5_datatypes.html

take a look also to fixed encodings of date, timestamp, and numbers.

Q2) The Onmnisci database use CPU and GPU memory as caches and it caches only the columns needed by the queries you run, so if you have a tables with 50 columns and you will use just 4 of them on your query, only the 4 columns will be read from disk (the first time), placed in CPU memory and copied on GPU memory for the execution.
If it’s impossible to run the query on GPU for whatever reason, the execution can fall back to CPU, but the GPU won’t use the system RAM directly; this would severely limit the performance because the PCI-ex bus is quite slow (15GB/sec), so performance wise it wouldn’t make sense, and it will break compatibility with Kepler arch.

The parameter needed to allow the query to fall back for CPU is

allow-cpu-retry=true

Here a link to configuration parameters:

http://docs.mapd.com/latest/4_configuration.html

If you want to better understand which factors affect the performance of the server you can use enable-debug-timer=true in your mapd.conf file, so the server will write more detailed information on mapd_server.INFO

As example a simple query like select UniqueCarrier,Origin,count() from flights group by 1,2 on a table with 120M+ rows

1) Fresh started server so there is anything on caches

    I0104 18:19:19.341622 26475 measure.h:80] Timer end                           parse_to_ra                         parse_to_ra: 4644 elapsed 22 ms  --parse time
    I0104 18:19:19.470356 26475 measure.h:80] Timer end               execution_dispatch_comp                     executeWorkUnit: 1033 elapsed 91 ms --time to compile the plan generated
    I0104 18:19:19.633479 27549 measure.h:80] Timer end                           fetchChunks                         fetchChunks: 1683 elapsed 163 ms --time to load data from disk and populate caches
    I0104 18:19:19.777536 27549 measure.h:80] Timer end                          lauchGpuCode                       launchGpuCode:   97 elapsed 43 ms --time to execute the query on gpu
    I0104 18:19:19.781117 26475 measure.h:80] Timer end                          convert_rows                        convert_rows: 4095 elapsed 3 ms --time the cpu spent to decode text dictionary strings and do the finale projection
    I0104 18:19:19.781141 26475 MapDHandler.cpp:757] sql_execute-COMPLETED Total: 361 (ms), Execution: 358 (ms)

2) execution without data un GPU memory

    I0104 18:19:32.453187 26475 measure.h:80] Timer end                           parse_to_ra                         parse_to_ra: 4644 elapsed 11 ms  --parse time
    I0104 18:19:32.454771 26475 measure.h:80] Timer end               execution_dispatch_comp                     executeWorkUnit: 1033 elapsed 1 ms --the query doesn't need to compile 
    I0104 18:19:32.538339 27592 measure.h:80] Timer end                           fetchChunks                         fetchChunks: 1683 elapsed 83 ms  --time to populate gpu caches
    I0104 18:19:32.583942 27592 measure.h:80] Timer end                          lauchGpuCode                       launchGpuCode:   97 elapsed 45 ms --time to execute the query
    I0104 18:19:32.587343 26475 measure.h:80] Timer end                          convert_rows                        convert_rows: 4095 elapsed 3 ms 
    I0104 18:19:32.587360 26475 MapDHandler.cpp:757] sql_execute-COMPLETED Total: 145 (ms), Execution: 141 (ms)

3) execution without data on CPU and GPU memory but with compiled query 

    I0104 18:19:43.978327 26475 measure.h:80] Timer end                           parse_to_ra                         parse_to_ra: 4644 elapsed 11 ms
    I0104 18:19:43.979847 26475 measure.h:80] Timer end               execution_dispatch_comp                     executeWorkUnit: 1033 elapsed 1 ms  --the query doesn't need to compile 
    I0104 18:19:44.143931 27595 measure.h:80] Timer end                           fetchChunks                         fetchChunks: 1683 elapsed 164 ms  --time to load data from disk and populate caches
    I0104 18:19:44.189477 27595 measure.h:80] Timer end                          lauchGpuCode                       launchGpuCode:   97 elapsed 45 ms
    I0104 18:19:44.192921 26475 measure.h:80] Timer end                          convert_rows                        convert_rows: 4095 elapsed 3 ms
    I0104 18:19:44.192942 26475 MapDHandler.cpp:757] sql_execute-COMPLETED Total: 225 (ms), Execution: 222 (ms)

4) The caches are populated and  the query is alread compiled

    I0104 18:31:21.202143 26475 measure.h:80] Timer end                           parse_to_ra                         parse_to_ra: 4644 elapsed 9 ms
    I0104 18:31:21.203577 26475 measure.h:80] Timer end               execution_dispatch_comp                     executeWorkUnit: 1033 elapsed 1 ms
    I0104 18:31:21.203725 27689 measure.h:80] Timer end                           fetchChunks                         fetchChunks: 1683 elapsed 0 ms
    I0104 18:31:21.248863 27689 measure.h:80] Timer end                          lauchGpuCode                       launchGpuCode:   97 elapsed 45 ms
    I0104 18:31:21.252300 26475 measure.h:80] Timer end                          convert_rows                        convert_rows: 4095 elapsed 3 ms

It’s better to add GPUs because besides adding memory you will get more processing power so better performances on queries.


#4

Thank you guys for verbose answers! I will go trough in more detail during benchmark preparation.


#5

Cannot wait for benchmark execution on DGX machines :slight_smile: I have at moment only single machine, but if I would like to benchmark scalability as well by joining another machine into cluster, do I need for evaluation some kind of license key?


#6

If you want to try multi-node distributed mode as part of our enterprise edition, yes, you will need to talk to one of our sales team: sales@omnisci.com