Dictionary is using large Chunk of Space, more than Tables

Hi,

Observed following in lab
[test@utility mapd_data]$ du -h --max-depth=1 | grep -i G
6.1G ./table_5_59
13G ./DB_5_DICT_10276
9.5G ./table_5_3456
26G ./DB_5_DICT_10247
27G ./DB_5_DICT_10249
16G ./table_5_3455
5.3G ./DB_5_DICT_10254
86G ./DB_5_DICT_10250
9.0G ./DB_5_DICT_10274
16G ./DB_5_DICT_10277
18G ./DB_5_DICT_10266
14G ./DB_5_DICT_10279
8.1G ./DB_5_DICT_10280
19G ./table_5_3457
42G ./DB_5_DICT_10245
315G .

Observed that Dictionary sizes are far bigger than Table size.

Please let us know following

  1. This dictionary is associated with which tables ?

  2. How we we can cleanup Dictionary ?

  3. Dictionary cleanup will impact anything ?

  4. we are doing lots of CTAS activity, will this can impact the large dictionary size ?

  5. any specific guideline for Dictionary ?

Regards,
sumit

Hi @sumit,

It is possible that a dictionary can be more significant than the underlying table because, in the table, it’s stored only the id that oìpoint to the string in the dictionary.

So if you have a table with two integers and one dictionary encoded string (32 bits as an example), each row will table 12 bytes, and with 1B, records will allocate 12GB of disk space. Assuming a high cardinality of 250M differents trying with an average length of 80 bytes, the dictionary will use 20GB.
Anyway, if you are doing a lot of CTAS, the dictionaries are shared between source and newly created tables.

To know which tables belong to a dictionary, you have to query the catalog, so connect to the internal catalog

we have those quite big dictionaries

|6,0G|DB_1_DICT_198|
|6,0G|DB_1_DICT_202|
|8,0M|DB_1_DICT_1|

the “easy” way is to connect to the catalog and query for the198 and 202 as id of the dictionaries

candido@zion-legion:/mapd_storage/data_test/mapd_catalogs$ sqlite3 omnisci
SQLite version 3.35.5 2021-04-19 18:32:05
Enter ".help" for usage hints.
sqlite> select * from mapd_dictionaries where dictid in(198,202);
198|orders_O_COMMENT_dict198|32|0|1|1
202|orders_ns_O_COMMENT_dict202|32|0|1|1

so from the internal name we know that the tables are called orders and orders_ns and the column is o_comment for both.

Hope this helps and answer your question

Regards,
Candido

One more query

  1. How i can CTAS as instruct for not share the Dictionary… i am expecting by doing this activity may be Dictionary size will reduce. will this activity has any other adverse impact ?

Regards,
sumit

Him

If you want to create the new table with a new dictionary, instead of sharing the old one between the two tables, you can use the use_shared_dictionaries property, this way.

CREATE now_table AS 
SELECT ... 
FROM old_table 
WITH (use_shared_dictionaries=false);

Generally it’s preferable to use shared dictionaries to save space and memory, but it depends by the data and how the tables are populated after the CTAS.

Regards,
Candido

thanks … we will do and update on change in value