How to speed up CTAS command?

Hi,

please let us know if any method is available to speed up the CTAS command. As delete command not free the OS Disk space and no option to drop past day records. Hence need to execute the CTAS command on daily basis.

regards,
sumit

Hi @Sumit,

you can speed up the duplication of data from a table to another (in your case we are talking about an exact copy) using IAS instead of CTAS sharing the dictionaries between tables.

as an example: copying the orders table of TPCH benchmark with a scale 100 (60m records for orders table) with the following ddl

CREATE TABLE orders (
O_ORDERKEY INTEGER NOT NULL,
O_CUSTKEY INTEGER NOT NULL,
O_ORDERSTATUS TEXT NOT NULL ENCODING DICT(8),
O_TOTALPRICE DECIMAL(18,2) NOT NULL,
O_ORDERDATE DATE NOT NULL ENCODING DAYS(16),
O_ORDERPRIORITY TEXT NOT NULL ENCODING DICT(32),
O_CLERK TEXT NOT NULL ENCODING DICT(32),
O_SHIPPRIORITY INTEGER NOT NULL,
O_COMMENT TEXT NOT NULL ENCODING DICT(32))

CTAS takes 88725ms, an IAS without sharing the dictionaries takes 86423ms, while an IAS on a table that shares the dictionaries takes just 14382 ms with a 6x speed-up.

That’s true with everything loaded into RAM on a fast storage (or enough ram to cache fs needed for the newly created table) and with a significant portion of the table consisting of dictionaries.

The DDLs with shared dictionary is the following:

CREATE TABLE orders_copy (
O_ORDERKEY INTEGER NOT NULL,
O_CUSTKEY INTEGER NOT NULL,
O_ORDERSTATUS TEXT NOT NULL ,
O_TOTALPRICE DECIMAL(18,2) NOT NULL,
O_ORDERDATE DATE NOT NULL ENCODING DAYS(16),
O_ORDERPRIORITY TEXT NOT NULL ,
O_CLERK TEXT NOT NULL,
O_SHIPPRIORITY INTEGER NOT NULL,
O_COMMENT TEXT NOT NULL,
SHARED DICTIONARY(O_ORDERSTATUS) references orders(O_ORDERSTATUS),
SHARED DICTIONARY(O_ORDERPRIORITY ) references orders(O_ORDERPRIORITY ),
SHARED DICTIONARY(O_CLERK ) references orders(O_CLERK ),
SHARED DICTIONARY(O_COMMENT  ) references orders(O_COMMENT )
)

and IAS syntax is:

insert into orders_copy select * from orders;

Hi @candido.dessanti

2 Queries

  1. SHARED DICTIONARY field is required only for columns when are defined as “Text” ?

  2. After executing “insert into orders_copy select * from orders;” … If i will drop Table “orders”, my full data will be available in “orders_copy” table and no other impact ?

Regards,
sumit

Hi @Sumit,

Only text datatypes are dictionary encoded, so yes, you can share just string encoded fields.

The strings are stored on a different file than the table data. It’s encoded as an integer (or smallint, tinyint depending the number of distinct values expected in the column), and it’s possible to share a dictionary between different tables.
Doing that you will save disk space and you will speed up the IAS because it’s likely you haven’t to decode and re-encode the string from the source table (I’m not sure of that).
The tradeoff using the shared dictionaries is that the strings are varying a lot between data (e.g., you have something like a preformatted datetime/timestamp stored as dictionary encoded strings), you won’t recover the space of unused ones. So before using this approach, checks if it’s worth the time saved.

Another idea could be to use IAS with shared dictionaries daily and doing a complete rebuild weekly or monthly.

I did some test and yes, you have the data into the new table and the dictionary will survive the drop because is used by the new table

Thanks you @candido.dessanti for detail explaination.

Please let us know how i can check for my any specific table which are dictionary are associated.

regards,
sumit

Hi @sumit,

I posted a shell some time ago, to check the size of data and the dictionaries; @sam.carroll enhanced it.
Just added the details of space used by each dictionary encoded columns

mapd@zion16:~$ ./tablesize.sh omnisci orders /opt/mapd_storage/data
f2fea1ae-8cdd-4e27-b3ed-76329d37b80b , omnisci , orders , 59999999 , 2178 , 2993 , 5171  , MB, 1562336763 , O_ORDERSTATUS (9 MB) O_ORDERDATE (9 MB) O_ORDERPRIORITY (9 MB) O_CLERK (9 MB) O_COMMENT (2957 MB)

filed are uuid, database name, table name, num rows, table size, dictionary size, total size, UOM, dateran, dictionary details

this is the script tablesize.zip (911 Byte)