TPC-H Data Loading

Hi, I’m a new user of omnisci.
My current target is to test TPC-H benchmark on omnisci.

However, I have no idea how to load the TPC-H data into omnisci.
Is there any related script? Or any hint?

Currently, I plan to a stupid way:

  1. prepare TPC-H tables on local machine.
  2. Load the tables into omnisci.
  3. Run the query on omnisci.
    I’m worried about this stupid way would miss the optimization of pymapd data format, and so on.
1 Like

Hi @Eilowangfang,

I wouldn’t say you planned anything in a stupid way.

to run regualr tpch queries on omnisci, you have to start the server disabling watchdog, so with the parameter enable-watchdog truned to false, and to be able to run all queires except one, you have to do some rewriting.

I can supply everything you need, so ccontact me by pm or by email I supplied in github

yes. Thanks for your help.
An email has been sent to candido.dessanti@omnisci.com from wangfang722@outlook.com
Look forward your reply.

1 Like

Thanks for your help with TPC-H benchmarking!
Now I can execute the TPC-H queries on omnisci.

Well. Sorry. I still need your help with:

  1. is there any timer that indicates the end-to-end/elapsed time of one query execution?
    (I could find any from Documents)

  2. I’m using omnisci GPU. I noticed that once I started the omnisql server, the GPU memory will be occupied, as shown in the figure. The GPU memory is occupied until I stop the omnisql server.

Then there are three questions:

  • The data is loaded to GPU memory once when the omnisql server starts. And then data resides the GPU memory all the time. Is the understanding correct?
  • If so, for each query execution, the PCIe data of transferring data from the CPU to GPU is not included. The running time is only GPU in-device computing. Is the understanding correct?
  • Now I want to measure the running time includes PCIe transfer time (as It is not fair for me to ignore data transfer). How to achieve this time measurement?
1 Like

I have found the execution time in LOG files in /mapd_log. The figure shows the execution time of TPC-H Q10 (repeat 5 times) from LOG file.

I have two questions:

  1. Does the “execution_time_ms” include PCIe data transfer time?

  2. I repeated the query five times, the first query execution is always too long. Then the next queries are fast.
    The long first query execution might be:
    A. due to PCIe data transfer for the first query. Then next queries (if same query content) might not need PCIe transfer.
    In this case, 2857ms includes PCIe transfer, and 154/149/142/139ms does not include PCIe transfer.
    B. due to GPU startup and init time at the first query. For next each query, the PCIe are still essential.
    In this case, all 2857/154/149/142/139ms include PCIe transfer.

Which do you think is the actual reason?

The query result looks great to me, if the short execution time (e.g., 153/149/143/139ms) already includes PCIe transfer time.

Hi @Eilowangfang,

To your question, if there is a time that indicates the end-to-end etc., the answer is YES. to activate it, you have to start the server with the parameter enable-debug-timer turned to true. If you look at docs is listed, and it’s supposed to

Enable fine-grained query execution timers for debug. For debugging, logs verbose timing information for query execution (time to load data, time to compile code, and so on).

the Infos will be written into the logs at INFO level so that they will be available at [/mapd_storage]/data/mapd_logs/omnisci.INFO

running the TPCHQ10 query after a server cold start the output is this one

2020-12-18T16:26:00.744844 I 4773 1 measure.h:74 Timer start                         getExecutor                         getExecutor:  158
2020-12-18T16:26:00.744860 I 4773 1 measure.h:81 Timer end                           getExecutor                         getExecutor:  158 elapsed 0 ms
2020-12-18T16:26:00.744900 I 4773 1 QueryDispatchQueue.h:61 Dispatching query with 0 queries in the queue.
2020-12-18T16:26:00.744953 I 4773 7 QueryDispatchQueue.h:92 Worker 1 running the query and returning control. There are now 0 queries in the queue.
2020-12-18T16:26:00.744998 I 4773 7 measure.h:74 Timer start                         getExecutor                         getExecutor:  158
2020-12-18T16:26:00.745018 I 4773 7 measure.h:81 Timer end                           getExecutor                         getExecutor:  158 elapsed 0 ms
2020-12-18T16:26:00.745647 I 4773 7 measure.h:74 Timer start                  executeRelAlgQuery                  executeRelAlgQuery:  222
2020-12-18T16:26:00.745668 I 4773 7 measure.h:74 Timer start           executeRelAlgQueryNoRetry           executeRelAlgQueryNoRetry:  253
2020-12-18T16:26:00.943161 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_28 was 197ms
2020-12-18T16:26:01.140584 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_30 was 197ms
2020-12-18T16:26:01.313512 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_26 was 172ms
2020-12-18T16:26:01.505229 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_27 was 191ms
2020-12-18T16:26:01.505494 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_22 was 0ms
2020-12-18T16:26:01.505611 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_31 was 0ms
2020-12-18T16:26:01.505698 I 4773 7 measure.h:74 Timer start                    executeRelAlgSeq                    executeRelAlgSeq:  598
2020-12-18T16:26:01.505714 I 4773 7 measure.h:74 Timer start                   executeRelAlgStep                   executeRelAlgStep:  677
2020-12-18T16:26:01.505870 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(lineitem, ["L_ORDERKEY", "L_PARTKEY", "L_SUPPKEY", "L_LINENUMBER", "L_QUANTITY", "L_EXTENDEDPRICE", "L_DISCOUNT", "L_TAX", "L_RETURNFLAG", "L_LINESTATUS", "L_SHIPDATE", "L_COMMITDATE", "L_RECEIPTDATE", "L_SHIPINSTRUCT", "L_SHIPMODE", "L_COMMENT", "rowid"]) to nest level 0
2020-12-18T16:26:01.505891 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(orders, ["O_ORDERKEY", "O_CUSTKEY", "O_ORDERSTATUS", "O_TOTALPRICE", "O_ORDERDATE", "O_ORDERPRIORITY", "O_CLERK", "O_SHIPPRIORITY", "O_COMMENT", "rowid"]) to nest level 1
2020-12-18T16:26:01.505898 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(customer, ["C_CUSTKEY", "C_NAME", "C_ADDRESS", "C_NATIONKEY", "C_PHONE", "C_ACCTBAL", "C_MKTSEGMENT", "C_COMMENT", "rowid"]) to nest level 2
2020-12-18T16:26:01.505905 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(nation, ["N_NATIONKEY", "N_NAME", "N_REGIONKEY", "N_COMMENT", "rowid"]) to nest level 3
2020-12-18T16:26:01.505929 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(lineitem, ["L_ORDERKEY", "L_PARTKEY", "L_SUPPKEY", "L_LINENUMBER", "L_QUANTITY", "L_EXTENDEDPRICE", "L_DISCOUNT", "L_TAX", "L_RETURNFLAG", "L_LINESTATUS", "L_SHIPDATE", "L_COMMITDATE", "L_RECEIPTDATE", "L_SHIPINSTRUCT", "L_SHIPMODE", "L_COMMENT", "rowid"]) to nest level 0
2020-12-18T16:26:01.505947 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(orders, ["O_ORDERKEY", "O_CUSTKEY", "O_ORDERSTATUS", "O_TOTALPRICE", "O_ORDERDATE", "O_ORDERPRIORITY", "O_CLERK", "O_SHIPPRIORITY", "O_COMMENT", "rowid"]) to nest level 1
2020-12-18T16:26:01.505953 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(customer, ["C_CUSTKEY", "C_NAME", "C_ADDRESS", "C_NATIONKEY", "C_PHONE", "C_ACCTBAL", "C_MKTSEGMENT", "C_COMMENT", "rowid"]) to nest level 2
2020-12-18T16:26:01.505959 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(nation, ["N_NATIONKEY", "N_NAME", "N_REGIONKEY", "N_COMMENT", "rowid"]) to nest level 3
2020-12-18T16:26:01.506037 I 4773 7 measure.h:74 Timer start                     executeWorkUnit                     executeWorkUnit: 2769
2020-12-18T16:26:01.506046 I 4773 7 measure.h:74 Timer start                     get_table_infos                     get_table_infos:  263
2020-12-18T16:26:01.506053 I 4773 7 measure.h:81 Timer end                       get_table_infos                     get_table_infos:  263 elapsed 0 ms
2020-12-18T16:26:01.506113 I 4773 7 measure.h:74 Timer start                Exec_executeWorkUnit                 executeWorkUnitImpl: 1405
2020-12-18T16:26:01.506138 I 4773 7 measure.h:74 Timer start              query_step_compilation                 executeWorkUnitImpl: 1425
2020-12-18T16:26:01.506247 I 4773 7 measure.h:81 Timer end                query_step_compilation                 executeWorkUnitImpl: 1425 elapsed 0 ms
2020-12-18T16:26:01.506255 I 4773 7 measure.h:81 Timer end                  Exec_executeWorkUnit                 executeWorkUnitImpl: 1405 elapsed 0 ms
2020-12-18T16:26:01.506268 I 4773 7 measure.h:74 Timer start                     get_table_infos                     get_table_infos:  263
2020-12-18T16:26:01.506276 I 4773 7 measure.h:81 Timer end                       get_table_infos                     get_table_infos:  263 elapsed 0 ms
2020-12-18T16:26:01.506280 I 4773 7 measure.h:74 Timer start                Exec_executeWorkUnit                 executeWorkUnitImpl: 1405
2020-12-18T16:26:01.506289 I 4773 7 measure.h:74 Timer start              query_step_compilation                 executeWorkUnitImpl: 1425
2020-12-18T16:26:01.507720 I 4773 7 measure.h:74 Timer start                      buildJoinLoops                      buildJoinLoops:  266
2020-12-18T16:26:01.510955 I 4773 7 BufferMgr.cpp:304 ALLOCATION slab of 8388608 pages (4294967296B) created in 3 ms GPU_MGR:0
2020-12-18T16:26:01.510986 I 4773 7 BufferMgr.cpp:304 ALLOCATION slab of 8388608 pages (4294967296B) created in 0 ms CPU_MGR:0
2020-12-18T16:26:01.584934 I 4773 7 BufferMgr.cpp:304 ALLOCATION slab of 8388608 pages (4294967296B) created in 3 ms GPU_MGR:1
2020-12-18T16:26:01.996789 I 4773 7 measure.h:81 Timer end                        buildJoinLoops                      buildJoinLoops:  266 elapsed 489 ms
2020-12-18T16:26:01.998070 I 4773 7 NativeCodegen.cpp:283 module does not require linking against libdevice
2020-12-18T16:26:01.998105 I 4773 7 NativeCodegen.cpp:283 module does not require linking against libdevice
2020-12-18T16:26:02.083164 I 4773 7 measure.h:81 Timer end                query_step_compilation                 executeWorkUnitImpl: 1425 elapsed 576 ms
2020-12-18T16:26:02.083338 I 4773 14 measure.h:74 Timer start                          kernel_run                                 run:   92
2020-12-18T16:26:02.083397 I 4773 14 measure.h:74 Timer start                         fetchChunks                         fetchChunks: 2342
2020-12-18T16:26:02.083890 I 4773 15 measure.h:74 Timer start                          kernel_run                                 run:   92
2020-12-18T16:26:02.083909 I 4773 15 measure.h:74 Timer start                         fetchChunks                         fetchChunks: 2342
2020-12-18T16:26:02.634163 I 4773 14 measure.h:81 Timer end                           fetchChunks                         fetchChunks: 2342 elapsed 550 ms
2020-12-18T16:26:02.634241 I 4773 14 measure.h:74 Timer start           executePlanWithoutGroupBy           executePlanWithoutGroupBy: 2693
2020-12-18T16:26:02.634276 I 4773 14 measure.h:74 Timer start                        lauchGpuCode                       launchGpuCode:  200
2020-12-18T16:26:02.727827 I 4773 15 measure.h:81 Timer end                           fetchChunks                         fetchChunks: 2342 elapsed 643 ms
2020-12-18T16:26:02.727893 I 4773 15 measure.h:74 Timer start           executePlanWithoutGroupBy           executePlanWithoutGroupBy: 2693
2020-12-18T16:26:02.727910 I 4773 15 measure.h:74 Timer start                        lauchGpuCode                       launchGpuCode:  200
2020-12-18T16:26:02.771668 I 4773 15 measure.h:81 Timer end                          lauchGpuCode                       launchGpuCode:  200 elapsed 43 ms
2020-12-18T16:26:02.771713 I 4773 15 measure.h:81 Timer end             executePlanWithoutGroupBy           executePlanWithoutGroupBy: 2693 elapsed 43 ms
2020-12-18T16:26:02.771734 I 4773 15 measure.h:81 Timer end                            kernel_run                                 run:   92 elapsed 687 ms
2020-12-18T16:26:02.775700 I 4773 14 measure.h:81 Timer end                          lauchGpuCode                       launchGpuCode:  200 elapsed 141 ms
2020-12-18T16:26:02.775716 I 4773 14 measure.h:81 Timer end             executePlanWithoutGroupBy           executePlanWithoutGroupBy: 2693 elapsed 141 ms
2020-12-18T16:26:02.775736 I 4773 14 measure.h:81 Timer end                            kernel_run                                 run:   92 elapsed 692 ms
2020-12-18T16:26:02.779040 I 4773 7 measure.h:81 Timer end                  Exec_executeWorkUnit                 executeWorkUnitImpl: 1405 elapsed 1272 ms
2020-12-18T16:26:02.779554 I 4773 7 measure.h:74 Timer start                Exec_executeWorkUnit                 executeWorkUnitImpl: 1405
2020-12-18T16:26:02.779595 I 4773 7 measure.h:74 Timer start              query_step_compilation                 executeWorkUnitImpl: 1425
2020-12-18T16:26:02.781268 I 4773 7 measure.h:74 Timer start                      buildJoinLoops                      buildJoinLoops:  266
2020-12-18T16:26:02.790142 I 4773 7 measure.h:81 Timer end                        buildJoinLoops                      buildJoinLoops:  266 elapsed 8 ms
2020-12-18T16:26:02.791306 I 4773 7 NativeCodegen.cpp:283 module does not require linking against libdevice
2020-12-18T16:26:02.791374 I 4773 7 NativeCodegen.cpp:283 module does not require linking against libdevice
2020-12-18T16:26:02.879975 I 4773 7 measure.h:81 Timer end                query_step_compilation                 executeWorkUnitImpl: 1425 elapsed 100 ms
2020-12-18T16:26:02.880165 I 4773 23 measure.h:74 Timer start                          kernel_run                                 run:   92
2020-12-18T16:26:02.880199 I 4773 23 measure.h:74 Timer start                         fetchChunks                         fetchChunks: 2342
2020-12-18T16:26:02.880148 I 4773 22 measure.h:74 Timer start                          kernel_run                                 run:   92
2020-12-18T16:26:02.880376 I 4773 22 measure.h:74 Timer start                         fetchChunks                         fetchChunks: 2342
2020-12-18T16:26:02.956186 I 4773 23 measure.h:81 Timer end                           fetchChunks                         fetchChunks: 2342 elapsed 76 ms
2020-12-18T16:26:02.956326 I 4773 23 measure.h:74 Timer start              executePlanWithGroupBy              executePlanWithGroupBy: 2896
2020-12-18T16:26:02.956351 I 4773 23 measure.h:74 Timer start                        lauchGpuCode                       launchGpuCode:  200
2020-12-18T16:26:03.024190 I 4773 22 measure.h:81 Timer end                           fetchChunks                         fetchChunks: 2342 elapsed 143 ms
2020-12-18T16:26:03.024304 I 4773 22 measure.h:74 Timer start              executePlanWithGroupBy              executePlanWithGroupBy: 2896
2020-12-18T16:26:03.024330 I 4773 22 measure.h:74 Timer start                        lauchGpuCode                       launchGpuCode:  200
2020-12-18T16:26:03.115051 I 4773 23 measure.h:81 Timer end                          lauchGpuCode                       launchGpuCode:  200 elapsed 158 ms
2020-12-18T16:26:03.116177 I 4773 22 measure.h:81 Timer end                          lauchGpuCode                       launchGpuCode:  200 elapsed 91 ms
2020-12-18T16:26:03.119799 I 4773 23 measure.h:81 Timer end                executePlanWithGroupBy              executePlanWithGroupBy: 2896 elapsed 163 ms
2020-12-18T16:26:03.119840 I 4773 23 measure.h:81 Timer end                            kernel_run                                 run:   92 elapsed 239 ms
2020-12-18T16:26:03.119846 I 4773 22 measure.h:81 Timer end                executePlanWithGroupBy              executePlanWithGroupBy: 2896 elapsed 95 ms
2020-12-18T16:26:03.119875 I 4773 22 measure.h:81 Timer end                            kernel_run                                 run:   92 elapsed 239 ms
2020-12-18T16:26:03.217164 I 4773 7 measure.h:81 Timer end                  Exec_executeWorkUnit                 executeWorkUnitImpl: 1405 elapsed 437 ms
2020-12-18T16:26:03.217273 I 4773 7 measure.h:81 Timer end                       executeWorkUnit                     executeWorkUnit: 2769 elapsed 1711 ms
2020-12-18T16:26:03.359704 I 4773 7 measure.h:81 Timer end                     executeRelAlgStep                   executeRelAlgStep:  677 elapsed 1854 ms
2020-12-18T16:26:03.359741 I 4773 7 measure.h:81 Timer end                      executeRelAlgSeq                    executeRelAlgSeq:  598 elapsed 1854 ms
2020-12-18T16:26:03.359769 I 4773 7 measure.h:81 Timer end             executeRelAlgQueryNoRetry           executeRelAlgQueryNoRetry:  253 elapsed 2614 ms
2020-12-18T16:26:03.359778 I 4773 7 measure.h:81 Timer end                    executeRelAlgQuery                  executeRelAlgQuery:  222 elapsed 2614 ms

the most relevant info are:

2020-12-18T16:26:02.083164 I 4773 7 measure.h:81 Timer end query_step_compilation executeWorkUnitImpl: 1425 elapsed 576 ms

The system didn’t know the query, so it has to parse, optimize and generate the LLVM code for each step of the query (You can get the LLVM code with the explain [your_statement] command).

To build the code, the system took 576ms (if you run the same query with different parameters, the old code is reused)

then to load the data from disk to system RAM and GPU ram, you have to look to fetch chunks

2020-12-18T16:26:02.634163 I 4773 14 measure.h:81 Timer end                           fetchChunks                         fetchChunks: 2342 elapsed **550** ms 
2020-12-18T16:26:02.727827 I 4773 15 measure.h:81 Timer end                           fetchChunks                         fetchChunks: 2342 elapsed **643** ms

Those threads run in parallel, so the entire operation likely lasted 643ms. I have a disk subsystem capable of 10GB/sec, so the reads from disks are really fast, and the bottleneck is the CPU power and the PCI-ex bus. You get other data like the time the query spent on Gpu or in Cpu depending on the executor chooses for your query.

You will also get a summarization in a more readable format at the end of the log. Something like that

2614ms total duration for executeRelAlgQuery
  2614ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:254
    760ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:255
    1854ms start(760ms) executeRelAlgSeq RelAlgExecutor.cpp:599
      1853ms start(760ms) executeRelAlgStep RelAlgExecutor.cpp:678
        1853ms start(760ms) executeSort RelAlgExecutor.cpp:2446
          1711ms start(760ms) executeWorkUnit RelAlgExecutor.cpp:2770
            0ms start(760ms) compileWorkUnit NativeCodegen.cpp:2409
            576ms start(760ms) compileWorkUnit NativeCodegen.cpp:2409
              474ms start(762ms) getInstance HashJoin.cpp:480
                474ms start(762ms) reify JoinHashTable.cpp:487
                    New thread(8)
                      214ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                    End thread(8)
                    New thread(9)
                      374ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                    End thread(9)
              14ms start(1236ms) getInstance HashJoin.cpp:480
                14ms start(1236ms) reify JoinHashTable.cpp:487
                    New thread(10)
                      2ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                    End thread(10)
                    New thread(11)
                      0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                    End thread(11)
              0ms start(1250ms) getInstance HashJoin.cpp:480
                0ms start(1250ms) reify JoinHashTable.cpp:487
                    New thread(12)
                      0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                    End thread(12)
                    New thread(13)
                      0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                    End thread(13)
              New thread(14)
                0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:91
                550ms start(0ms) fetchChunks Execute.cpp:2341
                0ms start(550ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:703
                141ms start(550ms) executePlanWithoutGroupBy Execute.cpp:2694
                  141ms start(550ms) launchGpuCode QueryExecutionContext.cpp:199
              End thread(14)
              New thread(15)
                0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:91
                643ms start(0ms) fetchChunks Execute.cpp:2341
                0ms start(644ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:703
                43ms start(644ms) executePlanWithoutGroupBy Execute.cpp:2694
                  43ms start(644ms) launchGpuCode QueryExecutionContext.cpp:199
              End thread(15)
            3ms start(2030ms) collectAllDeviceResults Execute.cpp:1783
              3ms start(2030ms) reduceMultiDeviceResults Execute.cpp:887
            100ms start(2033ms) compileWorkUnit NativeCodegen.cpp:2409
              7ms start(2035ms) getInstance HashJoin.cpp:480
                7ms start(2035ms) reify JoinHashTable.cpp:487
                    New thread(16)
                      6ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                    End thread(16)
                    New thread(17)
                      6ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                    End thread(17)
              0ms start(2043ms) getInstance HashJoin.cpp:480
                0ms start(2043ms) reify JoinHashTable.cpp:487
                    New thread(18)
                      0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                    End thread(18)
                    New thread(19)
                      0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                    End thread(19)
              0ms start(2044ms) getInstance HashJoin.cpp:480
                0ms start(2044ms) reify JoinHashTable.cpp:487
                    New thread(20)
                      0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                    End thread(20)
                    New thread(21)
                      0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                    End thread(21)
              New thread(22)
                0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:91
                143ms start(0ms) fetchChunks Execute.cpp:2341
                0ms start(144ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:703
                95ms start(144ms) executePlanWithGroupBy Execute.cpp:2895
                  91ms start(144ms) launchGpuCode QueryExecutionContext.cpp:199
                  3ms start(236ms) getRowSet QueryExecutionContext.cpp:136
                    3ms start(236ms) reduceMultiDeviceResults Execute.cpp:887
                      3ms start(236ms) reduceMultiDeviceResultSets Execute.cpp:933
              End thread(22)
              New thread(23)
                0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:91
                76ms start(0ms) fetchChunks Execute.cpp:2341
                0ms start(76ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:703
                163ms start(76ms) executePlanWithGroupBy Execute.cpp:2895
                  158ms start(76ms) launchGpuCode QueryExecutionContext.cpp:199
                  4ms start(234ms) getRowSet QueryExecutionContext.cpp:136
                    4ms start(234ms) reduceMultiDeviceResults Execute.cpp:887
                      4ms start(234ms) reduceMultiDeviceResultSets Execute.cpp:933
              End thread(23)
            97ms start(2374ms) collectAllDeviceResults Execute.cpp:1783
              97ms start(2374ms) reduceMultiDeviceResults Execute.cpp:887
                97ms start(2374ms) reduceMultiDeviceResultSets Execute.cpp:933
          142ms start(2471ms) sort ResultSet.cpp:484
            142ms start(2471ms) baselineSort ResultSet.cpp:544
              0ms start(2613ms) createComparator ResultSet.h:638
              0ms start(2614ms) topPermutation ResultSet.cpp:846

Re-running the query changing the year from 1993 to 1994, the total time is 441 ms.

Clearing the GPU memory cache with the command \clear_gpu and re-running the query, the total time went up to 874 ms, so it’s likely to transfer data and populate the GPU caches, the system is taking over 400ms.

So your guesses were mostly right, except we don’t load anything automatically when the Omnisci server start (there is a parameter that can be used to warm-up some caches), and the data resides in the GPU if needed, but for a complete GPU execution using the default parameters you need that all the data fit in GPU memory; if there isn’t enough GPU ram the query will be executed on CPU if there isn’t enough CPU memory, it’s likely to fail.

The GPU memory is allocated when needed, and you can limit using the GPU-buffer-mem-bytes that, as default, is 0 and will allocate all available GPU mem in your system (if needed).

I think I replied to your question, but If I forgot something, pleas tell me.

Best Regards,
Candido :wink:

Thank you. Your answer is clear to me.

I have configured the file with adding:
image

I did get the running time info, but only the “total_time_ms” and “execution_time_ms”. not detailed like that you showed.

.

BTW, i’m curious of the usage of “num-gpus = 1”. My understanding is that the number of GPU triggered to execute the query. I have two gpus, and I just want to use one of them. So I set “num-gpus = 1”.
However, when I run the TPC-H query, both GPU0 and GPU1 are loaded data, but seems only GPU0 works. I kind of cannot understand what happends to the GPUs.
Are GPU0 and GPU1 working together for the query? Or just GPU0 works, and GPU1 only holds the data and does nothing?

Hi @Eilowangfang,

You have placed the parameters num-gpus and enable-debug-timer into the [web] section, so they are ignored by the DB, and used by the omnisci_web_server, which’s likely to fail, so the database is running 2 GPUs, and you won’t file the new entries in the logfile.

Looking at the nvidia-smi screen, you posted it looks like it’s not running anything on the GPU; while the GPU0 is saying 10%, the power consumption is so low that it’s unlikely to run a heavy kernel. Probably the query
you are running has fallen back for CPOU executions, or there isn’t any query running.

While benchmarking, I suggest you collect data in a fine-grained way, like a sample with an interval of 100ms or 200ms using the switch -lms that take the input in milliseconds and collect data CSV format.

something like that

nvidia-smi -lms 200 --query-gpu=gpu_name,index,utilization.gpu,utilization.memory,clocks.current.sm,power.draw,memory.used --format=csv

So having multiple GPUs, you will get an idea about how much they are used.

About your question of how many GPUs are used during a query, the correct answers that all of them will be used at a point in time every GPU will be used, and if everything is properly tuned, all the GPUs in a system will be used during query execution.

The tables in OmnisciDB consists of something like a macro-partition that we call fragment; at the time of writing, a fragment can be assigned to a GPUs, so if the table has 2 fragments and you have 2 GPUs, each GPU will have a fragment to work with, so they will be busy. Still, if a table has 3 fragments the first GPUs will be work on two fragments. In contrast, the second one on just 1, so it’s likely that the second GPU will be idle while the 1st one is still working.

In the same scenario, if the first fragment has more records than the first one, the first GPU will work more than the second one.

You can mitigate this behavior, perfectly balancing the number of records and the number of fragments manually, or creating the tables with a number of shards equal to the number of gpu you have in the system.

Regards,
Candido

Hi,

Thanks. It was my mistake that put the configuration line at wrong place.
Now, I got the detailed timer results.

When I run TPC-H SF=10 Q10, the utilization of GPU shows the below Figure:

I think the utilization are perhaps reasonable,
as my workload is simple (SF-10), the kernel exeuction only takes around 104ms.
The time of compliation and other things such as result printing, take more than 160ms.

The nvidia-smi GPU utilization denotes “the percentage of time over the last second that any of the SMs was busy”. Hence, since the kernel execution time is shorter than other time components. I think 13% percentage is reasonable.
As to nvidia-smi GPU memory utilization denotes actually the percentage of time the memory controller (PCIe) was busy, as the query is executed when the data resides in GPU memory, so 1% utilization is reasonable to me.

Regards,
WANG Fang

Hi,

The frequencies look a little lower than expected and also the power consumption, while the 104ms kernel runtime is reasonable (400ms on Rtx 2080ti of total time with an SF40).

The memory time shouldn’t be the PCIe usage, but the memory is used so the memory scan read about 10GB in the last second…looks quite weird.

To get the PCIe usage you have to use other tools.

Regards

Hi,

Yes. I agree that the memory time is weird.
I plan to profiling further via tool nvprof/Nsight Compute.

Could you give some hints on how to use the profiling tool on omnisci?
Using nvprof normally is commanded with an executable file, while omnisci is interactive command for a query.

I tried package a shell file query.sh of the command
" $OMNISCI_PATH/bin/omnisql -p HyperInteractive omnisci < scripts.sql"

and then nvprof query.sh.
But seems this does not work, no profile data collected.
image

Hi @Eilowangfang,

You have to run nvprof on omnisci_server executable.

So you have to stop the service and run the server from the command line

nvprof [your flags] omnisci_server - - data /path_to_omnisci_storage [other omnisci server flags] 

Then run some queries and stop the server with control+c.
You should get the output, or the binary file depending on the parameters you specified

Thank you. I tried and it works.

1 Like

I updated the discussion of memory utilization here.
Previously, the memory utilization reported by nvidia-smi is just 1%, which is such wired.

I just repeat the profiling by using Nsight compute, and the memory utilization reported is around 86%.

@candido.dessanti, the memory utilization here denotes the time spent on L1/L2 and global memory access. 86% might be reasonable (on TPC-H SF=10).

@Eilowangfang,

it looks to be the profiling of init_hash_join, so it’s not the query itself, but a kernel called to initialize the hash join buffers. the duration is 240 nanoseconds…well, it would be nice to be so fast :wink:

I will try to cook something for you during the day, while I am having some troubles with the new driver into my WS.

It would be nice to recover posts of the previous version of the community forum, but I think it’s not possible

Regards

Hi,
Yes. The running time of omnisci is good to me.

As mentioned in the Email, my next step is to survey the multi-query performance of omnisci.
I understand the omnisci currently dose not support concurrent multi-query. But my scenario wants to support more users, achieving high query throughput.

Could you give me some hints that how to implement this?
Thank you so much.

I wanted to say that you haven’t profiled the query runtime, but a kernel that initializes some buffers to perform hash joins, and Q10 is doing some of them.

To run queries in parallel, you have to create 2 instances of omniscidb, load the same data using the same table names, and connect in a round-robin manner to the instance’s nodes.

so as the first step, you have to init the storages you are using for the two DBs

./bin/initdb /data_storage_1
./bin/initdb /data_storage_2

then run two instances of the software changing storage and ports and limiting the memory of the system and GPU accessible to each instance; I allocated 16GB per instance because you are using V100S GPUs

./bin/omnisci_server --data  /data_storage_1 --gpu-buffer-mem-bytes 16000000000 --cpu-buffer-mem-bytes 32000000000 
./bin/omnisci_server --data  /data_storage_2 --gpu-buffer-mem-bytes 16000000000 --cpu-buffer-mem-bytes 32000000000 --calcite-port 16279 --http-port 16278 --port 16274

Then create the tables on both instances and populate them; from now on, you are ready to run queries in parallel on your GPUs. Depending on the queries you will get some performance boost compared to the usual intra-parallelism we use in the database. I’m not sure the waste of resources can justify the performance boost.

Maybe is better to add GPUs and rely on num-executors parameter discussed in the thread Concurrent Queries and turning on the dynamic-watchdog to stop the queries that would stall the system.

Regards,

Candido

<-- I wanted to say that you haven’t profiled the query runtime, but a kernel that initializes some buffers to perform hash joins, and Q10 is doing some of them. -->

Yes. Sorry for the previous Figure makes the profiling result confusing. I did receive detailed profiling of Q10, which I think include all the kernels executed for Q10.
The below Figure shows the part of kernels:

Thanks for your help with concurrent query implementation. I will try soon and update the performance.

You mentioned that “I’m not sure the waste of resources can justify the performance boost.”
What kind of resources that you think would be probably wasted? (global memory, or SM, or others?)

My motivation for running multi-query is that:

  1. My dataset is small that GPU can fit.
  2. Running a single query cannot make full utilization of GPU resource.
  3. Running multi-query can improve resource utilization, and also enjoy high query throughput.
    I got the motivation only because the community says so.
    However, I’m not clear what specific kind of GPU resources that I should try to better use.
    The resources that I can recognize:
  • GPU global memory
  • SM utilization
  • register
  • shared memory
    I still know very few about GPUs. It there any other kind of resources that could be better used for concurrent multi-query?

Hi,

Yap, the multifrag_query kernel is the one that is running most of the query runtime.

Well, if you are doing what I did, you are cutting in half your memory resource and doubling the space allocated on the disk. When I briefly tried doing that, I got some performance improvement versus a classical serial execution, but not so big as I expected.

Maybe using a tesla class card is better because I tried with RTX Gaming class. I am not a great program on GPU, but maybe you could try to run queries changing the cuda-block-size and cuda-grid-size; in the past the default values has 1024 and 2 times the number of SMs on GPU (You can get the value from the log or by looking at specs of your cards on internet)

Hi.
Merry Christmas.

I just got a chance to access Nvidia DGX-2 server, which has 10 V100-SXM32 GPUs.
So I revisited the chance of moving data between CPU memory and GPU memory.
However, I found that the data transfer between CPU memory and GPU memory is still slow.

After reading some documents, I think that DGX-2 only provides high bandwidth via the NVlink2.0 among GPU-to-GPU data transfer. The host-to-GPU and GPU-to-host is still via PCIe 3.0. The reason is that the CPU in DGX-2 is Intel® Xeon® Platinum 8168 CPU, which dose not support NVLink between the host and the GPUs.
Do you agree with this?

Seems now I have no chance to use NVLink 2.0 unless I can access IBM Power servers.