Multi-GPU test on tpc-h query

We tested omnisciDB on tpch 1/2/4/8 in 2080ti * 2 with PCIe. But the result is almost same as 2080ti * 1.

why multi-gpu not faster than single-gpu? Is there any optimization?(I tried --num-gpus=2 parameter but it was same…)

Hi @wnn156,

Thanks for your message. For TPC-H, the largest table is LINEITEM, which has 6M rows for every SF. So at SF 8, that table is 48 million rows. By default, OmniSci partitions data into “fragments”, which are by default 32M rows. Note that this can be changed when creating a table using the WITH (fragment_size=X) option.

Work is assigned to GPUs (and also CPU cores when applicable) at the granularity of fragments, so at SF 8, you’ll have one fragment of the LINEITEM table with 32M rows, and one with 16M rows, so the first GPU will have twice the work of the second. At scale factors 1, 2, and 4, you’ll only have one fragment for LINEITEM (at SF 4, it will have 6M X 4 = 24M rows), so you’ll see no inter-GPU parallelism.

The solution would be to create your tables with the fragment size parameter mentioned above (for SF 8, a fragment size of 24M for LINEITEM would balance the table between the two GPUs, while the other tables can be left at their defaults). We have some thoughts in the longer term about how to do more granular balancing dynamically, such as allowing for the concept of “sub-fragments”, i.e. splitting a fragment up into sizes that can be evenly distributed to available hardware, but until then, your best bet is to tweak the fragment_size parameter, or load enough data where rough balancing happens automatically.

(One other point worth noting is generally performance will increase to some degree with larger fragments, assuming that the chosen size still allows for balancing across hardware, due to reducing per-fragment processing overhead)

1 Like

hi @wnn156,

thanks for joining the forum.

If I can add something to what @todd said, I suggest to increase the SF of the TPCH benchmark, expecially if you are using GPUs with a lot of Bandwidth and thousands of Cuda Cores as RTX 2080ti.

as an example using a modified version of TPCH Q1 you will get

Number of GPUs SF Total Runtime (GPU Runtime) SF Total Runtime (GPU Runtime) SF Total Runtime (GPU Runtime)
1 40 177 (165) 8 50 (33) 8 balanced 48 (32)
2 40 105 (95-76) 8 38 (26-14) 8 balanced 30 (16-15)

as you can see the scaling of GPU runtime isn’t exactly linear on SF40 (un-expected) and SF8 (expected because the fragment of 1st gpu is larger than the on of the second one).

The total runtime is higher because of the parsing stage, optimization and final reduction of CPU and as you can see those steps affect more the runtime of the runs against smaller dataset than bigger one; 10 on 105 is less than 10%, while on 38 its more than 25%

You can get more detailed timing for the steps of executions of a query setting to true the parameter enable-debug-timer

you will get a final output like that for each query, that will give you an idea about how much time each GPU has been busy and what is causing slowness in your queries

117ms total duration for sql_execute
  103ms start(14ms) executeRelAlgQuery RelAlgExecutor.cpp:158
    103ms start(14ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:181
      0ms start(14ms) Query pre-execution steps RelAlgExecutor.cpp:182
      103ms start(14ms) executeRelAlgSeq RelAlgExecutor.cpp:424
        103ms start(14ms) executeRelAlgStep RelAlgExecutor.cpp:503
          103ms start(14ms) executeSort RelAlgExecutor.cpp:2261
            102ms start(14ms) executeWorkUnit RelAlgExecutor.cpp:2574
              1ms start(14ms) compileWorkUnit NativeCodegen.cpp:1647
                New thread(41)
                  0ms start(0ms) fetchChunks Execute.cpp:2089
                  0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:766
                  100ms start(0ms) executePlanWithGroupBy Execute.cpp:2629
                    100ms start(0ms) launchGpuCode QueryExecutionContext.cpp:195
                    0ms start(100ms) reduceMultiDeviceResultSets Execute.cpp:881
                End thread(41)
                New thread(42)
                  0ms start(0ms) fetchChunks Execute.cpp:2089
                  0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:766
                  97ms start(0ms) executePlanWithGroupBy Execute.cpp:2629
                    97ms start(0ms) launchGpuCode QueryExecutionContext.cpp:195
                    0ms start(97ms) reduceMultiDeviceResultSets Execute.cpp:881
                End thread(42)
              0ms start(117ms) collectAllDeviceResults Execute.cpp:1572
                0ms start(117ms) reduceMultiDeviceResultSets Execute.cpp:881
            0ms start(117ms) sort ResultSet.cpp:496

Hope this helps