A strang test result on tpc-h query1

We tested omni-sci on tpc-h 5/20/100G benchmark, the time spend on query1 is 52.009s, 106.727s and 1.133s respectively.

Why omni runs much faster on 100G than on 5G/20G? Is there any special optimization?

Hi @minghong.zhou,

Thanls for joining the community forum :wink:

About your problem, there isn’t any specific optimization on bigger datasets.

I ran TPCH Q1 on a tpch-40 and tpch-100 and while the execution times aren’t stellar ( I guess this one is one of the few queries that runs better on CPU), the response times are of 1100ms and 2400ms using 2 GPUs, while the runtime on CPU is faster (1289ms on tpch100 using a few cores).

Maybe you are experiencing some troubles when the queries run on GPU (tpch-100 depending on DDL can use over 20GB of memory, so probably the query is falling back to CPU).

Could you share the DDLs of your table? My lineitem DDL is the following

CREATE TABLE lineitem (
L_ORDERKEY INTEGER NOT NULL,
L_PARTKEY INTEGER NOT NULL,
L_SUPPKEY INTEGER NOT NULL,
L_LINENUMBER INTEGER NOT NULL,
L_QUANTITY DECIMAL(10,2) NOT NULL,
L_EXTENDEDPRICE DECIMAL(18,2) NOT NULL,
L_DISCOUNT DECIMAL(18,2) NOT NULL,
L_TAX DECIMAL(18,2) NOT NULL,
L_RETURNFLAG TEXT NOT NULL ENCODING DICT(8),
L_LINESTATUS TEXT NOT NULL ENCODING DICT(8),
L_SHIPDATE DATE NOT NULL ENCODING DAYS(16),
L_COMMITDATE DATE NOT NULL ENCODING DAYS(16),
L_RECEIPTDATE DATE NOT NULL ENCODING DAYS(16),
L_SHIPINSTRUCT TEXT NOT NULL ENCODING DICT(8),
L_SHIPMODE TEXT NOT NULL ENCODING DICT(8),
L_COMMENT TEXT NOT NULL ENCODING DICT(32),
SHARD KEY (L_ORDERKEY))
WITH (FRAGMENT_SIZE = 302000000, SHARD_COUNT = 2, PARTITIONS = 'SHARDED')

You can turn on the enable-debug-timer parameter (on omnisci.conf add a line with enable-debug-timer=true) to check in the log what’s going on with your query.

As an example on my workstation, this is the summary output of the query

2420ms total duration for executeRelAlgQuery
  2419ms executeWorkUnit RelAlgExecutor.cpp:1902
    1ms compileWorkUnit NativeCodegen.cpp:1570
      New thread(140328619190016)
        0ms fetchChunks Execute.cpp:2034
        0ms getQueryExecutionContext QueryMemoryDescriptor.cpp:706
        2417ms executePlanWithGroupBy Execute.cpp:2434
          2417ms launchGpuCode QueryExecutionContext.cpp:195
          0ms reduceMultiDeviceResultSets Execute.cpp:875
      End thread(140328619190016)
      New thread(140328610797312)
        0ms fetchChunks Execute.cpp:2034
        0ms getQueryExecutionContext QueryMemoryDescriptor.cpp:706
        1266ms executePlanWithGroupBy Execute.cpp:2434
          1265ms launchGpuCode QueryExecutionContext.cpp:195
          0ms reduceMultiDeviceResultSets Execute.cpp:875
      End thread(140328610797312)
    0ms collectAllDeviceResults Execute.cpp:1523
      0ms reduceMultiDeviceResultSets Execute.cpp:875

You can exactly know where (GPU or CPU) the query ran, and which step took more time to run.

In the example, the query ran on two GPUs (two threads with launchGpuCode ), and the bottleneck looks to be the runtime on the first Gpu thread that took 2400ms (this is weird, the data is balanced between the two GPUs).