Discussions

Expand all | Collapse all

Query Elapsed Time does not increase linearly.

  • 1.  Query Elapsed Time does not increase linearly.

    Posted 03-04-2019 22:14
    Query Elapsed Time does not increase linearly as the scale factor increases. What do you think is the biggest difference when scale factor increases from 32G to 64G?

    Below is a table that measures the query execution time when in warm cache.

    System Specification
    CPU:i7-8700K(3.7GHz)
    GPU:GTX 1080Ti 11G
    RAM:32G


    #Core

    ------------------------------
    Jeon Woohyuck
    [[Unknown]]
    ------------------------------


  • 2.  RE: Query Elapsed Time does not increase linearly.

    Posted 03-05-2019 01:34
    Edited by Candido Dessanti 03-05-2019 03:33
    Hi @Jeon Woohyuck,

    Assuming you are talking about TPC-H schema and you are on mapd 4.4.x, it's likely that on a 64GB scale the data needed by the queries didn't fit on RAM of your GPU, so the optimizer, to avoid a CPU fallback, execute the query on several trips, unloading and loading the data to the GPU memory as needed OR you are on another release and the optimizer is falling back to a CPU execution

    First scenario:

    with the Q1 on of 30GB

    SELECT l_returnflag, sum(l_quantity) AS sum_qty, sum(l_extendedprice) AS sum_base_price, sum(l_extendedprice*(1-l_discount)) AS sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) AS sum_charge, avg(l_quantity) AS avg_qty, avg(l_extendedprice) AS avg_price, avg(l_discount) AS avg_disc, count(*) AS count_order FROM lineitem WHERE l_shipdate <= DATE '1996-12-01' - INTERVAL '90' DAY (3) GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag;

    the query takes 1248ms and the memory allocated is 6415.57 MB and it's running on a single trip

    turning on the enable-debug-timer parameter you will see in output like that:

    I0305 09:50:09.712998 129067 measure.h:73] Timer start execution_dispatch_run operator(): 1056
    I0305 09:50:09.713070 129067 measure.h:73] Timer start fetchChunks fetchChunks: 1682
    I0305 09:50:09.713325 129067 measure.h:80] Timer end fetchChunks fetchChunks: 1682 elapsed 0 ms
    I0305 09:50:09.713392 129067 measure.h:73] Timer start executePlanWithGroupBy executePlanWithGroupBy: 2092
    I0305 09:50:09.713407 129067 measure.h:73] Timer start lauchGpuCode launchGpuCode: 97
    I0305 09:50:10.844782 129067 measure.h:80] Timer end lauchGpuCode launchGpuCode: 97 elapsed 1131 ms
    I0305 09:50:10.844820 129067 measure.h:80] Timer end executePlanWithGroupBy executePlanWithGroupBy: 2092 elapsed 1131 ms
    I0305 09:50:10.844846 129067 measure.h:80] Timer end execution_dispatch_run operator(): 1056 elapsed 1131 ms

    Adding 30gb of data to the table the GPU memory needed is 12831.14GB, so the execution is and the query's time increase to 3357ms (2.7x) and in the log you will find several messages from the memory manager. Here is extract

    I0305 10:00:08.908902 94987 measure.h:80] Timer end execution_dispatch_run operator(): 1056 elapsed 886 ms
    I0305 10:00:08.909087 94993 measure.h:73] Timer start fetchChunks fetchChunks: 1682
    I0305 10:00:08.909134 94993 BufferMgr.cpp:425] ALLOCATION failed to find 256000000B free. Forcing Eviction. Eviction start 2875000 Number pages requested 500000 Best Eviction Start Slab 0 GPU_MGR:0
    I0305 10:00:08.933188 94993 BufferMgr.cpp:425] ALLOCATION failed to find 256000000B free. Forcing Eviction. Eviction start 0 Number pages requested 500000 Best Eviction Start Slab 1 GPU_MGR:0
    I0305 10:00:08.957132 94993 BufferMgr.cpp:425] ALLOCATION failed to find 256000000B free. Forcing Eviction. Eviction start 0 Number pages requested 500000 Best Eviction Start Slab 4 GPU_MGR:0
    I0305 10:00:08.981112 94993 BufferMgr.cpp:425] ALLOCATION failed to find 32000000B free. Forcing Eviction. Eviction start 3752714 Number pages requested 62500 Best Eviction Start Slab 0 GPU_MGR:0
    I0305 10:00:08.984266 94993 BufferMgr.cpp:425] ALLOCATION failed to find 32000000B free. Forcing Eviction. Eviction start 937500 Number pages requested 62500 Best Eviction Start Slab 5 GPU_MGR:0
    I0305 10:00:08.987442 94993 BufferMgr.cpp:425] ALLOCATION failed to find 128000000B free. Forcing Eviction. Eviction start 250000 Number pages requested 250000 Best Eviction Start Slab 3 GPU_MGR:0
    I0305 10:00:08.999441 94993 measure.h:80] Timer end fetchChunks fetchChunks: 1682 elapsed 90 ms
    I0305 10:00:08.999529 94993 measure.h:73] Timer start executePlanWithGroupBy executePlanWithGroupBy: 2092
    I0305 10:00:08.999537 94993 measure.h:73] Timer start lauchGpuCode launchGpuCode: 97
    I0305 10:00:09.149390 94993 measure.h:80] Timer end lauchGpuCode launchGpuCode: 97 elapsed 149 ms
    I0305 10:00:09.149436 94993 measure.h:80] Timer end executePlanWithGroupBy executePlanWithGroupBy: 2092 elapsed 149 ms
    I0305 10:00:09.149449 94993 measure.h:80] Timer end execution_dispatch_run operator(): 1056 elapsed 1127 ms
    I0305 10:00:09.149472 94992 measure.h:73] Timer start fetchChunks fetchChunks: 1682
    I0305 10:00:09.149579 94992 measure.h:80] Timer end fetchChunks fetchChunks: 1682 elapsed 0 ms
    I0305 10:00:09.149632 94992 measure.h:73] Timer start executePlanWithGroupBy executePlanWithGroupBy: 2092
    I0305 10:00:09.149646 94992 measure.h:73] Timer start lauchGpuCode launchGpuCode: 97
    I0305 10:00:09.370874 94992 measure.h:80] Timer end lauchGpuCode launchGpuCode: 97 elapsed 221 ms
    I0305 10:00:09.370887 94992 measure.h:80] Timer end executePlanWithGroupBy executePlanWithGroupBy: 2092 elapsed 221 ms
    I0305 10:00:09.370935 94992 measure.h:80] Timer end execution_dispatch_run operator(): 1056 elapsed 1348 ms
    I0305 10:00:09.370981 94990 measure.h:73] Timer start fetchChunks fetchChunks: 1682
    I0305 10:00:09.371073 94990 BufferMgr.cpp:425] ALLOCATION failed to find 256000000B free. Forcing Eviction. Eviction start 500000 Number pages requested 500000 Best Eviction Start Slab 4 GPU_MGR:0
    I0305 10:00:09.395375 94990 BufferMgr.cpp:425] ALLOCATION failed to find 256000000B free. Forcing Eviction. Eviction start 1000000 Number pages requested 500000 Best Eviction Start Slab 4 GPU_MGR:0
    I0305 10:00:09.417973 94990 BufferMgr.cpp:425] ALLOCATION failed to find 256000000B free. Forcing Eviction. Eviction start 1500000 Number pages requested 500000 Best Eviction Start Slab 0 GPU_MGR:0
    I0305 10:00:09.439924 94990 BufferMgr.cpp:425] ALLOCATION failed to find 256000000B free. Forcing Eviction. Eviction start 2164416 Number pages requested 500000 Best Eviction Start Slab 3 GPU_MGR:0
    I0305 10:00:09.462505 94990 measure.h:80] Timer end fetchChunks fetchChunks: 1682 elapsed 91 ms
    I0305 10:00:09.462582 94990 measure.h:73] Timer start executePlanWithGroupBy executePlanWithGroupBy: 2092
    I0305 10:00:09.462592 94990 measure.h:73] Timer start lauchGpuCode launchGpuCode: 97
    I0305 10:00:09.610553 94990 measure.h:80] Timer end lauchGpuCode launchGpuCode: 97 elapsed 147 ms
    I0305 10:00:09.610569 94990 measure.h:80] Timer end executePlanWithGroupBy executePlanWithGroupBy: 2092 elapsed 147 ms
    I0305 10:00:09.610580 94990 measure.h:80] Timer end execution_dispatch_run operator(): 1056 elapsed 1588 ms
    I0305 10:00:09.610630 94991 measure.h:73] Timer start fetchChunks fetchChunks: 1682
    I0305 10:00:09.610745 94991 BufferMgr.cpp:425] ALLOCATION failed to find 256000000B free. Forcing Eviction. Eviction start 2664416 Number pages requested 500000 Best Eviction Start Slab 3 GPU_MGR:0
    I0305 10:00:09.634245 94991 BufferMgr.cpp:425] ALLOCATION failed to find 256000000B free. Forcing Eviction. Eviction start 2250000 Number pages requested 500000 Best Eviction Start Slab 1 GPU_MGR:0
    I0305 10:00:09.656515 94991 BufferMgr.cpp:425] ALLOCATION failed to find 256000000B free. Forcing Eviction. Eviction start 1250000 Number pages requested 500000 Best Eviction Start Slab 2 GPU_MGR:0
    I0305 10:00:09.678789 94991 BufferMgr.cpp:425] ALLOCATION failed to find 256000000B free. Forcing Eviction. Eviction start 1750000 Number pages requested 500000 Best Eviction Start Slab 2 GPU_MGR:0
    I0305 10:00:09.701083 94991 measure.h:80] Timer end fetchChunks fetchChunks: 1682 elapsed 90 ms
    I0305 10:00:09.701160 94991 measure.h:73] Timer start executePlanWithGroupBy executePlanWithGroupBy: 2092
    I0305 10:00:09.701169 94991 measure.h:73] Timer start lauchGpuCode launchGpuCode: 97
    I0305 10:00:09.921865 94991 measure.h:80] Timer end lauchGpuCode launchGpuCode: 97 elapsed 220 ms
    I0305 10:00:09.921881 94991 measure.h:80] Timer end executePlanWithGroupBy executePlanWithGroupBy: 2092 elapsed 220 ms
    I0305 10:00:09.921892 94991 measure.h:80] Timer end execution_dispatch_run operator(): 1056 elapsed 1899 ms
    I0305 10:00:09.921943 94994 measure.h:73] Timer start fetchChunks fetchChunks: 1682
    I0305 10:00:09.922060 94994 BufferMgr.cpp:425] ALLOCATION failed to find 256000000B free. Forcing Eviction. Eviction start 2250000 Number pages requested 500000 Best Eviction Start Slab 2 GPU_MGR:0
    I0305 10:00:09.947115 94994 BufferMgr.cpp:425] ALLOCATION failed to find 256000000B free. Forcing Eviction. Eviction start 1500000 Number pages requested 500000 Best Eviction Start Slab 4 GPU_MGR:0
    I0305 10:00:09.971230 94994 BufferMgr.cpp:425] ALLOCATION failed to find 256000000B free. Forcing Eviction. Eviction start 0 Number pages requested 500000 Best Eviction Start Slab 0 GPU_MGR:0
    I0305 10:00:09.995249 94994 BufferMgr.cpp:425] ALLOCATION failed to find 256000000B free. Forcing Eviction. Eviction start 500000 Number pages requested 500000 Best Eviction Start Slab 0 GPU_MGR:0
    I0305 10:00:10.025403 94994 BufferMgr.cpp:425] ALLOCATION failed to find 128000000B free. Forcing Eviction. Eviction start 3820642 Number pages requested 250000 Best Eviction Start Slab 0 GPU_MGR:0
    I0305 10:00:10.037401 94994 measure.h:80] Timer end fetchChunks fetchChunks: 1682 elapsed 115 ms
    I0305 10:00:10.037493 94994 measure.h:73] Timer start executePlanWithGroupBy executePlanWithGroupBy: 2092
    I0305 10:00:10.037501 94994 measure.h:73] Timer start lauchGpuCode launchGpuCode: 97
    I0305 10:00:10.187234 94994 measure.h:80] Timer end lauchGpuCode launchGpuCode: 97 elapsed 149 ms
    I0305 10:00:10.187294 94994 measure.h:80] Timer end executePlanWithGroupBy executePlanWithGroupBy: 2092 elapsed 149 ms
    I0305 10:00:10.187307 94994 measure.h:80] Timer end execution_dispatch_run operator(): 1056 elapsed 2165 ms

    As said before looks the executor processing one fragment after another adding some overhead to the process (memory manager, multiples GPUs launches and so on) making the linearity of the execution time impossible.

    In the second scenario the optimizer is falling back the execution to CPU, so for each fragment (12 if you are using the default fragments size) a CPU thread is because you have just 4C/8T on your CPU the execution time of Q1 and Q6 is slower

    Cutting the cores/thread on my system get 3613ms for Q1 and 6119ms for Q6.

    If you want to be sure what's happening on your system, I suggest to turn the enable-debug-timer to true on your and search for launchGpuCode or launchCpuCode on log

    On omnisci 4.5 the behavior is different and query fall back to CPU execution. You can also save memory using the new encoding for the date datatype and use the appropriate precision for decimal datatypes.

    This DDL would lead to a quite big memory saving

    CREATE TABLE (
    L_ORDERKEY INTEGER,
    L_PARTKEY INTEGER,
    L_SUPPKEY INTEGER,
    L_LINENUMBER INTEGER,
    L_QUANTITY DECIMAL(8,2) ENCODING FIXED(32),
    L_EXTENDEDPRICE DECIMAL(8,2) ENCODING FIXED(32),
    L_DISCOUNT DECIMAL(4,2) ENCODING FIXED(16),
    L_TAX DECIMAL(4,2) ENCODING FIXED(16),
    L_RETURNFLAG TEXT ENCODING DICT(8),
    L_LINESTATUS TEXT ENCODING DICT(8),
    L_SHIPDATE DATE ENCODING FIXED(16),
    L_COMMITDATE DATE ENCODING FIXED(16),
    L_RECEIPTDATE DATE ENCODING FIXED(16),
    L_SHIPINSTRUCT TEXT ENCODING DICT(8),
    L_SHIPMODE TEXT ENCODING DICT(8),
    L_COMMENT TEXT ENCODING DICT(32))

    ------------------------------
    Candido Dessanti
    Dba
    consulting
    Rome
    ------------------------------



  • 3.  RE: Query Elapsed Time does not increase linearly.

    Posted 03-05-2019 21:51

    My results are like this.





    I don't have the following log.

    Did you use the 'allow-cpu-retry=true' option?

    ------------------------------
    Jeon Woohyuck
    [[Unknown]]
    ------------------------------



  • 4.  RE: Query Elapsed Time does not increase linearly.

    Posted 03-06-2019 01:48
    Hi @Jeon Woohyuck,

    the allow_cpu_retry parameter has to be set explicitly to true on mapd_server with versions less than 4.5; from 4.5 it's turned on by default


    ------------------------------
    Candido Dessanti
    Dba
    Crismatica consulting
    Rome
    ------------------------------



  • 5.  RE: Query Elapsed Time does not increase linearly.

    Posted 28 days ago
    We've re-measured the query.
    We compared 16G, 32G, and then used '--enable-debug-timer' to analyze log files.

    Workload:TPC-H V.2.18.0
    Query Number: 1

    <Warm Cache>
    16G Elapsed Time:441ms


    32G Elapsed Time:1573ms



    So here's the result. Unlike the last question, however, the log file did not show any failed GPU memory allocation. Now that allocations have been made and data sizes have doubled, we expected time to double, but we're wrong. Why doesn't time increase linearly when allocations are complete?

    ------------------------------
    Jeon Woohyuck
    [[Unknown]]
    ------------------------------



  • 6.  RE: Query Elapsed Time does not increase linearly.

    Posted 27 days ago
    Hi @Jeon Woohyuck,

    If you see allocations message on your logs, the system has to load data from somewhere so the caches couldn't be considered WARM.
    In your case, the system is allocating SLABs on CPU and GPU memory, so it's going to read data from disk to populate the CPU caches and subsequently the GPUS ones.

    Anyway I'll check asap with tpch with scale os 16 and 32 ASAP




    ------------------------------
    Candido Dessanti
    Dba
    Crismatica consulting
    Rome
    ------------------------------



  • 7.  RE: Query Elapsed Time does not increase linearly.

    Posted 27 days ago
    Oh, Sorry. that log is cold cache, but the value I wrote is the time value after running the query about 5 times.

    ------------------------------
    Jeon Woohyuck
    [[Unknown]]
    ------------------------------



  • 8.  RE: Query Elapsed Time does not increase linearly.

    Posted 27 days ago
      |   view attached
    hi @Jeon Woohyuck,

    my runtimes are linear on the same are using for testing

    Into the attach the queries runtime, memory allocation and - output

    Regards

    ------------------------------
    Candido Dessanti
    Dba
    Crismatica consulting
    Rome
    ------------------------------

    Attachment(s)

    txt
    tpch_scale_16_32.txt   7K 1 version


  • 9.  RE: Query Elapsed Time does not increase linearly.

    Posted 26 days ago
    Why does this result in my computer?
    That's weird. I don't know.
    Could you look at my log file?

    ------------------------------
    Jeon Woohyuck
    [[Unknown]]
    ------------------------------

    Attachment(s)

    txt
    tpc_16 Q1.txt   27K 1 version
    txt
    tpc_32 Q1.txt   27K 1 version


  • 10.  RE: Query Elapsed Time does not increase linearly.

    Posted 26 days ago
    Hi @Jeon Woohyuck,

    by the log you posted the TPCH-16 run, the omnisci server has started in CPU mode, so the query is being executed by CPU (3 threads)


    I0325 18:13:35.750970  9117 MapDHandler.cpp:222] Started in CPU mode
    
    I0325 18:13:41.022897 10254 measure.h:73] Timer start                        lauchCpuCode                       launchCpuCode:  529
    I0325 18:13:42.533156 10253 measure.h:80] Timer end                          lauchCpuCode                       launchCpuCode:  529 elapsed 1544 ms
    I0325 18:13:42.533260 10253 measure.h:80] Timer end                executePlanWithGroupBy              executePlanWithGroupBy: 2312 elapsed 1544 ms
    I0325 18:13:42.533289 10253 measure.h:80] Timer end                execution_dispatch_run                          operator(): 1186 elapsed 2037 ms
    I0325 18:13:42.546689 10255 measure.h:80] Timer end                          lauchCpuCode                       launchCpuCode:  529 elapsed 1540 ms
    I0325 18:13:42.546702 10255 measure.h:80] Timer end                executePlanWithGroupBy              executePlanWithGroupBy: 2312 elapsed 1540 ms
    I0325 18:13:42.546716 10255 measure.h:80] Timer end                execution_dispatch_run                          operator(): 1186 elapsed 2050 ms
    I0325 18:13:42.561115 10254 measure.h:80] Timer end                          lauchCpuCode                       launchCpuCode:  529 elapsed 1538 ms
    I0325 18:13:42.561126 10254 measure.h:80] Timer end                executePlanWithGroupBy              executePlanWithGroupBy: 2312 elapsed 1538 ms
    I0325 18:13:42.561139 10254 measure.h:80] Timer end                execution_dispatch_run                          operator(): 1186 elapsed 2064 ms
    I0325 18:13:42.561271  9177 measure.h:80] Timer end                  Exec_executeWorkUnit                 executeWorkUnitImpl: 1123 elapsed 2086 ms
    I0325 18:13:42.561300  9177 measure.h:80] Timer end                       executeWorkUnit                     executeWorkUnit: 1758 elapsed 2087 ms​


    ------------------------------
    Candido Dessanti
    Dba
    Crismatica consulting
    Rome
    ------------------------------



  • 11.  RE: Query Elapsed Time does not increase linearly.

    Posted 26 days ago
    Edited by Jeon Woohyuck 26 days ago
    Oh, I'm sorry. I accidentally uploaded another file.I uploaded it again. Please check it one more time.
    (edit) File name is wrong '1G' -> 'Q1'. ^^

    ------------------------------
    Jeon Woohyuck
    [[Unknown]]
    ------------------------------

    Attachment(s)

    txt
    TPC_16 1G.txt   27K 1 version
    txt
    TPC_32 1G.txt   27K 1 version


  • 12.  RE: Query Elapsed Time does not increase linearly.

    Posted 25 days ago
    Hi @Jeon Woohyuck,

    You can try to enable the verbose logging on your omnisci instance to verify that the query strategy is the same on both tables, check how many records and the quelity of records loaded on both tables; can you share the ddl of both tables?

    I'm asking those info because the 400ms  for 16GB sizing seems a bit low compared to 8gb result and what I'm getting here

    ------------------------------
    Candido Dessanti
    Dba
    Crismatica consulting
    Rome
    ------------------------------



  • 13.  RE: Query Elapsed Time does not increase linearly.

    Posted 25 days ago
    We used the same dss.ddl file, and we created new 8G, 16G, and 32G data to put it in, but the results are the same...
    Is my computer environment a problem? Hmm...
    (GPU mode) 8G -> 16G -> 32G



    ------------------------------
    Jeon Woohyuck
    [[Unknown]]
    ------------------------------



  • 14.  RE: Query Elapsed Time does not increase linearly.

    Posted 21 days ago
    Hi @Jeon Woohyuck,

    I'm not saying you have a problem in your environment, asked to be sure we are running the test with comparable DDLs/data.
    I created a TPCH-8 table, and I'm getting inconsistent query response time too; could you try to clear  GPU's caches before the run of queries?


    ------------------------------
    Candido Dessanti
    Dba
    Crismatica consulting
    Rome
    ------------------------------



  • 15.  RE: Query Elapsed Time does not increase linearly.

    Posted 20 days ago
    The first measured time is the time when Cold Cache is and the time after which it is measured in the Warm Cache state. After that, I turned off the server, turned it back on, and then measured it. So I think GPU Cache is empty, except when i run queries consecutively.

    ------------------------------
    Jeon Woohyuck
    [[Unknown]]
    ------------------------------



  • 16.  RE: Query Elapsed Time does not increase linearly.

    Posted 19 days ago
    hi @Jeon Woohyuck,

    I'm getting different results between different archs and omnisci parameters.

    Setting the instance to

    disable-multifrag=false
    allow-cpu-retry=false

    I'm getting stable results with a 1080ti, but the response times are quite slow as you can see from files I've attached to the post.

    Using disable-multifrag to true I'm getting mixed results because the response time of the query is different depending on which memory position has located the data.

    increasing the cardinality of the query the repose time is more stable between sizes, so there is a problem on this kind of low cardinality queries that uses the global memory

    SELECT l_returnflag,l_linestatus,extract(year from L_SHIPDATE), sum(l_quantity) AS sum_qty, sum(l_extendedprice) AS sum_base_price, sum(l_extendedprice*(1-l_discount)) AS sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) AS sum_charge, avg(l_quantity) AS avg_qty, avg(l_extendedprice) AS avg_price, avg(l_discount) AS avg_disc, count(*) AS count_order FROM lineitem WHERE l_shipdate <= DATE '1998-12-01' - INTERVAL '116' day (3) GROUP BY l_returnflag, l_linestatus,extract(year from L_SHIPDATE) ORDER BY l_returnflag,l_linestatus;

    Attachment(s)



  • 17.  RE: Query Elapsed Time does not increase linearly.

    Posted 19 days ago
    Okay. With --disable-multifrag=false, the time seems to be closer to a more linearly

    It's a slightly different question, but we've measured 'nvprof' in the same environment with no options applied. 
    And then there was an interesting result.
    Up to 16G, the speed was constant, such as Cache rate and Device Memory Read/Write Throughput, but it almost halved as we moved to 32G.
    Does this result also relate to the '--disable-multifrag' option?

    Below is the nvprof measurement results for Query1 at 16G and 32G. The method I observed is multifrag_query_hoisted_literals.
    (There are more metrics from the measured nvprof, but I'll show you the speed-related parts.)






  • 18.  RE: Query Elapsed Time does not increase linearly.

    Posted 18 days ago
    Edited by Candido Dessanti 18 days ago
    Hi @Jeon Woohyuck

    Like yours, I tried to figure out the reason of this strange behavior profiling the database, but I didn't find differences between the different runs, except that's the runtime is increased, and the throughput of the memory is slower.
    I assumed the problem relied on an outstanding concurrency of atomic operations, so I lowered the number of threads allocated to the kernel with little luck, so my assumption was wrong; there is something wrong, but I can't figure out right now.

    Luckily this behavior looks to be localized to query using multiple keys with a very low cardinality; in my experience the response time on star schemas queries on a single GPU is linear, but maybe I've not tried hard enough; said that it would be better, for such queries to place the group by buffers in the shared memory to reduce memory contention on global memory, but nothing is perfect after all, and there is room for improvements.

    Joining a fact table with a big table, like and orders without filtering (you to enable pushdown filter parameter), could lead to non-linear response time too.
    The problem arises when the right table has more than a fragment and you can address creating the table with a fragment size bigger than of records and/or using sharding with multiples GPUs.


    e.g. query 12 of TPCH with the sizing of 32G (works well with every query with an equijoin between lineitem and orders table)

    omnisql> SELECT l_shipmode, sum(case when o_orderpriority = '1-URGENT' OR o_orderpriority = '2-HIGH' then 1 else 0 end) as high_line_count, sum(case when o_orderpriority <> '1-URGENT' AND o_orderpriority <> '2-HIGH' then 1 else 0 end) AS low_line_count FROM orders_s, lineitem WHERE o_orderkey = l_orderkey AND l_shipmode in ('MAIL', 'SHIP') AND l_commitdate < l_receiptdate AND l_shipdate < l_commitdate AND l_receiptdate >= date '1994-01-01' AND l_receiptdate < date '1994-01-01' + interval '1' year GROUP BY l_shipmode ORDER BY l_shipmode;
    l_shipmode|high_line_count|low_line_count
    MAIL|198794|298385
    SHIP|199365|299061
    2 rows returned.
    Execution time: 326 ms, Total time: 327 ms

    omnisql> SELECT l_shipmode, sum(case when o_orderpriority = '1-URGENT' OR o_orderpriority = '2-HIGH' then 1 else 0 end) as high_line_count, sum(case when o_orderpriority <> '1-URGENT' AND o_orderpriority <> '2-HIGH' then 1 else 0 end) AS low_line_count FROM orders, lineitem WHERE o_orderkey = l_orderkey AND l_shipmode in ('MAIL', 'SHIP') AND l_commitdate < l_receiptdate AND l_shipdate < l_commitdate AND l_receiptdate >= date '1994-01-01' AND l_receiptdate < date '1994-01-01' + interval '1' year GROUP BY l_shipmode ORDER BY l_shipmode;
    l_shipmode|high_line_count|low_line_count
    MAIL|198794|298385
    SHIP|199365|299061
    2 rows returned.
    Execution time: 778 ms, Total time: 794 ms

    the first query use an order table created with a fragment size of 64M while the second one uses the default (32M). The table has 48M records

    Hopes this help.