Discussions

Expand all | Collapse all

Confusing CPU/GPU choice depending on platform

  • 1.  Confusing CPU/GPU choice depending on platform

    Posted 02-18-2019 07:15

    I run TPC-H benchmarks on different platforms to compare execution times and I find the results very confusing.

    On a P100 server, the DBMS uses GPU and takes 22,000 ms to perform query 1.
    On a K80 server, the DBMS uses CPU and takes 200 ms to perform query 1.

    It’s the same data on both platforms, it’s MapD Server Version: 4.4.2-20190109-6fb58bf441 in a docker container in both cases, it’s the same query, and I activated GPU by \gpu in both cases.

    The memory summaries are:

    P100:

    MapD Server CPU Memory Summary:
                MAX            USE      ALLOCATED           FREE
       619159.44 MB      609.06 MB     4096.00 MB     3486.94 MB
    
    MapD Server GPU Memory Summary:
    [GPU]            MAX            USE      ALLOCATED           FREE
      [0]    15676.04 MB      632.56 MB     2048.00 MB     1415.44 MB
    

    K80:

    MapD Server CPU Memory Summary:
                MAX            USE      ALLOCATED           FREE
        51439.73 MB      609.06 MB     4096.00 MB     3486.94 MB
    
    MapD Server GPU Memory Summary:
    [GPU]            MAX            USE      ALLOCATED           FREE
      [0]    10836.35 MB      562.58 MB     2048.00 MB     1485.42 MB
    

    where the GPU RAM usage on K80 comes from another query.

    I can confirm P100 uses GPU and K80 doesn’t by nvidia-smi. Also EXPLAIN says IR for the CPU on K80 and IR for the GPU for P100.

    Anybody any idea why the behaviour is so?

    Thank you very much!



  • 2.  RE: Confusing CPU/GPU choice depending on platform

    Posted 02-18-2019 11:05

    Hi @perdelt,

    In the 22000 ms run, it seems the omnisci server was cold started and the system reed the data from disk (no data on filesystem cache, so they are real reads on quite slow disks)

    to reproduce:

    echo 3 > /proc/sys/vm/drop_caches
    sudo service mapd_server start
    
    from mqpdql
    SELECT l_returnflag, l_linestatus, sum(l_quantity) AS sum_qty, sum(l_extendedprice) AS sum_base_price, sum(l_extendedprice*(1-l_discount)) AS sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) AS sum_charge, avg(l_quantity) AS avg_qty, avg(l_extendedprice) AS avg_price, avg(l_discount) AS avg_disc, count(*) AS count_order FROM lineitem WHERE l_shipdate <= DATE '1998-12-01' - INTERVAL '90' DAY (3)  GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus;
    4 rows returned.
    Execution time: 4069 ms, Total time: 4069 ms
    
    changing slightly the query (it's not necessary because omnisci server hasn't a result cache)
    SELECT l_returnflag, l_linestatus, sum(l_quantity) AS sum_qty, sum(l_extendedprice) AS sum_base_price, sum(l_extendedprice*(1-l_discount)) AS sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) AS sum_charge, avg(l_quantity) AS avg_qty, avg(l_extendedprice) AS avg_price, avg(l_discount) AS avg_disc, count(*) AS count_order FROM lineitem WHERE l_shipdate <= DATE '1997-12-01' - INTERVAL '90' DAY (3)  GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus;
    4 rows returned.
    Execution time: 417 ms, Total time: 418 ms
    

    the difference of 3600ms is because the disk read, the time spent on cache population on system/gpu ram and the generation/compilation of query plan.

    restarting the server using another GPU without dropping the filesystem cache the results are quite different

    1st query
    4 rows returned.
    Execution time: 1763 ms, Total time: 1764 ms
    2nd query
    4 rows returned.
    Execution time: 360 ms, Total time: 360 ms
    

    It’s likely that your k80 machine used the two GPUs on your card with GPUs caches populated because with 30M or records (I deducted by the memory_summary) the query on CPU would use just 1 core.

    Anyway if you want to be sure the P100 GPUs is running the queries check with the nvdidia-smi tool with the switch -lms 100 (It will take a sample of GPU every 100ms)



  • 3.  RE: Confusing CPU/GPU choice depending on platform

    Posted 02-18-2019 11:48

    Hi @aznable

    thank you, these are very good hints!

    I can confirm that it is not a cold start problem. If I rerun the query 100 times, it is still the same picture. Also if I restart the service the behaviour stays the same.

    Also EXPLAIN still says IR for the CPU on K80 and IR for the GPU for P100 (and this is dramatically slower).

    This is query specific. TPC-H Q5 for example shows the expected behaviour (GPU used, P100 faster than K80).

    I run the benchmarks on 2 different machines, each having 8 GPU cards (P100, K80 resp.), but I allow the docker container to see only 1 of them. The P100 uses docker directly, the K80 uses kubernetes/docker.
    nvidia-smi shows only 1 GPU in both cases, so apparently this works.

    Any other hints what might influence the decision for GPU or CPU resp.?

    Thank you!



  • 4.  RE: Confusing CPU/GPU choice depending on platform

    Posted 02-18-2019 14:00

    Hi @perdelt,

    to be sure where the code is running you can enable the enable-debug-timer parameter setting it to true in your mapd.conf file.

    You will see a lauchGpuCode or lauchCpuCode in your mapd_server.INFO log file located into the MAPD_STORAGE/mapd_log directory

    GPU run

    I0218 17:13:26.407959 39993 measure.h:73] Timer start              executePlanWithGroupBy              executePlanWithGroupBy: 2092
    I0218 17:13:26.407977 39993 measure.h:73] Timer start                        lauchGpuCode                       launchGpuCode:   97
    I0218 17:13:26.848518 39993 measure.h:80] Timer end                          lauchGpuCode                       launchGpuCode:   97 elapsed 440 ms
    I0218 17:13:26.848562 39993 measure.h:80] Timer end                executePlanWithGroupBy              executePlanWithGroupBy: 2092 elapsed 440 ms
    I0218 17:13:26.848573 39993 measure.h:80] Timer end                execution_dispatch_run                          operator(): 1056 elapsed 440 ms
    I0218 17:13:26.848807 39992 measure.h:80] Timer end                  Exec_executeWorkUnit                     executeWorkUnit:  986 elapsed 443 ms
    I0218 17:13:26.848857 39992 measure.h:80] Timer end                       executeWorkUnit                     executeWorkUnit: 1621 elapsed 443 ms
    I0218 17:13:26.848901 39992 measure.h:80] Timer end                     executeRelAlgStep                   executeRelAlgStep:  395 elapsed 443 ms
    I0218 17:13:26.848917 39992 measure.h:80] Timer end                      executeRelAlgSeq                    executeRelAlgSeq:  359 elapsed 443 ms
    I0218 17:13:26.848953 39992 measure.h:80] Timer end             executeRelAlgQueryNoRetry           executeRelAlgQueryNoRetry:  138 elapsed 444 ms
    I0218 17:13:26.849052 39992 measure.h:80] Timer end                    executeRelAlgQuery                  executeRelAlgQuery:  119 elapsed 444 ms
    I0218 17:13:26.849072 39992 measure.h:73] Timer start                        convert_rows                        convert_rows: 4143
    I0218 17:13:26.849143 39992 measure.h:80] Timer end                          convert_rows                        convert_rows: 4143 elapsed 0 ms
    I0218 17:13:26.849170 39992 measure.h:80] Timer end                       execute_rel_alg                     execute_rel_alg: 3935 elapsed 444 ms
    I0218 17:13:26.849186 39992 MapDHandler.cpp:767] sql_execute-COMPLETED Total: 472 (ms), Execution: 471 (ms)
    

    CPU Run
    I0218 17:28:11.943572 39992 MapDHandler.cpp:3918] User mapd sets CPU mode.

    I0218 17:28:13.239140 40011 measure.h:73] Timer start              executePlanWithGroupBy              executePlanWithGroupBy: 2092
    I0218 17:28:13.239159 40012 measure.h:73] Timer start              executePlanWithGroupBy              executePlanWithGroupBy: 2092
    I0218 17:28:13.239161 40011 measure.h:73] Timer start                        lauchCpuCode                       launchCpuCode:  432
    I0218 17:28:13.239183 40012 measure.h:73] Timer start                        lauchCpuCode                       launchCpuCode:  432
    I0218 17:28:15.063004 40012 measure.h:80] Timer end                          lauchCpuCode                       launchCpuCode:  432 elapsed 1823 ms
    I0218 17:28:15.063060 40012 measure.h:80] Timer end                executePlanWithGroupBy              executePlanWithGroupBy: 2092 elapsed 1823 ms
    I0218 17:28:15.063072 40012 measure.h:80] Timer end                execution_dispatch_run                          operator(): 1056 elapsed 1824 ms
    I0218 17:28:15.329355 40011 measure.h:80] Timer end                          lauchCpuCode                       launchCpuCode:  432 elapsed 2090 ms
    I0218 17:28:15.329416 40011 measure.h:80] Timer end                executePlanWithGroupBy              executePlanWithGroupBy: 2092 elapsed 2090 ms
    I0218 17:28:15.329430 40011 measure.h:80] Timer end                execution_dispatch_run                          operator(): 1056 elapsed 2090 ms
    I0218 17:28:15.329785 39992 measure.h:80] Timer end                  Exec_executeWorkUnit                     executeWorkUnit:  986 elapsed 2128 ms
    I0218 17:28:15.329826 39992 measure.h:80] Timer end                       executeWorkUnit                     executeWorkUnit: 1621 elapsed 2128 ms
    I0218 17:28:15.329859 39992 measure.h:80] Timer end                     executeRelAlgStep                   executeRelAlgStep:  395 elapsed 2128 ms
    I0218 17:28:15.329866 39992 measure.h:80] Timer end                      executeRelAlgSeq                    executeRelAlgSeq:  359 elapsed 2128 ms
    I0218 17:28:15.329900 39992 measure.h:80] Timer end             executeRelAlgQueryNoRetry           executeRelAlgQueryNoRetry:  138 elapsed 2128 ms
    I0218 17:28:15.329915 39992 measure.h:80] Timer end                    executeRelAlgQuery                  executeRelAlgQuery:  119 elapsed 2128 ms
    I0218 17:28:15.329926 39992 measure.h:73] Timer start                        convert_rows                        convert_rows: 4143
    I0218 17:28:15.330009 39992 measure.h:80] Timer end                          convert_rows                        convert_rows: 4143 elapsed 0 ms
    I0218 17:28:15.330036 39992 measure.h:80] Timer end                       execute_rel_alg                     execute_rel_alg: 3935 elapsed 2128 ms
    I0218 17:28:15.330050 39992 MapDHandler.cpp:767] sql_execute-COMPLETED Total: 2157 (ms), Execution: 2156 (ms)
    

    About you problem with the Tesla P100, I can’t reproduce because I haven’t one, but the second card run of the previous message was on a GP102 chip, that is quite similar to the GP100 one.

    If you want to know more on meanings of debug timer log take a look to this topic



  • 5.  RE: Confusing CPU/GPU choice depending on platform

    Posted 30 days ago
    Edited by Patrick Erdelt 30 days ago
    Hi @dwayneberry, hi @aznable,

    I could perform the same test on a AWS p3.2xlarge by now.
    GPU: V100
    Driver Version: 410.79

    It gives me the same picture. TPC-H Q1 takes about 70,000ms in the first run and then constantly around 23,000ms.

    I also tried
    GPU: P100
    Driver: Driver Version: 410.48
    OmniSci Server Version: 4.5.0-20190221-e41be43ff0
    from here: https://github.com/omnisci/mapd-core/tree/master/docker
    It has the same picture: TPC-H Q1 ~23,000ms

    I think it could be
    • a driver problem
    • a docker (image) problem
    • something I do wrong and haven't recognized yet
    I will try on.

    Thank you!

    ------------------------------
    perdelt
    ------------------------------



  • 6.  RE: Confusing CPU/GPU choice depending on platform

    Posted 29 days ago
    If I have to guess the problems come from the docker image or rocker configuration, but I'm an old school guy that likes to touch the iron, so I'm wary about vms, containers and stuff like that 😉

    ------------------------------
    Candido Dessanti
    Dba
    Crismatica consulting
    Rome
    ------------------------------



  • 7.  RE: Confusing CPU/GPU choice depending on platform

    Posted 28 days ago
    Edited by dwayneberry 28 days ago

    Hi @Patrick Erdelt

    I ran a test of tpc-h on a k80 machine here and this is what I see on the explain

    omnisql> explain select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= date '1998-12-01' - interval '90' day (3) group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus;
    Explanation
    IR for the GPU:
    ===============
    
    ; Function Attrs: uwtable
    define void @query_group_by_template(i8** nocapture readnone %byte_stream, i8* nocapture readonly %literals, i64* nocapture readnone %row_count_ptr, i64* nocapture readonly %frag_row_off_ptr, i32* %max_matched_ptr, i64* %agg_init_val, i64** %group_by_buffers, i32 %frag_idx, i64* %join_hash_tables, i32* %total_matched, i32* %error_code) #20 {
    .entry:
    %0 = getelementptr i8*, i8** %byte_stream, i32 0


    My code is being generated for the GPU.

    What scale factor of data are you running with?

    Would it be possible for you to share the omnisci_server INFO log to see if anything is unusual or untoward there

    Regards



  • 8.  RE: Confusing CPU/GPU choice depending on platform

    Posted 21 days ago
    Hi @dwayneberry,

    the scale factor is 1.

    I found these differences in the log files:

    P100:
    I0225 07:38:57.492754 9 MapDHandler.cpp:215] Started in GPU mode
    I0225 07:38:57.588497 9 EglPlatform.cpp:187] EGL Version: 1.5.
    I0225 07:38:57.588600 9 EglPlatform.cpp:189] OpenGL Version: 4.6.
    I0225 07:38:57.591351 9 DriverInstance.cpp:58] Using GfxDriver: OpenGL.
    I0225 07:38:57.591478 9 EglDisplayManager.cpp:71] Initialization call returned successfully.
    I0225 07:38:57.591502 9 EglWindow.cpp:151] EGL Setting: BITS_RGBA = 8.
    I0225 07:38:57.591511 9 EglWindow.cpp:175] EGL Setting: BITS_ALPHA = 8.
    I0225 07:38:57.642119 9 QueryRenderManager.cpp:305] QueryRenderManager initialized for rendering...
    I0225 07:38:57.642257 9 QueryRenderManager.cpp:306] Start GPU 0
    I0225 07:38:57.642345 9 QueryRenderManager.cpp:307] Num GPUs -1
    I0225 07:38:57.642376 9 QueryRenderManager.cpp:308] Render Cache Limit 500
    I0225 07:38:57.642390 9 QueryRenderManager.cpp:309] Render Mem (bytes) 500000000
    I0225 07:38:57.642415 9 QueryRenderManager.cpp:310] Poly Cache (bytes) 300000000

    K80:
    I0225 08:00:41.770045 24241 MapDHandler.cpp:215] Started in GPU mode
    I0225 08:00:41.838930 24241 Catalog.cpp:663] Successfully migrated db access privileges
    E0225 08:00:41.845654 24241 EglPlatform.cpp:59] EGL_EXT_device_query extension is not supported.
    I0225 08:00:41.846305 24272 RenderCmdQueue.cpp:145] Render thread exited normally
    E0225 08:00:41.846421 24241 MapDHandler.cpp:236] Backend rendering disabled: /home/jenkins-slave/workspace/mapd2-multi/compiler/gcc/gpu/cuda/host/centos/render/render/GfxDriver/Drivers/GL/egl/EglPlatform.cpp:59 EGL_EXT_device_query extension is not supported.
    I0225 08:00:41.846485 24241 MapDServer.cpp:949] MapD server using unencrypted connection

    Best regards

    ------------------------------
    Patrick Erdelt
    ------------------------------