Discussions

Expand all | Collapse all

How to effectively use multi GPU for data warehouses?

  • 1.  How to effectively use multi GPU for data warehouses?

    Posted 03-04-2019 06:26
    Data warehouses typically involve combined keys (star schema, ...) and a lot of JOINs.

    Documentation says "A shard key must specify a single column to shard on. There is no support for sharding by a combination of keys."

    Obviously one option is to combine all tables into a single very redundant table.

    Is there another option?
    #Core

    ------------------------------
    Patrick Erdelt
    ------------------------------


  • 2.  RE: How to effectively use multi GPU for data warehouses?

    Posted 03-04-2019 08:13
    Edited by Candido Dessanti 03-04-2019 08:55
    Hi @Patrick Erdelt,

    You are right, but the use of sharding is necessary if you have a dimension with a significant number of records (or for other reasons like query's performance on relatively small tables when you have multiples GPUs).
    Radical denormalization of an entire star schema, in my opinion, should be avoided whenever possible, because of the increased disk and memory consumption, also on columnar based databases; on the first instance, you can think about various denormalization strategies like combining some dimensions into one or the fact table with a dimension (or both if necessary)

    P.S. I'm referring to a Kimball's Star Schema/DWH

    ------------------------------
    Candido Dessanti
    Dba
    consulting
    Rome
    ------------------------------



  • 3.  RE: How to effectively use multi GPU for data warehouses?

    Posted 03-19-2019 10:43
    Edited by Patrick Erdelt 03-19-2019 12:33
    Hi @Candido Dessanti

    thank you for your reply!

    As I understand, it is not sensible (or possible) to shard two dimensions, so I would shard one dimension (the biggest one) of a star schema:

    CREATE TABLE dim1
    (
    dim1_id BIGINT ENCODING FIXED(32),
    name CHAR(25),
    SHARD KEY (dim1_id)
    ) WITH (shard_count = 4, partitions='SHARDED');

    CREATE TABLE dim2
    (
    dim2_id BIGINT ENCODING FIXED(32),
    name CHAR(25)
    ) WITH (partitions='REPLICATED');

    CREATE TABLE facts
    (
    dim1_id BIGINT ENCODING FIXED(32),
    dim2_id BIGINT ENCODING FIXED(32),
    fact INTEGER,
    SHARD KEY (dim1_id)
    ) WITH (shard_count = 4, partitions='SHARDED');

    Is this expected to speed up suitable queries by factor 4, if I have 4 GPUs and everything fits into GPU memory?


    Best regards

    ------------------------------
    Patrick Erdelt
    ------------------------------



  • 4.  RE: How to effectively use multi GPU for data warehouses?

    Posted 03-20-2019 01:51
    Hi @Patrick Erdelt,

    The easy answer is Yes because the software is designed to run queries in MPP on a single GPU, so on a system with multiple GPUs, the speed-up of GPU's related tasks would scale linearly, but consider you need an additional reduction to get the final result of the query.
    Everything else like parsing, memory management, projection won't benefit by the additional GPUs so the results could vary depending on the query complexity and the cardinality of the results.
    Have you fixed the problems you got using containers? ;)




    ------------------------------
    Candido Dessanti
    Dba
    Crismatica consulting
    Rome
    ------------------------------



  • 5.  RE: How to effectively use multi GPU for data warehouses?

    Posted 03-20-2019 07:21
    Hi @Candido Dessanti

    have you tried sharding successfully? What kind of queries profit most?

    I face some strange issues even in the simple case when sharding a single table using it's primary key.

    How would you shard the TPC-H dataset?

    Thank you for asking about containers. In fact I found some more issues. I will post these in a new ​discussion, when I am more confident about what is going wrong...



    ------------------------------
    Patrick Erdelt
    ------------------------------



  • 6.  RE: How to effectively use multi GPU for data warehouses?

    Posted 03-21-2019 09:05
    Hi @Patrick Erdelt,

    I briefly tried sharding to solve a problem impossible to solve, so I have to reply with a big no

    The queries that could get a significant speedup by using sharding are:

    1) the ones where two big tables are joined together as stated into Docs because you will save a lot of traffic in the PCI-ex bus and the join will be executed just one time on each card
    2) when you have with an unbalanced number of fragments, e.g. a table with 2 fragments and your system has 4 GPUs; if queries are using taxing filtering or calculations, you can use the sharding to balance the data between every available GPU.

    It's unuseful and probably harmful on queries that don't scale with multiple GPUs, like high cardinality ones, because the final reduction has to be done on a GPU only.

    About the TPC-H; I don't like how is designed because it uses a too normalized table design, level 1 SCD (that's a joke), but it's obvious you have to shard the and the order tables on order's key.

    Could you tell us which problems are you facing with sharding?





    ------------------------------
    Candido Dessanti
    Dba
    Crismatica consulting
    Rome
    ------------------------------



  • 7.  RE: How to effectively use multi GPU for data warehouses?

    Posted 03-22-2019 04:41
    Edited by Patrick Erdelt 30 days ago
    Hi @Candido Dessanti

    I see the following results in TPC-H, scaling factor = 1, Q5, i.e. query:
    select
    n_name,
    sum(l_extendedprice * (1 - l_discount)) as revenue
    from
    customer,
    orders,
    lineitem,
    supplier,
    nation,
    region
    where
    c_custkey = o_custkey
    and l_orderkey = o_orderkey
    and l_suppkey = s_suppkey
    and c_nationkey = s_nationkey
    and s_nationkey = n_nationkey
    and n_regionkey = r_regionkey
    and r_name = 'EUROPE'
    and o_orderdate >= date '1993-01-01'
    and o_orderdate < date '1993-01-01' + interval '1' year
    group by
    n_name
    order by
    revenue desc

    8xK80 - 8 shards
    n_name|revenue
    RUSSIA|28192512.1223
    ROMANIA|27742341.30130001
    FRANCE|27551487.674
    UNITED KINGDOM|26663225.03280001
    GERMANY|25032671.0815

    1xK80 - 1 shard
    n_name|revenue
    RUSSIA|56715418.56220009
    FRANCE|54436355.26089995
    GERMANY|53161745.64450005
    ROMANIA|52712944.00339993
    UNITED KINGDOM|52569055.94930004

    In both cases I use the same data and same table schemas except for sharding. In fact I sharded lineitem and order on order's key, so that was my best guess, too. I can also verify the number of imported rows is equal in both settings.

    The second try was about excerpts of the gdelt data set. That is 1 table, sharded by a single column pk.

    1xK80 and 4xK80 show almost the same run times. I tried a 3.7GB and a 24GB excerpt and used basic OLAP queries, count, sum and cascading group bys.
    3.7GB: Same result sets for all queries
    24GB: Same result sets for simple queries, different result sets for complex queries

    So maybe
    • there's a bug
    • sharding is weired
    • it's a container problem
    • something about the schemas (data types?)
    • something about the queries
    • something else on my side



    ------------------------------
    Patrick Erdelt
    ------------------------------



  • 8.  RE: How to effectively use multi GPU for data warehouses?

    Posted 30 days ago
    Hi  @Patrick Erdelt,

    cannot reproduce right now because I ran out of usable instances with multiple K80s on AWS; I'll try on the weekend.
    About the performance it depends on the query and the data; sharing queries you are using on the dataset would help


    ------------------------------
    Candido Dessanti
    Dba
    Crismatica consulting
    Rome
    ------------------------------



  • 9.  RE: How to effectively use multi GPU for data warehouses?

    Posted 30 days ago
    Edited by Patrick Erdelt 30 days ago
    Hi @Candido Dessanti

    same result sets for no shards 1xK80 and 4 shards 4xK80:
    select MonthYear, Actor1CountryCode, count(*) c, SUM(NumMentions) m from gdelt group by MonthYear, Actor1CountryCode

    different result sets:
    select MonthYear, Actor1CountryCode, Actor2CountryCode, count(*) c, SUM(NumMentions) m from gdelt group by MonthYear, Actor1CountryCode, Actor2CountryCode

    PS: If I put select count(*) from () around the queries I obtain the same number of rows so it's possibly just ordering.
    Ordering itself is not possible, because Exception: Sorting the result would be too slow

    PPS: It's not just ordering. I fetch the result sets in python, so I can verify they are really different.

    ------------------------------
    Patrick Erdelt
    ------------------------------



  • 10.  RE: How to effectively use multi GPU for data warehouses?

    Posted 27 days ago
    Hi @Patrick Erdelt,

    I tried the GDELT dataset (from 1st January 2015 to 30 September 2018, so 252Millions of records), and the results are the same on 1,4 or 8 K80s; I tried with a table not sharded and another sharded containing the same data on globaleventid field.
    The wall clock performance is about the same using different number of GPUs mainly because the projection phase, that is done on CPU and with a such high cardinality (170k+ records) it's taking a long time (1500ms,1650ms and 1820ms for 1,4,8 GPUs configuration probably for a pci-ex bus issue, infact the K80s are dual GPU cards so 2 GPUs are sharing the same bus, indipendently from the sharding (in this case the number of fragments of the no-sharded table is 8 too).
    The pure GPU execution time for the cardinality estimation and the real execution is 240-800ms, 60-250ms and 35-170ms.
    I tried with mapd 4.4.2, which release are you using?


    ------------------------------
    Candido Dessanti
    Dba
    Crismatica consulting
    Rome
    ------------------------------



  • 11.  RE: How to effectively use multi GPU for data warehouses?

    Posted 27 days ago
    Hi @Candido Dessanti

    thank you for testing!

    I use mapd 4.4.2 too, but in a Docker Container, and gdelt 2018, that is about 60M records.

    How did you compare the result sets?



    ------------------------------
    Patrick Erdelt
    ------------------------------



  • 12.  RE: How to effectively use multi GPU for data warehouses?

    Posted 27 days ago
    Hi @Patrick Erdelt,

    I just used the copy command running the same query on both tables, then compared with diff Unix command the sorted of resulting CSVs.

    copy (select ....) to '/path/query1.csv'

    I can try with 2018 data only, but using such small dataset doesn't make sense performance wise; the 2080ti of my workstation runs the query in less than 90ms and a single K80 would run it in 260ms (we are talking about the GPU part of the query, the projection phase takes 344ms)

    ------------------------------
    Candido Dessanti
    Dba
    Crismatica consulting
    Rome
    ------------------------------



  • 13.  RE: How to effectively use multi GPU for data warehouses?

    Posted 27 days ago

    Hi

    Thanks for your continued exploration of OmniSci

    @Patrick Erdelt  Could you share the precise CREATE table statements you used to build your test tables for the above result?​

    In parallel I will test this locally to see if i can see the difference in results.

    Regards



    ------------------------------
    dwayneberry
    OmniSci
    ------------------------------



  • 14.  RE: How to effectively use multi GPU for data warehouses?

    Posted 27 days ago
    Edited by dwayneberry 27 days ago

    Hi

    @Patrick Erdelt

    I have 3 gpu's in this machine so I used this set up for my sharding

    DROP TABLE IF EXISTS PART;
    CREATE TABLE PART (

    P_PARTKEY BIGINT,
    P_NAME TEXT ENCODING DICT,
    P_MFGR TEXT ENCODING DICT,
    P_BRAND TEXT ENCODING DICT,
    P_TYPE TEXT ENCODING DICT,
    P_SIZE INTEGER,
    P_CONTAINER TEXT ENCODING DICT,
    P_RETAILPRICE DECIMAL(16,2),
    P_COMMENT TEXT ENCODING DICT
    ) with (fragment_size=64000000);

    DROP TABLE IF EXISTS SUPPLIER;
    CREATE TABLE SUPPLIER (
    S_SUPPKEY BIGINT,
    S_NAME TEXT ENCODING DICT,
    S_ADDRESS TEXT ENCODING DICT,
    S_NATIONKEY BIGINT ,
    S_PHONE TEXT ENCODING DICT,
    S_ACCTBAL DECIMAL(16,2),
    S_COMMENT TEXT ENCODING DICT
    )with (fragment_size=64000000);

    DROP TABLE IF EXISTS PARTSUPP ;
    CREATE TABLE PARTSUPP (
    PS_PARTKEY BIGINT ,
    PS_SUPPKEY BIGINT ,
    PS_AVAILQTY INTEGER,
    PS_SUPPLYCOST DECIMAL(16,2),
    PS_COMMENT TEXT ENCODING DICT
    )with (fragment_size=64000000);

    DROP TABLE IF EXISTS CUSTOMER;
    CREATE TABLE CUSTOMER (
    C_CUSTKEY BIGINT,
    C_NAME TEXT ENCODING DICT,
    C_ADDRESS TEXT ENCODING DICT,
    C_NATIONKEY BIGINT ,
    C_PHONE TEXT ENCODING DICT,
    C_ACCTBAL DECIMAL(16,2),
    C_MKTSEGMENT TEXT ENCODING DICT,
    C_COMMENT TEXT ENCODING DICT,
    SHARD KEY (C_CUSTKEY)
    )with (fragment_size=64000000, shard_count =3, partitions='sharded');

    DROP TABLE IF EXISTS ORDERS ;
    CREATE TABLE ORDERS (
    O_ORDERKEY BIGINT,
    O_CUSTKEY BIGINT ,
    O_ORDERSTATUS TEXT ENCODING DICT,
    O_TOTALPRICE DECIMAL(16,2),
    O_ORDERDATE DATE,
    O_ORDERPRIORITY TEXT ENCODING DICT,
    O_CLERK TEXT ENCODING DICT,
    O_SHIPPRIORITY INTEGER,
    O_COMMENT TEXT ENCODING DICT,
    SHARD KEY (O_CUSTKEY)
    )with (fragment_size=64000000, shard_count=3, partitions='sharded');

    DROP TABLE IF EXISTS LINEITEM ;
    CREATE TABLE LINEITEM (
    L_ORDERKEY BIGINT ,
    L_PARTKEY BIGINT ,
    L_SUPPKEY BIGINT ,
    L_LINENUMBER INTEGER,
    L_QUANTITY INTEGER,
    L_EXTENDEDPRICE DECIMAL(16,2),
    L_DISCOUNT DECIMAL(16,2),
    L_TAX DECIMAL(16,2),
    L_RETURNFLAG TEXT ENCODING DICT,
    L_LINESTATUS TEXT ENCODING DICT,
    L_SHIPDATE DATE,
    L_COMMITDATE DATE,
    L_RECEIPTDATE DATE,
    L_SHIPINSTRUCT TEXT ENCODING DICT,
    L_SHIPMODE TEXT ENCODING DICT,
    L_COMMENT TEXT ENCODING DICT
    )with (fragment_size=64000000);

    DROP TABLE IF EXISTS NATION ;
    CREATE TABLE NATION (
    N_NATIONKEY BIGINT,
    N_NAME TEXT ENCODING DICT,
    N_REGIONKEY BIGINT ,
    N_COMMENT TEXT ENCODING DICT
    )with (fragment_size=64000000);

    DROP TABLE IF EXISTS REGION ;
    CREATE TABLE REGION (
    R_REGIONKEY BIGINT,
    R_NAME TEXT ENCODING DICT,
    R_COMMENT TEXT ENCODING DICT
    )with (fragment_size=64000000);

    Only shared on Customer and orders in this case as thats all that could be used for this query

    ran this query

    select
    n_name,
    sum(l_extendedprice * (1 - l_discount)) as revenue
    from
    customer,
    orders,
    lineitem,
    supplier,
    nation,
    region
    where
    c_custkey = o_custkey
    and l_orderkey = o_orderkey
    and l_suppkey = s_suppkey
    and c_nationkey = s_nationkey
    and s_nationkey = n_nationkey
    and n_regionkey = r_regionkey
    and r_name = 'EUROPE'
    and o_orderdate >= date '1993-01-01'
    and o_orderdate < date '1993-01-01' + interval '1' year
    group by
    n_name
    order by
    revenue desc


    got result

    n_name|revenue
    RUSSIA|56715418.5622
    FRANCE|54436355.2609
    GERMANY|53161745.6445
    ROMANIA|52712944.0034
    UNITED KINGDOM|52569055.9493


    Which is same result I got when I loaded the data non sharded.

    I know there is an issue if you do not set up sharding to match the gpu count in your machine so perhaps you are experiencing that issue.

    Sharing your create table statement and details of your machine and config will allow us to verify that.

    regards





    ------------------------------
    dwayneberry
    OmniSci
    ------------------------------



  • 15.  RE: How to effectively use multi GPU for data warehouses?

    Posted 26 days ago
    Edited by Patrick Erdelt 26 days ago
    Hi @Candido Dessanti, hi @dwayneberry

    the 2018 data set is about 40GB, so I thought it is more than enough for 1xK80.

    I fetch results via python/JDBC and compare as lists of lists and as pandas data frames.

    The schema is

    CREATE TABLE gdelt (
    GLOBALEVENTID BIGINT ENCODING FIXED(32),
    SQLDATE INTEGER,
    MonthYear TEXT ENCODING DICT(8),
    Y TEXT ENCODING DICT(8),
    FractionDate DOUBLE,
    Actor1Code TEXT ENCODING DICT(16),
    Actor1Name TEXT ENCODING NONE,
    Actor1CountryCode TEXT ENCODING DICT(8),
    Actor1KnownGroupCode TEXT ENCODING DICT(8),
    Actor1EthnicCode TEXT ENCODING DICT(16),
    Actor1Religion1Code TEXT ENCODING DICT(8),
    Actor1Religion2Code TEXT ENCODING DICT(8),
    Actor1Type1Code TEXT ENCODING DICT(8),
    Actor1Type2Code TEXT ENCODING DICT(8),
    Actor1Type3Code TEXT ENCODING DICT(8),
    Actor2Code TEXT ENCODING DICT(16),
    Actor2Name TEXT ENCODING NONE,
    Actor2CountryCode TEXT ENCODING DICT(8),
    Actor2KnownGroupCode TEXT ENCODING DICT(8),
    Actor2EthnicCode TEXT ENCODING DICT(16),
    Actor2Religion1Code TEXT ENCODING DICT(8),
    Actor2Religion2Code TEXT ENCODING DICT(8),
    Actor2Type1Code TEXT ENCODING DICT(8),
    Actor2Type2Code TEXT ENCODING DICT(8),
    Actor2Type3Code TEXT ENCODING DICT(8),
    IsRootEvent INTEGER ENCODING FIXED(8),
    EventCode TEXT ENCODING DICT(8),
    EventBaseCode TEXT ENCODING DICT(8),
    EventRootCode TEXT ENCODING DICT(8),
    QuadClass INTEGER ENCODING FIXED(8),
    GoldsteinScale DOUBLE,
    NumMentions INTEGER ENCODING FIXED(16),
    NumSources INTEGER ENCODING FIXED(8),
    NumArticles INTEGER ENCODING FIXED(16),
    AvgTone DOUBLE,
    Actor1GeoType INTEGER ENCODING FIXED(8),
    Actor1GeoFullName TEXT ENCODING NONE,
    Actor1GeoCountryCode TEXT ENCODING DICT(8),
    Actor1GeoADM1Code TEXT ENCODING DICT(16),
    Actor1GeoADM2Code TEXT ENCODING DICT(16),
    Actor1GeoLat FLOAT,
    Actor1GeoLong FLOAT,
    Actor1GeoFeatureID INTEGER,
    Actor2GeoType INTEGER ENCODING FIXED(8),
    Actor2GeoFullName TEXT ENCODING NONE,
    Actor2GeoCountryCode TEXT ENCODING DICT(8),
    Actor2GeoADM1Code TEXT ENCODING DICT(16),
    Actor2GeoADM2Code TEXT ENCODING DICT(16),
    Actor2GeoLat FLOAT,
    Actor2GeoLong FLOAT,
    Actor2GeoFeatureID INTEGER,
    ActionGeoType INTEGER ENCODING FIXED(8),
    ActionGeoFullName TEXT ENCODING NONE,
    ActionGeoCountryCode TEXT ENCODING DICT(8),
    ActionGeoADM1Code TEXT ENCODING DICT(16),
    ActionGeoADM2Code TEXT ENCODING DICT(16),
    ActionGeoLat FLOAT,
    ActionGeoLong FLOAT,
    ActionGeoFeatureID INTEGER,
    DATEADDED BIGINT,
    SOURCEURL TEXT ENCODING NONE,
    SHARD KEY (GLOBALEVENTID)
    ) WITH (shard_count = 4, partitions='SHARDED');

    The schema for TPC-H is

    CREATE TABLE nation
    (
    n_nationkey INTEGER not null,
    n_name CHAR(25) not null,
    n_regionkey INTEGER not null,
    n_comment VARCHAR(152)
    );

    CREATE TABLE region
    (
    r_regionkey INTEGER not null,
    r_name CHAR(25) not null,
    r_comment VARCHAR(152)
    );

    CREATE TABLE part
    (
    p_partkey BIGINT not null,
    p_name VARCHAR(55) not null,
    p_mfgr CHAR(25) not null,
    p_brand CHAR(10) not null,
    p_type VARCHAR(25) not null,
    p_size INTEGER not null,
    p_container CHAR(10) not null,
    p_retailprice DOUBLE PRECISION not null,
    p_comment VARCHAR(23) not null
    );

    CREATE TABLE supplier
    (
    s_suppkey BIGINT not null,
    s_name CHAR(25) not null,
    s_address VARCHAR(40) not null,
    s_nationkey INTEGER not null,
    s_phone CHAR(15) not null,
    s_acctbal DOUBLE PRECISION not null,
    s_comment VARCHAR(101) not null
    );

    CREATE TABLE partsupp
    (
    ps_partkey BIGINT not null,
    ps_suppkey BIGINT not null,
    ps_availqty BIGINT not null,
    ps_supplycost DOUBLE PRECISION not null,
    ps_comment VARCHAR(199) not null
    );

    CREATE TABLE customer
    (
    c_custkey BIGINT not null,
    c_name VARCHAR(25) not null,
    c_address VARCHAR(40) not null,
    c_nationkey INTEGER not null,
    c_phone CHAR(15) not null,
    c_acctbal DOUBLE PRECISION not null,
    c_mktsegment CHAR(10) not null,
    c_comment VARCHAR(117) not null
    );

    CREATE TABLE orders
    (
    o_orderkey BIGINT not null,
    o_custkey BIGINT not null,
    o_orderstatus CHAR(1) not null,
    o_totalprice DOUBLE PRECISION not null,
    o_orderdate DATE not null,
    o_orderpriority CHAR(15) not null,
    o_clerk CHAR(15) not null,
    o_shippriority INTEGER not null,
    o_comment VARCHAR(79) not null,
    SHARD KEY (o_orderkey)
    ) WITH (shard_count = 4, partitions='SHARDED');

    CREATE TABLE lineitem
    (
    l_orderkey BIGINT not null,
    l_partkey BIGINT not null,
    l_suppkey BIGINT not null,
    l_linenumber BIGINT not null,
    l_quantity DOUBLE PRECISION not null,
    l_extendedprice DOUBLE PRECISION not null,
    l_discount DOUBLE PRECISION not null,
    l_tax DOUBLE PRECISION not null,
    l_returnflag CHAR(1) not null,
    l_linestatus CHAR(1) not null,
    l_shipdate DATE not null,
    l_commitdate DATE not null,
    l_receiptdate DATE not null,
    l_shipinstruct CHAR(25) not null,
    l_shipmode CHAR(10) not null,
    l_comment VARCHAR(44) not null,
    SHARD KEY (l_orderkey)
    ) WITH (shard_count = 4, partitions='SHARDED');

    The number of GPUs matches. I use docker container to limit GPUs (there are in fact 8), and nvidia-smi says restriction works.

    Best regards
    ​​

    ------------------------------
    Patrick Erdelt
    ------------------------------



  • 16.  RE: How to effectively use multi GPU for data warehouses?

    Posted 26 days ago
    Hi @Patrick Erdelt,

    looking at the DDLs, you are using make me better understand your recent problems; they are a little wrong

    You can use the ones provided by @dwayneberry for the TPCH schema or use the ones on the attaches.

    With the DDL I provided for the event table of GDELT dataset, the 60M records are allocating about 8GB on disk, so even using all the files on the query (remember the DB is columnar so it will use just the columns you are using on query/queries) it's not enough to use all memory of a single K80 and your query is going to use just 0.5GB.

    Could you retry with those optimized DDLs and report back? I hope the problems will disappear and you will get some real fun with omnisci database ;)
    p.s. take a look to docs about datatypes to have a better understanding of them

    Datatypes and Fixed Encoding
    Mapd remove preview
    Datatypes and Fixed Encoding
    This topic describes standard datatypes and space-saving variations for values stored in OmniSci.
    View this on Mapd >

    If you have any question, just post in the community forum

    ------------------------------
    Candido Dessanti
    Dba
    Crismatica consulting
    Rome
    ------------------------------

    Attachment(s)

    txt
    gdelt_sql.txt   2K 1 version
    txt
    tpch_sql.txt   2K 1 version


  • 17.  RE: How to effectively use multi GPU for data warehouses?

    Posted 24 days ago
    Hi @Candido Dessanti, hi @dwayneberry

    thank you for the schemas!

    TPC-H:
    I took the schemas and ran some benchmarks on 1xK80 (no sharding), 1xV100 (no sharding) and 4xV100 (4 shards). All show about the same run time, but (sometimes) different results, for example for Q3 and Q5.

    GDELT:
    I had to modify the table, since I use gdelt event 2.0, which has some more fields, e.g. Actor1GeoADM2Code. I also had to change DATEADDED to BIGINT in order to import data.

    I tried both, but unfortunately it didn't help.

    But I have found another point: Since I connect via JDBC, the size of the result set is limited to 100,000 rows by the mapd driver. It seems, depending on sharding I receive different "first" 100,000 rows. I haven't found a way in python yet to change this via setMaxRows like in https://www.omnisci.com/docs/latest/6_jdbc.html

    Maybe I will start a new thread on that.

    Another point is, sharding apparently has an influence on precision.
    SELECT COUNT(*) c, COUNT(DISTINCT GLOBALEVENTID) d, MIN(GLOBALEVENTID) mi, MAX(GLOBALEVENTID) ma, AVG(GLOBALEVENTID) a, STDDEV(GLOBALEVENTID) s FROM gdelt
    gives me slightly different results for standard deviation depending on sharding.

    When executing the following query directly in mapdql
    select MonthYear, Actor1CountryCode, Actor2CountryCode, Actor1EthnicCode, Actor2EthnicCode, Actor1Religion1Code, Actor2Religion1Code, Actor1Religion2Code, Actor2Religion2Code, count(*) c, SUM(NumMentions) m from gdelt group by MonthYear, Actor1CountryCode, Actor2CountryCode, Actor1EthnicCode, Actor2EthnicCode, Actor1Religion1Code, Actor2Religion1Code, Actor1Religion2Code, Actor2Religion2Code

    I obtain:
    1xV100: Execution time: 147 ms, Total time: 1220 ms
    4xV100: Execution time: 208 ms, Total time: 1346 ms
    so this does not show much effect, but at least I receive the same result sets!

    So I am still looking for use cases showing the benefit of sharding in a data warehouse situation and I will keep on trying.


    Best regards
    ​​​

    ------------------------------
    Patrick Erdelt
    ------------------------------



  • 18.  RE: How to effectively use multi GPU for data warehouses?

    Posted 23 days ago

    But I have found another point: Since I connect via JDBC, the size of the result set is limited to 100,000 rows by the mapd driver. It seems, depending on sharding I receive different "first" 100,000 rows. I haven't found a way in python yet to change this via setMaxRows like in https://www.omnisci.com/docs/latest/6_jdbc.html
    Patrick Erdelt,  03-29-2019 07:55

    Change this behavior is a little tricky but simple; the drivers retrieve a max of 100k records because if it detects that the query launched hasn't a it rewrites it adding an arbitrary . So if you want to retrieve 250k records, add to your LIMIT 250000.
    It's normal you will receive different first N rows on a Massive Parallel Database, sharding or not.

    Just a question; why are you using jdbc driver with python? It would be more convenient using the pymapd driver.

    you can find instructions how to install here  https://pypi.org/project/pymapd/ and docs here https://pymapd.readthedocs.io/en/latest/

    As for sharding a flat table, I did some brief test on GDELT data of 2017 consisting of 67M records on 8 K80s GPU with the following query

    select MonthYear, count(*) c, SUM(NumMentions),avg(NumMentions),round_to_digit(avg(NumMentions),1) from events_2017 group by MonthYear;

    without sharding, so using little more than 2 GPUs, the response time is around 95ms, with sharding set to 8 the response time is under 40ms; with the query you provided with 4 GPUs the execution time is faster than a single one, but because such high cardinality the projection phase it's taking a lot of time making almost useless.


    1xV100: Execution time: 147 ms, Total time: 1220 ms
    4xV100: Execution time: 208 ms, Total time: 1346 ms
    so this does not show much effect, but at least I receive the same result sets!
    Patrick Erdelt,  03-29-2019 07:55
    When you say 1xV100 do you mean you are executing the query on a container with a single V100 exposed, or you are saying you are running the query on a table that hasn't been shared on a container with 4xV100?

    Have a nice Week End

    ------------------------------
    Candido Dessanti
    Dba
    Crismatica consulting
    Rome, Italy
    ------------------------------



  • 19.  RE: How to effectively use multi GPU for data warehouses?

    Posted 21 days ago
    Edited by Patrick Erdelt 20 days ago
    Adding LIMIT is quite easy I think and it works. Thank you!

    I try to only use standard tools and nothing vendor-specific to keep use cases and benchmarks comparable and as broud as possible, so I stick to Python/JayDeBeApi, but thank you for the hint!

    As I understand, (multi-) GPU usage brings most benefits when you have complex computation queries but little data transfer. Some candidates might be
    -) geo computations (based on lon/lat)
    -) window functions (does mapd support any of those?)
    -) subqueries (apparently only uncorrelated are supported)
    -) don't retrieve result sets, but summaries like SUM, COUNT, MAX, MIN, AVG

    What would you recommend?

    I have access to a kubernetes driven HPC cluster. Limiting GPUs here happens via docker hiding some of them. I also have access to AWS, p2 and p3 machines. I don't know how AWS handles these, but I assume these are virtual, too.

    1xV100 is p3.2xlarge
    4xV100 is p3.8xlarge


    Best wishes

    ------------------------------
    Patrick Erdelt
    ------------------------------



  • 20.  RE: How to effectively use multi GPU for data warehouses?

    Posted 20 days ago

    I try to only use standard tools and nothing vendor-specific to keep cases and benchmarks comparable and as as possible, so I stick to Python/JayDeBeApi, but thank you for the hint!
    Patrick Erdelt,  04-01-2019 08:42
    Well, the jdbc driver provided by omnisci isn't exactly state of the art and, I guess, it lacks a lot of functions supported on python driver.


    As I understand, (multi-) GPU usage brings most benefits when you have complex computation queries but little data transfer. Some candidates might be
    -) geo computations (based on long/lat)
    -) window functions (does mapd support any of those?)
    -) subqueries (apparently only uncorrelated are supported)
    -) don't retrieve result sets, but summaries like SUM, COUNT, MAX, MIN, AVG
    Patrick Erdelt,  04-01-2019 08:42
    I recommend anything and everything at the same time, but it depends on the query, cardinality, release of the software and the generation of the hardware.

    Taking as an example the queries on GDELT and TPCH dataset we tried so far, so a high cardinality one with little and simple a single key group with low cardinality and more aggregates and a two key group with a lot of aggregates take from no to huge advantage using a multi GPU configuration with sharding.

    DELT v1 2017 (66M records)

    select MonthYear, Actor1CountryCode, Actor2CountryCode, Actor1EthnicCode, Actor2EthnicCode, Actor1Religion1Code, Actor2Religion1Code, Actor1Religion2Code, Actor2Religion2Code, count(*) c, SUM(NumMentions) m
    from events_2017
    group by MonthYear, Actor1CountryCode, Actor2CountryCode, Actor1EthnicCode, Actor2EthnicCode, Actor1Religion1Code, Actor2Religion1Code, Actor1Religion2Code, Actor2Religion2Code;


    sharded on PK                         8 GPUs 389ms (overhead after the Group by 200ms) 1.5x over 1 GPU
    sharded on first group's key     8 GPUs 339ms (overhead after the Group by 150ms) 1.8x over 1 GPU
    not sharded                              3 GPUs 389ms  (overhead after the Group by 60ms)  1.5x over 1 GPU
    not sharded/sharded                1 GPU   607ms (no overhead)

    This query has a preflight and the groups are shared on global memory. The overhead caused by the cardinality limit the benefits of the more processing power used.

    select MonthYear, count(*) c, SUM(NumMentions),avg(NumMentions),round_to_digit(avg(NumMentions),1) from events_2017 group by MonthYear;

    sharded on PK                         8 GPUs  42ms  (no overhead) 4.2x over a GPU
    not sharded                              3 GPUs  96ms  (no overhead) 1.9x over a GPU
    not sharded/sharded                1 GPU  180ms  (no overhead)

    In this query, the groups are shared on on-chip memory, little to no overhead so having more processing power makes more sense

    TPCH sizing 16GB (95M records)

    SELECT l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order FROM lineitem_s WHERE l_shipdate <= date '1998-12-01' - interval '90' day GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus;

    sharded on HC column            8 GPUs   170ms (no overhead) 8x over 1 GPU
    not sharded                              3 GPUs   450ms (no overhead) 3x over 1 GPU
    sharderd/sharded               1 GPU   1345ms (no overhead)

    This query has a very low cardinality two group's key and, because the groups are shared on global memory the concurrency is very taxing, so the scaling is almost linear thru the GPUs.

    Naturally having newer hardware, on aggregates, change those results; on a Turing gaming card, the response times are from 3 to 7 times lower than K80s.

    About your candidates, I can reply
    • maybe (not tested)
    • unlikely (looking at recent commits, windows functions may be supported on next release but CPU only; it's tricky to execute them on MPPway)
    • probably
    • that's true because analytical databases aren't designed to shine on projection phase


    I have access to a driven HPC cluster. Limiting GPUs here happens via docker hiding some of them. I also have access to AWS, p2 and p3 machines. I don't know how AWS handles these, but I assume these are virtual, too.

    1xV100 is p3.2xlarge
    4xV100 is p3.8xlarge
    Patrick Erdelt,  04-01-2019 08:42
    I've idea what is better, I think the performances are similar; BTW you can limit the GPUs the is going to use by parameter on file

    ------------------------------
    Candido Dessanti
    Dba
    Crismatica consulting
    Rome, Italy
    ------------------------------