Discussions

Expand all | Collapse all

How to effectively use multi GPU for data warehouses?

  • 1.  How to effectively use multi GPU for data warehouses?

    Posted 20 days ago
    Data warehouses typically involve combined keys (star schema, ...) and a lot of JOINs.

    Documentation says "A shard key must specify a single column to shard on. There is no support for sharding by a combination of keys."

    Obviously one option is to combine all tables into a single very redundant table.

    Is there another option?
    #Core

    ------------------------------
    Patrick Erdelt
    ------------------------------


  • 2.  RE: How to effectively use multi GPU for data warehouses?

    Posted 20 days ago
    Edited by Candido Dessanti 20 days ago
    Hi @Patrick Erdelt,

    You are right, but the use of sharding is necessary if you have a dimension with a significant number of records (or for other reasons like query's performance on relatively small tables when you have multiples GPUs).
    Radical denormalization of an entire star schema, in my opinion, should be avoided whenever possible, because of the increased disk and memory consumption, also on columnar based databases; on the first instance, you can think about various denormalization strategies like combining some dimensions into one or the fact table with a dimension (or both if necessary)

    P.S. I'm referring to a Kimball's Star Schema/DWH

    ------------------------------
    Candido Dessanti
    Dba
    consulting
    Rome
    ------------------------------



  • 3.  RE: How to effectively use multi GPU for data warehouses?

    Posted 5 days ago
    Edited by Patrick Erdelt 5 days ago
    Hi @Candido Dessanti

    thank you for your reply!

    As I understand, it is not sensible (or possible) to shard two dimensions, so I would shard one dimension (the biggest one) of a star schema:

    CREATE TABLE dim1
    (
    dim1_id BIGINT ENCODING FIXED(32),
    name CHAR(25),
    SHARD KEY (dim1_id)
    ) WITH (shard_count = 4, partitions='SHARDED');

    CREATE TABLE dim2
    (
    dim2_id BIGINT ENCODING FIXED(32),
    name CHAR(25)
    ) WITH (partitions='REPLICATED');

    CREATE TABLE facts
    (
    dim1_id BIGINT ENCODING FIXED(32),
    dim2_id BIGINT ENCODING FIXED(32),
    fact INTEGER,
    SHARD KEY (dim1_id)
    ) WITH (shard_count = 4, partitions='SHARDED');

    Is this expected to speed up suitable queries by factor 4, if I have 4 GPUs and everything fits into GPU memory?


    Best regards

    ------------------------------
    Patrick Erdelt
    ------------------------------



  • 4.  RE: How to effectively use multi GPU for data warehouses?

    Posted 4 days ago
    Hi @Patrick Erdelt,

    The easy answer is Yes because the software is designed to run queries in MPP on a single GPU, so on a system with multiple GPUs, the speed-up of GPU's related tasks would scale linearly, but consider you need an additional reduction to get the final result of the query.
    Everything else like parsing, memory management, projection won't benefit by the additional GPUs so the results could vary depending on the query complexity and the cardinality of the results.
    Have you fixed the problems you got using containers? ;)




    ------------------------------
    Candido Dessanti
    Dba
    Crismatica consulting
    Rome
    ------------------------------



  • 5.  RE: How to effectively use multi GPU for data warehouses?

    Posted 4 days ago
    Hi @Candido Dessanti

    have you tried sharding successfully? What kind of queries profit most?

    I face some strange issues even in the simple case when sharding a single table using it's primary key.

    How would you shard the TPC-H dataset?

    Thank you for asking about containers. In fact I found some more issues. I will post these in a new ​discussion, when I am more confident about what is going wrong...



    ------------------------------
    Patrick Erdelt
    ------------------------------



  • 6.  RE: How to effectively use multi GPU for data warehouses?

    Posted 3 days ago
    Hi @Patrick Erdelt,

    I briefly tried sharding to solve a problem impossible to solve, so I have to reply with a big no

    The queries that could get a significant speedup by using sharding are:

    1) the ones where two big tables are joined together as stated into Docs because you will save a lot of traffic in the PCI-ex bus and the join will be executed just one time on each card
    2) when you have with an unbalanced number of fragments, e.g. a table with 2 fragments and your system has 4 GPUs; if queries are using taxing filtering or calculations, you can use the sharding to balance the data between every available GPU.

    It's unuseful and probably harmful on queries that don't scale with multiple GPUs, like high cardinality ones, because the final reduction has to be done on a GPU only.

    About the TPC-H; I don't like how is designed because it uses a too normalized table design, level 1 SCD (that's a joke), but it's obvious you have to shard the and the order tables on order's key.

    Could you tell us which problems are you facing with sharding?





    ------------------------------
    Candido Dessanti
    Dba
    Crismatica consulting
    Rome
    ------------------------------



  • 7.  RE: How to effectively use multi GPU for data warehouses?

    Posted 2 days ago
    Edited by Patrick Erdelt 2 days ago
    Hi @Candido Dessanti

    I see the following results in TPC-H, scaling factor = 1, Q5, i.e. query:
    select
    n_name,
    sum(l_extendedprice * (1 - l_discount)) as revenue
    from
    customer,
    orders,
    lineitem,
    supplier,
    nation,
    region
    where
    c_custkey = o_custkey
    and l_orderkey = o_orderkey
    and l_suppkey = s_suppkey
    and c_nationkey = s_nationkey
    and s_nationkey = n_nationkey
    and n_regionkey = r_regionkey
    and r_name = 'EUROPE'
    and o_orderdate >= date '1993-01-01'
    and o_orderdate < date '1993-01-01' + interval '1' year
    group by
    n_name
    order by
    revenue desc

    8xK80 - 8 shards
    n_name|revenue
    RUSSIA|28192512.1223
    ROMANIA|27742341.30130001
    FRANCE|27551487.674
    UNITED KINGDOM|26663225.03280001
    GERMANY|25032671.0815

    1xK80 - 1 shard
    n_name|revenue
    RUSSIA|56715418.56220009
    FRANCE|54436355.26089995
    GERMANY|53161745.64450005
    ROMANIA|52712944.00339993
    UNITED KINGDOM|52569055.94930004

    In both cases I use the same data and same table schemas except for sharding. In fact I sharded lineitem and order on order's key, so that was my best guess, too. I can also verify the number of imported rows is equal in both settings.

    The second try was about excerpts of the gdelt data set. That is 1 table, sharded by a single column pk.

    1xK80 and 4xK80 show almost the same run times. I tried a 3.7GB and a 24GB excerpt and used basic OLAP queries, count, sum and cascading group bys.
    3.7GB: Same result sets for all queries
    24GB: Same result sets for simple queries, different result sets for complex queries

    So maybe
    • there's a bug
    • sharding is weired
    • it's a container problem
    • something about the schemas (data types?)
    • something about the queries
    • something else on my side



    ------------------------------
    Patrick Erdelt
    ------------------------------



  • 8.  RE: How to effectively use multi GPU for data warehouses?

    Posted 2 days ago
    Hi  @Patrick Erdelt,

    cannot reproduce right now because I ran out of usable instances with multiple K80s on AWS; I'll try on the weekend.
    About the performance it depends on the query and the data; sharing queries you are using on the dataset would help


    ------------------------------
    Candido Dessanti
    Dba
    Crismatica consulting
    Rome
    ------------------------------



  • 9.  RE: How to effectively use multi GPU for data warehouses?

    Posted 2 days ago
    Edited by Patrick Erdelt yesterday
    Hi @Candido Dessanti

    same result sets for no shards 1xK80 and 4 shards 4xK80:
    select MonthYear, Actor1CountryCode, count(*) c, SUM(NumMentions) m from gdelt group by MonthYear, Actor1CountryCode

    different result sets:
    select MonthYear, Actor1CountryCode, Actor2CountryCode, count(*) c, SUM(NumMentions) m from gdelt group by MonthYear, Actor1CountryCode, Actor2CountryCode

    PS: If I put select count(*) from () around the queries I obtain the same number of rows so it's possibly just ordering.
    Ordering itself is not possible, because Exception: Sorting the result would be too slow

    PPS: It's not just ordering. I fetch the result sets in python, so I can verify they are really different.

    ------------------------------
    Patrick Erdelt
    ------------------------------