Discussions

Expand all | Collapse all

Problem with joining two tables

  • 1.  Problem with joining two tables

    Posted 09-12-2019 03:12

    Hello,

    I'm going to try to explain my problem as much as I can.

    So, basically, table A has 588360324 rows, and table B has 146991045. By simple filtering table A for sID = 999888 (select count(*) from A where sID=999888) I'm getting 35424 results.

    When I'm trying to execute query:


    select sum(A.val) val1,A.sDate,A.sID from A join B on B.sID=A.sID and B.id=A.bID where A.sID=999888 group by A.sDate,A.sID;

    I'm getting an error:

    Projection query output result set on table(s): B would contain 146991045 rows, which is more than the current system limit of 128000000.

    Is there any option or solution which I can use to solve my problem and execute query properly? Maybe I can change the limit? When I'm trying to query both tables separately (without joins) there is no problem.

    My hardware is PowerEdge R720 with 2xK80 (seen as 4 cards 11 GB RAM each), 40 core processor, 128 GB RAM. I'm using opensource version.


    #Uncategorized
    #Core
    #General


  • 2.  RE: Problem with joining two tables

    Posted 09-12-2019 04:21
    Hi,

    it's the watchdog exception that's blocking your query, so you can disable the watchdog, then retry running your query.
    To disable add the line enable-watchdog=false to you omnisci.conf file, then restart the server.

    Said that looks a but to me because I don't see any reason why there would be an intermediate projection of the B table in such a simple query.

    Have you tried to shard the tables? Sharding could be beneficial for performance


  • 3.  RE: Problem with joining two tables

    Posted 09-12-2019 05:02
    Edited by Archi K 09-12-2019 05:06
    Hi!

    Thank You for the answer. Now server is trying to execute this query. After couple of seconds I'm getting an error:

    Exception: Fatal error while attempting to build hash tables for join: Hash tables with more than 2B entries not supported yet

    I'm new to omnisci to be honest and I'm testing its capabilities. I've read about sharded tables but haven't used it yet. Do You think this error is caused by the fact my tables are not sharded? If it is, I will try sharding later but as I understand I have to reload my data to them. I planned sharding my tables later at performance optimization stage, but I'm struggling to perform such a simple queries which is frustrating :)


  • 4.  RE: Problem with joining two tables

    Posted 09-12-2019 06:08
    Hi,

    the table sharding wanted to be a suggestion to improve the performance, so I cannot saidd that sharding tables the query will run.

    The fact you are waiting a couple of seconds before get an error it doesn't sound good. I run without any problem query like that

    select r_name,n_name,sum(o_totalprice), count() from orders_1500m join customers_150m on o_custkey = c_custkey join nation on n_nationkey = c_nationkey join region on n_regionkey=r_regionkey group by 1,2

    in less than 1000ms (filtered or not) with just 2 gaming GPUs. The big difference is that you are using two keys to join tables.  Could you add verbose=yes and enable-debug-timer = true, re-run the query and share the [mapd_storage]/data/mapd_log/omnisci_server.INFO, to better undestand what's happening?


  • 5.  RE: Problem with joining two tables

    Posted 09-14-2019 03:42
    Edited by Archi K 09-14-2019 03:44
    Hello!

    I've performed those operations and attached requested file.

    omnisci_server.INFO contents:

    https://pastebin.com/QyLuPjqC

    If there is no way to solve this problem, next thing I'm thinking about is to use one keys instead of two as You mentioned and check if it's going to give any results. To do this I have to denormalize the data and merge those keys to have one composed of both of them as I've done before to analyse the data in Analysis Services. Or maybe put those two tables together into one (all columns together) and then import data to omnisci and analyse in omnisci this way. 

    But now I'll just wait for You, maybe there is an answer in my INFO file :)

    I want to thank You for Your help and great commitment in my issue!




  • 6.  RE: Problem with joining two tables

    Posted 09-14-2019 23:51
    Hi @Archi K,

    thanks for posting the log and sorry for the late reply

    enabling the filter pushdown setting the parameter enable-filter-push-down to true could help to lower the cardinality of the join, making the query run, but te query performance will not be optimal, because the large projection on the right table, so it would be better, if possible to have a single key column to join such high cardinalities tables.

    if the enable filter pushdown is working, you can to the ETL work needed in omnisci database using the pseudo column that every omnisci's table has.

    so you can do a create table as select for the bigger table using a narrow key-range, then complete with some IAS.

    e.g.

    create table a as
    select f1,f2,f3...,fn, b.rownum as fk_to_table_b
    from a join b on b.sID=a.sID and b.id=a.bID 
    where b.sID between 1 and 10;
    
    insert into table a
    select f1,f2,f3...,fn, b.rownum as fk_to_table_b
    from a join b on b.sID=a.sID and b.id=a.bID 
    where b.sID between 11 and 20;
    
    and so on.

    the do the same for the b table without the join.

    as I told you before, sharding tables on the new FK/PK created; also changing the fragment size to 64000000 would help speeding up the join


  • 7.  RE: Problem with joining two tables

    Posted 09-16-2019 04:17
    Hello @Candido Dessanti,

    Unfortunately, enabling filter pushdown setting doesn't work as well, exception message is the same. So the next thing I'm going to do is to use one pregenerated key instead of those two as I did it in MS Analysis Services. It has been successfully tested before for data integrity so this step is behind me already :) I'll be back with info if it works.

    Again, thank You for Your help!



  • 8.  RE: Problem with joining two tables

    Posted 09-17-2019 03:46
    Hi,

    I'm sorry it's not working; I'll try to reproduce doing some because I haven't any dataset with two keys with such cardinality.

    With just one key I didn't see any reason why I shouldn't work


  • 9.  RE: Problem with joining two tables

    Posted 09-17-2019 09:49
    Hello!

    I think it will be fine. For now I have to postpone the tests a little as I have to prepare the data, setup replications and it requires some processes to be finished. I'll write a post as soon as I finish with all the one-key omnisci test database.

    All the best!


  • 10.  RE: Problem with joining two tables

    Posted 9 days ago
    Edited by Archi K 9 days ago
    Hello again!

    So I've performed some data integrations to avoid any joins and there I have one single large data table with about 650M records. But even relatively simple queries perform terribly slow :(

    For example:

    SELECT SUM(VAL_FIELD),KEY1_FIELD,cast(DATE_FIELD as date) as DATE_FIELD,KEY_FIELD2 FROM DATA_TABLE WHERE cast(DATE_FIELD as date)>='2015-1-1' AND cast(DATE_FIELD as date)<='2016-1-1' GROUP BY KEY1_FIELD,DATE_FIELD,KEY_FIELD2

    runs for a long time where in SQLServer I'm getting an answer almost immediately. When I'm trying to put the results to another table using CTAS, this query:

    create table TEMP_COUNTS as (SELECT SUM(VAL_FIELD),KEY1_FIELD,cast(DATE_FIELD as date) as DATE_FIELD,KEY_FIELD2 FROM DATA_TABLE WHERE cast(DATE_FIELD as date)>='2015-1-1' AND cast(DATE_FIELD as date)<='2016-1-1' GROUP BY KEY1_FIELD,DATE_FIELD,KEY_FIELD2);

    runs forever and never ends, server hangs, does nothing and interrupting the query results in this message:

    Interrupt signal (2) received.
    Asking server to interrupt query.

    And nothing happens.

    Every time when I'm trying to run any query, nvidia-smi put to watch every 0,2 sec shows almost all the time 0% gpu usage, about 30% of GPU memory usage, but system performance monitor shows 100% CPU usage.

    When I tried SELECT COUNT(*) FROM DATA_TABLE; i get the result in quite the same time as in SQLServer but GPU ram usage is 100%.

    I have 2,5TB free disk space, reinstalled Ubuntu OS, fresh omnisci installation. Is there any possible move to get it to work? Maybe data table cardinality is too big? If so, what is the optimal record count for OmniSci? Maybe I can't operate on such a large datasets? Or maybe my hardware is too weak for my db?

    I'm loading the data directly from SQLServer using command line and it creates the table automatically. I can do it creating the table on my own, but I'm not sure if I can shard the table as I have multiple grouping keys and as well I don't know if sharding push things forward.

    As I mentioned above, my hardware is PowerEdge R720 with 2xK80 (seen as 4 cards 11 GB RAM each), 40 core processor, 128 GB RAM.




    Apart from this problem, can You give me an advice, what is the best solution to update/insert fast recently changed or inserted records into source database?

    1. Per-record insert/update using own external integrating solutions and keep records up to date (but possibly have performance issues doing so)
    or
    2. Bulk preloading all the table daily/weekly/monthly... (But losing data freshness and possibly have issues when automated data import fails)
    3. ? :)


    Best regards!


  • 11.  RE: Problem with joining two tables

    Posted 9 days ago
    Hi @Archi K,

    Thanks for continuing testing the Omnisci database.

    I did some tests with different reasonable cardinalities, on an unbalanced and un-optimized table containing nearly 500M records, must 5 fragments, so the querries running against the CPU will use up to 5 cores. My Configuration is a 12 core threadripper processor and 2 RTX 2080ti graphics cards from Nvidia.

    omnisql> select count(*) from table_data
    EXPR$0
    497551202
    1 row returned.
    Execution time: 15 ms, Total time: 15 ms
    
    omnisql> select dimension_1k,dimnesion_20k,cast(timestamp_field as date),sum(measure_float) from adsb_airplanes_test where cast(PosTimeDate as date) >= '2018-01-01' and cast(PosTimeDate as date) <= '2019-01-01' group by 1,2,3;
    [cut]
    125652 rows returned.
    Execution time: 330 ms, Total time: 542 ms
    
    omnisql> select dimension_400k,dimnesion_20k,cast(timestamp_field as date),sum(measure_float) from adsb_airplanes_test where cast(PosTimeDate as date) >= '2018-01-01' and cast(PosTimeDate as date) <= '2019-01-01' group by 1,2,3;
    [cut]
    1381512 rows returned.
    Execution time: 484 ms, Total time: 2776 ms
    
    ​


    the discrepancy of Execution and Total time is because the database is projecting the columnar results and transforming in rows for the client; this kind of jobs run on the CPU and is incurring on some serializations, so on high cardinality queries it's likely you will see the CPU quite busy, on such kind of task

    Due to the cardinality, the number of keys, and the kind of data those queries are using a baseline hash algorithm, that's the least performant for group bys.

    Removing a key the query will use a perfect hash with some speedup on queries e.g.

    omnisql> select dimension_1k,cast(timestamp_field as date),sum(measure_float) from adsb_airplanes_test where cast(PosTimeDate as date) >= '2018-01-01' and cast(PosTimeDate as date) <= '2019-01-01' group by 1,2;
    [cut]
    6660 rows returned.
    Execution time: 139 ms, Total time: 149 ms
    
    

    So the speed of the queries depends on various factors, like cardinality, the number of results projected, the kind of data used, how the tables are build (balanced to run evenly spread on various GPUs; in my case, the first GPU is processing the 2/3 of data, so the execution time is far from perfect)

    I guess the problems you are facing is because of the cardinality; I saying that because the queries look running entirely on CPU after the GPU run has failed after the Estimator query returned a result too big to be run on a GPU fragment.

    Any way I can be wrong so I suggest posting the DDLs and the estimated cardinalities returned by the queries and turning on the verbose logging adding

    verbose=true

    to the omnisci.conf files and posting an extract of omnisci_server.INFO log file.









  • 12.  RE: Problem with joining two tables

    Posted 8 days ago
    Hello @Candido Dessanti,

    Thank You very much for Your effort!

    I did some experiments according to Your advices. I think my performance problem is now solved, I've removed some of group bys, limited query results and now for 650M records I'm getting like 3,5 secs which is totally fine for me. Now I have to rethink my analytic reports and try to optimize them for omnisci instead SQLServer.

    My biggest problem for now is the second part of my last post. I have to keep the data at omnisci up to date somehow. My source records at sqlserver are upated on daily basis. So that I want, most optimistacally, to have maximum one day data delay. Is there any way to achieve this in other way than reloading all the data to omnisci? I'm asking because despite the fact my SQLServer is located on the same phisycal machine as omnisci, my data had to be denormalized to avoid joins as we talked before and I it takes like 10hrs to load all of them to omnisci. Unfortunately can't use any "record active" column on the source as this is the second systems' architecture and it has to be like that because of other, mostly past, dependencies and compatibilities :( Any ideas? :)

    Best regards!


  • 13.  RE: Problem with joining two tables

    Posted yesterday
    Hello,

    I'm sorry for reopening the thread, but maybe my last post was missed somehow.

    1. Anybody has an idea to achieve fast and reliable synchronization between mssql server and omnisci? :)

    2. Next thing is, I've reuploaded the data to keep them up to date and almost nothing changed (i've added few columns but I'm not using them for now in queries - I'm going to use them later) but I can't get it to work to use gpu :( EXPLAIN function says IR for the CPU. \gpu function does nothing, any executed query runs on cpu. Is there any chance to override any other settings to make the server use GPU rather than CPU? Any idea why the server was using gpu normally for the same queries and now it won't?

    All the best!


  • 14.  RE: Problem with joining two tables

    Posted 11 hours ago
    Edited by Candido Dessanti 10 hours ago
    Hi @Archi K,
    I'm sorry, I forgot to reply to the ETL question. Still, without knowing how SQL server tables are populated (you wrote is a replica of another SQL server) and which steps you are taking to flatten your data, it's challenging.

    there are several tools to import data into omnisci database, without the needs to write custom code and you can find documentation here
    , but the only one that can read data while pushing transformations directly from MSSS is the SQLImporter

    The performance with this tool isn't the best (on a 25 column table is around 17000 rows per second from an Oracle 19c database), because the operation is serialized. To improve the TP you have to spawn multiple processes at once manually (e.g. 4 processes increase the TP to 55000 rows per second)

    Using different tools like an ETL, you can think about using the StreamImporter tools, that takes a CSV Stream as an input and has a TP of 50000 for a single thread, 157000 for four threads, probably more increasing the number of threads, while it doesn't scale linearly.

    If you can use the CDC feature of MSSS, so being able to track changes of tables, you can speed up the loading, depending on the number of updates (single row updates takes a lot)
    An idea could be to move rows to be updated on the omnisci server, then do a massive delete and an insert as select (for 100000 records the operation takes less than 1 second in total), but you will have to rebuild periodically your table to reclaim disk space and to restore optimal performance.


    About the problem of no GPU usage, first, you have to check that's the server is starting in GPU mode.
    When the server starts write in the log file ($OMNISCI_STORAGE/mapd_log/omnisci_server.INFO), the number of GPUs has been detected and the detailed GPU's info
    2020-01-17T11:47:46.893972 I 19898 MapDServer.cpp:1117 OmniSci started with data directory at '/opt/mapd_storage/data50'
    2020-01-17T11:47:46.893979 I 19898 MapDServer.cpp:1126  Watchdog is set to false
    2020-01-17T11:47:46.893983 I 19898 MapDServer.cpp:1127  Dynamic Watchdog is set to false
    2020-01-17T11:47:46.893987 I 19898 MapDServer.cpp:1132  Debug Timer is set to true
    2020-01-17T11:47:46.893990 I 19898 MapDServer.cpp:1134  Maximum Idle session duration 60
    2020-01-17T11:47:46.893994 I 19898 MapDServer.cpp:1136  Maximum active session duration 43200
    2020-01-17T11:47:46.895594 I 19898 MapDHandler.cpp:228 OmniSci Server 5.0.1-xxxxxxxxxx-xxxxxxxxxx
    2020-01-17T11:47:46.895969 1 19898 MapDServer.cpp:1350 heartbeat thread starting
    2020-01-17T11:47:47.422740 I 19898 CudaMgr.cpp:318 Using 2 Gpus.
    2020-01-17T11:47:47.422780 1 19898 CudaMgr.cpp:320 Device: 0
    2020-01-17T11:47:47.422785 1 19898 CudaMgr.cpp:321 UUID: b3a74fb2-2742-8933-ca7f-fe9bf9ce7b7b
    2020-01-17T11:47:47.422790 1 19898 CudaMgr.cpp:322 Clock (khz): 1590000
    2020-01-17T11:47:47.422793 1 19898 CudaMgr.cpp:323 Compute Major: 7
    2020-01-17T11:47:47.422797 1 19898 CudaMgr.cpp:324 Compute Minor: 5
    2020-01-17T11:47:47.422801 1 19898 CudaMgr.cpp:325 PCI bus id: 9
    2020-01-17T11:47:47.422804 1 19898 CudaMgr.cpp:326 PCI deviceId id: 0
    2020-01-17T11:47:47.422808 1 19898 CudaMgr.cpp:327 Total Global memory: 10.7319 GB
    2020-01-17T11:47:47.422824 1 19898 CudaMgr.cpp:329 Memory clock (khz): 7000000
    2020-01-17T11:47:47.422828 1 19898 CudaMgr.cpp:330 Memory bandwidth: 308 GB/sec
    2020-01-17T11:47:47.422832 1 19898 CudaMgr.cpp:333 Constant Memory: 65536
    ​

    You can also use the \memory_summary command to determine how many GPUs the server is using.  Please refer to logs and monitoring doc page

    Can you share the query you are trying to run and the DDLs of underlying objects?


    ​​