Discussions

Expand all | Collapse all

Problem with joining two tables

  • 1.  Problem with joining two tables

    Posted 09-12-2019 03:12

    Hello,

    I'm going to try to explain my problem as much as I can.

    So, basically, table A has 588360324 rows, and table B has 146991045. By simple filtering table A for sID = 999888 (select count(*) from A where sID=999888) I'm getting 35424 results.

    When I'm trying to execute query:


    select sum(A.val) val1,A.sDate,A.sID from A join B on B.sID=A.sID and B.id=A.bID where A.sID=999888 group by A.sDate,A.sID;

    I'm getting an error:

    Projection query output result set on table(s): B would contain 146991045 rows, which is more than the current system limit of 128000000.

    Is there any option or solution which I can use to solve my problem and execute query properly? Maybe I can change the limit? When I'm trying to query both tables separately (without joins) there is no problem.

    My hardware is PowerEdge R720 with 2xK80 (seen as 4 cards 11 GB RAM each), 40 core processor, 128 GB RAM. I'm using opensource version.


    #Uncategorized
    #Core
    #General


  • 2.  RE: Problem with joining two tables

    Posted 09-12-2019 04:21
    Hi,

    it's the watchdog exception that's blocking your query, so you can disable the watchdog, then retry running your query.
    To disable add the line enable-watchdog=false to you omnisci.conf file, then restart the server.

    Said that looks a but to me because I don't see any reason why there would be an intermediate projection of the B table in such a simple query.

    Have you tried to shard the tables? Sharding could be beneficial for performance


  • 3.  RE: Problem with joining two tables

    Posted 09-12-2019 05:02
    Edited by Archi K 09-12-2019 05:06
    Hi!

    Thank You for the answer. Now server is trying to execute this query. After couple of seconds I'm getting an error:

    Exception: Fatal error while attempting to build hash tables for join: Hash tables with more than 2B entries not supported yet

    I'm new to omnisci to be honest and I'm testing its capabilities. I've read about sharded tables but haven't used it yet. Do You think this error is caused by the fact my tables are not sharded? If it is, I will try sharding later but as I understand I have to reload my data to them. I planned sharding my tables later at performance optimization stage, but I'm struggling to perform such a simple queries which is frustrating :)


  • 4.  RE: Problem with joining two tables

    Posted 09-12-2019 06:08
    Hi,

    the table sharding wanted to be a suggestion to improve the performance, so I cannot saidd that sharding tables the query will run.

    The fact you are waiting a couple of seconds before get an error it doesn't sound good. I run without any problem query like that

    select r_name,n_name,sum(o_totalprice), count() from orders_1500m join customers_150m on o_custkey = c_custkey join nation on n_nationkey = c_nationkey join region on n_regionkey=r_regionkey group by 1,2

    in less than 1000ms (filtered or not) with just 2 gaming GPUs. The big difference is that you are using two keys to join tables.  Could you add verbose=yes and enable-debug-timer = true, re-run the query and share the [mapd_storage]/data/mapd_log/omnisci_server.INFO, to better undestand what's happening?


  • 5.  RE: Problem with joining two tables

    Posted 09-14-2019 03:42
    Edited by Archi K 09-14-2019 03:44
    Hello!

    I've performed those operations and attached requested file.

    omnisci_server.INFO contents:

    https://pastebin.com/QyLuPjqC

    If there is no way to solve this problem, next thing I'm thinking about is to use one keys instead of two as You mentioned and check if it's going to give any results. To do this I have to denormalize the data and merge those keys to have one composed of both of them as I've done before to analyse the data in Analysis Services. Or maybe put those two tables together into one (all columns together) and then import data to omnisci and analyse in omnisci this way. 

    But now I'll just wait for You, maybe there is an answer in my INFO file :)

    I want to thank You for Your help and great commitment in my issue!




  • 6.  RE: Problem with joining two tables

    Posted 09-14-2019 23:51
    Hi @Archi K,

    thanks for posting the log and sorry for the late reply

    enabling the filter pushdown setting the parameter enable-filter-push-down to true could help to lower the cardinality of the join, making the query run, but te query performance will not be optimal, because the large projection on the right table, so it would be better, if possible to have a single key column to join such high cardinalities tables.

    if the enable filter pushdown is working, you can to the ETL work needed in omnisci database using the pseudo column that every omnisci's table has.

    so you can do a create table as select for the bigger table using a narrow key-range, then complete with some IAS.

    e.g.

    create table a as
    select f1,f2,f3...,fn, b.rownum as fk_to_table_b
    from a join b on b.sID=a.sID and b.id=a.bID 
    where b.sID between 1 and 10;
    
    insert into table a
    select f1,f2,f3...,fn, b.rownum as fk_to_table_b
    from a join b on b.sID=a.sID and b.id=a.bID 
    where b.sID between 11 and 20;
    
    and so on.

    the do the same for the b table without the join.

    as I told you before, sharding tables on the new FK/PK created; also changing the fragment size to 64000000 would help speeding up the join


  • 7.  RE: Problem with joining two tables

    Posted 09-16-2019 04:17
    Hello @Candido Dessanti,

    Unfortunately, enabling filter pushdown setting doesn't work as well, exception message is the same. So the next thing I'm going to do is to use one pregenerated key instead of those two as I did it in MS Analysis Services. It has been successfully tested before for data integrity so this step is behind me already :) I'll be back with info if it works.

    Again, thank You for Your help!



  • 8.  RE: Problem with joining two tables

    Posted 09-17-2019 03:46
    Hi,

    I'm sorry it's not working; I'll try to reproduce doing some because I haven't any dataset with two keys with such cardinality.

    With just one key I didn't see any reason why I shouldn't work


  • 9.  RE: Problem with joining two tables

    Posted 09-17-2019 09:49
    Hello!

    I think it will be fine. For now I have to postpone the tests a little as I have to prepare the data, setup replications and it requires some processes to be finished. I'll write a post as soon as I finish with all the one-key omnisci test database.

    All the best!


  • 10.  RE: Problem with joining two tables

    Posted 16 days ago
    Edited by Archi K 15 days ago
    Hello again!

    So I've performed some data integrations to avoid any joins and there I have one single large data table with about 650M records. But even relatively simple queries perform terribly slow :(

    For example:

    SELECT SUM(VAL_FIELD),KEY1_FIELD,cast(DATE_FIELD as date) as DATE_FIELD,KEY_FIELD2 FROM DATA_TABLE WHERE cast(DATE_FIELD as date)>='2015-1-1' AND cast(DATE_FIELD as date)<='2016-1-1' GROUP BY KEY1_FIELD,DATE_FIELD,KEY_FIELD2

    runs for a long time where in SQLServer I'm getting an answer almost immediately. When I'm trying to put the results to another table using CTAS, this query:

    create table TEMP_COUNTS as (SELECT SUM(VAL_FIELD),KEY1_FIELD,cast(DATE_FIELD as date) as DATE_FIELD,KEY_FIELD2 FROM DATA_TABLE WHERE cast(DATE_FIELD as date)>='2015-1-1' AND cast(DATE_FIELD as date)<='2016-1-1' GROUP BY KEY1_FIELD,DATE_FIELD,KEY_FIELD2);

    runs forever and never ends, server hangs, does nothing and interrupting the query results in this message:

    Interrupt signal (2) received.
    Asking server to interrupt query.

    And nothing happens.

    Every time when I'm trying to run any query, nvidia-smi put to watch every 0,2 sec shows almost all the time 0% gpu usage, about 30% of GPU memory usage, but system performance monitor shows 100% CPU usage.

    When I tried SELECT COUNT(*) FROM DATA_TABLE; i get the result in quite the same time as in SQLServer but GPU ram usage is 100%.

    I have 2,5TB free disk space, reinstalled Ubuntu OS, fresh omnisci installation. Is there any possible move to get it to work? Maybe data table cardinality is too big? If so, what is the optimal record count for OmniSci? Maybe I can't operate on such a large datasets? Or maybe my hardware is too weak for my db?

    I'm loading the data directly from SQLServer using command line and it creates the table automatically. I can do it creating the table on my own, but I'm not sure if I can shard the table as I have multiple grouping keys and as well I don't know if sharding push things forward.

    As I mentioned above, my hardware is PowerEdge R720 with 2xK80 (seen as 4 cards 11 GB RAM each), 40 core processor, 128 GB RAM.




    Apart from this problem, can You give me an advice, what is the best solution to update/insert fast recently changed or inserted records into source database?

    1. Per-record insert/update using own external integrating solutions and keep records up to date (but possibly have performance issues doing so)
    or
    2. Bulk preloading all the table daily/weekly/monthly... (But losing data freshness and possibly have issues when automated data import fails)
    3. ? :)


    Best regards!


  • 11.  RE: Problem with joining two tables

    Posted 15 days ago
    Hi @Archi K,

    Thanks for continuing testing the Omnisci database.

    I did some tests with different reasonable cardinalities, on an unbalanced and un-optimized table containing nearly 500M records, must 5 fragments, so the querries running against the CPU will use up to 5 cores. My Configuration is a 12 core threadripper processor and 2 RTX 2080ti graphics cards from Nvidia.

    omnisql> select count(*) from table_data
    EXPR$0
    497551202
    1 row returned.
    Execution time: 15 ms, Total time: 15 ms
    
    omnisql> select dimension_1k,dimnesion_20k,cast(timestamp_field as date),sum(measure_float) from adsb_airplanes_test where cast(PosTimeDate as date) >= '2018-01-01' and cast(PosTimeDate as date) <= '2019-01-01' group by 1,2,3;
    [cut]
    125652 rows returned.
    Execution time: 330 ms, Total time: 542 ms
    
    omnisql> select dimension_400k,dimnesion_20k,cast(timestamp_field as date),sum(measure_float) from adsb_airplanes_test where cast(PosTimeDate as date) >= '2018-01-01' and cast(PosTimeDate as date) <= '2019-01-01' group by 1,2,3;
    [cut]
    1381512 rows returned.
    Execution time: 484 ms, Total time: 2776 ms
    
    ​


    the discrepancy of Execution and Total time is because the database is projecting the columnar results and transforming in rows for the client; this kind of jobs run on the CPU and is incurring on some serializations, so on high cardinality queries it's likely you will see the CPU quite busy, on such kind of task

    Due to the cardinality, the number of keys, and the kind of data those queries are using a baseline hash algorithm, that's the least performant for group bys.

    Removing a key the query will use a perfect hash with some speedup on queries e.g.

    omnisql> select dimension_1k,cast(timestamp_field as date),sum(measure_float) from adsb_airplanes_test where cast(PosTimeDate as date) >= '2018-01-01' and cast(PosTimeDate as date) <= '2019-01-01' group by 1,2;
    [cut]
    6660 rows returned.
    Execution time: 139 ms, Total time: 149 ms
    
    

    So the speed of the queries depends on various factors, like cardinality, the number of results projected, the kind of data used, how the tables are build (balanced to run evenly spread on various GPUs; in my case, the first GPU is processing the 2/3 of data, so the execution time is far from perfect)

    I guess the problems you are facing is because of the cardinality; I saying that because the queries look running entirely on CPU after the GPU run has failed after the Estimator query returned a result too big to be run on a GPU fragment.

    Any way I can be wrong so I suggest posting the DDLs and the estimated cardinalities returned by the queries and turning on the verbose logging adding

    verbose=true

    to the omnisci.conf files and posting an extract of omnisci_server.INFO log file.









  • 12.  RE: Problem with joining two tables

    Posted 14 days ago
    Hello @Candido Dessanti,

    Thank You very much for Your effort!

    I did some experiments according to Your advices. I think my performance problem is now solved, I've removed some of group bys, limited query results and now for 650M records I'm getting like 3,5 secs which is totally fine for me. Now I have to rethink my analytic reports and try to optimize them for omnisci instead SQLServer.

    My biggest problem for now is the second part of my last post. I have to keep the data at omnisci up to date somehow. My source records at sqlserver are upated on daily basis. So that I want, most optimistacally, to have maximum one day data delay. Is there any way to achieve this in other way than reloading all the data to omnisci? I'm asking because despite the fact my SQLServer is located on the same phisycal machine as omnisci, my data had to be denormalized to avoid joins as we talked before and I it takes like 10hrs to load all of them to omnisci. Unfortunately can't use any "record active" column on the source as this is the second systems' architecture and it has to be like that because of other, mostly past, dependencies and compatibilities :( Any ideas? :)

    Best regards!


  • 13.  RE: Problem with joining two tables

    Posted 7 days ago
    Hello,

    I'm sorry for reopening the thread, but maybe my last post was missed somehow.

    1. Anybody has an idea to achieve fast and reliable synchronization between mssql server and omnisci? :)

    2. Next thing is, I've reuploaded the data to keep them up to date and almost nothing changed (i've added few columns but I'm not using them for now in queries - I'm going to use them later) but I can't get it to work to use gpu :( EXPLAIN function says IR for the CPU. \gpu function does nothing, any executed query runs on cpu. Is there any chance to override any other settings to make the server use GPU rather than CPU? Any idea why the server was using gpu normally for the same queries and now it won't?

    All the best!


  • 14.  RE: Problem with joining two tables

    Posted 6 days ago
    Edited by Candido Dessanti 6 days ago
    Hi @Archi K,
    I'm sorry, I forgot to reply to the ETL question. Still, without knowing how SQL server tables are populated (you wrote is a replica of another SQL server) and which steps you are taking to flatten your data, it's challenging.

    there are several tools to import data into omnisci database, without the needs to write custom code and you can find documentation here
    , but the only one that can read data while pushing transformations directly from MSSS is the SQLImporter

    The performance with this tool isn't the best (on a 25 column table is around 17000 rows per second from an Oracle 19c database), because the operation is serialized. To improve the TP you have to spawn multiple processes at once manually (e.g. 4 processes increase the TP to 55000 rows per second)

    Using different tools like an ETL, you can think about using the StreamImporter tools, that takes a CSV Stream as an input and has a TP of 50000 for a single thread, 157000 for four threads, probably more increasing the number of threads, while it doesn't scale linearly.

    If you can use the CDC feature of MSSS, so being able to track changes of tables, you can speed up the loading, depending on the number of updates (single row updates takes a lot)
    An idea could be to move rows to be updated on the omnisci server, then do a massive delete and an insert as select (for 100000 records the operation takes less than 1 second in total), but you will have to rebuild periodically your table to reclaim disk space and to restore optimal performance.


    About the problem of no GPU usage, first, you have to check that's the server is starting in GPU mode.
    When the server starts write in the log file ($OMNISCI_STORAGE/mapd_log/omnisci_server.INFO), the number of GPUs has been detected and the detailed GPU's info
    2020-01-17T11:47:46.893972 I 19898 MapDServer.cpp:1117 OmniSci started with data directory at '/opt/mapd_storage/data50'
    2020-01-17T11:47:46.893979 I 19898 MapDServer.cpp:1126  Watchdog is set to false
    2020-01-17T11:47:46.893983 I 19898 MapDServer.cpp:1127  Dynamic Watchdog is set to false
    2020-01-17T11:47:46.893987 I 19898 MapDServer.cpp:1132  Debug Timer is set to true
    2020-01-17T11:47:46.893990 I 19898 MapDServer.cpp:1134  Maximum Idle session duration 60
    2020-01-17T11:47:46.893994 I 19898 MapDServer.cpp:1136  Maximum active session duration 43200
    2020-01-17T11:47:46.895594 I 19898 MapDHandler.cpp:228 OmniSci Server 5.0.1-xxxxxxxxxx-xxxxxxxxxx
    2020-01-17T11:47:46.895969 1 19898 MapDServer.cpp:1350 heartbeat thread starting
    2020-01-17T11:47:47.422740 I 19898 CudaMgr.cpp:318 Using 2 Gpus.
    2020-01-17T11:47:47.422780 1 19898 CudaMgr.cpp:320 Device: 0
    2020-01-17T11:47:47.422785 1 19898 CudaMgr.cpp:321 UUID: b3a74fb2-2742-8933-ca7f-fe9bf9ce7b7b
    2020-01-17T11:47:47.422790 1 19898 CudaMgr.cpp:322 Clock (khz): 1590000
    2020-01-17T11:47:47.422793 1 19898 CudaMgr.cpp:323 Compute Major: 7
    2020-01-17T11:47:47.422797 1 19898 CudaMgr.cpp:324 Compute Minor: 5
    2020-01-17T11:47:47.422801 1 19898 CudaMgr.cpp:325 PCI bus id: 9
    2020-01-17T11:47:47.422804 1 19898 CudaMgr.cpp:326 PCI deviceId id: 0
    2020-01-17T11:47:47.422808 1 19898 CudaMgr.cpp:327 Total Global memory: 10.7319 GB
    2020-01-17T11:47:47.422824 1 19898 CudaMgr.cpp:329 Memory clock (khz): 7000000
    2020-01-17T11:47:47.422828 1 19898 CudaMgr.cpp:330 Memory bandwidth: 308 GB/sec
    2020-01-17T11:47:47.422832 1 19898 CudaMgr.cpp:333 Constant Memory: 65536
    ​

    You can also use the \memory_summary command to determine how many GPUs the server is using.  Please refer to logs and monitoring doc page

    Can you share the query you are trying to run and the DDLs of underlying objects?


    ​​


  • 15.  RE: Problem with joining two tables

    Posted 3 days ago

    Thank You for Your reply!

    I'll try to describe our architecture first.

     

    So, basically, our achitecture looks like this:

    1. We have MSSQL Server which has daily updated sales and purchases data.
    2. This server is being replicated 1:1 to our linux MSSQL.
    3. OmniSci server.

     

    Server (2) is an MSSQL instance existing only to avoid data transport layer over internet between OmniSci and MSSQL. Second reason is that (1)'s usage is mostly high. So (2) and (3) are on the same machine and the same OS and I'm bulk loading data directly from (2) to (3). I want the data to be as up to date as possible so I have to develop some kind of automated process for syncing (2) and (3), the most satisfying option for me would be when my data are no more than 1 day outdated. Thanks to Your advice I'm going to check if CSV StreamImporter would perform better. I must include MSSQL part for creating the file (or, because (2) and (3) are on the same machine, I could use table as a file in mssql; this is case study subject). I'm planning some experiments so I'll be back with results in some time.

     

    Next thing is my low GPU usage in OmniSci. This is how it runs: (.INFO file):

     

    2020-01-20T10:09:09.864805 I 1703 MapDServer.cpp:802 OmniSci started with data directory at '/var/lib/omnisci/data'

    2020-01-20T10:09:09.864961 I 1703 MapDServer.cpp:803  Watchdog is set to true

    2020-01-20T10:09:09.864978 I 1703 MapDServer.cpp:804  Dynamic Watchdog is set to false

    2020-01-20T10:09:09.864987 I 1703 MapDServer.cpp:809  Debug Timer is set to false

    2020-01-20T10:09:09.864994 I 1703 MapDServer.cpp:811  Maximum Idle session duration 60

    2020-01-20T10:09:09.865002 I 1703 MapDServer.cpp:813  Maximum active session duration 43200

    2020-01-20T10:09:09.868080 I 1703 MapDHandler.cpp:217 OmniSci Server 5.0.1-20191118-4c7b6d5ad6

    2020-01-20T10:09:10.444865 I 1703 CudaMgr.cpp:318 Using 4 Gpus.

    2020-01-20T10:09:10.444978 I 1703 DataMgr.cpp:111 cpuSlabSize is 4096M

    2020-01-20T10:09:10.445003 I 1703 DataMgr.cpp:113 reserved GPU memory is 128M includes render buffer allocation

    2020-01-20T10:09:10.445017 I 1703 DataMgr.cpp:127 gpuSlabSize is 2048M

    2020-01-20T10:09:10.445029 I 1703 DataMgr.cpp:127 gpuSlabSize is 2048M

    2020-01-20T10:09:10.445039 I 1703 DataMgr.cpp:127 gpuSlabSize is 2048M

    2020-01-20T10:09:10.445049 I 1703 DataMgr.cpp:127 gpuSlabSize is 2048M

    2020-01-20T10:09:10.445211 I 1703 FileMgr.cpp:204 Completed Reading table's file metadata, Elapsed time : 0ms Epoch: 0 files read: 0 table location: '/var/lib/omnisci/data/mapd_data/table_0_0'

    2020-01-20T10:09:10.445257 I 1703 Calcite.cpp:282 Creating Calcite Handler,  Calcite Port is 6279 base data dir is /var/lib/omnisci/data

    2020-01-20T10:09:10.445270 I 1703 Calcite.cpp:200 Running calcite server as a daemon

    2020-01-20T10:09:10.867840 I 1703 Calcite.cpp:239 Calcite server start took 400 ms

    2020-01-20T10:09:10.867901 I 1703 Calcite.cpp:240 ping took 19 ms

    2020-01-20T10:09:10.873781 I 1703 MapDHandler.cpp:305 Started in GPU mode

     

    And then, when I'm trying to perform such a simple query:

    select sum(SalesAmount),SaleClientID from SalesTable group by SaleClientID;

     

    Nvidia-smi in watch every 0,1 sec always shows 0% and \memory_summary looks like this:

     

    OmniSci Server CPU Memory Summary:

                MAX            USE      ALLOCATED           FREE

       103104.21 MB     7265.44 MB     8192.00 MB      926.56 MB

     

    OmniSci Server GPU Memory Summary:

    [GPU]            MAX            USE      ALLOCATED           FREE

      [0]    12078.62 MB        0.00 MB        0.00 MB        0.00 MB

      [1]    12078.62 MB        0.00 MB        0.00 MB        0.00 MB

      [2]    12078.62 MB        0.00 MB        0.00 MB        0.00 MB

      [3]    12078.62 MB        0.00 MB        0.00 MB        0.00 MB

     

    When I'm doing some counts (select count(*) from SalesTable;), nvidia-smi shows GPU involved for 1-2 secs:

    result = 634864255

     

    And then \memory_summary:

    OmniSci Server CPU Memory Summary:

                MAX            USE      ALLOCATED           FREE

       103104.21 MB     7265.44 MB     8192.00 MB      926.56 MB

     

    OmniSci Server GPU Memory Summary:

    [GPU]            MAX            USE      ALLOCATED           FREE

      [0]    12078.62 MB        0.00 MB     2048.00 MB     2048.00 MB

      [1]    12078.62 MB        0.00 MB     2048.00 MB     2048.00 MB

      [2]    12078.62 MB        0.00 MB     2048.00 MB     2048.00 MB

      [3]    12078.62 MB        0.00 MB     2048.00 MB     2048.00 MB

     

    This is my DDL:

     

    CREATE TABLE SalesTable (

    SaleClientID DECIMAL(9,0) ENCODING FIXED(32),

    SalePartnerID DECIMAL(15,0),

    SaleDecField1 DECIMAL(9,0) ENCODING FIXED(32),

    SaleDecField2 DECIMAL(9,0) ENCODING FIXED(32),

    SaleDecField3 DECIMAL(9,0) ENCODING FIXED(32),

    SaleProductID DECIMAL(15,0),

    SaleDocID DECIMAL(15,0),

    SaleType DECIMAL(2,0) ENCODING FIXED(16),

    SaleDecField4 DECIMAL(2,0) ENCODING FIXED(16),

    SalePaymentType TEXT ENCODING DICT(32),

    SaleDisplayNumber DECIMAL(14,0),

    SaleRecNumber DOUBLE,

    SaleDecField5 DECIMAL(5,0) ENCODING FIXED(32),

    SaleDate TIMESTAMP(0),

    SalePrice1 DOUBLE,

    SalePrice2 DOUBLE,

    SaleTax DOUBLE,

    SalePurTax DOUBLE,

    SalesAmount DOUBLE,

    SaleDoubleField1 DOUBLE,

    SaleDoubleField2 DOUBLE,

    SaleLimit DOUBLE,

    SaleGroup DECIMAL(1,0) ENCODING FIXED(16),

    SaleDoubleField3 DOUBLE,

    SaleTime DECIMAL(5,0) ENCODING FIXED(32),

    SaleDoubleField4 DOUBLE,

    SaleIsInvClosed TEXT ENCODING DICT(32),

    SaleIsValid DECIMAL(1,0) ENCODING FIXED(16),

    SaleIsCancelled DECIMAL(1,0) ENCODING FIXED(16),

    SaleRecDate TIMESTAMP(0),

    SaleCommonRecordID INTEGER,

    SaleDoubleField5 DOUBLE,

    SaleTimestampField1 TIMESTAMP(0),

    SaleTimestampField2 TIMESTAMP(0),

    SaleIsIncluded SMALLINT,

    SaleID DECIMAL(15,0),

    SaleMultiplier DOUBLE,

    SaleBigintField1 BIGINT,

    SaleBigintField2 BIGINT,

    SaleTimestampField3 TIMESTAMP(0),

    SaleIsOnline BOOLEAN,

    SaleAmountMlt DOUBLE,

    PurchaseID DECIMAL(15,0),

    PurchaseClientID DECIMAL(9,0) ENCODING FIXED(32),

    PurchaseDocID DECIMAL(15,0),

    PurchaseProductID DECIMAL(15,0),

    PurchaseSourceID DECIMAL(15,0),

    PurchaseProductCode TEXT ENCODING DICT(32),

    PurchaseAmount DOUBLE,

    PurchaseAmount2 DOUBLE,

    PurchaseStoreAmount DOUBLE,

    PurchaseMultiplier DOUBLE,

    PurchasePrice DOUBLE,

    PurchasePrice2 DOUBLE,

    PurchasePrice3 DOUBLE,

    PurchaseTax DOUBLE,

    PurchaseTax2 DOUBLE,

    PurchaseTimestampField4 TIMESTAMP(0),

    PurchaseRecType DECIMAL(1,0) ENCODING FIXED(16),

    PurchaseDocDate TIMESTAMP(0),

    PurchaseCode TEXT ENCODING DICT(32),

    PurchaseIsValid DECIMAL(1,0) ENCODING FIXED(16),

    PurchaseDate TIMESTAMP(0),

    PurchaseType TEXT ENCODING DICT(32),

    PurchaseIsCancelled DECIMAL(1,0) ENCODING FIXED(16),

    PurchaseRecTimestamp TIMESTAMP(0),

    PurchasePrice4 DOUBLE,

    PurchasePrice5 DOUBLE,

    PurchasePrice6 DOUBLE,

    PuchasePrice7 DOUBLE,

    PurchaseTextInfo TEXT ENCODING DICT(32),

    PurchasePromo DOUBLE,

    PurchaseAmount3 DOUBLE,

    PurchaseProductObs DECIMAL(1,0) ENCODING FIXED(16),

    PurchasePrice8 DOUBLE,

    PurchaseTextCode TEXT ENCODING DICT(32),

    PurchaseCommonRecordID INTEGER,

    PurchaseTimestamp TIMESTAMP(0),

    PurchasePrice8 DOUBLE,

    PurchasePrice9 DOUBLE,

    PurchaseIsFactory BOOLEAN,

    PurchaseIsIncluded SMALLINT,

    ProductID DECIMAL(15,0),

    ProductClientID DECIMAL(9,0) ENCODING FIXED(32),

    ProductCodePr TEXT ENCODING DICT(32),

    ProductEAN TEXT ENCODING DICT(32),

    ProductRecTimestamp TIMESTAMP(0),

    ProductCommonRecordID INTEGER,

    ProductTimestamp TIMESTAMP(0),

    ProductBigintField3 BIGINT,

    ProductBigintField4 BIGINT,

    ProductCharCode TEXT ENCODING DICT(32),

    ProductRecTimestamp TIMESTAMP(0),

    SalePrice3 DOUBLE,

    SalePrice4 DOUBLE,

    PurchasePrice10 DOUBLE,

    PurchasePrice11 DOUBLE,

    PurchasePrice12 DOUBLE,

    PurchasePrice13 DOUBLE,

    SalePrice5 DOUBLE,

    CommonRecordID BIGINT

    )




  • 16.  RE: Problem with joining two tables

    Posted 2 days ago
    Hi @Archi K,

    about the loading my idea was to stream the CSV while being generated, without any file materialization; if you are going to create a file the omnisql's copy command is the faster option because is going to be parallel (the TP is around 1.5M rows per seconds on my dataset with a 12c/24t processor using compèressed files, more than 4M using uncompressed files).

    with Oracle I can do that just using sqlplus line command, and I can something similar using an ETL like Informatica powercenter with a command on a file writer object with any database; I'm not sure how to di with MSSS without externa tools.

    About the query running on CPU it's strange and remind me about a problem we got with FP64 on Kepler Hardware on mapd 3.0, but it should be fixed a long time agou (3.1 release). Could you try to ru  the same query with a sum against  an integer datatype? I'll try to reproduce on an instance running kepler hardware ASAP



  • 17.  RE: Problem with joining two tables

    Posted 2 days ago
    Hello!

    Thank You again :) Yes I'm planning to use pregenerated csv files and then do some TP experiments. Thank You for Your support in this case.


    GPU/CPU issue:

    The same query against integer runs fine in GPU mode:

    select sum(SaleCommonRecordID),SaleClientID from SalesTable group by SaleClientID;

    Memory summary and nvidia-smi shows slightly increased amount than in count(*) query. Timing:
    7768 rows returned.
    Execution time: 346 ms, Total time: 623 ms

    Which is totally fine. So there is an issue with double field and Your guess whas just right probably. I'll wait for Your workaround/investigation :)


  • 18.  RE: Problem with joining two tables

    Posted 2 days ago
    Hi @Archi K,

    I did some check internally and on Kepler the aggregations are done in CPU mode, because of the lack of native instruction to manage the atomics on sums on global memory.

    Maybe a solution would be change from doubles to decimal fields (or floats); I changed the DDL of your table removing doubles and changed other datatypes to recover some memory and performance. The timestamp with fixed ancoding have a range of 1901-12-13 20:45:53 - 2038-01-19 03:14:07 but they takes just 4 bytes.

    CREATE TABLE SalesTable (
    SaleClientID INTEGER,
    SalePartnerID BIGINT,
    SaleDecField1 INTEGER,
    SaleDecField2 INTEGER,
    SaleDecField3 INTEGER,
    SaleProductID BIGINT,
    SaleDocID BIGINT,
    SaleType TINYINT,
    SaleDecField4 TINYINT,
    SalePaymentType TEXT ENCODING DICT(32),
    SaleDisplayNumber BIGINT,
    SaleRecNumber DECIMAL(18,5),
    SaleDecField5 INTEGER,
    SaleDate TIMESTAMP(0) ENCODING FIXED(32),
    SalePrice1 DECIMAL(18,5),
    SalePrice2 DECIMAL(18,5),
    SaleTax DECIMAL(18,5),
    SalePurTax DECIMAL(18,5),
    SalesAmount DECIMAL(18,5),
    SaleDoubleField1 DECIMAL(18,5),
    SaleDoubleField2 DECIMAL(18,5),
    SaleLimit DECIMAL(18,5),
    SaleGroup TINYINT,
    SaleDoubleField3 DECIMAL(18,5),
    SaleTime INTEGER,
    SaleDoubleField4 DECIMAL(18,5),
    SaleIsInvClosed TEXT ENCODING DICT(32),
    SaleIsValid TINYINT,
    SaleIsCancelled TINYINT,
    SaleRecDate TIMESTAMP(0) ENCODING FIXED(32),
    SaleCommonRecordID INTEGER,
    SaleDoubleField5 DECIMAL(18,5),
    SaleTimestampField1 TIMESTAMP(0) ENCODING FIXED(32),
    SaleTimestampField2 TIMESTAMP(0) ENCODING FIXED(32),
    SaleIsIncluded SMALLINT,
    SaleID BIGINT,
    SaleMultiplier DECIMAL(18,5),
    SaleBigintField1 BIGINT,
    SaleBigintField2 BIGINT,
    SaleTimestampField3 TIMESTAMP(0) ENCODING FIXED(32),
    SaleIsOnline BOOLEAN,
    SaleAmountMlt DECIMAL(18,5),
    PurchaseID BIGINT,
    PurchaseClientID INTEGER,
    PurchaseDocID BIGINT,
    PurchaseProductID BIGINT,
    PurchaseSourceID BIGINT,
    PurchaseProductCode TEXT ENCODING DICT(32),
    PurchaseAmount DECIMAL(18,5),
    PurchaseAmount2 DECIMAL(18,5),
    PurchaseStoreAmount DECIMAL(18,5),
    PurchaseMultiplier DECIMAL(18,5),
    PurchasePrice DECIMAL(18,5),
    PurchasePrice2 DECIMAL(18,5),
    PurchasePrice3 DECIMAL(18,5),
    PurchaseTax DECIMAL(18,5),
    PurchaseTax2 DECIMAL(18,5),
    PurchaseTimestampField4 TIMESTAMP(0) ENCODING FIXED(32),
    PurchaseRecType TINYINT,
    PurchaseDocDate TIMESTAMP(0) ENCODING FIXED(32),
    PurchaseCode TEXT ENCODING DICT(32),
    PurchaseIsValid TINYINT,
    PurchaseDate TIMESTAMP(0) ENCODING FIXED(32),
    PurchaseType TEXT ENCODING DICT(32),
    PurchaseIsCancelled TINYINT,
    PurchaseRecTimestamp TIMESTAMP(0) ENCODING FIXED(32),
    PurchasePrice4 DECIMAL(18,5),
    PurchasePrice5 DECIMAL(18,5),
    PurchasePrice6 DECIMAL(18,5),
    PuchasePrice7 DECIMAL(18,5),
    PurchaseTextInfo TEXT ENCODING DICT(32),
    PurchasePromo DECIMAL(18,5),
    PurchaseAmount3 DECIMAL(18,5),
    PurchaseProductObs TINYINT,
    PurchasePrice8 DECIMAL(18,5),
    PurchaseTextCode TEXT ENCODING DICT(32),
    PurchaseCommonRecordID INTEGER,
    PurchaseTimestamp TIMESTAMP(0) ENCODING FIXED(32),
    PurchasePrice9 DECIMAL(18,5),
    PurchaseIsFactory BOOLEAN,
    PurchaseIsIncluded SMALLINT,
    ProductID BIGINT,
    ProductClientID INTEGER,
    ProductCodePr TEXT ENCODING DICT(32),
    ProductEAN TEXT ENCODING DICT(32),
    ProductCommonRecordID INTEGER,
    ProductTimestamp TIMESTAMP(0) ENCODING FIXED(32),
    ProductBigintField3 BIGINT,
    ProductBigintField4 BIGINT,
    ProductCharCode TEXT ENCODING DICT(32),
    ProductRecTimestamp TIMESTAMP(0) ENCODING FIXED(32),
    SalePrice3 DECIMAL(18,5),
    SalePrice4 DECIMAL(18,5),
    PurchasePrice10 DECIMAL(18,5),
    PurchasePrice11 DECIMAL(18,5),
    PurchasePrice12 DECIMAL(18,5),
    PurchasePrice13 DECIMAL(18,5),
    SalePrice5 DECIMAL(18,5),
    CommonRecordID BIGINT)
    I think you can fine tune come decimals to recover some memory.

    To don't waste too much time while trying the new DDLs you can create the new table and poplulate it with copy commands

    copy (select * from SalesTable where rowid between 0 and 10000000) to '/temp/temp_export';
    copy SalesTableNew from '/temp/temp_export' with (header='false');
    copy (select * from SalesTable where rowid between 10000001 and 20000000) to '/temp/temp_export';
    copy SalesTableNew from '/temp/temp_export' with (header='false');
    copy (select * from SalesTable where rowid between 20000001 and 30000000) to '/temp/temp_export';
    copy SalesTableNew from '/temp/temp_export' with (header='false');
    
    and so on​
    then try your queries