Expand all | Collapse all

Column limit

  • 1.  Column limit

    Posted 05-11-2019 10:06

    I have been reading the documentation trying to learn about the database's limit.
    Is there a column limit ? 
    Row limit ?
    Any other limit such as row size.  text field size etc. 
    Does the entire table needs to fit in memory on the GPU. Or there is streaming happening ? 
    How many tables can be joined ? 


  • 2.  RE: Column limit

    Posted 05-11-2019 20:55
    Hi @Jeremy Villalobos

    OmniSci treats the combined memory (VRAM) of all the GPUs on a server as its lowest level cache, i.e. it tries to keep compressed versions of hot partitions of hot columns in GPU memory whenever possible. This approach can easily scale to working sets in the terabytes on a single node. For those cases when the working set cannot entirely fit in GPU memory, OmniSci caches a greater subset of the data in CPU memory, which although slower is typically significantly larger than available GPU memory. It will then either stream data from CPU to GPU or execute queries simultaneously in CPU and GPU. Although this method is slower than if the needed data fits entirely in GPU RAM, we have still found OmniSci to be faster than other in-memory solutions in such situations due to the highly optimized nature of the database.​


  • 3.  RE: Column limit

    Posted 05-12-2019 23:25
    Hi @Jeremy Villalobos,

    In my response yesterday, I mentioned that we execute queries simultaneously in CPU and GPU. This is incorrect, but rather OmniSci halts query execution and punts to CPU when the GPU memory is exceeded. As of OmniSci version 4.5, we have better memory handling through improved estimation of query GPU memory requirement. The OmniSci server configuration parameter allow-cpu-retry is now turned on by default, which allows queries in certain low-GPU-memory situations to be identified and executed on CPU automatically. ​


  • 4.  RE: Column limit

    Posted 05-14-2019 10:26
    Edited by Candido Dessanti 05-14-2019 10:35
    Hi @Jeremy Villalobos,

    it likely there are limits because anything on this world is unlimited

    1) is there a column limit; I'm interpreting as is there a column limit on a single table?
    I created a table with 65535 columns, loaded dummy data with copy command, then tried some simple query; the system can be described as unresponsive at best because the parse time did by the calcite server is 33994 ms the first time and over 22000 the subsequent times,  as you can see here:
    omnisql> copy test_col from '/home/mapd/test_col.csv.template' with (header='false');
    Loaded: 1 recs, Rejected: 0 recs in 6.710000 secs
    1 rows returned.
    Execution time: 6826 ms, Total time: 6826 ms
    omnisql> copy test_col from '/home/mapd/test_col.csv' with (header='false');
    Loaded: 1000 recs, Rejected: 0 recs in 10.381000 secs
    1 rows returned.
    Execution time: 10490 ms, Total time: 10491 ms
    omnisql> select count() from test_col;
    1 rows returned.
    Execution time: 33994 ms, Total time: 33997 ms
    omnisql> select count() from test_col;
    1 rows returned.
    Execution time: 23983 ms, Total time: 23984 ms
    omnisql> select sum(col1) from test_col where col2=3;
    1 rows returned.
    Execution time: 24348 ms, Total time: 24348 ms
    omnisql> select sum(col1) from test_col;
    1 rows returned.
    Execution time: 26566 ms, Total time: 26566 ms
    omnisql> explain select count() from test_col;
    IR for the GPU:

    define void @query_template_0(i8** nocapture %byte_stream, i8* nocapture %literals, i64* nocapture %row_count_ptr, i64* nocapture %frag_row_off_ptr, i32* %max_matched_ptr, i64* %agg_init_val, i64** %out, i64** %unused, i32 %frag_idx, i64* %join_hash_tables, i32* %total_matched, i32* %error_code) {
      %result = alloca i64, align 8
      %row_count = load i64, i64* %row_count_ptr, align 8

    filter_false:                                     ; preds = %filter_true, %entry
      ret i32 0

    1 rows returned.
    Execution time: 26425 ms, Total time: 26426 ms
    omnisql> explain calcite select count() from test_col;
    LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
        EnumerableTableScan(table=[[taxi, test_col]])

    The tp on load is pretty low (about 166 records/sec), the parse times are huge and a simple query that would take a few milliseconds, on a table with a reasonable number of columns, takes half of a minute. SO you can have a tables tens of thousands of columns, but performance wise it's not a viable option.

    So the right question would be, how many columns can I have on a table for my performance needs? The answer is that it depends on several factors and it's up to yours because you are the only one that what your needs are.

    It's also unlikely that you are going to hit a hard row limit, in is probable that you would deplete you system memory before hitting that limit on a single system, but it's probably a number too big to be typed.

    Being a columnar database the row size limit doesn't matter.
    Other important limits are

    1) the number of distinct values on a dictionary encoded fields, has been  recently raised from 1 Billion to 2 Billion
    2) the max size of a text field is 32767, the maximum precision of a timestamp has been raised to nanoseconds
    4) the maximum precision of decimals is 18 digits only (pay a lot of attention at this, when you plan your application).

    Omnisci is a columnar database, so only the columns (and dictionaries) needed by the actual query have to reside on CPU Memory and just a subset of them into GPU.
    This because operation like filtering, group bys, aggregates and joins are executed on GPU (if there is enough memory), while projections and sorts are run by the CPU.
    If you are using ALL the columns of a table for abovementioned GPU operations, all the rows of a table need to fit on GPU memory, but this is unlikely to happen.
    If the subsequent query cannot fit in free memory, the columns not needed anymore will be evicted from CPU/GPU memory and the new data will be loaded from disk/CPU memory to accommodate the new query.
    As @Veda Shankar stated on previous messages if the query cannot be run on GPU because lack of memory, the execution will fall back to the CPU.

    About streaming; What do you exactly mean?
    If you are asking if the system stream data needed by a single query from disk because you haven't enough memory (or to reduce the response lag) the reply is NO, and it's unlikely to happen in the short term.
    Streaming data in chucks from CPU Memory to GPU memory, if I remember correctly, has been possible in 4.4.1 release only
    (check out this thread: https://community.omnisci.com/communities/community-home/digestviewer/viewthread?GroupId=13&MessageKey=c72ee49b-8215-4b12-9180-5a3f26d8c2e9&CommunityKey=d06df790-8ca4-4e54-91a0-244af0228ddc&tab=digestviewer&ReturnUrl=%2fcommunities%2fcommunity-home%2fdigestviewer%3fCommunityKey%3dd06df790-8ca4-4e54-91a0-244af0228ddc%26tab%3ddigestviewer)
    , but now the feature has been removed, so maybe we will get it in the future; it's not a trivial task because omnisci, unlike some other, is doing joins on GPUs, so the memory manager has to be reworked a lot to manage a significant number of scenarios.

  • 5.  RE: Column limit

    Posted 05-15-2019 08:36
    @Candido Dessanti  and @Veda Shankar.  Thanks for the detail responses.