Omnisci SQL query to select date time range

Hi,

I am using pymapd to run my SQL queries and my trying to get the date time range from a table. I am getting a time out or error, my time_stamp column is timestamp datatype.

I am using the following query:
pd.read_sql("select * from my_table where time_stamp> = ‘2020-03-11 05:03:00’ and time_stamp<=‘2020-03-11 05:03:04’ ", connect)

Am I doing something wrong ?

Thanks,
Dipanwita.

Hi @dipanwita2019,

thanks for joinin the omnisci community forum.

Which kind of error are you getting? I did the same query on a table with 370 Million rows and 52 columns and I didn’t get any timeout or other kind of errors. The query tooks 42ms

>>> pd.read_sql("select * from flights where dep_timestamp >= '2008-02-01 00:00:00' and dep_timestamp <= '2008-02-01 01:00:00'",con);
     flight_year  flight_mmonth  flight_dayofmonth  flight_dayofweek  deptime  crsdeptime  ...                        dest_name  dest_city dest_state  dest_country   dest_lat   dest_lon
0           2008              2                  1                 5        3        1850  ...           Cleveland-Hopkins Intl  Cleveland         OH           USA  41.410892 -81.849396
1           2008              2                  1                 5        8        2220  ...             Gen Edw L Logan Intl     Boston         MA           USA  42.364349 -71.005180
2           2008              2                  1                 5       35        2005  ...                      Newark Intl     Newark         NJ           USA  40.692497 -74.168663
3           2008              2                  1                 5       52        2125  ...            Columbia Metropolitan   Columbia         SC           USA  33.938839 -81.119537
4           2008              2                  1                 5       35        2150  ...     Raleigh-Durham International    Raleigh         NC           USA  35.877640 -78.787476
..           ...            ...                ...               ...      ...         ...  ...                              ...        ...        ...           ...        ...        ...
202         2008              2                  1                 5       52        2125  ...            Columbia Metropolitan   Columbia         SC           USA  33.938839 -81.119537
203         2008              2                  1                 5       45          45  ...     George Bush Intercontinental    Houston         TX           USA  29.980473 -95.339722
204         2008              2                  1                 5       50        2050  ...   General Mitchell International  Milwaukee         WI           USA  42.947224 -87.896584
205         2008              1                 31                 4     2400        2335  ...  Washington Dulles International  Chantilly         VA           USA  38.944530 -77.455811
206         2008              2                  1                 5        8        2220  ...             Gen Edw L Logan Intl     Boston         MA           USA  42.364349 -71.005180

[207 rows x 52 columns]

How many columns and rows are trying to project with your query? if the number is very high, it could be a problem for disk subsystem and CPU that has to cerate a rowset on a large result set.

As na examples this one

>>> pd.read_sql("select * from flights where dep_timestamp >= '2008-02-01 00:00:00' and dep_timestamp <= '2008-07-01 00:00:00'",con);
         flight_year  flight_mmonth  flight_dayofmonth  flight_dayofweek  deptime  ...    dest_city  dest_state  dest_country   dest_lat    dest_lon
0               2008              2                 10                 7     1250  ...      Raleigh          NC           USA  35.877640  -78.787476
1               2008              2                 10                 7     2020  ...      Raleigh          NC           USA  35.877640  -78.787476
2               2008              2                 10                 7      700  ...      Raleigh          NC           USA  35.877640  -78.787476
3               2008              2                 10                 7     1717  ...      Raleigh          NC           USA  35.877640  -78.787476
4               2008              2                 10                 7     1511  ...         Reno          NV           USA  39.498577 -119.768066
...              ...            ...                ...               ...      ...  ...          ...         ...           ...        ...         ...
8802349         2008              6                 26                 4      739  ...      Orlando          FL           USA  28.428888  -81.316025
8802350         2008              6                  6                 5      611  ...      Houston          TX           USA  29.980473  -95.339722
8802351         2008              6                 10                 2     1141  ...  New Orleans          LA           USA  29.993389  -90.258026
8802352         2008              6                  9                 1     1525  ...       Newark          NJ           USA  40.692497  -74.168663
8802353         2008              6                 11                 3      742  ...    Arlington          VA           USA  38.852085  -77.037720

[8802354 rows x 52 columns]

that’s projecting more than 8M rows of 52 columns is taking over 141000ms while the filtering on gpu just took 283ms as you can see from thwe debug timer log

141381ms total duration for sql_execute
  286ms start(25ms) executeRelAlgQuery RelAlgExecutor.cpp:158
    285ms start(25ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:181
      0ms start(25ms) Query pre-execution steps RelAlgExecutor.cpp:182
      285ms start(25ms) executeRelAlgSeq RelAlgExecutor.cpp:424
        285ms start(25ms) executeRelAlgStep RelAlgExecutor.cpp:503
          285ms start(25ms) executeCompound RelAlgExecutor.cpp:1491
            285ms start(25ms) executeWorkUnit RelAlgExecutor.cpp:2574
              1ms start(25ms) compileWorkUnit NativeCodegen.cpp:1647
                New thread(16)
                  0ms start(0ms) fetchChunks Execute.cpp:2089
                  0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:766
                  2ms start(0ms) executePlanWithoutGroupBy Execute.cpp:2441
                    2ms start(0ms) launchGpuCode QueryExecutionContext.cpp:195
                End thread(16)
              0ms start(30ms) collectAllDeviceResults Execute.cpp:1572
                0ms start(30ms) reduceMultiDeviceResultSets Execute.cpp:881
              115ms start(30ms) compileWorkUnit NativeCodegen.cpp:1647
                New thread(17)
                  0ms start(55ms) fetchChunks Execute.cpp:2089
                  0ms start(55ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:766
                  56ms start(55ms) executePlanWithGroupBy Execute.cpp:2629
                    56ms start(55ms) launchGpuCode QueryExecutionContext.cpp:195
                    0ms start(111ms) reduceMultiDeviceResultSets Execute.cpp:881
                End thread(17)
                New thread(18)
                  0ms start(111ms) fetchChunks Execute.cpp:2089
                  0ms start(111ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:766
                  52ms start(111ms) executePlanWithGroupBy Execute.cpp:2629
                    52ms start(111ms) launchGpuCode QueryExecutionContext.cpp:195
                    0ms start(164ms) reduceMultiDeviceResultSets Execute.cpp:881
                End thread(18)
                New thread(19)
                  0ms start(0ms) fetchChunks Execute.cpp:2089
                  0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:766
                  54ms start(0ms) executePlanWithGroupBy Execute.cpp:2629
                    54ms start(0ms) launchGpuCode QueryExecutionContext.cpp:195
                    0ms start(54ms) reduceMultiDeviceResultSets Execute.cpp:881
                End thread(19)
2019-06-30T10:00:36.040398 I 17144 DBHandler.cpp:1015 stdlog sql_execute 17 141381 omnisci admin 321-a9x8 {"query_str","client","execution_time_ms","total_time_ms"} {"select * from flights where dep_timestamp >= '2008-02-01 00:00:00' and dep_timestamp <= '2008-07-01 00:00:00'","tcp:localhost:49574","310","141381"}

Hopes this helps

1 Like

Thank you for the reply.

My query is taking a long time to return the result set. Wondering if this has anything to do with the backend. Do you think my backend needs a tweak , the way the database is designed or the data is stored. I have run complex queries like groupby and joins which are relatively faster compared to this simple filtering.

Well,

I am not aware of your schema but the projections query aren’t historically so fast on columnar databases.

When you run your projection query with a filter, the system has to load into the Cpu cache every fragment (a table is subdivided fragments of a default size of 32 millions) of every column that need to be projected that satisfy the filtering condition (this is likely to happen just once), then the data blocks have to be transferred to the Gpu memory for the processing and after that you have to decode the various columns transform into rows and return the result to the client.
Unluckily the final steps are serialized so just 1 thread is working and if you are projecting a lots of rows/columns it could take a long time; add to this the time to transform into a panda’s dataframe :wink:

I suggest you to choose carefully the columns projected in your queries to improve the response time, because while in a row based db selecting one or hundreds columns is the same on a columnar one, expecially if it’s optimized for anhlitical queries like omnisci, is not.

1 Like

thank you …for the reply. This is really helpful :slight_smile:

to elaborate the question a little more

let’s try this query with four joins (on lookup tables) a medium cardinality group by with 4 columns and some aggregate, with filtering and an having clause on a 380M records fact table and some small lookups

select case when a.plane_manufacturer is null then 'Unknown' else a.plane_manufacturer end,c.carrier_name,o.airport_state,d.airport_state,count(*),max(arrdelay),avg(depdelay),avg(arrdelay) from flights f join airports o on o.iata = origin join airports d on d.iata = dest join carriers c on uniquecarrier=code join airplanes a on a.tailnum=f.tailnum where dep_timestamp between '2005-02-01 00:00:00' and '2008-05-07 00:00:00' group by 1,2,3,4 having avg(arrdelay)>10;
[CUT]
EMBRAER|Southwest Airlines|NH|PA|36|251|30.83333333333333|24.83333333333333
AIRBUS INDUSTRIE|United Air Lines|CO|NM|2805|299|13.45561497326203|12.58823529411765
5984 rows returned.
Execution time: 261 ms, Total time: 280 ms

then we will try to run a projection query narrowing the filter using the same fields on a flat table

select plane_manufacturer,carrier_name,origin_state,dest_state from flights_flat where dep_timestamp between '2008-05-01 00:00:00' and '2008-05-07 00:00:00';
[CUT]
BOMBARDIER INC|Comair|KY|TN
CANADAIR|Comair|NY|KY
348603 rows returned.
Execution time: 19 ms, Total time: 813 ms

omnisql> \memory_summary
OmniSci Server CPU Memory Summary:
            MAX            USE      ALLOCATED           FREE
    51485.61 MB      630.23 MB     4096.00 MB     3465.77 MB

OmniSci Server GPU Memory Summary:
[GPU]            MAX            USE      ALLOCATED           FREE
  [0]     9681.76 MB        0.00 MB        0.00 MB        0.00 MB
  [1]     9680.45 MB      315.11 MB     2048.00 MB     1732.89 MB

so the filtering operation is taking 19ms on one GPU (because of the narrow filter ) but to project the resultset it’s taking 813ms; the memory needed is just 630MB

asking for all columns of the table the response time and the memory consumption greatly increase

select * from flights where dep_timestamp between '2008-05-01 00:00:00' and '2008-05-07 00:00:00';
[CUT]
2008|5|5|1|631|635|759|805|AA|817|N251AA|88|90|68|-6|-4|MFE|DFW|468|13|7|0|NULL|0|NULL|NULL|NULL|NULL|NULL|2008-05-05 06:31:00|2008-05-05 07:59:00|American Airlines|Corporation|MCDONNELL DOUGLAS|2007-08-08|DC-9-82(MD-82)|Valid|Fixed Wing Multi-Engine|Turbo-Fan|1984|McAllen Miller International|McAllen|TX|USA|26.17583|-98.23861|Dallas-Fort Worth International|Dallas-Fort Worth|TX|USA|32.89595|-97.0372
2008|5|6|2|635|635|909|805|AA|817|N425AA|154|90|136|64|0|MFE|DFW|468|8|10|0|NULL|0|0|0|64|0|0|2008-05-06 06:35:00|2008-05-06 09:09:00|American Airlines|Partnership|PIPER|2005-02-10|PA-28-180|Valid|Fixed Wing Single-Engine|Reciprocating|1968|McAllen Miller International|McAllen|TX|USA|26.17583|-98.23861|Dallas-Fort Worth International|Dallas-Fort Worth|TX|USA|32.89595|-97.0372
348603 rows returned.
Execution time: 55 ms, Total time: 5185 ms
omnisql> \memory_summary

OmniSci Server CPU Memory Summary:
            MAX            USE      ALLOCATED           FREE
    51485.61 MB     8114.16 MB     8192.00 MB       77.84 MB

OmniSci Server GPU Memory Summary:
[GPU]            MAX            USE      ALLOCATED           FREE
  [0]     9681.76 MB        0.00 MB        0.00 MB        0.00 MB
  [1]     9680.45 MB      315.11 MB     2048.00 MB     1732.89 MB

So it’s better to ask for the columns needed by the application to save memory and to get better response times.

You can get also optimize your tables with the \o command of omnisql

1 Like