How use load_table with spatial data, like POINT

Hi folks,

is there a way how I can use pymapd to load big data sets (5m rows) fast to the table?

A table without POINT column take about 15 sek. to load 5 million rows.
For a load with point column I have to use method=rows and it takes very long time. about 10 minutes and the db server is broken after this. (restart)

Is the a way to load data with POINT column fast?

Thanks

Hi @slava,

In the meanwhile I suggest you, as workaround, to load your data using the copy command in omnisql. The documentation is here

https://docs.omnisci.com/loading-and-exporting-data/command-line/load-data#csv-tsv-import

We are investigating on your pymapd’s issue, 0and I hope we will come back with a solution asap.

Regards,
Candido

Hi @slava ,

I tried out to insert geometry point with pymapd, and I succeded using load_table_columnar method.

Assuming we have a panda dataframe with a geometry object containing points

>>> df.dtypes
f1            object
f2_geom    geometry
dtype: object
>>> df
             f1                     f2_geom
0  Buenos Aires  POINT (-58.66000 -34.58000)
1      Brasilia  POINT (-47.91000 -15.78000)
2      Santiago  POINT (-70.66000 -33.45000)
3        Bogota    POINT (-74.08000 4.60000)
4       Caracas   POINT (-66.86000 10.48000)

and a table called as an example test_point

CREATE TABLE test_point (
  f1 TEXT ENCODING DICT(32),
  f2 GEOMETRY(POINT, 4326) ENCODING COMPRESSED(32));

you can load the table with load_data_columnar

>>> con.load_table_columnar("test_point", df, preserve_index=False)
>>> c = con.cursor()
>>> c.execute("select * from test_point");
<pymapd.cursor.Cursor object at 0x7f4f79c7bd00>
>>> list(c)
[('Buenos Aires', 'POINT (-58.659999947371 -34.5799999984819)'), ('Brasilia', 'POINT (-47.9099999963818 -15.7799999955017)'), ('Santiago', 'POINT (-70.6599999920744 -33.4499999664025)'), ('Bogota', 'POINT (-74.0799999675155 4.59999998779967)'), ('Caracas', 'POINT (-66.8599999821093 10.479999990426)')]
>>> 

the troughtput loading 1.6 millions records into the table is around 50k records per second.
I am not sure if it’s fast enough for your need, but this way just 1 core of the CPU is used, so maybe possible increase the TP using multiple threads

>>> t1 = time.perf_counter()
>>> con.load_table_columnar("test_point", df, preserve_index=False)
>>> time.perf_counter()-t1
33.880870164000044
>>> df
              f1                     geometry
0   Buenos Aires  POINT (-58.66000 -34.58000)
1       Brasilia  POINT (-47.91000 -15.78000)
2       Santiago  POINT (-70.66000 -33.45000)
3         Bogota    POINT (-74.08000 4.60000)
4        Caracas   POINT (-66.86000 10.48000)
..           ...                          ...
0   Buenos Aires  POINT (-58.66000 -34.58000)
1       Brasilia  POINT (-47.91000 -15.78000)
2       Santiago  POINT (-70.66000 -33.45000)
3         Bogota    POINT (-74.08000 4.60000)
4        Caracas   POINT (-66.86000 10.48000)

[1673960 rows x 2 columns]
>>> c.execute("select count(*) from test_point")
<pymapd.cursor.Cursor object at 0x7f4f79c7bd00>
>>> list(c)
[(1673960,)]

Could you share your code so that we can help you better?

Regards,
Candido

Hi @candido.dessanti
thank you very much for your answer.

It’s worked!!!
May be you can update documentation. I think it is very helpfully for everyone.

So the insertion is working.
At first I will say the speed of load_table function without geo column is absolutely unbelievable fast.
With geo column is fast. Only fast :slight_smile:

I created a geo column because I thought the speed of ST_CONTAINS with geo column will be faster that with two column latitude and longitude. But to my surprise it is slower :astonished:
Is there any reason to create a POINT column instead of two float columns?

So I removed it again from my PoC.

Thank you again for a fast answer.

Hi @slava,

Thanks for your feedback. We will expand the examples for sure, while I’m not a big fan of fixed examples because, in a way, they restrict the creativity of devs.

Anyway, to improve performances other than use dask or other APIs to use more than one core and you don’t need a persistent table, you can create a temporary table like that.

create temporary table temp_test_point 
(
  f1 TEXT ENCODING DICT(32),
  f2 GEOMETRY(POINT, 4326) ENCODING COMPRESSED(32));

This would improve the TP to 60k+ records/sec. The data won’t persist in the database.

If you have some data as CSV files, you can consider using the COPY command of the omnisql command-line tool that comes with the database, and it’s multi-threaded.

As an example to load the raw data of taxi data provided by the municipality of NYC (link: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) with two points (pickup and dropoff location) for nearly 14m records are 38 secs for a persistent table and 9.5 secs for a temporary table and 8.1 secs for a table with longitude and latitude.

omnisql> copy yellow_tripdata from '/opt/opendata/taxi/yellow_tripdata_2014-06.csv.gz' ;
Result
Loaded: 13813030 recs, Rejected: 1 recs in 8.166000 secs
omnisql> copy yellow_tripdata_p from '/opt/opendata/taxi/yellow_tripdata_2014-06.csv.gz' with(lonlat='true');
Result
Loaded: 13813025 recs, Rejected: 6 recs in 37.704000 secs
omnisql> copy yellow_tripdata_p_temporary from '/opt/opendata/taxi/yellow_tripdata_2014-06.csv.gz' with(lonlat='true');
Result
Loaded: 13813025 recs, Rejected: 6 recs in 9.518000 secs

I am surprised you are getting better performances with geo joins using lon/lat values than points because the performances would be identical.

Joining that data with taxi_zones, I get more or less the same runtimes using the default parameters on the database, so around 16 secs for a query like that with 3 RTX class GPUs (with Geo operations, a Tesla class GPU, like a V100/A100 would perform close to 16x better than GTX/RTX because those operations use heavily FP64 instructions)

with point datatype:

select borough,avg(total_amount) from yellow_tripdata_p,taxi_zones 
where st_contains(omnisci_geo,pickup_point) group by 1;
borough|EXPR$1
EWR|89.24168737060042
Queens|40.25764929584506
Bronx|18.13730756739201
Manhattan|14.0137167972197
Staten Island|40.15972789115646
Brooklyn|17.20432348298326
6 rows returned.
Execution time: 16533 ms, Total time: 16535 ms

with long/lat

select borough,avg(total_amount) from yellow_tripdata,taxi_zones 
where st_contains(omnisci_geo,ST_SetSRID(ST_Point(pickup_longitude, pickup_latitude), 4326)
group by 1;
borough|EXPR$1
EWR|89.24168737060042
Queens|40.25774226222345
Bronx|18.13730756739201
Manhattan|14.01370850145267
Staten Island|40.15972789115646
Brooklyn|17.20429637548631
6 rows returned.
Execution time: 16475 ms, Total time: 16476 ms

but turning on the parameter enable-overlaps-hashjoin, the runtime of the query using the point datatype take half the time compared to the one using lon/lat

omnisql> select borough,avg(total_amount) 
from yellow_tripdata_p,taxi_zones 
where st_contains(omnisci_geo,pickup_point) group by 1;
borough|EXPR$1
EWR|89.24168737060042
Queens|40.25764929584506
Bronx|18.13730756739201
Manhattan|14.0137167972197
Staten Island|40.15972789115646
Brooklyn|17.20432348298326
6 rows returned.
Execution time: 9041 ms, Total time: 9043 ms

I got up to 9x the performance improvement using the parameter against other datasets.
I suggest you retry run your queries with this parameter activated.

I will try to set up a multi-treaded example with pymapd to improve the throughput using that API.

Hope this help,
Candido

Hi @slava,

We just released the 5.6 version of our database that introduces big enhancements on Point in Polygons, both performance, and functional-wise.

Because of that, now the overlaps join is turned on by default, and it’s working for POINT but not for lon/lat; for land/lot will be released in the next release (the bit I can say that now building the points is going to costa little)

so re-running the queries in the new version, we get a phenomenal increase of performances in the overlap join of 80 times

omnisql> select borough,avg(total_amount) from yellow_tripdata_p,taxi_zones where st_contains(omnisci_geo,pickup_point) group by 1;
borough|EXPR$1
EWR|89.24168737060042
Queens|40.25781140037595
Bronx|18.13787314644442
Manhattan|14.01369628296496
Staten Island|40.15972789115646
Brooklyn|17.20451073240689
6 rows returned.
Execution time: 112 ms, Total time: 112 ms

the old version took 9041ms while the new version 112ms

running with building the point at runtime is going for a loop join, but it’s using an improved version of st_contains

omnisql> select borough,avg(total_amount) from yellow_tripdata,taxi_zones where st_contains(omnisci_geo,ST_SetSRID(ST_Point(pickup_longitude, pickup_latitude), 4326)) group by 1;
borough|EXPR$1
EWR|89.24168737060042
Queens|40.27362967477962
Bronx|18.12789353612168
Manhattan|14.01586079553597
Staten Island|40.15972789115646
Brooklyn|17.20781398542617
6 rows returned.
Execution time: 10070 ms, Total time: 10070 ms

Also, using other datasets with many polygons of over 400000, as the census sections in Italy, you will get swift response times using just ONE gaming class GPU.

e.g., Joining 34M of Italy POI extracted from Openstreet Map

omnisql> select place,count(*) the_count 
                from italy_poi join istat_census_sections on            st_contains(istat_census_sections.omnisci_geo,italy_poi.omnisci_geo) group by 1 order by 2 limit 5;
place|the_count
In Tul Rii|13
quarter|56
historical|74
FARM|80
undefined|80
5 rows returned.
Execution time: 949 ms, Total time: 950 ms

You will get very usable response time (in the previous version with overlap hash join enabled the query took 83235ms to run, so the improvement is around 90x)

So with a simple upgrade, you can get a boost of performances of about 80x times.

Regards,
Candido