How use load_table with spatial data, like POINT

Hi folks,

is there a way how I can use pymapd to load big data sets (5m rows) fast to the table?

A table without POINT column take about 15 sek. to load 5 million rows.
For a load with point column I have to use method=rows and it takes very long time. about 10 minutes and the db server is broken after this. (restart)

Is the a way to load data with POINT column fast?

Thanks

Hi @slava,

In the meanwhile I suggest you, as workaround, to load your data using the copy command in omnisql. The documentation is here

https://docs.omnisci.com/loading-and-exporting-data/command-line/load-data#csv-tsv-import

We are investigating on your pymapd’s issue, 0and I hope we will come back with a solution asap.

Regards,
Candido

Hi @slava ,

I tried out to insert geometry point with pymapd, and I succeded using load_table_columnar method.

Assuming we have a panda dataframe with a geometry object containing points

>>> df.dtypes
f1            object
f2_geom    geometry
dtype: object
>>> df
             f1                     f2_geom
0  Buenos Aires  POINT (-58.66000 -34.58000)
1      Brasilia  POINT (-47.91000 -15.78000)
2      Santiago  POINT (-70.66000 -33.45000)
3        Bogota    POINT (-74.08000 4.60000)
4       Caracas   POINT (-66.86000 10.48000)

and a table called as an example test_point

CREATE TABLE test_point (
  f1 TEXT ENCODING DICT(32),
  f2 GEOMETRY(POINT, 4326) ENCODING COMPRESSED(32));

you can load the table with load_data_columnar

>>> con.load_table_columnar("test_point", df, preserve_index=False)
>>> c = con.cursor()
>>> c.execute("select * from test_point");
<pymapd.cursor.Cursor object at 0x7f4f79c7bd00>
>>> list(c)
[('Buenos Aires', 'POINT (-58.659999947371 -34.5799999984819)'), ('Brasilia', 'POINT (-47.9099999963818 -15.7799999955017)'), ('Santiago', 'POINT (-70.6599999920744 -33.4499999664025)'), ('Bogota', 'POINT (-74.0799999675155 4.59999998779967)'), ('Caracas', 'POINT (-66.8599999821093 10.479999990426)')]
>>> 

the troughtput loading 1.6 millions records into the table is around 50k records per second.
I am not sure if it’s fast enough for your need, but this way just 1 core of the CPU is used, so maybe possible increase the TP using multiple threads

>>> t1 = time.perf_counter()
>>> con.load_table_columnar("test_point", df, preserve_index=False)
>>> time.perf_counter()-t1
33.880870164000044
>>> df
              f1                     geometry
0   Buenos Aires  POINT (-58.66000 -34.58000)
1       Brasilia  POINT (-47.91000 -15.78000)
2       Santiago  POINT (-70.66000 -33.45000)
3         Bogota    POINT (-74.08000 4.60000)
4        Caracas   POINT (-66.86000 10.48000)
..           ...                          ...
0   Buenos Aires  POINT (-58.66000 -34.58000)
1       Brasilia  POINT (-47.91000 -15.78000)
2       Santiago  POINT (-70.66000 -33.45000)
3         Bogota    POINT (-74.08000 4.60000)
4        Caracas   POINT (-66.86000 10.48000)

[1673960 rows x 2 columns]
>>> c.execute("select count(*) from test_point")
<pymapd.cursor.Cursor object at 0x7f4f79c7bd00>
>>> list(c)
[(1673960,)]

Could you share your code so that we can help you better?

Regards,
Candido

Hi @candido.dessanti
thank you very much for your answer.

It’s worked!!!
May be you can update documentation. I think it is very helpfully for everyone.

So the insertion is working.
At first I will say the speed of load_table function without geo column is absolutely unbelievable fast.
With geo column is fast. Only fast :slight_smile:

I created a geo column because I thought the speed of ST_CONTAINS with geo column will be faster that with two column latitude and longitude. But to my surprise it is slower :astonished:
Is there any reason to create a POINT column instead of two float columns?

So I removed it again from my PoC.

Thank you again for a fast answer.

Hi @slava,

Thanks for your feedback. We will expand the examples for sure, while I’m not a big fan of fixed examples because, in a way, they restrict the creativity of devs.

Anyway, to improve performances other than use dask or other APIs to use more than one core and you don’t need a persistent table, you can create a temporary table like that.

create temporary table temp_test_point 
(
  f1 TEXT ENCODING DICT(32),
  f2 GEOMETRY(POINT, 4326) ENCODING COMPRESSED(32));

This would improve the TP to 60k+ records/sec. The data won’t persist in the database.

If you have some data as CSV files, you can consider using the COPY command of the omnisql command-line tool that comes with the database, and it’s multi-threaded.

As an example to load the raw data of taxi data provided by the municipality of NYC (link: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) with two points (pickup and dropoff location) for nearly 14m records are 38 secs for a persistent table and 9.5 secs for a temporary table and 8.1 secs for a table with longitude and latitude.

omnisql> copy yellow_tripdata from '/opt/opendata/taxi/yellow_tripdata_2014-06.csv.gz' ;
Result
Loaded: 13813030 recs, Rejected: 1 recs in 8.166000 secs
omnisql> copy yellow_tripdata_p from '/opt/opendata/taxi/yellow_tripdata_2014-06.csv.gz' with(lonlat='true');
Result
Loaded: 13813025 recs, Rejected: 6 recs in 37.704000 secs
omnisql> copy yellow_tripdata_p_temporary from '/opt/opendata/taxi/yellow_tripdata_2014-06.csv.gz' with(lonlat='true');
Result
Loaded: 13813025 recs, Rejected: 6 recs in 9.518000 secs

I am surprised you are getting better performances with geo joins using lon/lat values than points because the performances would be identical.

Joining that data with taxi_zones, I get more or less the same runtimes using the default parameters on the database, so around 16 secs for a query like that with 3 RTX class GPUs (with Geo operations, a Tesla class GPU, like a V100/A100 would perform close to 16x better than GTX/RTX because those operations use heavily FP64 instructions)

with point datatype:

select borough,avg(total_amount) from yellow_tripdata_p,taxi_zones 
where st_contains(omnisci_geo,pickup_point) group by 1;
borough|EXPR$1
EWR|89.24168737060042
Queens|40.25764929584506
Bronx|18.13730756739201
Manhattan|14.0137167972197
Staten Island|40.15972789115646
Brooklyn|17.20432348298326
6 rows returned.
Execution time: 16533 ms, Total time: 16535 ms

with long/lat

select borough,avg(total_amount) from yellow_tripdata,taxi_zones 
where st_contains(omnisci_geo,ST_SetSRID(ST_Point(pickup_longitude, pickup_latitude), 4326)
group by 1;
borough|EXPR$1
EWR|89.24168737060042
Queens|40.25774226222345
Bronx|18.13730756739201
Manhattan|14.01370850145267
Staten Island|40.15972789115646
Brooklyn|17.20429637548631
6 rows returned.
Execution time: 16475 ms, Total time: 16476 ms

but turning on the parameter enable-overlaps-hashjoin, the runtime of the query using the point datatype take half the time compared to the one using lon/lat

omnisql> select borough,avg(total_amount) 
from yellow_tripdata_p,taxi_zones 
where st_contains(omnisci_geo,pickup_point) group by 1;
borough|EXPR$1
EWR|89.24168737060042
Queens|40.25764929584506
Bronx|18.13730756739201
Manhattan|14.0137167972197
Staten Island|40.15972789115646
Brooklyn|17.20432348298326
6 rows returned.
Execution time: 9041 ms, Total time: 9043 ms

I got up to 9x the performance improvement using the parameter against other datasets.
I suggest you retry run your queries with this parameter activated.

I will try to set up a multi-treaded example with pymapd to improve the throughput using that API.

Hope this help,
Candido