Unable to load geo and array columns using pyomnisci

Hello,
I’m trying to import a table using pyomnisci load_table functon with arrow format.
The source file(parquet) has geo and array type columns and I have created those columns in the table accordingly.
However, I’m getting error(Arrow array appends not yet supported).
Understandable if the array columns were getting skipped.
But I’m getting this error even for geo columns.
Any help would be appreciated

Hi,

I am or sure to understand the problem you are facing, but you can refer to this topic/post about how to insert spatial data using pymapd

Let me know if it fixed your problem inserting geodata into omniscidb.

Regards,
Candido

Hi Candido,
I was able to load it using columnar method. But it’s very slow compared to the arrow method.
I will try doing it via omnisql since our source is parquet files in s3 and see how it goes.
Btw is the geo and array column support using arrow in the roadmap in the near future?
Thanks for all the help.

Regards,
Anirudh Simha

Hi,

I cannot provide a date, because while we would like to do everything, the engineering resources are limited, so it depends on how much a feature is requested.
If you have a Github account you can open an Issue at this link

Could you provide a snippet of the code you are using to load the table with arrow (I think that the problem with arrays and geometries objects are related).

The throughput shouldn’t be so bad ith the columnar method. The problem is that the method is serial so just 1 core is used while the copy command uses by default all the cores of the machine where the server is installed

Candido.

Sure I’ll open an issue on github :slightly_smiling_face:
PFB the code we use for importing.

import pymapd
import pyarrow.parquet as pq
from pyarrow import fs

con = pymapd.connect(user="xxx", password="xxx", host="xxx", port=xxx, dbname="xxx")
//load arrow_table from s3
con.load_table("xxx", arrow_table, method='arrow', preserve_index=False)
con.close() 

Well,

this is somethjing I did a long time ago to load an arrow table into the database (I guess everything is local because it used shared memory).

it’s multithreaded, but the speedup wasn’t spectacular, because the table is locked everytime the load_table_arrow method is called

import pyarrow as pa
from pyarrow import parquet as pq
import pymapd
import time
from  threading import Thread

def connect_then_load(url,table_name,thread_num,num_iter):
  con = connect(uri=uri)
  for x in range(0,num_iter):
    if (thread_num*num_iter+x<len(df_b)):
      con.load_table_arrow(table_name,df_b[thread_num*num_iter+x])

thread_runs = [];
num_threads=8
reader = pa.RecordBatchStreamReader("/opt/root_ubuntu18/opt/opendata/flights/flights_none.arrow")
df_b = reader.read_all().to_batches()
num_iter= round(len(df_b)/(num_threads-1))

from pymapd import connect
uri = "mapd://admin:HyperInteractive@localhost:6274/omnisci?protocol=binary"

start_time=time.time()

for thread_num in range(0,num_threads):
  thread_runs.append(Thread(target=connect_then_load,args=(uri,'flights_parquet',thread_num,num_iter)))
  thread_runs[len(thread_runs)-1].start()

for t in thread_runs:
  t.join()

Definitely looks worth trying!

Thanks a lot, Candido.

Regards,
Anirudh Simha

Hi Candido,
I was hoping that COPY command would work but it also throws the same error!
This is what I’m using:

COPY xxx FROM 's3://xxx' WITH(parquet='true', s3_access_key='XXXXXXXX', s3_secret_key='XXXXXXXX');
Loader Failed due to : Arrow array appends not yet supported in 0.286000 secs