I’m trying to import a table using pyomnisci load_table functon with arrow format.
The source file(parquet) has geo and array type columns and I have created those columns in the table accordingly.
However, I’m getting error(Arrow array appends not yet supported).
Understandable if the array columns were getting skipped.
But I’m getting this error even for geo columns.
Any help would be appreciated
I am or sure to understand the problem you are facing, but you can refer to this topic/post about how to insert spatial data using pymapd
Let me know if it fixed your problem inserting geodata into omniscidb.
I was able to load it using columnar method. But it’s very slow compared to the arrow method.
I will try doing it via omnisql since our source is parquet files in s3 and see how it goes.
Btw is the geo and array column support using arrow in the roadmap in the near future?
Thanks for all the help.
I cannot provide a date, because while we would like to do everything, the engineering resources are limited, so it depends on how much a feature is requested.
If you have a Github account you can open an Issue at this link
Could you provide a snippet of the code you are using to load the table with arrow (I think that the problem with arrays and geometries objects are related).
The throughput shouldn’t be so bad ith the columnar method. The problem is that the method is serial so just 1 core is used while the copy command uses by default all the cores of the machine where the server is installed
Sure I’ll open an issue on github
PFB the code we use for importing.
import pymapd import pyarrow.parquet as pq from pyarrow import fs con = pymapd.connect(user="xxx", password="xxx", host="xxx", port=xxx, dbname="xxx") //load arrow_table from s3 con.load_table("xxx", arrow_table, method='arrow', preserve_index=False) con.close()
this is somethjing I did a long time ago to load an arrow table into the database (I guess everything is local because it used shared memory).
it’s multithreaded, but the speedup wasn’t spectacular, because the table is locked everytime the load_table_arrow method is called
import pyarrow as pa from pyarrow import parquet as pq import pymapd import time from threading import Thread def connect_then_load(url,table_name,thread_num,num_iter): con = connect(uri=uri) for x in range(0,num_iter): if (thread_num*num_iter+x<len(df_b)): con.load_table_arrow(table_name,df_b[thread_num*num_iter+x]) thread_runs = ; num_threads=8 reader = pa.RecordBatchStreamReader("/opt/root_ubuntu18/opt/opendata/flights/flights_none.arrow") df_b = reader.read_all().to_batches() num_iter= round(len(df_b)/(num_threads-1)) from pymapd import connect uri = "mapd://admin:HyperInteractive@localhost:6274/omnisci?protocol=binary" start_time=time.time() for thread_num in range(0,num_threads): thread_runs.append(Thread(target=connect_then_load,args=(uri,'flights_parquet',thread_num,num_iter))) thread_runs[len(thread_runs)-1].start() for t in thread_runs: t.join()
Definitely looks worth trying!
Thanks a lot, Candido.
I was hoping that COPY command would work but it also throws the same error!
This is what I’m using:
COPY xxx FROM 's3://xxx' WITH(parquet='true', s3_access_key='XXXXXXXX', s3_secret_key='XXXXXXXX'); Loader Failed due to : Arrow array appends not yet supported in 0.286000 secs