Discussions

Expand all | Collapse all

importing from raw arrow

  • 1.  importing from raw arrow

    Posted 11 days ago
    Hello!

    While investigating how to process (e.g. filter, aggregate...) data already persisted as Apache Arrow binaries, MapD/OmniSci appears as a possible solution.  I've compiled OmniSci from sources (thanks for a documentation sufficient to complete such task).  I am now considering building a mechanism from the compiled MapD, for which your expert advice is very desirable.  My appreciation to you for the existence of this forum.

    Could you please help me figure out which is the most expedient manner, under the constraints explained below, to slurp (directly import) from disk some potentially large binary files already existing in a recent Arrow format (exactly which format is not terribly critical)?  The resulting benefit would be the ability to query such data with your SQL-like engine, in some manner yet to be determined.

    The following applies:
    * At least initially, the queries are known in advance and static.
    * The data is available locally, and there are near-real-time timing constraints: intermediate copies or format transforms are undesirable.
    * A link-level (say, a local memory space relocatable library, as opposed to networked) overall outcome would be highly desirable.
    * There is some latitude and wiggle room to negotiate the above.
    * Expediency for a working proof of concept is as important as the potential later for performance: sensible compromising is the name of the game.

    Best regards.
    #Connectors
    #Examples

    ------------------------------
    Alberto
    ------------------------------


  • 2.  RE: importing from raw arrow

    Posted 10 days ago
    Hi, @alberto squassabia,


    The mapping of external files to an omnisci table is already in work as you can see in our OS repository PR for external tables
    and it wouldn't take so much  to be completed.

    In the meanwhile you can use pymapd to load arrow file, it's quite simpleto load an arrow file with our pymapd driver (we also have a thrift endpoint to ingest them so that it can be easily used with other languages):
    this is a sample code to read and load an arrow file in a serial fashion (I'm not the best python programmer here :)

    import pyarrow as pa
    from pyarrow import parquet as pq
    import pymapd
    import time
    
    reader = pa.RecordBatchStreamReader("/files/file.arrow")
    df_b = reader.read_all().to_batches()
    
    from pymapd import connect
    uri = "mapd://admin:HyperInteractive@localhost:6274/omnisci?protocol=binary"
    con = connect(uri=uri)
    
    for df_batch in df_b:
      con.load_table_arrow('some_table',df_batch)

    ito get some speedup you can use MT
    import pyarrow as pa
    from pyarrow import parquet as pq
    import pymapd
    import time
    from  threading import Thread
    
    def connect_then_load(url,table_name,thread_num,num_iter):
      con = connect(uri=uri)
      for x in range(0,num_iter):
        if (thread_num*num_iter+x<len(df_b)):
          con.load_table_arrow(table_name,df_b[thread_num*num_iter+x])
    
    thread_runs = [];
    num_threads=8
    reader = pa.RecordBatchStreamReader("/files/file.arrow")
    df_b = reader.read_all().to_batches()
    num_iter= round(len(df_b)/(num_threads-1))
    
    from pymapd import connect
    uri = "mapd://admin:HyperInteractive@localhost:6274/omnisci?protocol=binary"
    
    for thread_num in range(0,num_threads):
      thread_runs.append(Thread(target=connect_then_load,args=(uri,'some_table',thread_num,num_iter)))
      thread_runs[len(thread_runs)-1].start()
    
    for t in thread_runs:
      t.join()
    After ingestion, or mapping when this feature will be available, you can query the data getting back results on SYSTEM/GPU memory or thru the nework using pymapd driver.

    Alternatively, you can use external tools utilizing the JDBC or ODBC driver or our Low Latency Data Visualization Tool called Immerse (Immerse and ODBC are available on EE only).

    Best regards