Import data into mapd


#1

Hello.
Can you describe the detailed process of importing data into mapd using command ‘copy from’?


#2

copy from command can load data on regular tables from local files or files from amazon s3; the first option is the faster because you haven’t to download anything from the network.

to use copy command you can:

  1. use mapdql installed on the machine hosting omnisci core database

mapdql -u mapd -p HyperInteractive --db mapd -q -t -n

  1. use mapdql installed on another host

mapdql -u mapd -p HyperInteractive --db mapd -q -t -n -s omnisci_hostname

  1. use sql editor in immerse

In any case, the CPU resources used will be the ones of the host where omnisci-core is running and by default, it will use all available CPU threads on the machine, but you can limit it by using the thread parameter, and the files you are using to load the table has to reside on the target host

http://docs.mapd.com/latest/6_loading_data.html#copy-from

The number and the type of fields in the files and in the table you are loading has to be the same or the rows will be discarded, and you have to create the table before start loading data (except if you are loading aws s3 files or you are importing geo files; both options needs additional parameters and the latter will create the target table).


#3

Thank you.
Consider this condition, if i have three files, 1.csv,2.csv and 3.csv, then i use the command copy flights from *.csv to insert data into omnisci. will omnisci use all available CPU threads to insert 1.csv first, then 2.csv and so on? Or it will insert three files into mapd at the same time?

Besides, will omnisci copy all the data of the files into CPU RAM flist, and then insert the data into table flights? Or it will copy data from RAM to table as soon as it copy data from csv file to RAM?


#4

whenever you specify one or more files all cpu threads will be used to insert data; the data isn’t directly loaded into CPU memory buffer of the database but it’s written on disk in batches.

The data is read from disk and placed into CPU memory buffer when you execute a query and just for the columns needed by your query; if the columns was already loaded into RAM for older queries, running another query will cause the load of the data of newly inserted rows.

you can experiment the beahviour of the command querying the table while loading table; the query will be slower than usual because the MM has to read the newly inserted rows from disk


#5

Do you mean that copy from command will not load any data into RAM? But i noticed that the cached grows rapidly when i am importing data into omnisci. The value is shown by commond free -m. And the value is almost twice the size of the data.


#6

Filesystem caches grow because of the copy command read local files to load and write to database files, and but operations use FS caches (you should mount FS forcing direct io to avoid this behavior).
Anyway, the write caches speed up any possible load into system memory because the read doesn’0t occur on disk but in FS-Cache that’s faster.

To verify this you can run a copy command on a fresh started instance of omnisci and check with \memory_cpu command

I noticed a possible memory leak in the copy command because the server allocates memory that’s is never released, but I need further testing.


#7

Yes, Using FS caches is faster than directly operating the file. When using caches, omnisci only copy data from cache to database.

I clear the cache of system, and restart mapd_server, then i import a csv file of 18GB(text none-encoding) into omnisci, the value of cached becomes 37GB, then i truncate the table(command truncate table flights), the value of cached becomes 18GB.

So, about the problem of memory leak, does it means that RAM has two copies of the data, one copy is for writing data, but another is useless, just because of the problem of memory leak?


#8

Well,

If you are loading a noncompressed file of 18GB it’s normal that you have 18GB of Filesystem Cache for the CSV files plus the size of the table the database has written to disk for the table; truncating or dropping the table the files relative to the table are removed, so the caches are released.

If you want to reclaim the memory allocated for the FS-Cache you can run this command.

sudo echo 3 > /proc/sys/vm/drop_caches

but the memory allocated for the FS Cache, if needed is reclaimed by the system.

No, I 'm not referring to FSC when I talk about a possible memory leak; in my system when I launch the copy command, the server allocates memory and this memory is never released, nor used.


#9

Thank you.
In fact, before i import the data, i clear the FS cache, the value of cached is less than 1GB. But after importing data of 18GB, the value of cached becomes 37GB. The data is 18GB, but it occupies 36GB cache. In the cache, there are FS cache of 18GB and table cache of 18GB. Do you mean like that?

So if i truncate the table, the table cache of 18GB will be released, but the FS cache of 18GB will stay in RAM.