MapD and C++ memory persistence



MapD is an incredible product with no c++ documentation, at least none that I could find. I successfully compile, run and load data from a cpp program into my MapD server with the load_table() function, as implemented in the “MapD.cpp” file.

However, I have a question regarding memory management. When I move my data to the server with load_table(), the transferred data remains on system memory, on the server process. I can force the server to save the transferred data to file by manually restarting the mapd_server, but that is a terrible solution.

The server keeps the transferred data on its heap, and does not clear memory. My question is: how can I force the server to clear the persistent memory?

And as I side note, if c++ documentation exists, I’d love a link.



Can you clarify what you are referring to with your statement " the transferred data remains on system memory, on the server process".

A load_table will allocate some memory used to load the data into the file system but it should not persist it in system memory. When you run a query it will move the data into system memory (in our internal Datamgr buffers) and will then manage that memory as part of our data caching strategies. This memory will not be cleared for the life of the server process. MapD server will by default assume it can use up to 80% of the machines available memory for its internal buffers before it will attempt any clearing/swapping of its internal memory managed data.

There are no c++ docs, as such. The code is available in the public repo. We hope to share further internal architecture as time allows to help summarize/explain what is going on.



So here is a concrete example:

In my cpp program, I have the vector
auto *input_rows = new std::vector<“TStringRow”>;

input_rows has 5Gb worth of data stored in it. The current system memory usage is 5Gb of RAM, corresponding to that same vector. Then my program executes

client.load_table(session, table_name, *input_rows);
delete input_rows;

process exit (all of my program’s memory is cleared)

But after my program exits, 5Gb of RAM is still being used, now listed as belonging to the mapd_server process. I can force clear the server’s memory with

sudo systemctl restart mapd_server

which seems to force the server to save what is on RAM into a file on disk.

My question is, how can I make the mapd_server clear its memory without having to restart? You may be thinking, why is this an issue? The problem is that when 20Gb or 40Gb of data are transferred between my program and mapd, lets say in a batch, the system will eventually run out of RAM, since mapd is keeping what is transferred to it on the heap. Not until I restart the server, does it save what is transferred to disk.



As @dwayneberry pointed out if Mapd is dealing with such data, DataMgr will swap it out of heap once data manager buffers acquire 80% of system’s main memory. So feel free to load larger data sets on MapD.

My personal suggestion would be to try and load this larger data set and please let us know how it went. We love feedback from our users.

mapdql does have \clear_cpu and \clear_gpu options that you can use to clear data buffers from main memory. I realized we haven’t documented it. We will make sure the docs are updated. However we strongly discourage use of \clear_cpu and clear_gpu as the data is supposed to stay in cache as long as possible.

We are working on a high level architecture document which will serve as a guiding point for community to understand workings of MapD SQL engine.

Thank you very much for using and contributing to MapD.




@amadis thanks for the additional information. I think I understand better what you are doing now.

The key here I believe is you are inserting 5Gb worth of data in a single call.

We use thrift as our services layer, I am suspecting that when moving large buffers like this in a single transfer, thrift is setting aside some work space.

I did a simple test like yours where I sent 22G+ of data over in 10G chunks. MapD server would increase its memory foot print up to about 11G and the drop back down to around 6G, then slowly work back up to 11G as the transfer proceeded. After the transfer completed the server settle back to 6G. I am speculating here that this low water mark is potentially related to the space set aside for the data transfer space required by thrift.

Normally in our code using the load_table<_whatever> api’s we break the calls up into separate batches (see StreamInsert in examples for a program that calls table_load in batches). So in the case you want to load 10’s or 100’s of GB you do not make a single call to table load with all the data you would make a series of calls to load however many records you wanted. You should see a stable memory foot print for mapd_server and it will not increase as you make more calls to load_table. It will of course increase once you start querying the data.

Regarding the mention of \clear_gpu and \clear_cpu mapdql commands, these will not reduce the memory allocated from a system perspective of the mapd_server process, these calls work at our DataMgr level, and are useful for our own internal processes.

From a straight bulk load performance perspective, you may want to look at using the COPY FROM command where you can for bulk loads.

I hope this points you in the right directions



Thanks @dwayneberry you’re right! Calling load_table() with smaller sized batches of rows stabilized the system memory. I broke up my large table into 100000 rows per load call and I was able to transfer data without memory problems and with the same execution speed. :slight_smile: