Performance with individual inserts


#1

I have noticed poor select performance on our MapD table compared to MySQL, particularly when doing individual inserts. I understand individual inserts cause a lot of extra metadata to be generated that has to be read from disk occasionally when doing other selects and that it’s recommended to use bulk inserts instead.

I could not find much in the documentation on this metadata except for this forum post:

Questions:

  1. Over a period of time with many inserts (both bulk or individual) will table performance continue to degrade as more metadata is generated? Is there a way to cleanup/compact this metadata so MapD treats the table as if all data was just bulk inserted again?

  2. If we want data to continue to stream in regularly all day what would be best? Let’s assume 1000 rows per second come in. What would be best:
    a) Individual insert for each and every row.
    b) Batching up 1000 rows and inserting bulk once per second.
    c) Batching up X rows and inserting bulk once per Y (please guide on X and Y here, or the thinking and reasoning behind choosing X and Y).

  3. Why does a second select operation after the first one cause delayed performance on a system where the entire table can fit into GPU memory?

Thanks,
Brian


#2

Hi Brian -

The important thing to note here is that OmniSci and MySQL are fundamentally different style databases, where OmniSci is a columnar database (sometimes referred to as an “analytics” database) and MySQL is a row-oriented database (sometimes referred to as a “transactional” database). So if your use case is one where each row is important (e.g., an order to be shipped) instead of a handful of columns (e.g., calculating transaction volume by customer type, state and day of week), then MySQL could be the more appropriate tool.

I’m not aware of a specific functionality to do this, other than perhaps once a day taking the records inserted in small batches and re-loading them a day at a time or other time interval that makes sense for your business.

Individual insert will never be a good idea, especially at the volumes you are describing. B) and C) are the same general idea; the more you can batch together, then better off you will be.

How you choose the volume of batching is up to your business needs. If you don’t need to do every second, but instead maybe only need every minute, you’ll immediately have 1/60th of the inserts. The tradeoff is really how frequently you need data, not how large a batch that OmniSci will consume optimally.

Can you provide more detail about this question? It seems unrelated to the others, so I want to make sure I understand what issues you are seeing.

Best,
Randy


#3

Hi @brian_earp,

To give a bit more color to @randyzwitch’s comment , indeed we do keep metadata per insert (whether singleton or batch), so that we can roll back if necessary. However currently we do not vacuum this metadata, so in the case of singleton inserts, a large amount of metadata end up being stored on disk (and required to be read when first touching a table). We are currently looking at what it would take to do cleanup of this metadata, and allow a maximum number of entries either per db page or table.

We hope to be able to prioritize this relatively soon, and we’ll make sure to follow up in this thread with relevant updates. In the meantime, as @randyzwitch mentioned, its best to batch your inserts into as big of groups as make logical sense for your business needs.

Regards


#4

As pointed out by @randyzwitch omnisci is a columnar database so it’s unlikely that you will find all columns’ of a particulare table in system or Gpu memory, so the columns not referenced in the first query needed in the second one are read from disk.

If you want to warm up the caches there is a server’s parameter called --db-query-list; you can read more about this topic here