I am currently doing some research about in-memory databases both CPU and GPU accelerated. I don’t know if this is the right place to ask about our use case but I will give it a shot.
We have 3 tables:
- Order (600GB 20 cols wide, 3 billion rows)
- Customer (60GB 12 cols wide, 1 million rows)
- Item (200GB 32 cols wide, 500 million rows)
The star schema is standard with order as fact. We have about 500 analysts using this at the same time. Today our solution is based on Hadoop and aggregates.
Let’s say 200 GB of the data is hot data, to use MapD what would our setup need to be, just to get an overview over the infrastructure cost?
I also have a couple of questions regarding MapD:
- Does MapD keep all the table data loaded in DRAM?
- Does MapD fill out VRAM with everything it needs for the SQL query?
- What happens if the data doesn’t fit, will it fail, run slower?
- Is there logic to keep datasets that are used by other concurrent users (hot data)?
- Complex queries require a lot more memory to create the result set, what happens if this overflows the VRAM or is this stored somewhere else?
- Will it be a problem for MapD to share the GPU machines with other tenants?
- We also have event streams with 100 GB / day, I understand this is not feasible to store in in-memory but is MapD planning to develop a component for loading on demand from HDD storage ex Hadoop etc. Or is MapD purely for a set of data you know that you will query, almost the same as for the MOLAP cube use cases.
- I read upsert is coming in the future, is it in this year’s roadmap?
Thank you for your time,