Avoiding Duplicate Rows



New to MapD and just gotten up and running to the point that I’m playing with time series data. In our MySQL workflow, we use indexes not only for performance, but to allow us to avoid duplicate inserts using insert ignore. I’ve seen that MapD doesn’t require indexes for high performance, but was wondering if there was a certain method to avoid having duplicate data in a MapD table.

Unique Set of Columns

So far the design point has been around lightening fast analytic queries and fast data ingest. While unique indexes are great for OLTP-style record lookups and for enforcing uniqueness constraints, they’re less useful for analytic queries spanning millions of rows, and the need to maintain these indexes slows down inserts. You’ll notice quite a few databases have the concept of a declared but unenforced constraint, which can be useful in query optimization cardinality estimates. MapD does these on the fly, but it certainly could be useful to have them declared. And nothing wrong with having fully enforced constraints at some point.

Unique key in mapd core?


As a way of detecting duplication prior to loading data into MapD you can perform the following steps:

Files to be inserted are labeled A,B,C… Z for our example sake

  1. Load file A into table MYTABLE
  2. execute query select t1.uniqueCol from MYTABLE t1 join MYTABLE t2 on t1.uCol = t2.uCol;
  3. There should be no rows returned, if not your first A file was not unique
  4. LOAD file B into table TEMPTABLE
  5. execute query select t1.uniqueCol from MYTABLE t1 join TEMPTABLE t2 on t1.uCol = t2.uCol;
  6. There should be no rows returned if file B is unique
  7. fix B if not unique with the details from the select
  8. load the fixed B file in to MYFILE
    9.drop table TEMPTABLE
    9 Continue step 4 -9 for the rest of the set to load and any other file you need as a check prior to loading to real MYTABLE

the load into MapD tables is very fast so this should be able to be scripted pretty easily and you at least know you have no duplicated



Many thanks, that makes a lot of sense. I take it the ultra fast workflow is the reason for a lack of update and delete queries (ie, its easier to just dump and reload as needed)?