Slow distinct count


The first time I tried a distinct count it took 37 seconds. The second time it took 1 second, which is still slow. Why so long? And why was the first one longer than the second? If it matters, bounce is currently a “char(1)”. This is running on an Amazon p2.xlarge.

mapdql> select count(distinct bounce) from content2;
1 rows returned.
Execution time: 37575 ms, Total time: 37576 ms
mapdql> select count(distinct bounce) from content2;
1 rows returned.
Execution time: 1056 ms, Total time: 1057 ms
mapdql> select count(*) from content2;
1 rows returned.
Execution time: 52 ms, Total time: 53 ms



So the initial delay on the first run is loading the data from disk. What are the details of the AWS disks you currently are storing your data on? Assuming 500MB/s fom disk, you were moving 3GB of data so I would expect it to take about 6 secs to initially load.


Is there any operations before SELECT

That would explain why the first one was slow but what about the second one? Most columnar databases can do distinct counts very fast. Is that being done on the gpu or cpu?

BTW I was using the io1 SSD drives with 20,000 iops.


I replicated a 832M row table with char(1). On a system with 2 1TB SSDs (Sata 3) in Raid 0 and 2X Titan XP GPUs, from a cold start the first query took 1.614 seconds on a warm filesystem cache, or 4.16 seconds on a cold cache. We are currently not compressing chars in anyway beyond 4 bytes per field, so on this system we are seeing 832 * 4 = 3.33 GB / 4.16 seconds = 800.5 MB/sec, which is within 80% of the theoretical max for the disk bandwidth. As @dwanyeberry noted, there may be an issue with the disk setup on AWS.

Running the queries then with the data cached I saw query times of 0.258 seconds (equating to 0.516 seconds per GPU, or about twice the performance you are seeing (likely the speedup of Pascal over Kepler). I am guessing that you are seeing a bit of a pathological case on the GPU, since with only two distinct values (which since we are using bitmaps to keep track of the unique values, all fall into the same byte), you have all threads attempting to do atomics on the same 4-byte sequence. A higher number of distinct values (note even 32 values would fall into the same 4-byte sequence) would likely improve performance.

There are of course optimizations that could be done here like potentially using shared memory for the reductions which suffers less atomic contention in cases like this. It would also be logical to move to needing only one byte to store chars.

Also I would note that clients often run with many more GPUs (like the p2.16xlarge) that the single GPU system you are running on, which for a query like this should see near-linear scaleup (i.e. your 1-second query time might drop to 60 ms). And given the system you are running on only costs $0.90/hour and you were running an aggregation over > 800M rows, I’m curious which other systems you have seen perform more quickly at this price point (if they are not pre-aggregating the data somehow like systems like Vertica can do, which typically does not play well with ad-hoc filters). For all the optimizations that would be profitable for us to exploit we still generally don’t see analytic systems getting close to MapD in terms of performance for the same price, at least for simple (i.e. BI-class) queries.



You could try the following rewrite: select count(*) from (select distinct bounce from content2);, I’d expect it to be significantly faster. As @todd said, there’s significant contention in the original query, which is mitigated by the rewrite. The inner query will use 32 bins for each of the two groups, which will dramatically reduce the contention. At some point we’ll rewrite this type of queries automatically, it’s just a matter of priorities.


Yes @asuhan that dropped it to 72ms on my 2 GPU machine.


mapdql> select count(distinct bounce) from content limit 10;
1 rows returned.
Execution time: 685 ms, Total time: 685 ms
mapdql> select count(*) from (select distinct bounce from content);
1 rows returned.
Execution time: 147 ms, Total time: 147 ms

So changing the query does make a difference in this case. I don’t want to change all my queries, though.

FYI, memsql does the first query in 15 ms on a m4.10xlarge instance. It doesn’t accept the second query’s syntax.

FYI2, I’ve found some cases where doing a distinct count with the \cpu option on is faster than with \gpu (94ms vs 143ms in one test where the count was 6).


It’s fairly simple to avoid scanning the input altogether and execute this query using only the metadata, in constant time. We might do it. Once filters are added to the query, though, it’s going to drop back to the full-scan behavior.


In our particular case we don’t need to apply a filter to the distinct count operation so that’d be a win for us.


I’m of two minds about it, but I’ll probably add this optimization next week since we also generate this query from our UI.


Some improvements were made that just landed in master that should drop the count distinct times down for low-cardinality columns. (Note its not in release yet)


Hey Ed,
Improved performance for low cardinality COUNT(DISTINCT) landed in this last release.

Let us know if you like what you see :slight_smile: