I replicated a 832M row table with char(1). On a system with 2 1TB SSDs (Sata 3) in Raid 0 and 2X Titan XP GPUs, from a cold start the first query took 1.614 seconds on a warm filesystem cache, or 4.16 seconds on a cold cache. We are currently not compressing chars in anyway beyond 4 bytes per field, so on this system we are seeing 832 * 4 = 3.33 GB / 4.16 seconds = 800.5 MB/sec, which is within 80% of the theoretical max for the disk bandwidth. As @dwanyeberry noted, there may be an issue with the disk setup on AWS.
Running the queries then with the data cached I saw query times of 0.258 seconds (equating to 0.516 seconds per GPU, or about twice the performance you are seeing (likely the speedup of Pascal over Kepler). I am guessing that you are seeing a bit of a pathological case on the GPU, since with only two distinct values (which since we are using bitmaps to keep track of the unique values, all fall into the same byte), you have all threads attempting to do atomics on the same 4-byte sequence. A higher number of distinct values (note even 32 values would fall into the same 4-byte sequence) would likely improve performance.
There are of course optimizations that could be done here like potentially using shared memory for the reductions which suffers less atomic contention in cases like this. It would also be logical to move to needing only one byte to store chars.
Also I would note that clients often run with many more GPUs (like the p2.16xlarge) that the single GPU system you are running on, which for a query like this should see near-linear scaleup (i.e. your 1-second query time might drop to 60 ms). And given the system you are running on only costs $0.90/hour and you were running an aggregation over > 800M rows, I’m curious which other systems you have seen perform more quickly at this price point (if they are not pre-aggregating the data somehow like systems like Vertica can do, which typically does not play well with ad-hoc filters). For all the optimizations that would be profitable for us to exploit we still generally don’t see analytic systems getting close to MapD in terms of performance for the same price, at least for simple (i.e. BI-class) queries.