Discussions

Expand all | Collapse all

Cost of Omnisci Database Vs Hadoop

  • 1.  Cost of Omnisci Database Vs Hadoop

    Posted 25 days ago
    Hello,

    I've been playing around with the open source version of Omnisci database using a Docker instance on my Mac. I am impressed with its ease of use. I was able to insert data using JDBC, run queries etc. I also opened an account on Omnisci cloud & created a dashboard. Very impressed with the speed, ease of use & its graphical capabilities. 

    I am excited & ready to approach my clients. Omnisci seems to be a perfect tool to provide sub-seconds response time to queries against huge amount of data.

    I do want to understand cost effectiveness of the solution based on Omnisci, though. The GPU based machines are expensive! It seems Omnisci stores data on the local hard disk. It doesn't seem to integrate with cheaper storage systems such as HDFS, S3, ADLS (Azure) etc. Is that true? Can I attach an EBS volume to a VM on AWS? Any other ideas of saving costs?

    We've many petabytes of data. We will have to transfer that from HDFS/ADLS (Hive) to Omnisci. I understand that we can create a Distributed Cluster of Omnisci database but the cost goes up. Has anyone done any cost analysis of Omnisci Vs Hadoop based solutions?

    I really (really) want to get my clients excited about using Omnisci but I need some help in convincing them. Please share information regarding costs.

    Thanks.
    #General

    ------------------------------
    Ajay Chitre
    https://medium.com/@chitre.ajay
    ------------------------------


  • 2.  RE: Cost of Omnisci Database Vs Hadoop

    Posted 24 days ago
    Hi @AJAY CHITRE,

    The software is high-speed, and it's user-friendly; when I approached for the first time Mapd in less than an hour, I was running the first queries on 100M+ tables. Easy to install, well-defined datatypes super-fast and easy to use ingestion commands.

    About cost-effectiveness; Yes the GPUs are expensive, but they deliver a lot of performance, so you have to think how much I have to pay for the hardware using the software A to get the same performance I'm getting with software B on n GPUs?
    When I benched Omnisci for the first time with a single GPU, I compared with an Oracle 12c with in-memory option using 16 core/32 threads, and the performance difference was huge; a speedup of 75x - after upgrading Oracle to the latest version, optimizing the Oracle schema the difference fall back to 37x.
    The GPU system was an equivalent of Standard_ND6s on Azure (1 P40 GPU) that costs around 2 euro per hour while the CPU was something like a D32 v3 that costs 1,25 euro, so doubling the cost of HW, you're getting a 35x speedup.
    To get the same performance I'd had to increase the CPU core count to 560 ;)
    Anyway, as you probably noticed, Omnisci is very fast while using CPUs only; with the same benchmarking I mentioned above, while using just 4 CPU core I got a speedup of 5x against Oracle, so it's not just a matter of hardware you are using, but how write the software is almost as important.
    You can read about a direct comparison between Spark/Hyve and Omnisci on a particular case written by @Randy Zwitch following link
    About storage's cost, I understand your concerns, because on Azure and AWS they are ridiculously expensive; to save space and money, you should choose the right datatypes while creating a table to maximize the savings.
    As an example (it's not the best one) the raw data size of taxi rides of NYC in 2010 is 22.5 GB, loaded on Omnisci it takes  just 7.7 GB, using a filesystem that allow compression as btrfs 3.1 GB (a performance penalty will occur while reading data if you are on a fast storage). On different datasets with more and longer strings fields, the savings are more significative.

    I'm not experienced with EBS volumes on AWS, but AFAIK they are network drives presented as block devices to the OS, so I guess you can use them.
    The concerns rely on the performance of this kind of disk, because of looks to peak to 10 gigabits per system, so if a query needs to load a large volume of data from storage the system could stall waiting for the data.
    I guess this is the main reason because the AMI images of Omnisci don't use this kind of storage; the use of a compressed filesystem could be beneficial here, but I never tried this particular mix.

    On the 4.6 release, the bulk copy from Apache Parquet's files has been added, and the addition of the format as native storage format is in work as you can read on this blog post of Rachel Wang Announcing Omnisci 4.6.

    If you think that essential features are missing, you can ask for them here or directly on Github

    Doing a generic cost analysis with so few info isn't easy, but take in mind that Omnisci Database is an in-memory columnar database, designed to run analytical queries as fast as possible.
    So as an example streaming data from storage isn't implemented yet and if you have a lot of data that you are actively using, you will need a lot of system memory too.

    I hope I give you some useful hints to better evaluate our database.
    ​​​


  • 3.  RE: Cost of Omnisci Database Vs Hadoop

    Posted 23 days ago
    Adding to Candido's answer, another thing to keep in mind that all of the Hadoop workflow isn't necessarily possible in OmniSci and vice versa. So comparing costs can only make sense for certain workloads. For example, I look at Hadoop as much from the HDFS storage part as much as computation. Hadoop clusters are good for storing data until you might need it, and tools like Spark are great for being able to deal with arbitrary data. With Amazon EMR, S3 possibly reduces the need for dedicated hardware for HDFS, but I suppose that also depends on the specific workflow. 

    So on a relational database use case, OmniSci is going to be many times faster than a Hadoop workflow (say Hive or Impala), but if you want to parse arbitrary text files, OmniSci isn't going to (currently) be a good solution. The same if you want to be able to compute on arbitrary file formats without doing ETL first. OmniSci added Parquet reading support in version 4.6, but ultimately we still convert that to OmniSci binary format and store that on-disk. In the short-term, we're not going to be able to point to arbitrary file formats, so if that's important than OmniSci can be a complement, but not a replacement for a Hadoop workflow.

    GPU instances are expensive for sure, but it's also the case that they provide functionality that's not possible with CPU-based solutions. It will be up to you and your clients to understand whether they need the sub-second query ability or the visual analytics capabilities of Immerse in OmniSci Enterprise Edition. If they need that, then buying the server hardware can be cost-effective vs. cloud, or using 1-year or 3-year reserved pricing (AWS does both, I believe Azure does as well) can be another way to save. Another way to save is also to not bring all of the petabytes of data over to OmniSci, but rather use OmniSci for data that they know they want to explore. In that sense, OmniSci is a complement to another cheaper tool for storage and batch processing, and OmniSci can be the accelerator for analytics.




  • 4.  RE: Cost of Omnisci Database Vs Hadoop

    Posted 16 days ago
    Thanks Candido & Randy for the detail replies. Here's what I gathered from your responses:

    1) Omnisci is a perfect tool for use cases that require sub-second response time. 
    2) It won't make sense to transfer all data from HDFS to Omnisci. It's not really a storage for archiving data.
    3) Omnisci doesn't force us to use GPUs. We can experiment on CPU based machines. Code remains same for both.
    4) Attaching storage such as EBS will cause performance degradation. It's best to use local storage.
    5) We need to pay special attention to data types as they directly affect storage cost.
    6) We can start exporting data into Omnisci gradually (But question is, if we add more machines to the cluster would the data get balanced across all nodes automatically? Would data for old tables always remain on old nodes? I wouldn't think so.)

    Bottomline is, GPUs are expensive but they provide functionality that otherwise wouldn't be possible on CPU based solutions.

    Thanks for the insights. Will present them to my client & hopefully we will start using Omnisci soon. 


  • 5.  RE: Cost of Omnisci Database Vs Hadoop

    Posted 15 days ago
    Edited by Candido Dessanti 15 days ago
    Hi @AJAY CHITRE,

    I want to be more clear on point 4 and 5.

    4) I've minimal experience with EBS volumes, but I wanted to warn you about possible performance degradation you can incur using this kind of disk:
    In fact in the best case, they are limited to an aggregate speed of 14 Gbps (for smaller instances the limit is 1.5 Gbps), so a machine EBS Optimized (you have to use this option to avoid a concurrency with your network bandwidth) wherever the number of disks you are using;
    This could be a limiting factor if a lot of evictions occurs, because the lacks of system meory. but it depends your data and how is going to be .
    If your data is going to the on system memory for a reasonable time, performance wise disk storage it's not going to be a limiting factor.

    5) Choosing the right datatypes not only will save disk space, but it will also imporve performance; less data to read from disk, means faster disk reads, less system memory used and less traffic on GPU bus. All of this lead to lower hardware's costs or more data capacity at the same cost.