Table Partition on Date Field --- Missing Feature

Hi omniSci Team,

In OmniSci Database, most of the time i am facing challenge and restricting to utilize for use case due to one Limitation and that is :: Unavailability of Table Partition option (with feature) on Date Field.

I think making Table partition option on Date field (with features) can make OMNISCI Database extremely useful and efficient for many many more use cases.

Let me try to explain

  1. Data Deletion : As objective to 30 Days records. Hence would like to drop any Data which is older than 30 days. Now I am using Delete command but this will not clear OS level disk Space. If Daily Partition option is available, can drop daily Partition which will be less costly query and release OS disk Space.

  2. Data Insertion : I am using Kafka Importer for data Loading. As would like to make data availability quickly so using the small Batch Size (–batch). This create lots of small Chunks of 16MB. Lets assume it create 1000 Blocks of 16MB per day. If Daily Partition Merge Block feature is available, can run Merge Block query next day to rearrange data for a Day partition and may be Data Blocks will reduce to 10 or 50 Blocks. This will improve Database performance and speed up Database restart.

  3. Backup & Restore : Lets assume Daily Record count is 100 Million and would like to keep 30 days data. If Daily Partition option is available then only need to take 100 Million record Backup and restore. Taking 3 Billion record Backup and Restore is not efficient and time consuming job as Failure Chances are high.

  4. Efficient Utilization of GPU and CPU Memory : Based on Query where Clause (if Date Field included), Optimizer can load required Partition in Memory and utilize memory more efficiently.

In above, i always mentioned Date partition (with Features), (With Features) denotes
a) Drop partition
b) Copy a partition
c) Backup a partition
d) Restore a partition
e) Merge DB Blocks for a partition
f) CTAS from Partition
g) Load Partition in Memory based on where Clause (if date field Included)

Regards,
Sumit

1 Like

Sumit, thanks for the input. these are good points - some of them we are already at work on, around improving deletes and also optimizing how we handle epochs on disk. While we dont support date-based partitions yet, we are also looking into how we can use filter pushdowns in queries to improve performance for date-based queries that result in wide scans (e.g. 2 years worth of data). More to come in the next few months.

All that said, a better path to loading omnisci is to avoid too many loads with small batches - this causes metadata to grow. As i have pointed out earlier, as a column store, it is hard for us (and many other column stores) to handle high-throughput small batch data loads. It might be better in the short term if you redo your import process to have larger than 16MB batches

Hi @Venkat_Krishnamurthy,

I am in agreement with your point to do writing in large chunks to utilize the 16MB DB Blocks. However many Real time use cases are required continuous data ingestion and commit to make data available. Like i mentioned in this case we will get next day to consolidate the data and reduce DB Blocks.
However, now in OMNISCI database no option is available other than CTAS on complete table. As we all know CTAS on billion of records is time consuming job and may be it can fail also.

i think we can discuss separately on use cases.

Regards,
Sumit

Sumit,
One other thing you can try to avoid having to run deletes is set a max_rows property on the DDL of the table in question. This property will automatically remove rows on a first in first out basis.

Here’s a link to the max_rows property:
https://docs.omnisci.com/latest/5_tables.html#encoding-spec

Sam

1 Like

Hi @sam.carroll

As this will remove rows without actual knowledge and this will create bigger problem. as requirement to drop all records which are x Days old to keep uniformity in reports and analytics.

Regards,
sumit

Sumit, these are all good input and suggestions, but as you can imagine, we cannot immediately work on date partitions because there are customer priorities on our roadmap that take precedence.

For now, you’ll need to the make the adjustments at your end to figure out the right batch size. Like I said, we’re working on ways to improve this.

Hi @Venkat_Krishnamurthy,

Thank you for sharing feedback and i can totally understand design pipelines.

For the time being, i will use CTAS method so we can make data available as soon as possible and manage Data Blocks.

Regards,
Sumit

Regards,
sumit

a warm welcome to @sam.carroll to our forum.

@Sumit adding the feature you requested took a very long time also to Oracle, and while it was very important to them to get some performance.
This isn’t true for our database, because we have metadata recorded for each column of each fragment of data, so if you run a query for a specific range of time, just the fragments containing data of that rage are read from disk and cached to CPU and GPU for further processing.

When we fix the way, we delete data, or better, optimize data after deletes, the needs for a horizontal partition will be less obvious.

Hi @candido.dessanti and Omnisci team,

Yes I am in agreement with you that adding a new feature is complex.

As Omnisci team is Active in Telecom domain and hence let me share with you most frequent use case in Telecom domain and that can give you idea about what could be useful feature

  1. Telecom data analytics dependent on CDR (Call Detail Records)/ BDR (balance Detail Records)/ Data Usage Detail Records/Syslogs.
  2. Telecom system generate 200Million to 800Million records per day for mid-size network.
  3. Daily Data consumption is in range of 10GB to 25GB and depend on number of fields in Records.
  4. Most cases Telecom operator keep 30 days Data in table for analytics and use system 24 x 7.
  5. Based on FIFO (First In First Out), system need to delete 30 days old data.
  6. Backup Team, take daily backup before deleting the records from Main Table. This Backup stored in Historical database server and tape drive for future reference.
  7. System HDD’s/SSD’s dimension are [30Days x Daily Volume]+ 50% additional usable space.
  8. CPU and RAM are based on product requirement and product specification.

I am seeing following challenges

  • How to manage 30 days data in Table in efficient way
  • How to take daily Partition Backup
  • How to manage disk space.

For the above challenges please let us know your suggestion. I firmly believe one problem has many solutions. My suggested Daily partition is one of the solution. However, new generation database can advise other more efficient option.