Calculating the percentile

Hello,

Simple problem really. I’m trying to calculate a set of percentiles (e.g. 50, 68, 90) on groups within a table. I’ve noticed the percentile_cont function is a reserved word in the documents, but the function is not currently implemented. I’ve attempted to go down the path of defining the function myself using the UDF extension to no avail and after looking online for several hours I’ve almost given up.

As an example using the nyc_tree_2015_683k dataset, I would like to be able to do the following (or something as close to it).

select 
    zipcode, 
    count(tree_dbh), 
    ave(tree_dbh), 
    percentile(tree_dbh, 0.50), 
    percentile(tree_dbh, 0.68), 
    percentile(tree_dbh, 0.90) 
from nyc_tree_2015_683k 
group by zipcode;

I should also mention that I have tried executing the query in a Jupyter notebook using ibis, pymapd and rbc.

*edited to align with the following post

For interest, below is the UDF I define and pass in the omnisci.config file, which is not even working now as running ./startomnisci --config omnisci.config just quits not long after starting.

    #include <cstdint>
    #include <math.h>
    #include <limits>
    #include <type_traits>
    #if defined(clang) && defined(CUDA) && defined(CUDA_ARCH) #define DEVICE device #else #define DEVICE #endif
    #if defined(clang) && defined(CUDA) && defined(CUDA_ARCH) #define NEVER_INLINE #else #define NEVER_INLINE attribute((noinline)) #endif
    #if defined(clang) && defined(CUDA) && defined(CUDA_ARCH) #define ALWAYS_INLINE #else #define ALWAYS_INLINE attribute((always_inline)) #endif
    #define EXTENSION_NOINLINE extern "C" NEVER_INLINE DEVICE

    EXTENSION_NOINLINE
    double percentile(const Array<double> arr, double q) {
        double perc, rank, crank, frank;
        int32_t length, ridx, cidx, fidx;
        
        length = arr.getSize();
        rank = 1. + *static_cast<double>(length-1)*q;
        
        crank = ceil(rank);
        frank = floor(rank);
        ridx = *static_cast<int32_t>(rank);
        cidx = *static_cast<int32_t>(crank);
        fidx = *static_cast<int32_t>(frank)
        
        if (crank == frank) & (frank == rank) {
            perc = arr[ridx];
        } else {
            perc = (crank-rank)*arr[frank] + (rank-frank)*arr[crank];
        }
        return perc;
    }

Hi @atomslinger,

Welcome to the community forum :wink:

the only window’s functions avaiable right noew are listed here
https://docs.omnisci.com/latest/5_dml.html#window-functions
and as younoticed the perrcentile_cont or percentile_rank aren’t implemented yet so the error.

I don’t think your UDF would work because you are assuming that you gettint into an array the distinct and order values of the tree_dbh values within the groups.

Trying to give you an alternative using sql only

Hi @candido.dessanti

I just tried implementing this solution from stackoverflow (the code which I derived my UDF from):

https://stackoverflow.com/questions/14300004/postgresql-equivalent-of-oracles-percentile-cont-function/14309370#14309370

However, does omniscidb currently support functions? My attempt at creating even simple ones appears to fail:

CREATE OR REPLACE FUNCTION totalRecords ()  
RETURNS integer AS $total$  
declare  
    total integer;  
BEGIN  
   SELECT count(tree_dbh) into total FROM nyc_tree_2015_683k;  
   RETURN total;  
END;  
$total$ LANGUAGE omnisql;

Hi,

nope, currently only UDF functions written in C Language are supported. The problem is that it’s likely you have to use windows functions, that runs on CPU so performance is going to be not so good