I am researching on twitter data mining project and come across the OmniSci TweetMap demo @ https://www.omnisci.com/demos/tweetmap . Such a fantastic platform and I am planning to do a similar project using OmniSci Free.
May I know if the twitter data is obtained from the official Twitter API ? According to some source, the amount of Twitter users sharing geo location data is low. How can you obtain the geo location for so many tweets ?
I asked internally and as I imagined we didn’t use anything other than the official Twitter API, filtering for the one that has some kind of geo-coordinates associated with it wherever the precision is.
So we are displaying only 600m of tweets for nearly 7 months, while we collect them for years for oblivious reasons.
We are thrilled to see your dashboard using Twitter data; as you probably know this is one of the first demos released to show the capacity of the database coupled with a backend rendering so, it has a special place in our hearts.
If we can do something to accelerate your project you have just to ask for.
Candido and the rest of omnisci/mapd team
I started to work on designing the data model and parse the datasets. I am inspired by an interesting project called “We feel fine” some years ago.
For the backend I think I will create a VM in GCP with GPU and install OmniSci Free.
Do you know the most convenient way for the installation ? How can I prototype with a minimum cost (minimum CPU RAM/GPU RAM) for e.g. 10 million data
For the frontend , I plan to code the data visualisation part using Three.js , but I guess the dashboards and charts in OmniSci Free are good already for some initial data explorations.
I took a look at “We feel fine” (good for them ;), and I guess that your project will use something more complex to analyze the text (just tweets?).
If you are un GPU, you have lots of power to process messages and get what’s the feeling of the writer, without looking for a punctual string.
Said that to do some prototyping with just 10 Million of records, you can get an idea by looking at this page on our docs, and if you want to be more precise, you have to know how many columns you are using and how much is the memory needed by each column.
So, for example, if you are going to use 4 dictionaries encoded strings of 32bits, 2 DES of 16 buts, 1 point compressed (64 bits), and 2 smallint, the math is fairly simple; add all sizes, then multiply for the number of records you want to process.
In this example, the number of bytes is 44 (DCE 32bits) + 22 (DCE 16bits) + 18 (POINT) + 22 (SMALLINTS), so 32 bytes; multiplied for 10 million is just 320 Megabytes, for 100 million 3,2 Gigabytes and so on.
A community user shared a sheet to calculate the memory needed to store the data
Add on that some buffers needed for Joins and/or Group By and the space needed for the rendering (assuming you are using the back-end render of the maps), so every nVidia cards release in the latest years is going to fit and it’s able to process such low number of records.
To develop the prototype, you can use something like our recommended AWS configuration. Hence, a machine with 4/8 CPU and 64GB of ram and an Nvidia Tesla T4 has 16GB of ram and 2560 Cuda Cores (it also has some Tensor cores if you need to use it a previously trained model).
The GPU Itself costs just 0.36 USD/hour. and a c2-8 would cost under 0.4 USD per hour and it would pack more than enough memory and compute power needed for less than a total of 1 USD.
It would be nice to see your implementation of Three.js with ut database, but remember that moving a lot of data from server to client is going to limit the performance of your application (and that’s the reason because we develop also a back-end rendering engine).