How to downgrade the version

How to downgrade to lower version from below current version

Current Version :
Server Version : 5.1.2-20200304-e118922084

Not working and waiting for output for last 2 hours

I updated system last month Feb and it was working perfectly fine. However, today after updating to latest version, not getting the response after 2 hours

please note- Data volume is less than few GB’s and system RAM IS 192GB.

I am not sure how to come out from this problem

How did you install 5.1.2?

Upgrading the OMNISCI Version

systemctl stop omnisci_server

yum update omnisci

systemctl start omnisci_server

What does systemctl status omnisci_server show?

omnisql> \status
Server Version : 5.1.2-20200304-e118922084

Server Start Time : 2020-03-06 : 16:37:37

[root@utility1 ~]# systemctl status omnisci_server
● omnisci_server.service - OmniSci database server
Loaded: loaded (/usr/lib/systemd/system/omnisci_server.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2020-03-06 20:37:34 +04; 1h 18min ago
Main PID: 199238 (omnisci_server)
Tasks: 103
CGroup: /system.slice/omnisci_server.service
├─199238 /opt/omnisci//bin/omnisci_server --config /opt/data/omnisci.conf
└─199253 -Xmx1024m -DMAPD_LOG_DIR=/opt/data/data -jar /opt/omnisci/bin/calcite-1.0-SNAPSHOT-jar-with-dependencies.jar -e /opt/omnisci/QueryEngine/ -d /opt/data/data -p 6279 -m 6274 -T -P -Y -Z -c /opt/data/omnisci.conf

Mar 06 20:37:34 utility1 systemd[1]: Started OmniSci database server.
Mar 06 20:37:35 utility1 omnisci_server[199238]: 2020-03-06T20:37:35.483735 E 199238 MapDHandler.cpp:225 This build isn’t CUDA enabled, will run on CPU
Mar 06 20:37:35 utility1 omnisci_server[199238]: 2020-03-06T20:37:35.494217 E 199238 Calcite.cpp:216 Appears to be orphaned Calcite server already running, shutting it down
Mar 06 20:37:35 utility1 omnisci_server[199238]: 2020-03-06T20:37:35.494328 E 199238 Calcite.cpp:218 Please check that you are not trying to run two servers on same port
Mar 06 20:37:35 utility1 omnisci_server[199238]: 2020-03-06T20:37:35.494383 E 199238 Calcite.cpp:219 Attempting to shutdown orphaned Calcite server
Mar 06 20:37:35 utility1 omnisci_server[199238]: 2020-03-06T20:37:35.549881 E 199238 Calcite.cpp:224 orphaned Calcite server shutdown
[root@utility1 ~]#

omnisql> \t
^C
Interrupt signal (2) received.
[wicp@
utility1 ~]$


OmniSci Server CPU Memory Summary:
MAX USE ALLOCATED FREE
102529.43 MB 44.43 MB 4096.00 MB 4051.57 MB

OmniSci Server GPU Memory Summary:
[GPU] MAX USE ALLOCATED FREE

omnisql>

Please let us know how i can come out from this problem… I am stuck…

Need Help

Even its required… i can restart and share more detailed information.

If you have gotten to this point, it feels like you have a running server:

OmniSci Server CPU Memory Summary:
MAX USE ALLOCATED FREE
102529.43 MB 44.43 MB 4096.00 MB 4051.57 MB

OmniSci Server GPU Memory Summary:
[GPU] MAX USE ALLOCATED FREE

omnisql>

The only thing that stands out to me from the snippet you posted is that another Calcite server was running, and that OmniSciDB tried to disable that. Have you completely restarted the server (not just OmniSci), to see if that fixes the issues you are seeing?

[wicp@utility1 ~] ps -ef| grep -i omni omnisci 199238 1 11 Mar06 ? 00:43:00 /opt/omnisci//bin/omnisci_server --config /opt/data/omnisci.conf omnisci 199253 199238 0 Mar06 ? 00:00:42 -Xmx1024m -DMAPD_LOG_DIR=/opt/data/data -jar /opt/omnisci/bin/calcite-1.0-SNAPSHOT-jar-with-dependencies.jar -e /opt/omnisci/QueryEngine/ -d /opt/data/data -p 6279 -m 6274 -T -P -Y -Z -c /opt/data/omnisci.conf wicp 222119 221845 0 02:39 pts/0 00:00:00 grep --color=auto -i omni [wicp@utility1 ~]

Please let us know if any other way possible to check if any other OmniSci Server running ?

Hi Sumit, sorry to hear you’re stuck. If you’ve not tried this already,

  1. Do you have a backup of your OmniSci data directory? Can you make sure you do that first?
  2. Test with a clean install of 5.1.2 from a tarball, startup omnisci from the bin directory there and see if it works on a separate data directory (i.e without pointing it at your existing data directory)?

Hi @Venkat_Krishnamurthy,

a) Thank you for kind words and will check with team if they have taken the Data Directory Backup. If not then we can create a clear procedure for Backup and Restoration (Please let us know if such procedure is available in OmniSci Documentation or any other location.

b) On the other Note, i would like to highlight below critical point in OmniSci team notice on user point of view.
b1) Yesterday full day i struggled to make the system up and running even after multiple process restart and server reboot. suddenly it started working after 5-6 hours but still not sure what will happen if i will restart again.
b2) I am running OmniSci database in standalone server and no other process running in this high end server (40 Core, 192GB RAM, 10HDD’s). The Data volume is less than few GB’s. However,yesterday after restarting the OmniSci Database, Server Load average for touching 30-35 (in other days, system load average not cross more than 1-2).
b3) The biggest area of my concern is on Omnisci INFO Log. As a technical person i can understand definitely there was some issue in system or omnisci database, however i was not able to analyze anything from this log as log was showing normal even that time i am not able to get the response of simplest query \t to list the available tables. May be my knowledge on omnisci logs are not great, however, i was not able to share the feedback also with community to help me.
b4) please let us know any Coredump/detailed Trace option is available in omnisci database, so we can run for short duration during issue and analyze it. As well as share with omnisci team to verify and advise. In absence of no logs, during issue time i feel that i am in clueless state and only god can safeguard me.

Thank you for the feedback, we will take the logging into consideration.

Please note that there are actually 4 types of OmniSci logs, each with increasing severity:

omnisci_server.INFO 
omnisci_server.WARNING
omnisci_server.ERROR
omnisci_server.FATAL

So while the INFO log may not show any issues, the other log levels might.

From your description of the server doing something for hours, then eventually coming back, it sounds like your table metadata might have issues. How is the data being loaded into your database, in batches or individual row inserts?

Sumit - what version did you upgrade from?

I understand your frustration. There are several large omnisci customers running in production without encountering similar issues. We would like to help you get to the bottom of what happened, but you need to provide us with additional info in the form of logs to help. It is hard to diagnose problems just through your forum messages. Can you upload your full set of logs somewhere for us to take a look?

Hi @randyzwitch

thanks for your feedback. I am planning to start once again omniserver to ensure that in future if any day it will restart again then i will not face this issue.

This time i will capture the above mentioned logs and share with omnisci team to diagnose the issue if any.

Hi @Venkat_Krishnamurthy,

thank you for your feedback. As you mentioned about my frustration but this has other aspects of it.

I personally connect myself with omnisci database and feel that this database design philosophy has capability to change the world of analytical database.

However, when i face challenge and not able to resolve the issue, i feel alone and react accordingly.

At present biggest challenge which i am facing is on Log details specially during the issue time. For example, when i give \t to list available Tables or select count(*), column x from Table Y and not able to get output. Now i don’t know what to do next.

We appreciate your feeling as strongly as you do about OmniSci, @Sumit - Thank you for that! While we’re trying to figure out the problem, can you help answer the following questions

  1. How frequently do you do any sort of update/delete/insert operations on the database? i.e how many times a day/week/month and how many records do you usually have in these batches?
  2. How long have you been doing these updates? e.g. past year? a few months?
  3. What version did you most recently upgrade from?

Another note - your server is definitely quite capable on the compute capacity, but using HDDs may also be a contributing factor here (again, I’m guessing, given what you’ve described above). Usually we recommend faster NVMe storage with OmniSci because as we discussed earlier, we dont store all the database in CPU memory, so you can actually run databases larger than available memory with OmniSci unlike several in-memory databases. But - for that to be fast, we prefer high-speed IO in the form of SSDs/NVMe.

My guess is that you are facing some latency issues stemming from long-running metadata operations as Randy points out. We can get on a call to discuss this further if you want.

Hi @Venkat_Krishnamurthy and @randyzwitch,

I am able to simulate the reported issue and connected many dots, however still few are missing points and would request your help to assist.

A) Today again I restarted the omni Database using the following command

Stopping the OMNISCI server

systemctl stop omnisci_server

Starting the OMNISCI server

systemctl start omnisci_server

Action 1

  1. I observed that again I am not getting the response for my simple queries after restarting Omnisci Database module. Then started analyzing using step by step approach.
  2. Observations were as following
    a. Disk Partition was approximately 40% full (400GB)
    b. Disk IOPS increased drastically high by running one simple query (select count(*) from Table X) and touching TPS of 4000 and reading 100MB/sec reading.
    c. \memory_summary output was showing available Memory is 100GB but use memory is 0.00MB
    d. No Error in INFO and Error log

Action 2

  1. Based on the Action1, I felt that some issue in Table Fragmentation and started analyzing the Table size using the sqllite queries.
  2. During the analysis observed that Tables are highly fragmented and actual records are very less in Tables (in few thousands), however Table size is in terms of 40GB, 62GB, 100GB & 192GB.
  3. Defragmented Tables in following way
    1.3.1. Select count(*) from X;
    1.3.2. Create Table TEMP_X as select * from X;
    1.3.3. Drop table X;
    1.3.4. Create table X as select * from TEMP_X;
    1.3.5. Drop Table TEMP_X;

Observations on Action 2
Case1

  1. When i executed defragmented steps for a Table where No data Insert was happening, that time system IOPS was 2000+ and reading 100MB/s. However this query execution took more 6-7 minutes.
  2. In the select count(*) from X, observed that \memory_summary Use Memory value was constantly 0.00MB.
  3. No intermittent status update log between when the query accepted by Calcite and when received the output for query. However, between two events time gap was more than 6-7 minutes and I was clueless what is happening in system.
    a. 2020-03-08T12:42:01.182590 I 422998 MapDHandler.cpp:883 stdlog_begin sql_execute 54 0 WICP_MAPD_DB WICP_MAPD 239-eBLm {“query_str”} {“select count(*) from EVR;”}
    b. 2020-03-08T12:47:04.128277 I 422998 FileMgr.cpp:204 Completed Reading table’s file metadata, Elapsed time : 302904ms Epoch: 424416 files read: 6011 table location: ‘/opt/data/data/mapd_data/table_2_34’
    Case 2
  4. When I executed select query for a Table where data Insert was happening, that time system IOPS was 4000+ and reading 100MB/s. However this query execution took more 150 minutes. This time I was in panic condition as no information in INFO Log, \memory_summary output showing memory usage in 0.00MB, system Load Average is above 30, IOPS above 4000.

Questions to OmniSci Team

  1. No progress information available in INFO logs available between “2020-03-08T12:42:01.217941 I 422998 Calcite.cpp:504 Time in Thrift 1 (ms), Time in Java Calcite server 32 (ms)" and “2020-03-08T12:47:04.128277 I 422998 FileMgr.cpp:204 Completed Reading table’s file metadata, Elapsed time : 302904ms Epoch: 424416 files read: 6011 table location: '/opt/data/data/mapd_data/table_2_34”. This is more than 5 minutes duration and not good for any administrator to be clueless about the progress on query.
  2. Why Data was not loading in memory for “select count(*) from X or select count(column1) from X”. Is it correct ?
  3. When Data loading was in progress for Table X, that time “select count(*) from Y” took more than 150 minutes and during this time I was not able to get a output of any query. This was system outage.
  4. How fragmented table can create this big issue and force for X hours of service outage. What is method available to avoid such scenarios.
  5. System output was 4000 IOPS and reading was 125MB/sec, then how reading took more than 150 minutes. With this speed we can read more than 200GB data also in less than 30 minutes.

waiting for expert advise

@Sumit, please see above. Both myself and @Venkat_Krishnamurthy believe that metadata could be the issue you are seeing, but without knowing the insert pattern you are doing, it’s hard for us to provide any other advice.