Unable to instantiate CudaMgr

Hello,

when I try to start omnisci server , its enter CPU-only mode. it couldn’t instantiate CudaMgr.
this is the error shown “DBHandler.cpp:266 Unable to instantiate CudaMgr, falling back to CPU-only mode. CUDA Error (999): unknown error”

Do you know what might be the issue?

thank you for your time

Hi,

It’s because the system cannot detect correctly the GPUs

Could you post the output of nvidia-smi command?
What system are you on? (OS, Hardware)
Which version of OmniSciDB have you installed?
After the Cuda unknown error are you getting something like no gpus detected?

I am sorry to ask you a lot of questions but the 999 error is quite generic

Regards,
Candido

Hi candido,

I have centos and this is the output of Nvidia-sim:

no I did not get a no gpu detected error .

thank for your help, I really appreciated it.

my best regards,
Sama

Hi,

I did some tests, also using a similar driver of your (455.23.04 can’t find the 05 anywhere), and I can’t reproduce your issue.

It looks there is something that’s preventing you from using the GPUs. We got troubles recently with Nvidia Fabric Manager on DGX and HGX systems, but I don’t think your system has an NV-link switch, but maybe I’m wrong.
Which kind of hardware are you using? It’s an on-premise physical machine or it’s an AWS Instance (on an AWS Instance I could reproduce)

Also, the 999 could mean that the Nvidia driver is in a bad state, and a reboot (or a driver reset) is needed. Can you try to reboot the machine and re-try?

I had the same issue on a Dell Cauldron with 8x T4s and an HPE Apollo 6500 with 8x A100s sxm2. Running ubuntu 20.04 I uninstalled all nvidia-* and cuda-* packages installed with apt then rebooted. Then used the latest 460.73.01 driver installed from the run file. Then installed cuda-toolkit-11.2 from the run file so that it did not install drivers. Then install nvidia-fabricmanager and started the service daemon. Started nv-hostengine and persistenced daemons, and ensured the post install actions were completed Installation Guide Linux :: CUDA Toolkit Documentation. Rebooted then started omnisci_server.

Thanks @mjj203,

We also had a similar issue with dgx and hgx systems. For this reason I asked to @missasma which system he is on.

I have fixed the issue , it turn out the Nvidia MPS service was casing the issue with detecting the GPU.

1 Like

Hi @missasma,

That’s great news. I hope you will be satisfied by OmnisciDB.

It would be nice to share what you did to make the software work in the environment; it could be useful for other community users having the same problem.

Best Regards,
Candido