Frequent calcite crash?


#1

Hi,

I’m running a mapd instance with 200k rows on a Google Cloud instance with 1 Tesla K80. I’m using Immerse as a client. I’m unable to stabilize a running instance of mapd.

Some usage details:

I’ve created a dashboard with 2 tables and a scatter plot. I use the 2 tables to filter the scatter plot. When I select multiple entries from these tables (4+ filters or so, sometimes more sometimes less), I eventually hit an error MapDHandler.cpp:3260] Exception: connect() failed: Connection refused and I have to restart mapd_server to regain access to the interface.

I could use some help in understanding why my instance is crashing and how best to resolve it. I’m attaching some logs and details that I think are relevant below, I can attach more if needed.

Thanks!

schema:

CREATE TABLE test_data (
input decimal(19,8) DEFAULT NULL,
output decimal(19,8) DEFAULT NULL,
position_index decimal(19,2) DEFAULT NULL,
channel_name varchar(255) DEFAULT NULL,
file_name varchar(255) DEFAULT NULL,
acc_name varchar(255) DEFAULT NULL,
cart_name varchar(255) DEFAULT NULL,
id bigint DEFAULT NULL
);

nvidia-smi output immediately after a crash:

Wed Dec  6 18:58:52 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   64C    P0    74W / 149W |   2877MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2159      G   /usr/lib/xorg/Xorg                            15MiB |
|    0      2603    C+G   /opt/mapd/bin/mapd_server                   2847MiB |
+-----------------------------------------------------------------------------+

INFO log output:

I1206 18:55:20.873397 2603 MapDServer.cpp:606] MapD started with data directory at '/var/lib/mapd/data’
I1206 18:55:20.873548 2603 MapDServer.cpp:613] Watchdog is set to 1
I1206 18:55:20.873555 2603 MapDServer.cpp:635] cuda block size 0
I1206 18:55:20.873559 2603 MapDServer.cpp:636] cuda grid size 0
I1206 18:55:20.873562 2603 MapDServer.cpp:637] calcite JVM max memory 1024
I1206 18:55:20.873566 2603 MapDServer.cpp:638] MapD Server Port 9091
I1206 18:55:20.873569 2603 MapDServer.cpp:639] MapD Calcite Port 9093
I1206 18:55:20.873585 2603 MapDHandler.cpp:151] MapD Server 3.3.1-20171108-32e7bcc
I1206 18:55:21.822178 2603 CudaMgr.cpp:127] Using 1 Gpus.
I1206 18:55:21.822242 2603 DataMgr.cpp:120] cpuSlabSize is 2958.11M
I1206 18:55:21.822257 2603 DataMgr.cpp:122] reserved GPU memory is 604.837M includes render buffer allocation
I1206 18:55:21.836956 2603 DataMgr.cpp:132] gpuSlabSize is 2048M
I1206 18:55:21.856178 2603 FileMgr.cpp:173] Completed Reading table’s file metadata, Elasped time : 0ms Epoch: 0 files read: 0 table location: '/var/lib/mapd/data/mapd_data/table_0_0/'
I1206 18:55:21.856680 2603 Calcite.cpp:156] Creating Calcite Handler, Calcite Port is 9093 base data dir is /var/lib/mapd/data
I1206 18:55:21.856691 2603 Calcite.cpp:95] Running calcite server as a daemon
I1206 18:55:25.509320 2603 Calcite.cpp:124] Calcite server start took 3300 ms
I1206 18:55:25.509354 2603 Calcite.cpp:125] ping took 94 ms

ERROR log output:

Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E1206 18:57:54.359479  2697 MapDHandler.cpp:3260] Exception: connect() failed: Connection refused
E1206 18:57:54.359545  5005 MapDHandler.cpp:3260] Exception: connect() failed: Connection refused
E1206 18:57:55.104563  4855 MapDHandler.cpp:2240] Exception: connect() failed: Connection refused

ALL output, truncated to the query that caused the crash:

I1206 18:57:49.110793  2697 MapDHandler.cpp:568] sql_execute :12ZJ2CKx1khjZSVPP7KcyPJfafT7lSGO:query_str:SELECT COUNT(*) as val FROM test_data WHERE (id = 44494 OR id = 44355 OR id = 44291) AND (position_index = 6) AND (input >= -1.7342400790567456 AND input <= 2.970683636030559) AND (output >= -319.1088760820893 AND output <= 383.94926969623407)
I1206 18:57:49.111295  2697 Calcite.cpp:247] User mapd catalog mapd sql 'SELECT COUNT(*) as val FROM test_data WHERE (id = 44494 OR id = 44355 OR id = 44291) AND (position_index = 6) AND (input >= -1.7342400790567456 AND input <= 2.970683636030559) AND (output >= -319.1088760820893 AND output <= 383.94926969623407)'
I1206 18:57:49.270043  2697 Calcite.cpp:260] Time in Thrift 138 (ms), Time in Java Calcite server 20 (ms)
I1206 18:57:53.736174  2697 MapDHandler.cpp:586] sql_execute-COMPLETED Total: 4587 (ms), Execution: 4584 (ms)
I1206 18:57:54.109606  4855 MapDHandler.cpp:2096] render_vega :12ZJ2CKx1khjZSVPP7KcyPJfafT7lSGO:widget_id:4:compressionLevel:3:vega_json:{"width":1908,"height":853,"data":[{"name":"points","sql":"SELECT input as x,output as y,channel_name as color,test_data.rowid FROM test_data WHERE (id = 44494 OR id = 44355 OR id = 44291) AND (position_index = 6) AND (input >= -1.7342400790567456 AND input <= 2.970683636030559) AND (output >= -319.1088760820893 AND output <= 383.94926969623407) LIMIT 2000000"}],"scales":[{"name":"x","type":"linear","domain":[-1.7342400790567456,2.970683636030559],"range":"width"},{"name":"y","type":"linear","domain":[-319.1088760820893,383.94926969623407],"range":"height"},{"name":"points_fillColor","type":"ordinal","domain":["Channel 3 1","Channel 2 1 2","Channel 1 1","Other"],"range":["#ea5545","#bdcf32","#b33dc6","#27aeef"],"default":"#27aeef","nullValue":"#CACACA"}],"marks":[{"type":"points","from":{"data":"points"},"properties":{"x":{"scale":"x","field":"x"},"y":{"scale":"y","field":"y"},"size":4,"fillColor":{"scale":"points_fillColor","field":"color"}}}]}
I1206 18:57:54.109870  2697 MapDHandler.cpp:568] sql_execute :12ZJ2CKx1khjZSVPP7KcyPJfafT7lSGO:query_str:SELECT id as key0,COUNT(*) AS val FROM test_data WHERE (position_index = 6) AND (input >= -1.7342400790567456 AND input <= 2.970683636030559) AND (output >= -319.1088760820893 AND output <= 383.94926969623407) GROUP BY key0 ORDER BY key0 DESC LIMIT 400 OFFSET 0
I1206 18:57:54.110329  2697 Calcite.cpp:247] User mapd catalog mapd sql 'SELECT id as key0,COUNT(*) AS val FROM test_data WHERE (position_index = 6) AND (input >= -1.7342400790567456 AND input <= 2.970683636030559) AND (output >= -319.1088760820893 AND output <= 383.94926969623407) GROUP BY key0 ORDER BY key0 DESC LIMIT 400 OFFSET 0'
I1206 18:57:54.165184  5005 MapDHandler.cpp:568] sql_execute :12ZJ2CKx1khjZSVPP7KcyPJfafT7lSGO:query_str:SELECT position_index as key0,COUNT(*) AS val FROM test_data WHERE (id = 44494 OR id = 44355 OR id = 44291) AND (input >= -1.7342400790567456 AND input <= 2.970683636030559) AND (output >= -319.1088760820893 AND output <= 383.94926969623407) GROUP BY key0 ORDER BY key0 LIMIT 250 OFFSET 0
I1206 18:57:54.165601  5005 Calcite.cpp:247] User mapd catalog mapd sql 'SELECT position_index as key0,COUNT(*) AS val FROM test_data WHERE (id = 44494 OR id = 44355 OR id = 44291) AND (input >= -1.7342400790567456 AND input <= 2.970683636030559) AND (output >= -319.1088760820893 AND output <= 383.94926969623407) GROUP BY key0 ORDER BY key0 LIMIT 250 OFFSET 0'
E1206 18:57:54.359479  2697 MapDHandler.cpp:3260] Exception: connect() failed: Connection refused
E1206 18:57:54.359545  5005 MapDHandler.cpp:3260] Exception: connect() failed: Connection refused
I1206 18:57:54.926841  2691 Calcite.cpp:247] User mapd catalog mapd sql 'SELECT input as x,output as y,channel_name as color,test_data.rowid FROM test_data WHERE (id = 44494 OR id = 44355 OR id = 44291) AND (position_index = 6) AND (input >= -1.7342400790567456 AND input <= 2.970683636030559) AND (output >= -319.1088760820893 AND output <= 383.94926969623407) LIMIT 2000000'
E1206 18:57:55.104563  4855 MapDHandler.cpp:2240] Exception: connect() failed: Connection refused

#2

Hi,

How are you starting mapd. if using systemd would you be able to review the output journalctl for the mapd_server.

Also review the system logs for any additional logs.

Is this repeatable? how many queries do you execute before it happens?

regards


#3

Hi Dwayne,

I was able to repeat this pretty reliably, however, I spoke with Marc since posting this question. Increasing the CPU ram on the instance (6gb -> 48gb) seems to have resolved the issue.


#4

@ckguven glad to hear it’s working.