Exception:Query couldn't keep the entire working set of columns in GPU memory


#1

Hello,everyone.
I want to execute a sql statement on mapd with 10 millions data. But it told me Exception:Query couldn’t keep the entire working set of columns in GPU memory. But in fact, the VRAM was not fully used.

The sql statement is
select * from t_test t WHERE T.PASS_TIME >= ‘2017-04-02 00:00:00’ AND T.PASS_TIME < ‘2017-04-03 00:00:00’ and t.crossing_id = 385 and t.plate_info like ‘%23%’ order by t.pass_time limit 100 offset 900;

The nvidia-smi info is
MapD Server CPU Memory Summary:
MAX USE ALLOCATED FREE
102928.38 MB 10299.68 MB 12288.00 MB 1988.32 MB

MapD Server GPU Memory Summary:
[GPU] MAX USE ALLOCATED FREE
[0] 7478.38 MB 488.28 MB 2048.00 MB 1559.72 MB
[1] 7478.38 MB 488.28 MB 2048.00 MB 1559.72 MB
[2] 7478.38 MB 488.28 MB 2048.00 MB 1559.72 MB
[3] 7478.38 MB 61.04 MB 2048.00 MB 1986.96 MB
[4] 7478.38 MB 0.00 MB 2048.00 MB 2048.00 MB
[5] 7478.38 MB 0.00 MB 2048.00 MB 2048.00 MB
[6] 7478.38 MB 0.00 MB 2048.00 MB 2048.00 MB
[7] 7478.38 MB 0.00 MB 2048.00 MB 2048.00 MB


#2

First,i want to know why does mapd throw an exception ‘out of GPU memory’ with so much VRAM left?

Second,i think turning the watch-dog off or turning cpu-retry on will help me. So i restarted mapd_server like that and executed the sql statement again. It passed and. I found that, in the log file, there was description Query ran out of GPU memory, punt to CPU. I am interested in what it means,whether it will re-execute the sql statement on CPU or not? And whether it will take more time than running the sql statement only on CPU.

Thank you.


#3

turning off the watchdog and cpu-retry will help you a bit, but ins’t resolutive, because the top-n optimization will be used by the CPU. The top-n optimization has been introducted on mapd 3.x and is intended to run with a group-by query because to be performant is memory intensive.

As example I will take a quite small table from tpch benchmark

mapdql> select count() from part;
21934075
mapdql> select * from part where p_size between 10 and 20 order by p_size limit 100 offset 18000;
Exception: Not enough host memory to execute the query
mapdql> \cpu
mapdql> select * from part where p_size between 10 and 20 order by p_size limit 100 offset 18000;
100 rows returned.
Execution time: 242 ms, Total time: 303 ms

the second query has been pushed to cpu with top-n optimization and need a large amount of memory to run

E1127 08:27:57.614783 27781 ExecutionDispatch.cpp:239] Failed to allocate 241986240512 bytes of memory
E1127 08:27:57.614902 21250 MapDHandler.cpp:4352] Exception: Not enough host memory to execute the query

So I suggest to run this kind of querie in cpu mode; this will disable top-n optimization
From mapdql type \cpu and re-run your queries


#4

Do you mean that mapd will enable top-N optimization in GPU mode with watch-dog open? And if VRAM is not enough, the sql statement will be punted to CPU, the top-N optimization will be executed on CPU again and need a lot of memory? But if we run the sql statement in CPU mode, top-N optimization will be disabled?

If it is true, how to explain that the sql statement is punted to CPU with so much GPU VRAM left?

MapD Server CPU Memory Summary:
MAX USE ALLOCATED FREE
102928.38 MB 10299.68 MB 12288.00 MB 1988.32 MB

MapD Server GPU Memory Summary:
[GPU] MAX USE ALLOCATED FREE
[0] 7478.38 MB 488.28 MB 2048.00 MB 1559.72 MB
[1] 7478.38 MB 488.28 MB 2048.00 MB 1559.72 MB
[2] 7478.38 MB 488.28 MB 2048.00 MB 1559.72 MB
[3] 7478.38 MB 61.04 MB 2048.00 MB 1986.96 MB
[4] 7478.38 MB 0.00 MB 2048.00 MB 2048.00 MB
[5] 7478.38 MB 0.00 MB 2048.00 MB 2048.00 MB
[6] 7478.38 MB 0.00 MB 2048.00 MB 2048.00 MB
[7] 7478.38 MB 0.00 MB 2048.00 MB 2048.00 MB


#5

Considers those four queries

select * from part where p_size between 10 and 20 order by p_size limit 100;
select * from part where p_size between 10 and 20 order by p_size limit 100 offset 200;
select * from part where p_size between 10 and 20 order by p_size limit 100 offset 1200;
select * from part where p_size between 10 and 20 order by p_size limit 100 offset 2200;

QUERY 1
I1128 08:29:14.917922 29328 MapDHandler.cpp:727] sql_execute :mapd_LIe :query_str:select * from part where p_size between 10 and 20 order by p_size limit 100;
I1128 08:29:14.918151 29328 Calcite.cpp:376] User mapd catalog tpch sql ‘select * from part where p_size between 10 and 20 order by p_size limit 100;’
I1128 08:29:14.937813 29328 Calcite.cpp:395] Time in Thrift 0 (ms), Time in Java Calcite server 19 (ms)
I1128 08:29:15.008134 29328 MapDHandler.cpp:757] sql_execute-COMPLETED Total: 90 (ms), Execution: 88 (ms)

The query is executed on GPU because the top-n buffer is smaller than the default slab size (2gb)

this is the memory summary

MapD Server GPU Memory Summary:
[GPU] MAX USE ALLOCATED FREE
[0] 10381.47 MB 83.67 MB 2048.00 MB 1964.33 MB

As you can see is used just the memory needed fo store the p_size field for filtering

Second Query

I1128 08:29:34.746502 29328 MapDHandler.cpp:727] sql_execute :mapd_LIe :query_str:select * from part where p_size between 10 and 20 order by p_size limit 100 offset 200;
I1128 08:29:34.746831 29328 Calcite.cpp:376] User mapd catalog tpch sql 'select * from part where p_size between 10 and 20 order by p_size limit 100 offset 200;'
I1128 08:29:34.763414 29328 Calcite.cpp:395] Time in Thrift 0 (ms), Time in Java Calcite server 16 (ms)
**I1128 08:29:34.891887  3556 BufferMgr.cpp:40] OOM trace:**

** runImpl:305**
** prepareTopNHeapsDevBuffer:785**
** alloc:42 : device_id 0, num_bytes 4011917312**
** I1128 08:29:34.893864 3557 BufferMgr.cpp:40] OOM trace:**
** runImpl:305**
** prepareTopNHeapsDevBuffer:785**
** alloc:42 : device_id 0, num_bytes 4011917312**
I1128 08:29:34.893952 29328 RelAlgExecutor.cpp:1881] Query ran out of GPU memory, punt to CPU
I1128 08:29:35.035068 29328 MapDHandler.cpp:757] sql_execute-COMPLETED Total: 288 (ms), Execution: 286 (ms)

The query has been redirected to CPU because the memory needed for the top-n buffer exceeded the size of the slab

The third query has been run as the second one

1128 08:29:42.439116 29328 Calcite.cpp:376] User mapd catalog tpch sql 'select * from part where p_size between 10 and 20 order by p_size limit 100 offset 1200;'
I1128 08:29:42.472694 29328 Calcite.cpp:395] Time in Thrift 0 (ms), Time in Java Calcite server 33 (ms)
I1128 08:29:42.593864  3564 BufferMgr.cpp:40] OOM trace:
runImpl:305
prepareTopNHeapsDevBuffer:785
alloc:42 : device_id 0, num_bytes **17381261312**
I1128 08:29:42.595613  3565 BufferMgr.cpp:40] OOM trace:
runImpl:305
prepareTopNHeapsDevBuffer:785
alloc:42 : device_id 0, num_bytes **17381261312**
I1128 08:29:42.595716 29328 RelAlgExecutor.cpp:1881] Query ran out of GPU memory, punt to CPU
I1128 08:29:42.744390 29328 MapDHandler.cpp:757] sql_execute-COMPLETED Total: 305 (ms), Execution: 300 (ms)

Fourth query

I1128 08:29:47.022013 29328 MapDHandler.cpp:727] sql_execute :mapd_LIe :query_str:select * from part where p_size between 10 and 20 order by p_size limit 100 offset 2200;
I1128 08:29:47.022338 29328 Calcite.cpp:376] User mapd catalog tpch sql 'select * from part where p_size between 10 and 20 order by p_size limit 100 offset 2200;'
I1128 08:29:47.036861 29328 Calcite.cpp:395] Time in Thrift 0 (ms), Time in Java Calcite server 14 (ms)
I1128 08:29:47.151688  3572 BufferMgr.cpp:40] OOM trace:
runImpl:220
QueryExecutionContext:136 : group_buffer_size **30750605312**
E1128 08:29:47.151819  3572 ExecutionDispatch.cpp:239] Failed to allocate 30750605312 bytes of memory
E1128 08:29:47.152097 29328 MapDHandler.cpp:4352] **Exception: Not enough host memory to execute the query**

The query didn’t execute on GPU or CPU because the memory needed for the buffer
was insufficient on but devices (mapd is running on a tiny system)
But running explicitly in CPU mode the llvm generated is different so that the query can run

I1128 08:36:50.423583  3740 MapDHandler.cpp:727] sql_execute :mapd_2N6 :query_str:select * from part where p_size between 10 and 20 order by p_size limit 100 offset 2200;
I1128 08:36:50.423826  3740 Calcite.cpp:376] User mapd catalog tpch sql 'select * from part where p_size between 10 and 20 order by p_size limit 100 offset 2200;'
I1128 08:36:50.437062  3740 Calcite.cpp:395] Time in Thrift 1 (ms), Time in Java Calcite server 12 (ms)
I1128 08:36:50.675320  3740 MapDHandler.cpp:757] sql_execute-COMPLETED Total: 251 (ms), Execution: 243 (ms)

the memory summary looks like unchanged, but we can go depeer on GPU memory usage with the \memory_gpu command of mapdql

select * from part where p_size between 10 and 20 order by p_size limit 10;

GPU[0] Slab Information:
SLAB     ST_PAGE NUM_PAGE  TOUCH         CHUNK_KEY
   0           0   171360     15 USED    5,1,6,0,
   0      171360  4022944     29 FREE
---------------------------------------------------------------

a single slab of 2gb (4194304 pages) as been allocated to accomodate the column; the rest of free memory has been used for the top-n buffer

to increase the memory need we will change the query
select * from part where p_size between 10 and 20 order by p_size limit 100 offset 50, so the top-n buffer need the space to accomodate 150 entries

GPU[0] Slab Information:
SLAB     ST_PAGE NUM_PAGE  TOUCH         CHUNK_KEY
   0           0   171360     49 USED    5,1,6,0,
   0      171360  4022944     63 FREE
   1           0  4194304     64 FREE
---------------------------------------------------------------

the system has allocated another slab because the free space in the old one was insufficient.

so why the system doesn’t allocate additions slabs to serve queries that need biggers top-n buffers? Simply because the software now can use just a single slab, so everything that need more than 2gb has to be redirected to CPU that has a bigger slab (4gb) and at least in this particular case uses less memory for each entry


#6

Thank you so much, your answer is of great use to me. And i also have some questions.

As you say, the slab of GPU is 2GB, the slab of CPU is 4GB, if the memory for top-N buffer is more than 2GB, it will get memory from CPU, and if the memory for top-N buffer is more than 4GB, then both GPU and CPU can not provide slab with enough memory for the top-N buffer.

But in your third query, the information of top-N buffer is loc:42 : device_id 0, num_bytes **17381261312. The memory is about 16GB and the query can be executed successfully. Does it mean that cpu memory do not have the limit of slab size?


#7

Hi @hchen,

I’m happy that you find my replies helpful, but take into account I can be wrong because my knowledge of code is limited.

For some operations, the memory used by the GPU is larger than the same operation used in CPU (I don’t remember if it’s 4 or 8 times) and in the log, it’s showed the memory needed to run the operation on GPU, so it’s likely the memory usage on CPU is lower and I haven’t any evidence of more CPU slab

here is the output of \memory_cpu command

CPU[0] Slab Information:
SLAB     ST_PAGE NUM_PAGE  TOUCH         CHUNK_KEY
   0           0   171360    157 USED    5,1,1,0,
   0      171360   171360    158 USED    5,1,2,0,
   0      342720   171360    159 USED    5,1,3,0,
   0      514080   171360    160 USED    5,1,4,0,
   0      685440   171360    161 USED    5,1,5,0,
   0      856800   171360    162 USED    5,1,7,0,
   0     1028160   342720    163 USED    5,1,8,0,
   0     1370880   171360    164 USED    5,1,9,0,
   0     1542240    85680    165 USED    5,1,10,0,
   0     1627920   171360    167 USED    5,1,6,0,
   0     1799280  6589328      0 FREE

giung futher with the test

I1129 09:24:22.514248  2708 Calcite.cpp:376] User mapd catalog tpch sql 'select * from part where p_size between 10 and 20 order by p_size limit 100 offset 2100;'
I1129 09:24:22.524663  2708 Calcite.cpp:395] Time in Thrift 0 (ms), Time in Java Calcite server 8 (ms)
I1129 09:24:22.529317  2795 BufferMgr.cpp:40] OOM trace:
runImpl:305
prepareTopNHeapsDevBuffer:785
alloc:42 : device_id 0, num_bytes 29413670912
I1129 09:24:22.531236  2796 BufferMgr.cpp:40] OOM trace:
runImpl:305
prepareTopNHeapsDevBuffer:785
alloc:42 : device_id 0, num_bytes **29413670912**
I1129 09:24:22.531364  2708 RelAlgExecutor.cpp:1881] Query ran out of GPU memory, punt to CPU
I1129 09:24:22.667938  2708 MapDHandler.cpp:757] sql_execute-COMPLETED Total: 153 (ms), Execution: 145 (ms)

and 29413670912 is the higher value the query run on my system (I’ve just 16GB of memory installed now), so 28GB doesn’t fit

the free memory in CPU slab 0 is 6589328*512 = 3373735936, so maybe it’s enough to run the query.

switching to cpu mode looks like mapd is allocating memory but not so much (more or less 410mb) that’s released after the statement completion


#8

You mean that if we run the query on GPU, it need 29413670912bytes(28GB) of memory, so the query can not pass. But if the query is punted to CPU, it is likely that it need only less than 3373735936bytes(3GB). The reason for this phenomenon is that the process of running the query on GPU is different from running the query on CPU. The optimization of query on GPU may bring about much consumption of memory. Can i understand this problem in this way?

Besides, if the memory needed(the lower value) is beyond the size of slab on CPU, will mapd throw an exception and cancel the query? or maybe the memory needed is fixed,(more or less 410mb)?

Thank you for your detailed answer to my question.


#9

Well looks like that with CPU the memory footprint is signifincantly smaller, but I don’t think there isn’t anything fixed in this particular case; the llvm to serve this kind of query in CPU mode has to be different; I guess it’s doing a sort, but now I’ve not time to check, so I’ll do later.

Anyway the top-n code has been developed to serve aggregate queries with a few columns returned; if you use select * the memory footprint will be higher because all the fields have to be loaded in the buffer


#10

Thank you for your reply.