About nsight system,
I’m sorry to give you such basic example, but it’s the first time I’m using it.
I launched the pfiler this way
candido@zion-legion:~$ /opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/nsys profile -t cuda --cuda-flush-interval=10 /opt/mapd/omnisci-os-5.5.1-20210126-84edb9bd7b-Linux-x86_64/bin/omnisci_server --data /mapd_storage/data_test Warning: LBR backtrace method is not supported on this platform. DWARF backtrace method will be used. Collecting data...
after run 1 query I stopped the server with control+c then I extracted this basic report
/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/nsys stats --report gputrace --report cudaapisum --report gpukernsum --format table report1.qdrep
Using report1.sqlite file for stats and reports.
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gputrace report1.sqlite] to console...
+------------+----------------+--------+------+------+------+-------+------+------+---------+---------+---------+---------+------------+----------+----------+-------------------------+-----+------+----------------------------------+
| Start(sec) | Duration(nsec) | CorrId | GrdX | GrdY | GrdZ | BlkX | BlkY | BlkZ | Reg/Trd | StcSMem | DymSMem | Bytes | Thru(MB/s) | SrcMemKd | DstMemKd | Device | Ctx | Strm | Name |
+------------+----------------+--------+------+------+------+-------+------+------+---------+---------+---------+---------+------------+----------+----------+-------------------------+-----+------+----------------------------------+
| 41,000000 | 896 | 46 | | | | | | | | | | 8 | 8,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 41,000000 | 544 | 48 | | | | | | | | | | 8 | 14,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 41,000000 | 544 | 50 | | | | | | | | | | 8 | 14,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 41,000000 | 544 | 52 | | | | | | | | | | 8 | 14,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 41,000000 | 576 | 54 | | | | | | | | | | 8 | 13,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 41,000000 | 575 | 56 | | | | | | | | | | 4 | 6,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 41,000000 | 576 | 58 | | | | | | | | | | 4 | 6,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 41,000000 | 544 | 60 | | | | | | | | | | 16 | 29,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 41,000000 | 5.631 | 62 | | | | | | | | | | 49.152 | 8.728,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 41,000000 | 544 | 64 | | | | | | | | | | 4 | 7,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 41,000000 | 576 | 66 | | | | | | | | | | 16 | 27,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 41,000000 | 15.198 | 67 | 12 | 1 | 1 | 1.024 | 1 | 1 | 35 | 0 | 0 | | | | | GeForce GTX 1050 Ti (0) | 1 | 7 | multifrag_query_hoisted_literals |
| 41,000000 | 7.903 | 76 | | | | | | | | | | 49.152 | 6.219,000 | Device | Pageable | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy DtoH] |
| 41,000000 | 15.038 | 78 | | | | | | | | | | 98.304 | 6.537,000 | Device | Pageable | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy DtoH] |
| 41,000000 | 15.070 | 80 | | | | | | | | | | 98.304 | 6.523,000 | Device | Pageable | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy DtoH] |
| 89,000000 | 543 | 99 | | | | | | | | | | 8 | 14,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 89,000000 | 544 | 101 | | | | | | | | | | 8 | 14,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 89,000000 | 576 | 103 | | | | | | | | | | 8 | 13,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 89,000000 | 576 | 105 | | | | | | | | | | 8 | 13,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 89,000000 | 577 | 107 | | | | | | | | | | 8 | 13,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 89,000000 | 576 | 109 | | | | | | | | | | 4 | 6,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 89,000000 | 576 | 111 | | | | | | | | | | 4 | 6,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 89,000000 | 575 | 113 | | | | | | | | | | 16 | 27,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 89,000000 | 5.440 | 115 | | | | | | | | | | 49.152 | 9.035,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 89,000000 | 544 | 117 | | | | | | | | | | 4 | 7,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 89,000000 | 576 | 119 | | | | | | | | | | 16 | 27,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 89,000000 | 12.798 | 120 | 12 | 1 | 1 | 1.024 | 1 | 1 | 34 | 0 | 0 | | | | | GeForce GTX 1050 Ti (0) | 1 | 7 | multifrag_query_hoisted_literals |
| 89,000000 | 7.615 | 122 | | | | | | | | | | 49.152 | 6.454,000 | Device | Pageable | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy DtoH] |
| 89,000000 | 15.038 | 124 | | | | | | | | | | 98.304 | 6.537,000 | Device | Pageable | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy DtoH] |
| 89,000000 | 15.102 | 126 | | | | | | | | | | 98.304 | 6.509,000 | Device | Pageable | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy DtoH] |
| 96,000000 | 576 | 145 | | | | | | | | | | 8 | 13,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 96,000000 | 545 | 147 | | | | | | | | | | 8 | 14,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 96,000000 | 576 | 149 | | | | | | | | | | 8 | 13,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 96,000000 | 576 | 151 | | | | | | | | | | 4 | 6,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 96,000000 | 544 | 153 | | | | | | | | | | 4 | 7,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 96,000000 | 576 | 155 | | | | | | | | | | 8 | 13,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 96,000000 | 5.759 | 157 | | | | | | | | | | 49.152 | 8.534,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 96,000000 | 576 | 159 | | | | | | | | | | 4 | 6,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 96,000000 | 576 | 161 | | | | | | | | | | 8 | 13,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 96,000000 | 576 | 163 | | | | | | | | | | 8 | 13,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 96,000000 | 14.623 | 164 | 12 | 1 | 1 | 1.024 | 1 | 1 | 26 | 0 | 8 | | | | | GeForce GTX 1050 Ti (0) | 1 | 7 | multifrag_query_hoisted_literals |
| 96,000000 | 7.615 | 166 | | | | | | | | | | 49.152 | 6.454,000 | Device | Pageable | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy DtoH] |
| 96,000000 | 704 | 168 | | | | | | | | | | 8 | 11,000 | Device | Pageable | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy DtoH] |
| 120,000000 | 576 | 187 | | | | | | | | | | 8 | 13,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 120,000000 | 544 | 189 | | | | | | | | | | 8 | 14,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 120,000000 | 544 | 191 | | | | | | | | | | 16 | 29,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 120,000000 | 576 | 193 | | | | | | | | | | 8 | 13,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 120,000000 | 896 | 195 | | | | | | | | | | 16 | 17,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 120,000000 | 545 | 197 | | | | | | | | | | 16 | 29,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 120,000000 | 544 | 199 | | | | | | | | | | 4 | 7,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 120,000000 | 544 | 201 | | | | | | | | | | 4 | 7,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 120,000000 | 575 | 203 | | | | | | | | | | 16 | 27,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 120,000000 | 5.439 | 205 | | | | | | | | | | 49.152 | 9.036,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 120,000000 | 576 | 207 | | | | | | | | | | 4 | 6,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 120,000000 | 544 | 209 | | | | | | | | | | 16 | 29,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 120,000000 | 21.629 | 210 | 12 | 1 | 1 | 1.024 | 1 | 1 | 34 | 0 | 0 | | | | | GeForce GTX 1050 Ti (0) | 1 | 7 | multifrag_query_hoisted_literals |
| 120,000000 | 18.045 | 212 | | | | | | | | | | 49.152 | 2.723,000 | Device | Pageable | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy DtoH] |
| 120,000000 | 29.916 | 214 | | | | | | | | | | 196.608 | 6.572,000 | Device | Pageable | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy DtoH] |
| 120,000000 | 30.044 | 216 | | | | | | | | | | 196.608 | 6.544,000 | Device | Pageable | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy DtoH] |
| 146,000000 | 544 | 228 | | | | | | | | | | 92 | 169,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 146,000000 | 576 | 237 | | | | | | | | | | 8 | 13,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 146,000000 | 576 | 239 | | | | | | | | | | 8 | 13,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 146,000000 | 544 | 241 | | | | | | | | | | 8 | 14,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 146,000000 | 544 | 243 | | | | | | | | | | 8 | 14,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 146,000000 | 544 | 245 | | | | | | | | | | 8 | 14,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 146,000000 | 576 | 247 | | | | | | | | | | 4 | 6,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 146,000000 | 575 | 249 | | | | | | | | | | 4 | 6,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 146,000000 | 544 | 251 | | | | | | | | | | 16 | 29,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 146,000000 | 5.503 | 253 | | | | | | | | | | 49.152 | 8.931,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 146,000000 | 576 | 255 | | | | | | | | | | 4 | 6,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 146,000000 | 577 | 257 | | | | | | | | | | 16 | 27,000 | Pageable | Device | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy HtoD] |
| 146,000000 | 13.182 | 258 | 12 | 1 | 1 | 1.024 | 1 | 1 | 35 | 0 | 0 | | | | | GeForce GTX 1050 Ti (0) | 1 | 7 | multifrag_query_hoisted_literals |
| 146,000000 | 7.583 | 260 | | | | | | | | | | 49.152 | 6.481,000 | Device | Pageable | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy DtoH] |
| 146,000000 | 15.102 | 262 | | | | | | | | | | 98.304 | 6.509,000 | Device | Pageable | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy DtoH] |
| 146,000000 | 15.038 | 264 | | | | | | | | | | 98.304 | 6.537,000 | Device | Pageable | GeForce GTX 1050 Ti (0) | 1 | 7 | [CUDA memcpy DtoH] |
+------------+----------------+--------+------+------+------+-------+------+------+---------+---------+---------+---------+------------+----------+----------+-------------------------+-----+------+----------------------------------+
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/cudaapisum report1.sqlite] to console...
+---------+-----------------+-----------+---------------+-------------+-------------+--------------------+
| Time(%) | Total Time (ns) | Num Calls | Average | Minimum | Maximum | Name |
+---------+-----------------+-----------+---------------+-------------+-------------+--------------------+
| 63,0 | 210.179.805 | 1 | 210.179.805,0 | 210.179.805 | 210.179.805 | cuCtxCreate_v2 |
| 18,0 | 62.207.501 | 5 | 12.441.500,0 | 11.229.654 | 13.941.043 | cuLinkComplete |
| 16,0 | 53.269.735 | 5 | 10.653.947,0 | 1.042.573 | 13.806.811 | cuModuleLoadDataEx |
| 1,0 | 4.027.716 | 1 | 4.027.716,0 | 4.027.716 | 4.027.716 | cuMemAlloc_v2 |
| 0,0 | 1.589.085 | 6 | 264.847,0 | 1.290 | 1.576.353 | cuLinkDestroy |
| 0,0 | 899.224 | 14 | 64.230,0 | 9.146 | 121.435 | cuMemcpyDtoH_v2 |
| 0,0 | 690.838 | 6 | 115.139,0 | 76.462 | 228.353 | cuLinkCreate_v2 |
| 0,0 | 381.841 | 56 | 6.818,0 | 2.779 | 33.834 | cuMemcpyHtoD_v2 |
| 0,0 | 121.654 | 5 | 24.330,0 | 13.613 | 47.737 | cuLaunchKernel |
| 0,0 | 39.575 | 30 | 1.319,0 | 286 | 19.080 | cuEventCreate |
+---------+-----------------+-----------+---------------+-------------+-------------+--------------------+
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpukernsum report1.sqlite] to console...
+---------+-----------------+-----------+----------+---------+---------+----------------------------------+
| Time(%) | Total Time (ns) | Instances | Average | Minimum | Maximum | Name |
+---------+-----------------+-----------+----------+---------+---------+----------------------------------+
| 100,0 | 77.430 | 5 | 15.486,0 | 12.798 | 21.629 | multifrag_query_hoisted_literals |
+---------+-----------------+-----------+----------+---------+---------+----------------------------------+
the multifrag_query_hoisted_literals is the kernel that run the query; the rest are blocks of memory that are moved between host and gpu.