Neuron Runtime Profiler
Sections
Overview
Neuron Runtime Profiler is a performance profiling tool built into Neuron Runtime. When running a model, Neuron Runtime Profiler can output useful statistics about the model’s execution.
Usage
To enable Neuron Runtime Profiler, use one of the following methods:
Set environment variable:
MTKNN_ENABLE_PROFILER=1Set android property:
setprop debug.neuron.runtime.EnableProfiler 1
To save Neuron Runtime Profiler reports to a CSV log file, use one of the following methods:
Set environment variable:
MTKNN_PROFILER_LOG_PATH=<LOG_FULL_PATH>Set android property:
setprop debug.neuron.runtime.ProfilerLogPath <LOG_FULL_PATH>
Note
You must specify a full file path. For example, “/data/local/tmp/workspace/Profiler.log”.
You must create the log directories before execution.
If no log path is specified, Neuron Runtime Profiler writes log messages to the Android system log. You can read this using logcat.
Profiler Reports
A Neuron Runtime Profiler report consists of several tables in CSV format. Each table describes one scenario. The table’s rows list each sub-scenario of the scenario.
Table Components
Title: The scenario name and label. Example: “Runtime Status (LEVEL 1) - Summary”.
Subheader: Total execution time of this scenario. Example: “Total Execution Time: 1980.27001953125 ms over 100 times inference.”
Columns:
Time: The execution time of the sub-scenario. If a sub-scenario was executed more than one time, this value is the sum of its execution times.
Ratio: The ratio of the sub-scenario’s execution time to the total execution time of the scenario.
ExeTimes: The number of times that the sub-scenario was executed.
Time/Inf.: The average execution time per inference. If the sub-scenario was only executed one time, this value will be blank.
Description: The description of the sub-scenario.
GraphIdx: The index of the graph during inference.
ExeIdx: The index of the execution times.
For the inference stage, the report stores profiler information at different levels. Lower levels (e.g. level 1) include the function calling overhead of higher levels (e.g. Level 2). Note that section orders are not guaranteed. For example:
Level 1 Summary
Level 1 Breakdown
The execution time history in a scenario. (Note: only several major scenarios’ history were recorded.)
Level 2 Summary
Level 2 Breakdown
…
Level Summary
The hierarchical structure of the profiler report is summarized below:
Level 1: Inference
Level 2 Input Preprocess
Level 2 Execute on Device
Level 3: APUSys Driver Execution Time
Level 4: APUSys IP Execution Time
Level 2: Output Postprocess
Example
In this example, the user tries to infer a model 100 times. The model contains only one graph, which is compiled as four MDLA subgraphs with parallel execution. The example is divided into four levels.
Level 1 Example
This section Runtime Status (LEVEL 1) represents the overall runtime status during 100 inference runs. Some pre-processing sub-scenarios were only executed one time, therefore their “Time/Inf.” are empty.
----------------------------------------------------------------------------------------------------
Runtime Status (LEVEL 1) - Summary
----------------------------------------------------------------------------------------------------
Total Execution Time: 1980.27001953125 ms over 100 times inference.
====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
4.377, 0.2%, 1, -----------, Read Compiled Network
1.201, 0.1%, 1, -----------, Extract Compiled Network
3.822, 0.2%, 1, -----------, Map Device Memory
0.392, 0.0%, 1, -----------, Fill Constant Data
0.013, 0.0%, 1, -----------, Deserialize DLA Sections
0.000, 0.0%, 1, -----------, Process target-specific data
4.264, 0.2%, 100, 0.043, Set Inference Input
1.868, 0.1%, 100, 0.019, Set Inference Output
1964.332, 99.2%, 100, 19.643, Inference
1980.270, 100.0%, 100, 19.705, Total
----------------------------------------------------------------------------------------------
Runtime Status (LEVEL 1) - Breakdown
----------------------------------------------------------------------------------------------
Inference: avg Time/inf.: 19.64332275390625 ms over 100 times inference.
====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
17.303, ---, #0, Inference
16.273, ---, #1, Inference
... , ... , ... , ...
17.883, ---, #98, Inference
18.883, ---, #99, Inference
----------------------------------------------------------------------------------------------
Read Compiled Network: Load the compiled network into the runtime.
Extract Compiled Network: Extract the compiled network to get the required information for preprocessing.
Map Device Memory: Allocate buffers on the device.
Fill Constant Data: Prepare for the weighting data.
Deserialize DLA Sections: Deserialize the data saved in the DLA file.
Process target-specific data: If necessary, pre-process the compiled result according to the target’s requirements.
Set Inference Input: Set and allocate resources for inference inputs.
Set Inference Output: Set and allocate resources for inference outputs.
Inference: Start to infer this network.
Level 2 Example
Level 2 is a breakdown of Inference during level 1. There are two scenarios in level 2:
Transit Status: The execution time of transitions between subgraphs.
APUSys Device Status: Shows execution data for the hardware device. The hardware device might be the MVPU, VPU, or EDMA depending on the application scenario.
----------------------------------------------------------------------------------------------------
Transit Status (LEVEL 2) - Summary
----------------------------------------------------------------------------------------------------
Total Execution Time: 177.159423828125 ms over 100 times inference.
====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
0.001, 0.0%, 1, -----------, Build Transit Map
177.158, 100.0%, 100, 1.772, Transit Execute
177.159, 100.0%, 100, 1.772, Total
----------------------------------------------------------------------------------------------
Build Transit Map: Construct the transition mapping table between subgraphs.
Transit Execute: Transfer output data of the source subgraph to the input tensor of the destination subgraph. This is executed before a subgraph starts inference.
----------------------------------------------------------------------------------------------------
APUSys Device Status (LEVEL 2) - Summary
----------------------------------------------------------------------------------------------------
Total Execution Time: 1728.017822265625 ms over 100 times inference.
====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
176.188, 10.2%, 100, 1.762, Input Preprocess
1415.348, 81.9%, 100, 14.153, Execute On Device
136.482, 7.9%, 100, 1.365, Output Postprocess
1728.018, 100.0%, 100, 17.280, Total
----------------------------------------------------------------------------------------------
APUSys Device Status (LEVEL 2) - Breakdown
----------------------------------------------------------------------------------------------
Input Preprocess: avg Time/inf.: 1.7618798828125 ms over 100 times inference.
====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
0.930, #0, #0, Input Preprocess
1.135, #0, #1, Input Preprocess
... , ... , ... , ...
1.220, #0, #98, Input Preprocess
1.529, #0, #99, Input Preprocess
----------------------------------------------------------------------------------------------
Execute On Device: avg Time/inf.: 14.1534765625 ms over 100 times inference.
====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
13.921, #0, #0, Execute On Device
12.963, #0, #1, Execute On Device
... , ... , ... , ...
13.751, #0, #98, Execute On Device
13.904, #0, #99, Execute On Device
----------------------------------------------------------------------------------------------
Output Postprocess: avg Time/inf.: 1.36482177734375 ms over 100 times inference.
====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
1.620, #0, #0, Output Postprocess
1.162, #0, #1, Output Postprocess
... , ... , ... , ...
1.144, #0, #98, Output Postprocess
1.527, #0, #99, Output Postprocess
----------------------------------------------------------------------------------------------
Input Preprocess: Pre-processing of input data. Depends on the device’s requirements.
Execute On Device: Get the commands of the device, pass these commands to the kernel level, and then execute.
Output Postprocess: Post-processing of input data. Depends on the device’s requirements.
Level 3 Example
Level 3 is a breakdown of Execute On Device in level 2. Level 3 might show MDLA, MVPU, VPU, or EDMA, depending on the application scenario.
----------------------------------------------------------------------------------------------------
APUSys Driver Execution Time (LEVEL 3) - Summary
----------------------------------------------------------------------------------------------------
Total Execution Time: 5001.183000000001 ms over 100 times inference.
====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
1229.648, 24.6%, 100, 12.296, MDLA_3_0 - subgraph#0-0
1209.381, 24.2%, 100, 12.094, MDLA_3_0 - subgraph#0-1
1297.657, 25.9%, 100, 12.977, MDLA_3_0 - subgraph#0-2
1264.497, 25.3%, 100, 12.645, MDLA_3_0 - subgraph#0-3
5001.183, 100.0%, 100, 50.012, Total
----------------------------------------------------------------------------------------------
APUSys Driver Execution Time (LEVEL 3) - Breakdown
----------------------------------------------------------------------------------------------
MDLA_3_0 - subgraph#0-0: avg Time/inf.: 12.296480000000006 ms over 100 times inference.
====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
12.296, #0, #0, MDLA_3_0 - subgraph#0-0
11.994, #0, #1, MDLA_3_0 - subgraph#0-0
... , ... , ... , ...
12.188, #0, #98, MDLA_3_0 - subgraph#0-0
12.223, #0, #99, MDLA_3_0 - subgraph#0-0
----------------------------------------------------------------------------------------------
MDLA_3_0 - subgraph#0-1: avg Time/inf.: 12.093809999999998 ms over 100 times inference.
====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
12.079, #0, #0, MDLA_3_0 - subgraph#0-1
11.973, #0, #1, MDLA_3_0 - subgraph#0-1
... , ... , ... , ...
12.118, #0, #98, MDLA_3_0 - subgraph#0-1
11.968, #0, #99, MDLA_3_0 - subgraph#0-1
----------------------------------------------------------------------------------------------
MDLA_3_0 - subgraph#0-2: avg Time/inf.: 12.976569999999999 ms over 100 times inference.
====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
13.047, #0, #0, MDLA_3_0 - subgraph#0-2
12.091, #0, #1, MDLA_3_0 - subgraph#0-2
... , ... , ... , ...
12.769, #0, #98, MDLA_3_0 - subgraph#0-2
13.003, #0, #99, MDLA_3_0 - subgraph#0-2
----------------------------------------------------------------------------------------------
MDLA_3_0 - subgraph#0-3: avg Time/inf.: 12.64497 ms over 100 times inference.
====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
12.668, #0, #0, MDLA_3_0 - subgraph#0-3
12.061, #0, #1, MDLA_3_0 - subgraph#0-3
... , ... , ... , ...
12.361, #0, #98, MDLA_3_0 - subgraph#0-3
12.598, #0, #99, MDLA_3_0 - subgraph#0-3
----------------------------------------------------------------------------------------------
APUSys Driver Execution Time: The execution time at the APUSys driver layer.
MDLA_3_0 - subgraph#X-Y: The execution time of subgraph#X-Y.
In this example, the original network is compiled as one MDLA main graph (corresponded to X as “0”), which is composed of 4 parallel subgraphs (corresponded to Y as “4”). The execution time here should be the maximum time out of all four subgraph#0, which is 12.977 ms. If we check the Execute On Device time in level 2, the execution time is 14.153 ms, which means that there is a (14.153 - 12.977) ms function call and kernel driver overhead between level 2 and level 3.
Note
If there are parallel subgraphs in a scenario, then Total Execution Time can be ignored because it is always the sum of one sub-scenario.
Level 3 information may not be available for some devices, such as TFLiteCPU and GPU.
Level 4 Example
Level 4 is the breakdown scenarios of level 3. Level 4 might show MDLA, MVPU, VPU, or EDMA, depending on the application scenario.
----------------------------------------------------------------------------------------------------
APUSys IP Time (LEVEL 4) - Summary
----------------------------------------------------------------------------------------------------
Total Execution Time: 4883.115 ms over 100 times inference.
====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
1206.809, 24.7%, 100, 12.068, MDLA_3_0 - subgraph#0-0
1190.617, 24.4%, 100, 11.906, MDLA_3_0 - subgraph#0-1
1260.328, 25.8%, 100, 12.603, MDLA_3_0 - subgraph#0-2
1225.361, 25.1%, 100, 12.254, MDLA_3_0 - subgraph#0-3
4883.115, 100.0%, 100, 48.831, Total
----------------------------------------------------------------------------------------------
APUSys IP Time (LEVEL 4) - Breakdown
----------------------------------------------------------------------------------------------
MDLA_3_0 - subgraph#0-0: avg Time/inf.: 12.068090000000002 ms over 100 times inference.
====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
12.040, #0, #0, MDLA_3_0 - subgraph#0-0
11.940, #0, #1, MDLA_3_0 - subgraph#0-0
... , ... , ... , ...
12.073, #0, #98, MDLA_3_0 - subgraph#0-0
11.966, #0, #99, MDLA_3_0 - subgraph#0-0
----------------------------------------------------------------------------------------------
MDLA_3_0 - subgraph#0-1: avg Time/inf.: 11.906169999999992 ms over 100 times inference.
====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
11.878, #0, #0, MDLA_3_0 - subgraph#0-1
11.876, #0, #1, MDLA_3_0 - subgraph#0-1
... , ... , ... , ...
11.896, #0, #98, MDLA_3_0 - subgraph#0-1
11.719, #0, #99, MDLA_3_0 - subgraph#0-1
----------------------------------------------------------------------------------------------
MDLA_3_0 - subgraph#0-2: avg Time/inf.: 12.60328 ms over 100 times inference.
====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
12.630, #0, #0, MDLA_3_0 - subgraph#0-2
12.021, #0, #1, MDLA_3_0 - subgraph#0-2
... , ... , ... , ...
12.331, #0, #98, MDLA_3_0 - subgraph#0-2
12.553, #0, #99, MDLA_3_0 - subgraph#0-2
----------------------------------------------------------------------------------------------
MDLA_3_0 - subgraph#0-3: avg Time/inf.: 12.253610000000004 ms over 100 times inference.
====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
12.257, #0, #0, MDLA_3_0 - subgraph#0-3
11.961, #0, #1, MDLA_3_0 - subgraph#0-3
... , ... , ... , ...
12.154, #0, #98, MDLA_3_0 - subgraph#0-3
12.179, #0, #99, MDLA_3_0 - subgraph#0-3
----------------------------------------------------------------------------------------------
APUSys IP Time: The lowest level IP execution time reported by the APUSys driver
MDLA_3_0 - subgraph#X-Y: The execution IP time of subgraph#X-Y.
In this example, the Execute On Device time in level 2 is 14.153 ms, and the maximum driver execution time in level 2 is 12.977 ms. This means that there is a (14.153 - 12.977) ms function call and kernel driver overhead between level 2 and level 3, and a (12.977 - 12.603) ms overhead after the kernel driver tries to execute these commands on the hardware.
Note
If there are parallel subgraphs in a scenario, then Total Execution Time can be ignored because it is always the sum of one sub-scenario.
Level 4 information may not be available for some devices, such as TFLiteCPU and GPU.
Per-OP (Per-Operation) Performance Profiling
Neuron Runtime Profiler provides per-op performance profiling of MDLA compute devices.
Usage
To save Neuron Runtime Profiler per-op report to a CSV file, use one of the following methods:
Add the following option when compiling to DLA using ncc-tflite:
--gen-debug-infoSet the following environment variable during inference by neuronrt:
export MTKNN_PER_OP_PROFILE=1
Note
Limitations:
Not supported when
-c
option is specified (ex.-c 10
).Does not support MDLA 1.5/1.7/2.0 with Android S or later.
Does not support multiple 1.5/1.7/2.0 MDLA devices.
Per-OP Profiler Reports
Per-OP profiler logs the following information in CSV format:
Overall performance
Execution time of each operation at different levels. Lower levels (e.g. level 1) include the function calling overhead of higher levels (e.g. Level 2).
The location ID corresponds to the location information in the TFLite model.
MDLA 1.5/1.7/2.0 Example
In this example, the user tries to infer a model with 124 operations. The model is compiled as two subgraphs, which will be executed by the EDMA and MDLA 1.5. Neuron Runtime Profiler generates a per-op report only if the scenario uses one MDLA core. Neuron Runtime Profiler does not produce a report for scenarios that use multiple MDLA cores.
----------------------------------------------------------------------------------------------------
APUSys Device Status (LEVEL 2) - Summary
----------------------------------------------------------------------------------------------------
Total Execution Time: 124.781005859375 ms over 1 times inference.
====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
0.160, 0.1%, 1, 0.160, Input Preprocess
6.204, 5.0%, 1, 6.204, Execute On Device (subgraph#0)
2.666, 2.1%, 1, 2.666, Execute On Device (subgraph#1 location#0#1: CONV_2D + CONV_2D)
0.885, 0.7%, 1, 0.885, Execute On Device (subgraph#1 location#2: MAX_POOL_2D)
... ... ... ... ...
... ... ... ... ...
... ... ... ... ...
0.992, 0.8%, 1, 0.992, Execute On Device (subgraph#1 location#120: CONV_2D)
0.985, 0.8%, 1, 0.985, Execute On Device (subgraph#1 location#121: CONV_2D)
1.164, 0.9%, 1, 1.164, Execute On Device (subgraph#1 location#122: AVERAGE_POOL_2D)
0.937, 0.8%, 1, 0.937, Execute On Device (subgraph#1 location#123: CONV_2D)
0.069, 0.1%, 1, 0.069, Output Postprocess
124.781, 100.0%, 1, 124.781, Total
----------------------------------------------------------------------------------------------------
APUSys Driver Execution Time (LEVEL 3) - Summary
----------------------------------------------------------------------------------------------------
Total Execution Time: 38.046000000000014 ms over 1 times inference.
====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
0.369, 1.0%, 1, 0.369, EDMA - subgraph#0
1.318, 3.5%, 1, 1.318, MDLA - subgraph#1 location#0#1: CONV_2D + CONV_2D
0.209, 0.5%, 1, 0.209, MDLA - subgraph#1 location#2: MAX_POOL_2D
... ... ... ... ...
... ... ... ... ...
... ... ... ... ...
0.248, 0.7%, 1, 0.248, MDLA - subgraph#1 location#120: CONV_2D
0.249, 0.7%, 1, 0.249, MDLA - subgraph#1 location#121: CONV_2D
0.444, 1.2%, 1, 0.444, MDLA - subgraph#1 location#122: AVERAGE_POOL_2D
0.353, 0.9%, 1, 0.353, MDLA - subgraph#1 location#123: CONV_2D
38.046, 100.0%, 1, 38.046, Total
----------------------------------------------------------------------------------------------------
APUSys IP Time (LEVEL 4) - Summary
----------------------------------------------------------------------------------------------------
Total Execution Time: 22.043999999999993 ms over 1 times inference.
====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
0.314, 1.4%, 1, 0.314, EDMA - subgraph#0
1.077, 4.9%, 1, 1.077, MDLA - subgraph#1 location#0#1: CONV_2D + CONV_2D
0.128, 0.6%, 1, 0.128, MDLA - subgraph#1 location#2: MAX_POOL_2D
... ... ... ... ...
... ... ... ... ...
... ... ... ... ...
0.155, 0.7%, 1, 0.155, MDLA - subgraph#1 location#120: CONV_2D
0.160, 0.7%, 1, 0.160, MDLA - subgraph#1 location#121: CONV_2D
0.115, 0.5%, 1, 0.115, MDLA - subgraph#1 location#122: AVERAGE_POOL_2D
0.057, 0.3%, 1, 0.057, MDLA - subgraph#1 location#123: CONV_2D
22.044, 100.0%, 1, 22.044, Total
Example of MDLA 3.0
In this example, the user tries to infer a model with 305 operations. The model is compiled as one main graph, which contains four MDLA subgraphs for parallel execution. Neuron-5.0 supports per-op profiling with scenarios that use multiple MDLA cores. In a multiple-MDLA-core scenario, the log will contain a per-op profiling report for each MDLA core.
For MDLA 3.0 applications, the per-op profiling result of each operation can be seen as the IP time level. This does not include the synchronized time overhead between different cores in a multiple-MDLA scenario.
----------------------------------------------------------------------------------------------------
MDLA Core-0 Per-OP Profiling Report - Summary
----------------------------------------------------------------------------------------------------
Total Execution Time: 5.215456216216212 ms over 1 times inference.
====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
0.043, 0.8%, 1, 0.043, MDLA - subgraph#0-1 location#0#1: CONV_2D+MAX_POOL_2D
0.020, 0.4%, 1, 0.020, MDLA - subgraph#0-1 location#2: CUSTOM
0.235, 4.5%, 1, 0.235, MDLA - subgraph#0-1 location#3#4#5: DEPTHWISE_CONV_2D+CONV_2D+CONV_2D
0.068, 1.3%, 1, 0.068, MDLA - subgraph#0-1 location#6: CONCATENATION
... ... ... ... ...
0.016, 0.3%, 1, 0.016, MDLA - subgraph#0-1 location#299: CONV_2D
0.010, 0.2%, 1, 0.010, MDLA - subgraph#0-1 location#302#303: DEPTHWISE_CONV_2D+MEAN
0.036, 0.7%, 1, 0.036, MDLA - subgraph#0-1 location#304: CONV_2D
5.215, 100.0%, 1, 5.215, Total
----------------------------------------------------------------------------------------------------
MDLA Core-1 Per-OP Profiling Report - Summary
----------------------------------------------------------------------------------------------------
Total Execution Time: 5.0539675675675655 ms over 1 times inference.
====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
0.041, 0.8%, 1, 0.041, MDLA - subgraph#0-1 location#0#1: CONV_2D+MAX_POOL_2D
0.050, 1.0%, 1, 0.050, MDLA - subgraph#0-1 location#2: CUSTOM
0.070, 1.4%, 1, 0.070, MDLA - subgraph#0-1 location#3#4#5: DEPTHWISE_CONV_2D+CONV_2D+CONV_2D
0.041, 0.8%, 2, 0.041, MDLA - subgraph#0-1 location#6: CONCATENATION
... ... ... ... ...
0.010, 0.2%, 1, 0.010, MDLA - subgraph#0-1 location#302#303: DEPTHWISE_CONV_2D+MEAN
0.033, 0.7%, 1, 0.033, MDLA - subgraph#0-1 location#304: CONV_2D
5.054, 100.0%, 1, 5.054, Total
----------------------------------------------------------------------------------------------------
MDLA Core-2 Per-OP Profiling Report - Summary
----------------------------------------------------------------------------------------------------
Total Execution Time: 5.783833513513511 ms over 1 times inference.
====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
0.048, 0.8%, 1, 0.048, MDLA - subgraph#0-1 location#0#1: CONV_2D+MAX_POOL_2D
0.020, 0.4%, 1, 0.020, MDLA - subgraph#0-1 location#2: CUSTOM
0.771, 13.3%, 1, 0.771, MDLA - subgraph#0-1 location#3#4#5: DEPTHWISE_CONV_2D+CONV_2D+CONV_2D
... ... ... ... ...
0.005, 0.1%, 1, 0.005, MDLA - subgraph#0-1 location#302#303: DEPTHWISE_CONV_2D+MEAN
0.036, 0.6%, 1, 0.036, MDLA - subgraph#0-1 location#304: CONV_2D
0.026, 0.4%, 1, 0.026, MDLA - subgraph#0-1 location#306: SOFTMAX
5.784, 100.0%, 1, 5.784, Total
----------------------------------------------------------------------------------------------------
MDLA Core-3 Per-OP Profiling Report - Summary
----------------------------------------------------------------------------------------------------
Total Execution Time: 5.429357837837842 ms over 1 times inference.
====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
0.042, 0.8%, 1, 0.042, MDLA - subgraph#0-1 location#0#1: CONV_2D+MAX_POOL_2D
0.021, 0.4%, 1, 0.021, MDLA - subgraph#0-1 location#2: CUSTOM
0.413, 7.6%, 1, 0.413, MDLA - subgraph#0-1 location#3#4#5: DEPTHWISE_CONV_2D+CONV_2D+CONV_2D
0.091, 1.7%, 1, 0.091, MDLA - subgraph#0-1 location#7: CUSTOM
... ... ... ... ...
0.005, 0.1%, 1, 0.005, MDLA - subgraph#0-1 location#302#303: DEPTHWISE_CONV_2D+MEAN
0.035, 0.6%, 1, 0.035, MDLA - subgraph#0-1 location#304: CONV_2D
0.035, 0.6%, 1, 0.035, MDLA - subgraph#0-1 location#306: SOFTMAX
5.429, 100.0%, 1, 5.429, Total