Neuron Runtime Profiler

Overview

Neuron Runtime Profiler is a performance profiling tool built into Neuron Runtime. When running a model, Neuron Runtime Profiler can output useful statistics about the model’s execution.

Usage

To enable Neuron Runtime Profiler, use one of the following methods:

  1. Set environment variable:

    MTKNN_ENABLE_PROFILER=1
    
  2. Set android property:

    setprop debug.neuron.runtime.EnableProfiler 1
    

To save Neuron Runtime Profiler reports to a CSV log file, use one of the following methods:

  1. Set environment variable:

    MTKNN_PROFILER_LOG_PATH=<LOG_FULL_PATH>
    
  2. Set android property:

    setprop debug.neuron.runtime.ProfilerLogPath <LOG_FULL_PATH>
    

Note

  • You must specify a full file path. For example, “/data/local/tmp/workspace/Profiler.log”.

  • You must create the log directories before execution.

  • If no log path is specified, Neuron Runtime Profiler writes log messages to the Android system log. You can read this using logcat.

Profiler Reports

A Neuron Runtime Profiler report consists of several tables in CSV format. Each table describes one scenario. The table’s rows list each sub-scenario of the scenario.

Table Components

  • Title: The scenario name and label. Example: “Runtime Status (LEVEL 1) - Summary”.

  • Subheader: Total execution time of this scenario. Example: “Total Execution Time: 1980.27001953125 ms over 100 times inference.”

  • Columns:

    • Time: The execution time of the sub-scenario. If a sub-scenario was executed more than one time, this value is the sum of its execution times.

    • Ratio: The ratio of the sub-scenario’s execution time to the total execution time of the scenario.

    • ExeTimes: The number of times that the sub-scenario was executed.

    • Time/Inf.: The average execution time per inference. If the sub-scenario was only executed one time, this value will be blank.

    • Description: The description of the sub-scenario.

    • GraphIdx: The index of the graph during inference.

    • ExeIdx: The index of the execution times.

For the inference stage, the report stores profiler information at different levels. Lower levels (e.g. level 1) include the function calling overhead of higher levels (e.g. Level 2). Note that section orders are not guaranteed. For example:

  • Level 1 Summary

  • Level 1 Breakdown

    • The execution time history in a scenario. (Note: only several major scenarios’ history were recorded.)

  • Level 2 Summary

  • Level 2 Breakdown

Level Summary

The hierarchical structure of the profiler report is summarized below:

  • Level 1: Inference

    • Level 2 Input Preprocess

    • Level 2 Execute on Device

      • Level 3: APUSys Driver Execution Time

        • Level 4: APUSys IP Execution Time

    • Level 2: Output Postprocess

Example

In this example, the user tries to infer a model 100 times. The model contains only one graph, which is compiled as four MDLA subgraphs with parallel execution. The example is divided into four levels.

Level 1 Example

This section Runtime Status (LEVEL 1) represents the overall runtime status during 100 inference runs. Some pre-processing sub-scenarios were only executed one time, therefore their “Time/Inf.” are empty.

----------------------------------------------------------------------------------------------------
                                 Runtime Status (LEVEL 1) - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 1980.27001953125 ms over 100 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       4.377,      0.2%,          1, -----------, Read Compiled Network
       1.201,      0.1%,          1, -----------, Extract Compiled Network
       3.822,      0.2%,          1, -----------, Map Device Memory
       0.392,      0.0%,          1, -----------, Fill Constant Data
       0.013,      0.0%,          1, -----------, Deserialize DLA Sections
       0.000,      0.0%,          1, -----------, Process target-specific data
       4.264,      0.2%,        100,       0.043, Set Inference Input
       1.868,      0.1%,        100,       0.019, Set Inference Output
    1964.332,     99.2%,        100,      19.643, Inference
    1980.270,    100.0%,        100,      19.705, Total
----------------------------------------------------------------------------------------------
                                Runtime Status (LEVEL 1) - Breakdown
----------------------------------------------------------------------------------------------
  Inference: avg Time/inf.: 19.64332275390625 ms over 100 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
      17.303,          ---,         #0, Inference
      16.273,          ---,         #1, Inference
       ...  ,         ... ,       ... ,    ...
      17.883,          ---,        #98, Inference
      18.883,          ---,        #99, Inference
 ----------------------------------------------------------------------------------------------
  • Read Compiled Network: Load the compiled network into the runtime.

  • Extract Compiled Network: Extract the compiled network to get the required information for preprocessing.

  • Map Device Memory: Allocate buffers on the device.

  • Fill Constant Data: Prepare for the weighting data.

  • Deserialize DLA Sections: Deserialize the data saved in the DLA file.

  • Process target-specific data: If necessary, pre-process the compiled result according to the target’s requirements.

  • Set Inference Input: Set and allocate resources for inference inputs.

  • Set Inference Output: Set and allocate resources for inference outputs.

  • Inference: Start to infer this network.

Level 2 Example

Level 2 is a breakdown of Inference during level 1. There are two scenarios in level 2:

  • Transit Status: The execution time of transitions between subgraphs.

  • APUSys Device Status: Shows execution data for the hardware device. The hardware device might be the MVPU, VPU, or EDMA depending on the application scenario.

----------------------------------------------------------------------------------------------------
                                 Transit Status (LEVEL 2) - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 177.159423828125 ms over 100 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       0.001,      0.0%,          1, -----------, Build Transit Map
     177.158,    100.0%,        100,       1.772, Transit Execute
     177.159,    100.0%,        100,       1.772, Total
 ----------------------------------------------------------------------------------------------
  • Build Transit Map: Construct the transition mapping table between subgraphs.

  • Transit Execute: Transfer output data of the source subgraph to the input tensor of the destination subgraph. This is executed before a subgraph starts inference.

----------------------------------------------------------------------------------------------------
                              APUSys Device Status (LEVEL 2) - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 1728.017822265625 ms over 100 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
     176.188,     10.2%,        100,       1.762, Input Preprocess
    1415.348,     81.9%,        100,      14.153, Execute On Device
     136.482,      7.9%,        100,       1.365, Output Postprocess
    1728.018,    100.0%,        100,      17.280, Total
----------------------------------------------------------------------------------------------
                             APUSys Device Status (LEVEL 2) - Breakdown
----------------------------------------------------------------------------------------------
  Input Preprocess: avg Time/inf.: 1.7618798828125 ms over 100 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
       0.930,           #0,         #0, Input Preprocess
       1.135,           #0,         #1, Input Preprocess
       ...  ,         ... ,       ... ,    ...
       1.220,           #0,        #98, Input Preprocess
       1.529,           #0,        #99, Input Preprocess
 ----------------------------------------------------------------------------------------------
  Execute On Device: avg Time/inf.: 14.1534765625 ms over 100 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
      13.921,           #0,         #0, Execute On Device
      12.963,           #0,         #1, Execute On Device
       ...  ,         ... ,       ... ,    ...
      13.751,           #0,        #98, Execute On Device
      13.904,           #0,        #99, Execute On Device
 ----------------------------------------------------------------------------------------------
  Output Postprocess: avg Time/inf.: 1.36482177734375 ms over 100 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
       1.620,           #0,         #0, Output Postprocess
       1.162,           #0,         #1, Output Postprocess
       ...  ,         ... ,       ... ,    ...
       1.144,           #0,        #98, Output Postprocess
       1.527,           #0,        #99, Output Postprocess
 ----------------------------------------------------------------------------------------------
  • Input Preprocess: Pre-processing of input data. Depends on the device’s requirements.

  • Execute On Device: Get the commands of the device, pass these commands to the kernel level, and then execute.

  • Output Postprocess: Post-processing of input data. Depends on the device’s requirements.

Level 3 Example

Level 3 is a breakdown of Execute On Device in level 2. Level 3 might show MDLA, MVPU, VPU, or EDMA, depending on the application scenario.

----------------------------------------------------------------------------------------------------
                          APUSys Driver Execution Time (LEVEL 3) - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 5001.183000000001 ms over 100 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
    1229.648,     24.6%,        100,      12.296, MDLA_3_0 - subgraph#0-0
    1209.381,     24.2%,        100,      12.094, MDLA_3_0 - subgraph#0-1
    1297.657,     25.9%,        100,      12.977, MDLA_3_0 - subgraph#0-2
    1264.497,     25.3%,        100,      12.645, MDLA_3_0 - subgraph#0-3
    5001.183,    100.0%,        100,      50.012, Total
----------------------------------------------------------------------------------------------
                         APUSys Driver Execution Time (LEVEL 3) - Breakdown
----------------------------------------------------------------------------------------------
  MDLA_3_0 - subgraph#0-0: avg Time/inf.: 12.296480000000006 ms over 100 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
      12.296,           #0,         #0, MDLA_3_0 - subgraph#0-0
      11.994,           #0,         #1, MDLA_3_0 - subgraph#0-0
       ...  ,         ... ,       ... ,    ...
      12.188,           #0,        #98, MDLA_3_0 - subgraph#0-0
      12.223,           #0,        #99, MDLA_3_0 - subgraph#0-0
 ----------------------------------------------------------------------------------------------
  MDLA_3_0 - subgraph#0-1: avg Time/inf.: 12.093809999999998 ms over 100 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
      12.079,           #0,         #0, MDLA_3_0 - subgraph#0-1
      11.973,           #0,         #1, MDLA_3_0 - subgraph#0-1
       ...  ,         ... ,       ... ,    ...
      12.118,           #0,        #98, MDLA_3_0 - subgraph#0-1
      11.968,           #0,        #99, MDLA_3_0 - subgraph#0-1
 ----------------------------------------------------------------------------------------------
  MDLA_3_0 - subgraph#0-2: avg Time/inf.: 12.976569999999999 ms over 100 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
      13.047,           #0,         #0, MDLA_3_0 - subgraph#0-2
      12.091,           #0,         #1, MDLA_3_0 - subgraph#0-2
       ...  ,         ... ,       ... ,    ...
      12.769,           #0,        #98, MDLA_3_0 - subgraph#0-2
      13.003,           #0,        #99, MDLA_3_0 - subgraph#0-2
 ----------------------------------------------------------------------------------------------
  MDLA_3_0 - subgraph#0-3: avg Time/inf.: 12.64497 ms over 100 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
      12.668,           #0,         #0, MDLA_3_0 - subgraph#0-3
      12.061,           #0,         #1, MDLA_3_0 - subgraph#0-3
       ...  ,         ... ,       ... ,    ...
      12.361,           #0,        #98, MDLA_3_0 - subgraph#0-3
      12.598,           #0,        #99, MDLA_3_0 - subgraph#0-3
 ----------------------------------------------------------------------------------------------
  • APUSys Driver Execution Time: The execution time at the APUSys driver layer.

  • MDLA_3_0 - subgraph#X-Y: The execution time of subgraph#X-Y.

In this example, the original network is compiled as one MDLA main graph (corresponded to X as “0”), which is composed of 4 parallel subgraphs (corresponded to Y as “4”). The execution time here should be the maximum time out of all four subgraph#0, which is 12.977 ms. If we check the Execute On Device time in level 2, the execution time is 14.153 ms, which means that there is a (14.153 - 12.977) ms function call and kernel driver overhead between level 2 and level 3.

Note

  • If there are parallel subgraphs in a scenario, then Total Execution Time can be ignored because it is always the sum of one sub-scenario.

  • Level 3 information may not be available for some devices, such as TFLiteCPU and GPU.

Level 4 Example

Level 4 is the breakdown scenarios of level 3. Level 4 might show MDLA, MVPU, VPU, or EDMA, depending on the application scenario.

----------------------------------------------------------------------------------------------------
                                 APUSys IP Time (LEVEL 4) - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 4883.115 ms over 100 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
    1206.809,     24.7%,        100,      12.068, MDLA_3_0 - subgraph#0-0
    1190.617,     24.4%,        100,      11.906, MDLA_3_0 - subgraph#0-1
    1260.328,     25.8%,        100,      12.603, MDLA_3_0 - subgraph#0-2
    1225.361,     25.1%,        100,      12.254, MDLA_3_0 - subgraph#0-3
    4883.115,    100.0%,        100,      48.831, Total
----------------------------------------------------------------------------------------------
                                APUSys IP Time (LEVEL 4) - Breakdown
----------------------------------------------------------------------------------------------
  MDLA_3_0 - subgraph#0-0: avg Time/inf.: 12.068090000000002 ms over 100 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
      12.040,           #0,         #0, MDLA_3_0 - subgraph#0-0
      11.940,           #0,         #1, MDLA_3_0 - subgraph#0-0
       ...  ,         ... ,       ... ,    ...
      12.073,           #0,        #98, MDLA_3_0 - subgraph#0-0
      11.966,           #0,        #99, MDLA_3_0 - subgraph#0-0
 ----------------------------------------------------------------------------------------------
  MDLA_3_0 - subgraph#0-1: avg Time/inf.: 11.906169999999992 ms over 100 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
      11.878,           #0,         #0, MDLA_3_0 - subgraph#0-1
      11.876,           #0,         #1, MDLA_3_0 - subgraph#0-1
       ...  ,         ... ,       ... ,    ...
      11.896,           #0,        #98, MDLA_3_0 - subgraph#0-1
      11.719,           #0,        #99, MDLA_3_0 - subgraph#0-1
 ----------------------------------------------------------------------------------------------
  MDLA_3_0 - subgraph#0-2: avg Time/inf.: 12.60328 ms over 100 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
      12.630,           #0,         #0, MDLA_3_0 - subgraph#0-2
      12.021,           #0,         #1, MDLA_3_0 - subgraph#0-2
       ...  ,         ... ,       ... ,    ...
      12.331,           #0,        #98, MDLA_3_0 - subgraph#0-2
      12.553,           #0,        #99, MDLA_3_0 - subgraph#0-2
 ----------------------------------------------------------------------------------------------
  MDLA_3_0 - subgraph#0-3: avg Time/inf.: 12.253610000000004 ms over 100 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
      12.257,           #0,         #0, MDLA_3_0 - subgraph#0-3
      11.961,           #0,         #1, MDLA_3_0 - subgraph#0-3
       ...  ,         ... ,       ... ,    ...
      12.154,           #0,        #98, MDLA_3_0 - subgraph#0-3
      12.179,           #0,        #99, MDLA_3_0 - subgraph#0-3
 ----------------------------------------------------------------------------------------------
  • APUSys IP Time: The lowest level IP execution time reported by the APUSys driver

  • MDLA_3_0 - subgraph#X-Y: The execution IP time of subgraph#X-Y.

In this example, the Execute On Device time in level 2 is 14.153 ms, and the maximum driver execution time in level 2 is 12.977 ms. This means that there is a (14.153 - 12.977) ms function call and kernel driver overhead between level 2 and level 3, and a (12.977 - 12.603) ms overhead after the kernel driver tries to execute these commands on the hardware.

Note

  • If there are parallel subgraphs in a scenario, then Total Execution Time can be ignored because it is always the sum of one sub-scenario.

  • Level 4 information may not be available for some devices, such as TFLiteCPU and GPU.

Per-OP (Per-Operation) Performance Profiling

Neuron Runtime Profiler provides per-op performance profiling of MDLA compute devices.

Usage

To save Neuron Runtime Profiler per-op report to a CSV file, use one of the following methods:

  1. Add the following option when compiling to DLA using ncc-tflite:

    --gen-debug-info
    
  2. Set the following environment variable during inference by neuronrt:

    export MTKNN_PER_OP_PROFILE=1
    

Note

Limitations:

  • Not supported when -c option is specified (ex. -c 10).

  • Does not support MDLA 1.5/1.7/2.0 with Android S or later.

  • Does not support multiple 1.5/1.7/2.0 MDLA devices.

Per-OP Profiler Reports

Per-OP profiler logs the following information in CSV format:

  • Overall performance

  • Execution time of each operation at different levels. Lower levels (e.g. level 1) include the function calling overhead of higher levels (e.g. Level 2).

  • The location ID corresponds to the location information in the TFLite model.

MDLA 1.5/1.7/2.0 Example

In this example, the user tries to infer a model with 124 operations. The model is compiled as two subgraphs, which will be executed by the EDMA and MDLA 1.5. Neuron Runtime Profiler generates a per-op report only if the scenario uses one MDLA core. Neuron Runtime Profiler does not produce a report for scenarios that use multiple MDLA cores.

----------------------------------------------------------------------------------------------------
                              APUSys Device Status (LEVEL 2) - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 124.781005859375 ms over 1 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       0.160,      0.1%,          1,       0.160, Input Preprocess
       6.204,      5.0%,          1,       6.204, Execute On Device (subgraph#0)
       2.666,      2.1%,          1,       2.666, Execute On Device (subgraph#1 location#0#1: CONV_2D + CONV_2D)
       0.885,      0.7%,          1,       0.885, Execute On Device (subgraph#1 location#2: MAX_POOL_2D)
        ...        ...          ...         ...    ...
        ...        ...          ...         ...    ...
        ...        ...          ...         ...    ...
       0.992,      0.8%,          1,       0.992, Execute On Device (subgraph#1 location#120: CONV_2D)
       0.985,      0.8%,          1,       0.985, Execute On Device (subgraph#1 location#121: CONV_2D)
       1.164,      0.9%,          1,       1.164, Execute On Device (subgraph#1 location#122: AVERAGE_POOL_2D)
       0.937,      0.8%,          1,       0.937, Execute On Device (subgraph#1 location#123: CONV_2D)
       0.069,      0.1%,          1,       0.069, Output Postprocess
     124.781,    100.0%,          1,     124.781, Total

----------------------------------------------------------------------------------------------------
                          APUSys Driver Execution Time (LEVEL 3) - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 38.046000000000014 ms over 1 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       0.369,      1.0%,          1,       0.369, EDMA - subgraph#0
       1.318,      3.5%,          1,       1.318, MDLA - subgraph#1 location#0#1: CONV_2D + CONV_2D
       0.209,      0.5%,          1,       0.209, MDLA - subgraph#1 location#2: MAX_POOL_2D
        ...        ...          ...         ...    ...
        ...        ...          ...         ...    ...
        ...        ...          ...         ...    ...
       0.248,      0.7%,          1,       0.248, MDLA - subgraph#1 location#120: CONV_2D
       0.249,      0.7%,          1,       0.249, MDLA - subgraph#1 location#121: CONV_2D
       0.444,      1.2%,          1,       0.444, MDLA - subgraph#1 location#122: AVERAGE_POOL_2D
       0.353,      0.9%,          1,       0.353, MDLA - subgraph#1 location#123: CONV_2D
      38.046,    100.0%,          1,      38.046, Total

----------------------------------------------------------------------------------------------------
                                 APUSys IP Time (LEVEL 4) - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 22.043999999999993 ms over 1 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       0.314,      1.4%,          1,       0.314, EDMA - subgraph#0
       1.077,      4.9%,          1,       1.077, MDLA - subgraph#1 location#0#1: CONV_2D + CONV_2D
       0.128,      0.6%,          1,       0.128, MDLA - subgraph#1 location#2: MAX_POOL_2D
        ...        ...          ...         ...    ...
        ...        ...          ...         ...    ...
        ...        ...          ...         ...    ...
       0.155,      0.7%,          1,       0.155, MDLA - subgraph#1 location#120: CONV_2D
       0.160,      0.7%,          1,       0.160, MDLA - subgraph#1 location#121: CONV_2D
       0.115,      0.5%,          1,       0.115, MDLA - subgraph#1 location#122: AVERAGE_POOL_2D
       0.057,      0.3%,          1,       0.057, MDLA - subgraph#1 location#123: CONV_2D
      22.044,    100.0%,          1,      22.044, Total

Example of MDLA 3.0

In this example, the user tries to infer a model with 305 operations. The model is compiled as one main graph, which contains four MDLA subgraphs for parallel execution. Neuron-5.0 supports per-op profiling with scenarios that use multiple MDLA cores. In a multiple-MDLA-core scenario, the log will contain a per-op profiling report for each MDLA core.

For MDLA 3.0 applications, the per-op profiling result of each operation can be seen as the IP time level. This does not include the synchronized time overhead between different cores in a multiple-MDLA scenario.

----------------------------------------------------------------------------------------------------
                           MDLA Core-0 Per-OP Profiling Report - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 5.215456216216212 ms over 1 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       0.043,      0.8%,          1,       0.043, MDLA - subgraph#0-1 location#0#1: CONV_2D+MAX_POOL_2D
       0.020,      0.4%,          1,       0.020, MDLA - subgraph#0-1 location#2: CUSTOM
       0.235,      4.5%,          1,       0.235, MDLA - subgraph#0-1 location#3#4#5: DEPTHWISE_CONV_2D+CONV_2D+CONV_2D
       0.068,      1.3%,          1,       0.068, MDLA - subgraph#0-1 location#6: CONCATENATION
        ...        ...         ...          ...    ...
       0.016,      0.3%,          1,       0.016, MDLA - subgraph#0-1 location#299: CONV_2D
       0.010,      0.2%,          1,       0.010, MDLA - subgraph#0-1 location#302#303: DEPTHWISE_CONV_2D+MEAN
       0.036,      0.7%,          1,       0.036, MDLA - subgraph#0-1 location#304: CONV_2D
       5.215,    100.0%,          1,       5.215, Total

----------------------------------------------------------------------------------------------------
                           MDLA Core-1 Per-OP Profiling Report - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 5.0539675675675655 ms over 1 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       0.041,      0.8%,          1,       0.041, MDLA - subgraph#0-1 location#0#1: CONV_2D+MAX_POOL_2D
       0.050,      1.0%,          1,       0.050, MDLA - subgraph#0-1 location#2: CUSTOM
       0.070,      1.4%,          1,       0.070, MDLA - subgraph#0-1 location#3#4#5: DEPTHWISE_CONV_2D+CONV_2D+CONV_2D
       0.041,      0.8%,          2,       0.041, MDLA - subgraph#0-1 location#6: CONCATENATION
        ...        ...         ...          ...    ...
       0.010,      0.2%,          1,       0.010, MDLA - subgraph#0-1 location#302#303: DEPTHWISE_CONV_2D+MEAN
       0.033,      0.7%,          1,       0.033, MDLA - subgraph#0-1 location#304: CONV_2D
       5.054,    100.0%,          1,       5.054, Total

----------------------------------------------------------------------------------------------------
                           MDLA Core-2 Per-OP Profiling Report - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 5.783833513513511 ms over 1 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       0.048,      0.8%,          1,       0.048, MDLA - subgraph#0-1 location#0#1: CONV_2D+MAX_POOL_2D
       0.020,      0.4%,          1,       0.020, MDLA - subgraph#0-1 location#2: CUSTOM
       0.771,     13.3%,          1,       0.771, MDLA - subgraph#0-1 location#3#4#5: DEPTHWISE_CONV_2D+CONV_2D+CONV_2D
        ...        ...         ...          ...    ...
       0.005,      0.1%,          1,       0.005, MDLA - subgraph#0-1 location#302#303: DEPTHWISE_CONV_2D+MEAN
       0.036,      0.6%,          1,       0.036, MDLA - subgraph#0-1 location#304: CONV_2D
       0.026,      0.4%,          1,       0.026, MDLA - subgraph#0-1 location#306: SOFTMAX
       5.784,    100.0%,          1,       5.784, Total

----------------------------------------------------------------------------------------------------
                           MDLA Core-3 Per-OP Profiling Report - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 5.429357837837842 ms over 1 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       0.042,      0.8%,          1,       0.042, MDLA - subgraph#0-1 location#0#1: CONV_2D+MAX_POOL_2D
       0.021,      0.4%,          1,       0.021, MDLA - subgraph#0-1 location#2: CUSTOM
       0.413,      7.6%,          1,       0.413, MDLA - subgraph#0-1 location#3#4#5: DEPTHWISE_CONV_2D+CONV_2D+CONV_2D
       0.091,      1.7%,          1,       0.091, MDLA - subgraph#0-1 location#7: CUSTOM
        ...        ...         ...          ...    ...
       0.005,      0.1%,          1,       0.005, MDLA - subgraph#0-1 location#302#303: DEPTHWISE_CONV_2D+MEAN
       0.035,      0.6%,          1,       0.035, MDLA - subgraph#0-1 location#304: CONV_2D
       0.035,      0.6%,          1,       0.035, MDLA - subgraph#0-1 location#306: SOFTMAX
       5.429,    100.0%,          1,       5.429, Total