Neuron Runtime Profiler

Overview 

Neuron Runtime Profiler is a performance profiling tool built into Neuron Runtime. When running a model, Neuron Runtime Profiler can output useful statistics about the model’s execution.

Usage 

To enable Neuron Runtime Profiler, use one of the following methods:

Set environment variable:
MTKNN_ENABLE_PROFILER=1

To save Neuron Runtime Profiler reports to a CSV log file, use one of the following methods:

Set environment variable:
MTKNN_PROFILER_LOG_PATH=<LOG_FULL_PATH>

Note

You must specify a full file path. For example, /data/local/tmp/workspace/Profiler.log.
You must create the log directories before execution.

Profiler Reports 

A Neuron Runtime Profiler report consists of several tables in CSV format. Each table describes one scenario. The table’s rows list each sub-scenario of the scenario.

Table Components 

Title: The scenario name and label. Example: “Runtime Status (LEVEL 1) - Summary”.
Subheader: Total execution time of this scenario. Example: “Total Execution Time: 1980.27001953125 ms over 100 times inference.”
Columns:
- Time: The execution time of the sub-scenario. If a sub-scenario was executed more than one time, this value is the sum of its execution times.
- Ratio: The ratio of the sub-scenario’s execution time to the total execution time of the scenario.
- ExeTimes: The number of times that the sub-scenario was executed.
- Time/Inf.: The average execution time per inference. If the sub-scenario was only executed one time, this value will be blank.
- Description: The description of the sub-scenario.
- GraphIdx: The index of the graph during inference.
- ExeIdx: The index of the execution times.

For the inference stage, the report stores profiler information at different levels. Lower levels (e.g. level 1) include the function calling overhead of higher levels (e.g. Level 2). Note that section orders are not guaranteed. For example:

Level 1 Summary

Level 1 Breakdown

The execution time history in a scenario. (Note: only several major scenarios’ history were recorded.)

Level 2 Summary

Level 2 Breakdown

…

Level Summary 

The hierarchical structure of the profiler report is summarized below:

Level 1: Inference
- Level 2 Input Preprocess
- Level 2 Execute on Device
  - Level 3: APUSys Driver Execution Time
    - Level 4: APUSys IP Execution Time
- Level 2: Output Postprocess

Example 

In this example, the user tries to infer a model 100 times. The model contains only one graph, which is compiled as four MDLA subgraphs with parallel execution. The example is divided into four levels.

Level 1 Example

This section Runtime Status (LEVEL 1) represents the overall runtime status during 100 inference runs. Some pre-processing sub-scenarios were only executed one time, therefore their “Time/Inf.” are empty.

----------------------------------------------------------------------------------------------------
                                 Runtime Status (LEVEL 1) - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 66.674560546875 ms over 34 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       4.377,      0.2%,          3, -----------, Read Compiled Network
       1.201,      0.1%,          3, -----------, Extract Compiled Network
       0.002,      0.0%,          3, -----------, Map Device Memory
       0.001,      0.0%,          3, -----------, Fill Constant Data
       0.788,      1.1%,          3, -----------, Deserialize DLA Sections
       1.568,      2.5%,          3, -----------, Process target-specific data
       0.927,      1.5%,         34,       0.027, Set Inference Input
       0.110,      0.2%,         34,       0.003, Set Inference Output
      57.778,     94.4%,         34,       1.699, Inference
      61.175,    100.0%,         34,       1.730, Total
----------------------------------------------------------------------------------------------
                                Runtime Status (LEVEL 1) - Breakdown
----------------------------------------------------------------------------------------------
  Inference: avg Time/inf.: 1.6993551815257353 ms over 34 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
       1.421,          ---,         #0, Inference
       1.471,          ---,         #1, Inference
        ... ,          ...,        ...,    ...
       1.757,          ---,        #32, Inference
       1.389,          ---,        #33, Inference
 ----------------------------------------------------------------------------------------------

Read Compiled Network: Load the compiled network into the runtime.
Extract Compiled Network: Extract the compiled network to get the required information for preprocessing.
Map Device Memory: Allocate buffers on the device.
Fill Constant Data: Prepare for the weighting data.
Deserialize DLA Sections: Deserialize the data saved in the DLA file.
Process target-specific data: If necessary, pre-process the compiled result according to the target’s requirements.
Set Inference Input: Set and allocate resources for inference inputs.
Set Inference Output: Set and allocate resources for inference outputs.
Inference: Start to infer this network.

Level 2 Example

Level 2 is a breakdown of Inference during level 1. There are two scenarios in level 2:

Transit Status: The execution time of transitions between subgraphs.
APUSys Device Status: Shows execution data for the hardware device. The hardware device might be the VPU, or EDMA depending on the application scenario.

----------------------------------------------------------------------------------------------------
                                 Transit Status (LEVEL 2) - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 0.372802734375 ms over 34 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       0.045,     12.1%,          1, -----------, Build Transit Map
       0.328,     87.9%,         34,       0.010, Transit Execute
       0.373,    100.0%,         34,       0.010, Total
----------------------------------------------------------------------------------------------

Build Transit Map: Construct the transition mapping table between subgraphs.
Transit Execute: Transfer output data of the source subgraph to the input tensor of the destination subgraph. This is executed before a subgraph starts inference.

----------------------------------------------------------------------------------------------------
                              APUSys Device Status (LEVEL 2) - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 37.31640625 ms over 34 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       0.253,      0.7%,         34,       0.007, Input Preprocess
      36.910,     98.9%,         34,       1.086, Execute On Device
       0.153,      0.4%,         34,       0.004, Output Postprocess
      37.316,    100.0%,         34,       1.098, Total
----------------------------------------------------------------------------------------------
                             APUSys Device Status (LEVEL 2) - Breakdown
----------------------------------------------------------------------------------------------
  Input Preprocess: avg Time/inf.: 0.0074534696691176475 ms over 34 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
       0.040,           #0,         #0, Input Preprocess
       0.006,           #0,         #1, Input Preprocess
        ... ,          ...,        ...,    ...
       0.008,           #0,        #32, Input Preprocess
       0.006,           #0,        #33, Input Preprocess
 ----------------------------------------------------------------------------------------------
  Execute On Device: avg Time/inf.: 1.0856000114889706 ms over 34 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
       5.105,           #0,         #0, Execute On Device
       0.869,           #0,         #1, Execute On Device
        ... ,          ...,        ...,    ...
       0.907,           #0,        #32, Execute On Device
       0.842,           #0,        #33, Execute On Device
 ----------------------------------------------------------------------------------------------
  Output Postprocess: avg Time/inf.: 0.004487879136029412 ms over 34 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
       0.005,           #0,         #0, Output Postprocess
       0.004,           #0,         #1, Output Postprocess
        ... ,          ...,        ...,    ...
       0.008,           #0,        #32, Output Postprocess
       0.004,           #0,        #33, Output Postprocess
 ----------------------------------------------------------------------------------------------

Input Preprocess: Pre-processing of input data. Depends on the device’s requirements.
Execute On Device: Get the commands of the device, pass these commands to the kernel level, and then execute.
Output Postprocess: Post-processing of input data. Depends on the device’s requirements.

Level 3 Example

Level 3 is a breakdown of Execute On Device in level 2. Level 3 might show MDLA, VPU, or EDMA, depending on the application scenario.

----------------------------------------------------------------------------------------------------
                          APUSys Driver Execution Time (LEVEL 3) - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 19.206000000000007 ms over 34 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       1.213,      6.3%,         34,       0.036, EDMA_3_0 - subgraph#0-0
      13.682,     71.2%,         34,       0.402, MDLA_3_0 - subgraph#0-1
      19.206,    100.0%,         34,       0.565, Total
----------------------------------------------------------------------------------------------
                         APUSys Driver Execution Time (LEVEL 3) - Breakdown
----------------------------------------------------------------------------------------------
  EDMA_3_0 - subgraph#0-0: avg Time/inf.: 0.0356764705882353 ms over 34 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
       0.191,           #0,         #0, EDMA_3_0 - subgraph#0-0
       0.033,           #0,         #1, EDMA_3_0 - subgraph#0-0
        ... ,          ...,        ...,    ...
       0.031,           #0,        #32, EDMA_3_0 - subgraph#0-0
       0.031,           #0,        #33, EDMA_3_0 - subgraph#0-0
 ----------------------------------------------------------------------------------------------
  MDLA_3_0 - subgraph#0-1: avg Time/inf.: 0.40241176470588247 ms over 34 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
       0.941,           #0,         #0, MDLA_3_0 - subgraph#0-1
       0.383,           #0,         #1, MDLA_3_0 - subgraph#0-1
        ... ,          ...,        ...,    ...
       0.387,           #0,        #32, MDLA_3_0 - subgraph#0-1
       0.382,           #0,        #33, MDLA_3_0 - subgraph#0-1
 ----------------------------------------------------------------------------------------------

APUSys Driver Execution Time: The execution time at the APUSys driver layer.
MDLA_3_0 - subgraph#X-Y: The execution time of subgraph#X-Y.

In this example, the original network is compiled as three subgraphs: subgraph#0-0 executed by EDMA_3_0, and subgraph#0-1 executed by MDLA_3_0. If we check the Execute On Device time in level 2, the execution time is 1.086 ms, which means that there is a (1.086 - 0.565) ms function call and kernel driver overhead between level 2 and level 3.

Note

Level 3 information may not be available for some devices, such as TFLiteCPU and GPU.

Level 4 Example

Level 4 is the breakdown scenarios of level 3. Level 4 might show MDLA, VPU, or EDMA, depending on the application scenario.

----------------------------------------------------------------------------------------------------
                                 APUSys IP Time (LEVEL 4) - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 15.575000000000005 ms over 34 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       0.784,      5.0%,         34,       0.023, EDMA_3_0 - subgraph#0-0
      12.719,     81.7%,         34,       0.374, MDLA_3_0 - subgraph#0-1
      15.575,    100.0%,         34,       0.458, Total
----------------------------------------------------------------------------------------------
                                APUSys IP Time (LEVEL 4) - Breakdown
----------------------------------------------------------------------------------------------
  EDMA_3_0 - subgraph#0-0: avg Time/inf.: 0.023058823529411774 ms over 34 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
       0.028,           #0,         #0, EDMA_3_0 - subgraph#0-0
       0.023,           #0,         #1, EDMA_3_0 - subgraph#0-0
        ... ,          ...,        ...,    ...
       0.023,           #0,        #32, EDMA_3_0 - subgraph#0-0
       0.023,           #0,        #33, EDMA_3_0 - subgraph#0-0
 ----------------------------------------------------------------------------------------------
  MDLA_3_0 - subgraph#0-1: avg Time/inf.: 0.3740882352941178 ms over 34 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
       0.819,           #0,         #0, MDLA_3_0 - subgraph#0-1
       0.355,           #0,         #1, MDLA_3_0 - subgraph#0-1
        ... ,          ...,        ...,    ...
       0.364,           #0,        #32, MDLA_3_0 - subgraph#0-1
       0.351,           #0,        #33, MDLA_3_0 - subgraph#0-1
 ----------------------------------------------------------------------------------------------

APUSys IP Time: The lowest level IP execution time reported by the APUSys driver
MDLA_3_0 - subgraph#X-Y: The execution IP time of subgraph#X-Y.

In this example, the Execute On Device time in level 2 is 1.086, and the driver execution time in level 3 is 0.565 ms. This means there is a (1.086 - 0.565) ms function call and kernel driver overhead between level 2 and level 3, and a (0.565 - 0.458) ms overhead after the kernel driver tries to execute the commands on the hardware device.

Note

Level 4 information may not be available for some devices, such as TFLiteCPU and GPU.

MultiCore Overview Example

----------------------------------------------------------------------------------------------------
                        MDLA_3_0 MultiCore Overview (Driver Time) - Summary
----------------------------------------------------------------------------------------------------
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
      12.640,     24.6%,         34,       0.372, MDLA_3_0 core0 - subgraph#0-1
      11.845,     23.1%,         34,       0.348, MDLA_3_0 core1 - subgraph#0-1
      13.682,     26.6%,         34,       0.402, MDLA_3_0 core2 - subgraph#0-1
      13.209,     25.7%,         34,       0.389, MDLA_3_0 core3 - subgraph#0-1
----------------------------------------------------------------------------------------------
                       MDLA_3_0 MultiCore Overview (Driver Time) - Breakdown
----------------------------------------------------------------------------------------------
  MDLA_3_0 core0 - subgraph#0-1: avg Time/inf.: 0.37176470588235283 ms over 34 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
       0.603,           #0,         #0, MDLA_3_0 core0 - subgraph#0-1
       0.362,           #0,         #1, MDLA_3_0 core0 - subgraph#0-1
        ... ,          ...,        ...,    ...
       0.365,           #0,        #32, MDLA_3_0 core0 - subgraph#0-1
       0.360,           #0,        #33, MDLA_3_0 core0 - subgraph#0-1
 ----------------------------------------------------------------------------------------------
  MDLA_3_0 core1 - subgraph#0-1: avg Time/inf.: 0.3483823529411765 ms over 34 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
       0.488,           #0,         #0, MDLA_3_0 core1 - subgraph#0-1
       0.341,           #0,         #1, MDLA_3_0 core1 - subgraph#0-1
        ... ,          ...,        ...,    ...
       0.344,           #0,        #32, MDLA_3_0 core1 - subgraph#0-1
       0.339,           #0,        #33, MDLA_3_0 core1 - subgraph#0-1
 ----------------------------------------------------------------------------------------------
  MDLA_3_0 core2 - subgraph#0-1: avg Time/inf.: 0.40241176470588247 ms over 34 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
       0.941,           #0,         #0, MDLA_3_0 core2 - subgraph#0-1
       0.383,           #0,         #1, MDLA_3_0 core2 - subgraph#0-1
        ... ,          ...,        ...,    ...
       0.387,           #0,        #32, MDLA_3_0 core2 - subgraph#0-1
       0.382,           #0,        #33, MDLA_3_0 core2 - subgraph#0-1
 ----------------------------------------------------------------------------------------------
  MDLA_3_0 core3 - subgraph#0-1: avg Time/inf.: 0.38850000000000007 ms over 34 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
       0.835,           #0,         #0, MDLA_3_0 core3 - subgraph#0-1
       0.372,           #0,         #1, MDLA_3_0 core3 - subgraph#0-1
        ... ,          ...,        ...,    ...
       0.375,           #0,        #32, MDLA_3_0 core3 - subgraph#0-1
       0.371,           #0,        #33, MDLA_3_0 core3 - subgraph#0-1
 ----------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------
                          MDLA_3_0 MultiCore Overview (IP Time) - Summary
----------------------------------------------------------------------------------------------------
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
      11.609,     24.5%,         34,       0.341, MDLA_3_0 core0 - subgraph#0-1
      10.962,     23.2%,         34,       0.322, MDLA_3_0 core1 - subgraph#0-1
      12.719,     26.9%,         34,       0.374, MDLA_3_0 core2 - subgraph#0-1
      12.041,     25.4%,         34,       0.354, MDLA_3_0 core3 - subgraph#0-1
----------------------------------------------------------------------------------------------
                         MDLA_3_0 MultiCore Overview (IP Time) - Breakdown
----------------------------------------------------------------------------------------------
  MDLA_3_0 core0 - subgraph#0-1: avg Time/inf.: 0.34144117647058825 ms over 34 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
       0.485,           #0,         #0, MDLA_3_0 core0 - subgraph#0-1
       0.338,           #0,         #1, MDLA_3_0 core0 - subgraph#0-1
        ... ,          ...,        ...,    ...
       0.334,           #0,        #32, MDLA_3_0 core0 - subgraph#0-1
       0.329,           #0,        #33, MDLA_3_0 core0 - subgraph#0-1
 ----------------------------------------------------------------------------------------------
  MDLA_3_0 core1 - subgraph#0-1: avg Time/inf.: 0.3224117647058823 ms over 34 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
       0.368,           #0,         #0, MDLA_3_0 core1 - subgraph#0-1
       0.312,           #0,         #1, MDLA_3_0 core1 - subgraph#0-1
        ... ,          ...,        ...,    ...
       0.315,           #0,        #32, MDLA_3_0 core1 - subgraph#0-1
       0.320,           #0,        #33, MDLA_3_0 core1 - subgraph#0-1
 ----------------------------------------------------------------------------------------------
  MDLA_3_0 core2 - subgraph#0-1: avg Time/inf.: 0.3740882352941178 ms over 34 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
       0.819,           #0,         #0, MDLA_3_0 core2 - subgraph#0-1
       0.355,           #0,         #1, MDLA_3_0 core2 - subgraph#0-1
        ... ,          ...,        ...,    ...
       0.364,           #0,        #32, MDLA_3_0 core2 - subgraph#0-1
       0.351,           #0,        #33, MDLA_3_0 core2 - subgraph#0-1
 ----------------------------------------------------------------------------------------------
  MDLA_3_0 core3 - subgraph#0-1: avg Time/inf.: 0.35414705882352937 ms over 34 times inference.
 ====Time===, ==GraphIdx==, ==ExeIdx==, ===Description===
       0.572,           #0,         #0, MDLA_3_0 core3 - subgraph#0-1
       0.347,           #0,         #1, MDLA_3_0 core3 - subgraph#0-1
        ... ,          ...,        ...,    ...
       0.355,           #0,        #32, MDLA_3_0 core3 - subgraph#0-1
       0.350,           #0,        #33, MDLA_3_0 core3 - subgraph#0-1
 ----------------------------------------------------------------------------------------------

If the subgraph dispatched to MDLA can be executed in parallel, MDLA_3_0 (or MDLA_3_5) MultiCore Overview will show the execution status on each MDLA core. In this example, it costs 0.374 milliseconds on average for the MDLA to execute subgraph#0-1, where subgraph#0-1 is being executed in parallel by four MDLAs. Each MDLA core cost 0.341 ms, 0.322 ms, 0.374 ms, and 0.354 ms separately. Neuron Runtime Profiler provides the execution time in the APUSys driver layer, and the lowest level IP execution time reported by APUSys driver. The execution time in the APUSys driver layer contains the overhead after the kernel driver tries to execute these commands on the hardware device.

Per-OP (Per-Operation) Performance Profiling 

Neuron Runtime Profiler provides per-op performance profiling of MDLA compute devices.

Usage 

To save Neuron Runtime Profiler per-op reports to a CSV file, use the following steps:

Add the following option when compiling to DLA using ncc-tflite:
--gen-debug-info
Set the following environment variable during inference by neuronrt:
export MTKNN_PER_OP_PROFILE=1

Note

Limitations:

Not supported when -c option is specified (ex. -c 10).

Does not support MDLA 1.5/1.7/2.0 with Android S or later.

Does not support multiple 1.5/1.7/2.0 MDLA devices.

Per-OP Profiler Reports 

Per-OP profiler logs the following information in CSV format:

Overall performance
Execution time of each operation at different levels. Lower levels (e.g. level 1) include the function calling overhead of higher levels (e.g. Level 2).
The location ID corresponds to the location information in the TFLite model.

MDLA 1.5/1.7/2.0 Example 

In this example, the user tries to infer a model with 124 operations. The model is compiled as two subgraphs, which will be executed by the EDMA and MDLA 1.5. Neuron Runtime Profiler generates a per-op report only if the scenario uses one MDLA core. Neuron Runtime Profiler does not produce a report for scenarios that use multiple MDLA cores.

----------------------------------------------------------------------------------------------------
                              APUSys Device Status (LEVEL 2) - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 124.781005859375 ms over 1 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       0.160,      0.1%,          1,       0.160, Input Preprocess
       6.204,      5.0%,          1,       6.204, Execute On Device (subgraph#0)
       2.666,      2.1%,          1,       2.666, Execute On Device (subgraph#1 location#0#1: CONV_2D + CONV_2D)
       0.885,      0.7%,          1,       0.885, Execute On Device (subgraph#1 location#2: MAX_POOL_2D)
        ...        ...          ...         ...    ...
        ...        ...          ...         ...    ...
        ...        ...          ...         ...    ...
       0.992,      0.8%,          1,       0.992, Execute On Device (subgraph#1 location#120: CONV_2D)
       0.985,      0.8%,          1,       0.985, Execute On Device (subgraph#1 location#121: CONV_2D)
       1.164,      0.9%,          1,       1.164, Execute On Device (subgraph#1 location#122: AVERAGE_POOL_2D)
       0.937,      0.8%,          1,       0.937, Execute On Device (subgraph#1 location#123: CONV_2D)
       0.069,      0.1%,          1,       0.069, Output Postprocess
     124.781,    100.0%,          1,     124.781, Total

----------------------------------------------------------------------------------------------------
                          APUSys Driver Execution Time (LEVEL 3) - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 38.046000000000014 ms over 1 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       0.369,      1.0%,          1,       0.369, EDMA - subgraph#0
       1.318,      3.5%,          1,       1.318, MDLA - subgraph#1 location#0#1: CONV_2D + CONV_2D
       0.209,      0.5%,          1,       0.209, MDLA - subgraph#1 location#2: MAX_POOL_2D
        ...        ...          ...         ...    ...
        ...        ...          ...         ...    ...
        ...        ...          ...         ...    ...
       0.248,      0.7%,          1,       0.248, MDLA - subgraph#1 location#120: CONV_2D
       0.249,      0.7%,          1,       0.249, MDLA - subgraph#1 location#121: CONV_2D
       0.444,      1.2%,          1,       0.444, MDLA - subgraph#1 location#122: AVERAGE_POOL_2D
       0.353,      0.9%,          1,       0.353, MDLA - subgraph#1 location#123: CONV_2D
      38.046,    100.0%,          1,      38.046, Total

----------------------------------------------------------------------------------------------------
                                 APUSys IP Time (LEVEL 4) - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 22.043999999999993 ms over 1 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       0.314,      1.4%,          1,       0.314, EDMA - subgraph#0
       1.077,      4.9%,          1,       1.077, MDLA - subgraph#1 location#0#1: CONV_2D + CONV_2D
       0.128,      0.6%,          1,       0.128, MDLA - subgraph#1 location#2: MAX_POOL_2D
        ...        ...          ...         ...    ...
        ...        ...          ...         ...    ...
        ...        ...          ...         ...    ...
       0.155,      0.7%,          1,       0.155, MDLA - subgraph#1 location#120: CONV_2D
       0.160,      0.7%,          1,       0.160, MDLA - subgraph#1 location#121: CONV_2D
       0.115,      0.5%,          1,       0.115, MDLA - subgraph#1 location#122: AVERAGE_POOL_2D
       0.057,      0.3%,          1,       0.057, MDLA - subgraph#1 location#123: CONV_2D
      22.044,    100.0%,          1,      22.044, Total

Example of MDLA 3.0 

In this example, the user tries to infer a model with 305 operations. The model is compiled as one main graph, which contains four MDLA subgraphs for parallel execution. Neuron-5.0 supports per-op profiling with scenarios that use multiple MDLA cores. In a multiple-MDLA-core scenario, the log will contain a per-op profiling report for each MDLA core.

For MDLA 3.0 applications, the per-op profiling result of each operation can be seen as the IP time level. This does not include the synchronized time overhead between different cores in a multiple-MDLA scenario.

----------------------------------------------------------------------------------------------------
                           MDLA Core-0 Per-OP Profiling Report - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 5.215456216216212 ms over 1 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       0.043,      0.8%,          1,       0.043, MDLA - subgraph#0-1 location#0#1: CONV_2D+MAX_POOL_2D
       0.020,      0.4%,          1,       0.020, MDLA - subgraph#0-1 location#2: CUSTOM
       0.235,      4.5%,          1,       0.235, MDLA - subgraph#0-1 location#3#4#5: DEPTHWISE_CONV_2D+CONV_2D+CONV_2D
       0.068,      1.3%,          1,       0.068, MDLA - subgraph#0-1 location#6: CONCATENATION
        ...        ...         ...          ...    ...
       0.016,      0.3%,          1,       0.016, MDLA - subgraph#0-1 location#299: CONV_2D
       0.010,      0.2%,          1,       0.010, MDLA - subgraph#0-1 location#302#303: DEPTHWISE_CONV_2D+MEAN
       0.036,      0.7%,          1,       0.036, MDLA - subgraph#0-1 location#304: CONV_2D
       5.215,    100.0%,          1,       5.215, Total

----------------------------------------------------------------------------------------------------
                           MDLA Core-1 Per-OP Profiling Report - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 5.0539675675675655 ms over 1 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       0.041,      0.8%,          1,       0.041, MDLA - subgraph#0-1 location#0#1: CONV_2D+MAX_POOL_2D
       0.050,      1.0%,          1,       0.050, MDLA - subgraph#0-1 location#2: CUSTOM
       0.070,      1.4%,          1,       0.070, MDLA - subgraph#0-1 location#3#4#5: DEPTHWISE_CONV_2D+CONV_2D+CONV_2D
       0.041,      0.8%,          2,       0.041, MDLA - subgraph#0-1 location#6: CONCATENATION
        ...        ...         ...          ...    ...
       0.010,      0.2%,          1,       0.010, MDLA - subgraph#0-1 location#302#303: DEPTHWISE_CONV_2D+MEAN
       0.033,      0.7%,          1,       0.033, MDLA - subgraph#0-1 location#304: CONV_2D
       5.054,    100.0%,          1,       5.054, Total

----------------------------------------------------------------------------------------------------
                           MDLA Core-2 Per-OP Profiling Report - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 5.783833513513511 ms over 1 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       0.048,      0.8%,          1,       0.048, MDLA - subgraph#0-1 location#0#1: CONV_2D+MAX_POOL_2D
       0.020,      0.4%,          1,       0.020, MDLA - subgraph#0-1 location#2: CUSTOM
       0.771,     13.3%,          1,       0.771, MDLA - subgraph#0-1 location#3#4#5: DEPTHWISE_CONV_2D+CONV_2D+CONV_2D
        ...        ...         ...          ...    ...
       0.005,      0.1%,          1,       0.005, MDLA - subgraph#0-1 location#302#303: DEPTHWISE_CONV_2D+MEAN
       0.036,      0.6%,          1,       0.036, MDLA - subgraph#0-1 location#304: CONV_2D
       0.026,      0.4%,          1,       0.026, MDLA - subgraph#0-1 location#306: SOFTMAX
       5.784,    100.0%,          1,       5.784, Total

----------------------------------------------------------------------------------------------------
                           MDLA Core-3 Per-OP Profiling Report - Summary
----------------------------------------------------------------------------------------------------
  Total Execution Time: 5.429357837837842 ms over 1 times inference.
 ====Time===, ==Ratio==, =ExeTimes=, =Time/Inf.=, ===Description===
       0.042,      0.8%,          1,       0.042, MDLA - subgraph#0-1 location#0#1: CONV_2D+MAX_POOL_2D
       0.021,      0.4%,          1,       0.021, MDLA - subgraph#0-1 location#2: CUSTOM
       0.413,      7.6%,          1,       0.413, MDLA - subgraph#0-1 location#3#4#5: DEPTHWISE_CONV_2D+CONV_2D+CONV_2D
       0.091,      1.7%,          1,       0.091, MDLA - subgraph#0-1 location#7: CUSTOM
        ...        ...         ...          ...    ...
       0.005,      0.1%,          1,       0.005, MDLA - subgraph#0-1 location#302#303: DEPTHWISE_CONV_2D+MEAN
       0.035,      0.6%,          1,       0.035, MDLA - subgraph#0-1 location#304: CONV_2D
       0.035,      0.6%,          1,       0.035, MDLA - subgraph#0-1 location#306: SOFTMAX
       5.429,    100.0%,          1,       5.429, Total