.. include:: /keyword.rst ========== |i350-EVK| ========== .. contents:: Sections :local: :depth: 2 MT8365 System on Chip ===================== ========= ======================= Hardware MT8365 ========= ======================= CPU 4x CA53 2.0GHz GPU ARM Mali-G52 AI APU (VPU) ========= ======================= Please refer to the :doc:`MT8365 (Genio 350) ` to find detailed specifications. APU *** The APU includes a multi-core processor combined with intelligent control logic. It is 2X more power efficient than a GPU and generates class-leading edge-AI processing performance of up to 0.3T. Overview ======== On |i350-EVK-REF-BOARD|, we provide ``tensorflow lite`` with hardware acceleration to develop and deploy a wide range of machine learning. The following figure illustrates the machine learning software stack: .. figure:: /_asset/sw_rity_ml-guide_g350_sw_stack.svg :align: center Machine learning software stack on |i350-EVK-REF-BOARD|. By using `TensorFlow Lite Delegates `_, you can enable hardware acceleration of TensorFlow Lite models by leveraging on-device accelerators such as the GPU and Digital Signal Processor (DSP). |IOT-YOCTO| already integrated the following 3 delegates - `GPU delegate `_: The GPU delegate uses Open GL ES compute shader on the device to inference TensorFlow Lite model. - `Arm NN delegate `_: Arm NN is a set of open-source software that enables machine learning workloads on Arm hardware devices. It provides a bridge between existing neural network frameworks and Cortex-A CPUs, Arm Mali GPUs. - `NNAPI delegate `_: It provides acceleration for TensorFlow Lite models on Android devices with supported hardware accelerators. But now Google has ported NNAPI from Android to their ChromeOS (`NNAPI on ChromeOS `_). |IOT-YOCTO| adapted it on |IOT-YOCTO|. .. note:: - Currently, NNAPI on Linux supports only **one** HAL that needs to be built at compile time. - The HAL is a dynamically shared library named ``libvendor-nn-hal.so.`` - |IOT-YOCTO| default uses **XtensaANN** HAL which is the HAL to use the VPU from Cadence. You can find it in ``$BUILD_DIR/conf/local.conf`` .. code:: ... PREFERRED_PROVIDER_virtual/libvendor-nn-hal:genio-350-evk = "xtensa-ann-bin" .. note:: Software information, cmd operations, and test results presented in this chapter are based on the latest version of IoT Yocto (v23.0), |i350-EVK-REF-BOARD| -------------------------------- Tensorflow Lite and Delegates ============================= IoT Yocto integrated the ``tensorflow lite`` and ``Arm NN delegate`` to provide neural network acceleration. The software versions are as follows: ================================================================================================ ========= ============================================================================================================================================================================================ Component Version Support Operations ================================================================================================ ========= ============================================================================================================================================================================================ `TFLite `_ 2.14.0 `TFLite Ops `_ `Arm NN `_ 24.02 `Arm NN TFLite Delegate Supported Operators `_ `NNAPI `_ 1.3 `Android Neural Networks `_ ================================================================================================ ========= ============================================================================================================================================================================================ .. note:: According to Arm NN `setup script `_, the Arm NN delegate unit tests are verified under TensorFlow Lite without XNNPACK support. In order to verify that the Arm NN delegate is properly integrated on IoT Yocto through its unit tests, IoT Yocto is configured not to enable XNNPACK support for TensorFlow Lite by default. The following are the execution commands and results for the Arm NN delegate unit tests. All tests should pass. If XNNPACK is enabled in TensorFlow Lite, the Arm NN delegate unit tests will fail. .. prompt:: bash # auto # DelegateUnitTests ... ... =============================================================================== [doctest] test cases: 670 | 670 passed | 0 failed | 0 skipped [doctest] assertions: 53244 | 53244 passed | 0 failed | [doctest] Status: SUCCESS! Info: Shutdown time: 53.35 ms. If you have to use Tensorflow Lite with XNNPACK, you can set the ``tflite_with_xnnpack`` as true in the following file: ``t/src/meta-nn/recipes-tensorflow/tensorflow-lite/tensorflow-lite_%.bbappend`` and rebuild Tensorflow Lite package. .. prompt:: text # auto CUSTOM_BAZEL_FLAGS += " --define tflite_with_xnnpack=true " Supported Operations ******************** .. csv-table:: Supported Operations :class: longtable :file: /_asset/tables/ml-g350-op-latest-v23_1_0.csv :width: 50% :widths: 25 25 25 25 -------------------------------- .. _ml_demo-g350: Demo **** A python demo application for image recognition is built into the image that can be found in the ``/usr/share/label_image`` directory. It is adapted from the upstream `label_image.py `__ .. prompt:: bash # auto # cd /usr/share/label_image # ls -l -rw-r--r-- 1 root root 940650 Mar 9 2018 grace_hopper.bmp -rw-r--r-- 1 root root 61306 Mar 9 2018 grace_hopper.jpg -rw-r--r-- 1 root root 10479 Mar 9 2018 imagenet_slim_labels.txt -rw-r--r-- 1 root root 95746802 Mar 9 2018 inception_v3_2016_08_28_frozen.pb -rw-r--r-- 1 root root 4388 Mar 9 2018 label_image.py -rw-r--r-- 1 root root 10484 Mar 9 2018 labels_mobilenet_quant_v1_224.txt -rw-r--r-- 1 root root 4276352 Mar 9 2018 mobilenet_v1_1.0_224_quant.tflite Basic commands for running the demo with different delegates are as follows. - Execute on **CPU** .. prompt:: bash # auto # cd /usr/share/label_image # python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite - Execute on **GPU**, with **GPU delegate** .. prompt:: bash # auto # cd /usr/share/label_image # python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/gpu_external_delegate.so - Execute on **GPU**, with **Arm NN delegate** .. prompt:: bash # auto # cd /usr/share/label_image # python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/libarmnnDelegate.so.29 -o "backends:GpuAcc,CpuAcc" - Execute on **VPU**, with **NNAPI delegate** .. prompt:: bash # auto # cd /usr/share/label_image # python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so -------------------------------- .. _ml_benchmark-g350: Benchmark Tool ************** ``benchmark_model`` is provided in `Tenforflow Performance Measurement `__ for performance evaluation. Basic commands for running the benchmark tool with CPU and different delegates are as follows. - Execute on **CPU** (4 threads) .. prompt:: bash # auto # benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --num_threads=4 --use_xnnpack=0 --num_runs=10 - Execute on **GPU**, with **GPU delegate** .. prompt:: bash # auto # benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --use_gpu=1 --gpu_precision_loss_allowed=1 --use_xnnpack=0 --num_runs=10 - Execute on **GPU**, with **Arm NN delegate** .. prompt:: bash # auto # benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --external_delegate_path=/usr/lib/libarmnnDelegate.so.29 --external_delegate_options="backends:GpuAcc,CpuAcc" --use_xnnpack=0 --num_runs=10 - Execute on **VPU**, with **NNAPI delegate** .. prompt:: bash # auto # benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --use_nnapi=1 --disable_nnapi_cpu=1 --use_xnnpack=0 --num_runs=10 .. _ml_benchmark-result-g350: Benchmark Result **************** The following table are the benchmark results under :ref:`performance mode ` .. csv-table:: Average inference time(ms) :class: longtable :file: /_asset/tables/ml-g350-benchamrk-latest-v23_1_0.csv :width: 50% :widths: 16 16 16 16 16 16 -------------------------------- .. _ml_performance-mode-g350: Performance Mode ================ Force CPU, GPU, and APU(VPU) to run at maximum frequency. - **CPU at maximum frequency** Command to set performance mode for CPU governor. .. prompt:: bash # auto # echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor - **Disable CPU idle** Command to disable CPU idle. .. prompt:: bash # auto # for j in 3 2 1 0; do for i in 3 2 1 0 ; do echo 1 > /sys/devices/system/cpu/cpu$i/cpuidle/state$j/disable ; done ; done - **GPU at maximum frequency** Please refer to :ref:`Adjust GPU Frequency ` to fix GPU to run at maximum frequency. Or you could just set performance for GPU governor and make the GPU statically to the highest frequency. .. prompt:: bash # auto # echo performance > /sys/devices/platform/soc/13040000.mali/devfreq/13040000.mali/governor - **APU at maximum frequency** Currently, VPU is always running at maximum frequency. - **Disable thermal** .. prompt:: bash # auto # echo disabled > /sys/class/thermal/thermal_zone0/mode -------------------------------- Troubleshooting =============== Adjust Logging Severity Level for ArmNN delegate ************************************************ You can set the logging severity level for ArmNN delegate via option key: ``logging-severity`` when delegate loads. The possible values of logging-severity are ``trace``, ``debug``, ``info``, ``warning``, ``error``, and ``fatal``. Take the demo as an example, add the option ``logging-severity:debug`` to enable debug log. .. prompt:: bash # auto # cd /usr/share/label_image # python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/libarmnnDelegate.so.29 -o "backends:GpuAcc,CpuAcc;logging-severity:debug" Adjust Logging Severity Level for NNAPI delegate ************************************************ NNAPI is adapted from ChromeOS (`NNAPI on ChromeOS `_), you can refer to `Debugging `_ section to find out how to adjust logging severity level. Currently there are two separate logging methods to assist in debugging. - VLOG You can set the logging severity level for NNAPI delegate through the environment variable ``DEBUG_NN_VLOG``. It must be set before NNAPI loads, as it is only read on startup. ``DEBUG_NN_VLOG`` is a list of tags, delimited by spaces, commas, or colons, indicating which logging is to be done. The tags are ``compilation``, ``cpuexe``, ``driver``, ``execution``, ``manager``, ``model`` and ``all``. Take the demo as an example: set the environment variable ``DEBUG_NN_VLOG=compilation`` to enable the compilation log. .. prompt:: bash # auto # export DEBUG_NN_VLOG=compilation # python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so - ANDROID_LOG_TAGS The ``ANDROID_LOG_TAGS`` environment variable can be set to filter log output. See the `Android Filtering log output `_ instructions for details on how to configure this environment variable for logging output. Take the demo as an example: set the environment variable ``ANDROID_LOG_TAGS="*:d"`` to enable the debug level log. .. prompt:: bash # auto # export ANDROID_LOG_TAGS="*:d" # python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so Determine What Operations are Executed by VP6 ********************************************* By enabling the compilation log, we can determine what operations are executed by VP6. The default name of NNPAI HAL is ``cros-nnapi-default``. If we find the log that is similar to ``ModelBuilder::findBestDeviceForEachOperation(CONV_2D)=0 (cros-nnapi-default)``, it means this operation(``CONV_2D``) works fine with NNPAI HAL, it can be executed by VP6; Otherwise, the operation is fallback to CPU execution. .. note:: Set environment variable: ``DEBUG_NN_VLOG`` to ``compilation`` before running NN model. - **OP is executed by VPU** .. prompt:: bash # auto # export DEBUG_NN_VLOG=compilation # python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so ... ExecutionPlan.cpp:2057] ModelBuilder::findBestDeviceForEachOperation(CONV_2D) = 0 (cros-nnapi-default) ... - **OP falls back to CPU execution** .. prompt:: bash # auto # export DEBUG_NN_VLOG=compilation # python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so ... ExecutionPlan.cpp:2037] Device cros-nnapi-default can not do operation CONV_2D ExecutionPlan.cpp:2057] ModelBuilder::findBestDeviceForEachOperation(CONV_2D) = 1 (nnapi-reference) ... Is It Possible to Run Floating Point Model? ******************************************* Yes. The float point model can run on CPU and GPU if all operations are supported. Is FP16 (Half Precision Floating Point) Supported on APU? ********************************************************* The operations that APU supports are default QUANT8 implementation, some operations may support FP16 variant, such as | - ANEURALNETWORKS_FULLY_CONNECTED | - ANEURALNENTWORKS_CAST | - ANEURALNETWORKS_AXIS_ALIGNED_BBOX_TRANSFORM | - ANEURALNETWORKS_DETECTION_POSTPROCESSING | - ANEURALNETWORKS_GENERATE_PROPOSALS | - ANEURALNETWORKS_HEATMAP_MAX_KEYPOINT | - ANEURALNETWORKS_BOX_WITH_NMS_LIMIT | - ANEURALNETWORKS_LOG_SOFTMAX | - ANEURALNETWORKS_TRANSPOSE | - ANEURALNETWORKS_RESIZE_NEAREST_NEIGHBOR | - ANEURALNETWORKS_DETECTION_POSTPROCESSING | - ANEURALNETWORKS_RSQRT | - ANEURALNETWORKS_SQRT | - ANEURALNETWORKS_DIV Does IoT Yocto Provide OpenCV Support? If True, What Version of OpenCV is Provided? ************************************************************************************ IoT Yocto provides OpenCV and OpenCV is provided as-is because OpenCV Yocto integration is directly provided by OpenEmbedded, you can find recipe in ``src/meta-openembedded/meta-oe/recipes-support/opencv/opencv_${version}.bb``. If necessary, you can integrate another version of OpenCV by yourself. Regarding Benchmark Results, Why Are Some Models Inference by CPU Faster Than GPU? *********************************************************************************** Many factors could affect the efficiency of the GPU. GPU operations are asynchronous, the CPU might not be able to fill up the GPU cores in time. If some operations in the model are not supported by GPU, they will fall back to the CPU to execute. For this case, it might be more efficient to execute all operations by the CPU in multi-thread than to split the model into many subgraphs and execute them by different backends. Do You Have Information About Accuracy with ARM NN? *************************************************** ARM NN is provided as is, you can find recipe in ``src/meta-nn/recipes-armnn/armnn/armnn_${version}.bb``. We did not evaluate the accuracy of ARM NN, but ARM NN provides a tool: `ModelAccuracyTool-Armnn `_ for measuring the Top 5 accuracy results of a model against an image dataset. What `TFlite Quantization Method `_ Is Supported on APU? ************************************************************************************************************************ - About APU Post-training quantization, quantization-aware training and post-training dynamic range quantization: APU supports are default QUANT8 implementation, if all operations in the model are supported by APU, then the model can run on APU. - About CPU CPU is implemented by TFLite, these quantization methods are supported by CPU. - About GPU Please refer to `ARM NN documentation `_ to check the restrictions of operations. Is It Possible to Run Multiple Models Simultaneously on VP6 *********************************************************** Currently VP6 can only process one operation at a time, can't handle multiple operations at the same time. So you can't run multiple models simultaneously on VP6. How to Develop Cadence VP6 Firmware *********************************** For Genio 350-EVK, Cadence VP6 Firmware is binary released. It is implemented by MediaTek using Cadence Xtensa SDK and toolchain. The firmware source code is MediaTek proprietary. If you would like to develop Cadence VP6 firmware, you have to prepare following items: - VP6 configuration file: You need to sign the NDA with MediaTek to request VP6 configuration file. - Cadence Xtensa toolchains: You need to contact Cadence directly to get Xtensa toolchains and development document. Below figure is APU(VP6) software stack: .. figure:: /_asset/sw_rity_ml-guide_g350_vpu_sw_stack.svg :align: center APU(VP6) software stack on |i350-EVK-REF-BOARD|. You can find source code or binary of each module. - CPU side: ============================== ================= =========================================================================================================== Module Release Policy Repo ============================== ================= =========================================================================================================== NNAPI Source release Source: https://chromium.googlesource.com/aosp/platform/frameworks/ml/+/refs/heads/master/nn/ Xtensa ANN Binary release Binary: https://gitlab.com/mediatek/aiot/nda-cadence/prebuilts `libAPU` Source release Source: https://gitlab.com/mediatek/aiot/bsp/open-amp APU `Remoteproc` Kernel driver Source release Source: https://gitlab.com/mediatek/aiot/bsp/linux/-/blob/mtk-v5.15-dev/drivers/remoteproc/mtk_apu_rproc.c APU `Rpmsg` Kernel driver Source release Source: https://gitlab.com/mediatek/aiot/bsp/linux/-/blob/mtk-v5.15-dev/drivers/rpmsg/apu_rpmsg.c ============================== ================= =========================================================================================================== - VP6 side: ============== ================= ==================================================================================================== Module Release Policy Repo ============== ================= ==================================================================================================== Firmware Binary release Binary: https://gitlab.com/mediatek/aiot/nda-cadence/prebuilts ============== ================= ====================================================================================================