.. include:: /keyword.rst ========== |i350-EVK| ========== .. contents:: Sections :local: :depth: 2 MT8365 System on Chip ===================== ========= ======================= Hardware MT8365 ========= ======================= CPU 4x CA53 2.0GHz GPU ARM Mali-G52 ========= ======================= Please refer to the :doc:`MT8365 (Genio 350) ` to find detailed specifications. Overview ======== On |i350-EVK-REF-BOARD|, we provide ``tensorflow lite`` with hardware acceleration to develop and deploy a wide range of machine learning. The following figure illustrates the machine learning software stack: .. figure:: /_asset/sw_rity_ml-guide_g350_sw_stack.svg :align: center Machine learning software stack on |i350-EVK-REF-BOARD|. By using `TensorFlow Lite Delegates `_, you can enable hardware acceleration of TFLite models by leveraging on-device accelerators such as the GPU and Digital Signal Processor (DSP) |IOT-YOCTO| already integrated the TFLite GPU delegate: - `GPU delegate `_: The GPU delegate uses Open GL ES compute shader on the device to inference TFLite model. -------------------------------- Tensorflow Lite and Delegates ============================= IoT Yocto integrated the ``tensorflow lite`` to provide GPU neural network acceleration. The software version is as follows: ========================================================================= ========= ========================================================================================================================== Component Version Support Operations ========================================================================= ========= ========================================================================================================================== `TFLite `_ 2.16.0 `TFLite Ops `_ ========================================================================= ========= ========================================================================================================================== .. note:: If you have to use Tensorflow Lite with XNNPACK, you can set the ``tflite_with_xnnpack`` as true in the following file: ``t/src/meta-nn/recipes-tensorflow/tensorflow-lite/tensorflow-lite_%.bbappend`` and rebuild Tensorflow Lite package. .. prompt:: text # auto CUSTOM_BAZEL_FLAGS += " --define tflite_with_xnnpack=true " Supported Operations ******************** .. csv-table:: Supported Operations :class: longtable :file: /_asset/tables/ml-g350-op-latest-v23_1_0.csv :width: 50% :widths: 25 25 25 25 -------------------------------- .. _ml_demo-g350: Demo **** A python demo application for image recognition is built into the image that can be found in the ``/usr/share/label_image`` directory. It is adapted from the upstream `label_image.py `__ .. prompt:: bash # auto # cd /usr/share/label_image # ls -l -rw-r--r-- 1 root root 940650 Mar 9 2018 grace_hopper.bmp -rw-r--r-- 1 root root 61306 Mar 9 2018 grace_hopper.jpg -rw-r--r-- 1 root root 10479 Mar 9 2018 imagenet_slim_labels.txt -rw-r--r-- 1 root root 95746802 Mar 9 2018 inception_v3_2016_08_28_frozen.pb -rw-r--r-- 1 root root 4388 Mar 9 2018 label_image.py -rw-r--r-- 1 root root 10484 Mar 9 2018 labels_mobilenet_quant_v1_224.txt -rw-r--r-- 1 root root 4276352 Mar 9 2018 mobilenet_v1_1.0_224_quant.tflite Basic commands for running the demo with different delegates are as follows. - Execute on **CPU** .. prompt:: bash # auto # cd /usr/share/label_image # python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite - Execute on **GPU**, with **GPU delegate** .. prompt:: bash # auto # cd /usr/share/label_image # python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/gpu_external_delegate.so -------------------------------- .. _ml_benchmark-g350: Benchmark Tool ************** ``benchmark_model`` is provided in `Tenforflow Performance Measurement `__ for performance evaluation. Basic commands for running the benchmark tool with CPU and different delegates are as follows. - Execute on **CPU** (4 threads) .. prompt:: bash # auto # benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --num_threads=4 --use_xnnpack=0 --num_runs=10 - Execute on **GPU**, with **GPU delegate** .. prompt:: bash # auto # benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --use_gpu=1 --gpu_precision_loss_allowed=1 --use_xnnpack=0 --num_runs=10 .. _ml_benchmark-result-g350: Benchmark Result **************** The following table are the benchmark results under :ref:`performance mode ` .. csv-table:: Average inference time(ms) :class: longtable :file: /_asset/tables/ml-g350-benchamrk-latest-v23_1_0.csv :width: 50% :widths: 16 16 16 16 16 16 -------------------------------- .. _ml_performance-mode-g350: Performance Mode ================ Force CPU, GPU to run at maximum frequency. - **CPU at maximum frequency** Command to set performance mode for CPU governor. .. prompt:: bash # auto # echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor - **Disable CPU idle** Command to disable CPU idle. .. prompt:: bash # auto # for j in 3 2 1 0; do for i in 3 2 1 0 ; do echo 1 > /sys/devices/system/cpu/cpu$i/cpuidle/state$j/disable ; done ; done - **GPU at maximum frequency** Please refer to :ref:`Adjust GPU Frequency ` to fix GPU to run at maximum frequency. Or you could just set performance for GPU governor and make the GPU statically to the highest frequency. .. prompt:: bash # auto # echo performance > /sys/devices/platform/soc/13040000.mali/devfreq/13040000.mali/governor - **Disable thermal** .. prompt:: bash # auto # echo disabled > /sys/class/thermal/thermal_zone0/mode -------------------------------- Troubleshooting =============== Is It Possible to Run Floating Point Model? ******************************************* Yes. The float point model can run on CPU and GPU if all operations are supported. Is FP16 (Half Precision Floating Point) Supported on APU? ********************************************************* The operations that APU supports are default QUANT8 implementation, some operations may support FP16 variant, such as | - ANEURALNETWORKS_FULLY_CONNECTED | - ANEURALNENTWORKS_CAST | - ANEURALNETWORKS_AXIS_ALIGNED_BBOX_TRANSFORM | - ANEURALNETWORKS_DETECTION_POSTPROCESSING | - ANEURALNETWORKS_GENERATE_PROPOSALS | - ANEURALNETWORKS_HEATMAP_MAX_KEYPOINT | - ANEURALNETWORKS_BOX_WITH_NMS_LIMIT | - ANEURALNETWORKS_LOG_SOFTMAX | - ANEURALNETWORKS_TRANSPOSE | - ANEURALNETWORKS_RESIZE_NEAREST_NEIGHBOR | - ANEURALNETWORKS_DETECTION_POSTPROCESSING | - ANEURALNETWORKS_RSQRT | - ANEURALNETWORKS_SQRT | - ANEURALNETWORKS_DIV Does IoT Yocto Provide OpenCV Support? If True, What Version of OpenCV is Provided? ************************************************************************************ IoT Yocto provides OpenCV and OpenCV is provided as-is because OpenCV Yocto integration is directly provided by OpenEmbedded, you can find recipe in ``src/meta-openembedded/meta-oe/recipes-support/opencv/opencv_${version}.bb``. If necessary, you can integrate another version of OpenCV by yourself. Regarding Benchmark Results, Why Are Some Models Inference by CPU Faster Than GPU? *********************************************************************************** Many factors could affect the efficiency of the GPU. GPU operations are asynchronous, the CPU might not be able to fill up the GPU cores in time. If some operations in the model are not supported by GPU, they will fall back to the CPU to execute. For this case, it might be more efficient to execute all operations by the CPU in multi-thread than to split the model into many subgraphs and execute them by different backends.