.. include:: /keyword.rst ============ |G1200-EVK| ============ .. contents:: Sections :local: :depth: 4 MT8395 System on Chip ===================== ========= ======================= Hardware MT8395 ========= ======================= CPU 4x CA78 2.2GHz, 4x CA55 2.0GHz GPU ARM G57 AI APU 3.0 (2x MDLA 2.0, 2x VPU) ========= ======================= Please refer to the :doc:`MT8395 (Genio 1200) ` to find detailed specifications. APU *** The MediaTek AI Processing Unit (APU) is a high-performance hardware engine for deep-learning, optimized for bandwidth and power efficiency. The APU architecture consists of big, small, and tiny cores. This highly heterogeneous design is suited for a wide variety of modern smartphone tasks, such as AI-camera, AI-assistant, and OS or in-app enhancements. The new APU 3.0 is scalable AI architecture, offering a huge 4 TOPS. Overview ======== On |G1200-EVK-REF-BOARD|, we provide different software solutions to boost AI computing by GPU and APU. GPU Neural Network Acceleration ******************************* We provide ``tensorflow lite`` with hardware acceleration to develop and deploy a wide range of machine learning. By using `TensorFlow Lite Delegates `_, you can enable hardware acceleration of TFLite models by leveraging on-device accelerators such as the GPU and Digital Signal Processor (DSP) |IOT-YOCTO| already integrated the following two delegates: - `GPU delegate `_: The GPU delegate uses Open GL ES compute shader on the device to inference TFLite model. - `Arm NN delegate `_: Arm NN is a set of open-source software that enables machine learning workloads on Arm hardware devices. It provides a bridge between existing neural network frameworks and Cortex-A CPUs, Arm Mali GPUs. APU Neural Network Acceleration ******************************* We introduce the MediaTek-proprietary Machine Learning solution: **NeuroPilot** on |IOT-YOCTO| on |i1200-DEMO-REF-BOARD|. NeuroPilot is a collection of software tools and APIs which are at the center of MediaTek’s AI ecosystem. With NeuroPilot, users can develop and deploy AI applications on edge devices with extremely high efficiency. This makes a wide variety of AI applications run faster, while also keeping data private. On |G1200-EVK-REF-BOARD|, , we support online inference path ``Neuron Stable Delegate`` also offline inference path ``Neuron SDK`` which is one of NeuroPilot software collections for APU acceleration. For the Neuron Stable Delegate, it's MTK neuron delegate implemented with the interface provided by TensorflowLite Stable. Neuron SDK provides a Neuron compiler (``ncc-tflite``) to convert TFLite models to MediaTek-proprietary binaries (DLA, Deep Learning Archive) for deployment on MediaTek platforms. The resulting models are highly efficient, with reduced latency and a smaller memory footprint. Neuron SDK also provides ``Neuron Run-time API`` which provides a set of APIs that users can invoke from within a C/C++ program to create a run-time environment, parse compiled model file and perform on-device network inference. .. figure:: /_asset/sw_rity_ml-guide_g1200_sw_stack.png :align: center :scale: 70% Machine learning software stack on |G1200-EVK-REF-BOARD| .. note:: Software information, cmd operations, and test results presented in this chapter are based on the latest version of IoT Yocto (v23.0), |G1200-EVK-REF-BOARD|. -------------------------------- Tensorflow Lite and Delegates ============================= IoT Yocto integrated the ``Tensorflow Lite`` and ``Arm NN delegate`` to provide GPU neural network acceleration. The software versions are as follows: ========================================================================= ========= ========================================================================================================================== Component Version Support Operations ========================================================================= ========= ========================================================================================================================== `TFLite `_ 2.14.0 `TFLite Ops `_ `Arm NN `_ 24.02 `Arm NN TFLite Delegate Supported Operators `_ ========================================================================= ========= ========================================================================================================================== .. note:: According to Arm NN `setup script `_, the Arm NN delegate unit tests are verified under TensorFlow Lite without XNNPACK support. In order to verify that the Arm NN delegate is properly integrated on IoT Yocto through its unit tests, IoT Yocto is configured not to enable XNNPACK support for TensorFlow Lite by default. The following are the execution commands and results for the Arm NN delegate unit tests. All tests should pass. If XNNPACK is enabled in TensorFlow Lite, the Arm NN delegate unit tests will fail. .. prompt:: bash # auto # DelegateUnitTests ... ... =============================================================================== [doctest] test cases: 670 | 670 passed | 0 failed | 0 skipped [doctest] assertions: 53244 | 53244 passed | 0 failed | [doctest] Status: SUCCESS! Info: Shutdown time: 53.35 ms. If you have to use Tensorflow Lite with XNNPACK, you can set the ``tflite_with_xnnpack`` as true in the following file: ``t/src/meta-nn/recipes-tensorflow/tensorflow-lite/tensorflow-lite_%.bbappend`` and rebuild Tensorflow Lite package. .. prompt:: text # auto CUSTOM_BAZEL_FLAGS += " --define tflite_with_xnnpack=true " Supported Operations ******************** .. csv-table:: Supported Operations :class: longtable :file: /_asset/tables/ml-g1200-op-latest-v23_1_0.csv :width: 50% :widths: 50 50 Demo **** A python demo application for image recognition is built in the image that can be found in the ``/usr/share/label_image`` directory. It is adapted from the upstream `label_image.py `__ .. prompt:: bash # auto # cd /usr/share/label_image # ls -l -rw-r--r-- 1 root root 940650 Mar 9 2018 grace_hopper.bmp -rw-r--r-- 1 root root 61306 Mar 9 2018 grace_hopper.jpg -rw-r--r-- 1 root root 10479 Mar 9 2018 imagenet_slim_labels.txt -rw-r--r-- 1 root root 95746802 Mar 9 2018 inception_v3_2016_08_28_frozen.pb -rw-r--r-- 1 root root 4388 Mar 9 2018 label_image.py -rw-r--r-- 1 root root 10484 Mar 9 2018 labels_mobilenet_quant_v1_224.txt -rw-r--r-- 1 root root 4276352 Mar 9 2018 mobilenet_v1_1.0_224_quant.tflite Basic commands for running the demo with different delegates are as follows. - Execute on **CPU** .. prompt:: bash # auto # cd /usr/share/label_image # python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite - Execute on **GPU**, with **GPU delegate** .. prompt:: bash # auto # cd /usr/share/label_image # python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/gpu_external_delegate.so - Execute on **GPU**, with **Arm NN delegate** .. prompt:: bash # auto # cd /usr/share/label_image # python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/libarmnnDelegate.so.29 -o "backends:GpuAcc,CpuAcc" .. note:: There is still no Tensorflow official support for python-binding on ``Stable Delegate`` by the date of Yocto v24.0 release. Here we won't have the Stable Delegate demo with python. -------------------------------- Benchmark Tool ************** ``benchmark_model`` is provided in `Tenforflow Performance Measurement `__ for performance evaluation. Basic commands for running the benchmark tool with CPU and different delegates are as follows. - Execute on **CPU** (8 threads): .. prompt:: bash # auto # benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --num_threads=8 --num_runs=10 - Execute on **GPU**, with **GPU delegate**: .. prompt:: bash # auto # benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --use_gpu=1 --gpu_precision_loss_allowed=1 --num_runs=10 - Execute on **GPU**, with **Arm NN delegate**: .. prompt:: bash # auto # benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --external_delegate_path=/usr/lib/libarmnnDelegate.so.29 --external_delegate_options="backends:GpuAcc,CpuAcc" --num_runs=10 - Execute on **APU**, with **Neuron Delegate**: .. prompt:: bash # auto # benchmark_model --stable_delegate_settings_file=/usr/share/label_image/stable_delegate_settings.json --use_nnapi=false --use_xnnpack=false --use_gpu=false --min_secs=20 --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite -------------------------------- Neuron SDK ========== On |G1200-EVK-REF-BOARD|, |IOT-YOCTO| supports ``Neuron SDK`` which is one of MediaTek NeuroPilot software collections. Neuron SDK provides Neuron compiler (``ncc-tflite``) to convert TFLite models to MediaTek-proprietary binaries (DLA, Deep Learning Archive) for deployment on MediaTek platforms. Neuron SDK also provides ``Neuron Run-time API`` which provides a set of APIs that users can invoke from within a C/C++ program to create a run-time environment, parse compiled model file and perform on-device network inference. Please refer to :doc:`Neuron SDK chapter ` to find all supporting detail. .. note:: To enable Neuron SDK support on |G1200-EVK-REF-BOARD| enable the ``apusys`` device tree overlay (``apusys.dtbo``) when flashing the image. .. prompt:: bash # auto # genio-flash --load-dtbo apusys.dtbo Supported Operations ******************** Refer to :doc:`Supported Operations ` to find all the neural network operations supported by Neuron SDK, and any restrictions placed on their use. .. note:: Different compute devices may have restrictions on supported operations. These restrictions are a function of: #. Op Type #. Op parameters (e.g. kernel dimensions and modifiers, such as stride) #. Tensor dimensions (both input and output) #. Soc Platform #. Numeric format, both data type, and quantization method Each device will have its guidelines and restrictions. Demo **** A python demo application for image recognition is built into the image that can be found in the ``/usr/share/demo_dla`` directory. .. prompt:: bash # auto # cd /usr/share/demo_dla # ls -l -rw-r--r-- 1 root root 61306 Mar 9 2018 grace_hopper.jpg -rw-r--r-- 1 root root 10479 Mar 9 2018 imagenet_slim_labels.txt -rw-r--r-- 1 root root 1402 Mar 9 2018 label_image.py -rw-r--r-- 1 root root 4276352 Mar 9 2018 mobilenet_v1_1.0_224_quant.tflite Use cmd:``python3 label_image.py`` to run the demo. The demo program will convert ``mobilenet_v1_1.0_224_quant.tflite`` into DLA, then inference it on APU to classify the image: ``grace_hopper.jpg``. Finally, print out the result of image classification, it should be "military uniform". .. prompt:: bash # auto # cd /usr/share/demo_dla # python3 label_image.py /usr/share/demo_dla/mobilenet_v1_1.0_224_quant.dla /usr/share/demo_dla/grace_hopper.bin WARNING: dlopen failed: libcmdl_ndk.mtk.vndk.so and libcmdl_ndk.mtk.so not found WARNING: CmdlLibManager cannot get dlopen handle. [apusys][info]apusysSession: Session(0xaaaaf9d4eb80): thd(runtime_api_sam) version(3) log(0) The required size of the input buffer is 150528 The required size of the output buffer is 1001 [apusys][info]run: Cmd v2(0xaaaaf9d737e0): run [apusys][info]run: Cmd v2(0xaaaaf9d737e0): run done(0) The top index is 653 The image: military uniform Benchmark Tool ************** A python application for benchmarking is built in the image that can be found in the ``/usr/share/benchmark_dla`` directory. .. prompt:: bash # auto # cd /usr/share/benchmark_dla # ls -l -rw-r--r-- 1 root root 26539112 Mar 9 2018 ResNet50V2_224_1.0_quant.tflite -rw-r--r-- 1 root root 9020 Mar 9 2018 benchmark.py -rw-r--r-- 1 root root 23942928 Mar 9 2018 inception_v3_quant.tflite -rw-r--r-- 1 root root 3577760 Mar 9 2018 mobilenet_v2_1.0_224_quant.tflite -rw-r--r-- 1 root root 6885840 Mar 9 2018 ssd_mobilenet_v1_coco_quantized.tflite Use cmd:``python3 benchmark.py --auto`` to run the benchmark. It will find all TFLite models in ``/usr/share/benchmark_dla`` and compile them into DLA, then inference them on APU. Finally, the benchmark result will be saved in ``/usr/share/benchmark_dla/benchmark.log`` .. prompt:: bash # auto # cd /usr/share/benchmark_dla # # run benchmark to evaluate inference time of each model in the current folder # python3 benchmark.py --auto .. prompt:: bash # auto # # check inference time of each model # cat benchmark.log [INFO] mobilenet_v2_1.0_224_quant.tflite, mdla2.0, avg inference time: 3.2 [INFO] mobilenet_v2_1.0_224_quant.tflite, vpu, avg inference time: 14.1 [INFO] ssd_mobilenet_v1_coco_quantized.tflite, mdla2.0, avg inference time: 4.2 [INFO] ssd_mobilenet_v1_coco_quantized.tflite, vpu, avg inference time: 20.1 [INFO] ResNet50V2_224_1.0_quant.tflite, mdla2.0, avg inference time: 8.2 [INFO] ResNet50V2_224_1.0_quant.tflite, vpu, avg inference time: 56.3 [INFO] inception_v3_quant.tflite, mdla2.0, avg inference time: 11.2 [INFO] inception_v3_quant.tflite, vpu, avg inference time: 71.4 -------------------------------- Benchmark Result ================ The following table are the benchmark results under :ref:`performance mode ` .. csv-table:: Average inference time(ms) :class: longtable :file: /_asset/tables/ml-g1200-benchmark-latest-v24_0.csv :width: 50% :widths: 14 14 14 14 14 14 14 14 -------------------------------- .. _ml_performance-mode-g1200: Performance Mode ================ Force CPU, GPU, APU to run at maximum frequency. - **CPU at maximum frequency** Command to set performance mode for CPU governor. .. prompt:: bash # auto # echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor # echo performance > /sys/devices/system/cpu/cpufreq/policy4/scaling_governor - **Disable CPU idle** Command to disable CPU idle. .. prompt:: bash # auto # for j in 2 1 0; do for i in 7 6 5 4 3 2 1 0 ; do echo 1 > /sys/devices/system/cpu/cpu$i/cpuidle/state$j/disable ; done ; done - **GPU at maximum frequency** Please refer to :ref:`Adjust GPU Frequency ` to fix GPU to run at maximum frequency. Or you could just set performance mode for GPU governor, and make the GPU statically to the highest frequency. .. prompt:: bash # auto # echo performance > /sys/devices/platform/soc/13000000.mali/devfreq/13000000.mali/governor - **APU Performance Hints** The APU operates in performance mode by default, so no adjustments are necessary. To reduce performance, refer to the :ref:`QoS Tuning Flow ` and set lower ``qos.boostValue`` values. If you suspect a performance issue related to APUSYS frequency, use the following DebugFS node to force APUSYS to run at the highest operating points. Then, compare the actual model inference time for any differences: .. prompt:: bash # auto # echo dvfs_debug 0 > /sys/kernel/debug/apusys/power .. warning:: Note that this is a debug feature and should not be used in production images. As a general guideline, kernel DebugFS should be disabled in production environments. - **Disable thermal** .. prompt:: bash # auto # echo disabled > /sys/class/thermal/thermal_zone0/mode