Genio 350-EVK

MT8365 System On Chip

Hardware

MT8365

CPU

4x CA53 2.0GHz

GPU

ARM Mali-G52

AI

APU (VPU)

Please refer to the MT8365 (Genio 350) to find detailed specifications.

APU

The APU includes a multi-core processor combined with intelligent control logic. It is 2X more power efficient than a GPU and generates class-leading edge-AI processing performance of up to 0.3T.

Overview

On Genio 350-EVK, we provide tensorflow lite with hardware acceleration to develop and deploy a wide range of machine learning. The following figure illustrates the machine learning software stack:

../../../_images/sw_rity_ml-guide_g350_sw_stack.svg

Machine learning software stack on Genio 350-EVK.

By using TensorFlow Lite Delegates, you can enable hardware acceleration of TensorFlow Lite models by leveraging on-device accelerators such as the GPU and Digital Signal Processor (DSP). IoT Yocto already integrated the following 3 delegates

  • GPU delegate: The GPU delegate uses Open GL ES compute shader on the device to inference TensorFlow Lite model.

  • Arm NN delegate: Arm NN is a set of open-source software that enables machine learning workloads on Arm hardware devices. It provides a bridge between existing neural network frameworks and Cortex-A CPUs, Arm Mali GPUs.

  • NNAPI delegate: It provides acceleration for TensorFlow Lite models on Android devices with supported hardware accelerators. But now Google has ported NNAPI from Android to their ChromeOS (NNAPI on ChromeOS). IoT Yocto adapted it on IoT Yocto.

Note

  • Currently, NNAPI on Linux supports only one HAL that needs to be built at compile time.

  • The HAL is a dynamically shared library named libvendor-nn-hal.so.

  • IoT Yocto default uses XtensaANN HAL which is the HAL to use the VPU from Cadence. You can find it in $BUILD_DIR/conf/local.conf

    ...
    PREFERRED_PROVIDER_virtual/libvendor-nn-hal:genio-350-evk = "xtensa-ann-bin"
    

Note

Software information, cmd operations, and test results presented in this chapter are based on the latest version of IoT Yocto (v23.0), Genio 350-EVK


Tensorflow Lite and Delegates

IoT Yocto integrated the tensorflow lite and Arm NN delegate to provide neural network acceleration. The software versions are as follows:

Component

Version

Support Operations

TFLite

2.10.0

TFLite Ops

Arm NN

23.02

Arm NN TFLite Delegate Supported Operators

NNAPI

1.3

Android Neural Networks

Note

According to Arm NN setup script, the Arm NN delegate unit tests are verified under TensorFlow Lite without XNNPACK support. In order to verify that the Arm NN delegate is properly integrated on IoT Yocto through its unit tests, IoT Yocto is configured not to enable XNNPACK support for TensorFlow Lite by default.

The following are the execution commands and results for the Arm NN delegate unit tests. All tests should pass. If XNNPACK is enabled in TensorFlow Lite, the Arm NN delegate unit tests will fail.

DelegateUnitTests
...
...
===============================================================================
[doctest] test cases:   670 |   670 passed | 0 failed | 0 skipped
[doctest] assertions: 53244 | 53244 passed | 0 failed |
[doctest] Status: SUCCESS!
Info: Shutdown time: 53.35 ms.

If you have to use Tensorflow Lite with XNNPACK, you can set the tflite_with_xnnpack as true in the following file: t/src/meta-nn/recipes-tensorflow/tensorflow-lite/tensorflow-lite_%.bbappend and rebuild Tensorflow Lite package.

CUSTOM_BAZEL_FLAGS += " --define tflite_with_xnnpack=true "

Supported Operations

Supported Operations

TFLite 2.10.0

ARMNN 23.02

NNAPI 1.3

Xtensa-ANN 1.3.1

abs

ABS

ANEURALNETWORKS_ABS

add

ADD

ANEURALNETWORKS_ADD

ANEURALNETWORKS_ADD

add_n

arg_max

ARGMAX

ANEURALNETWORKS_ARGMAX

arg_min

ARGMIN

ANEURALNETWORKS_ARGMIN

assign_variable

average_pool_2d

AVERAGE_POOL_2D

ANEURALNETWORKS_AVERAGE_POOL_2D

ANEURALNETWORKS_AVERAGE_POOL_2D

AVERAGE_POOL_3D

basic_lstm

batch_matmul

BATCH_MATMUL

batch_to_space_nd

BATCH_TO_SPACE_ND

ANEURALNETWORKS_BATCH_TO_SPACE_ND

bidirectional_sequence_lstm

broadcast_args

broadcast_to

bucketize

call_once

cast

CAST

ANEURALNETWORKS_CAST

ANEURALNETWORKS_CAST

ceil

complex_abs

concatenation

CONCATENATION

ANEURALNETWORKS_CONCATENATION

ANEURALNETWORKS_CONCATENATION

control_node

conv_2d

CONV_2D

ANEURALNETWORKS_CONV_2D

ANEURALNETWORKS_CONV_2D

conv_3d

CONV_3D

conv_3d_transpose

cos

cumsum

custom

custom_tf

densify

depth_to_space

DEPTH_TO_SPACE

ANEURALNETWORKS_DEPTH_TO_SPACE

ANEURALNETWORKS_DEPTH_TO_SPACE

depthwise_conv_2d

DEPTHWISE_CONV_2D

ANEURALNETWORKS_DEPTHWISE_CONV_2D

ANEURALNETWORKS_DEPTHWISE_CONV_2D

dequantize

DEQUANTIZE

ANEURALNETWORKS_DEQUANTIZE

div

DIV

ANEURALNETWORKS_DIV

ANEURALNETWORKS_DIV

dynamic_update_slice

elu

ELU

ANEURALNETWORKS_ELU

embedding_lookup

equal

EQUAL

ANEURALNETWORKS_EQUAL

exp

EXP

ANEURALNETWORKS_EXP

expand_dims

EXPAND_DIMS

ANEURALNETWORKS_EXPAND_DIMS

external_const

fake_quant

fill

FILL

ANEURALNETWORKS_FILL

floor

FLOOR

ANEURALNETWORKS_FLOOR

floor_div

FLOOR_DIV

floor_mod

fully_connected

FULLY_CONNECTED

ANEURALNETWORKS_FULLY_CONNECTED

ANEURALNETWORKS_FULLY_CONNECTED

gather

GATHER

ANEURALNETWORKS_GATHER

gather_nd

GATHER_ND

gelu

greater

GREATER

ANEURALNETWORKS_GREATER

greater_equal

GREATER_OR_EQUAL

ANEURALNETWORKS_GREATER_EQUAL

hard_swish

HARD_SWISH

ANEURALNETWORKS_HARD_SWISH

hashtable

hashtable_find

hashtable_import

hashtable_size

if

imag

l2_normalization

L2_NORMALIZATION

ANEURALNETWORKS_L2_NORMALIZATION

ANEURALNETWORKS_L2_NORMALIZATION

L2_POOL_2D

ANEURALNETWORKS_L2_POOL_2D

leaky_relu

less

LESS

ANEURALNETWORKS_LESS

less_equal

LESS_OR_EQUAL

ANEURALNETWORKS_LESS_EQUAL

local_response_normalization

LOCAL_RESPONSE_NORMALIZATION

ANEURALNETWORKS_LOCAL_RESPONSE_NORMALIZATION

log

LOG

ANEURALNETWORKS_LOG

log_softmax

LOG_SOFTMAX

ANEURALNETWORKS_LOG_SOFTMAX

ANEURALNETWORKS_LOG_SOFTMAX

logical_and

LOGICAL_AND

ANEURALNETWORKS_LOGICAL_AND

logical_not

LOGICAL_NOT

ANEURALNETWORKS_LOGICAL_NOT

logical_or

LOGICAL_OR

ANEURALNETWORKS_LOGICAL_OR

logistic

LOGISTIC

ANEURALNETWORKS_LOGISTIC

ANEURALNETWORKS_LOGISTIC

lstm

LSTM

ANEURALNETWORKS_LSTM

matrix_diag

matrix_set_diag

max_pool_2d

MAX_POOL_2D

ANEURALNETWORKS_MAX_POOL_2D

ANEURALNETWORKS_MAX_POOL_2D

MAX_POOL_3D

maximum

MAXIMUM

ANEURALNETWORKS_MAXIMUM

ANEURALNETWORKS_MAXIMUM

mean

MEAN

ANEURALNETWORKS_MEAN

minimum

MINIMUM

ANEURALNETWORKS_MINIMUM

ANEURALNETWORKS_MINIMUM

mirror_pad

MIRROR_PAD

mul

MUL

ANEURALNETWORKS_MUL

ANEURALNETWORKS_MUL

multinomial

neg

NEG

ANEURALNETWORKS_NEG

no_value

non_max_suppression_v4

non_max_suppression_v5

not_equal

NOT_EQUAL

ANEURALNETWORKS_NOT_EQUAL

NumericVerify

one_hot

pack

PACK

pad

PAD

ANEURALNETWORKS_PAD

padv2

ANEURALNETWORKS_PAD_V2

poly_call

pow

ANEURALNETWORKS_POW

prelu

PRELU

ANEURALNETWORKS_PRELU

ANEURALNETWORKS_PRELU

pseudo_const

pseudo_qconst

pseudo_sparse_const

pseudo_sparse_qconst

quantize

QUANTIZE

ANEURALNETWORKS_QUANTIZE

random_standard_normal

random_uniform

range

rank

RANK

ANEURALNETWORKS_RANK

read_variable

real

reduce_all

reduce_any

ANEURALNETWORKS_REDUCE_ANY

reduce_max

REDUCE_MAX

ANEURALNETWORKS_REDUCE_MAX

reduce_min

REDUCE_MIN

ANEURALNETWORKS_REDUCE_MIN

reduce_prod

REDUCE_PROD

ANEURALNETWORKS_REDUCE_PROD

relu

RELU

ANEURALNETWORKS_RELU

ANEURALNETWORKS_RELU

relu6

RELU6

ANEURALNETWORKS_RELU6

ANEURALNETWORKS_RELU6

relu_n1_to_1

RELU_N1_TO_1

ANEURALNETWORKS_RELU1

ANEURALNETWORKS_RELU1

reshape

RESHAPE

ANEURALNETWORKS_RESHAPE

ANEURALNETWORKS_RESHAPE

resize_bilinear

RESIZE_BILINEAR

ANEURALNETWORKS_RESIZE_BILINEAR

resize_nearest_neighbor

RESIZE_NEAREST_NEIGHBOR

ANEURALNETWORKS_RESIZE_NEAREST_NEIGHBOR

ANEURALNETWORKS_RESIZE_NEAREST_NEIGHBOR

reverse_sequence

reverse_v2

rfft2d

round

rsqrt

RSQRT

ANEURALNETWORKS_RSQRT

ANEURALNETWORKS_RSQRT

scatter_nd

segment_sum

select

ANEURALNETWORKS_SELECT

ANEURALNETWORKS_SELECT

select_v2

shape

SHAPE

sin

SIN

ANEURALNETWORKS_SIN

slice

ANEURALNETWORKS_SLICE

softmax

SOFTMAX

ANEURALNETWORKS_SOFTMAX

ANEURALNETWORKS_SOFTMAX

space_to_batch_nd

SPACE_TO_BATCH_ND

ANEURALNETWORKS_SPACE_TO_BATCH_ND

space_to_depth

SPACE_TO_DEPTH

ANEURALNETWORKS_SPACE_TO_DEPTH

ANEURALNETWORKS_SPACE_TO_DEPTH

sparse_to_dense

split

SPLIT

ANEURALNETWORKS_SPLIT

split_v

SPLIT_V

sqrt

SQRT

ANEURALNETWORKS_SQRT

ANEURALNETWORKS_SQRT

square

squared_difference

squeeze

SQUEEZE

ANEURALNETWORKS_SQUEEZE

strided_slice

STRIDED_SLICE

ANEURALNETWORKS_STRIDED_SLICE

sub

SUB

ANEURALNETWORKS_SUB

sum

SUM

svdf

ANEURALNETWORKS_SVDF

tanh

TANH

ANEURALNETWORKS_TANH

tile

ANEURALNETWORKS_TILE

topk_v2

ANEURALNETWORKS_TOPK_V2

ANEURALNETWORKS_TOPK_V2

transpose

TRANSPOSE

ANEURALNETWORKS_TRANSPOSE

ANEURALNETWORKS_TRANSPOSE

transpose_conv

TRANSPOSE_CONV

ANEURALNETWORKS_TRANSPOSE_CONV_2D

ANEURALNETWORKS_TRANSPOSE_CONV_2D

unidirectional_sequence_lstm

UNIDIRECTIONAL_SEQUENCE_LSTM

ANEURALNETWORKS_UNIDIRECTIONAL_SEQUENCE_LSTM

unidirectional_sequence_rnn

ANEURALNETWORKS_UNIDIRECTIONAL_SEQUENCE_RNN

unique

unpack

UNPACK

unsorted_segment_max

unsorted_segment_prod

unsorted_segment_sum

var_handle

where

while

ANEURALNETWORKS_WHILE

yield

zeros_like

ANEURALNETWORKS_HASHTABLE_LOOKUP

ANEURALNETWORKS_HASHTABLE_LOOKUP

ANEURALNETWORKS_EMBEDDING_LOOKUP

ANEURALNETWORKS_EMBEDDING_LOOKUP

ANEURALNETWORKS_LSH_PROJECTION

ANEURALNETWORKS_RNN

ANEURALNETWORKS_AXIS_ALIGNED_BBOX_TRANSFORM

ANEURALNETWORKS_AXIS_ALIGNED_BBOX_TRANSFORM

ANEURALNETWORKS_BOX_WITH_NMS_LIMIT

ANEURALNETWORKS_BOX_WITH_NMS_LIMIT

ANEURALNETWORKS_BIDIRECTIONAL_SEQUENCE_LSTM

ANEURALNETWORKS_BIDIRECTIONAL_SEQUENCE_RNN

ANEURALNETWORKS_CHANNEL_SHUFFLE

ANEURALNETWORKS_CHANNEL_SHUFFLE

ANEURALNETWORKS_DETECTION_POSTPROCESSING

ANEURALNETWORKS_DETECTION_POSTPROCESSING

ANEURALNETWORKS_GENERATE_PROPOSALS

ANEURALNETWORKS_GENERATE_PROPOSALS

ANEURALNETWORKS_GROUPED_CONV_2D

ANEURALNETWORKS_GROUPED_CONV_2D

ANEURALNETWORKS_HEATMAP_MAX_KEYPOINT

ANEURALNETWORKS_HEATMAP_MAX_KEYPOINT

ANEURALNETWORKS_INSTANCE_NORMALIZATION

ANEURALNETWORKS_QUANTIZED_16BIT_LSTM

ANEURALNETWORKS_QUANTIZED_16BIT_LSTM

ANEURALNETWORKS_QUANTIZED_LSTM

ANEURALNETWORKS_QUANTIZED_LSTM

ANEURALNETWORKS_RANDOM_MULTINOMIAL

ANEURALNETWORKS_REDUCE_ALL

ANEURALNETWORKS_REDUCE_SUM

ANEURALNETWORKS_ROI_ALIGN

ANEURALNETWORKS_ROI_POOLING

ANEURALNETWORKS_IF


Demo

A python demo application for image recognition is built into the image that can be found in the /usr/share/label_image directory. It is adapted from the upstream label_image.py

cd /usr/share/label_image
ls -l

-rw-r--r-- 1 root root   940650 Mar  9  2018 grace_hopper.bmp
-rw-r--r-- 1 root root    61306 Mar  9  2018 grace_hopper.jpg
-rw-r--r-- 1 root root    10479 Mar  9  2018 imagenet_slim_labels.txt
-rw-r--r-- 1 root root 95746802 Mar  9  2018 inception_v3_2016_08_28_frozen.pb
-rw-r--r-- 1 root root     4388 Mar  9  2018 label_image.py
-rw-r--r-- 1 root root    10484 Mar  9  2018 labels_mobilenet_quant_v1_224.txt
-rw-r--r-- 1 root root  4276352 Mar  9  2018 mobilenet_v1_1.0_224_quant.tflite

Basic commands for running the demo with different delegates are as follows.

  • Execute on CPU

cd /usr/share/label_image
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite
  • Execute on GPU, with GPU delegate

cd /usr/share/label_image
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/gpu_external_delegate.so
  • Execute on GPU, with Arm NN delegate

cd /usr/share/label_image
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/libarmnnDelegate.so.28 -o "backends:GpuAcc,CpuAcc"
  • Execute on VPU, with NNAPI delegate

cd /usr/share/label_image
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so

Benchmark Tool

benchmark_model is provided in Tenforflow Performance Measurement for performance evaluation.

Basic commands for running the benchmark tool with CPU and different delegates are as follows.

  • Execute on CPU (4 threads)

benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --num_threads=4 --use_xnnpack=0 --num_runs=10
  • Execute on GPU, with GPU delegate

benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --use_gpu=1 --allow_fp16=0 --gpu_precision_loss_allowed=0 --use_xnnpack=0 --num_runs=10
  • Execute on GPU, with Arm NN delegate

benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --external_delegate_path=/usr/lib/libarmnnDelegate.so.28 --external_delegate_options="backends:GpuAcc,CpuAcc" --use_xnnpack=0 --num_runs=10
  • Execute on VPU, with NNAPI delegate

benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --use_nnapi=1 --disable_nnapi_cpu=1 --use_xnnpack=0 --num_runs=10

Benchmark Result

The following table are the benchmark results under performance mode

Average inference time(ms)

Run model (.tflite) 10 times

CPU (Thread:4)

GPU

ARMNN(GpuAcc)

ARMNN(CpuAcc)

NNAPI: VPU

inception_v3

710.991

640.675

492.128

400.797

Not be executed by VPU

inception_v3_quant

326.554

644.713

272.593

286.147

99.65

mobilenet_v2_1.0.224

54.559

49.993

54.016

53.24

Not be executed by VPU

mobilenet_v2_1.0.224_quant

29.235

51.367

35.322

34.735

21.322

ResNet50V2_224_1.0

476.963

415.715

364.339

268.352

Not be executed by VPU

ResNet50V2_224_1.0_quant

261.817

423.382

193.165

211.424

158.617

ssd_mobilenet_v1_coco

148.977

178.77

163.025

125.623

Not be executed by VPU

ssd_mobilenet_v1_coco_quantized

73.39

182.983

83.417

74.805

31.805


Performance Mode

Force CPU, GPU, and APU(VPU) to run at maximum frequency.

  • CPU at maximum frequency

    Command to set performance mode for CPU governor.

    echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
    
  • Disable CPU idle

    Command to disable CPU idle.

    for j in 3 2 1 0; do for i in 3 2 1 0 ; do echo 1 > /sys/devices/system/cpu/cpu$i/cpuidle/state$j/disable ; done ; done
    
  • GPU at maximum frequency

    Please refer to Adjust GPU Frequency to fix GPU to run at maximum frequency.

    Or you could just set performance for GPU governor and make the GPU statically to the highest frequency.

    echo performance > /sys/devices/platform/soc/13040000.mali/devfreq/13040000.mali/governor
    
  • APU at maximum frequency

    Currently, VPU is always running at maximum frequency.

  • Disable thermal

    echo disabled > /sys/class/thermal/thermal_zone0/mode
    

Troubleshooting

Adjust Logging Severity Level for ARMNN delegate

You can set the logging severity level for ARMNN delegate via option key: logging-severity when delegate loads. The possible values of logging-severity are trace, debug, info, warning, error, and fatal.

Take the demo as an example, add the option logging-severity:debug to enable debug log.

cd /usr/share/label_image
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/libarmnnDelegate.so.28 -o "backends:GpuAcc,CpuAcc;logging-severity:debug"

Adjust Logging Severity Level for NNAPI delegate

NNAPI is adapted from ChromeOS (NNAPI on ChromeOS), you can refer to Debugging section to find out how to adjust logging severity level.

Currently there are two separate logging methods to assist in debugging.

  • VLOG

    You can set the logging severity level for NNAPI delegate through the environment variable DEBUG_NN_VLOG. It must be set before NNAPI loads, as it is only read on startup. DEBUG_NN_VLOG is a list of tags, delimited by spaces, commas, or colons, indicating which logging is to be done. The tags are compilation, cpuexe, driver, execution, manager, model and all.

    Take the demo as an example: set the environment variable DEBUG_NN_VLOG=compilation to enable the compilation log.

    export DEBUG_NN_VLOG=compilation
    python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so
    
  • ANDROID_LOG_TAGS

    The ANDROID_LOG_TAGS environment variable can be set to filter log output. See the Android Filtering log output instructions for details on how to configure this environment variable for logging output.

    Take the demo as an example: set the environment variable ANDROID_LOG_TAGS="*:d" to enable the debug level log.

    export ANDROID_LOG_TAGS="*:d"
    python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so
    

Determine What Operations are Executed by VP6

By enabling the compilation log, we can determine what operations are executed by VP6.

The default name of NNPAI HAL is cros-nnapi-default. If we find the log that is similar to ModelBuilder::findBestDeviceForEachOperation(CONV_2D)=0 (cros-nnapi-default), it means this operation(CONV_2D) works fine with NNPAI HAL, it can be executed by VP6; Otherwise, the operation is fallback to CPU execution.

Note

Set environment variable: DEBUG_NN_VLOG to compilation before running NN model.

  • OP is executed by VPU

    export DEBUG_NN_VLOG=compilation
    python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so
    ...
    ExecutionPlan.cpp:2057] ModelBuilder::findBestDeviceForEachOperation(CONV_2D) = 0 (cros-nnapi-default)
    ...
    
  • OP falls back to CPU execution

    export DEBUG_NN_VLOG=compilation
    python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so
    ...
    ExecutionPlan.cpp:2037] Device cros-nnapi-default can not do operation CONV_2D
    ExecutionPlan.cpp:2057] ModelBuilder::findBestDeviceForEachOperation(CONV_2D) = 1 (nnapi-reference)
    ...
    

Is It Possible to Run Floating Point Model?

Yes. The float point model can run on CPU and GPU if all operations are supported.

Is FP16 (Half Precision Floating Point) Supported on APU?

The operations that APU supports are default QUANT8 implementation, some operations may support FP16 variant, such as

- ANEURALNETWORKS_FULLY_CONNECTED
- ANEURALNENTWORKS_CAST
- ANEURALNETWORKS_AXIS_ALIGNED_BBOX_TRANSFORM
- ANEURALNETWORKS_DETECTION_POSTPROCESSING
- ANEURALNETWORKS_GENERATE_PROPOSALS
- ANEURALNETWORKS_HEATMAP_MAX_KEYPOINT
- ANEURALNETWORKS_BOX_WITH_NMS_LIMIT
- ANEURALNETWORKS_LOG_SOFTMAX
- ANEURALNETWORKS_TRANSPOSE
- ANEURALNETWORKS_RESIZE_NEAREST_NEIGHBOR
- ANEURALNETWORKS_DETECTION_POSTPROCESSING
- ANEURALNETWORKS_RSQRT
- ANEURALNETWORKS_SQRT
- ANEURALNETWORKS_DIV

Does IoT Yocto Provide Opencv Support? If True, What Version of OpenCV is Provided?

IoT Yocto provides OpenCV and OpenCV is provided as-is because OpenCV Yocto integration is directly provided by openembedded, you can find recipe in src/meta-openembedded/meta-oe/recipes-support/opencv/opencv_${version}.bb. If necessary, you can integrate another version of OpenCV by yourself.

Regarding Benchmark Results, Why Are Some Models Inference by CPU Faster Than GPU?

Many factors could affect the efficiency of the GPU. GPU operations are asynchronous, the CPU might not be able to fill up the GPU cores in time. If some operations in the model are not supported by GPU, they will fall back to the CPU to execute. For this case, it might be more efficient to execute all operations by the CPU in multi-thread than to split the model into many subgraphs and execute them by different backends.

Do You Have Information About Accuracy with ARM NN?

ARM NN is provided as is, you can find recipe in src/meta-nn/recipes-armnn/armnn/armnn_${version}.bb. We did not evaluate the accuracy of ARM NN, but ARM NN provides a tool: ModelAccuracyTool-Armnn for measuring the Top 5 accuracy results of a model against an image dataset.

What TFlite quantization method Is Supported on APU?

  • About APU

    Post-training quantization, quantization-aware training and post-training dynamic range quantization: APU supports are default QUANT8 implementation, if all operations in the model are supported by APU, then the model can run on APU.

  • About CPU

    CPU is implemented by tflite, these quantization methods are supported by CPU.

  • About GPU

    Please refer to ARM NN documentation to check the restrictions of operations.

Is It Possible to Run Multiple Models Simultaneously on VP6

Currently VP6 can only process one operation at a time, can’t handle multiple operations at the same time. So you can’t run multiple models simultaneously on VP6.

How to develop Cadence VP6 firmware

For Genio 350-EVK, Cadence VP6 Firmware is binary released. It is implemented by Mediatek using Cadence Xtensa SDK and toolchain. The firmware source code is Mediatek proprietary. If you would like to develop Cadence VP6 firmware, you have to prepare following items:

  • VP6 configuration file: You need to sign the NDA with Mediatek to request VP6 configuration file.

  • Cadence xtensa toolchains: You need to contact Cadence directly to get xtensa toolchains and development document.

Below figure is APU(VP6) software stack:

../../../_images/sw_rity_ml-guide_g350_vpu_sw_stack.svg

APU(VP6) software stack on Genio 350-EVK.

You can find source code or binary of each module.