Performance Benchmarks

The IoT Yocto image includes dedicated benchmarking tools to evaluate AI inference performance on Genio platforms. These tools enable the comparison of different hardware backends—CPU, GPU, and NPU—to validate model compilation and deployment strategies.

  • Online Inference Benchmark: Based on the upstream benchmark_model utility from the TFLite project.

  • Offline Inference Benchmark: A MediaTek-provided Python application (benchmark.py) that executes compiled Deep Learning Archive (DLA) models through the Neuron runtime.

For detailed per-platform benchmark data, refer to the TFLite section of the Model Zoo.

Online Inference Performance

The system provides the benchmark_model utility for real-time performance measurement. The following examples demonstrate how to benchmark the CPU, GPU, and NPU backends using specific delegates.

Executing on CPU

The developer must specify the --num_threads argument based on the hardware capabilities of the target platform. The following example uses eight threads as a baseline; adjust this value to match the available CPU cores on the specific Genio SoC.

benchmark_model \
    --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite \
    --num_threads=8 \
    --num_runs=10

Execute on GPU (GPU delegate)

benchmark_model \
    --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite \
    --use_gpu=1 \
    --gpu_precision_loss_allowed=1 \
    --num_runs=10

Execute on NPU (Neuron Stable Delegate)

benchmark_model \
    --stable_delegate_settings_file=/usr/share/label_image/stable_delegate_settings.json \
    --use_nnapi=false \
    --use_xnnpack=false \
    --use_gpu=false \
    --min_secs=20 \
    --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite

Offline Inference Performance

The IoT Yocto image pre-installs a Python benchmarking suite in the /usr/share/benchmark_dla directory.

cd /usr/share/benchmark_dla
ls -l
-rw-r--r-- 1 root root 26539112 Mar  9  2018 ResNet50V2_224_1.0_quant.tflite
-rw-r--r-- 1 root root     9020 Mar  9  2018 benchmark.py
-rw-r--r-- 1 root root 23942928 Mar  9  2018 inception_v3_quant.tflite
-rw-r--r-- 1 root root  3577760 Mar  9  2018 mobilenet_v2_1.0_224_quant.tflite
-rw-r--r-- 1 root root  6885840 Mar  9  2018 ssd_mobilenet_v1_coco_quantized.tflite

To initiate the automated benchmark, execute the script with the --auto flag. The script identifies all .tflite models in the directory, compiles them into .dla binaries, and measures inference speed on the NPU. The script saves the results to /usr/share/benchmark_dla/benchmark.log.

cd /usr/share/benchmark_dla
# Run benchmark to evaluate inference time of each model in the current folder
python3 benchmark.py --auto

After the run completes, inspect the log file to review the per‑model inference time on different backends:

# Check inference time of each model
cat benchmark.log
[INFO] inception_v3_quant.tflite, mdla3.0, : 7.36
[INFO] inception_v3_quant.tflite, vpu, : 75.57
[INFO] mobilenet_v2_1.0_224_quant.tflite, mdla3.0, : 2.48
[INFO] mobilenet_v2_1.0_224_quant.tflite, vpu, : 18.58
[INFO] ssd_mobilenet_v1_coco_quantized.tflite, mdla3.0, : 3.28
[INFO] ssd_mobilenet_v1_coco_quantized.tflite, vpu, : 24.06
[INFO] ResNet50V2_224_1.0_quant.tflite, mdla3.0, : 7.37
[INFO] ResNet50V2_224_1.0_quant.tflite, vpu, : 61.09

Note

The benchmark.py script performs AI inference by calling neuronrt through a Python API wrapper. The neuronrt component itself is implemented as a C/C++ program that uses the Neuron Runtime API. The Neuron Runtime API does not provide an official Python interface; the Python script in this image acts only as a convenience wrapper for benchmarking.