ONNX Development

This section describes the tools and methods for preparing, optimizing, and debugging ONNX models for Genio platforms.

Model Quantization

Quantization reduces model size and inference latency by using 8-bit linear representation. For detailed principles, refer to the official ONNX Runtime Quantization guide.

Dynamic versus Static Quantization

  • Dynamic Quantization: Calculates quantization parameters for activations at runtime. It is simpler to implement but may have higher latency.

  • Static Quantization: Uses calibration data to pre-calculate parameters. MediaTek recommends the QDQ (Quantize-Dequantize) format for optimal performance on Genio.

The following script performs static QDQ quantization:

from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantFormat, QuantType
import numpy as np

class CalibrationReader(CalibrationDataReader):
    # Implement get_next() to provide calibration tensors
    pass

# Perform quantization to QDQ format
quantize_static("model_fp32.onnx", "model_int8.onnx", CalibrationReader(),
               quant_format=QuantFormat.QDQ, weight_type=QuantType.QInt8)

Advanced Quantization

  • INT4/UINT4 Quantization: Supported for MatMul and Gather operators via weight-only quantization. Use matmul_4bits_quantizer for implementation.

  • Float16 (FP16) Conversion: Decreases model size by 50% with minimal accuracy loss.

import onnx
from onnxconverter_common import float16

model = onnx.load("model.onnx")
onnx.save(float16.convert_float_to_float16(model), "model_fp16.onnx")

Making Dynamic Shapes Static

Genio NPU accelerators require static input shapes. The developer must convert symbolic dimensions (e.g., batch) or unnamed dynamic dimensions (?) to fixed values on the host PC before deployment.

Fixing Symbolic Dimensions

python3 -m onnxruntime.tools.make_dynamic_shape_fixed --dim_param batch --dim_value 1 model.onnx model.fixed.onnx

Fixing Unnamed Dimensions

python3 -m onnxruntime.tools.make_dynamic_shape_fixed --input_name x --input_shape 1,3,224,224 model.onnx model.fixed.onnx

Profiling and Debugging

Performance Profiling

The developer can enable latency profiling to generate a JSON trace file compatible with tools like Perfetto.

options = ort.SessionOptions()
options.enable_profiling = True
session = ort.InferenceSession("model.onnx", options)

Logging and Threading

  • Logging: Set log_severity_level = 0 for verbose output during development.

  • Thread Management: Configure intra_op_num_threads to control CPU core utilization. For Genio platforms, the developer may adjust this based on the number of available physical cores.

Optimization Tools

  • Olive: MediaTek recommends using Olive, a hardware-aware optimization tool from Microsoft, for end-to-end compression and tuning.

  • TAO on Genio: Leverage NVIDIA TAO models optimized for NeuroPilot. Refer to TAO on Genio.

  • ORT-GENAI: (Experimental) Supports small language models (SLMs) such as the Phi family on Genio CPUs.