.. include:: /keyword.rst

=============
ONNX Runtime
=============

Introduction
------------

`ONNX Runtime <https://onnxruntime.ai/>`_ is a high-performance, cross-platform engine for scoring and training machine learning models in the ONNX (Open Neural Network Exchange) format. It accelerates inference and training for models from popular frameworks such as PyTorch, TensorFlow/Keras, and classical libraries like scikit-learn, LightGBM, and XGBoost. ONNX Runtime is compatible with various hardware, drivers, and operating systems, leveraging hardware accelerators and advanced graph optimizations for optimal performance.

ONNX Runtime on Genio
---------------------

ONNX models can be executed efficiently on Genio platforms using ONNX Runtime. We currently support ONNX Runtime v1.20.2 on both Kirkstone and Scarthgap Yocto versions. The ONNX Runtime recipes are included in the `meta-mediatek-experimental` layer.

To build ONNX Runtime packages as part of your Rity build, follow these steps:

1. Initialize the repo:

   .. code-block:: shell

      repo init -u https://gitlab.com/mediatek/aiot/bsp/manifest.git -b rity/kirkstone -m meta-mediatek-experimental.xml
      repo init -u https://gitlab.com/mediatek/aiot/bsp/manifest.git -b rity/scarthgap -m meta-mediatek-experimental.xml

2. Sync the repo:

   .. code-block:: shell

      repo sync -j 48

3. Add the ``meta-mediatek-experimental`` layer to your ``bblayers.conf``.

4. Add the following to your ``local.conf``:

   .. code-block:: shell

      " onnxruntime onnxruntime-examples onnxruntime-staticdev "

5. If using Kirkstone, ensure you add:

   .. code-block:: shell

      PREFERRED_VERSION_cmake = "3.26.4"

6. Build your Rity image as usual:

   .. code-block:: shell

      bitbake rity-demo-image

Executing ONNX Models on Genio
------------------------------

With ONNX Runtime support on Genio, executing ONNX models is straightforward. Below is a sample Python script to benchmark your ONNX models on CPU:

.. code-block:: python

    import onnxruntime as ort
    import numpy as np
    import time

    def load_model(model_path):
        """Load the ONNX model."""
        session_options = ort.SessionOptions()
        session_options.intra_op_num_threads = 4
        session_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
        session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
        execution_providers = ['XnnpackExecutionProvider', 'CPUExecutionProvider']
        session = ort.InferenceSession(model_path, sess_options=session_options, providers=execution_providers)
        return session

    def generate_sample_input(session):
        """Generate a sample input based on the model's input shape."""
        input_name = session.get_inputs()[0].name
        input_shape = session.get_inputs()[0].shape
        sample_input = np.random.random(input_shape).astype(np.float32)
        return {input_name: sample_input}

    def run_inference(session, input_data):
        """Run inference and measure time taken."""
        start_time = time.time()
        outputs = session.run(None, input_data)
        end_time = time.time()
        return outputs, end_time - start_time

    def benchmark_model(model_path, num_iterations=100):
        """Benchmark the ONNX model."""
        session = load_model(model_path)
        input_data = generate_sample_input(session)
        total_time = 0.0
        for _ in range(num_iterations):
            _, inference_time = run_inference(session, input_data)
            total_time += inference_time
        average_time = total_time / num_iterations
        print(f"Average inference time over {num_iterations} iterations: {average_time:.6f} seconds")

    if __name__ == "__main__":
        model_path = "path/to/your/model.onnx"
        benchmark_model(model_path)

Examples
--------

Below is an example of running the `onnxruntime_test` Python script using the EfficientNet-Lite4 ONNX model for image classification on Genio 700:

.. code-block:: shell

    root@genio-700-evk:~/onnxruntime_example# python3 onnxruntime_test.py -i kitten.jfif -l labels_map.txt -m /home/root/efficientnet-lite4/efficientnet-lite4.onnx
    0.88291174  281: 'tabby, tabby cat',
    0.093538865  285: 'Egyptian cat',
    0.023412632  282: 'tiger cat',
    3.1124982e-05  539: 'doormat, welcome mat',
    1.9749632e-05  287: 'lynx, catamount',
    time: 103.599ms

This example runs on 1 CPU thread using `CPUExecutionProvider`.

ONNX Runtime Execution Providers
--------------------------------

ONNX Runtime supports various hardware acceleration libraries through its extensible Execution Providers (EP) framework, enabling optimal execution of ONNX models on different platforms. This flexibility allows developers to deploy ONNX models in diverse environments—cloud or edge—and optimize execution by leveraging the platform's compute capabilities.

On Genio, the following execution providers are supported and integrated by default:

- `CPUExecutionProvider`
- `XnnpackExecutionProvider`

Both providers execute on CPU.

Quantization Methods
--------------------

Quantization in ONNX Runtime refers to 8-bit linear quantization of ONNX models. For more information, see the official documentation: `ONNX Runtime Quantization <https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html>`_

**Dynamic Quantization**

Dynamic quantization calculates quantization parameters (scale and zero point) for activations dynamically during inference, typically resulting in higher accuracy at the cost of increased inference time.

.. code-block:: python

    import onnx
    from onnxruntime.quantization import quantize_dynamic, QuantType
    model_fp32 = 'path/to/the/model.onnx'
    model_quant = 'path/to/the/model.quant.onnx'
    quantized_model = quantize_dynamic(model_fp32, model_quant)

**Static Quantization**

Static quantization uses calibration data to determine quantization parameters, which are then stored in the quantized model. There are two static quantization formats:

- **QOperator (Operator-oriented):** Quantized operators have their own ONNX definitions (e.g., QLinearConv, MatMulInteger).
- **QDQ (Tensor-oriented):** Inserts `QuantizeLinear` and `DeQuantizeLinear` operators between original operators to simulate quantization and dequantization.

.. code-block:: python

    import onnx
    import onnxruntime
    from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantType, QuantFormat
    import numpy as np
    import argparse

    class DummyCalibrationDataReader(CalibrationDataReader):
        def __init__(self, input_name, input_shape, num_samples=100):
            self.input_name = input_name
            self.data = [{self.input_name: np.random.rand(*input_shape).astype(np.float32)} for _ in range(num_samples)]
            self.iterator = iter(self.data)

        def get_next(self):
            return next(self.iterator, None)

    def get_input_details(model_path):
        model = onnx.load(model_path)
        graph = model.graph
        input_tensor = graph.input[0]
        input_name = input_tensor.name
        input_shape = [dim.dim_value for dim in input_tensor.type.tensor_type.shape.dim]
        return input_name, input_shape

    def quantize_model(fp32_model_path, int8_model_path):
        input_name, input_shape = get_input_details(fp32_model_path)
        calibration_data_reader = DummyCalibrationDataReader(input_name, input_shape)
        quantize_static(fp32_model_path, int8_model_path, calibration_data_reader, quant_format=QuantFormat.QOperator, weight_type=QuantType.QInt8)

    if __name__ == "__main__":
        parser = argparse.ArgumentParser(description="Quantize FP32 ONNX model to INT8 using dummy inputs")
        parser.add_argument("fp32_model_path", type=str, help="Path to the FP32 ONNX model")
        parser.add_argument("int8_model_path", type=str, help="Path to save the INT8 ONNX model")
        args = parser.parse_args()
        quantize_model(args.fp32_model_path, args.int8_model_path)
        print(f"Quantized model saved to {args.int8_model_path}")

Make Dynamic Shapes Static
--------------------------

Many ONNX models use dynamic shapes, which may not always be compatible with certain execution providers. In such cases, it is often necessary to convert dynamic shapes to static shapes. This conversion is typically performed on a host machine, after which the static ONNX model can be deployed to the target device for application use.

For more details, refer to the official documentation:  
`ONNX Runtime: Make Dynamic Shape Fixed <https://onnxruntime.ai/docs/tutorials/mobile/helpers/make-dynamic-shape-fixed.html>`_

**Making Symbolic Dimensions Fixed**

Consider a model (as visualized in Netron) with a symbolic dimension called ``batch`` for the batch size in ``input:0``. We want to update this symbolic dimension to a fixed value, such as 1.

.. figure:: /_asset/dynamic_shape_1.PNG  
    :align: center

    Example model with symbolic dimension

To convert the symbolic dimension to a static value, use the following Python command:

.. code-block:: shell

    python3 -m onnxruntime.tools.make_dynamic_shape_fixed --dim_param batch --dim_value 1 model.onnx model.fixed.onnx

After this replacement, the shape for ``input:0`` will be fixed to ``[1, 36, 36, 3]``.

.. figure:: /_asset/dynamic_shape_2.PNG  
    :align: center

    Example model with fixed dimension

**Making Input Shape Fixed**

Some models have unnamed dynamic dimensions, often represented as ``?`` in Netron (for example, in the input ``x``). Since these dimensions are unnamed, you can update the shape using the ``--input_shape`` option.

.. figure:: /_asset/dynamic_shape_3.PNG  
    :align: center

    Example model with unnamed dynamic dimension

To fix these dynamic shapes, use the following command:

.. code-block:: shell

    python3 -m onnxruntime.tools.make_dynamic_shape_fixed --input_name x --input_shape 1,3,960,960 model.onnx model.fixed.onnx

After running this command, the shape for ``x`` will be fixed to ``[1, 3, 960, 960]``.

.. figure:: /_asset/dynamic_shape_4.PNG  
    :align: center

    Example model with fixed dynamic dimension

ONNX TAO
--------

For more information about TAO, please refer to the official documentation:  
`MediaTek Genio TAO Documentation <https://mediatek.gitlab.io/genio/doc/tao/>`_

ORT-GENAI
---------

Genio supports ONNX Runtime GenAI packages, which are available as Python pip packages integrated into the Yocto build. Currently, this feature is in the experimental phase, and GenAI models are executed only on the CPU.

With the ONNX Runtime GenAI package, you can run models from the Phi family (text and vision) as well as DeepSeek on Genio. This enables the development of chatbots and experimentation with the conversational capabilities of small language models (SLMs). As of now, version 0.6 has been tested on Genio.

*Note: This section will be updated as more features and support become available.*