Qualcomm Hexagon Plugin Interface (QHPI)

Overview

QHPI provides a set of well-defined, strongly typed C APIs that enable operator writers to create and register operators with the QNN HTP Backend (BE). It replaces the legacy DEF_PACKAGE_OPT-based operator system in QNN HTP.

Key Features

  • API/ABI Compatibility

    Provides strong API and ABI compatibility for operator packages through versioned op package registration APIs and data structures. The versioned APIs enable customers to develop op packages using a given QHPI API version and continue to use those in future SDKs without any further changes if the customer does not need features from the newer QHPI API versions. A QHPI op package built with an older SDK is expected to continue working on newer SDKs without requiring recompilation.

  • Multi-threading Support

    Provides access to improved performance on NPUs.

  • Smooth Transition

    Coexists with legacy C++ DEF_PACKAGE_OPT based packages to simplify migration. QHPI integrates seamlessly with existing QNN tools and the QNN HTP Backend operator package workflow, requiring minimal changes.

Key Terms

The rest of this document uses the following key terms:

  • Operator packages

    An operator package (also referred to as an op package) is a collection of one or more HTP operator implementations.

    • Built as a dynamic linked library.

    • Provided at graph preparation and execution time to supply custom operator implementations.

    • Must implement the QNN Operator Package Interface API as defined in QnnOpPackage.h.

    • Includes a well-defined entry point for the HTP backend to invoke during dynamic loading.

    • For more details, see Op Packages.

  • Kernel

    A specific C/C++ function invoked during execution. Kernels are typically associated with specific layouts and element types.

  • Operator

    A named node in a machine learning graph. Operators are generic with respect to layout, element type, and storage placement. At the end of graph preparation, every operator is associated with a specific kernel.

  • Precomputation

    The equivalent of the COMPILER_FOR macro, where a function can be invoked when a prepared graph is loaded for execution.

  • Multithreading

    Operators can be invoked multiple times with distinct slice identifiers, enabling parallel execution across different hardware threads of the same type.

Please refer to "<QNN_SDK_ROOT>/include/HTP/core/qhpi.h" for QHPI API definitions.

API/ABI Compatibility

API and ABI compatibility is one of the key capabilities that QHPI offers to operator packages. To accomplish this, all APIs and data structures defined via "<QNN_SDK_ROOT>/include/HTP/core/qhpi.h" are subject to versioning to ensure compatibility. Any significant changes to a data structure or function will be handled by the creation of a new version of that data structure or function, using _vxxx suffix, XXX is the version number.

The example below illustrates an updated QHPI operator data structure, QHPI_OpInfo_v1, which now references the latest QHPI kernel version, QHPI_Kernel_v1.

typedef struct {
   const char *name;
   uint32_t num_kernels;
   QHPI_Kernel_v1 *kernels;
   QHPI_RewriteOpFunc early_rewrite;
   QHPI_TileShapeRequired shape_required;
   QHPI_TileShapeLegalized shape_legalized;
   QHPI_BuildTileOfOp build_tile;
   QHPI_RewriteOpFunc late_rewrite;
} QHPI_OpInfo_v1;

Here is an example of a versioned function in the API, qhpi_register_ops_v1, introduced to register the latest operator data structure, QHPI_OpInfo_v1.

// register a collection of v1 operators
uint32_t qhpi_register_ops_v1(uint32_t num_ops, QHPI_OpInfo_v1 *operators, const char *package);

The versioned APIs and data structures will enable SDK users to develop op packages using a given SDK and continue using them on future SDKs without any further changes or recompilation.

Quick-Start Checklist for Migrating to QHPI

Follow these steps to migrate an existing legacy operator package to QHPI:

  1. Update XML Configuration

  • Set UseQHPI="true" in the OpDefCollection element.

  • Verify operator definitions in XML follow the OpDef schema.

  • Reference: QNN XML Op Def.

  1. Generate QHPI Skeleton

  • Use qnn-op-package-generator to create a skeletal QHPI implementation.

  • Command:

    <QNN_SDK_ROOT>/bin/x86_64-linux-clang/qnn-op-package-generator --config <xml_file> --output_path <output_dir>
    
  • Reference: Generating Op Packages.

  1. Implement Op Package Entry Point

  • Ensure the generated skeleton includes the mandatory function qhpi_init(), which is required for successful loading and registration of the QHPI operator package.

  • Register QHPI ops with QNN HTP BE using the appropriate versioned registration API. For example, the initial QHPI release contains QHPI_register_ops_v1() API that could be invoked from qhpi_init() to register operators with QNN HTP BE.

Note

qhpi_init() is the QHPI equivalent of the legacy macro INIT_PKG_CORE_INIT_FUNC. QHPI supports versioned registration APIs; please pick the appropriate API based on SDK and op/kernel/tensor properties.

  1. Implement Operator Logic

  • Replace legacy macros (e.g., REGISTER_OP) with:

    • QHPI_Kernel_vxxx structures for kernel definitions.

    • QHPI_Tensor_Signature_vxxx for input/output tensor properties.

    • QHPI_OpInfo_vxxx for operator-to-kernel mapping.

  • Ensure:

    • function_name is unique.

    • Examine and initialize attributes such as resources and source_destructive appropriately.

    • Order kernels appropriately. The most preferred kernel is listed first. Kernels are selected by QHPI in the order they appear in QHPI_OpInfo_vxxx by default. This can however be overridden by the op writer using Predicates. Refer to Predicates for more on this.

  1. Handle Kernel Invocation

  • Implement kernel execution functions for:

    • Default execution.

    • Precomputation (optional).

  • See the discussion on Operator Implementations for details.

  1. Enable Advanced Features

  • Multithreading: Set multithreaded flag and use slice APIs if kernel implementation can be parallelized across multiple hardware threads.

  • Source Destructive: Set source_destructive flag if the first input/output can share memory.

  • Cost Function: Optional, used for predicting performance (not for kernel selection).

  1. Implement Rewrite Rules

  • Rewrite callbacks can be used to rewrite operators into a new subgraph containing other QHPI and QNN operators.

  • Replace op rewrite DEF_PACKAGE_OPTIMIZATION rules with optimizations implemented using the appropriate QHPI C API rewrite callbacks.

    • early_rewrite: Optional function, when specified is invoked early during graph compilation prior to op tiling.

    • late_rewrite: Optional function, when specified is invoked during graph compilation after op tiling.

  1. Implement Tiling Rules

  • Tiling callbacks allow the op writer to customize HTP’s choices on splitting a QHPI operator in a given graph.

  • Replace any DEF_PACKAGE_OPTIMIZATION rules using AUTOSPLIT and TILING with implementations using appropriate QHPI tiling callbacks.

    • shape_required: Optional function, when implemented can be used to enforce requirements on tiling dimensions.

    • shape_legalized: Optional function, when implemented can be used to adjust the HTP’s tile choice based on tiling heuristics.

    • build_tile: Optional function, when implemented can be used to create a tiled operator to compute a specified output slice.

  1. Validate and Test

  • Build the operator package using Makefiles generated in step 2.

  • Verify the operator package works with the HTP backend by following the workflow mentioned in the Custom op package tutorial. Note that the steps for building and executing a model using QNN HTP BE have not changed due to QHPI.

  1. Please refer to the following sample legacy to QHPI ports in the SDK.

  • ${QNN_SDK_ROOT}/examples/QNN/OpPackage/HTP/QHPI/

  • ${QNN_SDK_ROOT}/examples/QNN/OpPackageGenerator/generated/HTP/

The following section provides additional details on these steps.

QHPI-Based Operator Package Details

A QNN operator package implementation consists of three components:

  • QNN operator package Skeletal Generation

  • Operator Implementation

  • Operator Registration

QHPI does not change the QNN op package interface definition. The QNN operator package interface remains the same whether you use legacy APIs or QHPI for operator implementations. However, the operator implementation and registration APIs must be updated to use QHPI.

The following section provides examples showing how to create a new QHPI-based operator package or migrate an existing legacy package to QHPI. Several QHPI-based op implementation examples can be found in the SDK at:

${QNN_SDK_ROOT}/examples/QNN/OpPackage/HTP/QHPI/

QNN Operator Package Skeletal Generation

QNN provides a streamlined way to create the required operator package interface implementation and supporting build files for generating QHPI-based operator packages using the qnn-op-package-generator tool.

To use the tool, package information and operators must be defined using the XML OpDef schema, as described in QNN XML Op Def.

To enable QHPI, update the XML configuration file by setting the UseQHPI="true" attribute in the OpDefCollection element, as shown below:

<?xml version="1.0" encoding="UTF-8"?>
<OpDefCollection
   PackageName="ExampleOpPackage"
   Domain="aisw"
   Version="1.0"
   UseQHPI="true">
   ...
</OpDefCollection>

Sample XML configurations can be found in Example XML Op Def Configs and in the SDK at:

${QNN_SDK_ROOT}/examples/QNN/OpPackageGenerator

Based on the input/output data types and parameters specified in the XML configuration file, the qnn-op-package-generator tool creates a QHPI-based skeletal implementation using definitions provided in the SDK. The kernel tiling, execution, and other functions are stubbed out in the generated skeleton and must be implemented by the developer.

The generated skeletal implementation also includes the mandatory entry-point function qhpi_init(), which is required for successful loading and registration of a QHPI operator package.

Note

The qhpi_init() function is the QHPI equivalent of the legacy macro INIT_PKG_CORE_INIT_FUNC.

Given an XML configuration file, a skeletal implementation for the op package can be generated using the following command:

<QNN_SDK_ROOT>/bin/x86_64-linux-clang/qnn-op-package-generator --config <xml_file> --output_path <output_dir>

Further details on op package skeleton generation can be found at Generating Op Packages.

Operator Implementations

The operator implementation is the primary area where QHPI APIs differ from the legacy C++ macro-based approach. The following sections provide details on QHPI operator implementation.

Operator definition

In the legacy interface, the DEF_PACKAGE_OP macro and its variants declare kernels and associate them with operators. For example:

template <typename Ttype>
GraphStatus asin_opt(Ttype &out, const Ttype &in);

DEF_PACKAGE_OP(asin_opt<QuantUint16Tensor>, "Asin_16");
DEF_PACKAGE_OP(asin_opt<QuantUint16Tensor_TCM>, "Asin_16");
DEF_PACKAGE_OP(asin_opt<QUint16CroutonTensor>, "Asin_16");
DEF_PACKAGE_OP(asin_opt<QUint16CroutonTensor_TCM>, "Asin_16");

In this example, four kernels are associated with the operator "Asin_16". These kernels differ in layout and memory placement of the input tensor. The macro uses C++ templates to interpret type signatures and match kernels to tensor types.

In QHPI, kernels are declared using a static data structure QHPI_Kernel_vxxx. The QHPI_Kernel_vxxx defines kernel attributes such as function name, resources, input/output signatures, and flags.

Example:

static QHPI_Kernel_v1 asin16_kernels[] = {{
   .function_name = THIS_PKG_NAME_STR "::" "asin_16_flat",
   .function = asin_16<QuantUint16Tensor>,
   .resources = QHPI_RESOURCE_HVX,
   .source_destructive = true,
   .min_inputs = 1,
   .input_signature = &sig_flat_16,
   .min_outputs = 1,
   .output_signature = &sig_flat_16,
}, ... };

QHPI_Tensor_Signature_vxxx

Captures tensor properties such as element type, layout, storage, and memory placement. The corresponding legacy macro equivalent is DEF_TENSOR_PROPERTIES.

Example:

static QHPI_Tensor_Signature_v1 sig_flat_16 = {
   .element_type = QHPI_QUInt16,
   .layout = QHPI_Layout_Flat4,
   .storage = QHPI_Storage_Direct,
   .mem_placement = QHPI_MemLoc_DDR_OR_TCM,
};

QHPI_OpInfo_vxxx

Defines an operator and associates it with one or more kernels.

Example:

static QHPI_OpInfo_v1 ops[] = {{
   .name = THIS_PKG_NAME_STR "::" "Asin_16",
   .num_kernels = 2,
   .kernels = asin16_kernels,
}, ... };

Note

Operator names follow the convention PackageName::OperatorName.

Kernels are matched in the order they appear in the operator definition by default.

Kernel Implementation

Every QHPI kernel defined using QHPI_Kernel_vxxx must include an execution function that handles kernel implementation for inferencing.

Example kernel execution signature:

template <typename Ttype>
inline GraphStatus asin_opt(Ttype &out, const Ttype &in);

template<typename TensorType>
static uint32_t asin_16(QHPI_RuntimeHandle *,
                        uint32_t num_outputs, QHPI_Tensor **outputs,
                        uint32_t num_inputs, const QHPI_Tensor *const *inputs) {
   return asin_opt<TensorType>(*reinterpret_cast<TensorType *>(outputs[0]),
                              *reinterpret_cast<const TensorType *>(inputs[0]));
}

Precomputation

QHPI supports precomputation for optimal inferencing via additional function pointers in QHPI_Kernel_v1:

  • do_precomputation_function

    Called during graph load to initialize a data block. This API replaces the legacy COMPILER_FOR macro. This function has access to tensor info such as shape, block table, and quantization parameters. Any computation based on this information may be done and stored in the data block for use later during graph inference.

  • function_with_precomputed_data

    Called during inference with the runtime handle and the precomputed data block from do_precomputation_function. This is an alternative to a kernel’s default execution function specified in function.

    Example:

    static QHPI_Kernel_v1 kernels[] = {{
       .function_name = THIS_PKG_NAME_STR "::" "Asin_16",
       ...
       .precomputed_data_size = sizeof(Precompute),
       .do_precomputation_function = asin_do_precomputation,
       .function_with_precomputed_data = asin_use_precomputation,
    }};
    

Multithreading

  • Enable by setting the multithreaded flag in QHPI_Kernel_v1.

  • Enables multi-threaded execution of a kernel across multiple hardware threads.

  • Access slice information for current thread and the total number of slices via runtime functions:

uint32_t num_slices = qhpi_num_slices(fh);
uint32_t slice_number = qhpi_slice_number(fh);

Source Destructive

Specify source_destructive = true in QHPI_Kernel_v1 if the first input and output tensors can share memory. Such a kernel must ensure that it reads the input before writing the corresponding output location.

Note

This optimization is opportunistic and the kernel must be written to run correctly when the tensors do not share the same memory location.

Cost Function

In the legacy APIs, the cost functions influenced both kernel selection and execution time prediction. In QHPI, however, they are only used for predicting execution times, and not for kernel selection. Please see Predicates for more on kernel selection.

Example:

float cost_func(const uint32_t num_inputs, const QHPI_Tensor *const *inputs) {
   QHPI_Shape shape = qhpi_tensor_shape(inputs[0]);
   unsigned size = shape.dims[0] * shape.dims[1] * shape.dims[2] * shape.dims[3];
   return size * 0.2f + 10.0f;
}

Optimization Rules

QHPI replaces the DEF_OPT Domain Specific Language (DSL) with a simplified C API for graph rewrites. Op writers can implement the following optional callbacks for tiling:

  • early_rewrite

    Invoked before any tiling is performed by the compiler. This function can rewrite the operator into a new subgraph of operators. Operators in this subgraph may be QHPI operators or standard QNN operators.

    Example:

    static const QHPI_Op *relu_to_relu_minmax_quant_rewrite(const QHPI_Op *op) {
       QHPI_OpRef input = qhpi_op_input(op, 0);
       QHPI_OutputDef input_output = qhpi_op_output(input.op, input.output_number);
    
       // Check if input is quantized type
       if (input_output.type != QHPI_QUInt8 && input_output.type != QHPI_QUInt16 &&
          input_output.type != QHPI_QInt8 && input_output.type != QHPI_QInt16) {
          return op;
       }
    
       // Create ReluMinMax with min=0.0f, max=INF
       QHPI_OpRef min_const = gen_const_scalar_f32(op, 0.0f);
       QHPI_OpRef max_const = gen_const_scalar_f32(op, INFINITY);
    
       QHPI_OpRef inputs[] = {input, min_const, max_const};
       QHPI_OutputDef output = qhpi_op_output(op, 0);
    
       return qhpi_op_create(op, THIS_PKG_NAME_STR "::ReluMinMax", 3, inputs, 1, &output);
    }
    
  • late_rewrite

    This allows the op package to rewrite operators after tiling into a new subgraph. In this case, the new subgraph should only contain plugin operators and a small set of additional operators such as Slice_shape, Concat, and Reshape. The late rewrite can also be used to introduce scratch space after tiling as unused outputs.

    Example:

    static const QHPI_Op *relu_late_rewrite(const QHPI_Op *op) {
       // Use late rewrite to add scratch
       if (qhpi_op_num_outputs(op) > 1)
          return op;
       QHPI_OutputDef outputs[2];
       outputs[0] = qhpi_op_output(op, 0);
       outputs[1] = {.type = QHPI_Int32,
                      .shape = {.rank = 4, .dims = {1, 1, 1, 32}}};
       QHPI_OpRef input = qhpi_op_input(op, 0);
       return qhpi_op_create(op, qhpi_op_name(op), 1, &input, 2, outputs);
    }
    

    The explicit phase ordering supported in DEF_OPT is replaced by a simpler pre/post callback functions for tiling.

Tiling

QHPI supports direct callbacks into a centralized tiling algorithm. This algorithm makes decisions on how to create smaller versions of operators. By hooking into our central tiler, you can enhance parallelism across functional units and minimize peak memory footprint so as to remain in TCM and improve end-to-end latency. Concurrently, the central tiler also weighs the costs that come from over-decomposition of operators by avoiding excessive inter-op communication and aligning chunking dimensions when possible to minimize concatenation and slicing. This results in the central tiler choosing chunk sizes for every operator it processes.

It is strongly recommended that users opt-in to these callbacks by (at a minimum) creating a build_tile function.

During tiling there are several functions which may be defined and used to drive our choices on how we split an QHPI operator in our graph. These include:

  • shape_required

    Callback which is passed an instance of a plugin operator and returns a _shape_ object that forces certain sizes on each tiling dimension. This function is optional and if omitted no restrictions are placed at the start of tiling.

    Example:

    static QHPI_Shape relu_shape_required(const QHPI_Op *op) {
       // Define tiling requirements - split on height dimension
       static QHPI_Shape required = {
          .rank = 4,
          .dims = {1, RELU_TILE_HEIGHT, 0, RELU_CHANNEL_SPLIT_SIZE}
       };
       return required;
    }
    
  • shape_legalized

    Callback which is passed an instance of a plugin operator and a candidate tile shape. The function then returns a “legal” tile shape after considering the initally-proposed one from central tiling’s heuristics. This is intended to support scenarios where there are operator-specific requirements on the shape (e.g. some dimension must be a multiple of some value for good performance). This function is optional and if omitted no restrictions are assumed beyond what is provided by a potential shape required function.

    Example:

    static QHPI_Shape relu_shape_legalized(const QHPI_Op *op) {
       static QHPI_Shape legal = {
          .rank = 4,
          .dims = {1, 8, 0, 256}
       };
       ...
       return legal;
    }
    
  • build_tile

    Callback which is passed an instance of a plugin operator, a starting location, and an extent of the first output, is expected to create a new instance of the operator to compute that particular output tile. The key aspect of this is to determine the new inputs to this operator which will (typically) be slices of the inputs of the original. This function is required if you’d like to have your op split into smaller chunks. When omitted, the op is passed back without splitting.

    Example:

    static const QHPI_Op *relu_build_tile(const QHPI_Op *op,
                                        const QHPI_Shape *out_start,
                                        const QHPI_Shape *out_extent) {
       // Get input reference
       QHPI_OpRef input_ref = qhpi_op_input(op, 0);
    
       // For ReLU, input and output have same dimensions, so input slice = output slice
       QHPI_Shape in_start = *out_start;
       QHPI_Shape in_extent = *out_extent;
    
       // Create input slice
       QHPI_OpRef input_slice = qhpi_op_slice(input_ref, &in_start, &in_extent);
    
       // Build tiled operator with sliced input
       QHPI_OpRef inputs[] = {input_slice};
    
       QHPI_OutputDef outputs[] = {
          {.type = qhpi_op_output(op, 0).type,
             .quant_parameters = qhpi_op_output(op, 0).quant_parameters,
             .shape = *out_extent}
       };
    
       return qhpi_op_create(op, qhpi_op_name(op), 1, inputs, 1, outputs);
    }
    

These tiling callbacks can be invoked several times during prepare to perform chunk size evaluation and generate the new sub-operations.

Predicates

By default, QHPI matches kernels based on tensor signatures in the order of kernel specification at op registration time. The op writer can, however, further influence kernel matching by implementing an optional predicate callback function. The predicate callback function can be written to return true or false to either select or skip over the kernel for the next one.

Example:

uint32_t asin_plugin_default_predicate(const QHPI_Op *op, const uint32_t num_inputs, const QHPI_Tensor *const *inputs)
{
   if (num_inputs == 0) {
      return 0u; // false
   }
   for (uint32_t i = 0; i < num_inputs; i++) {
      if (inputs[i] == nullptr) {
            return 0u; // false
      }
      QHPI_Shape shape = qhpi_tensor_shape(inputs[i]);
      for (uint32_t d = 0; d < shape.rank; d++) {
            if (shape.dims[d] == 0) {
               return 0u; // false
            }
      }
      QHPI_Quant_Parameters qp = qhpi_tensor_quant_parameters(inputs[i]);
      if (qp.stepsize == 0.0f) {
            return 0u; // false
      }
   }
   return 1u; // non-zero => "true"
}

Operator registration

QHPI operators defined using QHPI_OpInfo_vxxx can be registered with the QNN HTP BE using the corresponding registration function qhpi_register_ops_vXXX API as part of the op package dynamic library entry point qhpi_init().

Example:

// OpInfo definitions
static QHPI_OpInfo_v1 ops[] = {
   {
      .name = THIS_PKG_NAME_STR "::Relu",
      .num_kernels = 2,
      .kernels = relu_kernels,
      .early_rewrite = relu_to_relu_minmax_quant,
      .shape_required = relu_shape_required,
      .build_tile = relu_build_tile,
   },
   // ...
};

// Registration function for regular ReLU operations
void register_relu_ops()
{
   qhpi_register_ops_v1(sizeof(ops) / sizeof(ops[0]), ops, THIS_PKG_NAME_STR);
}

extern "C" const char *qhpi_init()
{
   // Register the ops with HTP BE
   register_relu_ops();
   return THIS_PKG_NAME_STR;
}

The next step after creating and building a QHPI op package is to build and execute a model that uses operators implemented in the op package. The steps for building and executing a model using QNN HTP BE have not changed due to QHPI, and are outlined in Custom op package tutorial.