Qualcomm Hexagon Plugin Interface (QHPI)¶
Overview¶
QHPI provides a set of well-defined, strongly typed C APIs that enable operator writers to create and register operators with the QNN HTP Backend (BE). It replaces the legacy DEF_PACKAGE_OPT-based operator system in QNN HTP.
Key Features¶
API/ABI Compatibility
Provides strong API and ABI compatibility for operator packages through versioned op package registration APIs and data structures. The versioned APIs enable customers to develop op packages using a given QHPI API version and continue to use those in future SDKs without any further changes if the customer does not need features from the newer QHPI API versions. A QHPI op package built with an older SDK is expected to continue working on newer SDKs without requiring recompilation.
Multi-threading Support
Provides access to improved performance on NPUs.
Smooth Transition
Coexists with legacy C++
DEF_PACKAGE_OPTbased packages to simplify migration. QHPI integrates seamlessly with existing QNN tools and the QNN HTP Backend operator package workflow, requiring minimal changes.
Key Terms¶
The rest of this document uses the following key terms:
Operator packages
An operator package (also referred to as an op package) is a collection of one or more HTP operator implementations.
Built as a dynamic linked library.
Provided at graph preparation and execution time to supply custom operator implementations.
Must implement the QNN Operator Package Interface API as defined in QnnOpPackage.h.
Includes a well-defined entry point for the HTP backend to invoke during dynamic loading.
For more details, see Op Packages.
Kernel
A specific C/C++ function invoked during execution. Kernels are typically associated with specific layouts and element types.
Operator
A named node in a machine learning graph. Operators are generic with respect to layout, element type, and storage placement. At the end of graph preparation, every operator is associated with a specific kernel.
Precomputation
The equivalent of the
COMPILER_FORmacro, where a function can be invoked when a prepared graph is loaded for execution.Multithreading
Operators can be invoked multiple times with distinct slice identifiers, enabling parallel execution across different hardware threads of the same type.
Please refer to "<QNN_SDK_ROOT>/include/HTP/core/qhpi.h" for QHPI API definitions.
API/ABI Compatibility¶
API and ABI compatibility is one of the key capabilities that QHPI offers to operator packages. To accomplish this, all APIs and data structures defined via "<QNN_SDK_ROOT>/include/HTP/core/qhpi.h" are subject to versioning to ensure compatibility. Any significant changes to a data structure or function will be handled by the creation of a new version of that data structure or function, using _vxxx suffix, XXX is the version number.
The example below illustrates an updated QHPI operator data structure, QHPI_OpInfo_v1, which now references the latest QHPI kernel version, QHPI_Kernel_v1.
typedef struct {
const char *name;
uint32_t num_kernels;
QHPI_Kernel_v1 *kernels;
QHPI_RewriteOpFunc early_rewrite;
QHPI_TileShapeRequired shape_required;
QHPI_TileShapeLegalized shape_legalized;
QHPI_BuildTileOfOp build_tile;
QHPI_RewriteOpFunc late_rewrite;
} QHPI_OpInfo_v1;
Here is an example of a versioned function in the API, qhpi_register_ops_v1, introduced to register the latest operator data structure, QHPI_OpInfo_v1.
// register a collection of v1 operators
uint32_t qhpi_register_ops_v1(uint32_t num_ops, QHPI_OpInfo_v1 *operators, const char *package);
The versioned APIs and data structures will enable SDK users to develop op packages using a given SDK and continue using them on future SDKs without any further changes or recompilation.
Quick-Start Checklist for Migrating to QHPI¶
Follow these steps to migrate an existing legacy operator package to QHPI:
Update XML Configuration
Set
UseQHPI="true"in theOpDefCollectionelement.Verify operator definitions in XML follow the OpDef schema.
Reference: QNN XML Op Def.
Generate QHPI Skeleton
Use
qnn-op-package-generatorto create a skeletal QHPI implementation.Command:
<QNN_SDK_ROOT>/bin/x86_64-linux-clang/qnn-op-package-generator --config <xml_file> --output_path <output_dir>
Reference: Generating Op Packages.
Implement Op Package Entry Point
Ensure the generated skeleton includes the mandatory function
qhpi_init(), which is required for successful loading and registration of the QHPI operator package.Register QHPI ops with QNN HTP BE using the appropriate versioned registration API. For example, the initial QHPI release contains
QHPI_register_ops_v1()API that could be invoked fromqhpi_init()to register operators with QNN HTP BE.
Note
qhpi_init() is the QHPI equivalent of the legacy macro INIT_PKG_CORE_INIT_FUNC.
QHPI supports versioned registration APIs; please pick the appropriate API based on SDK and op/kernel/tensor properties.
Implement Operator Logic
Replace legacy macros (e.g.,
REGISTER_OP) with:QHPI_Kernel_vxxxstructures for kernel definitions.QHPI_Tensor_Signature_vxxxfor input/output tensor properties.QHPI_OpInfo_vxxxfor operator-to-kernel mapping.
Ensure:
function_nameis unique.Examine and initialize attributes such as
resourcesandsource_destructiveappropriately.Order kernels appropriately. The most preferred kernel is listed first. Kernels are selected by QHPI in the order they appear in
QHPI_OpInfo_vxxxby default. This can however be overridden by the op writer using Predicates. Refer to Predicates for more on this.
Handle Kernel Invocation
Implement kernel execution functions for:
Default execution.
Precomputation (optional).
See the discussion on Operator Implementations for details.
Enable Advanced Features
Multithreading: Set
multithreadedflag and use slice APIs if kernel implementation can be parallelized across multiple hardware threads.Source Destructive: Set
source_destructiveflag if the first input/output can share memory.Cost Function: Optional, used for predicting performance (not for kernel selection).
Implement Rewrite Rules
Rewrite callbacks can be used to rewrite operators into a new subgraph containing other QHPI and QNN operators.
Replace op rewrite
DEF_PACKAGE_OPTIMIZATIONrules with optimizations implemented using the appropriate QHPI C API rewrite callbacks.early_rewrite: Optional function, when specified is invoked early during graph compilation prior to op tiling.late_rewrite: Optional function, when specified is invoked during graph compilation after op tiling.
Implement Tiling Rules
Tiling callbacks allow the op writer to customize HTP’s choices on splitting a QHPI operator in a given graph.
Replace any
DEF_PACKAGE_OPTIMIZATIONrules usingAUTOSPLITandTILINGwith implementations using appropriate QHPI tiling callbacks.shape_required: Optional function, when implemented can be used to enforce requirements on tiling dimensions.shape_legalized: Optional function, when implemented can be used to adjust the HTP’s tile choice based on tiling heuristics.build_tile: Optional function, when implemented can be used to create a tiled operator to compute a specified output slice.
Validate and Test
Build the operator package using Makefiles generated in step 2.
Verify the operator package works with the HTP backend by following the workflow mentioned in the Custom op package tutorial. Note that the steps for building and executing a model using QNN HTP BE have not changed due to QHPI.
Please refer to the following sample legacy to QHPI ports in the SDK.
${QNN_SDK_ROOT}/examples/QNN/OpPackage/HTP/QHPI/
${QNN_SDK_ROOT}/examples/QNN/OpPackageGenerator/generated/HTP/
The following section provides additional details on these steps.
QHPI-Based Operator Package Details¶
A QNN operator package implementation consists of three components:
QNN operator package Skeletal Generation
Operator Implementation
Operator Registration
QHPI does not change the QNN op package interface definition. The QNN operator package interface remains the same whether you use legacy APIs or QHPI for operator implementations. However, the operator implementation and registration APIs must be updated to use QHPI.
The following section provides examples showing how to create a new QHPI-based operator package or migrate an existing legacy package to QHPI. Several QHPI-based op implementation examples can be found in the SDK at:
${QNN_SDK_ROOT}/examples/QNN/OpPackage/HTP/QHPI/
QNN Operator Package Skeletal Generation¶
QNN provides a streamlined way to create the required operator package interface implementation and supporting build files for generating QHPI-based operator packages using the qnn-op-package-generator tool.
To use the tool, package information and operators must be defined using the XML OpDef schema, as described in QNN XML Op Def.
To enable QHPI, update the XML configuration file by setting the UseQHPI="true" attribute in the OpDefCollection element, as shown below:
<?xml version="1.0" encoding="UTF-8"?>
<OpDefCollection
PackageName="ExampleOpPackage"
Domain="aisw"
Version="1.0"
UseQHPI="true">
...
</OpDefCollection>
Sample XML configurations can be found in Example XML Op Def Configs and in the SDK at:
${QNN_SDK_ROOT}/examples/QNN/OpPackageGenerator
Based on the input/output data types and parameters specified in the XML configuration file, the qnn-op-package-generator tool creates a QHPI-based skeletal implementation using definitions provided in the SDK. The kernel tiling, execution, and other functions are stubbed out in the generated skeleton and must be implemented by the developer.
The generated skeletal implementation also includes the mandatory entry-point function qhpi_init(), which is required for successful loading and registration of a QHPI operator package.
Note
The qhpi_init() function is the QHPI equivalent of the legacy macro INIT_PKG_CORE_INIT_FUNC.
Given an XML configuration file, a skeletal implementation for the op package can be generated using the following command:
<QNN_SDK_ROOT>/bin/x86_64-linux-clang/qnn-op-package-generator --config <xml_file> --output_path <output_dir>
Further details on op package skeleton generation can be found at Generating Op Packages.
Operator Implementations¶
The operator implementation is the primary area where QHPI APIs differ from the legacy C++ macro-based approach. The following sections provide details on QHPI operator implementation.
Operator definition¶
In the legacy interface, the DEF_PACKAGE_OP macro and its variants declare kernels and associate them with operators. For example:
template <typename Ttype>
GraphStatus asin_opt(Ttype &out, const Ttype &in);
DEF_PACKAGE_OP(asin_opt<QuantUint16Tensor>, "Asin_16");
DEF_PACKAGE_OP(asin_opt<QuantUint16Tensor_TCM>, "Asin_16");
DEF_PACKAGE_OP(asin_opt<QUint16CroutonTensor>, "Asin_16");
DEF_PACKAGE_OP(asin_opt<QUint16CroutonTensor_TCM>, "Asin_16");
In this example, four kernels are associated with the operator "Asin_16". These kernels differ in layout and memory placement of the input tensor. The macro uses C++ templates to interpret type signatures and match kernels to tensor types.
In QHPI, kernels are declared using a static data structure QHPI_Kernel_vxxx. The QHPI_Kernel_vxxx defines kernel attributes such as function name, resources, input/output signatures, and flags.
Example:
static QHPI_Kernel_v1 asin16_kernels[] = {{
.function_name = THIS_PKG_NAME_STR "::" "asin_16_flat",
.function = asin_16<QuantUint16Tensor>,
.resources = QHPI_RESOURCE_HVX,
.source_destructive = true,
.min_inputs = 1,
.input_signature = &sig_flat_16,
.min_outputs = 1,
.output_signature = &sig_flat_16,
}, ... };
QHPI_Tensor_Signature_vxxx
Captures tensor properties such as element type, layout, storage, and memory placement. The corresponding legacy macro equivalent is DEF_TENSOR_PROPERTIES.
Example:
static QHPI_Tensor_Signature_v1 sig_flat_16 = {
.element_type = QHPI_QUInt16,
.layout = QHPI_Layout_Flat4,
.storage = QHPI_Storage_Direct,
.mem_placement = QHPI_MemLoc_DDR_OR_TCM,
};
QHPI_OpInfo_vxxx
Defines an operator and associates it with one or more kernels.
Example:
static QHPI_OpInfo_v1 ops[] = {{
.name = THIS_PKG_NAME_STR "::" "Asin_16",
.num_kernels = 2,
.kernels = asin16_kernels,
}, ... };
Note
Operator names follow the convention PackageName::OperatorName.
Kernels are matched in the order they appear in the operator definition by default.
Kernel Implementation¶
Every QHPI kernel defined using QHPI_Kernel_vxxx must include an execution function that handles kernel implementation for inferencing.
Example kernel execution signature:
template <typename Ttype>
inline GraphStatus asin_opt(Ttype &out, const Ttype &in);
template<typename TensorType>
static uint32_t asin_16(QHPI_RuntimeHandle *,
uint32_t num_outputs, QHPI_Tensor **outputs,
uint32_t num_inputs, const QHPI_Tensor *const *inputs) {
return asin_opt<TensorType>(*reinterpret_cast<TensorType *>(outputs[0]),
*reinterpret_cast<const TensorType *>(inputs[0]));
}
Precomputation¶
QHPI supports precomputation for optimal inferencing via additional function pointers in QHPI_Kernel_v1:
do_precomputation_function
Called during graph load to initialize a data block. This API replaces the legacy
COMPILER_FORmacro. This function has access to tensor info such as shape, block table, and quantization parameters. Any computation based on this information may be done and stored in the data block for use later during graph inference.function_with_precomputed_data
Called during inference with the runtime handle and the precomputed data block from
do_precomputation_function. This is an alternative to a kernel’s default execution function specified infunction.Example:
static QHPI_Kernel_v1 kernels[] = {{ .function_name = THIS_PKG_NAME_STR "::" "Asin_16", ... .precomputed_data_size = sizeof(Precompute), .do_precomputation_function = asin_do_precomputation, .function_with_precomputed_data = asin_use_precomputation, }};
Multithreading¶
Enable by setting the
multithreadedflag inQHPI_Kernel_v1.Enables multi-threaded execution of a kernel across multiple hardware threads.
Access slice information for current thread and the total number of slices via runtime functions:
uint32_t num_slices = qhpi_num_slices(fh);
uint32_t slice_number = qhpi_slice_number(fh);
Source Destructive¶
Specify source_destructive = true in QHPI_Kernel_v1 if the first input and output tensors can share memory. Such a kernel must ensure that it reads the input before writing the corresponding output location.
Note
This optimization is opportunistic and the kernel must be written to run correctly when the tensors do not share the same memory location.
Cost Function¶
In the legacy APIs, the cost functions influenced both kernel selection and execution time prediction. In QHPI, however, they are only used for predicting execution times, and not for kernel selection. Please see Predicates for more on kernel selection.
Example:
float cost_func(const uint32_t num_inputs, const QHPI_Tensor *const *inputs) {
QHPI_Shape shape = qhpi_tensor_shape(inputs[0]);
unsigned size = shape.dims[0] * shape.dims[1] * shape.dims[2] * shape.dims[3];
return size * 0.2f + 10.0f;
}
Optimization Rules¶
QHPI replaces the DEF_OPT Domain Specific Language (DSL) with a simplified C API for graph rewrites. Op writers can implement the following optional callbacks for tiling:
early_rewrite
Invoked before any tiling is performed by the compiler. This function can rewrite the operator into a new subgraph of operators. Operators in this subgraph may be QHPI operators or standard QNN operators.
Example:
static const QHPI_Op *relu_to_relu_minmax_quant_rewrite(const QHPI_Op *op) { QHPI_OpRef input = qhpi_op_input(op, 0); QHPI_OutputDef input_output = qhpi_op_output(input.op, input.output_number); // Check if input is quantized type if (input_output.type != QHPI_QUInt8 && input_output.type != QHPI_QUInt16 && input_output.type != QHPI_QInt8 && input_output.type != QHPI_QInt16) { return op; } // Create ReluMinMax with min=0.0f, max=INF QHPI_OpRef min_const = gen_const_scalar_f32(op, 0.0f); QHPI_OpRef max_const = gen_const_scalar_f32(op, INFINITY); QHPI_OpRef inputs[] = {input, min_const, max_const}; QHPI_OutputDef output = qhpi_op_output(op, 0); return qhpi_op_create(op, THIS_PKG_NAME_STR "::ReluMinMax", 3, inputs, 1, &output); }
late_rewrite
This allows the op package to rewrite operators after tiling into a new subgraph. In this case, the new subgraph should only contain plugin operators and a small set of additional operators such as Slice_shape, Concat, and Reshape. The late rewrite can also be used to introduce scratch space after tiling as unused outputs.
Example:
static const QHPI_Op *relu_late_rewrite(const QHPI_Op *op) { // Use late rewrite to add scratch if (qhpi_op_num_outputs(op) > 1) return op; QHPI_OutputDef outputs[2]; outputs[0] = qhpi_op_output(op, 0); outputs[1] = {.type = QHPI_Int32, .shape = {.rank = 4, .dims = {1, 1, 1, 32}}}; QHPI_OpRef input = qhpi_op_input(op, 0); return qhpi_op_create(op, qhpi_op_name(op), 1, &input, 2, outputs); }
The explicit phase ordering supported in
DEF_OPTis replaced by a simpler pre/post callback functions for tiling.
Tiling¶
QHPI supports direct callbacks into a centralized tiling algorithm. This algorithm makes decisions on how to create smaller versions of operators. By hooking into our central tiler, you can enhance parallelism across functional units and minimize peak memory footprint so as to remain in TCM and improve end-to-end latency. Concurrently, the central tiler also weighs the costs that come from over-decomposition of operators by avoiding excessive inter-op communication and aligning chunking dimensions when possible to minimize concatenation and slicing. This results in the central tiler choosing chunk sizes for every operator it processes.
It is strongly recommended that users opt-in to these callbacks by (at a minimum) creating a build_tile function.
During tiling there are several functions which may be defined and used to drive our choices on how we split an QHPI operator in our graph. These include:
shape_required
Callback which is passed an instance of a plugin operator and returns a _shape_ object that forces certain sizes on each tiling dimension. This function is optional and if omitted no restrictions are placed at the start of tiling.
Example:
static QHPI_Shape relu_shape_required(const QHPI_Op *op) { // Define tiling requirements - split on height dimension static QHPI_Shape required = { .rank = 4, .dims = {1, RELU_TILE_HEIGHT, 0, RELU_CHANNEL_SPLIT_SIZE} }; return required; }
shape_legalized
Callback which is passed an instance of a plugin operator and a candidate tile shape. The function then returns a “legal” tile shape after considering the initally-proposed one from central tiling’s heuristics. This is intended to support scenarios where there are operator-specific requirements on the shape (e.g. some dimension must be a multiple of some value for good performance). This function is optional and if omitted no restrictions are assumed beyond what is provided by a potential shape required function.
Example:
static QHPI_Shape relu_shape_legalized(const QHPI_Op *op) { static QHPI_Shape legal = { .rank = 4, .dims = {1, 8, 0, 256} }; ... return legal; }
build_tile
Callback which is passed an instance of a plugin operator, a starting location, and an extent of the first output, is expected to create a new instance of the operator to compute that particular output tile. The key aspect of this is to determine the new inputs to this operator which will (typically) be slices of the inputs of the original. This function is required if you’d like to have your op split into smaller chunks. When omitted, the op is passed back without splitting.
Example:
static const QHPI_Op *relu_build_tile(const QHPI_Op *op, const QHPI_Shape *out_start, const QHPI_Shape *out_extent) { // Get input reference QHPI_OpRef input_ref = qhpi_op_input(op, 0); // For ReLU, input and output have same dimensions, so input slice = output slice QHPI_Shape in_start = *out_start; QHPI_Shape in_extent = *out_extent; // Create input slice QHPI_OpRef input_slice = qhpi_op_slice(input_ref, &in_start, &in_extent); // Build tiled operator with sliced input QHPI_OpRef inputs[] = {input_slice}; QHPI_OutputDef outputs[] = { {.type = qhpi_op_output(op, 0).type, .quant_parameters = qhpi_op_output(op, 0).quant_parameters, .shape = *out_extent} }; return qhpi_op_create(op, qhpi_op_name(op), 1, inputs, 1, outputs); }
These tiling callbacks can be invoked several times during prepare to perform chunk size evaluation and generate the new sub-operations.
Predicates¶
By default, QHPI matches kernels based on tensor signatures in the order of kernel specification at op registration time. The op writer can, however, further influence kernel matching by implementing an optional predicate callback function. The predicate callback function can be written to return true or false to either select or skip over the kernel for the next one.
Example:
uint32_t asin_plugin_default_predicate(const QHPI_Op *op, const uint32_t num_inputs, const QHPI_Tensor *const *inputs)
{
if (num_inputs == 0) {
return 0u; // false
}
for (uint32_t i = 0; i < num_inputs; i++) {
if (inputs[i] == nullptr) {
return 0u; // false
}
QHPI_Shape shape = qhpi_tensor_shape(inputs[i]);
for (uint32_t d = 0; d < shape.rank; d++) {
if (shape.dims[d] == 0) {
return 0u; // false
}
}
QHPI_Quant_Parameters qp = qhpi_tensor_quant_parameters(inputs[i]);
if (qp.stepsize == 0.0f) {
return 0u; // false
}
}
return 1u; // non-zero => "true"
}
Operator registration¶
QHPI operators defined using QHPI_OpInfo_vxxx can be registered with the QNN HTP BE using the corresponding registration function qhpi_register_ops_vXXX API as part of the op package dynamic library entry point qhpi_init().
Example:
// OpInfo definitions
static QHPI_OpInfo_v1 ops[] = {
{
.name = THIS_PKG_NAME_STR "::Relu",
.num_kernels = 2,
.kernels = relu_kernels,
.early_rewrite = relu_to_relu_minmax_quant,
.shape_required = relu_shape_required,
.build_tile = relu_build_tile,
},
// ...
};
// Registration function for regular ReLU operations
void register_relu_ops()
{
qhpi_register_ops_v1(sizeof(ops) / sizeof(ops[0]), ops, THIS_PKG_NAME_STR);
}
extern "C" const char *qhpi_init()
{
// Register the ops with HTP BE
register_relu_ops();
return THIS_PKG_NAME_STR;
}
The next step after creating and building a QHPI op package is to build and execute a model that uses operators implemented in the op package. The steps for building and executing a model using QNN HTP BE have not changed due to QHPI, and are outlined in Custom op package tutorial.