prefix_quant

The prefix_quant feature is a novel quantization technique for large language models (LLMs). By pre-inserting “prefixed outlier tokens” into the KV cache, it significantly reduces activation quantization error, resulting in more stable activation distributions and higher quantized model accuracy, making it more suitable for engineering-grade deployment of large models.

Tutorial on how to use prefix_quant cache in genie-t2t-run

Note

To use prefix_quant cache in genie-t2t-run, we need to do the following two steps: 1. set “bos-token” as -1 in genie config; 2.remove the prefix tokens from the prompt(the prefix cache will provide these tokens automatically).

In genie-t2t-run, the –restore (or -r) option can be used to load the Prefix Quant cache prior to handling a query when running in basic dialog mode.

For example:

./genie-t2t-run -c llama3-3b-htp.json
                -p "Tell me about Qualcomm."
                -r /data/local/tmp/llama3-3b/prefix_cache

Tutorial on how to use prefix_quant cache in genie-app

Note

To use prefix_quant cache in genie-app, we need to do the following two steps: 1. set “bos-token” as -1 in genie config; 2.remove the prefix tokens from the prompt(the prefix cache will provide these tokens automatically).

In genie-app, the “dialog restore DIALOG_NAME PATH” can be used in genie-app script to load the Prefix Quant cache prior to handling a query when running in basic dialog mode.

Script for genie-app
profile create profile1
log create log1 verbose log.txt
dialog config create config1 llama3-3b-htp.json

dialog config bind profile config1 profile1
dialog config bind log config1 log1

dialog create dialog1 config1

dialog restore dialog1 /data/local/tmp/llama3-3b/prefix_cache

dialog query dialog1 "Tell me about qualcomm."

profile save profile1 profile1.json

dialog config free config1

dialog free dialog1

profile free profile1

log free log1