prefix_quant¶
The prefix_quant feature is a novel quantization technique for large language models (LLMs). By pre-inserting “prefixed outlier tokens” into the KV cache, it significantly reduces activation quantization error, resulting in more stable activation distributions and higher quantized model accuracy, making it more suitable for engineering-grade deployment of large models.
Tutorial on how to use prefix_quant cache in genie-t2t-run¶
Note
To use prefix_quant cache in genie-t2t-run, we need to do the following two steps: 1. set “bos-token” as -1 in genie config; 2.remove the prefix tokens from the prompt(the prefix cache will provide these tokens automatically).
In genie-t2t-run, the –restore (or -r) option can be used to load the Prefix Quant cache prior to handling a query when running in basic dialog mode.
For example:
./genie-t2t-run -c llama3-3b-htp.json
-p "Tell me about Qualcomm."
-r /data/local/tmp/llama3-3b/prefix_cache
Tutorial on how to use prefix_quant cache in genie-app¶
Note
To use prefix_quant cache in genie-app, we need to do the following two steps: 1. set “bos-token” as -1 in genie config; 2.remove the prefix tokens from the prompt(the prefix cache will provide these tokens automatically).
In genie-app, the “dialog restore DIALOG_NAME PATH” can be used in genie-app script to load the Prefix Quant cache prior to handling a query when running in basic dialog mode.
profile create profile1
log create log1 verbose log.txt
dialog config create config1 llama3-3b-htp.json
dialog config bind profile config1 profile1
dialog config bind log config1 log1
dialog create dialog1 config1
dialog restore dialog1 /data/local/tmp/llama3-3b/prefix_cache
dialog query dialog1 "Tell me about qualcomm."
profile save profile1 profile1.json
dialog config free config1
dialog free dialog1
profile free profile1
log free log1