llama inference numbers

#llama #large_language_models #technology #performance

hostname device model time per token
arae cpu 7b q4 (non-gptq) 1000ms
arae cpu 7b q4 (gptq) TODO
wall cpu 7b q4 (non-gptq) 160ms
wall cpu 7b q4 (gptq) 160ms
wall cpu 7b 300ms
wall gpu 7b q4 (non-gptq) TODO
wall gpu 7b q4 (gptq) TODO
wall gpu 13b q4 (gptq) 66ms
wall gpu 7b ENOMEM

machine setups #

wall:

arae: