llama inference numbers

llama inference numbers

#llama #large_language_models #technology #performance

hostname	device	model	time per token
arae	cpu	7b q4 (non-gptq)	1000ms
arae	cpu	7b q4 (gptq)	TODO
wall	cpu	7b q4 (non-gptq)	160ms
wall	cpu	7b q4 (gptq)	160ms
wall	cpu	7b	300ms
wall	gpu	7b q4 (non-gptq)	TODO
wall	gpu	7b q4 (gptq)	TODO
wall	gpu	13b q4 (gptq)	66ms
wall	gpu	7b	ENOMEM

machine setups #

wall:

AMD Ryzen 5 4600G @ 3.7GHz
16GB ram
GPU: RTX 3060 12gb

arae:

laptop
Intel I5-1035G1 @ 3.6Ghz
8GB ram
can't run non-quantized model