llama inference numbers
#llama #large_language_models #technology #performance
hostname | device | model | time per token |
---|---|---|---|
arae | cpu | 7b q4 (non-gptq) | 1000ms |
arae | cpu | 7b q4 (gptq) | TODO |
wall | cpu | 7b q4 (non-gptq) | 160ms |
wall | cpu | 7b q4 (gptq) | 160ms |
wall | cpu | 7b | 300ms |
wall | gpu | 7b q4 (non-gptq) | TODO |
wall | gpu | 7b q4 (gptq) | TODO |
wall | gpu | 13b q4 (gptq) | 66ms |
wall | gpu | 7b | ENOMEM |
machine setups #
wall:
- AMD Ryzen 5 4600G @ 3.7GHz
- 16GB ram
- GPU: RTX 3060 12gb
arae:
- laptop
- Intel I5-1035G1 @ 3.6Ghz
- 8GB ram
- can't run non-quantized model