llama inference numbers
#llama #large_language_models #technology #performance
| hostname | device | model | time per token |
|---|---|---|---|
| arae | cpu | 7b q4 (non-gptq) | 1000ms |
| arae | cpu | 7b q4 (gptq) | TODO |
| wall | cpu | 7b q4 (non-gptq) | 160ms |
| wall | cpu | 7b q4 (gptq) | 160ms |
| wall | cpu | 7b | 300ms |
| wall | gpu | 7b q4 (non-gptq) | TODO |
| wall | gpu | 7b q4 (gptq) | TODO |
| wall | gpu | 13b q4 (gptq) | 66ms |
| wall | gpu | 7b | ENOMEM |
machine setups #
wall:
- AMD Ryzen 5 4600G @ 3.7GHz
- 16GB ram
- GPU: RTX 3060 12gb
arae:
- laptop
- Intel I5-1035G1 @ 3.6Ghz
- 8GB ram
- can't run non-quantized model