llama

(3 pages)

llama inference numbers

#llama #large_language_models #technology #performance | hostname | device | model | time per token | | - | - | - | - | | arae | cpu | 7b q4 (non-gptq) | 1000ms | | arae | cpu | 7b q4 (gptq) | TODO | | wall | cpu | 7b q4 (non-gptq) | 160ms | | wall | cpu…

llama3 inference numbers

#llama #large_language_models #technology #performance timeline: - started project %at=2024-04-20T00:16:05.592Z - finished this version %at=2024-05-19T22:20:38.907Z to continue the tradition from [llama inference numbers] and [llama2 inference number…

llama2 inference numbers

#llama #large_language_models #technology #performance %at=2023-07-25T01:18:10 remember, the gold **SUBJECTIVE** target is 10tok/s (i think thats what chatgpt runs at, probably 40tok/s, i dont run those numbers yet), which is around 100ms per token. goi…