llama2 inference numbers
#llama #large_language_models #technology #performance
remember, the gold SUBJECTIVE target is 10tok/s (i think thats what chatgpt runs at, probably 40tok/s, i dont run those numbers yet), which is around 100ms per token. going slower might be good enough for you though!!
all experiments running with llama.cpp
models used:
- https://huggingface.co/TheBloke/llama2_7b_chat_uncensored-GGML
- https://huggingface.co/TheBloke/Wizard-Vicuna-13B-Uncensored-GGML
| hostname | device | model | quantization | time per token (eval time) |
|---|---|---|---|---|
| arae | cpu (openblas, -t 8) |
llama2_7b_chat_uncensored | q4_k_m | 656.91ms |
| arae | cpu (openblas, -t 1) |
llama2_7b_chat_uncensored | q4_k_m | 585.76ms |
| arae | cpu (openblas, -t 4) |
llama2_7b_chat_uncensored | q4_k_m | 332.28ms |
| arae | cpu (openblas, -t 7) |
llama2_7b_chat_uncensored | q4_k_m | 327.42ms |
| arae | cpu+gpu (native -t 7, cublas -ngl 12) |
llama2_7b_chat_uncensored | q4_k_m | 304.32ms |
| arae | cpu+gpu (openblas -t 7, cublas -ngl 12) |
llama2_7b_chat_uncensored | q4_k_m | 296.30ms |
| wall | cpu (openblas -t 12) |
llama2_7b_chat_uncensored | q4_k_m | 148.85ms |
| wall | cpu (openblas -t 11) |
llama2_7b_chat_uncensored | q4_k_m | 132.51ms |
| wall | cpu (native -t 7) |
llama2_7b_chat_uncensored | q4_k_m | 127.32ms |
| wall | cpu (openblas -t 10) |
llama2_7b_chat_uncensored | q4_k_m | 126.40ms |
| wall | cpu (openblas -t 7) |
llama2_7b_chat_uncensored | q4_k_m | 125.30ms |
| wall | cpu (openblas -t 9) |
llama2_7b_chat_uncensored | q4_k_m | 124.18ms |
| wall | cpu (openblas -t 8) |
llama2_7b_chat_uncensored | q4_k_m | 122.55ms |
| wall | cpu+gpu (openblas -t 8, cublas -ngl 3) |
llama2_7b_chat_uncensored | q4_k_m | 113.28ms |
| wall | cpu+gpu (openblas -t 8, cublas -ngl 10) |
llama2_7b_chat_uncensored | q4_k_m | 92.63ms |
| wall | cpu+gpu (openblas -t 8, cublas -ngl 50) |
Wizard-Vicuna-13B-Uncensored-GGML | q4_k_m | 33.96ms |
| wall | cpu+gpu (openblas -t 8, cublas -ngl 30) |
llama2_7b_chat_uncensored | q4_k_m | 33.74ms |
| wall | cpu+gpu (openblas -t 8, cublas -ngl 50) |
llama2_7b_chat_uncensored | q4_k_m | 19.30ms |
| wall | cpu?+gpu (openblas -t 8, cublas -ngl 1000) |
llama2_7b_chat_uncensored | q4_k_m | 19.44ms |
arae is same as arae from llama inference numbers, except cublas now gets a
- nvidia mx350 gpu (2gb vram, allegedly)
wall is the same as wall from llama inference numbers. no hardware changes for those runs
setting up cublas #
i hate the christ that is Python Packaging. as seen by my python packaging patterns
- install miniconda somewhere. get miniconda here. run the script
- i dont like doing
conda init. activate it manually - for fish, i run
eval "$(/home/luna/tmp/miniconda3/bin/conda shell.fish hook)"manually when i want to enter conda
- i dont like doing
conda new -n llamacppconda activate llamacppconda install -c 'nvidia/label/cuda-12.2.0' cuda-toolkit- around 10gb disk space needed
- get llama.cpp
- edit its makefile's
LDFLAGSto point it to your conda environment (it assumes CUDA toolkit is on/opt/cuda, we dont do that because this method of installing CUDA is non-standard! but it's easily reproducible at least.).- e.g add
-L/home/luna/tmp/miniconda3/envs/llamacpp/lib
- e.g add
make LLAMA_CUBLAS=1wget 'https://huggingface.co/TheBloke/llama2_7b_chat_uncensored-GGML/resolve/main/llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin'- other quantization methods may be wanted, like q5 for larger ram systems??
- the compiled
./mainwill have successfully linked to CUDA libraries in the conda path, but your linker does not know about those. that means diretcly using the binary leads to./main: error while loading shared libraries: libcublas.so.12: cannot open shared object file: No such file or directory. useLD_LIBRARY_PATHoverrides to fix this - full run looks like this
env LD_LIBRARY_PATH=/home/luna/tmp/miniconda3/envs/llamacpp/lib ./main -m ./llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin --color-b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7 -ngl 12 -p 'Write a paragraph about the hit game Among Us'-nglis the amount of model layers sent to the gpu. for my 400ms/tok number i used-ngl 12, anything longer is an OOM (my gpu has 2gb vram as reported by nvidia-smi)- those are bad parameters for actual inference. if you want to actually use this for anything, choose different parameters
setting up without cublas #
- get llama.cpp
make. you needgccandg++wget 'https://huggingface.co/TheBloke/llama2_7b_chat_uncensored-GGML/resolve/main/llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin'./main -m ./llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin --color-b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7 -p 'Write a paragraph about the hit game Among Us'- same thing as with cuBLAS, but without
-nglandLD_LIBRARY_PATH
- same thing as with cuBLAS, but without