llama2 inference numbers
#llama #large_language_models #technology #performance
remember, the gold SUBJECTIVE target is 10tok/s (i think thats what chatgpt runs at, probably 40tok/s, i dont run those numbers yet), which is around 100ms per token. going slower might be good enough for you though!!
all experiments running with llama.cpp
models used:
- https://huggingface.co/TheBloke/llama2_7b_chat_uncensored-GGML
- https://huggingface.co/TheBloke/Wizard-Vicuna-13B-Uncensored-GGML
hostname | device | model | quantization | time per token (eval time) |
---|---|---|---|---|
arae | cpu (openblas, -t 8 ) |
llama2_7b_chat_uncensored | q4_k_m | 656.91ms |
arae | cpu (openblas, -t 1 ) |
llama2_7b_chat_uncensored | q4_k_m | 585.76ms |
arae | cpu (openblas, -t 4 ) |
llama2_7b_chat_uncensored | q4_k_m | 332.28ms |
arae | cpu (openblas, -t 7 ) |
llama2_7b_chat_uncensored | q4_k_m | 327.42ms |
arae | cpu+gpu (native -t 7 , cublas -ngl 12 ) |
llama2_7b_chat_uncensored | q4_k_m | 304.32ms |
arae | cpu+gpu (openblas -t 7 , cublas -ngl 12 ) |
llama2_7b_chat_uncensored | q4_k_m | 296.30ms |
wall | cpu (openblas -t 12 ) |
llama2_7b_chat_uncensored | q4_k_m | 148.85ms |
wall | cpu (openblas -t 11 ) |
llama2_7b_chat_uncensored | q4_k_m | 132.51ms |
wall | cpu (native -t 7 ) |
llama2_7b_chat_uncensored | q4_k_m | 127.32ms |
wall | cpu (openblas -t 10 ) |
llama2_7b_chat_uncensored | q4_k_m | 126.40ms |
wall | cpu (openblas -t 7 ) |
llama2_7b_chat_uncensored | q4_k_m | 125.30ms |
wall | cpu (openblas -t 9 ) |
llama2_7b_chat_uncensored | q4_k_m | 124.18ms |
wall | cpu (openblas -t 8 ) |
llama2_7b_chat_uncensored | q4_k_m | 122.55ms |
wall | cpu+gpu (openblas -t 8 , cublas -ngl 3 ) |
llama2_7b_chat_uncensored | q4_k_m | 113.28ms |
wall | cpu+gpu (openblas -t 8 , cublas -ngl 10 ) |
llama2_7b_chat_uncensored | q4_k_m | 92.63ms |
wall | cpu+gpu (openblas -t 8 , cublas -ngl 50 ) |
Wizard-Vicuna-13B-Uncensored-GGML | q4_k_m | 33.96ms |
wall | cpu+gpu (openblas -t 8 , cublas -ngl 30 ) |
llama2_7b_chat_uncensored | q4_k_m | 33.74ms |
wall | cpu+gpu (openblas -t 8 , cublas -ngl 50 ) |
llama2_7b_chat_uncensored | q4_k_m | 19.30ms |
wall | cpu?+gpu (openblas -t 8 , cublas -ngl 1000 ) |
llama2_7b_chat_uncensored | q4_k_m | 19.44ms |
arae is same as arae from llama inference numbers, except cublas now gets a
- nvidia mx350 gpu (2gb vram, allegedly)
wall is the same as wall from llama inference numbers. no hardware changes for those runs
setting up cublas #
i hate the christ that is Python Packaging. as seen by my python packaging patterns
- install miniconda somewhere. get miniconda here. run the script
- i dont like doing
conda init
. activate it manually - for fish, i run
eval "$(/home/luna/tmp/miniconda3/bin/conda shell.fish hook)"
manually when i want to enter conda
- i dont like doing
conda new -n llamacpp
conda activate llamacpp
conda install -c 'nvidia/label/cuda-12.2.0' cuda-toolkit
- around 10gb disk space needed
- get llama.cpp
- edit its makefile's
LDFLAGS
to point it to your conda environment (it assumes CUDA toolkit is on/opt/cuda
, we dont do that because this method of installing CUDA is non-standard! but it's easily reproducible at least.).- e.g add
-L/home/luna/tmp/miniconda3/envs/llamacpp/lib
- e.g add
make LLAMA_CUBLAS=1
wget 'https://huggingface.co/TheBloke/llama2_7b_chat_uncensored-GGML/resolve/main/llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin'
- other quantization methods may be wanted, like q5 for larger ram systems??
- the compiled
./main
will have successfully linked to CUDA libraries in the conda path, but your linker does not know about those. that means diretcly using the binary leads to./main: error while loading shared libraries: libcublas.so.12: cannot open shared object file: No such file or directory
. useLD_LIBRARY_PATH
overrides to fix this - full run looks like this
env LD_LIBRARY_PATH=/home/luna/tmp/miniconda3/envs/llamacpp/lib ./main -m ./llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin --color-b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7 -ngl 12 -p 'Write a paragraph about the hit game Among Us'
-ngl
is the amount of model layers sent to the gpu. for my 400ms/tok number i used-ngl 12
, anything longer is an OOM (my gpu has 2gb vram as reported by nvidia-smi)- those are bad parameters for actual inference. if you want to actually use this for anything, choose different parameters
setting up without cublas #
- get llama.cpp
make
. you needgcc
andg++
wget 'https://huggingface.co/TheBloke/llama2_7b_chat_uncensored-GGML/resolve/main/llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin'
./main -m ./llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin --color-b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7 -p 'Write a paragraph about the hit game Among Us'
- same thing as with cuBLAS, but without
-ngl
andLD_LIBRARY_PATH
- same thing as with cuBLAS, but without