llama2 inference numbers

#llama #large_language_models #technology #performance

%at=2023-07-25T01:18:10

remember, the gold SUBJECTIVE target is 10tok/s (i think thats what chatgpt runs at, probably 40tok/s, i dont run those numbers yet), which is around 100ms per token. going slower might be good enough for you though!!

all experiments running with llama.cpp

models used:

hostname	device	model	quantization	time per token (eval time)
arae	cpu (openblas, `-t 8`)	llama2_7b_chat_uncensored	q4_k_m	656.91ms
arae	cpu (openblas, `-t 1`)	llama2_7b_chat_uncensored	q4_k_m	585.76ms
arae	cpu (openblas, `-t 4`)	llama2_7b_chat_uncensored	q4_k_m	332.28ms
arae	cpu (openblas, `-t 7`)	llama2_7b_chat_uncensored	q4_k_m	327.42ms
arae	cpu+gpu (native `-t 7`, cublas `-ngl 12`)	llama2_7b_chat_uncensored	q4_k_m	304.32ms
arae	cpu+gpu (openblas `-t 7`, cublas `-ngl 12`)	llama2_7b_chat_uncensored	q4_k_m	296.30ms
wall	cpu (openblas `-t 12`)	llama2_7b_chat_uncensored	q4_k_m	148.85ms
wall	cpu (openblas `-t 11`)	llama2_7b_chat_uncensored	q4_k_m	132.51ms
wall	cpu (native `-t 7`)	llama2_7b_chat_uncensored	q4_k_m	127.32ms
wall	cpu (openblas `-t 10`)	llama2_7b_chat_uncensored	q4_k_m	126.40ms
wall	cpu (openblas `-t 7`)	llama2_7b_chat_uncensored	q4_k_m	125.30ms
wall	cpu (openblas `-t 9`)	llama2_7b_chat_uncensored	q4_k_m	124.18ms
wall	cpu (openblas `-t 8`)	llama2_7b_chat_uncensored	q4_k_m	122.55ms
wall	cpu+gpu (openblas `-t 8`, cublas `-ngl 3`)	llama2_7b_chat_uncensored	q4_k_m	113.28ms
wall	cpu+gpu (openblas `-t 8`, cublas `-ngl 10`)	llama2_7b_chat_uncensored	q4_k_m	92.63ms
wall	cpu+gpu (openblas `-t 8`, cublas `-ngl 50`)	Wizard-Vicuna-13B-Uncensored-GGML	q4_k_m	33.96ms
wall	cpu+gpu (openblas `-t 8`, cublas `-ngl 30`)	llama2_7b_chat_uncensored	q4_k_m	33.74ms
wall	cpu+gpu (openblas `-t 8`, cublas `-ngl 50`)	llama2_7b_chat_uncensored	q4_k_m	19.30ms
wall	cpu?+gpu (openblas `-t 8`, cublas `-ngl 1000`)	llama2_7b_chat_uncensored	q4_k_m	19.44ms

arae is same as arae from llama inference numbers, except cublas now gets a

nvidia mx350 gpu (2gb vram, allegedly)

wall is the same as wall from llama inference numbers. no hardware changes for those runs

setting up cublas #

i hate the christ that is Python Packaging. as seen by my python packaging patterns

install miniconda somewhere. get miniconda here. run the script
1. i dont like doing conda init. activate it manually
2. for fish, i run eval "$(/home/luna/tmp/miniconda3/bin/conda shell.fish hook)" manually when i want to enter conda
conda new -n llamacpp
conda activate llamacpp
conda install -c 'nvidia/label/cuda-12.2.0' cuda-toolkit
1. around 10gb disk space needed
get llama.cpp
edit its makefile's LDFLAGS to point it to your conda environment (it assumes CUDA toolkit is on /opt/cuda, we dont do that because this method of installing CUDA is non-standard! but it's easily reproducible at least.).
1. e.g add -L/home/luna/tmp/miniconda3/envs/llamacpp/lib
make LLAMA_CUBLAS=1
wget 'https://huggingface.co/TheBloke/llama2_7b_chat_uncensored-GGML/resolve/main/llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin'
1. other quantization methods may be wanted, like q5 for larger ram systems??
the compiled ./main will have successfully linked to CUDA libraries in the conda path, but your linker does not know about those. that means diretcly using the binary leads to ./main: error while loading shared libraries: libcublas.so.12: cannot open shared object file: No such file or directory. use LD_LIBRARY_PATH overrides to fix this
full run looks like this env LD_LIBRARY_PATH=/home/luna/tmp/miniconda3/envs/llamacpp/lib ./main -m ./llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin --color-b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7 -ngl 12 -p 'Write a paragraph about the hit game Among Us'
1. -ngl is the amount of model layers sent to the gpu. for my 400ms/tok number i used -ngl 12, anything longer is an OOM (my gpu has 2gb vram as reported by nvidia-smi)
2. those are bad parameters for actual inference. if you want to actually use this for anything, choose different parameters
  1. this might help you

setting up without cublas #

get llama.cpp
make. you need gcc and g++
wget 'https://huggingface.co/TheBloke/llama2_7b_chat_uncensored-GGML/resolve/main/llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin'
./main -m ./llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin --color-b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7 -p 'Write a paragraph about the hit game Among Us'
1. same thing as with cuBLAS, but without -ngl and LD_LIBRARY_PATH