llama2 inference numbers

#llama #large_language_models #technology #performance


remember, the gold SUBJECTIVE target is 10tok/s (i think thats what chatgpt runs at, probably 40tok/s, i dont run those numbers yet), which is around 100ms per token. going slower might be good enough for you though!!

all experiments running with llama.cpp

models used:

hostname device model quantization time per token (eval time)
arae cpu (openblas, -t 8) llama2_7b_chat_uncensored q4_k_m 656.91ms
arae cpu (openblas, -t 1) llama2_7b_chat_uncensored q4_k_m 585.76ms
arae cpu (openblas, -t 4) llama2_7b_chat_uncensored q4_k_m 332.28ms
arae cpu (openblas, -t 7) llama2_7b_chat_uncensored q4_k_m 327.42ms
arae cpu+gpu (native -t 7, cublas -ngl 12) llama2_7b_chat_uncensored q4_k_m 304.32ms
arae cpu+gpu (openblas -t 7, cublas -ngl 12) llama2_7b_chat_uncensored q4_k_m 296.30ms
wall cpu (openblas -t 12) llama2_7b_chat_uncensored q4_k_m 148.85ms
wall cpu (openblas -t 11) llama2_7b_chat_uncensored q4_k_m 132.51ms
wall cpu (native -t 7) llama2_7b_chat_uncensored q4_k_m 127.32ms
wall cpu (openblas -t 10) llama2_7b_chat_uncensored q4_k_m 126.40ms
wall cpu (openblas -t 7) llama2_7b_chat_uncensored q4_k_m 125.30ms
wall cpu (openblas -t 9) llama2_7b_chat_uncensored q4_k_m 124.18ms
wall cpu (openblas -t 8) llama2_7b_chat_uncensored q4_k_m 122.55ms
wall cpu+gpu (openblas -t 8, cublas -ngl 3) llama2_7b_chat_uncensored q4_k_m 113.28ms
wall cpu+gpu (openblas -t 8, cublas -ngl 10) llama2_7b_chat_uncensored q4_k_m 92.63ms
wall cpu+gpu (openblas -t 8, cublas -ngl 50) Wizard-Vicuna-13B-Uncensored-GGML q4_k_m 33.96ms
wall cpu+gpu (openblas -t 8, cublas -ngl 30) llama2_7b_chat_uncensored q4_k_m 33.74ms
wall cpu+gpu (openblas -t 8, cublas -ngl 50) llama2_7b_chat_uncensored q4_k_m 19.30ms
wall cpu?+gpu (openblas -t 8, cublas -ngl 1000) llama2_7b_chat_uncensored q4_k_m 19.44ms

arae is same as arae from llama inference numbers, except cublas now gets a

wall is the same as wall from llama inference numbers. no hardware changes for those runs

setting up cublas #

i hate the christ that is Python Packaging. as seen by my python packaging patterns

  1. install miniconda somewhere. get miniconda here. run the script
    1. i dont like doing conda init. activate it manually
    2. for fish, i run eval "$(/home/luna/tmp/miniconda3/bin/conda shell.fish hook)" manually when i want to enter conda
  2. conda new -n llamacpp
  3. conda activate llamacpp
  4. conda install -c 'nvidia/label/cuda-12.2.0' cuda-toolkit
    1. around 10gb disk space needed
  5. get llama.cpp
  6. edit its makefile's LDFLAGS to point it to your conda environment (it assumes CUDA toolkit is on /opt/cuda, we dont do that because this method of installing CUDA is non-standard! but it's easily reproducible at least.).
    1. e.g add -L/home/luna/tmp/miniconda3/envs/llamacpp/lib
  7. make LLAMA_CUBLAS=1
  8. wget 'https://huggingface.co/TheBloke/llama2_7b_chat_uncensored-GGML/resolve/main/llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin'
    1. other quantization methods may be wanted, like q5 for larger ram systems??
  9. the compiled ./main will have successfully linked to CUDA libraries in the conda path, but your linker does not know about those. that means diretcly using the binary leads to ./main: error while loading shared libraries: libcublas.so.12: cannot open shared object file: No such file or directory. use LD_LIBRARY_PATH overrides to fix this
  10. full run looks like this env LD_LIBRARY_PATH=/home/luna/tmp/miniconda3/envs/llamacpp/lib ./main -m ./llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin --color-b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7 -ngl 12 -p 'Write a paragraph about the hit game Among Us'
    1. -ngl is the amount of model layers sent to the gpu. for my 400ms/tok number i used -ngl 12, anything longer is an OOM (my gpu has 2gb vram as reported by nvidia-smi)
    2. those are bad parameters for actual inference. if you want to actually use this for anything, choose different parameters
      1. this might help you

setting up without cublas #

  1. get llama.cpp
  2. make. you need gcc and g++
  3. wget 'https://huggingface.co/TheBloke/llama2_7b_chat_uncensored-GGML/resolve/main/llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin'
  4. ./main -m ./llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin --color-b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7 -p 'Write a paragraph about the hit game Among Us'
    1. same thing as with cuBLAS, but without -ngl and LD_LIBRARY_PATH