llama2 inference numbers

#llama #large_language_models #technology #performance

%at=2023-07-25T01:18:10

remember, the gold SUBJECTIVE target is 10tok/s (i think thats what chatgpt runs at, probably 40tok/s, i dont run those numbers yet), which is around 100ms per token. going slower might be good enough for you though!!

all experiments running with llama.cpp

hostname device model quantization time per token (eval time)
arae cpu (openblas, -t 8) llama2_7b_chat_uncensored q4_k_m 656.91ms
arae cpu (openblas, -t 1) llama2_7b_chat_uncensored q4_k_m 585.76ms
arae cpu (openblas, -t 4) llama2_7b_chat_uncensored q4_k_m 332.28ms
arae cpu (openblas, -t 7) llama2_7b_chat_uncensored q4_k_m 327.42ms
arae cpu+gpu (native -t 7, cublas -ngl 12) llama2_7b_chat_uncensored q4_k_m 304.32ms
arae cpu+gpu (openblas -t 7, cublas -ngl 12) llama2_7b_chat_uncensored q4_k_m 296.30ms
wall cpu (openblas -t 12) llama2_7b_chat_uncensored q4_k_m 148.85ms
wall cpu (openblas -t 11) llama2_7b_chat_uncensored q4_k_m 132.51ms
wall cpu (native -t 7) llama2_7b_chat_uncensored q4_k_m 127.32ms
wall cpu (openblas -t 10) llama2_7b_chat_uncensored q4_k_m 126.40ms
wall cpu (openblas -t 7) llama2_7b_chat_uncensored q4_k_m 125.30ms
wall cpu (openblas -t 9) llama2_7b_chat_uncensored q4_k_m 124.18ms
wall cpu (openblas -t 8) llama2_7b_chat_uncensored q4_k_m 122.55ms
wall cpu+gpu (openblas -t 8, cublas -ngl 3) llama2_7b_chat_uncensored q4_k_m 113.28ms
wall cpu+gpu (openblas -t 8, cublas -ngl 10) llama2_7b_chat_uncensored q4_k_m 92.63ms
wall cpu+gpu (openblas -t 8, cublas -ngl 50) Wizard-Vicuna-13B-Uncensored-GGML q4_k_m 33.96ms
wall cpu+gpu (openblas -t 8, cublas -ngl 30) llama2_7b_chat_uncensored q4_k_m 33.74ms
wall cpu+gpu (openblas -t 8, cublas -ngl 50) llama2_7b_chat_uncensored q4_k_m 19.30ms
wall cpu?+gpu (openblas -t 8, cublas -ngl 1000) llama2_7b_chat_uncensored q4_k_m 19.44ms

arae is same as arae from llama inference numbers, except cublas now gets a

wall is the same as wall from llama inference numbers. no hardware changes for those runs

setting up cublas #

i hate the christ that is Python Packaging. as seen by my python packaging patterns

  1. install miniconda somewhere. get miniconda here. run the script
    1. i dont like doing conda init. activate it manually
    2. for fish, i run eval "$(/home/luna/tmp/miniconda3/bin/conda shell.fish hook)" manually when i want to enter conda
  2. conda new -n llamacpp
  3. conda activate llamacpp
  4. conda install -c 'nvidia/label/cuda-12.2.0' cuda-toolkit
    1. around 10gb disk space needed
  5. get llama.cpp
  6. edit its makefile's LDFLAGS to point it to your conda environment (it assumes CUDA toolkit is on /opt/cuda, we dont do that because this method of installing CUDA is non-standard! but it's easily reproducible at least.).
    1. e.g add -L/home/luna/tmp/miniconda3/envs/llamacpp/lib
  7. make LLAMA_CUBLAS=1
  8. wget 'https://huggingface.co/TheBloke/llama2_7b_chat_uncensored-GGML/resolve/main/llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin'
    1. other quantization methods may be wanted, like q5 for larger ram systems??
  9. the compiled ./main will have successfully linked to CUDA libraries in the conda path, but your linker does not know about those. that means diretcly using the binary leads to ./main: error while loading shared libraries: libcublas.so.12: cannot open shared object file: No such file or directory. use LD_LIBRARY_PATH overrides to fix this
  10. full run looks like this env LD_LIBRARY_PATH=/home/luna/tmp/miniconda3/envs/llamacpp/lib ./main -m ./llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin --color-b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7 -ngl 12 -p 'Write a paragraph about the hit game Among Us'
    1. -ngl is the amount of model layers sent to the gpu. for my 400ms/tok number i used -ngl 12, anything longer is an OOM (my gpu has 2gb vram as reported by nvidia-smi)
    2. those are bad parameters for actual inference. if you want to actually use this for anything, choose different parameters
      1. this might help you

setting up without cublas #

  1. get llama.cpp
  2. make. you need gcc and g++
  3. wget 'https://huggingface.co/TheBloke/llama2_7b_chat_uncensored-GGML/resolve/main/llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin'
  4. ./main -m ./llama2_7b_chat_uncensored.ggmlv3.q4_K_M.bin --color-b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7 -p 'Write a paragraph about the hit game Among Us'
    1. same thing as with cuBLAS, but without -ngl and LD_LIBRARY_PATH