mixtral inference numbers

see the others:

i am not running perplexity scores on these (that's more important for a quantization comparison chart, rather than performance parameter tweaking), just want to find out how fast i can rip this. however, the paragraphs about Among Us i'm getting are plausible (which suggests the model actually works).

#large_language_models #machine_learning #performance

%at=2023-12-11T18:50:02.068Z

system:

using the mixtral branch of llama.cpp. i run it like this: $ ./main -m /.../models/mixtral-8x7b-v0.1.Q4_0.gguf -b 256 -n 100 -t 5 -p 'Among Us is'

model quantizations from TheBloke here: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF

with gcc11 pure (no openblas): system_info: n_threads = [...] / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

with openblas (see BLAS = 1):
system_info: n_threads = 2 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

(cpu vram not applicable, but it's around 27GB ram used in kernel pages)

device quantization eval time (lower is faster) vram
cpu (gcc11, -t 2) q4_0 331ms n/a
cpu (gcc11, -t 5) q4_0 209ms n/a
cpu (gcc11, -t 8) q4_0 225ms n/a
cpu (gcc11, -t 10) q4_0 216ms n/a
cpu (gcc11, -t 12) q4_0 244ms n/a
cpu (openblas, -t 2) q4_0 331ms n/a
cpu v2 (openblas, -t 2) q4_0 283ms n/a
cpu (openblas, -t 5) q4_0 211ms n/a
cpu (openblas, -t 8) q4_0 222ms n/a
cpu (openblas, -t 10) q4_0 216ms n/a
cpu v2 (openblas, -t 10) q4_0 223ms n/a
cpu v2 (openblas, -t 14) q4_0 220ms n/a
cpu v2 (openblas, -t 16) q4_0 245ms n/a
cpu (openblas, -t 12) q4_0 246ms n/a
cpu+gpu (gcc11, -t 10, -ngl 5) q4_0 194ms 4302mib
cpu+gpu (gcc11, -t 10, -ngl 10) q4_0 165ms 8312mib
cpu+gpu (gcc11, -t 10, -ngl 13) q4_0 146ms 10718mib
cpu+gpu (gcc11, -t 10, -ngl 14) q4_0 140ms 11520mib
cpu+gpu (gcc11, -t 10, -ngl 15) q4_0 OOM OOM