mixtral inference numbers

see the others:

i am not running perplexity scores on these (that's more important for a quantization comparison chart, rather than performance parameter tweaking), just want to find out how fast i can rip this. however, the paragraphs about Among Us i'm getting are plausible (which suggests the model actually works).

#large_language_models #machine_learning #performance

%at=2023-12-11T18:50:02.068Z

system:

cpu: AMD Ryzen 4600G
cpu v2: AMD Ryzen 7 5800X3D
ram: 32GB, 2666MT/s
gpu: RTX3060 12GB

using the mixtral branch of llama.cpp. i run it like this: $ ./main -m /.../models/mixtral-8x7b-v0.1.Q4_0.gguf -b 256 -n 100 -t 5 -p 'Among Us is'

model quantizations from TheBloke here: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF

with gcc11 pure (no openblas): system_info: n_threads = [...] / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

with openblas (see BLAS = 1):
system_info: n_threads = 2 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

(cpu vram not applicable, but it's around 27GB ram used in kernel pages)

device	quantization	eval time (lower is faster)	vram
cpu (gcc11, -t 2)	q4_0	331ms	n/a
cpu (gcc11, -t 5)	q4_0	209ms	n/a
cpu (gcc11, -t 8)	q4_0	225ms	n/a
cpu (gcc11, -t 10)	q4_0	216ms	n/a
cpu (gcc11, -t 12)	q4_0	244ms	n/a
cpu (openblas, -t 2)	q4_0	331ms	n/a
cpu v2 (openblas, -t 2)	q4_0	283ms	n/a
cpu (openblas, -t 5)	q4_0	211ms	n/a
cpu (openblas, -t 8)	q4_0	222ms	n/a
cpu (openblas, -t 10)	q4_0	216ms	n/a
cpu v2 (openblas, -t 10)	q4_0	223ms	n/a
cpu v2 (openblas, -t 14)	q4_0	220ms	n/a
cpu v2 (openblas, -t 16)	q4_0	245ms	n/a
cpu (openblas, -t 12)	q4_0	246ms	n/a
cpu+gpu (gcc11, -t 10, -ngl 5)	q4_0	194ms	4302mib
cpu+gpu (gcc11, -t 10, -ngl 10)	q4_0	165ms	8312mib
cpu+gpu (gcc11, -t 10, -ngl 13)	q4_0	146ms	10718mib
cpu+gpu (gcc11, -t 10, -ngl 14)	q4_0	140ms	11520mib
cpu+gpu (gcc11, -t 10, -ngl 15)	q4_0	OOM	OOM