mixtral inference numbers
see the others:
i am not running perplexity scores on these (that's more important for a quantization comparison chart, rather than performance parameter tweaking), just want to find out how fast i can rip this. however, the paragraphs about Among Us i'm getting are plausible (which suggests the model actually works).
#large_language_models #machine_learning #performance
system:
- cpu: AMD Ryzen 4600G
- cpu v2: AMD Ryzen 7 5800X3D
- ram: 32GB, 2666MT/s
- gpu: RTX3060 12GB
using the mixtral branch of llama.cpp. i run it like this: $ ./main -m /.../models/mixtral-8x7b-v0.1.Q4_0.gguf -b 256 -n 100 -t 5 -p 'Among Us is'
model quantizations from TheBloke here: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF
with gcc11 pure (no openblas): system_info: n_threads = [...] / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
with openblas (see BLAS = 1
):
system_info: n_threads = 2 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
(cpu vram not applicable, but it's around 27GB ram used in kernel pages)
device | quantization | eval time (lower is faster) | vram |
---|---|---|---|
cpu (gcc11, -t 2) | q4_0 | 331ms | n/a |
cpu (gcc11, -t 5) | q4_0 | 209ms | n/a |
cpu (gcc11, -t 8) | q4_0 | 225ms | n/a |
cpu (gcc11, -t 10) | q4_0 | 216ms | n/a |
cpu (gcc11, -t 12) | q4_0 | 244ms | n/a |
cpu (openblas, -t 2) | q4_0 | 331ms | n/a |
cpu v2 (openblas, -t 2) | q4_0 | 283ms | n/a |
cpu (openblas, -t 5) | q4_0 | 211ms | n/a |
cpu (openblas, -t 8) | q4_0 | 222ms | n/a |
cpu (openblas, -t 10) | q4_0 | 216ms | n/a |
cpu v2 (openblas, -t 10) | q4_0 | 223ms | n/a |
cpu v2 (openblas, -t 14) | q4_0 | 220ms | n/a |
cpu v2 (openblas, -t 16) | q4_0 | 245ms | n/a |
cpu (openblas, -t 12) | q4_0 | 246ms | n/a |
cpu+gpu (gcc11, -t 10, -ngl 5) | q4_0 | 194ms | 4302mib |
cpu+gpu (gcc11, -t 10, -ngl 10) | q4_0 | 165ms | 8312mib |
cpu+gpu (gcc11, -t 10, -ngl 13) | q4_0 | 146ms | 10718mib |
cpu+gpu (gcc11, -t 10, -ngl 14) | q4_0 | 140ms | 11520mib |
cpu+gpu (gcc11, -t 10, -ngl 15) | q4_0 | OOM | OOM |