llama3 inference numbers
#llama #large_language_models #technology #performance
timeline:
- started project
%at=2024-04-20T00:16:05.592Z - finished this version
%at=2024-05-19T22:20:38.907Z
to continue the tradition from llama inference numbers and llama2 inference numbers, mixtral inference numbers, here are the llama3 numbers. this time i made a script: llamabench to scale out to more machines and ensure my data is correct across all of them
all benches running with llama.cpp commit 0e4802b2ecbaab04b4f829fde4a3096ca19c84b5 (2024-04-19)
model used: https://huggingface.co/NousResearch/Meta-Llama-3-8B-Instruct-GGUF/tree/main, at Q4_K_M quantization across all machines
reading recommentation #
there is a LOT of data. i recommend skipping to the findings. they provide pretty graphs.
inference hardware #
many thanks to people that provided hardware systems to run the benchmarking script
- the akaris'
switchblade, arctic-rose, sidekick, apparition, obstinate-serenity
- utsuhorocks'
fusion
- cynthia's
DESKTOP-T0IEVSO
- ena's
steamdeck
hostname | cpu | ram | gpu |
---|---|---|---|
wall | AMD Ryzen 5 5800X3D | 32GB (4x8) DDR4-2666 | NVIDIA GALAX RTX3060 12GB |
elpis | AMD Ryzen 5 4600G | 16GB (1x16) DDR4-2666 | AMD MAXSUN RX580 2048SP |
switchblade (steam deck lcd) | Zen 2 4c/8t, 2.4-3.5GHz (up to 448 GFlops FP32) | 16 GB LPDDR5 on-board RAM (5500 MT/s quad 32-bit channels) | N/A |
steamdeck (also an lcd edition) | - | - | - |
chlorine (steam deck oled) | Zen 2 4c/8t, 2.4-3.5GHz (up to 448 GFlops FP32) | 16 GB LPDDR5 on-board RAM (6400 MT/s quad 32-bit channels) | N/A |
arctic-rose | Intel Xeon E5-2667 v2 | 32GB (4x8) DDR3-1866 ECC | NVIDIA MSI RTX2060 12GB |
fusion | Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz | 32GB (4x8) DDR3-1600 | N/A |
DESKTOP-T0IEVSO | AMD Ryzen 7 3700X | 32GB (2x16) DDR4-1800 | N/A |
sidekick | Intel Core i5-2400 | 16GB (4x4) DDR3-1600 | N/A |
apparition | Intel Xeon E5345 (x2) | 12GB ??? DDR2-??? (need to open it up and check) | N/A |
obstinate-serenity | Intel(R) Core(TM) i5-4200H CPU @ 2.80GHz | 8GB (2x4) DDR3L-1600 | N/A |
reticent-iris | Intel Core i7-4770 | 12GB (3x4) DDR3-1600 | N/A |
ubuntu-8gb-hel1-1 | Hetzner CX31 (2 "shared" vcpu) | 8GB RAM (QEMU) | N/A |
ubuntu-8gb-hel1-2 | Hetzner CCX13 (2 "dedicated" vcpu) | 8GB RAM (QEMU) | N/A |
scw-beautiful-hertz | Scaleway DEV1-L (4vcpu) | 8GB RAM (QEMU) | N/A |
inference software #
hostname | distro | version | kernel version | gcc | nvidia driver | openblas |
---|---|---|---|---|---|---|
wall | void linux | n/a (rolling release) | 6.6.22_1 | 11.2 | 550.67 | 0.3.27 |
elpis | void linux | n/a (rolling release) | 6.6.22_1 | 13.2 | n/a | 0.3.26 |
switchblade | steamos | beta 3.5.17 + gentoo prefix | 6.1.52-valve16-1-neptune-61 | 13.2.1 | n/a | 0.3.25 |
chlorine | steamos | 3.5.19 + gentoo prefix | 6.1.52-valve16-1-neptune-61 | 13.2.1 | n/a | 0.3.25 |
steamdeck | steamos | 3.5.7 | 6.1.52-valve9-1-neptune-61 | 13.1.1 | n/a | 0.3.23 |
arctic-rose (unstable sw atm) | gentoo | n/a (rolling) | 6.6.28-gentoo-myriad | 13.2.1 | 550.78 | 0.3.25 |
fusion | debian | 11 | 5.10.0-20-amd64 | 10.2.1 | n/a | 0.3.13 |
DESKTOP-T0IEVSO | windows | 10 22H2 + MSYS2 2024-01-13 | n/a | 13.2.0 | (amd) 24.3.1 | 0.3.27 |
ubuntu-8gb-hel1-1 | ubuntu | 24.04 noble | 6.8.0-31-generic | 13.2.0 | n/a | 0.3.26 |
ubuntu-8gb-hel1-2 | ubuntu | 24.04 noble | 6.8.0-31-generic | 13.2.0 | n/a | 0.3.26 |
scw-beautiful-hertz | ubuntu | 22.04 jammy | 5.15.0-102-generic | 11.4.0 | n/a | 0.3.20 |
notes #
- amd gpu devices running with amdgpu kernel driver
- some machines have
-v2
on them. some data had to be redone for the machines due to methodological/interesting software update reasons
the graphs #
cpu graphs #
the "clean" style means it ran without any BLAS libraries, just raw gcc
there are various categories of compute we will look into with later graphs
openblas #
the first interesting comparison we can make is how much openblas actually contributes to a system, and in general the results are somewhat inconclusive. +2t/s at best.
avx #
one question we had while going through the data is what avx can do in terms of inference speed. while we don't have a system with avx and another without that is of comparable specs, we have two systems that are "close enough":
- apparition:
n_threads = 1 / 8 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
- arctic-rose:
n_threads = 1 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
- reticent-iris:
n_threads = 1 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
primary differences between apparition and arctic-rose:
- ram: from DDR2-667 to DDR3-1866 (by raw numbers, a 10x improvement in ram clock rate)
- cpu: Xeon E5345 -> Xeon E5-2667
- 65 nm -> 32 nm
- Q1'07 -> Q1'12 (5 years difference)
- 2.33 GHz -> 2.90 GHz
- 8 MB L2 Cache -> 15 MB IntelĀ® Smart Cache
- whatever "smart" is
- 80 W TDP -> 130 W TDP
while the comparison can't be clear on avx vs. no avx, those changes in the system bring a massive improvement to inference speed (from basically 0.5t/s up to 10t/s).
steam deck lcd vs oled #
it is known that the deck oled made an upgrade to its SoC that can bring up to 10FPS uplift depending on the game, as shown by GamersNexus' data
what about inference speed? actually not much. (chlorine
is an oled model, the others are lcd)
future work should look into running the vulkan backend on those devices. we were unable to get all the sdk's working right to build llama.cpp with it.
(note, steamdeck
was likely not running with the cpu scaling governor set to performance
, which is the likely reason its numbers do not align as well as switchblade
, which also is a steam deck lcd)
what about vps'es? #
the primary point in ggml's manifesto reads:
Inference at the edge
well, what about the cheapest edge we have? random vps'es from some of the cheap cloud providers?
at most, 5t/s, with a dedicated hetz getting the best per-thread speed (of course, because they're a dedicated vcpu, though i'm not sure if they area actually allocated to real hardware cores, rather than shared. looks like it from the data, at least)
also be aware that cpu scaling governors are also not performance
as those cpus are QEMU virtualized, so i am unable to set them.
gpu graphs #
for gpu benchmarks, the -t
and -ngl
parameter spaces were greatly reduced so that benchmarks don't take days to finish. because of that, their resolution is decreased
the way these are shown is by showing 5 graphs at different scales of -ngl
, from 0 to 33 (max). only two systems have nvidia gpus and so they can have both vulkan and cuda comparisons.
by the shown data, the llama.cpp cuda backend can provide absurd amounts of speed compared to vulkan, even though they operate almost the same at -ngl 0
. heavily consider it even if finding a way to build it is a mess (hehe my python packaging patterns)
and elpis
's RX580-2048SP-that-came-from-aliexpress-which-is-probably-downclocked can at best pull off 20t/s, which is somewhat surprising considering it's a 600BRL card (wall
's RTX3060 pulls approx. 35t/s on vulkan, and that's running for around 2000BRL)
link to the raw data #
structured as tsv because of raw_timings
column
raw_timings column is structured as json which is a list containing 5 lists (because each benchmark setting is ran 5 times for averaging) of 3 elements extracted from llama.cpp output:
- "sample ms per token"
- "prompt eval ms per token"
- "eval ms per token"
i don't actually know what those mean, so i took the number that sounded the most correct (eval ms per token), and did 1000/eval_ms_per_token
to derive tokens/second speed.
if that's incorrect, feel free to suggest something and i'll rebuild the data (i don't think the findings would change that much from it, but it'd be good to have it correct for a possible llama4 release... you'll do it, right zuck? you'll release it to the mortals below you?)
https://raw.githubusercontent.com/lun-4/llamabench/mistress/llama3_luna_data.tsv