Flash-Moe: Running a 397B Parameter Model on a Mac with 48GB RAM

(github.com)

188 points | by mft_ 5 hours ago

27 comments

tarruda 2 hours ago

Note that this is not the only way to run Qwen 3.5 397B on consumer devices, there are excellent ~2.5 BPW quants available that make it viable for 128G devices.

I've had great success (~20 t/s) running it on a M1 Ultra with room for 256k context. Here are some lm-evaluation-harness results I ran against it:

    mmlu: 87.86%

    gpqa diamond: 82.32%

    gsm8k: 86.43%

    ifeval: 75.90%

More details of my experience:

- https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discu...

- https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...

Overall an excellent model to have for offline inference.

[-]

Aurornis 2 hours ago
The method in this link is already using a 2-bit quant. They also reduced the number of experts per token from 10 to 4 which is another layer of quality degradation.
In my experience the 2-bit quants can produce output to short prompts that makes sense but they aren’t useful for doing work with longer sessions.
This project couldn’t even get useful JSON out of the model because it can’t produce the right token for quotes:
> *2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable.
[-]
- tarruda 2 hours ago
  I can't say anything about the OP method, but I already tested the smol-IQ2_XS quant (which has 2.46 BPW) with the pi harness. I did not do a very long session because token generation and prompt processing gets very slow, but I think I worked for up to ~70k context and it maintained a lot of coherence in the session. IIRC the GPQA diamond is supposed to exercise long chains of thought and it scored exceptionally well with 82% (the original BF16 official number is 88%: https://huggingface.co/Qwen/Qwen3.5-397B-A17B).
  Note that not all quants are the same at a certain BPW. The smol-IQ2_XS quant I linked is pretty dynamic, with some tensors having q8_0 type, some q6_k and some q4_k (while the majority is iq2_xs). In my testing, this smol-IQ2_XS quant is the best available at this BPW range.
  Eventually I might try a more practical eval such as terminal bench.
  [-]
  - Aurornis 1 hour ago
    > I did not do a very long session
    This is always the problem with the 2-bit and even 3-bit quants: They look promising in short sessions but then you try to do real work and realize they’re a waste of time.
    Running a smaller dense model like 27B produces better results than 2-bit quants of larger models in my experience.
    [-]
    - singpolyma3 24 minutes ago
      Lots of people seem to use 4bit. Do you think that's worth it vs a smaller model in some cases?
      [-]
      - hnfong 3 minutes ago
        Generally the perplexity charts indicate that quality drops significantly below 4-bit, so in that sense 4-bit is the sweet spot if you're resource constrained.
- simonw 1 hour ago
  The project doesn't just use 2-bit - that was one of the formats they tried, but when that didn't give good tool calls they switched to 4-bit.

arjie 41 minutes ago

What's the tok/s you get these days? Does it actually work well when you use more of that context?

By the way, it's been a long time since I last saw your username. You're the guy who launched Neovim! Boy what a success. Definitely the Kickstarter/Bountysource I've been a tiny part of that had the best outcome. I use it every day.

[-]

tarruda 7 minutes ago

> What's the tok/s you get these days?

I ran llama-bench a couple of weeks ago when there was a big speed improvement on llama.cpp (https://github.com/ggml-org/llama.cpp/pull/20361#issuecommen...):

    % llama-bench -m ~/ml-models/huggingface/ubergarm/Qwen3.5-397B-A17B-GGUF/smol-IQ2_XS/Qwen3.5-397B-A17B-smol-IQ2_XS-00001-of-00004.gguf -fa 1 -t 1 -ngl 99 -b 2048 -ub 2048 -d 0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,150000,200000,250000
    ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
    ggml_metal_library_init: using embedded metal library
    ggml_metal_library_init: loaded in 0.008 sec
    ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
    ggml_metal_device_init: GPU name:   MTL0
    ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
    ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
    ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
    ggml_metal_device_init: simdgroup reduction   = true
    ggml_metal_device_init: simdgroup matrix mul. = true
    ggml_metal_device_init: has unified memory    = true
    ggml_metal_device_init: has bfloat            = true
    ggml_metal_device_init: has tensor            = false
    ggml_metal_device_init: use residency sets    = true
    ggml_metal_device_init: use shared buffers    = true
    ggml_metal_device_init: recommendedMaxWorkingSetSize  = 134217.73 MB
    | ------------------------------ | ---------: | ---------: | ---------- | ------: | -------: | -: | --------------: | -------------------: |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |           pp512 |        189.67 ± 1.98 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |           tg128 |         19.98 ± 0.01 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d10000 |        168.92 ± 0.55 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d10000 |         18.93 ± 0.02 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d20000 |        152.42 ± 0.22 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d20000 |         17.87 ± 0.01 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d30000 |        139.37 ± 0.28 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d30000 |         17.12 ± 0.01 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d40000 |        128.38 ± 0.33 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d40000 |         16.38 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d50000 |        118.07 ± 0.55 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d50000 |         15.66 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d60000 |        108.44 ± 0.38 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d60000 |         14.98 ± 0.01 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d70000 |         98.85 ± 0.18 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d70000 |         14.36 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d80000 |         91.39 ± 0.49 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d80000 |         13.84 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d90000 |         85.76 ± 0.24 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d90000 |         13.30 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d100000 |         80.19 ± 0.83 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d100000 |         12.82 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d150000 |         54.46 ± 0.33 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d150000 |         10.17 ± 0.09 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d200000 |         47.05 ± 0.15 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d200000 |          9.04 ± 0.02 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d250000 |         40.71 ± 0.26 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d250000 |          8.01 ± 0.02 |

    build: d28961d81 (8299)

So it starts at 20 tps tg and 190 tps pp with empty context and ends at 8 tps tg and 40 tps pp with 250k prefill.

I suspect that there are still a lot of optimizations to be implemented for Qwen 3.5 on llama.cpp, wouldn't be surprised to reach 25 tps in a few months.

> You're the guy who launched Neovim!

That's me ;D

> I use it every day.

So do I for the past 12 years! Though I admit in the past year I greatly reduced the amount of code I write by hand :/

[-]

arjie 6 minutes ago
That's surprisingly fast. Thanks for sharing.

outlog 1 hour ago
What is power usage? maybe https://www.coconut-flavour.com/coconutbattery/ can tell you estimate?
[-]
- tarruda 44 minutes ago
  I don't think I've ever seen the M1 ultra GPU exceed 80w in asitop

Aurornis 2 hours ago
Reading the details, he is using 2-bit quantization and reduced the number of experts per token from 10 down to 4 to get 5 tokens/sec. Cool proof of concept but it’s far from the quality and performance of the 397B model as normally used. Dropping the number of experts is particularly misleading.
This is some interesting work, but applying such extreme measures to LLMs to get them to run severely degrades quality. I know he claims negligible quality loss, but in my experience 2-bit quantizations are completely useless for real work. You can get them to respond to prompts, but they lose their intelligence and will go around in circles.
He also shows 5-6 tokens per second. Again that’s impressive for a large model on limited hardware but it’s very slow. Between the severely degraded model abilities and the extremely slow output the 397B result should be considered an attempt at proving something can technically run, not evidence that it can run well and produce output you’d expect from a 397B model.
He even mentions the obvious problems with his changes:
> *2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable.
So right out of the gate this isn’t useful if you want to do anything with it. He could have tried smaller models or less quantizations to get actual useful output from the model, but it wouldn’t look as impressive. It’s honestly getting kind of exhausting to read all of these AI-coded (admitted in the link) and AI-written papers made more for resume building. It would have been interesting to see this work applied to running a useful model that hadn’t been lobotomized instead of applying tricks to get an impressive headline but useless output.
[-]
- kageroumado 1 hour ago
  [dead]
homarp 4 hours ago
/r/localllama discussion: https://old.reddit.com/r/LocalLLaMA/comments/1rxmmu5/running...
mannyv 17 minutes ago
Everyone is focused on the bad 2 bit result but who cares? He says don’t use it because it’s bad.
justacatbot 1 hour ago
The quality degradation at 2-bit is a real issue. For actual work tasks, a well-tuned 30B at 4-bit usually outperforms a 70B+ at 2-bit in my experience. The expert reduction on top of that compounds things - you're essentially running a fairly different model. Still interesting to see the upper bound of what consumer hardware can attempt, even if the result isn't production-ready.
zozbot234 3 hours ago
The github page mentions that a naïve mmap approach is bottlenecked by per-page overhead. Can this be mitigated by setting up explicit "huge" pages? (2M using the CONT PTE feature if the "native" page size is 16k; 32M using a PMD level block mapping; or 1G using the CONT PMD feature.) Does macOS support this out of the box? Alternatively, one might use a simple mmap and then something like posix_fadvise to set up prefetching of the data.
mkw 1 hour ago
TLDR I took a stab at leveraging Dan's work and making it more practical:
https://github.com/matt-k-wong/mlx-flash
2 bit quantization lobotomizes the model but is impressive nonetheless! Maybe one day we'll be able to have intelligent 2 bit quants... I wonder.
my version supports - 4bit quantization, hybrid streaming (Disk + ram), arbitrary model compatibility, tested on Mamba2, and lets up the framework for LM Studio integration
I leveraged this work (Credit to Danveloper) and am in the middle of making this work on more practical models and quants. It still uses flash streaming, but done so with a control knob so you can choose how much ram and how little ram to use. In the craziest case, it uses as little ram as possible but is very slow, however, in the balanced case you use some ram and it's much faster.
I designed it around the intelligence dense Nemotron 3 Nano 30B and Nemotron Cascade 2 30B models (which are smaller, more intelligence density) and can run on low end 16GB machines, though you can run arbitrarily large models on larger machines (designed for very low end, but capable of high end).
JSR_FDED 4 hours ago
This is a very impressive result. If I understand correctly the bottleneck is the SSD in this architecture - the author seems to get almost 15GB/s - but I seem to remember the max b/w was about 8GB/s. What am I missing?
[-]
- Roxxik 3 hours ago
  IO is very bursty in these setups. When the router results are in you can start loading experts from SSD. In this brief moment the SSD is saturated.
  Outside of that the SSD is idling.
  Table 3 shows for K=4 experts an IO of 943 MB/Tok at 3.15 Tok/s giving an average IO of 2970 MB/s far below what the SSD could do.
  I'm not sure, but not all expert weights are used immediately. Maybe they could do async reads for the down tensors parallelizing compute with IO.
  Not sure if this works on Mac, I only tested my larger than RAM setup on Linux with io_uring O_DIRECT reads and I saw that about 20% of total reads do finish while my fused upgate matmul is already running.
  Edit: Typos
  [-]
  - zozbot234 3 hours ago
    The github page mentions that you can't overlap SSD traffic and GPU compute on Apple Silicon, you get heavy contention for the shared hardware resources.
    [-]
    - devnotes77 2 hours ago
      [dead]
- Aurornis 2 hours ago
  PCIe 5 doubles the maximum throughout. That’s why the numbers for newer SSDs are about double what you recall for the old maximum.
- rado 4 hours ago
  MacBook Pro M5 Pro and M5 Max have such SSD speed
  [-]
  - selimthegrim 3 hours ago
    I have an MBP M4 Pro and a WD Black SN850x in an external TB5 enclosure and I easily get 6-7 GB/s
bertili 4 hours ago
Very impressive! I wonder if there is a similar path for Linux using system memory instead of SSD? Hell, maybe even a case for the return of some kind of ROMs of weights?
[-]
- daemonologist 2 hours ago
  Most definitely - the popular engines have extensive support for doing this and controlling exactly which weights end up where (llama.cpp: https://github.com/ggml-org/llama.cpp/blob/master/tools/cli/... , vllm: https://docs.vllm.ai/en/stable/configuration/engine_args/#of... , sglang (haven't tried this): https://docs.sglang.io/advanced_features/server_arguments.ht...).
  Even with a MoE model, which has to move a relatively small portion of the weights around, you do end up quite bandwidth constrained though.
- zozbot234 3 hours ago
  Loading experts to system memory is supported by most local-AI frameworks. But you do not gain much by running that part of the decode on GPU, since decode is not compute-limited and the CPU-GPU transfer involves overhead. It's best to use the GPU for speeding up the shared part of the model.
- Aurornis 2 hours ago
  Using system memory and CPU compute for some of the layers that don’t fit into GPU memory is already supported by common tools.
  It’s workable for mixture of experts models but the performance falls off a cliff as soon as the model overflows out of the GPU and into system RAM. There is another performance cliff when the model has to be fetched from disk on every pass.
  [-]
  - zozbot234 2 hours ago
    It's less of a "performance falls off a cliff" problem and more of a "once you offload to RAM/storage, your bottleneck is the RAM/storage and basically everything else no longer matters". This means if you know you're going to be relying on heavy offload, you stop optimizing for e.g. lots of VRAM and GPU compute since that doesn't matter. That saves resources that you can use for scaling out.
- K0balt 3 hours ago
  My thoughts exactly. Something like this could make it so that modest GPU capacity, like a pair of 3090s , and lots of RAM could make big inference more practical for personal labs
haomingkoo 1 hour ago
Really interesting approach. Curious how the 2-bit quantization affects the model's reasoning ability on longer chains of thought vs shorter prompts. The benchmarkslook solid but real-world usage seems like a different story based on the comments here.
m-hodges 2 hours ago
As frontier models get closer and closer to consumer hardware, what’s the most for the API-driven $trillion labs?
[-]
- OJFord 2 hours ago
  Assuming 'moat' – they'll push the frontier forward; they don't really have to worry until progress levels off.
  At that point, I suppose there's still paid harnesses (people have always paid for IDEs despite FOSS options) partly for mindshare, and they could use expertise & compute capacity to provide application-specific training for enterprises that need it.
- stri8ted 2 hours ago
  48 GB is not consumer hardware. But fundamentally, there are economies of scale due to batching, power distribution, better utilization etc.., that means data center tokens will be cheaper. Also, as the cost of training (frontier) models increases, it's not clear the Chinese companies will continue open sourcing them. Notice for example, that Qwen-Max is not open source.
  [-]
  - zozbot234 2 hours ago
    Nothing obviously prevents using this approach, e.g. for 3B-active or 10B-active models, which do run on consumer hardware. I'd love to see how the 3B performs with this on the MacBook Neo, for example. More relevantly, data-center scale tokens are only cheaper for the specific type of tokens data centers sell. If you're willing to wait long enough for your inferences (and your overall volume is low enough that you can afford this) you can use approaches like OP's (offloading read-only data to storage) to handle inference on low-performing, slow "edge" devices.
- BoredomIsFun 1 hour ago
  > the API-driven $trillion labs?
  here we go: https://huggingface.co/collections/trillionlabs/tri-series
maxloh 2 hours ago
Can you add a license to the repo? Legally we couldn't run any code without a license attached to it.
[-]
- Wowfunhappy 1 hour ago
  ...you can't redistribute code without a license, but surely you can legally run it, can't you?
  Like, if I write a blog post and put it on my blog, you're allowed to read it, right?
  Heck, if my blog contains some Javascript code I wrote, I would imagine your web browser is allowed to run that code without opening you up to copyright infringement, even if I didn't provide an explicit license.
383toast 2 hours ago
yeah 4tok/s is kinda unusable though
[-]
spwa4 3 hours ago
Does this mean that it should be possible to load up a system with ~10 (seems to me at least the number of active experts) SSDs to get 40 tok/s even on truly gigantic models?
[-]
- zozbot234 3 hours ago
  SSD bandwidth will ultimately be limited by the amount of PCIe lanes you have available (for something other than the Apple Silicon internal storage). So the approach has inherent limitations. You can of course scale out to multiple systems to get more throughput.
  You can use this approach with Intel Optane, which is wearout-resistant unlike NAND and can thus substitute for RAM. Last I checked, it was available quite cheap on the secondary market, ~$1/GB as opposed to ~$15/GB or more for DRAM. (Of course that's nowhere near as cheap as NAND, which is around ~$0.1/GB but quite wearout-prone with heavy writes.)
  [-]
  - spwa4 2 hours ago
    Yeah, PCIe is the bottleneck. The point being that whether the data originates from RAM or from NVME or Optane, you cannot get data to the GPU faster with RAM than with SSDs.
    Meanwhile PCIe switches exist. So why not build:
    1 CPU + memory + ...
    N PCIe switch with each 1 low-memory GPU + 6 NVME drives (in theory 5 can saturate the GPU)
    Each of those should only bother the CPU when they have some tokens produced and have plenty of PCIe lanes to get at their data.
    Such a setup should be able to get a 6 to 8 times speedup from the solution detailed here, and a model compute increase should make relatively little difference in performance.
lostmsu 3 hours ago
How large is the KV cache?
[-]
- xbar 2 hours ago
  0.1 GB per full-attention layer and "The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention." So, 1.5 GB.
pdyc 4 hours ago
impressive, i wish someone takes a stab at using this technique on mobile gpu's even if it does not use storage it would still be a win. I am running llama.cpp on adreno 830 with oepncl and i am getting pathetic 2-3t/s for output tokens
harshhhhhhhhh 4 hours ago
seems promising , this is the way , can someone benchmark this
[-]
- frwickst 4 hours ago
  I'm getting 6.55t/s using the Qwen3.5-397B-A17B-4bit model with the command: ./infer --prompt "Explain quantum computing" --tokens 100
  MacBook Pro M5 Pro (64GB RAM)
  [-]
  - j45 3 hours ago
    Appreciate the data point. M5 Max would also be interesting to see once available in desktop form.
  - logicallee 4 hours ago
    can you post the final result (or as far as you got before you killed it) to show us how cohesive and good it is? I'd like to see an example of the output of this.
    [-]
    - frwickst 3 hours ago
      Since the output is quite long, here is a link: https://pastebin.com/k76wiVGP
      [-]
      - hrimfaxi 3 hours ago
        Why does this G character appear to prefix most of the output? ("Ġlike")
        [-]
        frwickst 3 hours ago
        It is a tokenizer artifact most likely (https://github.com/huggingface/transformers/issues/4786). So the output is not properly decoded in this case, it should just be a space.
        kgeist 3 hours ago
        The original tokens have Ġ instead of space. I had this issue too when writing an inference engine for Qwen. You have to "normalize" those special characters.
robutsume 45 minutes ago
[dead]
leontloveless 44 minutes ago
[dead]
diablevv 2 hours ago
[dead]
leontloveless 2 hours ago
[dead]
jee599 1 hour ago
[dead]
genie3io 1 hour ago
[dead]
mugivarra69 3 hours ago
[dead]
vilequeef 4 hours ago
Why so much RAM?
[-]
- vilequeef 3 hours ago
  Oh Mac, unified. Sometimes it takes a downvote
rvz 4 hours ago
The technical write up is great, but Mac users should not get too excited just yet on running 300B+ parameter models locally as the TPS isn't that good.
>...at 4.4+ tokens/second
That is even when it is using 4-bit quantization and it is still at that speed.
> The entire 209GB model streams from SSD through a custom Metal compute pipeline.
This is my main problem.
If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.
Can't imagine using this in the long term right now, but improvements will follow. Still a great write up anyways.
[-]
- Roxxik 4 hours ago
  Does an SSD meaningfully degrade by read only workloads?
  [-]
  - JSR_FDED 4 hours ago
    Nope, reads don’t cause wear
    [-]
    - zozbot234 3 hours ago
      No appreciable wear of course, but read disturb (requiring occasional rewrites) becomes more of an issue as NAND fabrication advances.
- etiam 4 hours ago
  > If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.
  How sure are you about that? I've never looked closer at how a large LLM with mixture of experts architecture switches between expert modules, but staying on roughly the same topic for the use (as it often would when editing the same codebase), I wouldn't be surprised to see the switches of composition are fairly rare, fairly small, and to the extent it happens it's repeated reads from the flash disk rather than writes it tends to cause.
  [-]
  - frotaur 3 hours ago
    Afaik the experts are not usually very interpretable, and generally would be surprised if at least one does not change every token. I don't know what happens in practice, but I know at least during training, nothing is done to minimize the number of expert switches between tokens.
- Wowfunhappy 3 hours ago
  Eh. I mean, 4 tokens a second works fine if you're patient. Go do something else while you wait.
  I feel like whenever I'm trying to find information on which local models will work on my hardware, I have to overestimate because people don't know how to wait for things.
  Also, reading data doesn't cause SSD wear.
- hrmtst93837 4 hours ago
  If you want decent throughput and do not care about burning SSD write cycles on a box that was never meant to act like a tiny inference server, a used server with actual RAM is still the cheaper and less silly option. I woudn't expect Apple's warranty team to be much help.
  [-]
  - K0balt 3 hours ago
    Is it doing a bunch of ssd writes?
    [-]
    - mkw 46 minutes ago
      stream from the SSD, perform the calculation, discard, repeat