2 comments

  • heggenhougen 1 hour ago
    Methodology and reproducibility details:

    All benchmarks were run on the same machine: HP All‑in‑One, Intel i7‑1165G7 (4 cores), 64 GB RAM.

    All tests use identical inputs, identical weights, identical precision, and identical batch size.

    Dense baseline uses the system BLAS (MKL/oneDNN depending on environment).

    Vendor sparse baseline uses standard CSR/COO kernels.

    The custom sparse operator runs in the same Python environment and on the same CPU.

    All baselines (dense and vendor sparse) run normally on this hardware; the custom operator only changes runtime performance, not model executability.

    Wall‑clock time is measured with time.perf_counter() around the matmul call.

    Power readings come from psutil.sensors_battery() and psutil.cpu_freq(); these are not calibrated against external instrumentation.

    “Effective TFLOPS” = nominal dense FLOPs ÷ wall‑clock time.

    Values above hardware peak indicate fewer multiply‑accumulate operations executed than dense.

    Dense TFLOPS is the actual hardware utilization number.

    “Tokens/s” is computed as 1 ÷ (per‑iteration wall‑clock time).

    TTFT is measured as the time from operator invocation to first output.

    All outputs are SHA‑256‑verified to match dense results bit‑for‑bit.

    No quantization, no weight modification, and no model retraining were used.

    All JSON blocks in the post are the raw outputs from the benchmark script.

  • heggenhougen 1 hour ago
    I ran four feed‑forward network (FFN) layers from real LLMs on a consumer HP All‑in‑One PC (Intel i7‑1165G7, 4 cores, 64 GB RAM). Each test compares:

    vendor dense baseline

    vendor sparse baseline

    a custom sparse operator

    identical inputs

    identical weights

    identical precision

    SHA‑256‑verified outputs

    wall‑clock timing

    psutil‑based power measurement

    All baselines (dense and vendor sparse) run normally on this machine. The custom operator only changes runtime performance; it is not required to execute the models.

    Below are the raw results and full JSON for reproducibility.

    Mistral‑7B Wanda FFN (4096×14336 @ 55% sparsity) Speedup vs dense: 84.3×

    Speedup vs vendor sparse: 163.4×

    Energy reduction: 98.8%

    Tokens/s: 53,330

    Dense TFLOPS: 0.07

    Effective TFLOPS: 6.26

    TTFT: 0.000590 s

    python { "benchmark": "Mistral-7B Wanda 55% Sparse FFN", "sparsity_pct": 55.0, "matrix": "4096x14336", "speedup_vs_dense_x": 84.3, "speedup_vs_sparse_x": 163.4, "energy_savings_pct": 98.8, "tokens_per_s_rolv": 53330.0, "tokens_per_s_dense": 633.0, "tokens_per_s_sparse": 326.0, "nominal_gflops_per_iter": 0.94, "dense_tflops": 0.07, "eff_tflops_rolv": 6.26, "ttft_s": 0.00059, "platform": "Intel Core i7-1165G7", "hardware": "4 cores — 63.7 GB RAM" } GPT‑J‑6B FFN (4096×16384 @ 40% sparsity) Speedup vs dense: 90.6×

    Speedup vs vendor sparse: 174.8×

    Energy reduction: 98.9%

    Tokens/s: 38,191

    Dense TFLOPS: 0.06

    Effective TFLOPS: 5.13

    TTFT: 0.000387 s

    python { "benchmark": "GPT-J-6B 40% Sparse FFN", "sparsity_pct": 40.0, "matrix": "4096x16384", "speedup_vs_dense_x": 90.6, "speedup_vs_sparse_x": 174.8, "energy_savings_pct": 98.9, "tokens_per_s_rolv": 38191.0, "tokens_per_s_dense": 422.0, "tokens_per_s_sparse": 218.0, "nominal_gflops_per_iter": 1.074, "dense_tflops": 0.06, "eff_tflops_rolv": 5.13, "ttft_s": 0.000387, "platform": "Intel Core i7-1165G7", "hardware": "4 cores — 63.7 GB RAM" } Llama‑2‑7B FFN (4096×11008 @ 70% sparsity) Speedup vs dense: 87.4×

    Speedup vs vendor sparse: 116.1×

    Energy reduction: 98.9%

    Tokens/s: 73,916

    Dense TFLOPS: 0.08

    Effective TFLOPS: 6.67

    TTFT: 0.000392 s

    python { "benchmark": "Llama-2-7B 70% Sparse FFN", "sparsity_pct": 70.0, "matrix": "4096x11008", "speedup_vs_dense_x": 87.4, "speedup_vs_sparse_x": 116.1, "energy_savings_pct": 98.9, "tokens_per_s_rolv": 73916.0, "tokens_per_s_dense": 845.0, "tokens_per_s_sparse": 637.0, "nominal_gflops_per_iter": 0.721, "dense_tflops": 0.08, "eff_tflops_rolv": 6.67, "ttft_s": 0.000392, "platform": "Intel Core i7-1165G7", "hardware": "4 cores — 63.7 GB RAM" } BERT‑Base FFN (3072×768, dense) Speedup vs dense: 4.8×

    Speedup vs vendor sparse: 23.9×

    Energy reduction: 79.0%

    Tokens/s: 104,131

    Dense TFLOPS: 0.10

    Effective TFLOPS: 0.49

    TTFT: 0.000322 s

    python { "benchmark": "BERT-Base Real FFN", "sparsity_pct": 0.0, "matrix": "3072x768", "speedup_vs_dense_x": 4.8, "speedup_vs_sparse_x": 23.9, "energy_savings_pct": 79.0, "tokens_per_s_rolv": 104131.0, "tokens_per_s_dense": 21895.0, "tokens_per_s_sparse": 4349.0, "nominal_gflops_per_iter": 0.038, "dense_tflops": 0.10, "eff_tflops_rolv": 0.49, "ttft_s": 0.000322, "platform": "Intel Core i7-1165G7", "hardware": "4 cores — 63.7 GB RAM" } Notes All baselines (dense and vendor sparse) run on the same machine.

    “Effective TFLOPS” = nominal dense FLOPs ÷ wall‑clock time.

    Values above hardware peak indicate fewer multiply‑accumulate operations executed than dense.

    Dense TFLOPS is the actual hardware utilization.

    Power readings from psutil; not calibrated against external instrumentation.

    All outputs are SHA‑256‑verified to match dense results.