public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed
From: "Ard Biesheuvel" <ardb@kernel.org>
To: "Demian Shulhan" <demyansh@gmail.com>,
	"Mark Rutland" <mark.rutland@arm.com>
Cc: "Christoph Hellwig" <hch@lst.de>, "Song Liu" <song@kernel.org>,
	"Yu Kuai" <yukuai@fnnas.com>, "Will Deacon" <will@kernel.org>,
	"Catalin Marinas" <catalin.marinas@arm.com>,
	"Mark Brown" <broonie@kernel.org>,
	linux-arm-kernel@lists.infradead.org, robin.murphy@arm.com,
	"Li Nan" <linan122@huawei.com>,
	linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
Date: Mon, 30 Mar 2026 18:39:49 +0200	[thread overview]
Message-ID: <9a12e043-8200-4650-bfe2-cbece57a4f87@app.fastmail.com> (raw)
In-Reply-To: <CAOLeWCsxhzdxQviizJ4X4VOp_28LCtO-RjWoCcZG29rQw86NVg@mail.gmail.com>

Hi Demian,

On Sun, 29 Mar 2026, at 15:01, Demian Shulhan wrote:
> I want to address the comment about the marginal 0.3% speedup on the
> 8-disk benchmark. While the pure memory bandwidth on a small array is
> indeed bottlenecked, it doesn't reveal the whole picture. I extracted
> the SVE and NEON implementations into a user-space benchmark to
> measure the actual hardware efficiency using perf stat, running on the
> same AWS Graviton3 (Neoverse-V1) instance.The results show a massive
> difference in CPU efficiency. For the same 8-disk workload, the svex4
> implementation requires about 35% fewer instructions and 46% fewer CPU
> cycles compared to neonx4 (7.58 billion instructions vs 11.62
> billion). This translates directly into significant energy savings and
> reduced pressure on the CPU frontend, which would leave more compute
> resources available for network and NVMe queues during an array
> rebuild.
>

I think the results are impressive, but I'd like to better understand
its implications on a real-world scenario. Is this code only a
bottleneck when rebuilding an array? Is it really that much more power
efficient, given that the registers (and ALU paths) are twice the size?
And given the I/O load of rebuilding a 24+ disk array, how much CPU
throughput can we make use of meaningfully in such a scenario?

Supporting SVE in the kernel primarily impacts the size of the per-task
buffers that we need to preserve/restore the context. Fortunately,
these are no longer allocated for the lifetime of the task, but
dynamically (by scoped_ksimd()), and so the main impediment has been
recently removed. But as Mark pointed out, there are other things to
take into account. Nonetheless, our position has always been that a
compelling use case could convince us that the additional complexity
of in-kernel SVE is justified.

> Furthermore, as Christoph suggested, I tested scalability on wider
> arrays since the default kernel benchmark is hardcoded to 8 disks,
> which doesn't give the unrolled SVE loop enough data to shine. On a
> 16-disk array, svex4 hits 15.1 GB/s compared to 8.0 GB/s for neonx4.
> On a 24-disk array, while neonx4 chokes and drops to 7.8 GB/s, svex4
> maintains a stable 15.0 GB/s — effectively doubling the throughput.

Does this mean the kernel benchmark is no longer fit for purpose? If
it cannot distinguish between implementations that differ in performance
by a factor of 2, I don't think we can rely on it to pick the optimal one.

> I agree this patch should be put on hold for now. My intention is to
> leave these numbers here as evidence that implementing SVE context
> preservation in the kernel (the "good use case") is highly justifiable
> from both a power-efficiency and a wide-array throughput perspective
> for modern ARM64 hardware.
>

Could you please summarize the results? The output below seems to have
become mangled a bit. Please also include the command line, a link to
the test source, and the vector length of the implementation.



> Thanks again for your time and time and review!
>
> ---------------------------------------------------
> User space test results:
> ==================================================
>     RAID6 SVE Benchmark Results (AWS Graviton3)
> ==================================================
> Instance Details:
> Linux ip-172-31-87-234 6.8.0-1047-aws #50~22.04.1-Ubuntu SMP Thu Feb
> 19 20:49:25 UTC 2026 aarch64 aarch64 aarch64 GNU/Linux
> --------------------------------------------------
>
> [Test 1: Energy Efficiency / Instruction Count (8 disks)]
> Running baseline (neonx4)...
> algo=neonx4 ndisks=8 iterations=1000000 time=2.681s MB/s=8741.36
>
> Running SVE (svex1)...
>
>  Performance counter stats for './raid6_bench neonx4 8 1000000':
>
>        11626717224      instructions                     #    1.67
> insn per cycle
>         6946699489      cycles
>          257013219      L1-dcache-load-misses
>
>        2.681213149 seconds time elapsed
>
>        2.676771000 seconds user
>        0.002000000 seconds sys
>
>
> algo=svex1 ndisks=8 iterations=1000000 time=1.688s MB/s=13885.23
>
>  Performance counter stats for './raid6_bench svex1 8 1000000':
>
>        10527277490
> Running SVE unrolled x4 (svex4)...
>      instructions                     #    2.40  insn per cycle
>         4379539835      cycles
>          175695656      L1-dcache-load-misses
>
>        1.688852006 seconds time elapsed
>
>        1.687298000 seconds user
>        0.000999000 seconds sys
>
>
> algo=svex4 ndisks=8 iterations=1000000 time=1.445s MB/s=16215.04
>
>  Performance counter stats for './raid6_bench svex4 8 1000000':
>
>         7587813392      instructions
> ==================================================
> [Test 2: Scalability on Wide RAID Arrays (MB/s)]
> --- 16 Disks ---
>  #    2.02  insn per cycle
>         3748486131      cycles
>          213816184      L1-dcache-load-misses
>
>        1.446032415 seconds time elapsed
>
>        1.442412000 seconds user
>        0.002996000 seconds sys
>
>
> algo=neonx4 ndisks=16 iterations=1000000 time=6.783s MB/s=8062.33
> algo=svex1 ndisks=16 iterations=1000000 time=4.912s MB/s=11132.90
> algo=svex4 ndisks=16 iterations=1000000 time=3.601s MB/s=15188.85
>
> --- 24 Disks ---
> algo=neonx4 ndisks=24 iterations=1000000 time=11.011s MB/s=7805.02
> algo=svex1 ndisks=24 iterations=1000000 time=8.843s MB/s=9718.26
> algo=svex4 ndisks=24 iterations=1000000 time=5.719s MB/s=15026.92
>
> Extra tests:
> --- 48 Disks ---
> algo=neonx4 ndisks=48 iterations=500000 time=11.826s MB/s=7597.25
> algo=svex4 ndisks=48 iterations=500000 time=5.808s MB/s=15468.10
> --- 96 Disks ---
> algo=neonx4 ndisks=96 iterations=200000 time=9.783s MB/s=7507.01
> algo=svex4 ndisks=96 iterations=200000 time=4.701s MB/s=15621.17
> ==================================================
>



  parent reply	other threads:[~2026-03-30 16:41 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260318150245.3080719-1-demyansh@gmail.com>
2026-03-24  7:45 ` [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation Christoph Hellwig
2026-03-24  8:00 ` Ard Biesheuvel
2026-03-24 10:04   ` Mark Rutland
2026-03-29 13:01     ` Demian Shulhan
2026-03-30  5:30       ` Christoph Hellwig
2026-03-30 16:39       ` Ard Biesheuvel [this message]
2026-03-31  6:36         ` Christoph Hellwig
2026-03-31 13:18           ` Demian Shulhan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9a12e043-8200-4650-bfe2-cbece57a4f87@app.fastmail.com \
    --to=ardb@kernel.org \
    --cc=broonie@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=demyansh@gmail.com \
    --cc=hch@lst.de \
    --cc=linan122@huawei.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=robin.murphy@arm.com \
    --cc=song@kernel.org \
    --cc=will@kernel.org \
    --cc=yukuai@fnnas.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox