From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id C1CD61061B0F
	for <linux-arm-kernel@archiver.kernel.org>; Mon, 30 Mar 2026 16:41:15 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help
	:List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding:
	Content-Type:Subject:References:In-Reply-To:Message-Id:Cc:To:From:Date:
	MIME-Version:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From:
	Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner;
	bh=vXfwJ5L/YpwyrS1HsDEcAEfcKjD/ArKHEtQcHSkJiYo=; b=yJ7JQDZJwg9Pyy2ylTi1jsQ86a
	kdwxVaXf5kAWikLyPHWkEU9+4kKWyE8kGaqYqraOYbBbOpxi8PPgTqQ+hHfjz48Dojsr/JLVh9WDW
	Hw/6KhhD850qtocxAE8vv/7iMZg37GbdWkzkYf4RG6+VP+9h39dR/zbmZEzmwfxu1ykHvy1exGR+/
	YeXFynq5j4P+KAxkPThKgX0wLjkusKpz66RnF/J/zoJyt8CL2r8DbhYTaQ+tOZs2ddVGX85+Fb86/
	oOHsEcK1GaNScZ5zSqckVl6HIMRhPpn4rquoxl+Q9z8a6kU2T+Dt6MhpkSwKhzDR0RmNXye/df5SB
	/NzvMQ8g==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.98.2 #2 (Red Hat Linux))
	id 1w7Ffn-0000000Belu-2LgT;
	Mon, 30 Mar 2026 16:41:07 +0000
Received: from tor.source.kernel.org ([172.105.4.254])
	by bombadil.infradead.org with esmtps (Exim 4.98.2 #2 (Red Hat Linux))
	id 1w7Ffm-0000000Belj-0qPs
	for linux-arm-kernel@lists.infradead.org;
	Mon, 30 Mar 2026 16:41:06 +0000
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by tor.source.kernel.org (Postfix) with ESMTP id 26751600CB;
	Mon, 30 Mar 2026 16:41:05 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6CB38C4CEF7;
	Mon, 30 Mar 2026 16:41:04 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1774888864;
	bh=dgqJmbMdDTsNPZEZ+a2xdA+lqEFJgMxpY5Wb+INrTAI=;
	h=Date:From:To:Cc:In-Reply-To:References:Subject:From;
	b=ez/Yu5QIYVdsQJ7Eo0jvpDxiuSNQ7HIenljrii5lGQoKwp7ToOqXTirFUNR3xT1Rp
	 WEAsdDezSV+XpAQ6xxBGVYLrO4nbMhWdZ1uEkpPNd90SuqwGyTCeeM4HZ9iq4DEx7j
	 xfcH2hpnTjl/p9rcoY5iGSVgkv4P2oNGArIGT5U0HOUbBfjtDYG/NjJKTxLFKeFz4o
	 0xgQNu5MxFTJ3q2LtOIoYKTPHmUrovJYS+VQOx9h9oo4spoWOo0Xlm4iS9ocKzufMo
	 XypBva+agJJYk21iKkq84oaJmQsZKtdUYur8HXeM7IajpNRvzTNxebFIbhzfmhB98P
	 2+mxZ0FuWS/hA==
Received: from phl-compute-01.internal (phl-compute-01.internal [10.202.2.41])
	by mailfauth.phl.internal (Postfix) with ESMTP id 546B5F40084;
	Mon, 30 Mar 2026 12:41:03 -0400 (EDT)
Received: from phl-imap-02 ([10.202.2.81])
  by phl-compute-01.internal (MEProxy); Mon, 30 Mar 2026 12:41:03 -0400
X-ME-Sender: <xms:n6fKad-kO--PvPfJxyC57tyvKJQyLOstTTqghxtmXDw6Nvasrpq_mw>
    <xme:n6fKachgqK1d46BUc47jqdVTDvQ9K2TKhKtrU2FmuZMmEj_jMZadwwvLg4Ka376J1
    Ofc5hl2Kd5fmg7bccxwdFbWxXFz7A_-TAJ6XAJNfzWXw1RSCYxldivE>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgdeffeelgeekucetufdoteggodetrf
    dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu
    rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf
    gurhepofggfffhvfevkfgjfhfutgfgsehtqhertdertdejnecuhfhrohhmpedftehrugcu
    uehivghshhgvuhhvvghlfdcuoegrrhgusgeskhgvrhhnvghlrdhorhhgqeenucggtffrrg
    htthgvrhhnpeeuteeiudeigeekjedvheduieehteetgfdtuefghfejgffhfedtleehvdeh
    fffhvdenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpe
    grrhguodhmvghsmhhtphgruhhthhhpvghrshhonhgrlhhithihqdduieejtdehtddtjeel
    qdeffedvudeigeduhedqrghruggspeepkhgvrhhnvghlrdhorhhgseifohhrkhhofhgrrh
    gurdgtohhmpdhnsggprhgtphhtthhopedufedpmhhouggvpehsmhhtphhouhhtpdhrtghp
    thhtoheptggrthgrlhhinhdrmhgrrhhinhgrshesrghrmhdrtghomhdprhgtphhtthhope
    hmrghrkhdrrhhuthhlrghnugesrghrmhdrtghomhdprhgtphhtthhopehrohgsihhnrdhm
    uhhrphhhhiesrghrmhdrtghomhdprhgtphhtthhopeihuhhkuhgrihesfhhnnhgrshdrtg
    homhdprhgtphhtthhopeguvghmhigrnhhshhesghhmrghilhdrtghomhdprhgtphhtthho
    pehlihhnrghnuddvvdeshhhurgifvghirdgtohhmpdhrtghpthhtohepsghrohhonhhivg
    eskhgvrhhnvghlrdhorhhgpdhrtghpthhtohepshhonhhgsehkvghrnhgvlhdrohhrghdp
    rhgtphhtthhopeifihhllheskhgvrhhnvghlrdhorhhg
X-ME-Proxy: <xmx:n6fKaXM5GGKRJbs1w30rTYE1sggut0f6P6i5JFKSLhlNt2c3OOvSGw>
    <xmx:n6fKaVq7t9V7Mp7UxZWdpUzCFPsONcn7E6LJ_Fxb--74mghvVqpAUQ>
    <xmx:n6fKae7u_JvCE4Y5EwwCbvh-wsT2LM-AAR46GEihrF_tmebPNyxqbw>
    <xmx:n6fKadHWQ4HmrUvkty2odTEvoLyDimzJOK3dbCkIjJ3k8_wge-mZnw>
    <xmx:n6fKaQ3F-GJDV7fYU425Par-sXjRUqizEHrPrS4FZqVoD8EPtuSUQDbs>
Feedback-ID: ice86485a:Fastmail
Received: by mailuser.phl.internal (Postfix, from userid 501)
	id 0C925700065; Mon, 30 Mar 2026 12:41:02 -0400 (EDT)
X-Mailer: MessagingEngine.com Webmail Interface
MIME-Version: 1.0
X-ThreadId: AB98H5nyYqoB
Date: Mon, 30 Mar 2026 18:39:49 +0200
From: "Ard Biesheuvel" <ardb@kernel.org>
To: "Demian Shulhan" <demyansh@gmail.com>,
 "Mark Rutland" <mark.rutland@arm.com>
Cc: "Christoph Hellwig" <hch@lst.de>, "Song Liu" <song@kernel.org>,
 "Yu Kuai" <yukuai@fnnas.com>, "Will Deacon" <will@kernel.org>,
 "Catalin Marinas" <catalin.marinas@arm.com>,
 "Mark Brown" <broonie@kernel.org>, linux-arm-kernel@lists.infradead.org,
 robin.murphy@arm.com, "Li Nan" <linan122@huawei.com>,
 linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org
Message-Id: <9a12e043-8200-4650-bfe2-cbece57a4f87@app.fastmail.com>
In-Reply-To: <CAOLeWCsxhzdxQviizJ4X4VOp_28LCtO-RjWoCcZG29rQw86NVg@mail.gmail.com>
References: <20260318150245.3080719-1-demyansh@gmail.com>
 <f9bc0534-4b7a-4b14-974b-4d7811ccd899@app.fastmail.com>
 <acJhCCPfVxjFUZ5R@J2N7QTR9R3>
 <CAOLeWCsxhzdxQviizJ4X4VOp_28LCtO-RjWoCcZG29rQw86NVg@mail.gmail.com>
Subject: Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome
 generation
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

Hi Demian,

On Sun, 29 Mar 2026, at 15:01, Demian Shulhan wrote:
> I want to address the comment about the marginal 0.3% speedup on the
> 8-disk benchmark. While the pure memory bandwidth on a small array is
> indeed bottlenecked, it doesn't reveal the whole picture. I extracted
> the SVE and NEON implementations into a user-space benchmark to
> measure the actual hardware efficiency using perf stat, running on the
> same AWS Graviton3 (Neoverse-V1) instance.The results show a massive
> difference in CPU efficiency. For the same 8-disk workload, the svex4
> implementation requires about 35% fewer instructions and 46% fewer CPU
> cycles compared to neonx4 (7.58 billion instructions vs 11.62
> billion). This translates directly into significant energy savings and
> reduced pressure on the CPU frontend, which would leave more compute
> resources available for network and NVMe queues during an array
> rebuild.
>

I think the results are impressive, but I'd like to better understand
its implications on a real-world scenario. Is this code only a
bottleneck when rebuilding an array? Is it really that much more power
efficient, given that the registers (and ALU paths) are twice the size?
And given the I/O load of rebuilding a 24+ disk array, how much CPU
throughput can we make use of meaningfully in such a scenario?

Supporting SVE in the kernel primarily impacts the size of the per-task
buffers that we need to preserve/restore the context. Fortunately,
these are no longer allocated for the lifetime of the task, but
dynamically (by scoped_ksimd()), and so the main impediment has been
recently removed. But as Mark pointed out, there are other things to
take into account. Nonetheless, our position has always been that a
compelling use case could convince us that the additional complexity
of in-kernel SVE is justified.

> Furthermore, as Christoph suggested, I tested scalability on wider
> arrays since the default kernel benchmark is hardcoded to 8 disks,
> which doesn't give the unrolled SVE loop enough data to shine. On a
> 16-disk array, svex4 hits 15.1 GB/s compared to 8.0 GB/s for neonx4.
> On a 24-disk array, while neonx4 chokes and drops to 7.8 GB/s, svex4
> maintains a stable 15.0 GB/s =E2=80=94 effectively doubling the throug=
hput.

Does this mean the kernel benchmark is no longer fit for purpose? If
it cannot distinguish between implementations that differ in performance
by a factor of 2, I don't think we can rely on it to pick the optimal on=
e.

> I agree this patch should be put on hold for now. My intention is to
> leave these numbers here as evidence that implementing SVE context
> preservation in the kernel (the "good use case") is highly justifiable
> from both a power-efficiency and a wide-array throughput perspective
> for modern ARM64 hardware.
>

Could you please summarize the results? The output below seems to have
become mangled a bit. Please also include the command line, a link to
the test source, and the vector length of the implementation.


> Thanks again for your time and time and review!
>
> ---------------------------------------------------
> User space test results:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
>     RAID6 SVE Benchmark Results (AWS Graviton3)
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
> Instance Details:
> Linux ip-172-31-87-234 6.8.0-1047-aws #50~22.04.1-Ubuntu SMP Thu Feb
> 19 20:49:25 UTC 2026 aarch64 aarch64 aarch64 GNU/Linux
> --------------------------------------------------
>
> [Test 1: Energy Efficiency / Instruction Count (8 disks)]
> Running baseline (neonx4)...
> algo=3Dneonx4 ndisks=3D8 iterations=3D1000000 time=3D2.681s MB/s=3D874=
1.36
>
> Running SVE (svex1)...
>
>  Performance counter stats for './raid6_bench neonx4 8 1000000':
>
>        11626717224      instructions                     #    1.67
> insn per cycle
>         6946699489      cycles
>          257013219      L1-dcache-load-misses
>
>        2.681213149 seconds time elapsed
>
>        2.676771000 seconds user
>        0.002000000 seconds sys
>
>
> algo=3Dsvex1 ndisks=3D8 iterations=3D1000000 time=3D1.688s MB/s=3D1388=
5.23
>
>  Performance counter stats for './raid6_bench svex1 8 1000000':
>
>        10527277490
> Running SVE unrolled x4 (svex4)...
>      instructions                     #    2.40  insn per cycle
>         4379539835      cycles
>          175695656      L1-dcache-load-misses
>
>        1.688852006 seconds time elapsed
>
>        1.687298000 seconds user
>        0.000999000 seconds sys
>
>
> algo=3Dsvex4 ndisks=3D8 iterations=3D1000000 time=3D1.445s MB/s=3D1621=
5.04
>
>  Performance counter stats for './raid6_bench svex4 8 1000000':
>
>         7587813392      instructions
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
> [Test 2: Scalability on Wide RAID Arrays (MB/s)]
> --- 16 Disks ---
>  #    2.02  insn per cycle
>         3748486131      cycles
>          213816184      L1-dcache-load-misses
>
>        1.446032415 seconds time elapsed
>
>        1.442412000 seconds user
>        0.002996000 seconds sys
>
>
> algo=3Dneonx4 ndisks=3D16 iterations=3D1000000 time=3D6.783s MB/s=3D80=
62.33
> algo=3Dsvex1 ndisks=3D16 iterations=3D1000000 time=3D4.912s MB/s=3D111=
32.90
> algo=3Dsvex4 ndisks=3D16 iterations=3D1000000 time=3D3.601s MB/s=3D151=
88.85
>
> --- 24 Disks ---
> algo=3Dneonx4 ndisks=3D24 iterations=3D1000000 time=3D11.011s MB/s=3D7=
805.02
> algo=3Dsvex1 ndisks=3D24 iterations=3D1000000 time=3D8.843s MB/s=3D971=
8.26
> algo=3Dsvex4 ndisks=3D24 iterations=3D1000000 time=3D5.719s MB/s=3D150=
26.92
>
> Extra tests:
> --- 48 Disks ---
> algo=3Dneonx4 ndisks=3D48 iterations=3D500000 time=3D11.826s MB/s=3D75=
97.25
> algo=3Dsvex4 ndisks=3D48 iterations=3D500000 time=3D5.808s MB/s=3D1546=
8.10
> --- 96 Disks ---
> algo=3Dneonx4 ndisks=3D96 iterations=3D200000 time=3D9.783s MB/s=3D750=
7.01
> algo=3Dsvex4 ndisks=3D96 iterations=3D200000 time=3D4.701s MB/s=3D1562=
1.17
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
>