Hello,

Miquel Sabaté Solà @ 2026-06-07 22:38 +02:

> Hi,
>
> Paul Walmsley @ 2026-06-06 18:50 -06:
>
>> Hi,
>>
>> On Fri, 5 Jun 2026, Miquel Sabaté Solà wrote:
>>
>>> Support for atomic Compare-And-Swap instructions has been in the RISC-V
>>> port of the Linux kernel for a long time. That being said, we apparently
>>> never bothered to set HAVE_CMPXCHG_DOUBLE and HAVE_CMPXCHG_LOCAL in the
>>> Kconfig, despite having all the framework to support them.
>>>
>>> Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
>>> ---
>>> This is a resend of [1], rebased on top of the latest commit from the
>>> for-next branch.
>>>
>>> I have built this patch with multiple configurations and ran it with KVM
>>> (the VisionFive2 board that I have lacks the needed extensions). All seems
>>> to work, but I do wonder if we did not enable these for a reason or this
>>> just slipped through. So far in the code I believe everything is in place,
>>> and I haven't seen any commit in the git log stating otherwise.
>>>
>>> [1] https://lore.kernel.org/all/20260220074449.8526-1-mssola@mssola.com/
>>
>> Thanks for the patch.  Your comments above are why I've been hesitant to
>> merge it.  I'm not aware of any publicly available hardware that supports
>> Zacas/Zabha.  No one has stepped forward to provide any Tested-by:s on
>> hardware that hasn't been released yet.  You mention that you tested on
>> your VisionFive2 board, but it would not have exercised those code paths.
>
> No, I mention that I ran it _only_ on KVM, as my VisionFive2 board lacks
> these extensions and hence I couldn't possible have tested this there :)
>
>>
>> Of course, we already have Zacas/Zabha support, merged back in 2024, in
>> cmpxchg.h.  I assume (?) that it was tested in QEMU, but I don't see any
>> comments about that in the patch series.  No one sent any Tested-by:s
>> then, either.
>>
>> It would be good if you (and ideally others) could put this patch through
>> some testing on QEMU with Zacas and Zabha enabled, before we merge it.
>> The affected code paths for HAVE_CMPXCHG_LOCAL seem to primarily involve
>> per-CPU counters and MM zone counters, so those would be the areas to
>> focus.  HAVE_CMPXCHG_DOUBLE seems to do nothing useful other than
>> preventing the AMD IOMMU driver from being selected if it's not present,
>> so that part of the patch seems fairly useless.  In fact I'd suggest
>> dropping that from the patch and just sending a separate patch to remove
>> HAVE_CMPXCHG_DOUBLE from the kernel completely.
>
> To be fair, on QEMU I only "tested" it by booting it, running a few
> things for some time and ensuring that nothing got totally broken in the
> process while taking a look at the kernel logs.
>
> In any case, let me double check with QEMU with these extensions enabled
> and I'll try to be more thorough about it. I'll do just that whenever I
> have some spare time during the following week :)

So much for "the following week" :) Sorry for being a bit late.

I have finally taken a deeper look at this, and I have the following
comments that I hope you can further clarify so we can land a more
precise patch.

First of all, HAVE_CMPXCHG_LOCAL is fundamentally used in two places: in
mm/vmstat.c and in lib/percpu_counter.c. In the latter it is used in the
percpu_counter_add_batch() function, which is used tree-wide. Hence,
selecting this config option has a wide effect.

As far as I can tell, both HAVE_CMPXCHG_LOCAL and HAVE_CMPXCHG_DOUBLE
could have been added in this patch series [1] by Alexandre Ghiti, but
I've been unable to tell how it was actually tested (no mentions on that
on no Tested-by either). On my side, the board I have doesn't have
neither Zacas nor the Zabha extensions so, as I mentioned on my previous
email, I have to test everything via KVM.

Having that into account, if I build the kernel with or without the
patch applied, functions like percpu_counter_add_batch() have a
different implementation, as expected (good ol' objdump -d vmlinux
--disassemble=percpu_counter_add_batch confirmed as much); and from a
general outlook, the given assembly code is what I'd expect. In order to
test it, I have to say that I've relied on checking that selftests and
similar continue to pass before/after the patch. In particular, I've run
the selftests from btrfs, and they seem just fine (and toggling a
breakpoint inside of percpu_counter_add_batch() confirms that it's being
constantly called, so we are definitely calling the "new" code).

>
> As for HAVE_CMPXCHG_DOUBLE, removing it makes sense. Let me just take
> another look and I will send a separate patch whenever I'm ready for it.

Here I'm a bit conflicted, because you mentioned that it's only used for the AMD
IOMMU driver, but support on different architectures like 5284e1b4bc8a ("arm64:
xchg: Implement cmpxchg_double") or f0e4b1b6e295 ("LoongArch: Add 128-bit atomic
cmpxchg support") hint into other places where this is needed. In fact, for
loongarch, it's apparently needed to fix multiple tests on sched_ext. Thus, I
believe that sending a patch to drop it from the kernel would be misguided.

As for RISC-V, first of all I believe that my patch should've also included that
HAVE_CMPXCHG_DOUBLE can only be selected on CONFIG_64BIT. In fact, commit
f7bd2be7663c ("riscv: Implement arch_cmpxchg128() using Zacas"), which brings
support for this (again, from the same series [1]), also requires this
configuration option. That being said, this commit doesn't say how it was
tested, and there are not Tested-by either. On my end, so far I got no luck
trying the sched_ext route mentioned in the loongarch support, which is a bit of
a hussle in KVM with cross-compilation and the library requirements.

But as a final note, I have used hackbench on this. I have statically compiled
hackbench from source [2] at commit e6fb0b2c8ad1 ("rt-tests: Add AGENTS.md guide
for AI coding assistants") (i.e. current master). This binary has been deployed
in three iterations where the kernel was built as:

1. Before the patch.
2. With the patch (both HAVE_CMPXCHG_LOCAL and HAVE_CMPXCHG_DOUBLE selected).
3. With only HAVE_CMPXCHG_LOCAL selected.

Again, given that I can only test this on QEMU, this whole thing is tested on
virtualized environments, so take it with a grain of salt :)

In any case, I got the following results:

|        | Mainline | Both selected | Only _LOCAL selected |
|--------+----------+---------------+----------------------|
| Mean   |  50.9966 |       54.2076 |              51.6488 |
| StdDev |   0.1508 |        0.1097 |               0.7051 |

As you can see, the results on Mainline and "only _LOCAL selected" are more or
less the same, but not so much whenever _DOUBLE is also selected. I figure that
that these instructions are not as fast as expected on my virtualized
environment.

With all of this, I'm torned on whether we should include HAVE_CMPXCHG_DOUBLE,
as code-wise is pretty much the same as HAVE_CMPXCHG_LOCAL, but I can also see
how it can be removed as I cannot properly test it. As a middle ground, maybe we
can leave a TODO in __arch_cmpxchg128() asking to introduce HAVE_CMPXCHG_DOUBLE
whenever someone can actually test it and guarantee that there are no
performance penalties as I saw on my virtualized environment.

And that's enough of my rumblings :) I'd appreciate any comments on this.

Thanks,
Miquel

[1] https://lore.kernel.org/all/20241103145153.105097-1-alexghiti@rivosinc.com/
[2] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git

>
>>
>>
>> - Paul
>
> Thanks for your input!
> Miquel
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv