Generic Linux architectural discussions

Generic Linux architectural discussions
 help / color / mirror / Atom feed

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Christian Brauner @ 2026-06-10  7:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Li Chen, Kees Cook, Alexander Viro, linux-fsdevel, linux-api,
	linux-kernel, linux-mm, linux-arch, linux-doc, linux-kselftest,
	x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Jan Kara, Jonathan Corbet,
	Shuah Khan
In-Reply-To: <CALCETrWJQpLR4n1cpichBk8=uExSKLWTMGU3BufGdk_WE_p5UA@mail.gmail.com>

On Mon, Jun 08, 2026 at 05:01:57PM -0700, Andy Lutomirski wrote:
> On Thu, May 28, 2026 at 4:05 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
> > > Hi,
> > >
> > > This is an early RFC for an idea that is probably still rough in both the
> > > UAPI and implementation details. Sorry for the rough edges; I am sending
> > > it now to check whether this direction is worth pursuing and to get
> > > feedback on the kernel/userspace boundary.
> >
> > The idea of having a builder api for exec isn't all that crazy. But it
> > should simply be built on top of pidfds and thus pidfs itself instead.
> > It has all the basic infrastructure in place already. Any implementation
> > should also allow userspace to implement posix_spawn() on top of it.
> >
> > fd = pidfd_open(0, PIDFD_EMPTY /* or better name */)
> >
> > pidfd_config(fd, ...) // modeled similar to fsconfig()
> >
> 
> After contemplating this for a bit... why pidfd?  Doesn't a pidfd
> refer to an actual process that is, or at least was, running?  This
> new thing is a process that we are contemplating spawning.  I can
> imagine that basically all pidfd APIs would be a bit confused by the
> nonexistence of the process in question.

I don't think that would be a problem because every api just needs to
handle ESRCH. Ignoring that for a second: the mount api has a builder fd
that is later transformed into a pidfd. Which is easily doable here as
well. My point is that all the infrastructure building blocks already
exist in pidfs.

^ permalink raw reply

* Re: [PATCH v12 00/15] barrier: Add smp_cond_load_{relaxed,acquire}_timeout()
From: Ankur Arora @ 2026-06-10  6:44 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf, arnd,
	catalin.marinas, will, peterz, akpm, mark.rutland, harisokn, cl,
	ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, broonie, joao.m.martins,
	boris.ostrovsky, konrad.wilk, ashok.bhat
In-Reply-To: <20260608080440.127491-1-ankur.a.arora@oracle.com>


Summarizing all of the bot reviews (sashiko/bpf-bot):

Most of the comments are same as v11. Let me outline the ones I think
are notable:

  - edge cases around (timeout is -1, S64_MAX, U64_MAX).

    I've noted in the first patch how these cases are probably best
    addressed at review time instead of complicating the implementation
    like in https://lore.kernel.org/lkml/874iklm1uy.fsf@oracle.com/

  - as a side-effect of enabling ARCH_HAS_CPU_RELAX, acpi_processor_setup_cstate()
    enables a NOP poll_idle() unintentionally (patch-2). I've described
    it in more detail in my reply to that patch.

    Will fix this.

  - potentially missed control dependency in the timeout case of
    smp_cond_load_acquire_timeout(). Probably need a better fix for
    this than I have.

    Need more thinktime as the bots would say.  Will address this one
    and the one below in reply to patches 6, 7.

  - possibly torn reads with atomic64_cond_read_*_timeout() on 32-bit
    architectures.

Ankur

Ankur Arora <ankur.a.arora@oracle.com> writes:

> Hi,
>
> Main change in this version:
>
>   - addressed some review comments from sashiko (see commit notes)
>     - The one notable change is to the implementation of
>       smp_cond_load_acquire_timeout() where there was a missed
>       control dependency in the timeout case.
>       All the others are minor.
>   - fixed a low probability race in the kunit test added in v11.
>   - added a bunch of kunit tests validating the implementation's
>     use of the clock.
>
> Andrew, if the changes look okay, could we take this in your mm-nomm
> tree as before?
>
> The core kernel often uses smp_cond_load_{relaxed,acquire}() to spin
> on condition variables with architectural primitives used to avoid
> hammering the relevant cachelines.
>
> (This primitive can vary greatly across architectures: on x86 it's a
> cpu_relax() to slow down the pipeline. On arm64, this is a __cmpwait()
> which waits for a cacheline to change state in a time limited fashion.)
>
> Regardless of architectural details, typical smp_cond_load*() usage
> does not allow for termination until the condition change occurs.
>
> Beyond the core kernel, there are cases where it is useful to additionally
> terminate on a timeout. Two cases:
>
>   - cpuidle poll_idle(): wait for need-resched until the cpuidle polling
>     duration expires.
>
>   - rqspinlock: nested qspinlock acquisition that terminates on timeout
>     or deadlock.
>
> Accordingly add two interfaces (with their generic and arm64 specific
> implementations):
>
>    smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr, timeout)
>    smp_cond_load_acquire_timeout(ptr, cond_expr, time_expr, timeout)
>
> Also add tif_need_resched_relaxed_wait() which wraps the polling
> pattern and its scheduler specific details in poll_idle().
> In addition add atomic_cond_read_*_timeout(),
> atomic64_cond_read_*_timeout(), and atomic_long wrappers.
>
> Structurally, both the smp_cond_load_*_timeout() interfaces are similar
> to smp_cond_load*(), with the addition of a rate-limited time-check.
>
> Usage
> ==
>
> These interfaces drop straight-forwardly into the rqspinlock logic
> since qspinlock already uses smp_cond_load*(), and the time-check
> extension can now be used for timeout and deadlock handling.
>
> Using tif_need_resched_relaxed_wait() in poll_idle() removes any
> architectural details allowing arm64 to straight-forwardly support
> that path.
> (However, for efficiency reasons cpuidle/poll_state.c continues to
> depend on ARCH_HAS_CPU_RELAX since that is defined on architectures
> with an optimized architectural primitive.)
>
>
> Performance
> ==
>
> Apart from simplifications due to this change, supporting polling in
> cpuidle on arm64 helps improve wakeup latency (needs a few cpuidle/acpi
> patches):
>
>
>   # perf stat -r 5 --cpu 4,5 -e task-clock,cycles,instructions,sched:sched_wake_idle_without_ipi \
>   perf bench sched pipe -l 1000000 -c 4
>
>   # No haltpoll (and, no TIF_POLLING_NRFLAG):
>
>   Performance counter stats for 'CPU(s) 4,5' (5 runs):
>
>          25,229.57 msec task-clock                       #    2.000 CPUs utilized               ( +-  7.75% )
>     45,821,250,284      cycles                           #    1.816 GHz                         ( +- 10.07% )
>     26,557,496,665      instructions                     #    0.58  insn per cycle              ( +-  0.21% )
>                  0      sched:sched_wake_idle_without_ipi #    0.000 /sec
>
>        12.615 +- 0.977 seconds time elapsed  ( +-  7.75% )
>
>
>   # Haltpoll:
>
>   Performance counter stats for 'CPU(s) 4,5' (5 runs):
>
>          15,131.58 msec task-clock                       #    2.000 CPUs utilized               ( +- 10.00% )
>     34,158,188,839      cycles                           #    2.257 GHz                         ( +-  6.91% )
>     20,824,950,916      instructions                     #    0.61  insn per cycle              ( +-  0.09% )
>          1,983,822      sched:sched_wake_idle_without_ipi #  131.105 K/sec                       ( +-  0.78% )
>
>         7.566 +- 0.756 seconds time elapsed  ( +- 10.00% )
>
>   We get improved latency because we don't switch in and out of a
>   deeper sleep state or from the hypervisor. This also causes us to
>   execute ~20% fewer instructions.
>
>
> Haris Okanovic also saw improvement in real workloads due to the
> cpuidle changes: "observed 4-6% improvements in memcahed, cassandra,
> mysql, and postgresql under certain loads. Other applications likely
> benefit too." [12]
>
>
> Changelog:
>   v11 [13] (as listed above):
>     - addressed some review comments from sashiko (see commit notes)
>       - The one notable change is to the implementation of
>         smp_cond_load_acquire_timeout() where there was a missed
>         control dependency in the timeout case.
>       All the others are minor.
>     - fixed a low probability race in the kunit test added in v11.
>     - added a bunch of kunit tests validating the implementation's
>       use of the clock.
>
>   v10 [10]:
>    - add a comment mentioning that smp_cond_load_relaxed_timeout() might
>      be using architectural primitives that don't support MMIO.
>      (David Laight, Catalin Marinas)
>    - added a kunit test for smp_cond_load_relaxed_timeout() (Andrew
>      Morton.)
>
>   v9 [9]:
>    - s/@cond/@cond_expr/ (Randy Dunlap)
>    - Clarify that SMP_TIMEOUT_POLL_COUNT is only around memory
>      addresses. (David Laight)
>    - Add the missing config ARCH_HAS_CPU_RELAX in arch/arm64/Kconfig.
>      (Catalin Marinas).
>    - Switch to arch_counter_get_cntvct_stable() (via __delay_cycles())
>      in the cmpwait path instead of using arch_timer_read_counter().
>      (Catalin Marinas)
>
>   v8 [0]:
>    - Defer evaluation of @time_expr_ns to when we hit the slowpath.
>       (comment from Alexei Starovoitov).
>
>    - Mention that cpu_poll_relax() is better than raw CPU polling
>      only where ARCH_HAS_CPU_RELAX is defined.
>      - also define ARCH_HAS_CPU_RELAX for arm64.
>       (Came out of a discussion with Will Deacon.)
>
>    - Split out WFET and WFE handling. I was doing both of these
>      in a common handler.
>      (From Will Deacon and in an earlier revision by Catalin Marinas.)
>
>    - Add mentions of atomic_cond_read_{relaxed,acquire}(),
>      atomic_cond_read_{relaxed,acquire}_timeout() in
>      Documentation/atomic_t.txt.
>
>    - Use the BIT() macro to do the checking in tif_bitset_relaxed_wait().
>
>    - Cleanup unnecessary assignments, casts etc in poll_idle().
>      (From Rafael Wysocki.)
>
>    - Fixup warnings from kernel build robot
>
>
>   v7 [1]:
>    - change the interface to separately provide the timeout. This is
>      useful for supporting WFET and similar primitives which can do
>      timed waiting (suggested by Arnd Bergmann).
>
>    - Adapting rqspinlock code to this changed interface also
>      necessitated allowing time_expr to fail.
>    - rqspinlock changes to adapt to the new smp_cond_load_acquire_timeout().
>
>    - add WFET support (suggested by Arnd Bergmann).
>    - add support for atomic-long wrappers.
>    - add a new scheduler interface tif_need_resched_relaxed_wait() which
>      encapsulates the polling logic used by poll_idle().
>      - interface suggested by (Rafael J. Wysocki).
>
>
>   v6 [2]:
>    - fixup missing timeout parameters in atomic64_cond_read_*_timeout()
>    - remove a race between setting of TIF_NEED_RESCHED and the call to
>      smp_cond_load_relaxed_timeout(). This would mean that dev->poll_time_limit
>      would be set even if we hadn't spent any time waiting.
>      (The original check compared against local_clock(), which would have been
>      fine, but I was instead using a cheaper check against _TIF_NEED_RESCHED.)
>    (Both from meta-CI bot)
>
>
>   v5 [3]:
>    - use cpu_poll_relax() instead of cpu_relax().
>    - instead of defining an arm64 specific
>      smp_cond_load_relaxed_timeout(), just define the appropriate
>      cpu_poll_relax().
>    - re-read the target pointer when we exit due to the time-check.
>    - s/SMP_TIMEOUT_SPIN_COUNT/SMP_TIMEOUT_POLL_COUNT/
>    (Suggested by Will Deacon)
>
>    - add atomic_cond_read_*_timeout() and atomic64_cond_read_*_timeout()
>      interfaces.
>    - rqspinlock: use atomic_cond_read_acquire_timeout().
>    - cpuidle: use smp_cond_load_relaxed_tiemout() for polling.
>    (Suggested by Catalin Marinas)
>
>    - rqspinlock: define SMP_TIMEOUT_POLL_COUNT to be 16k for non arm64
>
>
>   v4 [4]:
>     - naming change 's/timewait/timeout/'
>     - resilient spinlocks: get rid of res_smp_cond_load_acquire_waiting()
>       and fixup use of RES_CHECK_TIMEOUT().
>     (Both suggested by Catalin Marinas)
>
>   v3 [5]:
>     - further interface simplifications (suggested by Catalin Marinas)
>
>   v2 [6]:
>     - simplified the interface (suggested by Catalin Marinas)
>        - get rid of wait_policy, and a multitude of constants
>        - adds a slack parameter
>       This helped remove a fair amount of duplicated code duplication and in
>       hindsight unnecessary constants.
>
>   v1 [7]:
>      - add wait_policy (coarse and fine)
>      - derive spin-count etc at runtime instead of using arbitrary
>        constants.
>
> Haris Okanovic tested v4 of this series with poll_idle()/haltpoll patches. [8]
>
> Comments appreciated!
>
> Thanks
> Ankur
>
>  [0] https://lore.kernel.org/lkml/20251215044919.460086-1-ankur.a.arora@oracle.com/
>  [1] https://lore.kernel.org/lkml/20251028053136.692462-1-ankur.a.arora@oracle.com/
>  [2] https://lore.kernel.org/lkml/20250911034655.3916002-1-ankur.a.arora@oracle.com/
>  [3] https://lore.kernel.org/lkml/20250911034655.3916002-1-ankur.a.arora@oracle.com/
>  [4] https://lore.kernel.org/lkml/20250829080735.3598416-1-ankur.a.arora@oracle.com/
>  [5] https://lore.kernel.org/lkml/20250627044805.945491-1-ankur.a.arora@oracle.com/
>  [6] https://lore.kernel.org/lkml/20250502085223.1316925-1-ankur.a.arora@oracle.com/
>  [7] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/
>  [8] https://lore.kernel.org/lkml/2cecbf7fb23ee83a4ce027e1be3f46f97efd585c.camel@amazon.com/
>  [9] https://lore.kernel.org/lkml/20260209023153.2661784-1-ankur.a.arora@oracle.com/
>  [10] https://lore.kernel.org/lkml/20260316013651.3225328-1-ankur.a.arora@oracle.com/
>  [11] https://lore.kernel.org/lkml/20230809134837.GM212435@hirez.programming.kicks-ass.net/
>  [12] https://lore.kernel.org/lkml/c6f3c8d3f1f2e89a9dc7ae22482973b5a51b08cb.camel@amazon.com/
>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: bpf@vger.kernel.org
> Cc: linux-arch@vger.kernel.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-pm@vger.kernel.org
>
> Ankur Arora (15):
>   asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
>   arm64: barrier: Support smp_cond_load_relaxed_timeout()
>   arm64/delay: move some constants out to a separate header
>   arm64: support WFET in smp_cond_load_relaxed_timeout()
>   arm64: rqspinlock: Remove private copy of
>     smp_cond_load_acquire_timewait()
>   asm-generic: barrier: Add smp_cond_load_acquire_timeout()
>   atomic: Add atomic_cond_read_*_timeout()
>   locking/atomic: scripts: build atomic_long_cond_read_*_timeout()
>   bpf/rqspinlock: switch check_timeout() to a clock interface
>   bpf/rqspinlock: Use smp_cond_load_acquire_timeout()
>   sched: add need-resched timed wait interface
>   cpuidle/poll_state: Wait for need-resched via
>     tif_need_resched_relaxed_wait()
>   arm64/delay: enable testing smp_cond_load_relaxed_timeout()
>   barrier: add tests for smp_cond_load_*_timeout()
>   barrier: add clock tests for smp_cond_load_relaxed_timeout()
>
>  Documentation/atomic_t.txt           |  14 +-
>  arch/arm64/Kconfig                   |   3 +
>  arch/arm64/include/asm/barrier.h     |  23 ++++
>  arch/arm64/include/asm/cmpxchg.h     |  62 +++++++--
>  arch/arm64/include/asm/delay-const.h |  28 ++++
>  arch/arm64/include/asm/rqspinlock.h  |  85 ------------
>  arch/arm64/lib/delay.c               |  17 +--
>  drivers/clocksource/arm_arch_timer.c |   2 +
>  drivers/cpuidle/poll_state.c         |  21 +--
>  drivers/soc/qcom/rpmh-rsc.c          |   8 +-
>  include/asm-generic/barrier.h        |  97 ++++++++++++++
>  include/linux/atomic.h               |  10 ++
>  include/linux/atomic/atomic-long.h   |  18 ++-
>  include/linux/sched/idle.h           |  29 +++++
>  kernel/bpf/rqspinlock.c              |  77 +++++++----
>  lib/Kconfig.debug                    |  10 ++
>  lib/tests/Makefile                   |   1 +
>  lib/tests/barrier-timeout-test.c     | 185 +++++++++++++++++++++++++++
>  scripts/atomic/gen-atomic-long.sh    |  16 ++-
>  19 files changed, 528 insertions(+), 178 deletions(-)
>  create mode 100644 arch/arm64/include/asm/delay-const.h
>  create mode 100644 lib/tests/barrier-timeout-test.c


--
ankur

^ permalink raw reply

* Re: [PATCH v12 02/15] arm64: barrier: Support smp_cond_load_relaxed_timeout()
From: Ankur Arora @ 2026-06-10  6:32 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf, arnd,
	catalin.marinas, will, peterz, akpm, mark.rutland, harisokn, cl,
	rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai, rdunlap,
	david.laight.linux, broonie, joao.m.martins, boris.ostrovsky,
	konrad.wilk, ashok.bhat, ankur.a.arora, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai
In-Reply-To: <23edac0734ee3ef334237165500d8f34e3c649081db7b3140a4a8596a202f339@mail.kernel.org>


bot+bpf-ci@kernel.org writes:

>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index fe60738e5943..fa676428ec3f 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -1606,6 +1606,9 @@ config ARCH_SUPPORTS_CRASH_DUMP
>>  config ARCH_DEFAULT_CRASH_DUMP
>>  	def_bool y
>>
>> +config ARCH_HAS_CPU_RELAX
>> +	def_bool y
>> +
>
> ARCH_HAS_CPU_RELAX isn't private to the barrier code.  Before this it was
> defined only by arch/x86, and the generic kernel already keys cpuidle
> behavior off it in two places that arm64 builds.
>
> In drivers/cpuidle/Makefile:
>
>   obj-$(CONFIG_ARCH_HAS_CPU_RELAX)	  += poll_state.o
>
> and in drivers/acpi/processor_idle.c:
>
>   #define ACPI_IDLE_STATE_START	(IS_ENABLED(CONFIG_ARCH_HAS_CPU_RELAX) ? 1 : 0)
>   ...
>   if (IS_ENABLED(CONFIG_ARCH_HAS_CPU_RELAX)) {
>           cpuidle_poll_state_init(drv);
>           count = 1;
>   }
>
> On arm64, ACPI_PROCESSOR is "default y" and depends on ARM64, and it
> selects ACPI_PROCESSOR_IDLE which selects CPU_IDLE.  So on an ACPI arm64
> system this now builds poll_state.o and makes acpi_processor_setup_cstates()
> register a POLL idle state as C-state index 0, shifting the firmware
> C-states up by one.

Yes, this is a bug. The additional POLL idle state change wasn't intended
to be registered as part of this patch.

With this change the first c-state will end up executing a NOP poll_idle()
since current_set_polling_and_test() is tif_need_resched() when
TIF_POLLING_NRFLAG is not defined but this does unnecessarily register
the additional c-state.

I missed that in my testing because my test machine has LPI states which
need an explicit change to enable this.
(https://lore.kernel.org/lkml/20250218213337.377987-7-ankur.a.arora@oracle.com/)

Ankur


> The commit message describes this symbol only as indicating "an optimized
> implementation of cpu_poll_relax()", and the generic patch describes it as
> "a cpu_poll_relax() implementation that is expected to be cheaper".  Neither
> mentions the cpuidle/ACPI poll_state side effect.
>
>
> Is enabling the cpuidle polling idle state on arm64 intended here, or is
> ARCH_HAS_CPU_RELAX only meant to feed smp_cond_load_relaxed_timeout()?  If
> the latter, would a dedicated symbol (or decoupling poll_state/ACPI from
> ARCH_HAS_CPU_RELAX) avoid the unintended idle-path change?

>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27125050324


--
ankur

^ permalink raw reply

* Re: [PATCH] rqspinlock: Fix order in raw_res_spin_(un)lock_irq to allow schedule
From: Waiman Long @ 2026-06-10  0:06 UTC (permalink / raw)
  To: Arnd Bergmann, Gabriele Monaco, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, bpf, Linux-Arch, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
In-Reply-To: <d40ba64d-78d9-45f5-99b9-4bfb1fc27f6c@app.fastmail.com>

On 6/9/26 7:22 AM, Arnd Bergmann wrote:
> On Tue, Jun 9, 2026, at 11:49, Gabriele Monaco wrote:
>> raw_res_spin_unlock_irqrestore() calls raw_res_spin_unlock() and then
>> restores interrupts, this means preemption is enabled when interrupts
>> are still disabled (as part of raw_res_spin_unlock()) so this cannot
>> trigger an actual preemption.
>> This is inconsistent with other spinlock implementations
>> (raw_spin_unlock_irqrestore() and bpf_res_spin_unlock_irqrestore()
>> itself).
>>
>> Adjust the macro to ensure interrupts are enabled before enabling
>> preemption, allowing to schedule at that point. Make the same
>> modification in the error path of raw_res_spin_lock_irqsave().
>>
>> Fixes: 101acd2e78b1 ("rqspinlock: Add macros for rqspinlock usage")
> Should this be Cc:stable@vger.kernel.org to get backported?
>
> Did you see this cause measurable performance problems,
> or did you find it through inspection?
>
>> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> Acked-by: Arnd Bergmann <arnd@arndb.de> # asm-generic
>
> This should probably get merged through the BPF tree, but I've
> added the kernel/locking maintainers to Cc as well, since I
> feel it's more useful to have them look at it than me.
>
> Maybe it would be good to update (as a separate patch) the
> MAINTAINERS file so the locking subsystem also includes the
> headers currently missing:
>
> arch/*/include/asm/*spinlock*.h
> arch/*/include/asm/*rwlock*.h
> include/asm-generic/*spinlock*.h
> include/asm-generic/*rwlock*.h
>
>         Arnd
>
> (full patch quoted below)
>
>> ---
>>   include/asm-generic/rqspinlock.h | 14 +++++++++++---
>>   1 file changed, 11 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/asm-generic/rqspinlock.h
>> b/include/asm-generic/rqspinlock.h
>> index 151d267a49..4d46643f46 100644
>> --- a/include/asm-generic/rqspinlock.h
>> +++ b/include/asm-generic/rqspinlock.h
>> @@ -243,12 +243,20 @@ static __always_inline void
>> res_spin_unlock(rqspinlock_t *lock)
>>   	({                                        \
>>   		int __ret;                        \
>>   		local_irq_save(flags);            \
>> -		__ret = raw_res_spin_lock(lock);  \
>> -		if (__ret)                        \
>> +		preempt_disable();                \
>> +		__ret = res_spin_lock(lock);      \
>> +		if (__ret) {                      \
>>   			local_irq_restore(flags); \
>> +			preempt_enable();         \
>> +		}                                 \
>>   		__ret;                            \
>>   	})
>>
>> -#define raw_res_spin_unlock_irqrestore(lock, flags) ({
>> raw_res_spin_unlock(lock); local_irq_restore(flags); })
>> +#define raw_res_spin_unlock_irqrestore(lock, flags) \
>> +	({                                          \
>> +		res_spin_unlock(lock);              \
>> +		local_irq_restore(flags);           \
>> +		preempt_enable();                   \
>> +	})
>>
>>   #endif /* __ASM_GENERIC_RQSPINLOCK_H */
>>
>> base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
>> -- 
>> 2.54.0

Disabling interrupt will also disable preemption. However, the possible 
side effect of scheduler preemption when preemption is enabled will make 
these res_spin_lock APIs behave more like regular spinlock. So

Acked-by: Waiman Long <longman@redhat.com>


^ permalink raw reply

* [RFC PATCH 6/6] arm64: hyperv: Implement hv_is_isolation_supported() for CCA Realms
From: Kameron Carr @ 2026-06-09 18:10 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux
In-Reply-To: <20260609181030.2378391-1-kameroncarr@linux.microsoft.com>

Provide an arm64 implementation of hv_is_isolation_supported() that
overrides the __weak default in drivers/hv/hv_common.c.

The implementation deliberately does not depend on
hv_is_hyperv_initialized() because hv_common_init() consults
hv_is_isolation_supported() before hyperv_initialized is set.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
 arch/arm64/hyperv/mshyperv.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
index b595b2b9bdbbb..b9b1c2f8e3ec7 100644
--- a/arch/arm64/hyperv/mshyperv.c
+++ b/arch/arm64/hyperv/mshyperv.c
@@ -213,3 +213,8 @@ bool hv_isolation_type_cca(void)
 {
 	return is_realm_world();
 }
+
+bool hv_is_isolation_supported(void)
+{
+	return is_realm_world();
+}
-- 
2.45.4


^ permalink raw reply related

* [RFC PATCH 5/6] arm64: hyperv: Route hypercalls through RSI host call in CCA Realms
From: Kameron Carr @ 2026-06-09 18:10 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux
In-Reply-To: <20260609181030.2378391-1-kameroncarr@linux.microsoft.com>

Modify the five hypercall wrapper functions to check is_realm_world()
and use the per-CPU rsi_host_call structure when inside a Realm.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
 arch/arm64/hyperv/hv_core.c | 175 +++++++++++++++++++++++++++++-------
 1 file changed, 141 insertions(+), 34 deletions(-)

diff --git a/arch/arm64/hyperv/hv_core.c b/arch/arm64/hyperv/hv_core.c
index e33a9e3c366a1..1759998ef2667 100644
--- a/arch/arm64/hyperv/hv_core.c
+++ b/arch/arm64/hyperv/hv_core.c
@@ -16,6 +16,7 @@
 #include <asm-generic/bug.h>
 #include <hyperv/hvhdk.h>
 #include <asm/mshyperv.h>
+#include <asm/rsi.h>
 
 /*
  * hv_do_hypercall- Invoke the specified hypercall
@@ -25,12 +26,32 @@ u64 hv_do_hypercall(u64 control, void *input, void *output)
 	struct arm_smccc_res	res;
 	u64			input_address;
 	u64			output_address;
+	struct rsi_host_call *hostcall;
+	unsigned long flags;
+	u64 ret;
 
 	input_address = input ? virt_to_phys(input) : 0;
 	output_address = output ? virt_to_phys(output) : 0;
 
-	arm_smccc_1_1_hvc(HV_FUNC_ID, control,
-			  input_address, output_address, &res);
+	if (is_realm_world()) {
+		local_irq_save(flags);
+		hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
+		memset(hostcall, 0, sizeof(*hostcall));
+		hostcall->gprs[0] = HV_FUNC_ID;
+		hostcall->gprs[1] = control;
+		hostcall->gprs[2] = input_address;
+		hostcall->gprs[3] = output_address;
+
+		if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)
+			ret = hostcall->gprs[0];
+		else
+			ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
+		local_irq_restore(flags);
+		return ret;
+	}
+
+	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input_address,
+			  output_address, &res);
 	return res.a0;
 }
 EXPORT_SYMBOL_GPL(hv_do_hypercall);
@@ -45,9 +66,28 @@ u64 hv_do_fast_hypercall8(u16 code, u64 input)
 {
 	struct arm_smccc_res	res;
 	u64			control;
+	struct rsi_host_call *hostcall;
+	unsigned long flags;
+	u64 ret;
 
 	control = (u64)code | HV_HYPERCALL_FAST_BIT;
 
+	if (is_realm_world()) {
+		local_irq_save(flags);
+		hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
+		memset(hostcall, 0, sizeof(*hostcall));
+		hostcall->gprs[0] = HV_FUNC_ID;
+		hostcall->gprs[1] = control;
+		hostcall->gprs[2] = input;
+
+		if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)
+			ret = hostcall->gprs[0];
+		else
+			ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
+		local_irq_restore(flags);
+		return ret;
+	}
+
 	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input, &res);
 	return res.a0;
 }
@@ -62,9 +102,29 @@ u64 hv_do_fast_hypercall16(u16 code, u64 input1, u64 input2)
 {
 	struct arm_smccc_res	res;
 	u64			control;
+	struct rsi_host_call *hostcall;
+	unsigned long flags;
+	u64 ret;
 
 	control = (u64)code | HV_HYPERCALL_FAST_BIT;
 
+	if (is_realm_world()) {
+		local_irq_save(flags);
+		hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
+		memset(hostcall, 0, sizeof(*hostcall));
+		hostcall->gprs[0] = HV_FUNC_ID;
+		hostcall->gprs[1] = control;
+		hostcall->gprs[2] = input1;
+		hostcall->gprs[3] = input2;
+
+		if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)
+			ret = hostcall->gprs[0];
+		else
+			ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
+		local_irq_restore(flags);
+		return ret;
+	}
+
 	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input1, input2, &res);
 	return res.a0;
 }
@@ -76,24 +136,44 @@ EXPORT_SYMBOL_GPL(hv_do_fast_hypercall16);
 void hv_set_vpreg(u32 msr, u64 value)
 {
 	struct arm_smccc_res res;
+	struct rsi_host_call *hostcall;
+	unsigned long flags;
+	u64 status;
+
+	if (is_realm_world()) {
+		local_irq_save(flags);
+		hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
+		memset(hostcall, 0, sizeof(*hostcall));
+		hostcall->gprs[0] = HV_FUNC_ID;
+		hostcall->gprs[1] = HVCALL_SET_VP_REGISTERS |
+				    HV_HYPERCALL_FAST_BIT |
+				    HV_HYPERCALL_REP_COMP_1;
+		hostcall->gprs[2] = HV_PARTITION_ID_SELF;
+		hostcall->gprs[3] = HV_VP_INDEX_SELF;
+		hostcall->gprs[4] = msr;
+		hostcall->gprs[6] = value;
 
-	arm_smccc_1_1_hvc(HV_FUNC_ID,
-		HVCALL_SET_VP_REGISTERS | HV_HYPERCALL_FAST_BIT |
-			HV_HYPERCALL_REP_COMP_1,
-		HV_PARTITION_ID_SELF,
-		HV_VP_INDEX_SELF,
-		msr,
-		0,
-		value,
-		0,
-		&res);
+		if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)
+			status = hostcall->gprs[0];
+		else
+			status = HV_STATUS_INVALID_HYPERCALL_INPUT;
+		local_irq_restore(flags);
+	} else {
+		arm_smccc_1_1_hvc(HV_FUNC_ID,
+				  HVCALL_SET_VP_REGISTERS |
+					  HV_HYPERCALL_FAST_BIT |
+					  HV_HYPERCALL_REP_COMP_1,
+				  HV_PARTITION_ID_SELF, HV_VP_INDEX_SELF, msr,
+				  0, value, 0, &res);
+		status = res.a0;
+	}
 
 	/*
-	 * Something is fundamentally broken in the hypervisor if
-	 * setting a VP register fails. There's really no way to
-	 * continue as a guest VM, so panic.
+	 * Something is fundamentally broken in the hypervisor (or, in a
+	 * Realm, the RMM denied the host call) if setting a VP register
+	 * fails. There's really no way to continue as a guest VM, so panic.
 	 */
-	BUG_ON(!hv_result_success(res.a0));
+	BUG_ON(!hv_result_success(status));
 }
 EXPORT_SYMBOL_GPL(hv_set_vpreg);
 
@@ -108,29 +188,56 @@ void hv_get_vpreg_128(u32 msr, struct hv_get_vp_registers_output *result)
 {
 	struct arm_smccc_1_2_regs args;
 	struct arm_smccc_1_2_regs res;
+	struct rsi_host_call *hostcall;
+	u64 status;
 
-	args.a0 = HV_FUNC_ID;
-	args.a1 = HVCALL_GET_VP_REGISTERS | HV_HYPERCALL_FAST_BIT |
-			HV_HYPERCALL_REP_COMP_1;
-	args.a2 = HV_PARTITION_ID_SELF;
-	args.a3 = HV_VP_INDEX_SELF;
-	args.a4 = msr;
+	if (is_realm_world()) {
+		unsigned long flags;
 
-	/*
-	 * Use the SMCCC 1.2 interface because the results are in registers
-	 * beyond X0-X3.
-	 */
-	arm_smccc_1_2_hvc(&args, &res);
+		local_irq_save(flags);
+		hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
+		memset(hostcall, 0, sizeof(*hostcall));
+
+		hostcall->gprs[0] = HV_FUNC_ID;
+		hostcall->gprs[1] = HVCALL_GET_VP_REGISTERS |
+				    HV_HYPERCALL_FAST_BIT |
+				    HV_HYPERCALL_REP_COMP_1;
+		hostcall->gprs[2] = HV_PARTITION_ID_SELF;
+		hostcall->gprs[3] = HV_VP_INDEX_SELF;
+		hostcall->gprs[4] = msr;
+
+		if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS) {
+			status = hostcall->gprs[0];
+			result->as64.low = hostcall->gprs[6];
+			result->as64.high = hostcall->gprs[7];
+		} else {
+			status = HV_STATUS_INVALID_HYPERCALL_INPUT;
+		}
+		local_irq_restore(flags);
+	} else {
+		args.a0 = HV_FUNC_ID;
+		args.a1 = HVCALL_GET_VP_REGISTERS | HV_HYPERCALL_FAST_BIT |
+			  HV_HYPERCALL_REP_COMP_1;
+		args.a2 = HV_PARTITION_ID_SELF;
+		args.a3 = HV_VP_INDEX_SELF;
+		args.a4 = msr;
+
+		/*
+		 * Use the SMCCC 1.2 interface because the results are in
+		 * registers beyond X0-X3.
+		 */
+		arm_smccc_1_2_hvc(&args, &res);
+		status = res.a0;
+		result->as64.low = res.a6;
+		result->as64.high = res.a7;
+	}
 
 	/*
-	 * Something is fundamentally broken in the hypervisor if
-	 * getting a VP register fails. There's really no way to
-	 * continue as a guest VM, so panic.
+	 * Something is fundamentally broken in the hypervisor (or, in a
+	 * Realm, the RMM denied the host call) if getting a VP register
+	 * fails. There's really no way to continue as a guest VM, so panic.
 	 */
-	BUG_ON(!hv_result_success(res.a0));
-
-	result->as64.low = res.a6;
-	result->as64.high = res.a7;
+	BUG_ON(!hv_result_success(status));
 }
 EXPORT_SYMBOL_GPL(hv_get_vpreg_128);
 
-- 
2.45.4


^ permalink raw reply related

* [RFC PATCH 4/6] Drivers: hv: Mark shared memory as decrypted for CCA Realms
From: Kameron Carr @ 2026-06-09 18:10 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux
In-Reply-To: <20260609181030.2378391-1-kameroncarr@linux.microsoft.com>

In hv_common_cpu_init(), the per-CPU hypercall input/output pages need
to be marked as decrypted (shared) for confidential VM isolation types.
This is already done for SNP and TDX isolation; extend the same handling
to Arm CCA Realm guests so that the host hypervisor can access the
shared hypercall buffers.

is_realm_world() is only declared in arch/arm64/include/asm/rsi.h, so
using it directly in the arch-neutral drivers/hv/hv_common.c would
break the x86 build. Introduce a Hyper-V-specific helper following the
established hv_isolation_type_snp() / hv_isolation_type_tdx() pattern.

On architectures other than arm64 the weak default keeps the existing
behaviour.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
 arch/arm64/hyperv/mshyperv.c   | 5 +++++
 drivers/hv/hv_common.c         | 9 ++++++++-
 include/asm-generic/mshyperv.h | 1 +
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
index 08fec82691683..b595b2b9bdbbb 100644
--- a/arch/arm64/hyperv/mshyperv.c
+++ b/arch/arm64/hyperv/mshyperv.c
@@ -208,3 +208,8 @@ bool hv_is_hyperv_initialized(void)
 	return hyperv_initialized;
 }
 EXPORT_SYMBOL_GPL(hv_is_hyperv_initialized);
+
+bool hv_isolation_type_cca(void)
+{
+	return is_realm_world();
+}
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 6b67ac6167891..010c7d98b5de1 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -499,7 +499,8 @@ int hv_common_cpu_init(unsigned int cpu)
 		}
 
 		if (!ms_hyperv.paravisor_present &&
-		    (hv_isolation_type_snp() || hv_isolation_type_tdx())) {
+		    (hv_isolation_type_snp() || hv_isolation_type_tdx() ||
+		     hv_isolation_type_cca())) {
 			ret = set_memory_decrypted((unsigned long)mem, pgcount);
 			if (ret) {
 				/* It may be unsafe to free 'mem' */
@@ -666,6 +667,12 @@ bool __weak hv_isolation_type_tdx(void)
 }
 EXPORT_SYMBOL_GPL(hv_isolation_type_tdx);
 
+bool __weak hv_isolation_type_cca(void)
+{
+	return false;
+}
+EXPORT_SYMBOL_GPL(hv_isolation_type_cca);
+
 void __weak hv_setup_vmbus_handler(void (*handler)(void))
 {
 }
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index bf601d67cecb9..1fa79abce743c 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -79,6 +79,7 @@ u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2);
 
 bool hv_isolation_type_snp(void);
 bool hv_isolation_type_tdx(void);
+bool hv_isolation_type_cca(void);
 
 /*
  * On architectures where Hyper-V doesn't support AEOI (e.g., ARM64),
-- 
2.45.4


^ permalink raw reply related

* [RFC PATCH 3/6] arm64: hyperv: Add per-CPU RSI host call infrastructure for CCA Realms
From: Kameron Carr @ 2026-06-09 18:10 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux
In-Reply-To: <20260609181030.2378391-1-kameroncarr@linux.microsoft.com>

Arm CCA Realms cannot issue Hyper-V hypercalls via HVC; the guest must
route them through the RSI_HOST_CALL interface, which takes the IPA of a
per-CPU rsi_host_call structure as its argument.

Add hyperv_pcpu_hostcall_struct as a per-CPU pointer to that buffer and
allocate it for the boot CPU during hyperv_init() and for each secondary
CPU in hv_cpu_init(). The allocation is gated on is_realm_world() so
non-Realm arm64 Hyper-V guests pay no memory cost.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
 arch/arm64/hyperv/mshyperv.c      | 78 ++++++++++++++++++++++++++++++-
 arch/arm64/include/asm/mshyperv.h |  3 ++
 2 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
index 4fdc26ade1d74..08fec82691683 100644
--- a/arch/arm64/hyperv/mshyperv.c
+++ b/arch/arm64/hyperv/mshyperv.c
@@ -15,10 +15,16 @@
 #include <linux/errno.h>
 #include <linux/version.h>
 #include <linux/cpuhotplug.h>
+#include <linux/slab.h>
+#include <linux/percpu.h>
 #include <asm/mshyperv.h>
+#include <asm/rsi.h>
 
 static bool hyperv_initialized;
 
+void * __percpu *hyperv_pcpu_hostcall_struct;
+EXPORT_SYMBOL_GPL(hyperv_pcpu_hostcall_struct);
+
 int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
 {
 	hv_get_vpreg_128(HV_REGISTER_HYPERVISOR_VERSION,
@@ -60,6 +66,46 @@ static bool __init hyperv_detect_via_acpi(void)
 
 #endif
 
+static void hv_hostcall_free(void)
+{
+	int cpu;
+
+	if (!hyperv_pcpu_hostcall_struct)
+		return;
+
+	for_each_possible_cpu(cpu)
+		kfree(*per_cpu_ptr(hyperv_pcpu_hostcall_struct, cpu));
+	free_percpu(hyperv_pcpu_hostcall_struct);
+	hyperv_pcpu_hostcall_struct = NULL;
+}
+
+static int hv_cpu_init(unsigned int cpu)
+{
+	void **hostcall_struct;
+	gfp_t flags;
+	void *mem;
+
+	if (hyperv_pcpu_hostcall_struct) {
+		/* hv_cpu_init() can be called with IRQs disabled from hv_resume() */
+		flags = irqs_disabled() ? GFP_ATOMIC : GFP_KERNEL;
+
+		hostcall_struct = (void **)this_cpu_ptr(hyperv_pcpu_hostcall_struct);
+		/*
+		 * The hostcall_struct memory is not freed when the CPU
+		 * goes offline. If a previously offlined CPU is brought
+		 * back online, the memory is reused here.
+		 */
+		if (!*hostcall_struct) {
+			mem = kzalloc_obj(struct rsi_host_call, flags);
+			if (!mem)
+				return -ENOMEM;
+			*hostcall_struct = mem;
+		}
+	}
+
+	return hv_common_cpu_init(cpu);
+}
+
 static bool __init hyperv_detect_via_smccc(void)
 {
 	uuid_t hyperv_uuid = UUID_INIT(
@@ -73,6 +119,8 @@ static bool __init hyperv_detect_via_smccc(void)
 static int __init hyperv_init(void)
 {
 	struct hv_get_vp_registers_output	result;
+	void **hostcall_struct;
+	void *mem;
 	u64	guest_id;
 	int	ret;
 
@@ -85,6 +133,27 @@ static int __init hyperv_init(void)
 	if (!hyperv_detect_via_acpi() && !hyperv_detect_via_smccc())
 		return 0;
 
+	/*
+	 * The RSI host-call buffer is only ever used when
+	 * is_realm_world() is true. Skip the per-CPU allocation on
+	 * non-Realm guests.
+	 */
+	if (is_realm_world()) {
+		hyperv_pcpu_hostcall_struct = alloc_percpu(void *);
+		if (!hyperv_pcpu_hostcall_struct)
+			return -ENOMEM;
+
+		hostcall_struct = (void **)this_cpu_ptr(hyperv_pcpu_hostcall_struct);
+		if (!*hostcall_struct) {
+			mem = kzalloc_obj(struct rsi_host_call);
+			if (!mem) {
+				ret = -ENOMEM;
+				goto free_hostcall_mem;
+			}
+			*hostcall_struct = mem;
+		}
+	}
+
 	/* Setup the guest ID */
 	guest_id = hv_generate_guest_id(LINUX_VERSION_CODE);
 	hv_set_vpreg(HV_REGISTER_GUEST_OS_ID, guest_id);
@@ -106,12 +175,13 @@ static int __init hyperv_init(void)
 
 	ret = hv_common_init();
 	if (ret)
-		return ret;
+		goto free_hostcall_mem;
 
 	ret = cpuhp_setup_state(CPUHP_AP_HYPERV_ONLINE, "arm64/hyperv_init:online",
-				hv_common_cpu_init, hv_common_cpu_die);
+				hv_cpu_init, hv_common_cpu_die);
 	if (ret < 0) {
 		hv_common_free();
+		hv_hostcall_free();
 		return ret;
 	}
 
@@ -125,6 +195,10 @@ static int __init hyperv_init(void)
 
 	hyperv_initialized = true;
 	return 0;
+
+free_hostcall_mem:
+	hv_hostcall_free();
+	return ret;
 }
 
 early_initcall(hyperv_init);
diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
index b721d3134ab66..65a00bd14c6cb 100644
--- a/arch/arm64/include/asm/mshyperv.h
+++ b/arch/arm64/include/asm/mshyperv.h
@@ -63,4 +63,7 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
 
 #include <asm-generic/mshyperv.h>
 
+/* Per-CPU RSI host call structure for CCA Realms */
+extern void *__percpu *hyperv_pcpu_hostcall_struct;
+
 #endif
-- 
2.45.4


^ permalink raw reply related

* [RFC PATCH 2/6] firmware: smccc: Detect hypervisor via RSI host call in CCA Realms
From: Kameron Carr @ 2026-06-09 18:10 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux
In-Reply-To: <20260609181030.2378391-1-kameroncarr@linux.microsoft.com>

Modify arm_smccc_hypervisor_has_uuid() to check is_realm_world() and
use rsi_host_call() to query the hypervisor vendor UUID when inside a
Realm. The realm path is factored into a helper,
arm_smccc_realm_get_hypervisor_uuid(), that owns a file-static
rsi_host_call buffer (uuid_hc) serialized by a spinlock.

The RSI-specific includes, file-static state and helper are guarded
with CONFIG_ARM64 because <asm/rsi.h> does not exist on 32-bit ARM.

For non-Realm environments, the existing arm_smccc_1_1_invoke() path
is unchanged.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
 drivers/firmware/smccc/smccc.c | 41 +++++++++++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/smccc/smccc.c b/drivers/firmware/smccc/smccc.c
index bdee057db2fd3..6b465e65472b0 100644
--- a/drivers/firmware/smccc/smccc.c
+++ b/drivers/firmware/smccc/smccc.c
@@ -12,6 +12,12 @@
 #include <linux/platform_device.h>
 #include <asm/archrandom.h>
 
+#ifdef CONFIG_ARM64
+#include <linux/cleanup.h>
+#include <linux/spinlock.h>
+#include <asm/rsi.h>
+#endif
+
 static u32 smccc_version = ARM_SMCCC_VERSION_1_0;
 static enum arm_smccc_conduit smccc_conduit = SMCCC_CONDUIT_NONE;
 
@@ -67,12 +73,45 @@ s32 arm_smccc_get_soc_id_revision(void)
 }
 EXPORT_SYMBOL_GPL(arm_smccc_get_soc_id_revision);
 
+#ifdef CONFIG_ARM64
+static struct rsi_host_call uuid_hc;
+static DEFINE_SPINLOCK(uuid_hc_lock);
+
+/*
+ * Helper function to get the hypervisor UUID via an RsiHostCall.
+ */
+static bool arm_smccc_realm_get_hypervisor_uuid(struct arm_smccc_res *res)
+{
+	guard(spinlock_irqsave)(&uuid_hc_lock);
+
+	memset(&uuid_hc, 0, sizeof(uuid_hc));
+	uuid_hc.gprs[0] = ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID;
+
+	if (rsi_host_call(__pa_symbol(&uuid_hc)) != RSI_SUCCESS)
+		return false;
+
+	res->a0 = uuid_hc.gprs[0];
+	res->a1 = uuid_hc.gprs[1];
+	res->a2 = uuid_hc.gprs[2];
+	res->a3 = uuid_hc.gprs[3];
+	return true;
+}
+#endif
+
 bool arm_smccc_hypervisor_has_uuid(const uuid_t *hyp_uuid)
 {
 	struct arm_smccc_res res = {};
 	uuid_t uuid;
 
-	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID, &res);
+#ifdef CONFIG_ARM64
+	if (is_realm_world()) {
+		if (!arm_smccc_realm_get_hypervisor_uuid(&res))
+			return false;
+	} else
+#endif
+		arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID,
+				     &res);
+
 	if (res.a0 == SMCCC_RET_NOT_SUPPORTED)
 		return false;
 
-- 
2.45.4


^ permalink raw reply related

* [RFC PATCH 1/6] arm64: rsi: Add RSI host call structure and helper function
From: Kameron Carr @ 2026-06-09 18:10 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux
In-Reply-To: <20260609181030.2378391-1-kameroncarr@linux.microsoft.com>

Add struct rsi_host_call to rsi_smc.h, which represents the host call
data structure used by the Realm Management Monitor (RMM) for the
RSI_HOST_CALL interface. The structure contains a 16-bit immediate field
and 31 general-purpose register values, aligned to 256 bytes as required
by the CCA RMM specification.

Add rsi_host_call() static inline wrapper in rsi_cmds.h that invokes
SMC_RSI_HOST_CALL with the physical address of the host call structure.
This will be used by Hyper-V guest code to route hypercalls through the
RSI interface when running inside an Arm CCA Realm.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
 arch/arm64/include/asm/rsi_cmds.h | 9 +++++++++
 arch/arm64/include/asm/rsi_smc.h  | 6 ++++++
 2 files changed, 15 insertions(+)

diff --git a/arch/arm64/include/asm/rsi_cmds.h b/arch/arm64/include/asm/rsi_cmds.h
index 2c8763876dfb7..83b4b1f598454 100644
--- a/arch/arm64/include/asm/rsi_cmds.h
+++ b/arch/arm64/include/asm/rsi_cmds.h
@@ -159,4 +159,13 @@ static inline unsigned long rsi_attestation_token_continue(phys_addr_t granule,
 	return res.a0;
 }
 
+static inline long rsi_host_call(phys_addr_t host_call_struct)
+{
+	struct arm_smccc_res res;
+
+	arm_smccc_smc(SMC_RSI_HOST_CALL, host_call_struct, 0, 0, 0, 0, 0, 0,
+		      &res);
+	return res.a0;
+}
+
 #endif /* __ASM_RSI_CMDS_H */
diff --git a/arch/arm64/include/asm/rsi_smc.h b/arch/arm64/include/asm/rsi_smc.h
index e19253f96c940..ffea93340ed7f 100644
--- a/arch/arm64/include/asm/rsi_smc.h
+++ b/arch/arm64/include/asm/rsi_smc.h
@@ -142,6 +142,12 @@ struct realm_config {
 	 */
 } __aligned(0x1000);
 
+struct rsi_host_call {
+	u16 immediate;
+	u64 gprs[31];
+} __aligned(256);
+static_assert(sizeof(struct rsi_host_call) == 256);
+
 #endif /* __ASSEMBLER__ */
 
 /*
-- 
2.45.4


^ permalink raw reply related

* [RFC PATCH 0/6] arm64: hyperv: Add Realm support for Hyper-V
From: Kameron Carr @ 2026-06-09 18:10 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux

From: Kameron Carr <kameroncarr@microsoft.com>

Realms (CoCo VMs on ARM) require host calls to be routed through the RMM
(Realm Management Monitor) via the RSI (Realm Service Interface). This
series implements most of the necessary changes to support Realms on
Hyper-V.

One required change is not included in this series. The two buffers
allocated via vzalloc() in netvsc_init_buf() cannot be decrypted in
vmbus_establish_gpadl(). Currently only linearly mapped memory can be
decrypted. See my RFC patch [1]. I will implement the accompanying netvsc
changes based on the feedback I receive on that patch.

This patch series was tested by booting a Realm on Cobalt 200 running
Windows. I decreased the buffer size and used kzalloc() in
netvsc_init_buf() in my testing as a workaround for the issue mentioned
above.

[1] https://lore.kernel.org/all/20260521205834.1012925-1-kameroncarr@linux.microsoft.com/

Kameron Carr (6):
  arm64: rsi: Add RSI host call structure and helper function
  firmware: smccc: Detect hypervisor via RSI host call in CCA Realms
  arm64: hyperv: Add per-CPU RSI host call infrastructure for CCA Realms
  Drivers: hv: Mark shared memory as decrypted for CCA Realms
  arm64: hyperv: Route hypercalls through RSI host call in CCA Realms
  arm64: hyperv: Implement hv_is_isolation_supported() for CCA Realms

 arch/arm64/hyperv/hv_core.c       | 175 ++++++++++++++++++++++++------
 arch/arm64/hyperv/mshyperv.c      |  88 ++++++++++++++-
 arch/arm64/include/asm/mshyperv.h |   3 +
 arch/arm64/include/asm/rsi_cmds.h |   9 ++
 arch/arm64/include/asm/rsi_smc.h  |   6 +
 drivers/firmware/smccc/smccc.c    |  41 ++++++-
 drivers/hv/hv_common.c            |   9 +-
 include/asm-generic/mshyperv.h    |   1 +
 8 files changed, 294 insertions(+), 38 deletions(-)

base-commit: 7a035678fc2bdee81881170764ef08a91a076147
-- 
2.45.4

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Jann Horn @ 2026-06-09 17:53 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Mateusz Guzik, Christian Brauner, Li Chen, Kees Cook,
	Alexander Viro, linux-fsdevel, linux-api, linux-kernel, linux-mm,
	linux-arch, linux-doc, linux-kselftest, x86, Arnd Bergmann,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Jan Kara, Jonathan Corbet,
	Shuah Khan
In-Reply-To: <lhubjdk1c1m.fsf@oldenburg.str.redhat.com>

On Tue, Jun 9, 2026 at 8:08 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Jann Horn:
>
> >> Per the above, the primary win would stem from *NOT* messing with mm.
> >
> > As you write below, I think we have that with CLONE_MM? The C function
> > vfork() is kind of a terrible API because of its returns-twice
> > behavior, but I think if process cloning with CLONE_VM|CLONE_VFORK was
> > wrapped by libc in a way similar to clone() (with the child executing
> > a separate handler function), or if it was used in the implementation
> > of some higher-level process-spawning API, it would be a perfectly
> > fine API?
>
> No, there is still a problem with SIGTSTP handling because we cannot
> atomically unmask the signal during execve.  We need to unblock SIGTSTP
> before execve in the new process, but this means that it can get
> suspended by SIGTSTP.  Consequently, the execve never happens and the
> original process is stuck in vfork:
>
>   posix_spawn: parent can get stuck in uninterruptible sleep if child
>   receives SIGTSTP early enough
>   <https://inbox.sourceware.org/libc-help/2921668c-773e-465d-9480-0abb6f979bf9@www.fastmail.com/>
>
> More on the low-level side, it's difficult to make sure that execve gets
> a consistent snapshot of the environ vector.  Both vfork and execve need
> to be async-signal-safe.  Any locking or memory allocation (except for
> the stack …) persists in the original process after vfork returns.  The

I think that's not entirely accurate; if you call set_robust_list() on
a futex list, then call execve(), the futexes should be released once
the process switches to a new MM, in
begin_new_exec -> exec_mmap -> exec_mm_release -> futex_exec_release
-> futex_cleanup -> exit_robust_list.

So in theory you could use clone() with CLONE_VM and without
CLONE_VFORK, and let the parent either wait for a futex that is
released on exec, or somehow asynchronously check later whether the
futex is still held... probably not the nicest building block but
maybe workable? Though I guess it would fit more nicely if there was a
"munmap() this range on exec" API...

> environ vector can be large, so making a copy on the stack is not ideal.
> It's even harder for getenv/setenv/unsetenv implementations that use
> locking instead of software transactional memory.

Makes sense, that kind of sounds like a pain inherent in being able to
execute from signal handler context...

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: John Ericson @ 2026-06-09 17:27 UTC (permalink / raw)
  To: Li Chen, Andy Lutomirski
  Cc: Christian Brauner, Kees Cook, Al Viro, linux-fsdevel, linux-api,
	LKML, linux-mm, linux-arch, linux-doc, linux-kselftest, x86,
	Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Jan Kara, Jonathan Corbet,
	Shuah Khan
In-Reply-To: <19eacd64508.26b92c022125848.262962729296162879@linux.beauty>

On Tue, Jun 9, 2026, at 10:43 AM, Li Chen wrote:
> Hi Andy,
>
> ---- On Tue, 09 Jun 2026 08:01:57 +0800  Andy Lutomirski <luto@kernel.org> wrote ---
> > [...]
> >
> > After contemplating this for a bit... why pidfd?  Doesn't a pidfd
> > refer to an actual process that is, or at least was, running?  This
> > new thing is a process that we are contemplating spawning.  I can
> > imagine that basically all pidfd APIs would be a bit confused by the
> > nonexistence of the process in question.
> >
>
> Yes, I think that is a real concern.
>
> In my current local WIP I tried to keep that distinction explicit.
> pidfd_spawn_open() returns a pidfs-backed builder fd, not a normal pidfd
> referring to a process. The builder fd is allocated as an anonymous pidfs
> file with builder-specific file operations:
>
>     file = pidfs_alloc_anon_file("[pidfd_spawn]",
>                                  &pidfd_spawn_builder_fops, builder,
>                                  O_RDWR);
>

What does your builder fd point to, explicitly? For example in my other reply I
talked about how it was "real" process state. In my FreeBSD patch, for example,
I found there was already a status for a process "in exec", and I figured that
was clean to reuse for one of these "embryonic" processes that also hadn't
started running. I would reckon that Linux probably has some similar notions.

> and the normal pidfd helpers still reject it because it does not use the
> ordinary pidfd file operations:
>
>     struct pid *pidfd_pid(const struct file *file)
>     {
>         if (file->f_op != &pidfs_file_operations)
>             return ERR_PTR(-EBADF);
>         return file_inode(file)->i_private;
>     }
>
> So the current split is:
>
>     builder_fd = pidfd_spawn_open(...);       /* builder object */
>     pidfd_config(builder_fd, ...);
>     child_pidfd = pidfd_spawn_run(builder_fd, ...); /* real pidfd */
>
> Only the last fd is a normal pidfd for an actual child process. The builder
> fd is only accepted by the builder operations.
>
> This avoids having to define what waitid(P_PIDFD), pidfd_send_signal(),
> pidfd_getfd(), poll(), etc. mean before the process exists.

I wouldn't be so sure this is necessary/good. For example, I think it could
make sense to wait on a process that has yet to be started; one just waits for
both the process to start and the process to exit. Obviously a blocking syscall
in the thread that is spawning the process is not useful, but the asynchronous
poll variation seems fine.

As long as there is real process state here, it shouldn't be too hard to
implement.

> The downside is that it adds a separate open-style entry point and is less
> uniform than the pidfd_open(0, PIDFD_EMPTY) spelling Christian sketched.

I do think there is no point having two file descriptors. The file descriptor
that previously referred to the builder/embryonic process then can refer to the
real process, right?

> If people think there is a better way to represent the pre-spawn builder
> state, or if the preference is to integrate it directly into pidfd_open()
> with an explicit empty/future-pidfd state, I would be happy to discuss that.

Hope the above answers your question? I suppose my ideas lean more on the
"future" than "empty" side --- there is indeed a thread in the thread group,
with real VM/namespace/file descriptor etc. state. Moreover, state gets
initialized before the process is started, so the actual start is a pretty
lightweight step of just letting the scheduler know the now-ready process can
be scheduled. The only thing that distinguishes the embryonic process from a
real one is simply that it isn't running --- i.e. isn't (yet) available to be
scheduled --- so the pidfds holders are free to poke at its state.

Cheers,

John

^ permalink raw reply

* Re: [PATCH] rqspinlock: Fix order in raw_res_spin_(un)lock_irq to allow schedule
From: Alexei Starovoitov @ 2026-06-09 16:57 UTC (permalink / raw)
  To: Gabriele Monaco, Arnd Bergmann, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, bpf, Linux-Arch, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long
In-Reply-To: <defc6b1339dea2f64f936902dd6e1f850cff9f4e.camel@redhat.com>

On Tue Jun 9, 2026 at 6:04 AM PDT, Gabriele Monaco wrote:
> On Tue, 2026-06-09 at 13:22 +0200, Arnd Bergmann wrote:
>> Should this be Cc:stable@vger.kernel.org to get backported?
>
> Not sure if the Fixes: is enough to trigger the automation, I rarely
> remember to Cc:stable@vger.kernel.org and they're usually picked.
>
> In case I guess I'd need to re-submit the patch right?

Yes. For whatever reason the patch didn't reach the patchwork.

Please resubmit with [PATCH bpf-next] subject, so that CI can test it properly.
And collect Acks.

^ permalink raw reply

* Re: [PATCH] rqspinlock: Fix order in raw_res_spin_(un)lock_irq to allow schedule
From: Gabriele Monaco @ 2026-06-09 16:17 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Arnd Bergmann, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman, bpf,
	Linux-Arch, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long
In-Reply-To: <DJ4LJ3T8PYRG.3TG272FHH78RM@gmail.com>

On Tue, 2026-06-09 at 16:42 +0200, Kumar Kartikeya Dwivedi wrote:
> On Tue Jun 9, 2026 at 3:04 PM CEST, Gabriele Monaco wrote:
> > On Tue, 2026-06-09 at 13:22 +0200, Arnd Bergmann wrote:
> > 
> > > Did you see this cause measurable performance problems,
> > > or did you find it through inspection?
> > 
> > I noticed it while debugging an ENOMEM issue in the test_maps BPF
> > selftest on PREEMPT_RT and this was an obvious cuplrit (irq_work
> > not scheduled during a stress run). Turns out the problem is still
> > there after this fix though.
> 
> I would imagine this to be the least of your problems, I think
> there's a bunch of blockers for complete selftests passing with
> PREEMPT_RT support in BPF.
> 

Well, I'm starting to believe that too..
At the moment on well tuned machines we are only observing ENOMEM
issues in the test_maps and only when it's literally hogging the
allocator (preallocation is off and 100 treads do updates in parallel).

What else are you expecting to fail under PREEMPT_RT?

> > > > Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> > > 
> > > Acked-by: Arnd Bergmann <arnd@arndb.de> # asm-generic
> 
> The patch makes sense to me as well.
> 
> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

Thanks,
Gabriele


^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Li Chen @ 2026-06-09 14:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christian Brauner, Kees Cook, Alexander Viro, linux-fsdevel,
	linux-api, linux-kernel, linux-mm, linux-arch, linux-doc,
	linux-kselftest, x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Jan Kara,
	Jonathan Corbet, Shuah Khan
In-Reply-To: <CALCETrWJQpLR4n1cpichBk8=uExSKLWTMGU3BufGdk_WE_p5UA@mail.gmail.com>

Hi Andy,

 ---- On Tue, 09 Jun 2026 08:01:57 +0800  Andy Lutomirski <luto@kernel.org> wrote --- 
 > On Thu, May 28, 2026 at 4:05 AM Christian Brauner <brauner@kernel.org> wrote:
 > >
 > > On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
 > > > Hi,
 > > >
 > > > This is an early RFC for an idea that is probably still rough in both the
 > > > UAPI and implementation details. Sorry for the rough edges; I am sending
 > > > it now to check whether this direction is worth pursuing and to get
 > > > feedback on the kernel/userspace boundary.
 > >
 > > The idea of having a builder api for exec isn't all that crazy. But it
 > > should simply be built on top of pidfds and thus pidfs itself instead.
 > > It has all the basic infrastructure in place already. Any implementation
 > > should also allow userspace to implement posix_spawn() on top of it.
 > >
 > > fd = pidfd_open(0, PIDFD_EMPTY /* or better name */)
 > >
 > > pidfd_config(fd, ...) // modeled similar to fsconfig()
 > >
 > 
 > After contemplating this for a bit... why pidfd?  Doesn't a pidfd
 > refer to an actual process that is, or at least was, running?  This
 > new thing is a process that we are contemplating spawning.  I can
 > imagine that basically all pidfd APIs would be a bit confused by the
 > nonexistence of the process in question.
 > 

Yes, I think that is a real concern.                                                                                                                                                               
                                                                                 
In my current local WIP I tried to keep that distinction explicit.                                     
pidfd_spawn_open() returns a pidfs-backed builder fd, not a normal pidfd
referring to a process. The builder fd is allocated as an anonymous pidfs                                                                                                                                        
file with builder-specific file operations:       
                                                                                                       
    file = pidfs_alloc_anon_file("[pidfd_spawn]",                                                      
                                 &pidfd_spawn_builder_fops, builder,      
                                 O_RDWR);                                                              
                                                  
and the normal pidfd helpers still reject it because it does not use the
ordinary pidfd file operations:                                                                        
                                                                                                       
    struct pid *pidfd_pid(const struct file *file)
    {
        if (file->f_op != &pidfs_file_operations)                                                      
            return ERR_PTR(-EBADF);               
        return file_inode(file)->i_private;                                                                                                                                                                      
    }                                                                                                                                                                                                            
                                                                                                                                                                                                                 
So the current split is:                                                                               
                                                                                                       
    builder_fd = pidfd_spawn_open(...);       /* builder object */
    pidfd_config(builder_fd, ...);     
    child_pidfd = pidfd_spawn_run(builder_fd, ...); /* real pidfd */
                                                                                                       
Only the last fd is a normal pidfd for an actual child process. The
builder fd is only accepted by the builder operations.                                                                                                                                                           
                                                                                                       
This avoids having to define what waitid(P_PIDFD), pidfd_send_signal(),
pidfd_getfd(), poll(), etc. mean before the process exists. The downside                                                                                                                                         
is that it adds a separate open-style entry point and is less uniform than                                                                                                                                       
the pidfd_open(0, PIDFD_EMPTY) spelling Christian sketched.                                                                                                                                                      
                                                                                                                                                                                                                 
If people think there is a better way to represent the pre-spawn builder
state, or if the preference is to integrate it directly into pidfd_open()
with an explicit empty/future-pidfd state, I would be happy to discuss
that.

Regards,
Li


^ permalink raw reply

* Re: [PATCH] rqspinlock: Fix order in raw_res_spin_(un)lock_irq to allow schedule
From: Kumar Kartikeya Dwivedi @ 2026-06-09 14:42 UTC (permalink / raw)
  To: Gabriele Monaco, Arnd Bergmann, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, bpf, Linux-Arch, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long
In-Reply-To: <defc6b1339dea2f64f936902dd6e1f850cff9f4e.camel@redhat.com>

On Tue Jun 9, 2026 at 3:04 PM CEST, Gabriele Monaco wrote:
> On Tue, 2026-06-09 at 13:22 +0200, Arnd Bergmann wrote:
>> Should this be Cc:stable@vger.kernel.org to get backported?
>
> Not sure if the Fixes: is enough to trigger the automation, I rarely
> remember to Cc:stable@vger.kernel.org and they're usually picked.
>
> In case I guess I'd need to re-submit the patch right?
>
>> Did you see this cause measurable performance problems,
>> or did you find it through inspection?
>
> I noticed it while debugging an ENOMEM issue in the test_maps BPF
> selftest on PREEMPT_RT and this was an obvious cuplrit (irq_work not
> scheduled during a stress run). Turns out the problem is still there
> after this fix though.

I would imagine this to be the least of your problems, I think there's a bunch
of blockers for complete selftests passing with PREEMPT_RT support in BPF.

>
>>
>> > Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
>>
>> Acked-by: Arnd Bergmann <arnd@arndb.de> # asm-generic

The patch makes sense to me as well.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

^ permalink raw reply

* Re: [PATCH] rqspinlock: Fix order in raw_res_spin_(un)lock_irq to allow schedule
From: Peter Zijlstra @ 2026-06-09 14:35 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Gabriele Monaco, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi, bpf,
	Linux-Arch, linux-kernel, Ingo Molnar, Will Deacon, Boqun Feng,
	Waiman Long
In-Reply-To: <d40ba64d-78d9-45f5-99b9-4bfb1fc27f6c@app.fastmail.com>

On Tue, Jun 09, 2026 at 01:22:35PM +0200, Arnd Bergmann wrote:
> On Tue, Jun 9, 2026, at 11:49, Gabriele Monaco wrote:
> > raw_res_spin_unlock_irqrestore() calls raw_res_spin_unlock() and then
> > restores interrupts, this means preemption is enabled when interrupts
> > are still disabled (as part of raw_res_spin_unlock()) so this cannot
> > trigger an actual preemption.
> > This is inconsistent with other spinlock implementations
> > (raw_spin_unlock_irqrestore() and bpf_res_spin_unlock_irqrestore()
> > itself).
> >
> > Adjust the macro to ensure interrupts are enabled before enabling
> > preemption, allowing to schedule at that point. Make the same
> > modification in the error path of raw_res_spin_lock_irqsave().
> >
> > Fixes: 101acd2e78b1 ("rqspinlock: Add macros for rqspinlock usage")

Yeah, this is right. spinlocks always get one preempt_disable, in
addition they might also get irq or bh disable.

^ permalink raw reply

* Re: [PATCH] rqspinlock: Fix order in raw_res_spin_(un)lock_irq to allow schedule
From: Arnd Bergmann @ 2026-06-09 13:08 UTC (permalink / raw)
  To: Gabriele Monaco, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi, bpf,
	Linux-Arch, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long
In-Reply-To: <defc6b1339dea2f64f936902dd6e1f850cff9f4e.camel@redhat.com>

On Tue, Jun 9, 2026, at 15:04, Gabriele Monaco wrote:
> On Tue, 2026-06-09 at 13:22 +0200, Arnd Bergmann wrote:
>> Should this be Cc:stable@vger.kernel.org to get backported?
>
> Not sure if the Fixes: is enough to trigger the automation, I rarely
> remember to Cc:stable@vger.kernel.org and they're usually picked.

There is always human interaction. If you just have 'Fixes',
this means someone will have to look at the patch carefully
and make a decision, since a lot of bugfix patches either don't
apply to old kernels or don't fall under the rules for stable
backports.

If the patch gets tagged Cc:, this means it is expected to be
backported and needs less manual work.

> In case I guess I'd need to re-submit the patch right?

It can be added by whoever picks up the patch.

>> Did you see this cause measurable performance problems,
>> or did you find it through inspection?
>
> I noticed it while debugging an ENOMEM issue in the test_maps BPF
> selftest on PREEMPT_RT and this was an obvious cuplrit (irq_work not
> scheduled during a stress run). Turns out the problem is still there
> after this fix though.

Ok

       Arnd

^ permalink raw reply

* Re: [PATCH] rqspinlock: Fix order in raw_res_spin_(un)lock_irq to allow schedule
From: Gabriele Monaco @ 2026-06-09 13:04 UTC (permalink / raw)
  To: Arnd Bergmann, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi, bpf,
	Linux-Arch, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long
In-Reply-To: <d40ba64d-78d9-45f5-99b9-4bfb1fc27f6c@app.fastmail.com>

On Tue, 2026-06-09 at 13:22 +0200, Arnd Bergmann wrote:
> Should this be Cc:stable@vger.kernel.org to get backported?

Not sure if the Fixes: is enough to trigger the automation, I rarely
remember to Cc:stable@vger.kernel.org and they're usually picked.

In case I guess I'd need to re-submit the patch right?

> Did you see this cause measurable performance problems,
> or did you find it through inspection?

I noticed it while debugging an ENOMEM issue in the test_maps BPF
selftest on PREEMPT_RT and this was an obvious cuplrit (irq_work not
scheduled during a stress run). Turns out the problem is still there
after this fix though.

> 
> > Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> 
> Acked-by: Arnd Bergmann <arnd@arndb.de> # asm-generic

Thanks,
Gabriele

> 
> This should probably get merged through the BPF tree, but I've
> added the kernel/locking maintainers to Cc as well, since I
> feel it's more useful to have them look at it than me.
> 
> Maybe it would be good to update (as a separate patch) the
> MAINTAINERS file so the locking subsystem also includes the
> headers currently missing:
> 
> arch/*/include/asm/*spinlock*.h
> arch/*/include/asm/*rwlock*.h
> include/asm-generic/*spinlock*.h
> include/asm-generic/*rwlock*.h
> 
>        Arnd
> 
> (full patch quoted below)
> 
> > ---
> >  include/asm-generic/rqspinlock.h | 14 +++++++++++---
> >  1 file changed, 11 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/asm-generic/rqspinlock.h 
> > b/include/asm-generic/rqspinlock.h
> > index 151d267a49..4d46643f46 100644
> > --- a/include/asm-generic/rqspinlock.h
> > +++ b/include/asm-generic/rqspinlock.h
> > @@ -243,12 +243,20 @@ static __always_inline void 
> > res_spin_unlock(rqspinlock_t *lock)
> >  	({                                        \
> >  		int __ret;                        \
> >  		local_irq_save(flags);            \
> > -		__ret = raw_res_spin_lock(lock);  \
> > -		if (__ret)                        \
> > +		preempt_disable();                \
> > +		__ret = res_spin_lock(lock);      \
> > +		if (__ret) {                      \
> >  			local_irq_restore(flags); \
> > +			preempt_enable();         \
> > +		}                                 \
> >  		__ret;                            \
> >  	})
> > 
> > -#define raw_res_spin_unlock_irqrestore(lock, flags) ({ 
> > raw_res_spin_unlock(lock); local_irq_restore(flags); })
> > +#define raw_res_spin_unlock_irqrestore(lock, flags) \
> > +	({                                          \
> > +		res_spin_unlock(lock);              \
> > +		local_irq_restore(flags);           \
> > +		preempt_enable();                   \
> > +	})
> > 
> >  #endif /* __ASM_GENERIC_RQSPINLOCK_H */
> > 
> > base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
> > -- 
> > 2.54.0

^ permalink raw reply

* Re: [PATCH v4 6/8] string: introduce memcpy_streaming() helpers
From: Li Zhe @ 2026-06-09 12:01 UTC (permalink / raw)
  To: bp
  Cc: akpm, apopple, arnd, dave.hansen, david, kees, linux-arch,
	linux-hardening, linux-kernel, linux-mm, lizhe.67, mingo, rppt,
	tglx, x86
In-Reply-To: <20260607190804.GAaiXBlGO2eRcfs1oB@fat_crate.local>

On Sun, 7 Jun 2026 12:08:04 -0700, bp@alien8.de wrote:

> On Wed, Jun 03, 2026 at 04:01:50PM +0800, Li Zhe wrote:
> > Introduce a generic memcpy_streaming() interface for write-once copy
> > sites that can fall back to memcpy() when no architecture-specific
> > optimization is available, or when an architecture-specific backend
> > cannot safely handle a given transfer.
> >
> > Add memcpy_streaming_drain() alongside it so callers can separate the
> > copy primitive from any required ordering point. On x86, use
> > memcpy_flushcache() and sfence only for aligned transfers that can stay
> > entirely on the non-temporal store path; otherwise fall back to memcpy()
> 
> So you throwing "streaming", "non-temporal" and "flush-cache" wildly around
> here and this is adding unnecessary confusion where it shouldn't. I'd suggest
> you stick to "non-temporal" which you can abbreviate short'n'sweet to "nt" and
> that's it. Keep it simple.

Thanks for the review. Will switch to nt-based naming in next revision.

> > so the generic API does not expose flushcache semantics on cached
> > head/tail fragments.
> >
> > Callers are responsible for invoking memcpy_streaming_drain() before
> > later normal stores that must be ordered after the streaming copy.
> >
> > Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> > ---
> >  arch/x86/include/asm/string_64.h | 32 ++++++++++++++++++++++++++++++++
> >  include/linux/string.h           | 20 ++++++++++++++++++++
> >  2 files changed, 52 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
> > index 4635616863f5..aee63108577f 100644
> > --- a/arch/x86/include/asm/string_64.h
> > +++ b/arch/x86/include/asm/string_64.h
> 
> There's arch/x86/include/asm/string.h. Why are those here, in the _64 variant?

The current placement was meant to reflect that the x86 implementation
here is really just a thin wrapper around the existing
memcpy_flushcache() backend, and that backend is x86_64-only today.

On x86, CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE is selected only for X86_64,
so a 32-bit build would still need to fall back to the generic
memcpy()-based implementation anyway. Keeping it in string_64.h made
that backend dependency explicit.

That said, I see your layering point. If arch/x86/include/asm/string.h
is the preferred place for the arch-visible wrapper, I can move the
wrapper there in the next revision while keeping the x86_64-specific
implementation details in string_64.h.

> > @@ -4,6 +4,7 @@
> >
> >  #ifdef __KERNEL__
> >  #include <linux/jump_label.h>
> > +#include <linux/align.h>
> >
> >  /* Written 2002 by Andi Kleen */
> >
> > @@ -100,6 +101,37 @@ static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t
> >  	}
> >  	__memcpy_flushcache(dst, src, cnt);
> >  }
> > +
> > +/*
> > + * Only map memcpy_streaming() to memcpy_flushcache() when the destination
> > + * is already 8-byte aligned and the size can be handled without cached
> > + * head/tail fragments in __memcpy_flushcache().
> > + */
> > +static __always_inline bool memcpy_flushcache_nt_safe(const void *dst,
> > +						      size_t cnt)
> 
> This is checking alignment. Then call it that.
> 
> > +{
> > +	unsigned long d = (unsigned long)dst;
> 
> Useless.
> 
> > +
> > +	return cnt && IS_ALIGNED(d, 8) && IS_ALIGNED(cnt, 4);
> > +}
> 
> AFAICT, this helper is used only once. Zap it completely.

Agreed. That helper is over-factored in its current form.

I'll fold the alignment test into the callsite and drop the temporary
variable in the next revision.

> > +
> > +#define __HAVE_ARCH_MEMCPY_STREAMING 1
> > +static __always_inline void memcpy_streaming(void *dst, const void *src,
> 
> memcpy_nt()
> 
> > +					     size_t cnt)
> > +{
> > +	if (!cnt)
> > +		return;
> > +
> > +	if (memcpy_flushcache_nt_safe(dst, cnt))
> 
> That branch can cost. Why is that alignment checking so necessary? Why can't
> you simply DTRT by handling the misaligned parts like __memcpy_flushcache().
> 
> What does this bring you? None of that is explained in the commit message so
> why do I want this patch at all?
> 
> The commit message is basically telling me what the patch does but I can kinda
> read that from the diff itself. What it is not telling me is *why* it exists.

The extra alignment gating was meant to keep this helper narrower than
__memcpy_flushcache(), so patch 8 would not inherit the mixed cached
head/tail handling from that implementation.

Thinking about it more, I agree that this is hard to justify for a
generic helper. For this series, what really matters is that the
struct page copies in patch 8 can use the existing x86
memcpy_flushcache() fastpaths where that is beneficial; I do not need
patch 6 to impose extra selection policy on unrelated callers.

I'll simplify and rework this part in the next revision, rewrite the
changelog to explain the actual motivation more clearly, and respin
patches 6-8 accordingly.

Thanks,
Zhe

^ permalink raw reply

* Re: [PATCH] rqspinlock: Fix order in raw_res_spin_(un)lock_irq to allow schedule
From: Arnd Bergmann @ 2026-06-09 11:22 UTC (permalink / raw)
  To: Gabriele Monaco, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi, bpf,
	Linux-Arch, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long
In-Reply-To: <20260609094941.56122-1-gmonaco@redhat.com>

On Tue, Jun 9, 2026, at 11:49, Gabriele Monaco wrote:
> raw_res_spin_unlock_irqrestore() calls raw_res_spin_unlock() and then
> restores interrupts, this means preemption is enabled when interrupts
> are still disabled (as part of raw_res_spin_unlock()) so this cannot
> trigger an actual preemption.
> This is inconsistent with other spinlock implementations
> (raw_spin_unlock_irqrestore() and bpf_res_spin_unlock_irqrestore()
> itself).
>
> Adjust the macro to ensure interrupts are enabled before enabling
> preemption, allowing to schedule at that point. Make the same
> modification in the error path of raw_res_spin_lock_irqsave().
>
> Fixes: 101acd2e78b1 ("rqspinlock: Add macros for rqspinlock usage")

Should this be Cc:stable@vger.kernel.org to get backported?

Did you see this cause measurable performance problems,
or did you find it through inspection?

> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>

Acked-by: Arnd Bergmann <arnd@arndb.de> # asm-generic

This should probably get merged through the BPF tree, but I've
added the kernel/locking maintainers to Cc as well, since I
feel it's more useful to have them look at it than me.

Maybe it would be good to update (as a separate patch) the
MAINTAINERS file so the locking subsystem also includes the
headers currently missing:

arch/*/include/asm/*spinlock*.h
arch/*/include/asm/*rwlock*.h
include/asm-generic/*spinlock*.h
include/asm-generic/*rwlock*.h

       Arnd

(full patch quoted below)

> ---
>  include/asm-generic/rqspinlock.h | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/include/asm-generic/rqspinlock.h 
> b/include/asm-generic/rqspinlock.h
> index 151d267a49..4d46643f46 100644
> --- a/include/asm-generic/rqspinlock.h
> +++ b/include/asm-generic/rqspinlock.h
> @@ -243,12 +243,20 @@ static __always_inline void 
> res_spin_unlock(rqspinlock_t *lock)
>  	({                                        \
>  		int __ret;                        \
>  		local_irq_save(flags);            \
> -		__ret = raw_res_spin_lock(lock);  \
> -		if (__ret)                        \
> +		preempt_disable();                \
> +		__ret = res_spin_lock(lock);      \
> +		if (__ret) {                      \
>  			local_irq_restore(flags); \
> +			preempt_enable();         \
> +		}                                 \
>  		__ret;                            \
>  	})
> 
> -#define raw_res_spin_unlock_irqrestore(lock, flags) ({ 
> raw_res_spin_unlock(lock); local_irq_restore(flags); })
> +#define raw_res_spin_unlock_irqrestore(lock, flags) \
> +	({                                          \
> +		res_spin_unlock(lock);              \
> +		local_irq_restore(flags);           \
> +		preempt_enable();                   \
> +	})
> 
>  #endif /* __ASM_GENERIC_RQSPINLOCK_H */
>
> base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
> -- 
> 2.54.0

^ permalink raw reply

* [PATCH] rqspinlock: Fix order in raw_res_spin_(un)lock_irq to allow schedule
From: Gabriele Monaco @ 2026-06-09  9:49 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Arnd Bergmann, bpf,
	linux-arch, linux-kernel
  Cc: Gabriele Monaco

raw_res_spin_unlock_irqrestore() calls raw_res_spin_unlock() and then
restores interrupts, this means preemption is enabled when interrupts
are still disabled (as part of raw_res_spin_unlock()) so this cannot
trigger an actual preemption.
This is inconsistent with other spinlock implementations
(raw_spin_unlock_irqrestore() and bpf_res_spin_unlock_irqrestore()
itself).

Adjust the macro to ensure interrupts are enabled before enabling
preemption, allowing to schedule at that point. Make the same
modification in the error path of raw_res_spin_lock_irqsave().

Fixes: 101acd2e78b1 ("rqspinlock: Add macros for rqspinlock usage")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/asm-generic/rqspinlock.h | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/asm-generic/rqspinlock.h b/include/asm-generic/rqspinlock.h
index 151d267a49..4d46643f46 100644
--- a/include/asm-generic/rqspinlock.h
+++ b/include/asm-generic/rqspinlock.h
@@ -243,12 +243,20 @@ static __always_inline void res_spin_unlock(rqspinlock_t *lock)
 	({                                        \
 		int __ret;                        \
 		local_irq_save(flags);            \
-		__ret = raw_res_spin_lock(lock);  \
-		if (__ret)                        \
+		preempt_disable();                \
+		__ret = res_spin_lock(lock);      \
+		if (__ret) {                      \
 			local_irq_restore(flags); \
+			preempt_enable();         \
+		}                                 \
 		__ret;                            \
 	})
 
-#define raw_res_spin_unlock_irqrestore(lock, flags) ({ raw_res_spin_unlock(lock); local_irq_restore(flags); })
+#define raw_res_spin_unlock_irqrestore(lock, flags) \
+	({                                          \
+		res_spin_unlock(lock);              \
+		local_irq_restore(flags);           \
+		preempt_enable();                   \
+	})
 
 #endif /* __ASM_GENERIC_RQSPINLOCK_H */

base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
-- 
2.54.0


^ permalink raw reply related

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Florian Weimer @ 2026-06-09  6:08 UTC (permalink / raw)
  To: Jann Horn
  Cc: Mateusz Guzik, Christian Brauner, Li Chen, Kees Cook,
	Alexander Viro, linux-fsdevel, linux-api, linux-kernel, linux-mm,
	linux-arch, linux-doc, linux-kselftest, x86, Arnd Bergmann,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Jan Kara, Jonathan Corbet,
	Shuah Khan
In-Reply-To: <CAG48ez38OEE8ZPLyU6nr9=cYx-hMsdoh5WRrv-GMZGMDKyyOTA@mail.gmail.com>

* Jann Horn:

>> Per the above, the primary win would stem from *NOT* messing with mm.
>
> As you write below, I think we have that with CLONE_MM? The C function
> vfork() is kind of a terrible API because of its returns-twice
> behavior, but I think if process cloning with CLONE_VM|CLONE_VFORK was
> wrapped by libc in a way similar to clone() (with the child executing
> a separate handler function), or if it was used in the implementation
> of some higher-level process-spawning API, it would be a perfectly
> fine API?

No, there is still a problem with SIGTSTP handling because we cannot
atomically unmask the signal during execve.  We need to unblock SIGTSTP
before execve in the new process, but this means that it can get
suspended by SIGTSTP.  Consequently, the execve never happens and the
original process is stuck in vfork:

  posix_spawn: parent can get stuck in uninterruptible sleep if child
  receives SIGTSTP early enough
  <https://inbox.sourceware.org/libc-help/2921668c-773e-465d-9480-0abb6f979bf9@www.fastmail.com/>

More on the low-level side, it's difficult to make sure that execve gets
a consistent snapshot of the environ vector.  Both vfork and execve need
to be async-signal-safe.  Any locking or memory allocation (except for
the stack …) persists in the original process after vfork returns.  The
environ vector can be large, so making a copy on the stack is not ideal.
It's even harder for getenv/setenv/unsetenv implementations that use
locking instead of software transactional memory.

In general, I prefer the vfork+execve API over things like posix_spawn
because eventually, you have dependencies between the syslets, or need
control flow.  This introduces a lot of complexity.  Conceptually,
vfork+execve is much simpler, and in many ways quite safe (even mutexes
work as long as they do not need a correct TID).

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH v2 2/5] lib/bitrev: Introduce GENERIC_BITREVERSE
From: Jinjie Ruan @ 2026-06-09  1:53 UTC (permalink / raw)
  To: Yury Norov, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexandre Ghiti, Yury Norov, Rasmus Villemoes, Arnd Bergmann,
	Eric Biggers, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Andrew Morton, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, linux-kernel, linux-riscv, linux-arch, netdev,
	bpf
In-Reply-To: <20260506175207.110893-3-ynorov@nvidia.com>



On 5/7/2026 1:52 AM, Yury Norov wrote:
> The generic bit reversal implementation is controlled by
> !HAVE_ARCH_BITREVERSE. This makes it difficult for architectures to
> provide a hardware-accelerated implementation while being able to
> fall back to the generic version if needed.
> 
> This patch adds GENERIC_BITREVERSE, so bitreverse API is controlled by
> BITREVERSE, GENERIC_BITREVERSE and HAVE_ARCH_BITREVERSE options. The
> relationship between them is described as follows:
> 
>  - BITREVERSE is selected by user code; it's required to generate the API;
>  - Architectures may select HAVE_ARCH_BITREVERSE and provide an arch
>    implementation in arch/$(ARCH)/include/asm/bitrev.h.
>  - if HAVE_ARCH_BITREVERSE isn't set, BITREVERSE selects GENERIC_BITREVERSE;
>  - if GENERIC_BITREVERSE is set and HAVE_ARCH_BITREVERSE is not, the kernel
>    provides generic implementation only, and wires bitrevXX() to it.
>  - if HAVE_ARCH_BITREVERSE is set and GENERIC_BITREVERSE is not, the arch
>    code provides __arch_bitrevXX(), and it is wired to bitrevXX();
>  - if both GENERIC_BITREVERSE and HAVE_ARCH_BITREVERSE are selected, the kernel
>    generates generic___bitrev(), but wires bitrev() to the __arch_bitrev().
> 
> The last option allows architectures to use generic___bitrev() as a
> fallback option.
> 
> Drivers and core code should never select GENERIC_BITREVERSE or
> HAVE_ARCH_BITREVERSE explicitly.
> 
> Architectures that require generic bitreverse API as a fallback should
> explicitly enable GENERIC_BITREVERSE together with HAVE_ARCH_BITREVERSE.
> 
> Signed-off-by: Yury Norov <ynorov@nvidia.com>
> ---
>  lib/Kconfig  | 12 ++++++++++++
>  lib/Makefile |  2 +-
>  lib/bitrev.c |  3 ---
>  3 files changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/Kconfig b/lib/Kconfig
> index d8e7e89ae320..a33988adfaa3 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -54,6 +54,7 @@ config PACKING_KUNIT_TEST
>  
>  config BITREVERSE
>  	tristate
> +	select GENERIC_BITREVERSE if !HAVE_ARCH_BITREVERSE
>  
>  config HAVE_ARCH_BITREVERSE
>  	bool
> @@ -63,6 +64,17 @@ config HAVE_ARCH_BITREVERSE
>  	  This option enables the use of hardware bit-reversal instructions on
>  	  architectures which support such operations.
>  
> +config GENERIC_BITREVERSE
> +	tristate
> +	depends on BITREVERSE
> +	help
> +	  Generic bit reversal implementation. Drivers should never enable
> +	  it explicitly. Instead, enable BITREVERSE.


The later riscv implementation force GENERIC_BITREVERSE even when
HAVE_ARCH_BITREVERSE=y but triggers a Kconfig unmet direct dependency
warning as below:

warning: (RISCV) selects GENERIC_BITREVERSE which has unmet direct
dependencies (BITREVERSE)

This happens because select ignores depends on clauses and can force a
tristate symbol to y even when its dependency BITREVERSE is only =m. The
warning is a symptom of an invalid dependency chain.

Link:
https://lore.kernel.org/all/20260506214943.1AAE8C2BCB0@smtp.kernel.org/

> +
> +	  Architectures may want to select it as a fall-back option for
> +	  HAVE_ARCH_BITREVERSE, when the hardware-accelerated bit reverse
> +	  instruction set is optional, like RISC-V ZBKB extension.
> +
>  config ARCH_HAS_STRNCPY_FROM_USER
>  	bool
>  
> diff --git a/lib/Makefile b/lib/Makefile
> index f33a24bf1c19..23e07d19d01c 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -145,7 +145,7 @@ obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
>  obj-$(CONFIG_LIST_HARDENED) += list_debug.o
>  obj-$(CONFIG_DEBUG_OBJECTS) += debugobjects.o
>  
> -obj-$(CONFIG_BITREVERSE) += bitrev.o
> +obj-$(CONFIG_GENERIC_BITREVERSE) += bitrev.o
>  obj-$(CONFIG_LINEAR_RANGES) += linear_ranges.o
>  obj-$(CONFIG_PACKING)	+= packing.o
>  obj-$(CONFIG_PACKING_KUNIT_TEST) += packing_test.o
> diff --git a/lib/bitrev.c b/lib/bitrev.c
> index 81b56e0a7f32..05088231f31f 100644
> --- a/lib/bitrev.c
> +++ b/lib/bitrev.c
> @@ -1,5 +1,4 @@
>  // SPDX-License-Identifier: GPL-2.0-only
> -#ifndef CONFIG_HAVE_ARCH_BITREVERSE
>  #include <linux/types.h>
>  #include <linux/module.h>
>  #include <linux/bitrev.h>
> @@ -43,5 +42,3 @@ const u8 byte_rev_table[256] = {
>  	0x1f, 0x9f, 0x5f, 0xdf, 0x3f, 0xbf, 0x7f, 0xff,
>  };
>  EXPORT_SYMBOL_GPL(byte_rev_table);
> -
> -#endif /* CONFIG_HAVE_ARCH_BITREVERSE */


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox