Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* Re: [PATCH v3] killswitch: add per-function short-circuit mitigation primitive
From: Sasha Levin @ 2026-05-19  0:22 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, linux-doc, linux-kselftest, bpf, live-patching,
	Greg Kroah-Hartman, Andrew Morton, Jonathan Corbet,
	Mathieu Desnoyers, Joshua Peisach, Florian Weimer, Breno Leitao,
	Anthony Iliopoulos, Michal Hocko, Jiri Olsa
In-Reply-To: <CAPhsuW44UX663Au=WwHz8MVwnQgLkjxOqpJSCKxNiv3=RpZvqw@mail.gmail.com>

On Mon, May 18, 2026 at 04:59:08PM -0700, Song Liu wrote:
>On Mon, May 18, 2026 at 6:33 AM Sasha Levin <sashal@kernel.org> wrote:
>>
>> On Sun, May 17, 2026 at 11:37:36PM -0700, Song Liu wrote:
>> >On Sun, May 17, 2026 at 6:49 AM Sasha Levin <sashal@kernel.org> wrote:
>> >> * fail_function (CONFIG_FUNCTION_ERROR_INJECTION) is disabled in
>> >>   most production kernels. Even where enabled, it only works on
>> >>   functions pre-annotated with ALLOW_ERROR_INJECTION() in source -
>> >>   no help for a freshly-disclosed CVE. The debugfs UI is blocked by
>> >>   lockdown=integrity and the override is probabilistic.
>> >>
>> >> * BPF override (bpf_override_return) honors the same
>> >>   ALLOW_ERROR_INJECTION() whitelist, and BPF itself is off in many
>> >>   production kernels. Even where on, the operator interface is
>> >>   "load a verified BPF program," not a one-line write.
>> >
>> >If it is OK for killswitch to attach to any kernel functions, do we still
>> >need ALLOW_ERROR_INJECTION() for fail_function and BPF
>> >override? Shall we instead also allow fail_function and BPF override
>> >to attach to any kernel functions?
>>
>> I don't think so. ALLOW_ERROR_INJECTION is not a security mechanism, it's an
>> integrity/safety mechanism for both bpf and fault injection.
>>
>> It protects against a "developer or CI script doing legitimate fault injection
>> accidentally panics the box" scenario, not an "attacker gets in" one.
>
>There really isn't a clear boundary between "security mechanism" and
>"non-security mechanism". As we are making killswitch available
>everywhere under root, users will soon learn to use it to do fault injection,
>and potentially much more scary things. (Think about agents with sudo
>access).

Wouldn't the same argument apply to /dev/mem? If you enable that, and you give
whatever tool/agent/etc access to the interface, you're bound to have a really
bad time unless you know what you're doing?

root can already load a killswitch equivalent module, right? there's nothing
really new with killswitch.

-- 
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH] Documentation: hwmon: ad7314: document sysfs interface
From: Chen-Shi-Hong @ 2026-05-19  0:26 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Jonathan Corbet, Shuah Khan, linux-hwmon, linux-doc, linux-kernel
In-Reply-To: <006d3f24-b1cd-4fad-b8b6-96ddd904c283@roeck-us.net>

Hi Guenter,

Understood. Thank you for the feedback.

I will avoid sending this kind of low-value documentation patch in the future and will be more careful in judging whether a change is worth reviewers' time.

Thanks,
Chen-Shi-Hong

^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Sasha Levin @ 2026-05-19  0:31 UTC (permalink / raw)
  To: Paul Moore
  Cc: Song Liu, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAHC9VhS1DJNs9gDB6gD9WKhL08giSVajBskZ+=mY0AWRCAsw7Q@mail.gmail.com>

On Mon, May 18, 2026 at 05:29:32PM -0400, Paul Moore wrote:
>From my perspective there are two different issues here: should
>killswitch be a LSM, and should killswitch leverage kprobes to be able
>to "kill" security related symbols.  After all, are we okay with
>killswitch killing capable() and friends?

killswitch doesn't do it on it's own. It may be instructed by root to do that,
at which point that is root's problem.

>In my opinion, making killswitch an LSM is more of a procedural item
>that deals with how we view a capability like killswitch.  I
>personally view killswitch as somewhat similar to Lockdown, which is
>why I made the suggestion.

Maybe I'm not all that familiar with LSMs, but we would need to be able to stop
"random" code paths from executing, and I don't think we can create LSM hooks
at that granularity, no?

>The use of kprobes, while an interesting idea, presents problems as
>allowing any kernel symbol to be killed introduces the potential for
>security regressions.  As a reminder, some LSMs, as well as other
>kernel subsystems, have mechanisms in place to restrict root and/or
>enforce one-way configuration locks; while many people equate "root"
>with full control, in many cases today that is not strictly correct.

killswitch "complies" with lockdown. Is there a different scenario which we
should be blocking?

>Yes, kprobes have been around for some time, this is not a new
>problem, but killswitch makes it far more convenient and accessible to
>do dangerous things with kprobes.  If killswitch makes it past the RFC
>stage without any significant changes to its kill mechanism, we may
>need to start considering more liberal usage of NOKPROBE_SYMBOL()
>which I think would be an unfortunate casualty.

Why? If I don't really mind the security impact, I want to be able to have a
killswitch-like interface on my systems. If an attacker is in my systems,
killswitch is the least of my concerns I think.

If you are security concious, just don't enable CONFIG_KILLSWITCH?

-- 
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH 4/8] drm/panthor: Add support for protected memory allocation in panthor
From: Chia-I Wu @ 2026-05-19  0:36 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Liviu Dudau, Marcin Ślusarz, Ketil Johnsen, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian König, Steven Price, Daniel Almeida, Alice Ryhl,
	Matthias Brugger, AngeloGioacchino Del Regno, dri-devel,
	linux-doc, linux-kernel, linux-media, linaro-mm-sig,
	linux-arm-kernel, linux-mediatek, Florent Tomasin, nd
In-Reply-To: <20260518091650.5a7a4f4a@fedora>

On Mon, May 18, 2026 at 12:16 AM Boris Brezillon
<boris.brezillon@collabora.com> wrote:
>
> On Wed, 13 May 2026 12:31:32 -0700
> Chia-I Wu <olvaffe@gmail.com> wrote:
>
> > On Tue, May 12, 2026 at 8:39 AM Liviu Dudau <liviu.dudau@arm.com> wrote:
> > >
> > > On Tue, May 12, 2026 at 04:11:11PM +0200, Boris Brezillon wrote:
> > > > On Tue, 12 May 2026 14:47:27 +0100
> > > > Liviu Dudau <liviu.dudau@arm.com> wrote:
> > > >
> > > > > On Thu, May 07, 2026 at 01:53:56PM +0200, Boris Brezillon wrote:
> > > > > > On Thu, 7 May 2026 11:02:26 +0200
> > > > > > Marcin Ślusarz <marcin.slusarz@arm.com> wrote:
> > > > > >
> > > > > > > On Tue, May 05, 2026 at 06:15:23PM +0200, Boris Brezillon wrote:
> > > > > > > > > @@ -277,9 +286,21 @@ int panthor_device_init(struct panthor_device *ptdev)
> > > > > > > > >                     return ret;
> > > > > > > > >     }
> > > > > > > > >
> > > > > > > > > +   /* If a protected heap name is specified but not found, defer the probe until created */
> > > > > > > > > +   if (protected_heap_name && strlen(protected_heap_name)) {
> > > > > > > >
> > > > > > > > Do we really need this strlen() > 0? Won't dma_heap_find() fail is the
> > > > > > > > name is "" already?
> > > > > > >
> > > > > > > If dma_heap_find() will fail, then the whole probe with fail too.
> > > > > > > This check prevents that.
> > > > > >
> > > > > > Yeah, that's also a questionable design choice. I mean, we can
> > > > > > currently probe and boot the FW even though we never setup the
> > > > > > protected FW sections, so why should we defer the probe here? Can't we
> > > > > > just retry the next time a group with the protected bit is created and
> > > > > > fail if we can find a protected heap?
> > > > >
> > > > > The problem we have with the current firmware is that it does a number of setup steps at "boot"
> > > > > time only. One of the steps is preparing its internal structures for when it enters protected
> > > > > mode and it stores them in the buffer passed in at firmware loading. We cannot later run the
> > > > > process when we have a group with protected mode set.
> > > >
> > > > No, but we can force a full/slow reset and have that thing
> > > > re-initialized, can't we? I mean, that's basically what we do when a
> > > > fast reset fails: we re-initialize all the sections and reset again, at
> > > > which point the FW should start from a fresh state, and be able to
> > > > properly initialize the protected-related stuff if protected sections
> > > > are populated. Am I missing something?
> > >
> > > Right, we can do that. For some reason I keep associating the reset with the
> > > error handling and not with "normal" operations.
> > I kind of hope we end up with either
> >
> >  - panthor knows the exact heap to use and fails with EPROBE_DEFER if
> > the heap is missing, or
> >  - panthor gets a dma-buf from userspace and does the full reset
> >    - userspace also needs to provide a dma-buf for each protected
> > group for the suspend buffer
> >
> > than something in-between. The latter is more ad-hoc and basically
> > kicks the issue to the userspace.
>
> Indeed, the second option is more ad-hoc, but when you think about it,
> userspace has to have this knowledge, because it needs to know the
> dma-heap to use for buffer allocation that cross a device boundary
> anyway. Think about frames produced by a video decoder, and composited
> by the GPU into a protected scanout buffer that's passed to the KMS
> device. Why would the GPU driver be source of truth when it comes to
> choosing the heap to use to allocate protected buffers for the video
> decoder or those used for the display?
I don't think the GPU driver is ever the source of truth. If the
system integrator wants to specify the source of truth (SoT) from
kernel space, they should use the device tree (or module params /
config options). If they want to specify the SoT in userspace, then we
don't really care how it is done other than providing an ioctl.
Panthor is always on the receiving end.

If we don't want to delay this functionality, but it takes time to
converge on SoT, maybe a solution that is not a long-term promise can
work? Of the options on the table (dt, module params, kconfig options,
ioctls), a kconfig option, potentially marked as experimental, seems
like a good candidate.

>
> >
> > For the former, expressing the relation in DT seems to be the best,
> > but only if possible :-). Otherwise, a kconfig option (instead of
> > module param) should be easier to work with.
> >
> > Looking at the userspace implementation, can we also have an panthor
> > ioctl to return the heap to userspace?
>
> Yes, it's something we can add, but again, I'm questioning the
> usefulness of this: how can we ensure the heap used by panthor to
> allocate its protected FW buffers is suitable for scanout buffers
> (buffers that can be used by display drivers). There needs to be a glue
> leaving in usersland and taking the decision, and I'm not too sure
> trusting any of the component in the chain (vdec, gpu, display) is the
> right thing to do.
The heap returned by panthor is only for panfrost/panvk. It says
nothing about compatibility with other components on the system.

^ permalink raw reply

* Re: [RFC PATCH 2/5] mm/damon/core: cap effective quota size to total monitored memory
From: SeongJae Park @ 2026-05-19  0:38 UTC (permalink / raw)
  To: Ravi Jonnalagadda
  Cc: SeongJae Park, damon, linux-mm, linux-kernel, linux-doc, akpm,
	corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun
In-Reply-To: <CALa+Y14PXA_anNdvJCzx4RfKoKj6hNmEG39KUvMALtOBznprkw@mail.gmail.com>

On Sun, 17 May 2026 22:22:34 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:

> On Sun, May 17, 2026 at 11:37 AM SeongJae Park <sj@kernel.org> wrote:
> >
> > Hello Ravi,
> >
> > On Sat, 16 May 2026 14:03:54 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:
> >
> > > The DAMOS quota goal tuner can compute an effective size (esz) larger
> > > than the total monitored memory because it integrates over cumulative
> > > deltas without bounding by the actual workload size.  Once esz exceeds
> > > total monitored memory, the per-tick "remaining quota" arithmetic
> > > stops being meaningful: any scheme can apply to the entire monitored
> > > space and "remaining" stays positive indefinitely.
> >
> > Nice finding!
> >
> > >
> > > Cap esz to the total size of all currently monitored regions as a
> > > final bound after all other quota calculations.  Add
> > > damon_ctx_total_monitored_sz() helper that sums region sizes across
> > > all targets.
> >
> > You could also make an arbitrary cap by setting the static size quota.  That
> > is, if there are not only quota goal but also the size quota and/or time quota,
> > and the different types of quotas disagree about the real quota, DAMOS uses
> > smallest quota.  You could read damos_set_effective_quota() code and kernel-doc
> > comment of 'struct damos_quota' for more details.
> >
> > So you could apply the total monitoring region size cap by setting the size
> > quota of the total monitoring region size.  Could that work for you?
> >
> > Adding the total monitoring region size cap makes sense to me, and I think that
> > will make user experience better.  But, if the size quota based cap works, that
> > could also be handled on user space in an easier and even a betetr way.  If so,
> > I'd prefer the direction, to reduce kernel code complexity.  What do you think?
> 
> Hello SJ,
> 
> Agreed.  quota->sz combined with the smallest-quota-wins rule in
> damos_set_effective_quota does express this cap from userspace
> without kernel changes, and keeping the kernel side clean is the
> right call.
> 
> If the UX argument carries weight later, I'm happy to respin v2
> with sashiko fixes addressed.

Makes sense.  I find no change on the weight for now.  If someone else
including myself or you in the future claims again, we could revisit.


Thanks,
SJ

[...]

^ permalink raw reply

* Re: [PATCH v8 0/8] KVM: x86: nSVM: Improve PAT virtualization
From: Sean Christopherson @ 2026-05-19  0:41 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Jonathan Corbet, Shuah Khan,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, kvm, linux-doc, linux-kernel, Yosry Ahmed,
	Jim Mattson
In-Reply-To: <20260407190343.325299-1-jmattson@google.com>

On Tue, 07 Apr 2026 12:03:23 -0700, Jim Mattson wrote:
> Currently, KVM's implementation of nested SVM treats the PAT MSR the same
> way whether or not nested NPT is enabled: L1 and L2 share a single
> PAT. However, the AMD APM specifies that when nested NPT is enabled, the host
> (L1) and the guest (L2) should have independent PATs: hPAT for L1 and gPAT
> for L2.
> 
> This patch series implements independent PATs for L1 and L2 when nested NPT
> is enabled, but only when a new quirk, KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT,
> is disabled. By default, the quirk is enabled, preserving KVM's legacy
> behavior. When the quirk is disabled, KVM correctly virtualizes a separate
> PAT register for L2, using the g_pat field in the VMCB.
> 
> [...]

Applied to kvm-x86 svm.  Yosry and/or Jim, please double check the result, the
goof with patch 5 was slightly more annoying than I was expecting.

Thanks!

[1/8] KVM: x86: Define KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT
      https://github.com/kvm-x86/linux/commit/822790ab0149
[2/8] KVM: x86: nSVM: Clear VMCB_NPT clean bit when updating hPAT from guest mode
      https://github.com/kvm-x86/linux/commit/0a8aeb15848e
[3/8] KVM: x86: nSVM: Cache and validate vmcb12 g_pat
      https://github.com/kvm-x86/linux/commit/4b83e4ba836e
[4/8] KVM: x86: nSVM: Set vmcb02.g_pat correctly for nested NPT
      https://github.com/kvm-x86/linux/commit/02233c73f8ae
[6/8] KVM: x86: nSVM: Save gPAT to vmcb12.g_pat on VMEXIT
      https://github.com/kvm-x86/linux/commit/d65cf222b899
[7/8] KVM: Documentation: document KVM_{GET,SET}_NESTED_STATE for SVM
      https://github.com/kvm-x86/linux/commit/32ebdbce3b23
[8/8] KVM: x86: nSVM: Save/restore gPAT with KVM_{GET,SET}_NESTED_STATE
      https://github.com/kvm-x86/linux/commit/4f256d5770fe

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply

* Re: [PATCH 00/28] mm/damon: introduce data attributes monitoring
From: Andrew Morton @ 2026-05-19  0:54 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, David Hildenbrand, Jonathan Corbet,
	Lorenzo Stoakes, Masami Hiramatsu, Mathieu Desnoyers,
	Michal Hocko, Mike Rapoport, Shuah Khan, Shuah Khan,
	Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka, damon,
	linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel
In-Reply-To: <20260518234119.97569-1-sj@kernel.org>

On Mon, 18 May 2026 16:40:48 -0700 SeongJae Park <sj@kernel.org> wrote:

> TL; DR
> ======
> 
> Extend DAMON for monitoring general data attributes other than accesses.
> The short term motivation is lightweight page type (e.g., belonging
> cgroup) aware monitoring.  In long term, this will help extending DAMON
> for multiple access events capture primitives (e.g., page faults and
> PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
> Operations eNgine" in long term.

Added, thanks.

> Plan for Dropping RFC tag
> =========================
> 
> Making changes for feedback from myself, humans and Sashiko should be
> the major remaining work.
> 
> I'm currently hoping to drop the RFC tag by 7.2-rc1.
> 

I removed this section.



^ permalink raw reply

* Re: [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
From: Dongli Zhang @ 2026-05-19  0:57 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
	Sean Christopherson, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Dave Hansen, Vitaly Kuznetsov, x86, Marc Zyngier, Juergen Gross,
	Boris Ostrovsky, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Jack Allister, Joey Gouly, joe.jin, linux-doc, linux-kernel,
	xen-devel, linux-kselftest
In-Reply-To: <d3c461415e05345a9b82e6f995828c1ae64a4e61.camel@infradead.org>



On 2026-05-18 1:48 AM, David Woodhouse wrote:
> On Mon, 2026-05-18 at 00:52 -0700, Dongli Zhang wrote:
>> On 5/9/26 3:46 PM, David Woodhouse wrote:
> 
> Huh, I didn't write that then; it isn't September yet. Did you mean
> 2026-05-09? We aren't all in the US... 
> 
> Strictly speaking, you just misattributed a quote of mine, which is
> very poor form :)
> 
> What mailer are you using? Can it be fixed?

Thunderbird.

I have fixed the Thunderbird configuration. Does it look better to you?

> 
>>> From: Jack Allister <jalliste@amazon.com>
>>>
>>> Where kvm->arch.use_master_clock is false (because the host TSC is
>>> unreliable, or the guest TSCs are configured strangely), the KVM clock
>>> is *not* defined as a function of the guest TSC so KVM_GET_CLOCK_GUEST
>>> returns an error. In this case, as documented, userspace shall use the
>>> legacy KVM_GET_CLOCK ioctl. The loss of precision is acceptable in this
>>
>> The description here confused me a little. It sounds like userspace should call
>> KVM_SET_CLOCK if KVM_SET_CLOCK_GUEST fails. However, I assume it actually means
>> that userspace should do nothing extra if KVM_SET_CLOCK_GUEST fails, and simply
>> rely on the prior KVM_SET_CLOCK and KVM_VCPU_TSC_OFFSET workflow described in
>> patch 07. Is that correct?
> 
> Yes. If KVM_SET_CLOCK_GUEST doesn't work (which might be because
> KVM_GET_CLOCK_GUEST didn't work so userspace doesn't have the data in
> the first place, or because the actual ioctl returns failure), then
> userspace should rely on the old method using KVM_SET_CLOCK imprecisely
> instead. That includes on a migration from an older kernel that *lacks*
> KVM_GET_CLOCK_GUEST, of course.
> 
> I don't think it strictly matters whether userspace does KVM_SET_CLOCK
> first, then *tries* KVM_SET_CLOCK_GUEST, or whether it tries
> KVM_SET_CLOCK_GUEST and then only calls KVM_SET_CLOCK on failure? I'd
> probably be inclined not to use KVM_SET_CLOCK at all unless it is known
> to be needed?

I really appreciate guidelines like the ones below.

https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org

Assuming I am a user of the new API, I feel confused about whether the goal is
to replace KVM_SET_CLOCK with KVM_SET_CLOCK_GUEST, or whether the latter is
meant to supplement the former.


If we are going to use KVM_SET_CLOCK_GUEST when KVM_SET_CLOCK is not needed, I
would appreciate it if the API could carry more data in addition to struct
pvclock_vcpu_time_info.

+#define KVM_SET_CLOCK_GUEST    _IOW(KVMIO, 0xd6, struct pvclock_vcpu_time_info)
+#define KVM_GET_CLOCK_GUEST    _IOR(KVMIO, 0xd7, struct pvclock_vcpu_time_info)


In the future, if we need to carry additional data, we could simply reuse the
padding fields instead of introducing another KVM_SET_CLOCK_GUEST2.

The following is an example of how additional data could be carried.

KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c68dc1b577eabd5605c6c7c08f3e07ae18d30d5d


So far, I believe this guideline resolves most of my concerns.

https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org

> 
>>> +4.145 KVM_GET_CLOCK_GUEST
>>> +----------------------------
>>> +
>>> +:Capability: none
>>> +:Architectures: x86_64
>>> +:Type: vcpu ioctl
>>> +:Parameters: struct pvclock_vcpu_time_info (out)
>>> +:Returns: 0 on success, <0 on error
>>> +
>>> +Retrieves the current time information structure used for KVM/PV clocks,
>>> +in precisely the form advertised to the guest vCPU, which gives parameters
>>> +for a direct conversion from a guest TSC value to nanoseconds.
>>> +
>>> +When the KVM clock is not in "master clock" mode, for example because the
>>> +host TSC is unreliable or the guest TSCs are oddly configured, the KVM clock
>>> +is actually defined by the host CLOCK_MONOTONIC_RAW instead of the guest TSC.
>>> +In this case, the KVM_GET_CLOCK_GUEST ioctl returns -EINVAL.
>>> +
>>> +4.146 KVM_SET_CLOCK_GUEST
>>> +----------------------------
>>> +
>>> +:Capability: none
>>
>> Do we need a KVM_CHECK_EXTENSION capability for this? If userspace wants to
>> support the new API, should it detect availability via KVM_CHECK_EXTENSION, or
>> simply try the ioctl and handle failure?
> 
> That might be conventional, I suppose. But I suspect Jack's thinking
> was that userspace is going to have to *try* it anyway, and still might
> have to fall back to what KVM_SET_CLOCK can manage, so userspace
> probably wouldn't even bother to check that capability; it doesn't
> matter.
> 
> Since then, we've added some more attributes in this series though, and
> it probably is worth adding a cap which advertises them *all*?
> Something like KVM_CAP_CLOCK_PRECISION_API?

From an API user's perspective, userspace may need to distinguish between an API
failure and the API not being available.

I don't see any existing "Capability: none" entries in
Documentation/virt/kvm/api.rst.

> 
>>> +#ifdef CONFIG_X86_64
>>> +static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user *argp)
>>> +{
>>> +	struct pvclock_vcpu_time_info hv_clock = {};
>>> +	struct kvm_vcpu_arch *vcpu = &v->arch;
>>> +	struct kvm_arch *ka = &v->kvm->arch;
>>> +	unsigned int seq;
>>> +
>>> +	/*
>>> +	 * If KVM_REQ_CLOCK_UPDATE is already pending, or if the pvclock
>>> +	 * has never been generated at all, call kvm_guest_time_update().
>>> +	 */
>>> +	if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, v) || !vcpu->hw_tsc_hz) {
>>
>> This was flagged by AI, and I am still checking whether it is a real issue.
>>
>> What happens if KVM_REQ_MASTERCLOCK_UPDATE and KVM_REQ_CLOCK_UPDATE are both
>> pending?
>>
>> From my perspective, I am also curious how we should reason about this in other
>> scenarios in the future. Specifically, when do we need to process
>> KVM_REQ_MASTERCLOCK_UPDATE before KVM_REQ_CLOCK_UPDATE, and when is it
>> acceptable not to? I noticed that kvm_cpuid() already processes only
>> KVM_REQ_CLOCK_UPDATE.
> 
> The way I've been thinking about it — and I'm only two cups of coffee
> into Monday so take those words literally and don't think of them as
> British understatement of something I believe is absolute truth — is
> that MASTERCLOCK_UPDATE is updating the actual clock for the whole VM,
> while CLOCK_UPDATE is about *putting* that information into the per-
> vCPU pvclock structures.
> 
> So after a MASTERCLOCK_UPDATE, we need to do a CLOCK_UPDATE on all
> vCPUs to disseminate the result. Which means that if CLOCK_UPDATE is
> already pending before a MASTERCLOCK_UPDATE, it's probably redundant
> and might as well be cleared because it's only going to get set *again*
> in kvm_end_pvclock_update()? 

Another scenario is when only MASTERCLOCK_UPDATE is pending and there is no
pending CLOCK_UPDATE.

In this scenario, is it fine to skip processing MASTERCLOCK_UPDATE before saving
pvclock_vcpu_time_info?

This should be a very rare scenario. Although it is not mandatory, I think most
users call these APIs only when the VM is already stopped. I am just curious how
I should handle this in the future if I am implementing similar code, that is,
processing a pending request outside vcpu_enter_guest().

> 
> 
>>> +	/*
>>> +	 * Calculate the guest TSC at the new reference point, and the
>>> +	 * corresponding KVM clock value according to user_hv_clock.
>>> +	 * Adjust kvmclock_offset so both definitions agree.
>>> +	 */
>>> +	guest_tsc = kvm_read_l1_tsc(v, ka->master_cycle_now);
>>> +	user_clk_ns = __pvclock_read_cycles(&user_hv_clock, guest_tsc);
>>> +	ka->kvmclock_offset = user_clk_ns - ka->master_kernel_ns;
>>
>> I used to explore adjusting ka->kvmclock_offset in KVM_SET_CLOCK based on the
>> old hv_clock and the new hv_clock long time ago. At that time, my concern was
>> what would happen if userspace provided bogus values. Theoretically, this is
>> possible with any ioctl. My concern may be unnecessary.
>>
>> Would it be helpful to validate that the delta is within a reasonable range,
>> e.g. that the drift can never be more than five minutes (forward or backward)?
> 
> Setting confidential guests aside, which have their own way of trusting
> the TSC and should never even *consider* using kvmclock, surely this is
> supposed to be *entirely* under the control of the VMM? The kernel has
> no business deciding what is 'bogus'?

Yes, I both think and agree that this is supposed to be entirely under the
control of the VMM.

Sometimes security researchers use fuzzing tools to interact with APIs in an
attempt to leak data or crash the hypervisor in order to turn it into a CVE. My
understanding is that, in the worst-case scenario here, the guest clock would
simply get stuck.

> 
> If a guest has been running for months on a previous host and is
> migrated to a new host, don't we expect that the KVM clock of the new
> VM on the new host is tweaked from its default near-zero after
> creation, to some large amount?
> 

Regarding live migration, my own investigation does not show a proportional
relationship between VM uptime and the amount of drift.

Just taking QEMU + KVM as an example: suppose TSC scaling is inactive, the
amount of drift does not depend on how long the VM has been running before live
migration.

Instead, it depends on the delta between when we call MSR_IA32_TSC and
KVM_GET_CLOCK, and between MSR_IA32_TSC and KVM_SET_CLOCK.

The guest TSC stops at P1 and resumes at P3.
The kvmclock stops at P2 and resumes at P4.

We expect P1 == P2 and P3 == P4.

On source host.

- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=0 ===> P1
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=1
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=2
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=3
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=4
... ...
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=N
- KVM_GET_CLOCK                               ===> P2

On target host.

- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=1 ===> P3
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=2
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=3
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=4
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=5
... ...
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=N
- KVM_SET_CLOCK                               ====> P4


Here is my equiation to predict the drift.

T1_ns  = P2 - P1 (nanoseconds)
T2_tsc = P4 - P3 (cycles)
T2_ns  = pvclock_scale_delta(T2_tsc,
                             old_hv_clock_src.tsc_to_system_mul,
                             old_hv_clock_src.tsc_shift)

if (T2_ns > T1_ns)
    backward drift: T2_ns - T1_ns
else if (T1_ns > T2_ns)
    forward drift: T1_ns - T2_ns


Theoretically, if P1 == P2 and P3 == P4, we won't encounter any kvm-clock drift.

Thank you very much!

Dongli Zhang

^ permalink raw reply

* Re: [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk
From: SeongJae Park @ 2026-05-19  1:14 UTC (permalink / raw)
  To: Ravi Jonnalagadda
  Cc: SeongJae Park, damon, linux-mm, linux-kernel, linux-doc, akpm,
	corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun
In-Reply-To: <CALa+Y17nudor22aJvakfos3UegPgEG1M8N7cJPAxWX0Ca=MvfA@mail.gmail.com>

On Sun, 17 May 2026 22:38:51 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:

> On Sun, May 17, 2026 at 4:38 PM SeongJae Park <sj@kernel.org> wrote:
> >
> > On Sat, 16 May 2026 14:03:56 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:
> >
> > > damon_pa_migrate() walks every PFN in a region linearly, calling
> > > damon_get_folio() for each one.  On sparse physical address spaces
> > > (e.g., CXL-attached memory), a single DAMON region can span hundreds
> > > of gigabytes where most memory is free and sitting in the buddy
> > > allocator.  Most page lookups are fruitless and dominate kdamond
> > > tick time.
> >
> > On sparse address spaces, the problem would be large DAMON regions of offlined
> > memory.  The large DAMON regions that nearly all freed memory is another
> > problem that doesn't require the sparse address spaces.  If I'm not wrong, the
> > above paragraph could better clarified in my opinion.
> >
> > >
> > > Check at pageblock boundaries (2MB on x86_64) whether the block is
> > > entirely free.  If the first page of a pageblock is a buddy page at
> > > pageblock_order or higher, the entire block is free and can be
> > > skipped.
> > > Similarly skip pageblocks where pfn_to_online_page() returns
> > > NULL.
> > >
> > > This reduces the iteration from O(region_sz / PAGE_SIZE) to
> > > O(region_sz / pageblock_sz) + O(populated_pages).
> > >
> > > buddy_order_unsafe() is used without zone->lock.  A transient false
> > > positive (block becomes non-free between the PageBuddy and order
> > > checks) costs at most one tick of missed candidates on that block;
> > > the next tick re-scans.  No correctness consequence as DAMON walks
> > > are best-effort.
> >
> > I was initially thinking this is a good and reasonable optimization approach.
> > But on the second thought I get below questions.
> >
> > For large offlined memory space problem, couldn't we simply tune DAMON's
> > monitoring regions boundary to ignore the holes?
> >
> > For large free memory area, is it reasonable to assume such situations?  In
> > production, users will try to utilize as much memory of the system as possible.
> > Then, wouldn't there be such problematically large free memory area?
> >
> > Could you please enlighten me?
> >
> 
> Hi SJ,
> 
> You're right on the first point.  For static offlined memory
> holes (memory hotplug gaps, partial socket population, etc.) the
> right answer is configuring the monitoring region boundaries to
> exclude them upfront, not making the walk skip them at runtime.
> The changelog is clearer if I narrow the patch to the free-but-
> online case.

Thank you for clarifying, Ravi.

> 
> On the free-online case: I agree large free memory areas are
> not the steady state on a fully-utilized system.  The cases I
> had in mind are more limited:
> 
>    - A workload using a small part of a much larger range, with
>       the rest left as headroom (e.g. 64 GB used of a 512 GB
>       range).

Why would the user have that large amount of headroom?

> 
>   - Shared tiers where workloads are allocated and freed on their own
>     timelines.  Any single piece of free memory doesn't last
>     long, but on a busy system there's typically a meaningful
>     free fraction in the range at any point -- especially on a
>     slower tier, where workloads prefer faster memory first
>     when it's available.

I agree there could be reasonable amount of free memory.  But, I'm still not
feeling difficult to know would that be big enough to cause the issue in DAMOS.

> 
> The patch as written is a narrow optimization for those cases:
> the pageblock-aligned check is one extra read per
> pageblock_nr_pages PFNs (about 1 per 512 on x86_64), so it's
> effectively a no-op when the region is fully populated.
> 
> If you don't see those workloads as warranting the change, I'm
> happy to drop the patch.  If the framing is the issue more than
> the change itself, I can respin a v2 with:
> 
>   - the changelog narrowed to the free-but-online case (no
>     offlined-memory framing);
>   - any suggestions from you on sashiko's review comments.

I think your arguments make sense in general.  But I'm still not quite sure
what is the realistic size of the problem, so difficult to judge.  Having a
clearer and detailed use case and backing data would be nice.

I also got a little and trivial concern for this approach.  DAMOS quota system
assumes the cost of applying DAMOS action will be proportional to the size of
memory it is applied for.  After this patch is applied, the cost will depend on
amount of free or offline memory in the memory.  It might make users difficult
to predict the overhead of DAMOS.  I might be too picky and hallucinated, but
to be honest I'm not feeling 100% comfortable with this change.

For long term, we are working on extending DAMON for general data attributes
monitoring.  I pretty sure you also aware of that.  The v1 [1] is just added to
mm-new for more testing.  It is currently supporting anon page and belinging
memory cgroup attributes.  I'm planning to extend that a lot.  In future, DAMOS
might be able to target and filter memory based on the attributes monitoring
results.  Then, we may be able to extend it for monitoring online or freeness
of the memory and ask DAMOS to filter out or de-prioritize memory regions
having high proportion of free or offline memory.

So, long story short, I'd suggest to revisit this after a clear use case and
real problem is found, unless we have it right now.

[1] https://lore.kernel.org/20260518234119.97569-1-sj@kernel.org


Thanks,
SJ

[...]

^ permalink raw reply

* [PATCH v4 0/4] Introduce Per-CPU Work helpers (was QPW)
From: Leonardo Bras @ 2026-05-19  1:27 UTC (permalink / raw)
  To: Jonathan Corbet, Shuah Khan, Leonardo Bras, Peter Zijlstra,
	Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
	Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
	Randy Dunlap, Thomas Gleixner, Feng Tang, Dapeng Mi, Kees Cook,
	Marco Elver, Jakub Kicinski, Li RongQing, Eric Biggers,
	Paul E. McKenney, Nathan Chancellor, Miguel Ojeda, Nicolas Schier,
	Thomas Weißschuh, Douglas Anderson, Gary Guo,
	Christian Brauner, Pasha Tatashin, Masahiro Yamada, Coiby Xu,
	Frederic Weisbecker
  Cc: linux-doc, linux-kernel, linux-mm, linux-rt-devel

The problem:
Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps cache bouncing low since
cacheline tends to be mostly local, and avoids the cost of locks in non-RT
kernels, even though the very few remote operations will be expensive due
to scheduling overhead.

On the other hand, for RT workloads this can represent a problem: getting
an important workload scheduled out to deal with remote requests is
sure to introduce unexpected deadline misses.

The idea:
Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
In this case, instead of scheduling work on a remote cpu, it should
be safe to grab that remote cpu's per-cpu spinlock and run the required
work locally. That major cost, which is un/locking in every local function,
already happens in PREEMPT_RT.

Also, there is no need to worry about extra cache bouncing:
The cacheline invalidation already happens due to schedule_work_on().

This will avoid schedule_work_on(), and thus avoid scheduling-out an
RT workload.

Proposed solution:
A new interface called PerCPU Work (PW), which should replace
Work Queue in the above mentioned use case.

If CONFIG_PWLOCKS=n this interfaces just wraps the current
local_locks + WorkQueue behavior, so no expected change in runtime.

If CONFIG_PWLOCKS=y, and kernel boot option pwlocks=1,
pw_queue_on(cpu,...) will lock that cpu's per-cpu structure
and perform work on it locally. 

v3->v4:
- Mechanism name changed from QPW to PW/PWLOCKS. Helper funcions / API,
  file names and config options renamed accordingly.
- All members of the Per-CPU Work API now start with the same prefix 
  (Frederic Weisbecker)
- Improved style a bit, reviewed documentation

v2->v3:
- Use preempt_disable/preempt_enable on !CONFIG_PREEMPT_RT (Vlastimil Babka).
- Improve documentation to include local_qpw_lock on operations table
  (Leonardo Bras).
- Enable qpw=1 automatically if CPU isolation is enabled (Vlastimil Babka).

v1->v2:
- Introduce local_qpw_lock and unlock functions, move preempt_disable/
  preempt_enable to it (Leonardo Bras). This reduces performance
  overhead of the patch.
- Documentation and changelog typo fixes (Leonardo Bras).
- Fix places where preempt_disable/preempt_enable was not being
  correctly performed.
- Add performance measurements.

RFC->v1:

- Introduce CONFIG_QPW and qpw= kernel boot option to enable
  remote spinlocking and execution even on !CONFIG_PREEMPT_RT
  kernels (Leonardo Bras).
- Move buffer_head draining to separate workqueue (Marcelo Tosatti).
- Convert mlock per-CPU page lists to QPW (Marcelo Tosatti).
- Drop memcontrol convertion (as isolated CPUs are not targets
  of queue_work_on anymore).
- Rebase SLUB against Vlastimil's slab/next.
- Add basic document for QPW (Waiman Long).

The performance numbers, as measured by the following test program,
are as follows (v3, mechanics not changed since then):

CONFIG_PREEMPT_DYNAMIC=y
Unpatched kernel:                       60 cycles
Patched kernel, CONFIG_QPW=n:           62 cycles
Patched kernel, CONFIG_QPW=y, qpw=0:    62 cycles
Patched kernel, CONFIG_QPW=y, qpw=1:    75 cycles

CONFIG_PREEMPT_RT:
Unpatched kernel:                       95 cycles
Patched kernel, CONFIG_QPW=y, qpw=0:    99 cycles
Patched kernel, CONFIG_QPW=y, qpw=1:    97 cycles

kmalloc_bench.c:
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/slab.h>
#include <linux/timex.h>
#include <linux/preempt.h>
#include <linux/irqflags.h>
#include <linux/vmalloc.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Gemini AI");
MODULE_DESCRIPTION("A simple kmalloc performance benchmark");

static int size = 64; // Default allocation size in bytes
module_param(size, int, 0644);

static int iterations = 9000000; // Default number of iterations
module_param(iterations, int, 0644);

static int __init kmalloc_bench_init(void) {
    void **ptrs;
    cycles_t start, end;
    uint64_t total_cycles;
    int i;
    pr_info("kmalloc_bench: Starting test (size=%d, iterations=%d)\n", size, iterations);

    // Allocate an array to store pointers to avoid immediate kfree-reuse optimization
    ptrs = vmalloc(sizeof(void *) * iterations);
    if (!ptrs) {
        pr_err("kmalloc_bench: Failed to allocate pointer array\n");
        return -ENOMEM;
    }

    preempt_disable();
    start = get_cycles();

    for (i = 0; i < iterations; i++) {
        ptrs[i] = kmalloc(size, GFP_ATOMIC);
    }

    end = get_cycles();

    total_cycles = end - start;
    preempt_enable();

    pr_info("kmalloc_bench: Total cycles for %d allocs: %llu\n", iterations, total_cycles);
    pr_info("kmalloc_bench: Avg cycles per kmalloc: %llu\n", total_cycles / iterations);

    // Cleanup
    for (i = 0; i < iterations; i++) {
        kfree(ptrs[i]);
    }
    vfree(ptrs);

    return 0;
}

static void __exit kmalloc_bench_exit(void) {
    pr_info("kmalloc_bench: Module unloaded\n");
}

module_init(kmalloc_bench_init);
module_exit(kmalloc_bench_exit);

The following testcase triggers lru_add_drain_all on an isolated CPU
(that does sys_write to a file before entering its realtime
loop).

/*
 * Simulates a low latency loop program that is interrupted
 * due to lru_add_drain_all. To trigger lru_add_drain_all, run:
 *
 * blockdev --flushbufs /dev/sdX
 *
 */
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <stdarg.h>
#include <pthread.h>
#include <sched.h>
#include <unistd.h>

int cpu;

static void *run(void *arg)
{
        pthread_t current_thread;
        cpu_set_t cpuset;
        int ret, nrloops;
        struct sched_param sched_p;
        pid_t pid;
        int fd;
        char buf[] = "xxxxxxxxxxx";

        CPU_ZERO(&cpuset);
        CPU_SET(cpu, &cpuset);

        current_thread = pthread_self();   
        ret = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
        if (ret) {
                perror("pthread_setaffinity_np failed\n");
                exit(0);
        }

        memset(&sched_p, 0, sizeof(struct sched_param));
        sched_p.sched_priority = 1;
        pid = gettid();
        ret = sched_setscheduler(pid, SCHED_FIFO, &sched_p);
        if (ret) {
                perror("sched_setscheduler");
                exit(0);
        }

        fd = open("/tmp/tmpfile", O_RDWR|O_CREAT|O_TRUNC);
        if (fd == -1) {
                perror("open");
                exit(0);
        }

        ret = write(fd, buf, sizeof(buf));
        if (ret == -1) {
                perror("write");
                exit(0);
        }

        do {
                nrloops = nrloops+2;
                nrloops--;
        } while (1);
}

int main(int argc, char *argv[])
{
        int fd, ret;
        pthread_t thread;
        long val;
        char *endptr, *str;
        struct sched_param sched_p;
        pid_t pid;

        if (argc != 2) {
                printf("usage: %s cpu-nr\n", argv[0]);
                printf("where CPU number is the CPU to pin thread to\n");
                exit(0);
        }
        str = argv[1];
        cpu = strtol(str, &endptr, 10);
        if (cpu < 0) {
                printf("strtol returns %d\n", cpu);
                exit(0);
        }
        printf("cpunr=%d\n", cpu);

        memset(&sched_p, 0, sizeof(struct sched_param));
        sched_p.sched_priority = 1;
        pid = getpid();
        ret = sched_setscheduler(pid, SCHED_FIFO, &sched_p);
        if (ret) {
                perror("sched_setscheduler");
                exit(0);
        }

        pthread_create(&thread, NULL, run, NULL);

        sleep(5000);

        pthread_join(thread, NULL);
}

Leonardo Bras (3):
  Introducing pw_lock() and per-cpu queue & flush work
  swap: apply new pw_queue_on() interface
  slub: apply new pw_queue_on() interface

Marcelo Tosatti (1):
  mm/swap: move bh draining into a separate workqueue

 MAINTAINERS                                   |   7 +
 .../admin-guide/kernel-parameters.txt         |  10 +
 Documentation/locking/pwlocks.rst             |  76 +++++
 init/Kconfig                                  |  35 +++
 kernel/Makefile                               |   2 +
 include/linux/pwlocks.h                       | 265 ++++++++++++++++++
 mm/internal.h                                 |   4 +-
 kernel/pwlocks.c                              |  47 ++++
 mm/mlock.c                                    |  51 +++-
 mm/page_alloc.c                               |   2 +-
 mm/slub.c                                     | 142 +++++-----
 mm/swap.c                                     | 109 ++++---
 12 files changed, 624 insertions(+), 126 deletions(-)
 create mode 100644 Documentation/locking/pwlocks.rst
 create mode 100644 include/linux/pwlocks.h
 create mode 100644 kernel/pwlocks.c


base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8
-- 
2.54.0

^ permalink raw reply

* Re: [RFC PATCH 5/5] mm/damon/paddr: add time budget to migration page walk
From: SeongJae Park @ 2026-05-19  1:27 UTC (permalink / raw)
  To: Ravi Jonnalagadda
  Cc: SeongJae Park, damon, linux-mm, linux-kernel, linux-doc, akpm,
	corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun
In-Reply-To: <CALa+Y17XTzjAK5ZyKAKZLN1cAE-+c+2DgqpmuHGWgjUAZMgkFg@mail.gmail.com>

On Sun, 17 May 2026 22:54:18 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:

> On Sun, May 17, 2026 at 4:43 PM SeongJae Park <sj@kernel.org> wrote:
> >
> > On Sat, 16 May 2026 14:03:57 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:
> >
> > > On populated physical address ranges the pageblock skip optimization
> > > alone is insufficient — most pageblocks contain at least one allocated
> > > page, so the walk still iterates millions of PFNs.
> >
> > So my questions to the fourth patch of this series are also applied here,
> > especially about the assumption of systems having most memory free.  I will
> > hold digging deep here until the high level discussion is completed.
> >
> Hello SJ,
> 
> Stepping back to look at this with fresh eyes, I think this
> patch is in the same bucket as patches 1 and 3 (full background
> on the patch 3 thread): it came out of the same parallel debug
> effort, where I was seeing long walks during the startup
> transient on a multi-hundred-GB monitored target -- before
> kdamond_split_regions() and damon_apply_min_nr_regions() had
> trimmed the initial regions down -- and was unsure whether
> those long walks were contributing to the NMI-side
> responsiveness issues I was chasing.
> 
> Once the actual NMI problem was fixed and the per-region work
> in steady state is bounded by DAMON's region splitting (and by
> the scheme's quota when one is set), the per-call cost in
> damon_pa_migrate() is already small enough that the budget
> isn't doing useful work.  cond_resched() after damon_migrate_pages()
> covers the preemption case.
> 
> If a real workload later shows a per-region walk long
> enough to matter, I'll re-evaluate then with concrete numbers.

Sounds good!

FYI, many parts of DAMON are designed assuming it will be used on production
environments that have long-running workload and prefer stability.  It helps
making good results in long run, but also make it difficult to understand it in
short term, especially on lab environments.

I learned that by grateful users including you, and therefore recently
developed the multiple quota tuning logics and failed regions charge ratio.  I
feel like such DAMON limitation has contributed to this case to confuse you.
Sorry if that was the case, and please feel free to share your pain points and
improvement ideas.  Every user's use case including yours does matter!


Thanks,
SJ

[...]

^ permalink raw reply

* [PATCH v4 1/4] Introducing pw_lock() and per-cpu queue & flush work
From: Leonardo Bras @ 2026-05-19  1:27 UTC (permalink / raw)
  To: Jonathan Corbet, Shuah Khan, Leonardo Bras, Peter Zijlstra,
	Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
	Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
	Randy Dunlap, Feng Tang, Dapeng Mi, Kees Cook, Marco Elver,
	Jakub Kicinski, Li RongQing, Eric Biggers, Paul E. McKenney,
	Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
	Thomas Weißschuh, Thomas Gleixner, Douglas Anderson,
	Gary Guo, Christian Brauner, Pasha Tatashin, Coiby Xu,
	Masahiro Yamada, Frederic Weisbecker
  Cc: linux-doc, linux-kernel, linux-mm, linux-rt-devel,
	Marcelo Tosatti
In-Reply-To: <20260519012754.240804-1-leobras.c@gmail.com>

Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps cache bouncing low since
cacheline tends to be mostly local, and avoids the cost of locks in non-RT
kernels, even though the very few remote operations will be expensive due
to scheduling overhead.

On the other hand, for RT workloads this can represent a problem:
scheduling work on remote cpu that are executing low latency tasks
is undesired and can introduce unexpected deadline misses.

It's interesting, though, that local_lock()s in RT kernels become
spinlock(). We can make use of those to avoid scheduling work on a remote
cpu by directly updating another cpu's per_cpu structure, while holding
it's spinlock().

In order to do that, it's necessary to introduce a new set of functions to
make it possible to get another cpu's per-cpu "local" lock (pw_{un,}lock*)
and also do the corresponding queueing (pw_queue_on()) and flushing
(pw_flush()) helpers to run the remote work.

Users of non-RT kernels but with low latency requirements can select
similar functionality by using the CONFIG_PWLOCKS compile time option.

On CONFIG_PWLOCKS disabled kernels, no changes are expected, as every
one of the introduced helpers work the exactly same as the current
implementation:
pw_{un,}lock*()		->  local_{un,}lock*() (ignores cpu parameter)
pw_queue_on()  		->  queue_work_on()
pw_flush()		->  flush_work()

For PWLOCKS enabled kernels, though, pw_{un,}lock*() will use the extra
cpu parameter to select the correct per-cpu structure to work on,
and acquire the spinlock for that cpu.

pw_queue_on() will just call the requested function in the current
cpu, which will operate in another cpu's per-cpu object. Since the
local_locks() become spinlock()s in PWLOCKS enabled kernels, we are
safe doing that.

pw_flush() then becomes a no-op since no work is actually scheduled on a
remote cpu.

Some minimal code rework is needed in order to make this mechanism work:
The calls for local_{un,}lock*() on the functions that are currently
scheduled on remote cpus need to be replaced by either pw_{un,}lock_*(),
PWLOCKS enabled kernels they can reference a different cpu. It's also
necessary to use a pw_struct instead of a work_struct, but it just
contains a work struct and, in CONFIG_PWLOCKS, the target cpu.

This should have almost no impact on non-CONFIG_PWLOCKS kernels: few
this_cpu_ptr() will become per_cpu_ptr(,smp_processor_id()) on non-hotpath
functions.

On CONFIG_PWLOCKS kernels, this should avoid deadlines misses by
removing scheduling noise.

Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
 MAINTAINERS                                   |   7 +
 .../admin-guide/kernel-parameters.txt         |  10 +
 Documentation/locking/pwlocks.rst             |  76 +++++
 init/Kconfig                                  |  35 +++
 kernel/Makefile                               |   2 +
 include/linux/pwlocks.h                       | 265 ++++++++++++++++++
 kernel/pwlocks.c                              |  47 ++++
 7 files changed, 442 insertions(+)
 create mode 100644 Documentation/locking/pwlocks.rst
 create mode 100644 include/linux/pwlocks.h
 create mode 100644 kernel/pwlocks.c

diff --git a/MAINTAINERS b/MAINTAINERS
index c2c6d79275c6..7102031207c9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -21775,20 +21775,27 @@ QORIQ DPAA2 FSL-MC BUS DRIVER
 M:	Ioana Ciornei <ioana.ciornei@nxp.com>
 L:	linuxppc-dev@lists.ozlabs.org
 L:	linux-kernel@vger.kernel.org
 S:	Maintained
 F:	Documentation/ABI/stable/sysfs-bus-fsl-mc
 F:	Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml
 F:	Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
 F:	drivers/bus/fsl-mc/
 F:	include/uapi/linux/fsl_mc.h
 
+PW Locks
+M:	Leonardo Bras <leobras.c@gmail.com>
+S:	Supported
+F:	Documentation/locking/pwlocks.rst
+F:	include/linux/pwlocks.h
+F:	kernel/pwlocks.c
+
 QT1010 MEDIA DRIVER
 L:	linux-media@vger.kernel.org
 S:	Orphan
 W:	https://linuxtv.org
 Q:	http://patchwork.linuxtv.org/project/linux-media/list/
 F:	drivers/media/tuners/qt1010*
 
 QUALCOMM ATH12K WIRELESS DRIVER
 M:	Jeff Johnson <jjohnson@kernel.org>
 L:	linux-wireless@vger.kernel.org
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 4d0f545fb3ec..68c8a6f9d227 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2810,20 +2810,30 @@ Kernel parameters
 			  If a queue's affinity mask contains only isolated
 			  CPUs then this parameter has no effect on the
 			  interrupt routing decision, though interrupts are
 			  only delivered when tasks running on those
 			  isolated CPUs submit IO. IO submitted on
 			  housekeeping CPUs has no influence on those
 			  queues.
 
 			The format of <cpu-list> is described above.
 
+	pwlocks=	[KNL,SMP] Select a behavior on per-CPU resource sharing
+			and remote interference mechanism on a kernel built with
+			CONFIG_PWLOCKS.
+			Format: { "0" | "1" }
+			0 - local_lock() + queue_work_on(remote_cpu)
+			1 - spin_lock() for both local and remote operations
+
+			Selecting 1 may be interesting for systems that want
+			to avoid interruption & context switches from IPIs.
+
 	iucv=		[HW,NET]
 
 	ivrs_ioapic	[HW,X86-64]
 			Provide an override to the IOAPIC-ID<->DEVICE-ID
 			mapping provided in the IVRS ACPI table.
 			By default, PCI segment is 0, and can be omitted.
 
 			For example, to map IOAPIC-ID decimal 10 to
 			PCI segment 0x1 and PCI device 00:14.0,
 			write the parameter as:
diff --git a/Documentation/locking/pwlocks.rst b/Documentation/locking/pwlocks.rst
new file mode 100644
index 000000000000..09f4a5417bc1
--- /dev/null
+++ b/Documentation/locking/pwlocks.rst
@@ -0,0 +1,76 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========
+PW (Per-CPU Work) locks
+=========
+
+Some places in the kernel implement a parallel programming strategy
+consisting on local_locks() for most of the work, and some rare remote
+operations are scheduled on target cpu. This keeps cache bouncing low since
+cacheline tends to be mostly local, and avoids the cost of locks in non-RT
+kernels, even though the very few remote operations will be expensive due
+to scheduling overhead.
+
+On the other hand, for RT workloads this can represent a problem:
+scheduling work on remote cpu that are executing low latency tasks
+is undesired and can introduce unexpected deadline misses.
+
+PW locks help to convert sites that use local_locks (for cpu local operations)
+and queue_work_on (for queueing work remotely, to be executed
+locally on the owner cpu of the lock) to a spinlocks.
+
+The lock is declared pw_lock_t type.
+The lock is initialized with pw_lock_init.
+The lock is locked with pw_lock (takes a lock and cpu as a parameter).
+The lock is unlocked with pw_unlock (takes a lock and cpu as a parameter).
+
+The pw_lock_irqsave function disables interrupts and saves current interrupt state,
+cpu as a parameter.
+
+For trylock variant, there is the pw_trylock_t type, initialized with
+pw_trylock_init. Then the corresponding pw_trylock and pw_trylock_irqsave.
+
+work_struct should be replaced by pw_struct, which contains a cpu parameter
+(owner cpu of the lock), initialized by INIT_PW.
+
+The queue work related functions (analogous to queue_work_on and flush_work) are:
+pw_queue_on and pw_flush.
+
+The behaviour of the PW lock functions is as follows:
+
+* !CONFIG_PWLOCKS (or CONFIG_PWLOCKS and pwlocks=off kernel boot parameter):
+        - pw_lock:			local_lock
+        - pw_lock_irqsave:		local_lock_irqsave
+        - pw_trylock:			local_trylock
+        - pw_trylock_irqsave:		local_trylock_irqsave
+        - pw_unlock:			local_unlock
+        - pw_lock_local:		local_lock
+        - pw_trylock_local:		local_trylock
+        - pw_unlock_local:		local_unlock
+        - pw_queue_on:         		queue_work_on
+        - pw_flush:	            	flush_work
+
+* CONFIG_PWLOCKS (and CONFIG_PWLOCKS_DEFAULT=y or pwlocks=on kernel boot parameter),
+        - pw_lock:			spin_lock
+        - pw_lock_irqsave:		spin_lock_irqsave
+        - pw_trylock:			spin_trylock
+        - pw_trylock_irqsave:		spin_trylock_irqsave
+        - pw_unlock:			spin_unlock
+        - pw_lock_local:		preempt_disable OR migrate_disable + spin_lock
+        - pw_trylock_local:		preempt_disable OR migrate_disable + spin_trylock
+        - pw_unlock_local:		preempt_enable OR migrate_enable + spin_unlock
+        - pw_queue_on:         		executes work function on caller cpu
+        - pw_flush:            		empty
+
+pw_get_cpu(work_struct), to be called from within per-cpu work function,
+returns the target cpu.
+
+On the locking functions above, there are the local locking functions
+(pw_lock_local, pw_trylock_local and pw_unlock_local) that must only
+be used to access per-CPU data from the CPU that owns that data,
+and never remotely. They disable preemption/migration and don't require
+a cpu parameter, making them a replacement for local_lock functions that
+does not introduce overhead.
+
+These should only be used when accessing per-CPU data of the local CPU.
+
diff --git a/init/Kconfig b/init/Kconfig
index 2937c4d308ae..3fb751dc4530 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -764,20 +764,55 @@ config CPU_ISOLATION
 	depends on SMP
 	default y
 	help
 	  Make sure that CPUs running critical tasks are not disturbed by
 	  any source of "noise" such as unbound workqueues, timers, kthreads...
 	  Unbound jobs get offloaded to housekeeping CPUs. This is driven by
 	  the "isolcpus=" boot parameter.
 
 	  Say Y if unsure.
 
+config PWLOCKS
+	bool "Per-CPU Work locks"
+	depends on SMP || COMPILE_TEST
+	default n
+	help
+	  Allow changing the behavior on per-CPU resource sharing with cache,
+	  from the regular local_locks() + queue_work_on(remote_cpu) to using
+	  per-CPU spinlocks on both local and remote operations.
+
+	  This is useful to give user the option on reducing IPIs to CPUs, and
+	  thus reduce interruptions and context switches. On the other hand, it
+	  increases generated code and will use atomic operations if spinlocks
+	  are selected.
+
+	  If set, will use the default behavior set in PWLOCKS_DEFAULT unless boot
+	  parameter pwlocks is passed with a different behavior.
+
+	  If unset, will use the local_lock() + queue_work_on() strategy,
+	  regardless of the boot parameter or PWLOCKS_DEFAULT.
+
+	  Say N if unsure.
+
+config PWLOCKS_DEFAULT
+	bool "Use per-CPU spinlocks by default on PWLOCKS"
+	depends on PWLOCKS
+	default n
+	help
+	  If set, will use per-CPU spinlocks as default behavior for per-CPU
+	  remote operations.
+
+	  If unset, will use local_lock() + queue_work_on(cpu) as default
+	  behavior for remote operations.
+
+	  Say N if unsure
+
 source "kernel/rcu/Kconfig"
 
 config IKCONFIG
 	tristate "Kernel .config support"
 	help
 	  This option enables the complete Linux kernel ".config" file
 	  contents to be saved in the kernel. It provides documentation
 	  of which kernel options are used in a running kernel or in an
 	  on-disk kernel.  This information can be extracted from the kernel
 	  image file with the script scripts/extract-ikconfig and used as
diff --git a/kernel/Makefile b/kernel/Makefile
index 6785982013dc..60ccad0699e7 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -135,20 +135,22 @@ obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
 
 obj-$(CONFIG_HAS_IOMEM) += iomem.o
 obj-$(CONFIG_RSEQ) += rseq.o
 obj-$(CONFIG_WATCH_QUEUE) += watch_queue.o
 
 obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o
 obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o
 
+obj-$(CONFIG_PWLOCKS) += pwlocks.o
+
 CFLAGS_kstack_erase.o += $(DISABLE_KSTACK_ERASE)
 CFLAGS_kstack_erase.o += $(call cc-option,-mgeneral-regs-only)
 obj-$(CONFIG_KSTACK_ERASE) += kstack_erase.o
 KASAN_SANITIZE_kstack_erase.o := n
 KCSAN_SANITIZE_kstack_erase.o := n
 KCOV_INSTRUMENT_kstack_erase.o := n
 
 obj-$(CONFIG_SCF_TORTURE_TEST) += scftorture.o
 
 $(obj)/configs.o: $(obj)/config_data.gz
diff --git a/include/linux/pwlocks.h b/include/linux/pwlocks.h
new file mode 100644
index 000000000000..3d79621655f9
--- /dev/null
+++ b/include/linux/pwlocks.h
@@ -0,0 +1,265 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PWLOCKS_H
+#define _LINUX_PWLOCKS_H
+
+#include "linux/spinlock.h"
+#include "linux/local_lock.h"
+#include "linux/workqueue.h"
+
+#ifndef CONFIG_PWLOCKS
+
+typedef local_lock_t pw_lock_t;
+typedef local_trylock_t pw_trylock_t;
+
+struct pw_struct {
+	struct work_struct work;
+};
+
+#define pw_lock_init(lock)				\
+	local_lock_init(lock)
+
+#define pw_trylock_init(lock)				\
+	local_trylock_init(lock)
+
+#define pw_lock(lock, cpu)				\
+	local_lock(lock)
+
+#define pw_lock_local(lock)				\
+	local_lock(lock)
+
+#define pw_lock_irqsave(lock, flags, cpu)		\
+	local_lock_irqsave(lock, flags)
+
+#define pw_lock_local_irqsave(lock, flags)		\
+	local_lock_irqsave(lock, flags)
+
+#define pw_trylock(lock, cpu)				\
+	local_trylock(lock)
+
+#define pw_trylock_local(lock)				\
+	local_trylock(lock)
+
+#define pw_trylock_irqsave(lock, flags, cpu)		\
+	local_trylock_irqsave(lock, flags)
+
+#define pw_unlock(lock, cpu)				\
+	local_unlock(lock)
+
+#define pw_unlock_local(lock)				\
+	local_unlock(lock)
+
+#define pw_unlock_irqrestore(lock, flags, cpu)		\
+	local_unlock_irqrestore(lock, flags)
+
+#define pw_unlock_local_irqrestore(lock, flags)		\
+	local_unlock_irqrestore(lock, flags)
+
+#define pw_lockdep_assert_held(lock)			\
+	lockdep_assert_held(lock)
+
+#define pw_queue_on(c, wq, pw)				\
+	queue_work_on(c, wq, &(pw)->work)
+
+#define pw_flush(pw)					\
+	flush_work(&(pw)->work)
+
+#define pw_get_cpu(pw)	smp_processor_id()
+
+#define pw_is_cpu_remote(cpu)		(false)
+
+#define INIT_PW(pw, func, c)				\
+	INIT_WORK(&(pw)->work, (func))
+
+#else /* CONFIG_PWLOCKS */
+
+DECLARE_STATIC_KEY_MAYBE(CONFIG_PWLOCKS_DEFAULT, pw_sl);
+
+typedef union {
+	spinlock_t sl;
+	local_lock_t ll;
+} pw_lock_t;
+
+typedef union {
+	spinlock_t sl;
+	local_trylock_t ll;
+} pw_trylock_t;
+
+struct pw_struct {
+	struct work_struct work;
+	int cpu;
+};
+
+#ifdef CONFIG_PREEMPT_RT
+#define preempt_or_migrate_disable migrate_disable
+#define preempt_or_migrate_enable migrate_enable
+#else
+#define preempt_or_migrate_disable preempt_disable
+#define preempt_or_migrate_enable preempt_enable
+#endif
+
+#define pw_lock_init(lock)							\
+do {										\
+	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
+		spin_lock_init(lock.sl);					\
+	else									\
+		local_lock_init(lock.ll);					\
+} while (0)
+
+#define pw_trylock_init(lock)							\
+do {										\
+	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
+		spin_lock_init(lock.sl);					\
+	else									\
+		local_trylock_init(lock.ll);					\
+} while (0)
+
+#define pw_lock(lock, cpu)							\
+do {										\
+	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
+		spin_lock(per_cpu_ptr(lock.sl, cpu));				\
+	else									\
+		local_lock(lock.ll);						\
+} while (0)
+
+#define pw_lock_local(lock)							\
+do {										\
+	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) {		\
+		preempt_or_migrate_disable();					\
+		spin_lock(this_cpu_ptr(lock.sl));				\
+	} else {								\
+		local_lock(lock.ll);						\
+	}									\
+} while (0)
+
+#define pw_lock_irqsave(lock, flags, cpu)					\
+do {										\
+	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
+		spin_lock_irqsave(per_cpu_ptr(lock.sl, cpu), flags);	\
+	else									\
+		local_lock_irqsave(lock.ll, flags);				\
+} while (0)
+
+#define pw_lock_local_irqsave(lock, flags)					\
+do {										\
+	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) {		\
+		preempt_or_migrate_disable();					\
+		spin_lock_irqsave(this_cpu_ptr(lock.sl), flags);		\
+	} else {								\
+		local_lock_irqsave(lock.ll, flags);				\
+	}									\
+} while (0)
+
+#define pw_trylock(lock, cpu)							\
+({										\
+	int t;									\
+	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
+		t = spin_trylock(per_cpu_ptr(lock.sl, cpu));			\
+	else									\
+		t = local_trylock(lock.ll);					\
+	t;									\
+})
+
+#define pw_trylock_local(lock)							\
+({										\
+	int t;									\
+	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) {		\
+		preempt_or_migrate_disable();					\
+		t = spin_trylock(this_cpu_ptr(lock.sl));			\
+		if (!t)								\
+			preempt_or_migrate_enable();				\
+	} else {								\
+		t = local_trylock(lock.ll);					\
+	}									\
+	t;									\
+})
+
+#define pw_trylock_irqsave(lock, flags, cpu)					\
+({										\
+	int t;									\
+	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
+		t = spin_trylock_irqsave(per_cpu_ptr(lock.sl, cpu), flags);	\
+	else									\
+		t = local_trylock_irqsave(lock.ll, flags);			\
+	t;									\
+})
+
+#define pw_unlock(lock, cpu)							\
+do {										\
+	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
+		spin_unlock(per_cpu_ptr(lock.sl, cpu));			\
+	else									\
+		local_unlock(lock.ll);					\
+} while (0)
+
+#define pw_unlock_local(lock)							\
+do {										\
+	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) {		\
+		spin_unlock(this_cpu_ptr(lock.sl));				\
+		preempt_or_migrate_enable();					\
+	} else {								\
+		local_unlock(lock.ll);						\
+	}									\
+} while (0)
+
+#define pw_unlock_irqrestore(lock, flags, cpu)					\
+do {										\
+	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
+		spin_unlock_irqrestore(per_cpu_ptr(lock.sl, cpu), flags);	\
+	else									\
+		local_unlock_irqrestore(lock.ll, flags);			\
+} while (0)
+
+#define pw_unlock_local_irqrestore(lock, flags)					\
+do {										\
+	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) {		\
+		spin_unlock_irqrestore(this_cpu_ptr(lock.sl), flags);	\
+		preempt_or_migrate_enable();					\
+	} else {								\
+		local_unlock_irqrestore(lock.ll, flags);			\
+	}									\
+} while (0)
+
+#define pw_lockdep_assert_held(lock)						\
+do {										\
+	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
+		lockdep_assert_held(this_cpu_ptr(lock.sl));			\
+	else									\
+		lockdep_assert_held(this_cpu_ptr(lock.ll));			\
+} while (0)
+
+#define pw_queue_on(c, wq, pw)							\
+do {										\
+	int __c = c;								\
+	struct pw_struct *__pw = (pw);						\
+	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) {		\
+		WARN_ON((__c) != __pw->cpu);					\
+		__pw->work.func(&__pw->work);					\
+	} else {								\
+		queue_work_on(__c, wq, &(__pw)->work);				\
+	}									\
+} while (0)
+
+/*
+ * Does nothing if PWLOCKS is set to use spinlock, as the task is already done at the
+ * time pw_queue_on() returns.
+ */
+#define pw_flush(pw)								\
+do {										\
+	struct pw_struct *__pw = (pw);						\
+	if (!static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
+		flush_work(&__pw->work);					\
+} while (0)
+
+#define pw_get_cpu(w)			container_of((w), struct pw_struct, work)->cpu
+
+#define pw_is_cpu_remote(cpu)		((cpu) != smp_processor_id())
+
+#define INIT_PW(pw, func, c)							\
+do {										\
+	struct pw_struct *__pw = (pw);						\
+	INIT_WORK(&__pw->work, (func));						\
+	__pw->cpu = (c);							\
+} while (0)
+
+#endif /* CONFIG_PWLOCKS */
+#endif /* LINUX_PWLOCKS_H */
diff --git a/kernel/pwlocks.c b/kernel/pwlocks.c
new file mode 100644
index 000000000000..1ebf5cb979b9
--- /dev/null
+++ b/kernel/pwlocks.c
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "linux/export.h"
+#include <linux/sched.h>
+#include <linux/pwlocks.h>
+#include <linux/string.h>
+#include <linux/sched/isolation.h>
+
+DEFINE_STATIC_KEY_MAYBE(CONFIG_PWLOCKS_DEFAULT, pw_sl);
+EXPORT_SYMBOL(pw_sl);
+
+static bool pwlocks_param_specified;
+
+static int __init pwlocks_setup(char *str)
+{
+	int opt;
+
+	if (!get_option(&str, &opt)) {
+		pr_warn("PWLOCKS: invalid pwlocks parameter: %s, ignoring.\n", str);
+		return 0;
+	}
+
+	if (opt)
+		static_branch_enable(&pw_sl);
+	else
+		static_branch_disable(&pw_sl);
+
+	pwlocks_param_specified = true;
+
+	return 1;
+}
+__setup("pwlocks=", pwlocks_setup);
+
+/*
+ * Enable PWLOCKS if CPUs want to avoid kernel noise.
+ */
+static int __init pwlocks_init(void)
+{
+	if (pwlocks_param_specified)
+		return 0;
+
+	if (housekeeping_enabled(HK_TYPE_KERNEL_NOISE))
+		static_branch_enable(&pw_sl);
+
+	return 0;
+}
+
+late_initcall(pwlocks_init);
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 2/4] mm/swap: move bh draining into a separate workqueue
From: Leonardo Bras @ 2026-05-19  1:27 UTC (permalink / raw)
  To: Jonathan Corbet, Shuah Khan, Leonardo Bras, Peter Zijlstra,
	Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
	Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
	Randy Dunlap, Feng Tang, Dapeng Mi, Kees Cook, Marco Elver,
	Jakub Kicinski, Li RongQing, Eric Biggers, Paul E. McKenney,
	Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
	Thomas Weißschuh, Thomas Gleixner, Douglas Anderson,
	Gary Guo, Christian Brauner, Pasha Tatashin, Coiby Xu,
	Masahiro Yamada, Frederic Weisbecker
  Cc: Marcelo Tosatti, linux-doc, linux-kernel, linux-mm,
	linux-rt-devel
In-Reply-To: <20260519012754.240804-1-leobras.c@gmail.com>

From: Marcelo Tosatti <mtosatti@redhat.com>

Separate the bh draining into a separate workqueue
(from the mm lru draining), so that its possible to switch
the mm lru draining to QPW.

To switch bh draining to QPW, it would be necessary to add
a spinlock to addition of bhs to percpu cache, and that is a
very hot path.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
---
 mm/swap.c | 52 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 37 insertions(+), 15 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 5cc44f0de987..ed9b3d371547 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -744,60 +744,70 @@ void lru_add_drain(void)
 	local_unlock(&cpu_fbatches.lock);
 	mlock_drain_local();
 }
 
 /*
  * It's called from per-cpu workqueue context in SMP case so
  * lru_add_drain_cpu and invalidate_bh_lrus_cpu should run on
  * the same cpu. It shouldn't be a problem in !SMP case since
  * the core is only one and the locks will disable preemption.
  */
-static void lru_add_and_bh_lrus_drain(void)
+static void lru_add_mm_drain(void)
 {
 	local_lock(&cpu_fbatches.lock);
 	lru_add_drain_cpu(smp_processor_id());
 	local_unlock(&cpu_fbatches.lock);
-	invalidate_bh_lrus_cpu();
 	mlock_drain_local();
 }
 
 void lru_add_drain_cpu_zone(struct zone *zone)
 {
 	local_lock(&cpu_fbatches.lock);
 	lru_add_drain_cpu(smp_processor_id());
 	drain_local_pages(zone);
 	local_unlock(&cpu_fbatches.lock);
 	mlock_drain_local();
 }
 
 #ifdef CONFIG_SMP
 
 static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
 
 static void lru_add_drain_per_cpu(struct work_struct *dummy)
 {
-	lru_add_and_bh_lrus_drain();
+	lru_add_mm_drain();
 }
 
-static bool cpu_needs_drain(unsigned int cpu)
+static DEFINE_PER_CPU(struct work_struct, bh_add_drain_work);
+
+static void bh_add_drain_per_cpu(struct work_struct *dummy)
+{
+	invalidate_bh_lrus_cpu();
+}
+
+static bool cpu_needs_mm_drain(unsigned int cpu)
 {
 	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
 
 	/* Check these in order of likelihood that they're not zero */
 	return folio_batch_count(&fbatches->lru_add) ||
 		folio_batch_count(&fbatches->lru_move_tail) ||
 		folio_batch_count(&fbatches->lru_deactivate_file) ||
 		folio_batch_count(&fbatches->lru_deactivate) ||
 		folio_batch_count(&fbatches->lru_lazyfree) ||
 		folio_batch_count(&fbatches->lru_activate) ||
-		need_mlock_drain(cpu) ||
-		has_bh_in_lru(cpu, NULL);
+		need_mlock_drain(cpu);
+}
+
+static bool cpu_needs_bh_drain(unsigned int cpu)
+{
+	return has_bh_in_lru(cpu, NULL);
 }
 
 /*
  * Doesn't need any cpu hotplug locking because we do rely on per-cpu
  * kworkers being shut down before our page_alloc_cpu_dead callback is
  * executed on the offlined cpu.
  * Calling this function with cpu hotplug locks held can actually lead
  * to obscure indirect dependencies via WQ context.
  */
 static inline void __lru_add_drain_all(bool force_all_cpus)
@@ -806,21 +816,21 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
 	 * lru_drain_gen - Global pages generation number
 	 *
 	 * (A) Definition: global lru_drain_gen = x implies that all generations
 	 *     0 < n <= x are already *scheduled* for draining.
 	 *
 	 * This is an optimization for the highly-contended use case where a
 	 * user space workload keeps constantly generating a flow of pages for
 	 * each CPU.
 	 */
 	static unsigned int lru_drain_gen;
-	static struct cpumask has_work;
+	static struct cpumask has_mm_work, has_bh_work;
 	static DEFINE_MUTEX(lock);
 	unsigned cpu, this_gen;
 
 	/*
 	 * Make sure nobody triggers this path before mm_percpu_wq is fully
 	 * initialized.
 	 */
 	if (WARN_ON(!mm_percpu_wq))
 		return;
 
@@ -869,34 +879,45 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
 	 * along, adds some pages to its per-cpu vectors, then calls
 	 * lru_add_drain_all().
 	 *
 	 * If the paired barrier is done at any later step, e.g. after the
 	 * loop, CPU #x will just exit at (C) and miss flushing out all of its
 	 * added pages.
 	 */
 	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
 	smp_mb();
 
-	cpumask_clear(&has_work);
+	cpumask_clear(&has_mm_work);
+	cpumask_clear(&has_bh_work);
 	for_each_online_cpu(cpu) {
-		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
+		struct work_struct *mm_work = &per_cpu(lru_add_drain_work, cpu);
+		struct work_struct *bh_work = &per_cpu(bh_add_drain_work, cpu);
 
-		if (cpu_needs_drain(cpu)) {
-			INIT_WORK(work, lru_add_drain_per_cpu);
-			queue_work_on(cpu, mm_percpu_wq, work);
-			__cpumask_set_cpu(cpu, &has_work);
+		if (cpu_needs_mm_drain(cpu)) {
+			INIT_WORK(mm_work, lru_add_drain_per_cpu);
+			queue_work_on(cpu, mm_percpu_wq, mm_work);
+			__cpumask_set_cpu(cpu, &has_mm_work);
+		}
+
+		if (cpu_needs_bh_drain(cpu)) {
+			INIT_WORK(bh_work, bh_add_drain_per_cpu);
+			queue_work_on(cpu, mm_percpu_wq, bh_work);
+			__cpumask_set_cpu(cpu, &has_bh_work);
 		}
 	}
 
-	for_each_cpu(cpu, &has_work)
+	for_each_cpu(cpu, &has_mm_work)
 		flush_work(&per_cpu(lru_add_drain_work, cpu));
 
+	for_each_cpu(cpu, &has_bh_work)
+		flush_work(&per_cpu(bh_add_drain_work, cpu));
+
 done:
 	mutex_unlock(&lock);
 }
 
 void lru_add_drain_all(void)
 {
 	__lru_add_drain_all(false);
 }
 #else
 void lru_add_drain_all(void)
@@ -928,21 +949,22 @@ void lru_cache_disable(void)
 	 *
 	 * Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
 	 * preempt_disable() regions of code. So any CPU which sees
 	 * lru_disable_count = 0 will have exited the critical
 	 * section when synchronize_rcu() returns.
 	 */
 	synchronize_rcu_expedited();
 #ifdef CONFIG_SMP
 	__lru_add_drain_all(true);
 #else
-	lru_add_and_bh_lrus_drain();
+	lru_add_mm_drain();
+	invalidate_bh_lrus_cpu();
 #endif
 }
 
 /**
  * folios_put_refs - Reduce the reference count on a batch of folios.
  * @folios: The folios.
  * @refs: The number of refs to subtract from each folio.
  *
  * Like folio_put(), but for a batch of folios.  This is more efficient
  * than writing the loop yourself as it will optimise the locks which need
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 3/4] swap: apply new pw_queue_on() interface
From: Leonardo Bras @ 2026-05-19  1:27 UTC (permalink / raw)
  To: Jonathan Corbet, Shuah Khan, Leonardo Bras, Peter Zijlstra,
	Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
	Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
	Randy Dunlap, Feng Tang, Dapeng Mi, Kees Cook, Marco Elver,
	Jakub Kicinski, Li RongQing, Eric Biggers, Paul E. McKenney,
	Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
	Thomas Weißschuh, Thomas Gleixner, Douglas Anderson,
	Gary Guo, Christian Brauner, Pasha Tatashin, Coiby Xu,
	Masahiro Yamada, Frederic Weisbecker
  Cc: linux-doc, linux-kernel, linux-mm, linux-rt-devel,
	Marcelo Tosatti
In-Reply-To: <20260519012754.240804-1-leobras.c@gmail.com>

Make use of the new pw_{un,}lock*() and pw_queue_on() interface to improve
performance & latency.

For functions that may be scheduled in a different cpu, replace
local_{un,}lock*() by pw_{un,}lock*(), and replace schedule_work_on() by
pw_queue_on(). The same happens for flush_work() and pw_flush().

The change requires allocation of pw_structs instead of a work_structs,
and changing parameters of a few functions to include the cpu parameter.

This should bring no relevant performance impact on non-PWLOCKS kernels:
For functions that may be scheduled in a different cpu, the local_*lock's
this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).

Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
 mm/internal.h   |  4 ++-
 mm/mlock.c      | 51 ++++++++++++++++++++++++++----------
 mm/page_alloc.c |  2 +-
 mm/swap.c       | 69 ++++++++++++++++++++++++++-----------------------
 4 files changed, 79 insertions(+), 47 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..1ec9a11c373b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1209,24 +1209,26 @@ static inline void munlock_vma_folio(struct folio *folio,
 	 * cause folio not fully mapped to VMA.
 	 *
 	 * But it's not easy to confirm that's the situation. So we
 	 * always munlock the folio and page reclaim will correct it
 	 * if it's wrong.
 	 */
 	if (unlikely(vma->vm_flags & VM_LOCKED))
 		munlock_folio(folio);
 }
 
+int __init mlock_init(void);
 void mlock_new_folio(struct folio *folio);
 bool need_mlock_drain(int cpu);
 void mlock_drain_local(void);
-void mlock_drain_remote(int cpu);
+void mlock_drain_cpu(int cpu);
+void mlock_drain_offline(int cpu);
 
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
 /**
  * vma_address - Find the virtual address a page range is mapped at
  * @vma: The vma which maps this object.
  * @pgoff: The page offset within its object.
  * @nr_pages: The number of pages to consider.
  *
  * If any page in this range is mapped by this VMA, return the first address
diff --git a/mm/mlock.c b/mm/mlock.c
index 8c227fefa2df..5d25bbbb09e9 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -18,31 +18,30 @@
 #include <linux/mempolicy.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
 #include <linux/export.h>
 #include <linux/rmap.h>
 #include <linux/mmzone.h>
 #include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
 #include <linux/secretmem.h>
+#include <linux/pwlocks.h>
 
 #include "internal.h"
 
 struct mlock_fbatch {
-	local_lock_t lock;
+	pw_lock_t lock;
 	struct folio_batch fbatch;
 };
 
-static DEFINE_PER_CPU(struct mlock_fbatch, mlock_fbatch) = {
-	.lock = INIT_LOCAL_LOCK(lock),
-};
+static DEFINE_PER_CPU(struct mlock_fbatch, mlock_fbatch);
 
 bool can_do_mlock(void)
 {
 	if (rlimit(RLIMIT_MEMLOCK) != 0)
 		return true;
 	if (capable(CAP_IPC_LOCK))
 		return true;
 	return false;
 }
 EXPORT_SYMBOL(can_do_mlock);
@@ -202,32 +201,43 @@ static void mlock_folio_batch(struct folio_batch *fbatch)
 			lruvec = __mlock_new_folio(folio, lruvec);
 		else
 			lruvec = __munlock_folio(folio, lruvec);
 	}
 
 	if (lruvec)
 		lruvec_unlock_irq(lruvec);
 	folios_put(fbatch);
 }
 
+void mlock_drain_cpu(int cpu)
+{
+	struct folio_batch *fbatch;
+
+	pw_lock(&mlock_fbatch.lock, cpu);
+	fbatch = per_cpu_ptr(&mlock_fbatch.fbatch, cpu);
+	if (folio_batch_count(fbatch))
+		mlock_folio_batch(fbatch);
+	pw_unlock(&mlock_fbatch.lock, cpu);
+}
+
 void mlock_drain_local(void)
 {
 	struct folio_batch *fbatch;
 
-	local_lock(&mlock_fbatch.lock);
+	pw_lock_local(&mlock_fbatch.lock);
 	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
 	if (folio_batch_count(fbatch))
 		mlock_folio_batch(fbatch);
-	local_unlock(&mlock_fbatch.lock);
+	pw_unlock_local(&mlock_fbatch.lock);
 }
 
-void mlock_drain_remote(int cpu)
+void mlock_drain_offline(int cpu)
 {
 	struct folio_batch *fbatch;
 
 	WARN_ON_ONCE(cpu_online(cpu));
 	fbatch = &per_cpu(mlock_fbatch.fbatch, cpu);
 	if (folio_batch_count(fbatch))
 		mlock_folio_batch(fbatch);
 }
 
 bool need_mlock_drain(int cpu)
@@ -236,79 +246,79 @@ bool need_mlock_drain(int cpu)
 }
 
 /**
  * mlock_folio - mlock a folio already on (or temporarily off) LRU
  * @folio: folio to be mlocked.
  */
 void mlock_folio(struct folio *folio)
 {
 	struct folio_batch *fbatch;
 
-	local_lock(&mlock_fbatch.lock);
+	pw_lock_local(&mlock_fbatch.lock);
 	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
 
 	if (!folio_test_set_mlocked(folio)) {
 		int nr_pages = folio_nr_pages(folio);
 
 		zone_stat_mod_folio(folio, NR_MLOCK, nr_pages);
 		__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
 	}
 
 	folio_get(folio);
 	if (!folio_batch_add(fbatch, mlock_lru(folio)) ||
 	    !folio_may_be_lru_cached(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
-	local_unlock(&mlock_fbatch.lock);
+	pw_unlock_local(&mlock_fbatch.lock);
 }
 
 /**
  * mlock_new_folio - mlock a newly allocated folio not yet on LRU
  * @folio: folio to be mlocked, either normal or a THP head.
  */
 void mlock_new_folio(struct folio *folio)
 {
 	struct folio_batch *fbatch;
 	int nr_pages = folio_nr_pages(folio);
 
-	local_lock(&mlock_fbatch.lock);
+	pw_lock_local(&mlock_fbatch.lock);
 	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
 	folio_set_mlocked(folio);
 
 	zone_stat_mod_folio(folio, NR_MLOCK, nr_pages);
 	__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
 
 	folio_get(folio);
 	if (!folio_batch_add(fbatch, mlock_new(folio)) ||
 	    !folio_may_be_lru_cached(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
-	local_unlock(&mlock_fbatch.lock);
+	pw_unlock_local(&mlock_fbatch.lock);
 }
 
 /**
  * munlock_folio - munlock a folio
  * @folio: folio to be munlocked, either normal or a THP head.
  */
 void munlock_folio(struct folio *folio)
 {
 	struct folio_batch *fbatch;
 
-	local_lock(&mlock_fbatch.lock);
+	pw_lock_local(&mlock_fbatch.lock);
 	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
 	/*
 	 * folio_test_clear_mlocked(folio) must be left to __munlock_folio(),
 	 * which will check whether the folio is multiply mlocked.
 	 */
 	folio_get(folio);
 	if (!folio_batch_add(fbatch, folio) ||
 	    !folio_may_be_lru_cached(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
-	local_unlock(&mlock_fbatch.lock);
+	pw_unlock_local(&mlock_fbatch.lock);
 }
 
 static inline unsigned int folio_mlock_step(struct folio *folio,
 		pte_t *pte, unsigned long addr, unsigned long end)
 {
 	unsigned int count = (end - addr) >> PAGE_SHIFT;
 	pte_t ptent = ptep_get(pte);
 
 	if (!folio_test_large(folio))
 		return 1;
@@ -822,10 +832,25 @@ int user_shm_lock(size_t size, struct ucounts *ucounts)
 	return allowed;
 }
 
 void user_shm_unlock(size_t size, struct ucounts *ucounts)
 {
 	spin_lock(&shmlock_user_lock);
 	dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, (size + PAGE_SIZE - 1) >> PAGE_SHIFT);
 	spin_unlock(&shmlock_user_lock);
 	put_ucounts(ucounts);
 }
+
+int __init mlock_init(void)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct mlock_fbatch *fbatch = &per_cpu(mlock_fbatch, cpu);
+
+		pw_lock_init(&fbatch->lock);
+	}
+
+	return 0;
+}
+
+module_init(mlock_init);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 227d58dc3de6..fa768f07f88a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6217,21 +6217,21 @@ void free_reserved_page(struct page *page)
 	__free_page(page);
 	adjust_managed_page_count(page, 1);
 }
 EXPORT_SYMBOL(free_reserved_page);
 
 static int page_alloc_cpu_dead(unsigned int cpu)
 {
 	struct zone *zone;
 
 	lru_add_drain_cpu(cpu);
-	mlock_drain_remote(cpu);
+	mlock_drain_offline(cpu);
 	drain_pages(cpu);
 
 	/*
 	 * Spill the event counters of the dead processor
 	 * into the current processors event counters.
 	 * This artificially elevates the count of the current
 	 * processor.
 	 */
 	vm_events_fold_cpu(cpu);
 
diff --git a/mm/swap.c b/mm/swap.c
index ed9b3d371547..42f51bf4bb71 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -28,54 +28,51 @@
 #include <linux/memremap.h>
 #include <linux/percpu.h>
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/backing-dev.h>
 #include <linux/memcontrol.h>
 #include <linux/gfp.h>
 #include <linux/uio.h>
 #include <linux/hugetlb.h>
 #include <linux/page_idle.h>
-#include <linux/local_lock.h>
+#include <linux/pwlocks.h>
 #include <linux/buffer_head.h>
 
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/pagemap.h>
 
 /* How many pages do we try to swap or page in/out together? As a power of 2 */
 int page_cluster;
 static const int page_cluster_max = 31;
 
 struct cpu_fbatches {
 	/*
 	 * The following folio batches are grouped together because they are protected
 	 * by disabling preemption (and interrupts remain enabled).
 	 */
-	local_lock_t lock;
+	pw_lock_t lock;
 	struct folio_batch lru_add;
 	struct folio_batch lru_deactivate_file;
 	struct folio_batch lru_deactivate;
 	struct folio_batch lru_lazyfree;
 #ifdef CONFIG_SMP
 	struct folio_batch lru_activate;
 #endif
 	/* Protecting the following batches which require disabling interrupts */
-	local_lock_t lock_irq;
+	pw_lock_t lock_irq;
 	struct folio_batch lru_move_tail;
 };
 
-static DEFINE_PER_CPU(struct cpu_fbatches, cpu_fbatches) = {
-	.lock = INIT_LOCAL_LOCK(lock),
-	.lock_irq = INIT_LOCAL_LOCK(lock_irq),
-};
+static DEFINE_PER_CPU(struct cpu_fbatches, cpu_fbatches);
 
 static void __page_cache_release(struct folio *folio, struct lruvec **lruvecp,
 		unsigned long *flagsp)
 {
 	if (folio_test_lru(folio)) {
 		folio_lruvec_relock_irqsave(folio, lruvecp, flagsp);
 		lruvec_del_folio(*lruvecp, folio);
 		__folio_clear_lru_flags(folio);
 	}
 }
@@ -180,32 +177,32 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 }
 
 static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
 		struct folio *folio, move_fn_t move_fn, bool disable_irq)
 {
 	unsigned long flags;
 
 	folio_get(folio);
 
 	if (disable_irq)
-		local_lock_irqsave(&cpu_fbatches.lock_irq, flags);
+		pw_lock_local_irqsave(&cpu_fbatches.lock_irq, flags);
 	else
-		local_lock(&cpu_fbatches.lock);
+		pw_lock_local(&cpu_fbatches.lock);
 
 	if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
 			!folio_may_be_lru_cached(folio) || lru_cache_disabled())
 		folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
 
 	if (disable_irq)
-		local_unlock_irqrestore(&cpu_fbatches.lock_irq, flags);
+		pw_unlock_local_irqrestore(&cpu_fbatches.lock_irq, flags);
 	else
-		local_unlock(&cpu_fbatches.lock);
+		pw_unlock_local(&cpu_fbatches.lock);
 }
 
 #define folio_batch_add_and_move(folio, op)		\
 	__folio_batch_add_and_move(			\
 		&cpu_fbatches.op,			\
 		folio,					\
 		op,					\
 		offsetof(struct cpu_fbatches, op) >=	\
 		offsetof(struct cpu_fbatches, lock_irq)	\
 	)
@@ -356,21 +353,21 @@ void folio_activate(struct folio *folio)
 	lruvec_unlock_irq(lruvec);
 	folio_set_lru(folio);
 }
 #endif
 
 static void __lru_cache_activate_folio(struct folio *folio)
 {
 	struct folio_batch *fbatch;
 	int i;
 
-	local_lock(&cpu_fbatches.lock);
+	pw_lock_local(&cpu_fbatches.lock);
 	fbatch = this_cpu_ptr(&cpu_fbatches.lru_add);
 
 	/*
 	 * Search backwards on the optimistic assumption that the folio being
 	 * activated has just been added to this batch. Note that only
 	 * the local batch is examined as a !LRU folio could be in the
 	 * process of being released, reclaimed, migrated or on a remote
 	 * batch that is currently being drained. Furthermore, marking
 	 * a remote batch's folio active potentially hits a race where
 	 * a folio is marked active just after it is added to the inactive
@@ -378,21 +375,21 @@ static void __lru_cache_activate_folio(struct folio *folio)
 	 */
 	for (i = folio_batch_count(fbatch) - 1; i >= 0; i--) {
 		struct folio *batch_folio = fbatch->folios[i];
 
 		if (batch_folio == folio) {
 			folio_set_active(folio);
 			break;
 		}
 	}
 
-	local_unlock(&cpu_fbatches.lock);
+	pw_unlock_local(&cpu_fbatches.lock);
 }
 
 #ifdef CONFIG_LRU_GEN
 
 static void lru_gen_inc_refs(struct folio *folio)
 {
 	unsigned long new_flags, old_flags = READ_ONCE(folio->flags.f);
 
 	if (folio_test_unevictable(folio))
 		return;
@@ -652,23 +649,23 @@ void lru_add_drain_cpu(int cpu)
 
 	if (folio_batch_count(fbatch))
 		folio_batch_move_lru(fbatch, lru_add);
 
 	fbatch = &fbatches->lru_move_tail;
 	/* Disabling interrupts below acts as a compiler barrier. */
 	if (data_race(folio_batch_count(fbatch))) {
 		unsigned long flags;
 
 		/* No harm done if a racing interrupt already did this */
-		local_lock_irqsave(&cpu_fbatches.lock_irq, flags);
+		pw_lock_irqsave(&cpu_fbatches.lock_irq, flags, cpu);
 		folio_batch_move_lru(fbatch, lru_move_tail);
-		local_unlock_irqrestore(&cpu_fbatches.lock_irq, flags);
+		pw_unlock_irqrestore(&cpu_fbatches.lock_irq, flags, cpu);
 	}
 
 	fbatch = &fbatches->lru_deactivate_file;
 	if (folio_batch_count(fbatch))
 		folio_batch_move_lru(fbatch, lru_deactivate_file);
 
 	fbatch = &fbatches->lru_deactivate;
 	if (folio_batch_count(fbatch))
 		folio_batch_move_lru(fbatch, lru_deactivate);
 
@@ -732,56 +729,56 @@ void folio_mark_lazyfree(struct folio *folio)
 	if (!folio_test_anon(folio) || !folio_test_swapbacked(folio) ||
 	    !folio_test_lru(folio) ||
 	    folio_test_swapcache(folio) || folio_test_unevictable(folio))
 		return;
 
 	folio_batch_add_and_move(folio, lru_lazyfree);
 }
 
 void lru_add_drain(void)
 {
-	local_lock(&cpu_fbatches.lock);
+	pw_lock_local(&cpu_fbatches.lock);
 	lru_add_drain_cpu(smp_processor_id());
-	local_unlock(&cpu_fbatches.lock);
+	pw_unlock_local(&cpu_fbatches.lock);
 	mlock_drain_local();
 }
 
 /*
  * It's called from per-cpu workqueue context in SMP case so
  * lru_add_drain_cpu and invalidate_bh_lrus_cpu should run on
  * the same cpu. It shouldn't be a problem in !SMP case since
  * the core is only one and the locks will disable preemption.
  */
-static void lru_add_mm_drain(void)
+static void lru_add_mm_drain(int cpu)
 {
-	local_lock(&cpu_fbatches.lock);
-	lru_add_drain_cpu(smp_processor_id());
-	local_unlock(&cpu_fbatches.lock);
-	mlock_drain_local();
+	pw_lock(&cpu_fbatches.lock, cpu);
+	lru_add_drain_cpu(cpu);
+	pw_unlock(&cpu_fbatches.lock, cpu);
+	mlock_drain_cpu(cpu);
 }
 
 void lru_add_drain_cpu_zone(struct zone *zone)
 {
-	local_lock(&cpu_fbatches.lock);
+	pw_lock_local(&cpu_fbatches.lock);
 	lru_add_drain_cpu(smp_processor_id());
 	drain_local_pages(zone);
-	local_unlock(&cpu_fbatches.lock);
+	pw_unlock_local(&cpu_fbatches.lock);
 	mlock_drain_local();
 }
 
 #ifdef CONFIG_SMP
 
-static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
+static DEFINE_PER_CPU(struct pw_struct, lru_add_drain_pw);
 
-static void lru_add_drain_per_cpu(struct work_struct *dummy)
+static void lru_add_drain_per_cpu(struct work_struct *w)
 {
-	lru_add_mm_drain();
+	lru_add_mm_drain(pw_get_cpu(w));
 }
 
 static DEFINE_PER_CPU(struct work_struct, bh_add_drain_work);
 
 static void bh_add_drain_per_cpu(struct work_struct *dummy)
 {
 	invalidate_bh_lrus_cpu();
 }
 
 static bool cpu_needs_mm_drain(unsigned int cpu)
@@ -882,38 +879,38 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
 	 * If the paired barrier is done at any later step, e.g. after the
 	 * loop, CPU #x will just exit at (C) and miss flushing out all of its
 	 * added pages.
 	 */
 	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
 	smp_mb();
 
 	cpumask_clear(&has_mm_work);
 	cpumask_clear(&has_bh_work);
 	for_each_online_cpu(cpu) {
-		struct work_struct *mm_work = &per_cpu(lru_add_drain_work, cpu);
+		struct pw_struct *mm_pw = &per_cpu(lru_add_drain_pw, cpu);
 		struct work_struct *bh_work = &per_cpu(bh_add_drain_work, cpu);
 
 		if (cpu_needs_mm_drain(cpu)) {
-			INIT_WORK(mm_work, lru_add_drain_per_cpu);
-			queue_work_on(cpu, mm_percpu_wq, mm_work);
+			INIT_PW(mm_pw, lru_add_drain_per_cpu, cpu);
+			pw_queue_on(cpu, mm_percpu_wq, mm_pw);
 			__cpumask_set_cpu(cpu, &has_mm_work);
 		}
 
 		if (cpu_needs_bh_drain(cpu)) {
 			INIT_WORK(bh_work, bh_add_drain_per_cpu);
 			queue_work_on(cpu, mm_percpu_wq, bh_work);
 			__cpumask_set_cpu(cpu, &has_bh_work);
 		}
 	}
 
 	for_each_cpu(cpu, &has_mm_work)
-		flush_work(&per_cpu(lru_add_drain_work, cpu));
+		pw_flush(&per_cpu(lru_add_drain_pw, cpu));
 
 	for_each_cpu(cpu, &has_bh_work)
 		flush_work(&per_cpu(bh_add_drain_work, cpu));
 
 done:
 	mutex_unlock(&lock);
 }
 
 void lru_add_drain_all(void)
 {
@@ -949,21 +946,21 @@ void lru_cache_disable(void)
 	 *
 	 * Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
 	 * preempt_disable() regions of code. So any CPU which sees
 	 * lru_disable_count = 0 will have exited the critical
 	 * section when synchronize_rcu() returns.
 	 */
 	synchronize_rcu_expedited();
 #ifdef CONFIG_SMP
 	__lru_add_drain_all(true);
 #else
-	lru_add_mm_drain();
+	lru_add_mm_drain(smp_processor_id());
 	invalidate_bh_lrus_cpu();
 #endif
 }
 
 /**
  * folios_put_refs - Reduce the reference count on a batch of folios.
  * @folios: The folios.
  * @refs: The number of refs to subtract from each folio.
  *
  * Like folio_put(), but for a batch of folios.  This is more efficient
@@ -1156,23 +1153,31 @@ static const struct ctl_table swap_sysctl_table[] = {
 		.extra2		= (void *)&page_cluster_max,
 	}
 };
 
 /*
  * Perform any setup for the swap system
  */
 void __init swap_setup(void)
 {
 	unsigned long megs = PAGES_TO_MB(totalram_pages());
+	unsigned int cpu;
 
 	/* Use a smaller cluster for small-memory machines */
 	if (megs < 16)
 		page_cluster = 2;
 	else
 		page_cluster = 3;
 	/*
 	 * Right now other parts of the system means that we
 	 * _really_ don't want to cluster much more
 	 */
 
 	register_sysctl_init("vm", swap_sysctl_table);
+
+	for_each_possible_cpu(cpu) {
+		struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
+
+		pw_lock_init(&fbatches->lock);
+		pw_lock_init(&fbatches->lock_irq);
+	}
 }
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 4/4] slub: apply new pw_queue_on() interface
From: Leonardo Bras @ 2026-05-19  1:27 UTC (permalink / raw)
  To: Jonathan Corbet, Shuah Khan, Leonardo Bras, Peter Zijlstra,
	Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
	Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
	Randy Dunlap, Feng Tang, Dapeng Mi, Kees Cook, Marco Elver,
	Jakub Kicinski, Li RongQing, Eric Biggers, Paul E. McKenney,
	Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
	Thomas Weißschuh, Thomas Gleixner, Douglas Anderson,
	Gary Guo, Christian Brauner, Pasha Tatashin, Coiby Xu,
	Masahiro Yamada, Frederic Weisbecker
  Cc: linux-doc, linux-kernel, linux-mm, linux-rt-devel,
	Marcelo Tosatti
In-Reply-To: <20260519012754.240804-1-leobras.c@gmail.com>

Make use of the new pw_{un,}lock*() and pw_queue_on() interface to improve
performance & latency.

For functions that may be scheduled in a different cpu, replace
local_{un,}lock*() by pw_{un,}lock*(), and replace schedule_work_on() by
pw_queue_on(). The same happens for flush_work() and pw_flush().

This change requires allocation of pw_structs instead of a work_structs,
and changing parameters of a few functions to include the cpu parameter.

This should bring no relevant performance impact on non-PWLOCKS kernels:
For functions that may be scheduled in a different cpu, the local_*lock's
this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).

Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
 mm/slub.c | 142 +++++++++++++++++++++++++++---------------------------
 1 file changed, 72 insertions(+), 70 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 8f9004536729..a154d20e78f7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -43,20 +43,21 @@
 #include <linux/prefetch.h>
 #include <linux/memcontrol.h>
 #include <linux/random.h>
 #include <linux/prandom.h>
 #include <kunit/test.h>
 #include <kunit/test-bug.h>
 #include <linux/sort.h>
 #include <linux/irq_work.h>
 #include <linux/kprobes.h>
 #include <linux/debugfs.h>
+#include <linux/pwlocks.h>
 #include <trace/events/kmem.h>
 
 #include "internal.h"
 
 /*
  * Lock order:
  *   0.  cpu_hotplug_lock
  *   1.  slab_mutex (Global Mutex)
  *   2a. kmem_cache->cpu_sheaves->lock (Local trylock)
  *   2b. barn->lock (Spinlock)
@@ -122,21 +123,21 @@
  *   (Note that the total number of slabs is an atomic value that may be
  *   modified without taking the list lock).
  *
  *   The list_lock is a centralized lock and thus we avoid taking it as
  *   much as possible. As long as SLUB does not have to handle partial
  *   slabs, operations can continue without any centralized lock.
  *
  *   For debug caches, all allocations are forced to go through a list_lock
  *   protected region to serialize against concurrent validation.
  *
- *   cpu_sheaves->lock (local_trylock)
+ *   cpu_sheaves->lock (pw_trylock)
  *
  *   This lock protects fastpath operations on the percpu sheaves. On !RT it
  *   only disables preemption and does no atomic operations. As long as the main
  *   or spare sheaf can handle the allocation or free, there is no other
  *   overhead.
  *
  *   barn->lock (spinlock)
  *
  *   This lock protects the operations on per-NUMA-node barn. It can quickly
  *   serve an empty or full sheaf if available, and avoid more expensive refill
@@ -150,21 +151,21 @@
  *   cmpxchg_double this is done by a lockless update of slab's freelist and
  *   counters, otherwise slab_lock is taken. This only needs to take the
  *   list_lock if it's a first free to a full slab, or when a slab becomes empty
  *   after the free.
  *
  *   irq, preemption, migration considerations
  *
  *   Interrupts are disabled as part of list_lock or barn lock operations, or
  *   around the slab_lock operation, in order to make the slab allocator safe
  *   to use in the context of an irq.
- *   Preemption is disabled as part of local_trylock operations.
+ *   Preemption is disabled as part of pw_trylock operations.
  *   kmalloc_nolock() and kfree_nolock() are safe in NMI context but see
  *   their limitations.
  *
  * SLUB assigns two object arrays called sheaves for caching allocations and
  * frees on each cpu, with a NUMA node shared barn for balancing between cpus.
  * Allocations and frees are primarily served from these sheaves.
  *
  * Slabs with free elements are kept on a partial list and during regular
  * operations no list for full slabs is used. If an object in a full slab is
  * freed then the slab will show up again on the partial lists.
@@ -411,21 +412,21 @@ struct slab_sheaf {
 			bool pfmemalloc;
 		};
 	};
 	struct kmem_cache *cache;
 	unsigned int size;
 	int node; /* only used for rcu_sheaf */
 	void *objects[];
 };
 
 struct slub_percpu_sheaves {
-	local_trylock_t lock;
+	pw_trylock_t lock;
 	struct slab_sheaf *main; /* never NULL when unlocked */
 	struct slab_sheaf *spare; /* empty or full, may be NULL */
 	struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
 };
 
 /*
  * The slab lists for all objects.
  */
 struct kmem_cache_node {
 	spinlock_t list_lock;
@@ -477,21 +478,21 @@ static nodemask_t slab_nodes;
  * Corresponds to N_ONLINE nodes.
  */
 static nodemask_t slab_barn_nodes;
 
 /*
  * Workqueue used for flushing cpu and kfree_rcu sheaves.
  */
 static struct workqueue_struct *flushwq;
 
 struct slub_flush_work {
-	struct work_struct work;
+	struct pw_struct pw;
 	struct kmem_cache *s;
 	bool skip;
 };
 
 static DEFINE_MUTEX(flush_lock);
 static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
 
 /********************************************************************
  * 			Core slab cache functions
  *******************************************************************/
@@ -2838,74 +2839,74 @@ static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
  * Free all objects from the main sheaf. In order to perform
  * __kmem_cache_free_bulk() outside of cpu_sheaves->lock, work in batches where
  * object pointers are moved to a on-stack array under the lock. To bound the
  * stack usage, limit each batch to PCS_BATCH_MAX.
  *
  * Must be called with s->cpu_sheaves->lock locked, returns with the lock
  * unlocked.
  *
  * Returns how many objects are remaining to be flushed
  */
-static unsigned int __sheaf_flush_main_batch(struct kmem_cache *s)
+static unsigned int __sheaf_flush_main_batch(struct kmem_cache *s, int cpu)
 {
 	struct slub_percpu_sheaves *pcs;
 	unsigned int batch, remaining;
 	void *objects[PCS_BATCH_MAX];
 	struct slab_sheaf *sheaf;
 
-	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
-
-	pcs = this_cpu_ptr(s->cpu_sheaves);
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 	sheaf = pcs->main;
 
 	batch = min(PCS_BATCH_MAX, sheaf->size);
 
 	sheaf->size -= batch;
 	memcpy(objects, sheaf->objects + sheaf->size, batch * sizeof(void *));
 
 	remaining = sheaf->size;
 
-	local_unlock(&s->cpu_sheaves->lock);
+	pw_unlock(&s->cpu_sheaves->lock, cpu);
 
 	__kmem_cache_free_bulk(s, batch, &objects[0]);
 
 	stat_add(s, SHEAF_FLUSH, batch);
 
 	return remaining;
 }
 
-static void sheaf_flush_main(struct kmem_cache *s)
+static void sheaf_flush_main(struct kmem_cache *s, int cpu)
 {
 	unsigned int remaining;
 
 	do {
-		local_lock(&s->cpu_sheaves->lock);
+		pw_lock(&s->cpu_sheaves->lock, cpu);
 
-		remaining = __sheaf_flush_main_batch(s);
+		remaining = __sheaf_flush_main_batch(s, cpu);
 
 	} while (remaining);
 }
 
 /*
  * Returns true if the main sheaf was at least partially flushed.
  */
 static bool sheaf_try_flush_main(struct kmem_cache *s)
 {
 	unsigned int remaining;
 	bool ret = false;
 
 	do {
-		if (!local_trylock(&s->cpu_sheaves->lock))
+		if (!pw_trylock_local(&s->cpu_sheaves->lock))
 			return ret;
 
 		ret = true;
-		remaining = __sheaf_flush_main_batch(s);
+
+		pw_lockdep_assert_held(&s->cpu_sheaves->lock);
+		remaining = __sheaf_flush_main_batch(s, smp_processor_id());
 
 	} while (remaining);
 
 	return ret;
 }
 
 /*
  * Free all objects from a sheaf that's unused, i.e. not linked to any
  * cpu_sheaves, so we need no locking and batching. The locking is also not
  * necessary when flushing cpu's sheaves (both spare and main) during cpu
@@ -2968,45 +2969,45 @@ static void rcu_free_sheaf_nobarn(struct rcu_head *head)
 
 /*
  * Caller needs to make sure migration is disabled in order to fully flush
  * single cpu's sheaves
  *
  * must not be called from an irq
  *
  * flushing operations are rare so let's keep it simple and flush to slabs
  * directly, skipping the barn
  */
-static void pcs_flush_all(struct kmem_cache *s)
+static void pcs_flush_all(struct kmem_cache *s, int cpu)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *spare, *rcu_free;
 
-	local_lock(&s->cpu_sheaves->lock);
-	pcs = this_cpu_ptr(s->cpu_sheaves);
+	pw_lock(&s->cpu_sheaves->lock, cpu);
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
 	spare = pcs->spare;
 	pcs->spare = NULL;
 
 	rcu_free = pcs->rcu_free;
 	pcs->rcu_free = NULL;
 
-	local_unlock(&s->cpu_sheaves->lock);
+	pw_unlock(&s->cpu_sheaves->lock, cpu);
 
 	if (spare) {
 		sheaf_flush_unused(s, spare);
 		free_empty_sheaf(s, spare);
 	}
 
 	if (rcu_free)
 		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
 
-	sheaf_flush_main(s);
+	sheaf_flush_main(s, cpu);
 }
 
 static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
 {
 	struct slub_percpu_sheaves *pcs;
 
 	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
 	/* The cpu is not executing anymore so we don't need pcs->lock */
 	sheaf_flush_unused(s, pcs->main);
@@ -3942,83 +3943,84 @@ static bool has_pcs_used(int cpu, struct kmem_cache *s)
 
 /*
  * Flush percpu sheaves
  *
  * Called from CPU work handler with migration disabled.
  */
 static void flush_cpu_sheaves(struct work_struct *w)
 {
 	struct kmem_cache *s;
 	struct slub_flush_work *sfw;
+	int cpu = pw_get_cpu(w);
 
-	sfw = container_of(w, struct slub_flush_work, work);
-
+	sfw = &per_cpu(slub_flush, cpu);
 	s = sfw->s;
 
 	if (cache_has_sheaves(s))
-		pcs_flush_all(s);
+		pcs_flush_all(s, cpu);
 }
 
 static void flush_all_cpus_locked(struct kmem_cache *s)
 {
 	struct slub_flush_work *sfw;
 	unsigned int cpu;
 
 	lockdep_assert_cpus_held();
 	mutex_lock(&flush_lock);
 
 	for_each_online_cpu(cpu) {
 		sfw = &per_cpu(slub_flush, cpu);
 		if (!has_pcs_used(cpu, s)) {
 			sfw->skip = true;
 			continue;
 		}
-		INIT_WORK(&sfw->work, flush_cpu_sheaves);
+		INIT_PW(&sfw->pw, flush_cpu_sheaves, cpu);
 		sfw->skip = false;
 		sfw->s = s;
-		queue_work_on(cpu, flushwq, &sfw->work);
+		pw_queue_on(cpu, flushwq, &sfw->pw);
 	}
 
 	for_each_online_cpu(cpu) {
 		sfw = &per_cpu(slub_flush, cpu);
 		if (sfw->skip)
 			continue;
-		flush_work(&sfw->work);
+		pw_flush(&sfw->pw);
 	}
 
 	mutex_unlock(&flush_lock);
 }
 
 static void flush_all(struct kmem_cache *s)
 {
 	cpus_read_lock();
 	flush_all_cpus_locked(s);
 	cpus_read_unlock();
 }
 
 static void flush_rcu_sheaf(struct work_struct *w)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *rcu_free;
 	struct slub_flush_work *sfw;
 	struct kmem_cache *s;
+	int cpu = pw_get_cpu(w);
 
-	sfw = container_of(w, struct slub_flush_work, work);
+	sfw = &per_cpu(slub_flush, cpu);
 	s = sfw->s;
 
-	local_lock(&s->cpu_sheaves->lock);
-	pcs = this_cpu_ptr(s->cpu_sheaves);
+	pw_lock(&s->cpu_sheaves->lock, cpu);
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
 	rcu_free = pcs->rcu_free;
 	pcs->rcu_free = NULL;
 
-	local_unlock(&s->cpu_sheaves->lock);
+	pw_unlock(&s->cpu_sheaves->lock, cpu);
 
 	if (rcu_free)
 		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
 }
 
 
 /* needed for kvfree_rcu_barrier() */
 void flush_rcu_sheaves_on_cache(struct kmem_cache *s)
 {
 	struct slub_flush_work *sfw;
@@ -4029,28 +4031,28 @@ void flush_rcu_sheaves_on_cache(struct kmem_cache *s)
 	for_each_online_cpu(cpu) {
 		sfw = &per_cpu(slub_flush, cpu);
 
 		/*
 		 * we don't check if rcu_free sheaf exists - racing
 		 * __kfree_rcu_sheaf() might have just removed it.
 		 * by executing flush_rcu_sheaf() on the cpu we make
 		 * sure the __kfree_rcu_sheaf() finished its call_rcu()
 		 */
 
-		INIT_WORK(&sfw->work, flush_rcu_sheaf);
+		INIT_PW(&sfw->pw, flush_rcu_sheaf, cpu);
 		sfw->s = s;
-		queue_work_on(cpu, flushwq, &sfw->work);
+		pw_queue_on(cpu, flushwq, &sfw->pw);
 	}
 
 	for_each_online_cpu(cpu) {
 		sfw = &per_cpu(slub_flush, cpu);
-		flush_work(&sfw->work);
+		pw_flush(&sfw->pw);
 	}
 
 	mutex_unlock(&flush_lock);
 }
 
 void flush_all_rcu_sheaves(void)
 {
 	struct kmem_cache *s;
 
 	cpus_read_lock();
@@ -4589,36 +4591,36 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
  * unlocked.
  */
 static struct slub_percpu_sheaves *
 __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs, gfp_t gfp)
 {
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
 	bool allow_spin;
 
-	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+	pw_lockdep_assert_held(&s->cpu_sheaves->lock);
 
 	/* Bootstrap or debug cache, back off */
 	if (unlikely(!cache_has_sheaves(s))) {
-		local_unlock(&s->cpu_sheaves->lock);
+		pw_unlock_local(&s->cpu_sheaves->lock);
 		return NULL;
 	}
 
 	if (pcs->spare && pcs->spare->size > 0) {
 		swap(pcs->main, pcs->spare);
 		return pcs;
 	}
 
 	barn = get_barn(s);
 	if (!barn) {
-		local_unlock(&s->cpu_sheaves->lock);
+		pw_unlock_local(&s->cpu_sheaves->lock);
 		return NULL;
 	}
 
 	allow_spin = gfpflags_allow_spinning(gfp);
 
 	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
 		pcs->main = full;
@@ -4629,21 +4631,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 
 	if (allow_spin) {
 		if (pcs->spare) {
 			empty = pcs->spare;
 			pcs->spare = NULL;
 		} else {
 			empty = barn_get_empty_sheaf(barn, true);
 		}
 	}
 
-	local_unlock(&s->cpu_sheaves->lock);
+	pw_unlock_local(&s->cpu_sheaves->lock);
 	pcs = NULL;
 
 	if (!allow_spin)
 		return NULL;
 
 	if (!empty) {
 		empty = alloc_empty_sheaf(s, gfp);
 		if (!empty)
 			return NULL;
 	}
@@ -4655,21 +4657,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		 */
 		sheaf_flush_unused(s, empty);
 		free_empty_sheaf(s, empty);
 
 		return NULL;
 	}
 
 	full = empty;
 	empty = NULL;
 
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	if (!pw_trylock_local(&s->cpu_sheaves->lock))
 		goto barn_put;
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	/*
 	 * If we put any empty or full sheaf to the barn below, it's due to
 	 * racing or being migrated to a different cpu. Breaching the barn's
 	 * sheaf limits should be thus rare enough so just ignore them to
 	 * simplify the recovery.
 	 */
 
@@ -4733,121 +4735,121 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
 
 	/*
 	 * We assume the percpu sheaves contain only local objects although it's
 	 * not completely guaranteed, so we verify later.
 	 */
 	if (unlikely(node_requested && node != numa_mem_id())) {
 		stat(s, ALLOC_NODE_MISMATCH);
 		return NULL;
 	}
 
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	if (!pw_trylock_local(&s->cpu_sheaves->lock))
 		return NULL;
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	if (unlikely(pcs->main->size == 0)) {
 		pcs = __pcs_replace_empty_main(s, pcs, gfp);
 		if (unlikely(!pcs))
 			return NULL;
 	}
 
 	object = pcs->main->objects[pcs->main->size - 1];
 
 	if (unlikely(node_requested)) {
 		/*
 		 * Verify that the object was from the node we want. This could
 		 * be false because of cpu migration during an unlocked part of
 		 * the current allocation or previous freeing process.
 		 */
 		if (page_to_nid(virt_to_page(object)) != node) {
-			local_unlock(&s->cpu_sheaves->lock);
+			pw_unlock_local(&s->cpu_sheaves->lock);
 			stat(s, ALLOC_NODE_MISMATCH);
 			return NULL;
 		}
 	}
 
 	pcs->main->size--;
 
-	local_unlock(&s->cpu_sheaves->lock);
+	pw_unlock_local(&s->cpu_sheaves->lock);
 
 	stat(s, ALLOC_FASTPATH);
 
 	return object;
 }
 
 static __fastpath_inline
 unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, gfp_t gfp, size_t size,
 				 void **p)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *main;
 	unsigned int allocated = 0;
 	unsigned int batch;
 
 next_batch:
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	if (!pw_trylock_local(&s->cpu_sheaves->lock))
 		return allocated;
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	if (unlikely(pcs->main->size == 0)) {
 
 		struct slab_sheaf *full;
 		struct node_barn *barn;
 
 		if (unlikely(!cache_has_sheaves(s))) {
-			local_unlock(&s->cpu_sheaves->lock);
+			pw_unlock_local(&s->cpu_sheaves->lock);
 			return allocated;
 		}
 
 		if (pcs->spare && pcs->spare->size > 0) {
 			swap(pcs->main, pcs->spare);
 			goto do_alloc;
 		}
 
 		barn = get_barn(s);
 		if (!barn) {
-			local_unlock(&s->cpu_sheaves->lock);
+			pw_unlock_local(&s->cpu_sheaves->lock);
 			return allocated;
 		}
 
 		full = barn_replace_empty_sheaf(barn, pcs->main,
 						gfpflags_allow_spinning(gfp));
 
 		if (full) {
 			stat(s, BARN_GET);
 			pcs->main = full;
 			goto do_alloc;
 		}
 
 		stat(s, BARN_GET_FAIL);
 
-		local_unlock(&s->cpu_sheaves->lock);
+		pw_unlock_local(&s->cpu_sheaves->lock);
 
 		/*
 		 * Once full sheaves in barn are depleted, let the bulk
 		 * allocation continue from slab pages, otherwise we would just
 		 * be copying arrays of pointers twice.
 		 */
 		return allocated;
 	}
 
 do_alloc:
 
 	main = pcs->main;
 	batch = min(size, main->size);
 
 	main->size -= batch;
 	memcpy(p, main->objects + main->size, batch * sizeof(void *));
 
-	local_unlock(&s->cpu_sheaves->lock);
+	pw_unlock_local(&s->cpu_sheaves->lock);
 
 	stat_add(s, ALLOC_FASTPATH, batch);
 
 	allocated += batch;
 
 	if (batch < size) {
 		p += batch;
 		size -= batch;
 		goto next_batch;
 	}
@@ -5017,40 +5019,40 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
 					     &sheaf->objects[0])) {
 			kfree(sheaf);
 			return NULL;
 		}
 
 		sheaf->size = size;
 
 		return sheaf;
 	}
 
-	local_lock(&s->cpu_sheaves->lock);
+	pw_lock_local(&s->cpu_sheaves->lock);
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	if (pcs->spare) {
 		sheaf = pcs->spare;
 		pcs->spare = NULL;
 		stat(s, SHEAF_PREFILL_FAST);
 	} else {
 		barn = get_barn(s);
 
 		stat(s, SHEAF_PREFILL_SLOW);
 		if (barn)
 			sheaf = barn_get_full_or_empty_sheaf(barn);
 		if (sheaf && sheaf->size)
 			stat(s, BARN_GET);
 		else
 			stat(s, BARN_GET_FAIL);
 	}
 
-	local_unlock(&s->cpu_sheaves->lock);
+	pw_unlock_local(&s->cpu_sheaves->lock);
 
 
 	if (!sheaf)
 		sheaf = alloc_empty_sheaf(s, gfp);
 
 	if (sheaf) {
 		sheaf->capacity = s->sheaf_capacity;
 		sheaf->pfmemalloc = false;
 
 		if (sheaf->size < size &&
@@ -5080,31 +5082,31 @@ void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
 	struct slub_percpu_sheaves *pcs;
 	struct node_barn *barn;
 
 	if (unlikely((sheaf->capacity != s->sheaf_capacity)
 		     || sheaf->pfmemalloc)) {
 		sheaf_flush_unused(s, sheaf);
 		kfree(sheaf);
 		return;
 	}
 
-	local_lock(&s->cpu_sheaves->lock);
+	pw_lock_local(&s->cpu_sheaves->lock);
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 	barn = get_barn(s);
 
 	if (!pcs->spare) {
 		pcs->spare = sheaf;
 		sheaf = NULL;
 		stat(s, SHEAF_RETURN_FAST);
 	}
 
-	local_unlock(&s->cpu_sheaves->lock);
+	pw_unlock_local(&s->cpu_sheaves->lock);
 
 	if (!sheaf)
 		return;
 
 	stat(s, SHEAF_RETURN_SLOW);
 
 	/*
 	 * If the barn has too many full sheaves or we fail to refill the sheaf,
 	 * simply flush and free it.
 	 */
@@ -5627,21 +5629,21 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
  * An alternative scenario that gets us here is when we fail
  * barn_replace_full_sheaf(), because there's no empty sheaf available in the
  * barn, so we had to allocate it by alloc_empty_sheaf(). But because we saw the
  * limit on full sheaves was not exceeded, we assume it didn't change and just
  * put the full sheaf there.
  */
 static void __pcs_install_empty_sheaf(struct kmem_cache *s,
 		struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty,
 		struct node_barn *barn)
 {
-	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+	pw_lockdep_assert_held(&s->cpu_sheaves->lock);
 
 	/* This is what we expect to find if nobody interrupted us. */
 	if (likely(!pcs->spare)) {
 		pcs->spare = pcs->main;
 		pcs->main = empty;
 		return;
 	}
 
 	/*
 	 * Unlikely because if the main sheaf had space, we would have just
@@ -5678,31 +5680,31 @@ static void __pcs_install_empty_sheaf(struct kmem_cache *s,
  */
 static struct slub_percpu_sheaves *
 __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 			bool allow_spin)
 {
 	struct slab_sheaf *empty;
 	struct node_barn *barn;
 	bool put_fail;
 
 restart:
-	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+	pw_lockdep_assert_held(&s->cpu_sheaves->lock);
 
 	/* Bootstrap or debug cache, back off */
 	if (unlikely(!cache_has_sheaves(s))) {
-		local_unlock(&s->cpu_sheaves->lock);
+		pw_unlock_local(&s->cpu_sheaves->lock);
 		return NULL;
 	}
 
 	barn = get_barn(s);
 	if (!barn) {
-		local_unlock(&s->cpu_sheaves->lock);
+		pw_unlock_local(&s->cpu_sheaves->lock);
 		return NULL;
 	}
 
 	put_fail = false;
 
 	if (!pcs->spare) {
 		empty = barn_get_empty_sheaf(barn, allow_spin);
 		if (empty) {
 			pcs->spare = pcs->main;
 			pcs->main = empty;
@@ -5725,107 +5727,107 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	}
 
 	/* sheaf_flush_unused() doesn't support !allow_spin */
 	if (PTR_ERR(empty) == -E2BIG && allow_spin) {
 		/* Since we got here, spare exists and is full */
 		struct slab_sheaf *to_flush = pcs->spare;
 
 		stat(s, BARN_PUT_FAIL);
 
 		pcs->spare = NULL;
-		local_unlock(&s->cpu_sheaves->lock);
+		pw_unlock_local(&s->cpu_sheaves->lock);
 
 		sheaf_flush_unused(s, to_flush);
 		empty = to_flush;
 		goto got_empty;
 	}
 
 	/*
 	 * We could not replace full sheaf because barn had no empty
 	 * sheaves. We can still allocate it and put the full sheaf in
 	 * __pcs_install_empty_sheaf(), but if we fail to allocate it,
 	 * make sure to count the fail.
 	 */
 	put_fail = true;
 
 alloc_empty:
-	local_unlock(&s->cpu_sheaves->lock);
+	pw_unlock_local(&s->cpu_sheaves->lock);
 
 	/*
 	 * alloc_empty_sheaf() doesn't support !allow_spin and it's
 	 * easier to fall back to freeing directly without sheaves
 	 * than add the support (and to sheaf_flush_unused() above)
 	 */
 	if (!allow_spin)
 		return NULL;
 
 	empty = alloc_empty_sheaf(s, GFP_NOWAIT);
 	if (empty)
 		goto got_empty;
 
 	if (put_fail)
 		 stat(s, BARN_PUT_FAIL);
 
 	if (!sheaf_try_flush_main(s))
 		return NULL;
 
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	if (!pw_trylock_local(&s->cpu_sheaves->lock))
 		return NULL;
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	/*
 	 * we flushed the main sheaf so it should be empty now,
 	 * but in case we got preempted or migrated, we need to
 	 * check again
 	 */
 	if (pcs->main->size == s->sheaf_capacity)
 		goto restart;
 
 	return pcs;
 
 got_empty:
-	if (!local_trylock(&s->cpu_sheaves->lock)) {
+	if (!pw_trylock_local(&s->cpu_sheaves->lock)) {
 		barn_put_empty_sheaf(barn, empty);
 		return NULL;
 	}
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 	__pcs_install_empty_sheaf(s, pcs, empty, barn);
 
 	return pcs;
 }
 
 /*
  * Free an object to the percpu sheaves.
  * The object is expected to have passed slab_free_hook() already.
  */
 static __fastpath_inline
 bool free_to_pcs(struct kmem_cache *s, void *object, bool allow_spin)
 {
 	struct slub_percpu_sheaves *pcs;
 
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	if (!pw_trylock_local(&s->cpu_sheaves->lock))
 		return false;
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
 
 		pcs = __pcs_replace_full_main(s, pcs, allow_spin);
 		if (unlikely(!pcs))
 			return false;
 	}
 
 	pcs->main->objects[pcs->main->size++] = object;
 
-	local_unlock(&s->cpu_sheaves->lock);
+	pw_unlock_local(&s->cpu_sheaves->lock);
 
 	stat(s, FREE_FASTPATH);
 
 	return true;
 }
 
 static void rcu_free_sheaf(struct rcu_head *head)
 {
 	struct slab_sheaf *sheaf;
 	struct node_barn *barn = NULL;
@@ -5898,63 +5900,63 @@ static DEFINE_WAIT_OVERRIDE_MAP(kfree_rcu_sheaf_map, LD_WAIT_CONFIG);
 bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *rcu_sheaf;
 
 	if (WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT)))
 		return false;
 
 	lock_map_acquire_try(&kfree_rcu_sheaf_map);
 
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	if (!pw_trylock_local(&s->cpu_sheaves->lock))
 		goto fail;
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	if (unlikely(!pcs->rcu_free)) {
 
 		struct slab_sheaf *empty;
 		struct node_barn *barn;
 
 		/* Bootstrap or debug cache, fall back */
 		if (unlikely(!cache_has_sheaves(s))) {
-			local_unlock(&s->cpu_sheaves->lock);
+			pw_unlock_local(&s->cpu_sheaves->lock);
 			goto fail;
 		}
 
 		if (pcs->spare && pcs->spare->size == 0) {
 			pcs->rcu_free = pcs->spare;
 			pcs->spare = NULL;
 			goto do_free;
 		}
 
 		barn = get_barn(s);
 		if (!barn) {
-			local_unlock(&s->cpu_sheaves->lock);
+			pw_unlock_local(&s->cpu_sheaves->lock);
 			goto fail;
 		}
 
 		empty = barn_get_empty_sheaf(barn, true);
 
 		if (empty) {
 			pcs->rcu_free = empty;
 			goto do_free;
 		}
 
-		local_unlock(&s->cpu_sheaves->lock);
+		pw_unlock_local(&s->cpu_sheaves->lock);
 
 		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
 
 		if (!empty)
 			goto fail;
 
-		if (!local_trylock(&s->cpu_sheaves->lock)) {
+		if (!pw_trylock_local(&s->cpu_sheaves->lock)) {
 			barn_put_empty_sheaf(barn, empty);
 			goto fail;
 		}
 
 		pcs = this_cpu_ptr(s->cpu_sheaves);
 
 		if (unlikely(pcs->rcu_free))
 			barn_put_empty_sheaf(barn, empty);
 		else
 			pcs->rcu_free = empty;
@@ -5971,27 +5973,27 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 	rcu_sheaf->objects[rcu_sheaf->size++] = obj;
 
 	if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
 		rcu_sheaf = NULL;
 	} else {
 		pcs->rcu_free = NULL;
 		rcu_sheaf->node = numa_node_id();
 	}
 
 	/*
-	 * we flush before local_unlock to make sure a racing
+	 * we flush before pw_unlock_local to make sure a racing
 	 * flush_all_rcu_sheaves() doesn't miss this sheaf
 	 */
 	if (rcu_sheaf)
 		call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
 
-	local_unlock(&s->cpu_sheaves->lock);
+	pw_unlock_local(&s->cpu_sheaves->lock);
 
 	stat(s, FREE_RCU_SHEAF);
 	lock_map_release(&kfree_rcu_sheaf_map);
 	return true;
 
 fail:
 	stat(s, FREE_RCU_SHEAF_FAIL);
 	lock_map_release(&kfree_rcu_sheaf_map);
 	return false;
 }
@@ -6082,21 +6084,21 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 			continue;
 		}
 
 		i++;
 	}
 
 	if (!size)
 		goto flush_remote;
 
 next_batch:
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	if (!pw_trylock_local(&s->cpu_sheaves->lock))
 		goto fallback;
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	if (likely(pcs->main->size < s->sheaf_capacity))
 		goto do_free;
 
 	barn = get_barn(s);
 	if (!barn)
 		goto no_empty;
@@ -6125,37 +6127,37 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	stat(s, BARN_PUT);
 	pcs->main = empty;
 
 do_free:
 	main = pcs->main;
 	batch = min(size, s->sheaf_capacity - main->size);
 
 	memcpy(main->objects + main->size, p, batch * sizeof(void *));
 	main->size += batch;
 
-	local_unlock(&s->cpu_sheaves->lock);
+	pw_unlock_local(&s->cpu_sheaves->lock);
 
 	stat_add(s, FREE_FASTPATH, batch);
 
 	if (batch < size) {
 		p += batch;
 		size -= batch;
 		goto next_batch;
 	}
 
 	if (remote_nr)
 		goto flush_remote;
 
 	return;
 
 no_empty:
-	local_unlock(&s->cpu_sheaves->lock);
+	pw_unlock_local(&s->cpu_sheaves->lock);
 
 	/*
 	 * if we depleted all empty sheaves in the barn or there are too
 	 * many full sheaves, free the rest to slab pages
 	 */
 fallback:
 	__kmem_cache_free_bulk(s, size, p);
 	stat_add(s, FREE_SLOWPATH, size);
 
 flush_remote:
@@ -7554,21 +7556,21 @@ static inline int alloc_kmem_cache_stats(struct kmem_cache *s)
 static int init_percpu_sheaves(struct kmem_cache *s)
 {
 	static struct slab_sheaf bootstrap_sheaf = {};
 	int cpu;
 
 	for_each_possible_cpu(cpu) {
 		struct slub_percpu_sheaves *pcs;
 
 		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
-		local_trylock_init(&pcs->lock);
+		pw_trylock_init(&pcs->lock);
 
 		/*
 		 * Bootstrap sheaf has zero size so fast-path allocation fails.
 		 * It has also size == s->sheaf_capacity, so fast-path free
 		 * fails. In the slow paths we recognize the situation by
 		 * checking s->sheaf_capacity. This allows fast paths to assume
 		 * s->cpu_sheaves and pcs->main always exists and are valid.
 		 * It's also safe to share the single static bootstrap_sheaf
 		 * with zero-sized objects array as it's never modified.
 		 *
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v2 1/7] seg6: add End.MAP behavior
From: Andrea Mayer @ 2026-05-19  1:31 UTC (permalink / raw)
  To: Yuya Kusakabe
  Cc: David S. Miller, Eric Dumazet, David Ahern, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Justin Iurman, Shuah Khan,
	Jonathan Corbet, Shuah Khan, linux-kernel, netdev,
	linux-kselftest, linux-doc, stefano.salsano, ahabdels,
	Andrea Mayer
In-Reply-To: <20260505-seg6-mobile-v2-1-9e8022bdfdb6@gmail.com>

On Tue, 05 May 2026 01:30:11 +0900
Yuya Kusakabe <yuya.kusakabe@gmail.com> wrote:

> Add the End.MAP behavior (RFC 9433 Section 6.2): an endpoint that
> replaces the IPv6 destination address with a configured next SID
> and forwards via IPv6 routing without consuming the SRH.  The new
> nh6 attribute selects the replacement SID.
>
> Add three drop reasons that End.MAP emits to dropreason-core.h, so
> dropped packets show up in the standard skb:kfree_skb tracepoint:
>
>   SEG6_MOBILE_INVALID_SRH_SL
>   SEG6_MOBILE_HOP_LIMIT_EXCEEDED
>   SEG6_MOBILE_NOMEM

The commit message lists three drop reasons including
SEG6_MOBILE_HOP_LIMIT_EXCEEDED, but the code does not add that one.
The likely third reason (SEG6_MOBILE_MTU_EXCEEDED) appears in
patch 2. (Flagged by Sashiko, the Patchwork AI reviewer.)

>
> Configuration:
>
>   ip -6 route add 2001:db8:f::/64 \
>       encap seg6local action End.MAP nh6 2001:db8:1::e \
>       dev <dev>
>
> Link: https://www.rfc-editor.org/rfc/rfc9433.html#section-6.2

Nit: As far as I can see, Link: tags in this tree usually point to mailing
list messages (patch.msgid.link, lore.kernel.org). Other commits that
reference RFCs typically cite them in the commit body instead. Same for
patches 2-7.

> Signed-off-by: Yuya Kusakabe <yuya.kusakabe@gmail.com>
> ---
>  include/net/dropreason-core.h                    |  12 +++
>  include/uapi/linux/seg6_local.h                  |   2 +
>  net/ipv6/seg6_local.c                            |  73 ++++++++++++++++
>  tools/testing/selftests/net/Makefile             |   1 +
>  tools/testing/selftests/net/srv6_end_map_test.sh | 103 +++++++++++++++++++++++
>  5 files changed, 191 insertions(+)
>
> diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h
> index e0ca3904ff8e..1be5c54d7605 100644
> --- a/include/net/dropreason-core.h
> +++ b/include/net/dropreason-core.h
> @@ -127,6 +127,8 @@
>  	FN(PSP_INPUT)			\
>  	FN(PSP_OUTPUT)			\
>  	FN(RECURSION_LIMIT)		\
> +	FN(SEG6_MOBILE_INVALID_SRH_SL)	\
> +	FN(SEG6_MOBILE_NOMEM)		\
>  	FNe(MAX)
>
>  /**
> @@ -600,6 +602,16 @@ enum skb_drop_reason {
>  	SKB_DROP_REASON_PSP_OUTPUT,
>  	/** @SKB_DROP_REASON_RECURSION_LIMIT: Dead loop on virtual device. */
>  	SKB_DROP_REASON_RECURSION_LIMIT,
> +	/**
> +	 * @SKB_DROP_REASON_SEG6_MOBILE_INVALID_SRH_SL: invalid Segments Left
> +	 * value or SRH validation failure on an SRv6 Mobile path.
> +	 */
> +	SKB_DROP_REASON_SEG6_MOBILE_INVALID_SRH_SL,

This single reason covers several distinct failure modes across the
patchset: wrong Segments Left, SRH absent, SRH structurally malformed, and
HMAC validation failure. An operator seeing this drop cannot tell which
check failed. Using separate SRv6-level drop reasons seems reasonable. See
my cover letter reply for the broader discussion on drop reasons.

> +	/**
> +	 * @SKB_DROP_REASON_SEG6_MOBILE_NOMEM: skb head/tail expansion or
> +	 * helper allocation failed on an SRv6 Mobile path.
> +	 */
> +	SKB_DROP_REASON_SEG6_MOBILE_NOMEM,

This overlaps with the existing generic SKB_DROP_REASON_NOMEM in
dropreason-core.h. Why not use the generic one?

>  	/**
>  	 * @SKB_DROP_REASON_MAX: the maximum of core drop reasons, which
>  	 * shouldn't be used as a real 'reason' - only for tracing code gen
> diff --git a/include/uapi/linux/seg6_local.h b/include/uapi/linux/seg6_local.h
> index 4fdc424c9cb3..45386fdfa821 100644
> --- a/include/uapi/linux/seg6_local.h
> +++ b/include/uapi/linux/seg6_local.h
> @@ -67,6 +67,8 @@ enum {
>  	SEG6_LOCAL_ACTION_END_BPF	= 15,
>  	/* decap and lookup of DA in v4 or v6 table */
>  	SEG6_LOCAL_ACTION_END_DT46	= 16,
> +	/* swap DA with new SID, leave SRH untouched (RFC 9433 Section 6.2) */
> +	SEG6_LOCAL_ACTION_END_MAP	= 17,
>
>  	__SEG6_LOCAL_ACTION_MAX,
>  };
> diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
> index 2b41e4c0dddd..bd8e3312973f 100644
> --- a/net/ipv6/seg6_local.c
> +++ b/net/ipv6/seg6_local.c
> @@ -1468,6 +1468,73 @@ static int input_action_end_bpf(struct sk_buff *skb,
>  	return -EINVAL;
>  }
>
> +/* SRH validation helper for SRv6 Mobile (RFC 9433) behaviors that may
> + * receive an SRv6 encapsulated packet.  Returns the SRH on success or
> + * NULL on validation failure / when the SRH is absent.  The caller
> + * uses @missing to distinguish the two NULL cases: an SRH-less packet
> + * may be acceptable depending on the behavior.
> + */
> +static struct ipv6_sr_hdr *seg6_mobile_get_validated_srh(struct sk_buff *skb,
> +							 bool *missing)
> +{
> +	struct ipv6_sr_hdr *srh = seg6_get_srh(skb, 0);
> +
> +	if (!srh) {
> +		if (missing)
> +			*missing = true;
> +		return NULL;
> +	}
> +	if (missing)
> +		*missing = false;
> +
> +#ifdef CONFIG_IPV6_SEG6_HMAC
> +	if (!seg6_hmac_validate_skb(skb))
> +		return NULL;
> +#endif
> +	return srh;
> +}

seg6_get_srh() returns NULL both when the SRH is absent and when it is
malformed (seg6_validate_srh fails, e.g. type != 4) or truncated.
seg6_mobile_get_validated_srh() sets *missing = true in all these cases,
so input_action_end_map() treats a malformed SRH the same as an absent
one and continues processing. HMAC validation is also bypassed because
seg6_get_srh() returns NULL before HMAC is reached.

seg6_mobile_get_validated_srh() needs to distinguish "absent" from
"malformed/truncated" so callers can drop on malformed instead of silently
accepting the packet.

> +
> +/* RFC 9433 Section 6.2 -- End.MAP
> + * Replace the outer IPv6 destination address with the configured next
> + * SID, decrement the Hop Limit, and forward via IPv6 routing.  The
> + * SRH is left untouched, so any subsequent End* behavior continues to
> + * see the original Segment List unchanged.
> + */

Nit: the function comment says "decrement the Hop Limit" but the code does not
do it explicitly. The forwarding path handles it (ip6_forward). Maybe
remove that part from the comment or add a note that the forwarding path
handles it?

> +static int input_action_end_map(struct sk_buff *skb,
> +				struct seg6_local_lwt *slwt)
> +{
> +	enum skb_drop_reason reason;
> +	struct ipv6_sr_hdr *srh;
> +	struct ipv6hdr *ip6h;
> +	bool no_srh = false;
> +
> +	reason = SKB_DROP_REASON_SEG6_MOBILE_INVALID_SRH_SL;

Because of the bug described above, the only path that reaches the
drop label with this reason is HMAC validation failure (when HMAC is
enabled). Same drop reason granularity point as above.

> +
> +	/* When an SRH is present it must HMAC-validate before we touch
> +	 * the destination; an SRH-less packet is also accepted because
> +	 * End.MAP does not consume the SRH.
> +	 */
> +	srh = seg6_mobile_get_validated_srh(skb, &no_srh);
> +	if (!srh && !no_srh)
> +		goto drop;

See above: this only catches HMAC failure. A malformed SRH falls through as
if the SRH were absent.

> +
> +	if (skb_ensure_writable(skb, sizeof(*ip6h))) {
> +		reason = SKB_DROP_REASON_SEG6_MOBILE_NOMEM;
> +		goto drop;
> +	}
> +
> +	ip6h = ipv6_hdr(skb);
> +	ip6h->daddr = slwt->nh6;

Sashiko flagged that for SRH-less packets this breaks the ICMPv6 checksum,
because the pseudo-header includes the DA. The AI bot was right, but when I
ran the selftest it passed. Digging a bit further, I noticed why:
2001:db8:f::1 and 2001:db8:2::e have the same 16-bit word sum
(0x000f+0x0001 = 0x0002+0x000e), so the checksum stays valid by
coincidence. Changing nh6 to 2001:db8:2::2 makes the ping fail with
Icmp6InCsumErrors.

> +
> +	skb_dst_drop(skb);
> +	seg6_lookup_nexthop(skb, NULL, 0);
> +	return dst_input(skb);

seg6_lookup_nexthop() calls seg6_lookup_any_nexthop() which already calls
skb_dst_drop() internally. The explicit skb_dst_drop(skb) above is
redundant.

> +
> +drop:
> +	kfree_skb_reason(skb, reason);
> +	return -EINVAL;
> +}
> +
>  static struct seg6_action_desc seg6_action_table[] = {
>  	{
>  		.action		= SEG6_LOCAL_ACTION_END,
> @@ -1565,6 +1632,12 @@ static struct seg6_action_desc seg6_action_table[] = {
>  		.optattrs	= SEG6_F_LOCAL_COUNTERS,
>  		.input		= input_action_end_bpf,
>  	},
> +	{
> +		.action		= SEG6_LOCAL_ACTION_END_MAP,
> +		.attrs		= SEG6_F_ATTR(SEG6_LOCAL_NH6),
> +		.optattrs	= SEG6_F_LOCAL_COUNTERS,
> +		.input		= input_action_end_map,
> +	},

End.MAP reuses SEG6_LOCAL_NH6 to mean "replacement SID", not "next-hop"
as in End.X/End.DX6. This overloads the existing UAPI semantics of
the attribute. The cover letter reply discusses this attribute-semantics
question across the patchset.

>
>  };
>
> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
> index a275ed584026..4fbb1eff79f8 100644
> --- a/tools/testing/selftests/net/Makefile
> +++ b/tools/testing/selftests/net/Makefile
> @@ -90,6 +90,7 @@ TEST_PROGS := \
>  	srv6_end_dx4_netfilter_test.sh \
>  	srv6_end_dx6_netfilter_test.sh \
>  	srv6_end_flavors_test.sh \
> +	srv6_end_map_test.sh \
>  	srv6_end_next_csid_l3vpn_test.sh \
>  	srv6_end_x_next_csid_l3vpn_test.sh \
>  	srv6_hencap_red_l3vpn_test.sh \
> diff --git a/tools/testing/selftests/net/srv6_end_map_test.sh b/tools/testing/selftests/net/srv6_end_map_test.sh
> new file mode 100755
> index 000000000000..7ee54b4cc97f
> --- /dev/null
> +++ b/tools/testing/selftests/net/srv6_end_map_test.sh
> @@ -0,0 +1,103 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# shellcheck disable=SC2034,SC2154
> +#
> +# Selftest for the SRv6 End.MAP behavior (RFC 9433 Section 6.2).
> +#
> +#   +--------+   2001:db8:1::/64   +--------+   2001:db8:2::/64   +--------+
> +#   | srupf1 | ------------------- | srupf2 | ------------------- | srupf3 |
> +#   +--------+       veth-1        +--------+       veth-2        +--------+
> +#                                (intermediate
> +#                                 SRv6-aware UPF,
> +#                                 End.MAP)
> +#
> +# All three netns are SRv6-aware UPFs in the RFC 9433 sense (not
> +# 3GPP UPFs).  Per RFC 9433 Section 6.2 End.MAP is used by the
> +# intermediate UPF (here srupf2): srupf2 has an End.MAP SID for
> +# locator 2001:db8:f::/64 mapping to the new SID 2001:db8:2::e.
> +# srupf1 sends an IPv6 packet to 2001:db8:f::1; on srupf3 the
> +# destination address is expected to have been replaced by
> +# 2001:db8:2::e.

The selftest only covers the SRH-less path. End.MAP also accepts packets
with an SRH, and that case should be covered as well.

To expose the ICMPv6 checksum issue noted above, the address pair should be
chosen so that the daddr rewrite changes the checksum.

> [snip]

Thanks,

Ciao,
Andrea

^ permalink raw reply

* Re: [PATCH net-next v2 2/2] net: ti: icssg: Add HSR and LRE PA statistics
From: Jakub Kicinski @ 2026-05-19  1:45 UTC (permalink / raw)
  To: MD Danish Anwar, Felix Maurer, Luka Gejak
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Jonathan Corbet, Shuah Khan, Roger Quadros, Andrew Lunn,
	Meghana Malladi, Jacob Keller, David Carlier, Vadim Fedorenko,
	Kevin Hao, netdev, linux-doc, linux-kernel, linux-arm-kernel,
	Vladimir Oltean
In-Reply-To: <20260514075605.850674-3-danishanwar@ti.com>

On Thu, 14 May 2026 13:26:05 +0530 MD Danish Anwar wrote:
> Add new firmware PA statistics counters for HSR and LRE to the ethtool
> statistics exposed by the ICSSG driver.
> 
> New statistics added:
>  - FW_HSR_FWD_CHECK_FAIL_DROP: Packets dropped on the HSR forwarding path
>  - FW_HSR_HE_CHECK_FAIL_DROP: Packets dropped on the HSR host egress path
>  - FW_HSR_SKIP_HOST_DUP_DISCARD_FRAMES: Frames with duplicate discard
>    skipped
>  - FW_LRE_CNT_UNIQUE/DUPLICATE/MULTIPLE_RX: LRE duplicate detection
>    counters
>  - FW_LRE_CNT_RX/TX: LRE per-port frame counters
>  - FW_LRE_CNT_OWN_RX: Own HSR tagged frames received
>  - FW_LRE_CNT_ERRWRONGLAN: Frames with wrong LAN identifier (PRP)
> 
> Document the new HSR/LRE statistics in icssg_prueth.rst.

To an untrained eye these stats look like stuff that could 
be standardized across drivers. 

Luka, Felix, others on CC, do you think we should expose these
from HSR over netlink as "standard" offload stats different drivers 
can plug into or not worth it?

^ permalink raw reply

* Re: [PATCH net-next v5 0/8] net: devmem: support devmem with netkit devices
From: patchwork-bot+netdevbpf @ 2026-05-19  2:10 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, horms, corbet,
	skhan, alexs, si.yanteng, dzm91, michael.chan, pavan.chebbi,
	joshwash, hramamurthy, saeedm, tariqt, mbloch, leon,
	alexanderduyck, kernel-team, daniel, razor, shuah, dw, sdf.kernel,
	mohsin.bashr, willemb, jiang.kun2, xu.xin16, wang.yaxin, netdev,
	linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest, sdf,
	almasrymina, bobbyeshleman
In-Reply-To: <20260514-tcp-dm-netkit-v5-0-408c59b91e66@meta.com>

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu, 14 May 2026 10:22:27 -0700 you wrote:
> This series enables TCP devmem TX through netkit devices.
> 
> Netkit now supports queue leasing. A physical NIC's RX queue can be
> leased to a netkit guest interface inside a container namespace. This
> gives the container a devmem-capable data path on the RX side (bind-rx,
> etc...). On the TX side, the container process binds to its netkit guest
> interface and sends traffic that netkit redirects (via BPF or ip
> forwarding) to the physical NIC for DMA.
> 
> [...]

Here is the summary with links:
  - [net-next,v5,1/8] net: convert netmem_tx flag to enum
    https://git.kernel.org/netdev/net-next/c/7d3ab852dcd8
  - [net-next,v5,2/8] net: netkit: declare NETMEM_TX_NO_DMA mode
    https://git.kernel.org/netdev/net-next/c/6ce2bb048055
  - [net-next,v5,3/8] net: devmem: support TX over NETMEM_TX_NO_DMA devices
    https://git.kernel.org/netdev/net-next/c/1abe839b34ae
  - [net-next,v5,4/8] selftests: drv-net: ncdevmem: add -n flag to skip NIC configuration
    https://git.kernel.org/netdev/net-next/c/ecbdf3da7813
  - [net-next,v5,5/8] selftests: drv-net: make attr _nk_guest_ifname public
    https://git.kernel.org/netdev/net-next/c/28357ac667d4
  - [net-next,v5,6/8] selftests: drv-net: refactor devmem command builders into lib module
    https://git.kernel.org/netdev/net-next/c/6cac32fc3f1f
  - [net-next,v5,7/8] selftests: drv-net: add primary_rx_redirect support to NetDrvContEnv
    https://git.kernel.org/netdev/net-next/c/886a790b59f9
  - [net-next,v5,8/8] selftests: drv-net: add netkit devmem tests
    https://git.kernel.org/netdev/net-next/c/28c1cc999fbb

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH 1/6] alloc_tag: add ioctl to /proc/allocinfo
From: Hao Ge @ 2026-05-19  2:52 UTC (permalink / raw)
  To: Abhishek Bapat
  Cc: Suren Baghdasaryan, Shuah Khan, Jonathan Corbet, linux-doc,
	linux-kernel, linux-mm, Sourav Panda, Kent Overstreet,
	Andrew Morton
In-Reply-To: <CAL41Mv7zCEFUAD43wBRo+rno2AK-teUUaVSdx2Pd7qDU0uNwsg@mail.gmail.com>

Hi Abhishek


Thanks for the follow-up.


On 2026/5/19 07:41, Abhishek Bapat wrote:
> On Wed, May 13, 2026 at 9:38 PM Hao Ge<hao.ge@linux.dev>  wrote:
>> Hi Suren and Abhishek
>>
>>
>> Thanks for the patch! A couple of minor comments below.
>>
>>
>> On 2026/5/5 07:36, Abhishek Bapat wrote:
>>> From: Suren Baghdasaryan<surenb@google.com>
>>>
>>> Add the following ioctl commands for /proc/allocinfo file:
>>>
>>> ALLOCINFO_IOC_CONTENT_ID - gets content identifier which can be used
>>> to check whether the file content has changed specifically due to module
>>> load/unload. Every time a module is loaded / unloaded, the returned
>>> value will be different. By comparing the identifier value at the
>>> beginning and at the end of the content retrieval operation, users can
>>> validate retrieved information for consistency.
>>>
>>> ALLOCINFO_IOC_GET_AT - gets the record at the specified position. This
>>> is the position of a record in /proc/allocinfo.
>>>
>>> ALLOCINFO_IOC_GET_NEXT - gets the record next to the last retrieved
>>> one. If no records were previously retrieved, returns the first
>>> record.
>>>
>>> Signed-off-by: Suren Baghdasaryan<surenb@google.com>
>>> Signed-off-by: Abhishek Bapat<abhishekbapat@google.com>
>>> ---
>>>    .../userspace-api/ioctl/ioctl-number.rst      |   2 +
>>>    include/linux/codetag.h                       |   1 +
>>>    include/uapi/linux/alloc_tag.h                |  54 ++++++
>>>    lib/alloc_tag.c                               | 178 +++++++++++++++++-
>>>    lib/codetag.c                                 |  11 ++
>>>    5 files changed, 244 insertions(+), 2 deletions(-)
>>>    create mode 100644 include/uapi/linux/alloc_tag.h
>>>
>>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> index 331223761fff..84f6808a8578 100644
>>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> @@ -349,6 +349,8 @@ Code  Seq#    Include File                                             Comments
>>>                                                                           <mailto:luzmaximilian@gmail.com>
>>>    0xA5  20-2F  linux/surface_aggregator/dtx.h                            Microsoft Surface DTX driver
>>>                                                                           <mailto:luzmaximilian@gmail.com>
>>> +0xA6  00-0F  uapi/linux/alloc_tag.h                                    Memory allocation profiling
>>> +<mailto:surenb@google.com>
>>>    0xAA  00-3F  linux/uapi/linux/userfaultfd.h
>>>    0xAB  00-1F  linux/nbd.h
>>>    0xAC  00-1F  linux/raw.h
>>> diff --git a/include/linux/codetag.h b/include/linux/codetag.h
>>> index 8ea2a5f7c98a..2bcd4e7c809e 100644
>>> --- a/include/linux/codetag.h
>>> +++ b/include/linux/codetag.h
>>> @@ -76,6 +76,7 @@ struct codetag_iterator {
>>>
>>>    void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
>>>    bool codetag_trylock_module_list(struct codetag_type *cttype);
>>> +unsigned long codetag_get_content_id(struct codetag_type *cttype);
>>>    struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
>>>    struct codetag *codetag_next_ct(struct codetag_iterator *iter);
>>>
>>> diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
>>> new file mode 100644
>>> index 000000000000..e9a5b55fcc7a
>>> --- /dev/null
>>> +++ b/include/uapi/linux/alloc_tag.h
>>> @@ -0,0 +1,54 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>> +/*
>>> + *  include/linux/alloc_tag.h
>>> + */
>>> +
>>> +#ifndef _UAPI_ALLOC_TAG_H
>>> +#define _UAPI_ALLOC_TAG_H
>>> +
>>> +#include <linux/types.h>
>>> +
>>> +#define ALLOCINFO_STR_SIZE   64
>>> +
>>> +struct allocinfo_content_id {
>>> +     __u64 id;
>>> +};
>>> +
>>> +struct allocinfo_tag {
>>> +     /* Longer names are trimmed */
>>> +     char modname[ALLOCINFO_STR_SIZE];
>>> +     char function[ALLOCINFO_STR_SIZE];
>>> +     char filename[ALLOCINFO_STR_SIZE];
>>> +     __u64 lineno;
>>> +};
>>> +
>>> +struct allocinfo_counter {
>>> +     __u64 bytes;
>>> +     __u64 calls;
>>> +     __u8 accurate;
>>> +     __u8 pad[7]; /* Add alignment to not break the 32-bit compatible interface */
>>> +};
>>> +
>>> +struct allocinfo_tag_data {
>>> +     struct allocinfo_tag tag;
>>> +     struct allocinfo_counter counter;
>>> +};
>>> +
>>> +struct allocinfo_get_at {
>>> +     __u64 pos;      /* input */
>>> +     struct allocinfo_tag_data data;
>>> +};
>>> +
>>> +#define _ALLOCINFO_IOC_CONTENT_ID    0
>>> +#define _ALLOCINFO_IOC_GET_AT                1
>>> +#define _ALLOCINFO_IOC_GET_NEXT              2
>>> +
>>> +#define ALLOCINFO_IOC_BASE           0xA6
>>> +#define ALLOCINFO_IOC_CONTENT_ID     _IOR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_CONTENT_ID,     \
>>> +                                          struct allocinfo_content_id)
>>> +#define ALLOCINFO_IOC_GET_AT         _IOWR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_GET_AT,        \
>>> +                                           struct allocinfo_get_at)
>>> +#define ALLOCINFO_IOC_GET_NEXT               _IOR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_GET_NEXT,       \
>>> +                                          struct allocinfo_tag_data)
>>> +
>>> +#endif /* _UAPI_ALLOC_TAG_H */
>>> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
>>> index ed1bdcf1f8ab..5c24d2f954d4 100644
>>> --- a/lib/alloc_tag.c
>>> +++ b/lib/alloc_tag.c
>>> @@ -14,6 +14,7 @@
>>>    #include <linux/string_choices.h>
>>>    #include <linux/vmalloc.h>
>>>    #include <linux/kmemleak.h>
>>> +#include <uapi/linux/alloc_tag.h>
>>>
>>>    #define ALLOCINFO_FILE_NAME         "allocinfo"
>>>    #define MODULE_ALLOC_TAG_VMAP_SIZE  (100000UL * sizeof(struct alloc_tag))
>>> @@ -46,6 +47,9 @@ int alloc_tag_ref_offs;
>>>    struct allocinfo_private {
>>>        struct codetag_iterator iter;
>>>        bool print_header;
>>> +     /* ioctl uses a separate iterator not to interfere with reads */
>>> +     struct codetag_iterator ioctl_iter;
>>> +     bool positioned; /* seq_open_private() sets to 0 */
>>>    };
>>>
>>>    static void *allocinfo_start(struct seq_file *m, loff_t *pos)
>>> @@ -125,6 +129,177 @@ static const struct seq_operations allocinfo_seq_op = {
>>>        .show   = allocinfo_show,
>>>    };
>>>
>>> +static int allocinfo_open(struct inode *inode, struct file *file)
>>> +{
>>> +     return seq_open_private(file, &allocinfo_seq_op,
>>> +                             sizeof(struct allocinfo_private));
>>> +}
>>> +
>>> +static int allocinfo_release(struct inode *inode, struct file *file)
>>> +{
>>> +     return seq_release_private(inode, file);
>>> +}
>>> +
>>> +static const char *allocinfo_str(const char *str)
>>> +{
>>> +     size_t len = strlen(str);
>>> +
>>> +     /* Keep an extra space for the trailing NULL. */
>>> +     if (len >= ALLOCINFO_STR_SIZE)
>>> +             str += (len - ALLOCINFO_STR_SIZE) + 1;
>>> +     return str;
>>> +}
>>> +
>>> +/* Copy a string and trim from the beginning if it's too long */
>>> +static void allocinfo_copy_str(char *dest, const char *src)
>>> +{
>>> +     strscpy(dest, allocinfo_str(src), ALLOCINFO_STR_SIZE);
>>> +}
>>> +
>>> +static void allocinfo_to_params(struct codetag *ct,
>>> +                             struct allocinfo_tag_data *data)
>>> +{
>>> +     struct alloc_tag *tag = ct_to_alloc_tag(ct);
>>> +     struct alloc_tag_counters counter = alloc_tag_read(tag);
>>> +
>>> +     if (ct->modname)
>>> +             allocinfo_copy_str(data->tag.modname, ct->modname);
>>> +     else
>>> +             data->tag.modname[0] = '\0';
>> Minor nit about allocinfo_to_params():
>>
>> When modname is NULL (built-in kernel code), the current code sets it
>>
>> to an empty string:
>>
>>       if (ct->modname)
>>
>>           allocinfo_copy_str(data->tag.modname, ct->modname);
>>
>>       else
>>
>>           data->tag.modname[0] = '\0';
>>
>> This is of course workable in userspace by checking for an empty
>>
>> string, but I was wondering if it would be cleaner to use "vmlinux"
>>
>> as a default:
>>
>> else
>>
>>             allocinfo_copy_str(data->tag.modname, "vmlinux");
>>
>>
>> For some context, in our memory analysis workflow we often group
>>
>> allocations by module to get a quick overview of where memory goes,
>>
>> for example:
>>
>> vmlinux:    2.1 GB    (kernel core)
>>
>> nvidia:     1.2 GB    (GPU driver)
>>
>> iwlwifi:    800 MB    (WiFi driver)
>>
>> ext4:       500 MB    (filesystem)
>>
>> Having a consistent identifier for kernel built-in allocations would
>>
>> avoid each userspace tool needing to handle the empty string as a
>>
>> special case. Totally fine if this is intentional though.
>>
> Thanks for bringing this up, I can certainly make this change.
> However, the information is not currently exposed this way through
> /proc/allocinfo. /proc/allocinfo does not categorize kernel non-module
> allocations as vmlinux, so there will a delta between how IOCTL and
> /proc/allocinfo behave. Suren, could you comment on whether this
> recommendation is fine by you?
>
Right, /proc/allocinfo indeed doesn't categorize them as vmlinux currently.

It's just that in practice we often group allocations by module, so 
having "vmlinux" as a default

would be convenient. Let's wait for Suren's input.

>>> +     allocinfo_copy_str(data->tag.function, ct->function);
>>> +     allocinfo_copy_str(data->tag.filename, ct->filename);
>>> +     data->tag.lineno = ct->lineno;
>>> +     data->counter.bytes = counter.bytes;
>>> +     data->counter.calls = counter.calls;
>>> +     data->counter.accurate = !alloc_tag_is_inaccurate(tag);
>>> +}
>>> +
>>> +static int allocinfo_ioctl_get_content_id(struct seq_file *m, void __user *arg)
>>> +{
>>> +     struct allocinfo_content_id params;
>>> +
>>> +     codetag_lock_module_list(alloc_tag_cttype, true);
>>> +     params.id = codetag_get_content_id(alloc_tag_cttype);
>>> +     codetag_lock_module_list(alloc_tag_cttype, false);
>>> +     if (copy_to_user(arg, &params, sizeof(params)))
>>> +             return -EFAULT;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int allocinfo_ioctl_get_at(struct seq_file *m, void __user *arg)
>>> +{
>>> +     struct allocinfo_private *priv;
>>> +     struct codetag *ct;
>>> +     __u64 pos;
>>> +     struct allocinfo_get_at params = {0};
>>> +
>>> +     if (copy_from_user(&params, arg, sizeof(params)))
>>> +             return -EFAULT;
>>> +
>>> +     priv = (struct allocinfo_private *)m->private;
>>> +     pos = params.pos;
>>> +
>>> +     codetag_lock_module_list(alloc_tag_cttype, true);
>>> +
>>> +     /* Find the codetag */
>>> +     priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
>>> +     ct = codetag_next_ct(&priv->ioctl_iter);
>>> +     while (ct && pos--)
>>> +             ct = codetag_next_ct(&priv->ioctl_iter);
>> I noticed that codetag_next_ct(&priv->ioctl_iter) and
>>
>> priv->positioned are accessed without serialization in the ioctl
>>
>> path. Concurrent ioctl calls on the same fd could race on these
>>
>> fields. Just something I spotted while reading the code.
>>
>>
>> Thanks
>>
>> Best Regards
>>
>> Hao
>>
> I believe this should be prevented by `codetag_lock_module_list`; am I
> wrong in my understanding?

Thanks for the explanation! codetag_lock_module_list is designed to 
protect the module list from concurrent load/unload, which it does

correctly. However, it doesn't cover the race between concurrent ioctl 
calls on the same fd, since it acquires cttype->mod_lock via

down_read() and rwsem read locks allow multiple readers to proceed 
concurrently:

Thread A: ALLOCINFO_IOC_GET_AT

down_read(&cttype->mod_lock)              // read lock acquired

priv->ioctl_iter = codetag_get_ct_iter(...)

ct = codetag_next_ct(&priv->ioctl_iter)

priv->positioned = true;

Thread B: ALLOCINFO_IOC_GET_NEXT            // concurrent ioctl on same fd

down_read(&cttype->mod_lock)              // read locks don't exclude 
each other

if (!priv->positioned) {                  // sees partial state from 
Thread A

priv->ioctl_iter = ...                // overwrites Thread A's iterator

}

ct = codetag_next_ct(&priv->ioctl_iter)   // corrupted iterator

priv->ioctl_iter and priv->positioned are per-fd state with no 
serialization in the ioctl path.

Just something I spotted.

Thanks

Best Regards

Hao

>>> +     if (ct) {
>>> +             allocinfo_to_params(ct, &params.data);
>>> +             priv->positioned = true;
>>> +     }
>>> +
>>> +     codetag_lock_module_list(alloc_tag_cttype, false);
>>> +
>>> +     if (!ct)
>>> +             return -ENOENT;
>>> +
>>> +     if (copy_to_user(arg, &params, sizeof(params)))
>>> +             return -EFAULT;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int allocinfo_ioctl_get_next(struct seq_file *m, void __user *arg)
>>> +{
>>> +     struct allocinfo_private *priv;
>>> +     struct codetag *ct;
>>> +     struct allocinfo_tag_data params = {0};
>>> +     int ret = 0;
>>> +
>>> +     priv = (struct allocinfo_private *)m->private;
>>> +
>>> +     codetag_lock_module_list(alloc_tag_cttype, true);
>>> +
>>> +     if (!priv->positioned) {
>>> +             priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
>>> +             priv->positioned = true;
>>> +     }
>>> +
>>> +     ct = codetag_next_ct(&priv->ioctl_iter);
>>> +     if (ct)
>>> +             allocinfo_to_params(ct, &params);
>>> +
>>> +     if (!ct) {
>>> +             priv->positioned = false;
>>> +             ret = -ENOENT;
>>> +     }
>>> +     codetag_lock_module_list(alloc_tag_cttype, false);
>>> +
>>> +     if (ret == 0) {
>>> +             if (copy_to_user(arg, &params, sizeof(params)))
>>> +                     return -EFAULT;
>>> +     }
>>> +     return ret;
>>> +}
>>> +
>>> +static long allocinfo_ioctl(struct file *file, unsigned int cmd,
>>> +                         unsigned long __arg)
>>> +{
>>> +     void __user *arg = (void __user *)__arg;
>>> +     int ret;
>>> +
>>> +     switch (cmd) {
>>> +     case ALLOCINFO_IOC_CONTENT_ID:
>>> +             ret = allocinfo_ioctl_get_content_id(file->private_data, arg);
>>> +             break;
>>> +     case ALLOCINFO_IOC_GET_AT:
>>> +             ret = allocinfo_ioctl_get_at(file->private_data, arg);
>>> +             break;
>>> +     case ALLOCINFO_IOC_GET_NEXT:
>>> +             ret = allocinfo_ioctl_get_next(file->private_data, arg);
>>> +             break;
>>> +     default:
>>> +             ret = -ENOIOCTLCMD;
>>> +             break;
>>> +     }
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +#ifdef CONFIG_COMPAT
>>> +static long allocinfo_compat_ioctl(struct file *file, unsigned int cmd,
>>> +                                unsigned long arg)
>>> +{
>>> +     return allocinfo_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
>>> +}
>>> +#endif
>>> +
>>> +static const struct proc_ops allocinfo_proc_ops = {
>>> +     .proc_open              = allocinfo_open,
>>> +     .proc_read_iter         = seq_read_iter,
>>> +     .proc_lseek             = seq_lseek,
>>> +     .proc_release           = allocinfo_release,
>>> +     .proc_ioctl             = allocinfo_ioctl,
>>> +#ifdef CONFIG_COMPAT
>>> +     .proc_compat_ioctl      = allocinfo_compat_ioctl,
>>> +#endif
>>> +
>>> +};
>>> +
>>>    size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sleep)
>>>    {
>>>        struct codetag_iterator iter;
>>> @@ -946,8 +1121,7 @@ static int __init alloc_tag_init(void)
>>>                return 0;
>>>        }
>>>
>>> -     if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op,
>>> -                                  sizeof(struct allocinfo_private), NULL)) {
>>> +     if (!proc_create(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_proc_ops)) {
>>>                pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME);
>>>                shutdown_mem_profiling(false);
>>>                return -ENOMEM;
>>> diff --git a/lib/codetag.c b/lib/codetag.c
>>> index 304667897ad4..93aa30991563 100644
>>> --- a/lib/codetag.c
>>> +++ b/lib/codetag.c
>>> @@ -48,6 +48,17 @@ bool codetag_trylock_module_list(struct codetag_type *cttype)
>>>        return down_read_trylock(&cttype->mod_lock) != 0;
>>>    }
>>>
>>> +unsigned long codetag_get_content_id(struct codetag_type *cttype)
>>> +{
>>> +     lockdep_assert_held(&cttype->mod_lock);
>>> +
>>> +     /*
>>> +      * next_mod_seq is updated on every load, so can be used to identify
>>> +      * content changes.
>>> +      */
>>> +     return cttype->next_mod_seq;
>>> +}
>>> +
>>>    struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
>>>    {
>>>        struct codetag_iterator iter = {
> Note, I will be following up with a v2 patchset with your feedback
> included. Please bring up any other points you'd want to clarify so
> that I can include all the changes in the v2 patchset. Thanks for
> reviewing!

^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Paul Moore @ 2026-05-19  2:55 UTC (permalink / raw)
  To: Song Liu
  Cc: Sasha Levin, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAPhsuW7Js0Z6tU30RphbUjWsJXETkxaGOArfGZpDjNPmZFUpuQ@mail.gmail.com>

On Mon, May 18, 2026 at 8:01 PM Song Liu <song@kernel.org> wrote:
> On Mon, May 18, 2026 at 4:57 PM Paul Moore <paul@paul-moore.com> wrote:
> > On Mon, May 18, 2026 at 7:23 PM Song Liu <song@kernel.org> wrote:
> > > On Mon, May 18, 2026 at 2:29 PM Paul Moore <paul@paul-moore.com> wrote:
> > > [...]
> > > > In my opinion, making killswitch an LSM is more of a procedural item
> > > > that deals with how we view a capability like killswitch.  I
> > > > personally view killswitch as somewhat similar to Lockdown, which is
> > > > why I made the suggestion.
> > > >
> > > > The use of kprobes, while an interesting idea, presents problems as
> > > > allowing any kernel symbol to be killed introduces the potential for
> > > > security regressions.  As a reminder, some LSMs, as well as other
> > > > kernel subsystems, have mechanisms in place to restrict root and/or
> > > > enforce one-way configuration locks; while many people equate "root"
> > > > with full control, in many cases today that is not strictly correct.
> > > >
> > > > Yes, kprobes have been around for some time, this is not a new
> > > > problem, but killswitch makes it far more convenient and accessible to
> > > > do dangerous things with kprobes.  If killswitch makes it past the RFC
> > > > stage without any significant changes to its kill mechanism, we may
> > > > need to start considering more liberal usage of NOKPROBE_SYMBOL()
> > > > which I think would be an unfortunate casualty.
> > >
> > > I don't think we can use NOKPROBE_SYMBOL(). There are functions
> > > that we don't want to killswitch, but still want to trace.
> >
> > That was exactly my point, but we need to figure something out so
> > killswitch doesn't make it easier to cause a regression.
>
> killswitch is making it easier to fix a CVE. It can surely make it easier
> to cause a regression. AFAICT, the only protection here is "it is only
> for root".

As I mentioned earlier, several LSMs have the ability to restrict root
beyond what is possible with traditional Linux accesscontrols.  For
example, with SELinux one could deny root a specific privilege while
also blocking changes to the SELinux policy; the root user would not
be able to restore that privilege without rebooting the system.

On a killswitch enabled system the ability to restrict root is lost as
root would be able to kill the enforcement of those access controls.
Presumably one could have the LSM block access to killswitch in this
particular case, but that defeats the purpose doesn't it?

The audit subsystem also has a somewhat similar one-way configuration
lock, which when set does not allow even root to unlock it, a reboot
is required.  By a bit of luck with regards to how the code is
written*, it may not be vulnerable to a killswitch regression but I do
wonder if there are other similar things in the kernel which would
have the same type of problem.

* This is probably the first time I think I've ever considered myself
lucky with respect to the audit code implementation.

-- 
paul-moore.com

^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Paul Moore @ 2026-05-19  3:08 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Song Liu, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <aguvV8QCxK28ZHct@laps>

On Mon, May 18, 2026 at 8:31 PM Sasha Levin <sashal@kernel.org> wrote:
> On Mon, May 18, 2026 at 05:29:32PM -0400, Paul Moore wrote:
> >From my perspective there are two different issues here: should
> >killswitch be a LSM, and should killswitch leverage kprobes to be able
> >to "kill" security related symbols.  After all, are we okay with
> >killswitch killing capable() and friends?
>
> killswitch doesn't do it on it's own. It may be instructed by root to do that,
> at which point that is root's problem.

As I mentioned previously, there are cases where we can restrict
root's privileges today, but a functional killswitch would allow that
restriction to be bypassed.  My last email to Song has an example with
SELinux.

> >In my opinion, making killswitch an LSM is more of a procedural item
> >that deals with how we view a capability like killswitch.  I
> >personally view killswitch as somewhat similar to Lockdown, which is
> >why I made the suggestion.
>
> Maybe I'm not all that familiar with LSMs, but we would need to be able to stop
> "random" code paths from executing, and I don't think we can create LSM hooks
> at that granularity, no?

I don't see any LSM hooks in this revision of killswitch, and as long
as it is based on a kprobes I can't imagine it would ever use any.  As
I mentioned above, my killswitch-as-a-LSM comment is primarily about
killswitch filling a role very similar to Lockdown.

> >The use of kprobes, while an interesting idea, presents problems as
> >allowing any kernel symbol to be killed introduces the potential for
> >security regressions.  As a reminder, some LSMs, as well as other
> >kernel subsystems, have mechanisms in place to restrict root and/or
> >enforce one-way configuration locks; while many people equate "root"
> >with full control, in many cases today that is not strictly correct.
>
> killswitch "complies" with lockdown. Is there a different scenario which we
> should be blocking?

See the SELinux example I mentioned in my email to Song.

> >Yes, kprobes have been around for some time, this is not a new
> >problem, but killswitch makes it far more convenient and accessible to
> >do dangerous things with kprobes.  If killswitch makes it past the RFC
> >stage without any significant changes to its kill mechanism, we may
> >need to start considering more liberal usage of NOKPROBE_SYMBOL()
> >which I think would be an unfortunate casualty.
>
> Why? If I don't really mind the security impact, I want to be able to have a
> killswitch-like interface on my systems. If an attacker is in my systems,
> killswitch is the least of my concerns I think.
>
> If you are security concious, just don't enable CONFIG_KILLSWITCH?

Isn't the whole point of killswitch to have it enabled everywhere
because you never know when you might want/need it?

-- 
paul-moore.com

^ permalink raw reply

* [PATCH] eventpoll: add missing kernel-doc for @ctx function parameters
From: Randy Dunlap @ 2026-05-19  4:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Randy Dunlap, Alexander Viro, Christian Brauner, Jan Kara,
	Jonathan Corbet, Shuah Khan, linux-doc

Add the missing kernel-doc comments to prevent kernel-doc build
warnings while building the documentation.

WARNING: fs/eventpoll.c:1684 function parameter 'ctx' not described in 'reverse_path_check'
WARNING: fs/eventpoll.c:2349 function parameter 'ctx' not described in 'ep_loop_check_proc'

Fixes: e09c77d94003 ("eventpoll: hoist CTL_ADD scratch state into struct ep_ctl_ctx")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
---
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: linux-doc@vger.kernel.org

 fs/eventpoll.c |    2 ++
 1 file changed, 2 insertions(+)

--- linux-next-20260518.orig/fs/eventpoll.c
+++ linux-next-20260518/fs/eventpoll.c
@@ -1677,6 +1677,7 @@ static int reverse_path_check_proc(struc
  *                      anchoring files with newly proposed links; make
  *                      sure those links don't push any path-length bucket
  *                      over its limit in path_limits[].
+ * @ctx: Per-do_epoll_ctl() scratch for the loop / path checks.
  *
  * Return: %zero if the proposed links don't create too many paths,
  *	    %-1 otherwise.
@@ -2339,6 +2340,7 @@ static int ep_poll(struct eventpoll *ep,
  *                      epoll file does not create closed loops, and
  *                      determine the depth of the subtree starting at @ep
  *
+ * @ctx: Per-do_epoll_ctl() scratch for the loop / path checks.
  * @ep: the &struct eventpoll to be currently checked.
  * @depth: Current depth of the path being checked.
  *

^ permalink raw reply

* Re: [PATCH v3 2/4] PCI: endpoint: Add DOE mailbox support for endpoint functions
From: Aksh Garg @ 2026-05-19  5:23 UTC (permalink / raw)
  To: Manivannan Sadhasivam
  Cc: linux-pci, linux-doc, kwilczynski, bhelgaas, corbet, kishon,
	skhan, lukas, cassel, alistair, linux-arm-kernel, linux-kernel,
	s-vadapalli, danishanwar, srk
In-Reply-To: <ies3cbldthjv4vgraibgo642pfuvcr3lsixgxeisqa34ygkpbf@dd2qstv5fiig>



On 15/05/26 18:10, Manivannan Sadhasivam wrote:
> On Fri, May 15, 2026 at 11:05:29AM +0530, Aksh Garg wrote:
>>
>>
>> On 14/05/26 13:33, Manivannan Sadhasivam wrote:
>>> On Mon, Apr 27, 2026 at 10:47:23AM +0530, Aksh Garg wrote:
>>>> DOE (Data Object Exchange) is a standard PCIe extended capability
>>>> feature introduced in the Data Object Exchange (DOE) ECN for
>>>> PCIe r5.0. It provides a communication mechanism primarily used for
>>>> implementing PCIe security features such as device authentication, and
>>>> secure link establishment. Think of DOE as a sophisticated mailbox
>>>> system built into PCIe. The root complex can send structured requests
>>>> to the endpoint device through DOE mailboxes, and the endpoint device
>>>> responds with appropriate data.
>>>>
>>>> Add the DOE support for PCIe endpoint devices, enabling endpoint
>>>> functions to process the DOE requests from the host. The implementation
>>>> provides framework APIs for EPC core driver and controller drivers to
>>>> register mailboxes, and request processing with workqueues ensuring
>>>> sequential handling per mailbox, and parallel handling across mailboxes.
>>>> The Discovery protocol is handled internally by the DOE core.
>>>>
>>>> This implementation complements the existing DOE implementation for
>>>> root complex in drivers/pci/doe.c.
>>>>
>>>> Co-developed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
>>>> Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
>>>> Signed-off-by: Aksh Garg <a-garg7@ti.com>
>>>> ---
>>>> +
>>>> +/*
>>>> + * Global registry of protocol handlers.
>>>> + * When a new DOE protocol, library is added, add an entry to this array.
>>>> + */
>>>> +static const struct pci_doe_protocol pci_doe_protocols[] = {
>>>> +	{
>>>> +		.vid = PCI_VENDOR_ID_PCI_SIG,
>>>> +		.type = PCI_DOE_FEATURE_DISCOVERY,
>>>> +		.handler = pci_ep_doe_handle_discovery,
>>>> +	},
>>>> +};
>>>> +
>>>> +/*
>>>> + * Combines function number and capability offset into a unique lookup key
>>>> + * for storing/retrieving DOE mailboxes in an xarray.
>>>> + */
>>>> +#define PCI_DOE_MB_KEY(func, offset) \
>>>> +	(((unsigned long)(func) << 16) | (offset))
>>>> +#define PCI_DOE_PROTOCOL_COUNT        ARRAY_SIZE(pci_doe_protocols)
>>>> +
>>>> +/**
>>>> + * pci_ep_doe_init() - Initialize the DOE framework for a controller in EP mode
>>>> + * @epc: PCI endpoint controller
>>>> + *
>>>> + * Initialize the DOE framework data structures. This only initializes
>>>> + * the xarray that will hold the mailboxes.
>>>> + *
>>>> + * RETURNS: 0 on success, -errno on failure
>>>
>>> kernel-doc format to describe return value is 'Return:' or 'Returns:".
>>
>> Thanks for pointing this out. I will update this.
>>
>>>
>>>> + */
>>>> +int pci_ep_doe_init(struct pci_epc *epc)
>>>> +{
>>>> +	if (!epc)
>>>> +		return -EINVAL;
>>>> +
>>>> +	xa_init(&epc->doe_mbs);
>>>> +	return 0;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(pci_ep_doe_init);
>>>> +
>>
>> [...]
>>
>>>> +
>>>> +/**
>>>> + * pci_ep_doe_process_request() - Process DOE request on endpoint
>>>> + * @epc: PCI endpoint controller
>>>> + * @func_no: Physical function number
>>>> + * @cap_offset: DOE capability offset
>>>> + * @vendor: Vendor ID from request header
>>>> + * @type: Protocol type from request header
>>>> + * @request: Request payload in CPU-native format
>>>> + * @request_sz: Size of request payload (bytes)
>>>> + * @complete: Callback to invoke upon completion
>>>> + *
>>>> + * Asynchronously process a DOE request received on the endpoint. The request
>>>> + * payload should not include the DOE header (vendor/type/length). The protocol
>>>> + * handler will allocate the response buffer, which the caller (controller driver)
>>>> + * must free after use.
>>>> + *
>>>> + * This function returns immediately after queuing the request. The completion
>>>> + * callback will be invoked asynchronously from workqueue context once the
>>>> + * request is processed. The callback receives the function number and capability
>>>> + * offset to identify the mailbox, along with a status code (0 on success, -errno
>>>> + * on failure), and other required arguments.
>>>> + *
>>>> + * As per DOE specification, a mailbox processes one request at a time.
>>>> + * Therefore, this function will never be called concurrently for the same
>>>> + * mailbox by different callers.
>>>> + *
>>>> + * The caller is responsible for the conversion of the received DOE request
>>>> + * with le32_to_cpu() before calling this function.
>>>> + * Similarly, it is responsible for converting the response payload with
>>>> + * cpu_to_le32() before sending it back over the DOE mailbox.
>>>> + *
>>>> + * The caller is also responsible for ensuring that the request size
>>>> + * is within the limits defined by PCI_DOE_MAX_LENGTH.
>>>> + *
>>>> + * RETURNS: 0 if the request was successfully queued, -errno on failure
>>>> + */
>>>> +int pci_ep_doe_process_request(struct pci_epc *epc, u8 func_no, u16 cap_offset,
>>>> +			       u16 vendor, u8 type, const void *request, size_t request_sz,
>>>> +			       pci_ep_doe_complete_t complete)
>>>> +{
>>>> +	struct pci_ep_doe_mb *doe_mb;
>>>> +	struct pci_ep_doe_task *task;
>>>> +	int rc;
>>>> +
>>>> +	doe_mb = pci_ep_doe_get_mailbox(epc, func_no, cap_offset);
>>>> +	if (!doe_mb) {
>>>> +		kfree(request);
>>>> +		return -ENODEV;
>>>> +	}
>>>> +
>>>> +	task = kzalloc_obj(*task, GFP_KERNEL);
>>>> +	if (!task) {
>>>> +		kfree(request);
>>>> +		return -ENOMEM;
>>>> +	}
>>>> +
>>>> +	task->feat.vid = vendor;
>>>> +	task->feat.type = type;
>>>> +	task->request_pl = request;
>>>> +	task->request_pl_sz = request_sz;
>>>> +	task->response_pl = NULL;
>>>> +	task->response_pl_sz = 0;
>>>> +	task->complete = complete;
>>>> +
>>>> +	rc = pci_ep_doe_submit_task(doe_mb, task);
>>>> +	if (rc) {
>>>> +		kfree(request);
>>>> +		kfree(task);
>>>> +		return rc;
>>>> +	}
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(pci_ep_doe_process_request);
>>>
>>> So who is supposed to call this API? EPC driver that receives the DOE interrupt?
>>
>> Yes, the EPC drivers that receive the DOE interrupts are expected to
>> call this API.
>>
>>> But I don't see the any callers of this and below exported APIs in this series.
>>> Either you should add the callers or limit this series just to adding the DOE
>>> skeleton implementation with a clear follow-up.
>>
>> I currently am working on the EPC driver implementation for a platform
>> which has not been up-streamed yet. I plan to use these APIs to support
>> the DOE feature for that driver. Currently, I am not aware of any
>> platform whose EPC driver supports DOE feature and its interrupts, hence
>> I see no real callers of these APIs to include in this patch series.
>>
>> Would it be appropriate to add a dummy [NOT-FOR-MERGING] demonstration
>> patch over an existing EPC driver, showing how these DOE APIs would be
>> integrated into an EPC driver?
>>
> 
> Usually we don't add APIs without any callers. But if you have a realistic time
> frame and guarantee that you are going to add EPC driver support soon, then we
> can have these APIs merged first.
> 

Hi Mani,

The expected timeline for adding the upstream support for the platform
is by the end of Q3. Once it gets merged, we would post the patches to
add its EPC as well as downstream driver support on top of it.

> For demonstration purpose, you can just show the EPC integration as a snippet in
> cover letter or point to the downstream driver for reference (if it is not a
> secret sauce).

Sure, I will add a dummy EPC integration code in the cover letter to 
demonstrate the usage of those APIs.

Thanks.

> 
> - Mani
> 


^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Song Liu @ 2026-05-19  5:29 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Paul Moore, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <aguvV8QCxK28ZHct@laps>

On Mon, May 18, 2026 at 5:31 PM Sasha Levin <sashal@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 05:29:32PM -0400, Paul Moore wrote:
> >From my perspective there are two different issues here: should
> >killswitch be a LSM, and should killswitch leverage kprobes to be able
> >to "kill" security related symbols.  After all, are we okay with
> >killswitch killing capable() and friends?
>
> killswitch doesn't do it on it's own. It may be instructed by root to do that,
> at which point that is root's problem.
>
> >In my opinion, making killswitch an LSM is more of a procedural item
> >that deals with how we view a capability like killswitch.  I
> >personally view killswitch as somewhat similar to Lockdown, which is
> >why I made the suggestion.
>
> Maybe I'm not all that familiar with LSMs, but we would need to be able to stop
> "random" code paths from executing, and I don't think we can create LSM hooks
> at that granularity, no?

There are much fewer LSM hooks than ftrace-able (killswitch-able)
functions. In this sense, killswitch is more granular. However, LSM
hooks allow LSM policies to make different decisions for different
arguments. In this sense, LSM hooks are more granular than
killswitch, as killswitch can only set a fixed return value for each
engaged function.

With current LSM solutions, we can mitigate issues like Copy Fail
without breaking other features of the system. In [1], Cloudflare
shared how they mitigate Copy Fail with BPF LSM.

Thanks,
Song

[1] https://blog.cloudflare.com/copy-fail-linux-vulnerability-mitigation/

^ permalink raw reply

* Re: [PATCH v3 3/4] PCI: endpoint: Add API for DOE initialization and setup in EPC core
From: Aksh Garg @ 2026-05-19  5:30 UTC (permalink / raw)
  To: Manivannan Sadhasivam
  Cc: linux-pci, linux-doc, kwilczynski, bhelgaas, corbet, kishon,
	skhan, lukas, cassel, alistair, linux-arm-kernel, linux-kernel,
	s-vadapalli, danishanwar, srk
In-Reply-To: <mn7rnqdunp4mq45a7ypf26rfpzjr2gik7w4p7hpj4x3r3fzfzz@dlrsn27u5mbf>



On 15/05/26 18:17, Manivannan Sadhasivam wrote:
> On Fri, May 15, 2026 at 10:21:52AM +0530, Aksh Garg wrote:
>>
>>
>> On 14/05/26 13:38, Manivannan Sadhasivam wrote:
>>> On Mon, Apr 27, 2026 at 10:47:24AM +0530, Aksh Garg wrote:
>>>> Add pci_epc_setup_doe() API in EPC core driver to initialize and setup
>>>> the DOE framework for an endpoint controller. The API discovers the DOE
>>>> capabilities (extended capability ID 0x2E), and registers each discovered
>>>> DOE mailbox for all the functions in the endpoint controller. This API
>>>> should be invoked by the controller driver during probe based on the
>>>> doe_capable feature.
>>>>
>>>> Add pci_epc_destroy_doe() API in EPC core driver for cleanup of DOE
>>>> resources, which should be invoked by the controller driver during
>>>> controller cleanup based on the doe_capable feature.
>>>>
>>>> Co-developed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
>>>> Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
>>>> Signed-off-by: Aksh Garg <a-garg7@ti.com>
>>>> ---
>>>>
>>>> Changes from v2 to v3:
>>>> - Rebased on 7.1-rc1.
>>>>
>>>> Changes since v1:
>>>> - New patch added to v2 (not present in v1)
>>>>
>>>> v2: https://lore.kernel.org/all/20260401073022.215805-4-a-garg7@ti.com/
>>>>
>>>> This patch is introduced based on the feedback provided by Manivannan
>>>> Sadhasivam at [1].
>>>>
>>>
>>> Sweet! But I was expecting you to add atleast one EPC driver implementation to
>>> make use of these APIs.
>>>
>>> Also, why can't you call these APIs from the EPC core directly? Maybe during
>>> pci_epc_init_notify() once the register accesses become valid.
>>
>> Can we add the DOE initialization API to pci_epc_init_notify()? This
>> API seems to be called to notify the EPF drivers that the EPC device's
>> initialization has been completed, as the name and description suggests.
> 
> That's correct. But there is no harm in calling something like
> pci_epc_init_capabilities() inside its definition. Only concern would be that
> pci_epc_init_notify() is mostly called from threaded IRQ handlers. So loading
> the handler would not be recommended. But since it is threaded anyway and we
> don't have a better place to call, it would be OK.
> 
> We could've called this from pci_epc_{create/start}, but some controllers won't
> allow accessing CSRs without REFCLK. So only after pci_init_notify(), CSRs can
> be accessed.
> 
>> As 'pci_epc_doe_setup' is a part of EPC initialization, I thought the
>> EPC drivers should call this API before calling the pci_epc_init_notify().
>>
>> However, I agree with your suggestion to call the DOE setup API directly
>> from the EPC core instead of sprinkling over the EPC drivers. I would
>> recommend renaming the pci_epc_init_notify() API (and hence the
>> pci_epc_deinit_notify() as well) to something like pci_epc_init_complete(),
>> and add the DOE setup API/logic just before the
>> logic of notifying the EPF devices.
>>
> 
> No need to rename this API. Just use as is:
> 
> 	pci_epc_init_notify()
> 		-> pci_epc_init_capabilities()
> 			-> pci_epc_init_doe()
> 		-> epf->event_ops->epc_init()

Thank you for the suggestion, I will incorporate these changes in v4 series.

Regards,
Aksh Garg

> 
> - Mani
> 


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox