* Re: [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
From: David Woodhouse @ 2026-05-19 22:43 UTC (permalink / raw)
To: Dongli Zhang, kvm
Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
Sean Christopherson, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Dave Hansen, Vitaly Kuznetsov, x86, Marc Zyngier, Juergen Gross,
Boris Ostrovsky, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
Jack Allister, Joey Gouly, joe.jin, linux-doc, linux-kernel,
xen-devel, linux-kselftest
In-Reply-To: <aa68ed10-15da-4368-a986-6864843a3c44@oracle.com>
[-- Attachment #1: Type: text/plain, Size: 7240 bytes --]
On Tue, 2026-05-19 at 14:23 -0700, Dongli Zhang wrote:
> I think I now understand why I feel like I am always asking weird questions. I
> have been thinking about how to account for downtime, so I see
> KVM_SET_CLOCK_GUEST as a supplement to KVM_SET_CLOCK.
I do not believe in "downtime". There is no such thing.
There is only "steal time".
A CPU may be off in the weeds — a vCPU suffering steal time, or even a
pCPU in SMM which is effectively the same thing — but time doesn't
stop, and neither does the TSC.
> Suppose we are not going to account for any downtime. With KVM_SET_CLOCK_GUEST:
>
> 1. The masterclock is active, so gTSC is synchronized across vCPUs. All vCPUs
> share the same kvm_read_l1_tsc(v, ka->master_cycle_now).
Strictly, by the time we get to the end of my series, masterclock is
active *because* all the vCPUs are running at the same TSC rate (even
if the guest set them to different offsets). But OK.
> 2. Migrate the gTSC to the target VM however people want (either ablolute value
> or offset value). (Optional) Account for downtime in gTSC however people want,
> even with KVM_SET_CLOCK/KVM_CLOCK_REALTIME, which you may not like.
>
> 3. Adjust kvm-clock (that is, ka->kvmclock_offset) with KVM_SET_CLOCK_GUEST.
>
> That is why you think KVM_SET_CLOCK is no longer required if we have
> KVM_SET_CLOCK_GUEST. While I think KVM_SET_CLOCK is required because of
> KVM_CLOCK_REALTIME.
If I recall correctly what we described in
https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org/
I don't think we actually needed KVM_SET_CLOCK at all, did we?
We *abuse* KVM_GET_CLOCK to give us a tuple of {realtime, host TSC}
because there's actually no other way for *userspace* to get that. We
don't actually *care* about the KVM clock part.
We use the {realtime, host TSC} pair to reconstitute the guest TSC
values to correctly reflect the passing of time while the guest was in
the ether.
> It it isn't required to account any downtime for gTSC or if there is another way
> to do so, only KVM_SET_CLOCK_GUEST is enough.
Right. If you only want the guest to come back with the *same* values
in its TSC as before the migration, as if the TSC was *paused* during
the migration, then you can just restore those values and use
KVM_SET_CLOCK_GUEST. Assuming you are on modern hardware and have set
all vCPUs to the same rate (and are using this series so the *guest*
can't break masterclock for you, and you can trust the
KVM_SET_CLOCK_GUEST will work).
> >
> > > Another scenario is when only MASTERCLOCK_UPDATE is pending and there is no
> > > pending CLOCK_UPDATE.
> > >
> > > In this scenario, is it fine to skip processing MASTERCLOCK_UPDATE before saving
> > > pvclock_vcpu_time_info?
> > >
> >
> > I'm not sure I understand that scenario.
> >
> > MASTERCLOCK_UPDATE means we have to actually recalculate the master
> > clock (which really *should* be rare, now!). And then any time we do
> > that, we also have to do a CLOCK_UPDATE on every vCPU to disseminate
> > the new information. Which is why kvm_end_pvclock_update() does exactly
> > that.
> >
> > So your "MASTERCLOCK_UPDATE is pending and there is no pending
> > CLOCK_UPDATE" doesn't make much sense to me. If MASTERCLOCK_UPDATE is
> > pending, then there *will* be a CLOCK_UPDATE pending.
>
> Suppose the VM is stopped and the master clock is active.
I don't know what it means for a VM to be 'stopped'. Do you mean that
all vCPUs happen to be experiencing steal time at the present moment?
> Suddenly, we change the host clocksource from TSC to HPET. pvclock_gtod_notify()
> may call pvclock_gtod_update_fn() to set a pending KVM_REQ_MASTERCLOCK_UPDATE
> for all vCPUs. Unless the pending KVM_REQ_MASTERCLOCK_UPDATE is processed by
> kvm_update_masterclock(), kvm_end_pvclock_update() will not set a pending
> KVM_REQ_CLOCK_UPDATE.
You say 'Unless'... do you mean 'Until'?
> Therefore, this is a scenario in which only KVM_REQ_MASTERCLOCK_UPDATE is pending.
>
> I do not think this scenario is important. I am just curious about the expected
> way to implement similar code in the future :)
I think that's working correctly. Until the master clock has *actually*
been updated, there's no point in setting CLOCK_UPDATE for each vCPU to
disseminate the new information to its own pvclock?
> >
> >
> > > > >
> > > > > Would it be helpful to validate that the delta is within a reasonable range,
> > > > > e.g. that the drift can never be more than five minutes (forward or backward)?
> > > >
> > > > If a guest has been running for months on a previous host and is
> > > > migrated to a new host, don't we expect that the KVM clock of the new
> > > > VM on the new host is tweaked from its default near-zero after
> > > > creation, to some large amount?
> > > >
> > >
> > > Regarding live migration, my own investigation does not show a proportional
> > > relationship between VM uptime and the amount of drift.
> >
> > You're comparing the VM on the source host, with the VM on the
> > destination post-migration.
>
> Apologies for making it confusing. I was just trying to explain why I think the
> kvm-clock drift will not be large.
Sure, but I don't care. If we have a sane API, the drift should be zero
:)
> We previously discussed the vCPU hotplug and kvm-clock drift issue. The longer
> the time interval between two vCPU hotplug events, the larger the drift.
>
> For live migration (with QEMU), I provided the equation to show that the drift
> will not be large, because it is determined by something else rather than by how
> long the VM has been running on the source server.
>
>
> For the previous vCPU hotplug and kvm-clock bug, if we add more vCPUs to a guest
> that has been running for three months, the drift will be relatively larger.
>
> For QEMU live migration, migrating a guest VM that has been running on the
> source host for *three months* versus one that has been running for *one day*
> will not cause much difference in kvm-clock drift.
Right.
> For the ideal live update case (on the same host), there may be no need to
> adjust gTSC so that it keeps incrementing. In that case, KVM_SET_CLOCK_GUEST can
> be used to adjust kvm-clock based on gTSC.
Right. You restore the gTSC using its *offset* from the host TSC which
hasn't stopped counting on the same host. Then use KVM_SET_CLOCK_GUEST
to restore the kvmclock in terms of the gTSC. And you have an
absolutely cycle-perfect migration.
> For the live migration scenario, the current QEMU implementation not only fails
> to account for downtime, but also has a drift issue. That is what I would like
> to address in QEMU.
Again, restore the gTSC as accurately as possible. Probably by working
out for *yourself* the relationships of the source and destination host
TSCs to real time, and then reconstituting on the destination using TSC
offset just as for live migration.
And then use KVM_SET_CLOCK_GUEST too.
That's what I attempted to document in
https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org/
and should probably revive.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v3] killswitch: add per-function short-circuit mitigation primitive
From: Song Liu @ 2026-05-19 22:00 UTC (permalink / raw)
To: Sasha Levin
Cc: Daniel Borkmann, linux-kernel, linux-doc, linux-kselftest, bpf,
live-patching, Greg Kroah-Hartman, Andrew Morton, Jonathan Corbet,
Mathieu Desnoyers, Joshua Peisach, Florian Weimer, Breno Leitao,
Anthony Iliopoulos, Michal Hocko, Jiri Olsa, John Fastabend,
Christian Brauner
In-Reply-To: <agzAwjKhOhuANz_P@laps>
On Tue, May 19, 2026 at 12:57 PM Sasha Levin <sashal@kernel.org> wrote:
[...]
> >Fully agree with Song here that there is no clear boundary, and that the
> >killswitch could lead to arbitrary, hard to debug breakage if applied to
> >the wrong function.. introducing worse bugs than the one being mitigated
> >or even /short-circuit LSM enforcement/ (engage security_file_open 0,
> >engage cap_capable 0, engage apparmor_* etc).
>
> This is similar to livepatch, right? Do we need guardrails there too?
livepatch has the same guardrails as other kernel modules:
CONFIG_MODULE_SIG, CONFIG_MODULE_SIG_FORCE, etc.
Thanks,
Song
^ permalink raw reply
* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: David Woodhouse @ 2026-05-19 21:58 UTC (permalink / raw)
To: Oliver Upton
Cc: Paolo Bonzini, Marc Zyngier, Will Deacon, Jonathan Corbet,
Shuah Khan, kvm, Linux Doc Mailing List,
Kernel Mailing List, Linux, Sean Christopherson, Jim Mattson,
Joey Gouly, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <agzR2kaJsNa8X9lF@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 6861 bytes --]
On Tue, 2026-05-19 at 14:10 -0700, Oliver Upton wrote:
> On Tue, May 19, 2026 at 03:13:30PM +0100, David Woodhouse wrote:
> > On Tue, 2026-05-19 at 15:53 +0200, Paolo Bonzini wrote:
> > > On Tue, May 19, 2026 at 3:00 PM David Woodhouse <dwmw2@infradead.org> wrote:
> > > > Or some guest configurations which have only ever been tested under KVM
> > > > could have a bug where they *rely* on the registers not being writable,
> > > > and write values which are inconsistent with the rest of their
> > > > configuration. Which breaks the moment those registers become writable.
> > >
> > > Yeah, just having guests that worked by utter chance - but you still
> > > don't want to break them - is the case that is most likely. Crappy
> > > code that runs only under emulation/virtualization appears with
> > > probability 1 over time.
> > >
> > > Is this likely in this specific case---probably not, honestly.
> > > Christoffer's patch dates back to 2018 (commit d53c2c29ae0d); *back
> > > then* KVM/Arm was a lot less mature, and people developing for Arm on
> > > vanilla upstream kernels have moved on from Linux 4.19.
> >
> > It's not really 2018 though. EC2 is still running kernels with that
> > older commit reverted because of the breaking change that it
> > introduced.
> >
> > So the behaviour corresponding to GICD_IIDR.implementation_rev=1 is
> > still current for *millions* of guests.
> >
> > I'm now finding that revert in our tree during a *later* kernel upgrade
> > and trying to eliminate it.
>
> Still, as far as upstream is concerned this is damn near a decade old.
> Decisions that you or your peers made in the downstream doesn't change
> that.
>
> > And sure, I have given the engineers responsible for that a very hard
> > stare and suggested that they should have fixed it 'properly' in the
> > first place with a patch like the one we're discussing right now.
> >
> > And they're looking at this thread and saying "haha no, if fixing
> > things properly for Arm is this hard then we'll stick with the crappy
> > approach".
>
> The appropriate time to ask for accomodation was a *very* long time ago.
>
> And in the absence of clear evidence of a guest depending on the broken
> IGROUPR behavior, I don't see how the guest-side changes of Christoffer's
> series are any different from the multitude of bug fixes that we take
> every single release cycle. It is an unfortunate bug and I concur with
> Marc that it doesn't seem like the sort of thing a guest could rely
> upon.
I find this concerning, because I've already explained this.
There is a very real possibility of guests simply not *noticing* that
they had bugs in this area, as it didn't *matter* what they wrote to
these registers since it never worked.
There is an even larger possibility of guests having worked around the
original issue by *detecting* whether the registers were actually
writable before choosing to use the alternative groups. And if such a
guest launches on a new kernel and then needs to be rolled back to an
older kernel, that will also break.
> Because it is very much a bug fix, it should've happened without a
> change to the revision number.
No. Changing the revision number in conjunction with the guest-visible
behaviour change is *absolutely* the right thing to do.
> Now, the handling of GICD_IIDR itself is a separate issue. By my count,
> the series broke UAPI on three separate occasions. Before b489edc36169
> IIDR was RAZ/WI from userspace. And of course dd6251e463d3 and d53c2c29ae0d
> changed the revision with no way of restoring the old value.
>
> And really, IIDR should've *never* been used as a buy in for new UAPI
> because it unnecessarily becomes guest visible. 49a1a2c70a7f ("KVM: arm64:
> vgic-v3: Advertise GICR_CTLR.{IR, CES} as a new GICD_IIDR revision") is
> a much better example for IIDR going forward as it gates *guest-side*
> behavior.
Yes, 49a1a2c70a7f is the exemplar. The guest-visible behaviour changes,
so we get a new IIDR revision and the ability to preserve the previous
behaviour by setting IIDR to the old value. That is exactly how it
should always be done.
> > I do not want them to be right. I don't think any of us want that.
> >
> > > I would still lean towards accepting the code considering the limited
> > > complexity of the addition (in fact I like it more now that it uses
> > > IIDR instead of v2_groups_user_writable, but that's taste).
> >
> > I'm absolutely prepared to have a separate technical discussion about
> > the v2_groups_user_writable thing for GICv2, which is the second part
> > of that series.
> >
> > It seems like the right thing to do, and as far as I can tell, this
> > code *never* worked with QEMU. But I'm not sure who even cares about
> > GICv2 any more. I couldn't find hardware and I had to test the whole
> > thing inside qemu-tcg.
> >
> > But the 'IIDR defaults to 3 but the *behaviour* doesn't match until you
> > explicitly *set* it to 3' thing seemed so *egregiously* wrong to me,
> > that I fixed it anyway.
>
> Wrong or not, this behavior is documented unambiguously. From the VGICv2
> UAPI documentation:
>
> """
> Userspace should set GICD_IIDR before setting any other registers (both
> KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_CPU_REGS) to ensure
> the expected behavior. Unless GICD_IIDR has been set from userspace, writes
> to the interrupt group registers (GICD_IGROUPR) are ignored.
> """
>
> I'm not inclined to change that.
That'll all very well... but as far as I can tell, QEMU *doesn't* set
GICD_IIDR, so it still gets the bizarre behaviour where the *guest* can
write the registers, but userspace can't. So it looks like it'll work
except migration will fail. Am I missing something?
But honestly, I don't care one iota about GICv2; I was only trying to
do the cleanup while I was there. Feel free to drop that part entirely.
> As a way out of this whole mess, can we
> instead:
>
> - Allow userspace to set IIDR.Revision to 1
>
> - Drop any bug emulation from the handling of IGROUPR registers
It doesn't make sense to allow setting IIDR.Revision to 1 *without* the
one-liner that actually implements the corresponding behaviour change
in the IGROUPR registers. And as explained at least twice now, it's the
behaviour change that's *important* here.
The fact that it's a long-standing bug in KVM which downstream has been
working around for a long time doesn't matter. The unconditional
behavioural change *is* a bug and we should fix it.
> - Special-case the stupid GICv2 UAPI where IGROUPR are only writable if
> the VMM has written to IIDR and the revision >= 2
That already *is* a special case, right? And you'd rather leave it as it is?
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
From: Dongli Zhang @ 2026-05-19 21:23 UTC (permalink / raw)
To: David Woodhouse, kvm
Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
Sean Christopherson, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Dave Hansen, Vitaly Kuznetsov, x86, Marc Zyngier, Juergen Gross,
Boris Ostrovsky, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
Jack Allister, Joey Gouly, joe.jin, linux-doc, linux-kernel,
xen-devel, linux-kselftest
In-Reply-To: <b9980333f3a310bf05e170e79c40cb2f46485caf.camel@infradead.org>
On 2026-05-19 12:50 AM, David Woodhouse wrote:
> On Mon, 2026-05-18 at 17:57 -0700, Dongli Zhang wrote:
>> On 2026-05-18 1:48 AM, David Woodhouse wrote:
>>> ...
>>
>> I have fixed the Thunderbird configuration. Does it look better to you?
>
> The date is certainly better, thank you. But although I *was* up late
> that night frowning at clocks, I didn't think I was up *quite* as late
> (almost 2am) as it suggests.
>
> But I suspect that getting *that* right is beyond the limit of
> Thunderbird's configurability.
>
> Thanks :)
Let me continue exploring how to add a timezone, such as "17:57 -0700".
>
>> I really appreciate guidelines like the ones below.
>>
>> https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org
>>
>> Assuming I am a user of the new API, I feel confused about whether the goal is
>> to replace KVM_SET_CLOCK with KVM_SET_CLOCK_GUEST, or whether the latter is
>> meant to supplement the former.
>
> The issue is that KVM_SET_CLOCK_GUEST can only be used in 'masterclock'
> mode, when the TSC is reliable and the guest TSCs are all in sync.
>
> Which ought to be *all* of the time, on modern hardware and sane
> configurations. And in this series, I don't even let the *guest* screw
> that over by setting different TSC offsets on different vCPUs any more
> (we stay in masterclock mode in that case now). But the VMM can cause
> its guest to come out of masterclock mode, by setting different TSC
> *speeds* on different vCPUs.
>
> So there remain some pathological cases where the kvmclock actually
> still has a justification to exist, and those are the cases where it
> needs to be set in its own right as a function of host time
> (KVM_SET_CLOCK), not purely as a function of the guest TSC
> (KVM_SET_CLOCK_GUEST).
I think I now understand why I feel like I am always asking weird questions. I
have been thinking about how to account for downtime, so I see
KVM_SET_CLOCK_GUEST as a supplement to KVM_SET_CLOCK.
Suppose we are not going to account for any downtime. With KVM_SET_CLOCK_GUEST:
1. The masterclock is active, so gTSC is synchronized across vCPUs. All vCPUs
share the same kvm_read_l1_tsc(v, ka->master_cycle_now).
2. Migrate the gTSC to the target VM however people want (either ablolute value
or offset value). (Optional) Account for downtime in gTSC however people want,
even with KVM_SET_CLOCK/KVM_CLOCK_REALTIME, which you may not like.
3. Adjust kvm-clock (that is, ka->kvmclock_offset) with KVM_SET_CLOCK_GUEST.
That is why you think KVM_SET_CLOCK is no longer required if we have
KVM_SET_CLOCK_GUEST. While I think KVM_SET_CLOCK is required because of
KVM_CLOCK_REALTIME.
It it isn't required to account any downtime for gTSC or if there is another way
to do so, only KVM_SET_CLOCK_GUEST is enough.
>
>
>>
>> If we are going to use KVM_SET_CLOCK_GUEST when KVM_SET_CLOCK is not needed, I
>> would appreciate it if the API could carry more data in addition to struct
>> pvclock_vcpu_time_info.
>>
>> +#define KVM_SET_CLOCK_GUEST _IOW(KVMIO, 0xd6, struct pvclock_vcpu_time_info)
>> +#define KVM_GET_CLOCK_GUEST _IOR(KVMIO, 0xd7, struct pvclock_vcpu_time_info)
>>
>>
[snip]
>>
>> Another scenario is when only MASTERCLOCK_UPDATE is pending and there is no
>> pending CLOCK_UPDATE.
>>
>> In this scenario, is it fine to skip processing MASTERCLOCK_UPDATE before saving
>> pvclock_vcpu_time_info?
>>
>
> I'm not sure I understand that scenario.
>
> MASTERCLOCK_UPDATE means we have to actually recalculate the master
> clock (which really *should* be rare, now!). And then any time we do
> that, we also have to do a CLOCK_UPDATE on every vCPU to disseminate
> the new information. Which is why kvm_end_pvclock_update() does exactly
> that.
>
> So your "MASTERCLOCK_UPDATE is pending and there is no pending
> CLOCK_UPDATE" doesn't make much sense to me. If MASTERCLOCK_UPDATE is
> pending, then there *will* be a CLOCK_UPDATE pending.
Suppose the VM is stopped and the master clock is active.
Suddenly, we change the host clocksource from TSC to HPET. pvclock_gtod_notify()
may call pvclock_gtod_update_fn() to set a pending KVM_REQ_MASTERCLOCK_UPDATE
for all vCPUs. Unless the pending KVM_REQ_MASTERCLOCK_UPDATE is processed by
kvm_update_masterclock(), kvm_end_pvclock_update() will not set a pending
KVM_REQ_CLOCK_UPDATE.
Therefore, this is a scenario in which only KVM_REQ_MASTERCLOCK_UPDATE is pending.
I do not think this scenario is important. I am just curious about the expected
way to implement similar code in the future :)
>
>
>>>>
>>>> Would it be helpful to validate that the delta is within a reasonable range,
>>>> e.g. that the drift can never be more than five minutes (forward or backward)?
>>>
>>> If a guest has been running for months on a previous host and is
>>> migrated to a new host, don't we expect that the KVM clock of the new
>>> VM on the new host is tweaked from its default near-zero after
>>> creation, to some large amount?
>>>
>>
>> Regarding live migration, my own investigation does not show a proportional
>> relationship between VM uptime and the amount of drift.
>
> You're comparing the VM on the source host, with the VM on the
> destination post-migration.
Apologies for making it confusing. I was just trying to explain why I think the
kvm-clock drift will not be large.
We previously discussed the vCPU hotplug and kvm-clock drift issue. The longer
the time interval between two vCPU hotplug events, the larger the drift.
For live migration (with QEMU), I provided the equation to show that the drift
will not be large, because it is determined by something else rather than by how
long the VM has been running on the source server.
For the previous vCPU hotplug and kvm-clock bug, if we add more vCPUs to a guest
that has been running for three months, the drift will be relatively larger.
For QEMU live migration, migrating a guest VM that has been running on the
source host for *three months* versus one that has been running for *one day*
will not cause much difference in kvm-clock drift.
>
> Perhaps I misunderstood, but I thought your suggested validation of a
> 'reasonable range' would also apply when adjusting the kvmclock of the
> nascent VM on the destination host, from "newly created" to "has been
> running for months" while migrating the state of the actual guest onto
> a clean new slate.
>
>> Just taking QEMU + KVM as an example: suppose TSC scaling is inactive, the
>> amount of drift does not depend on how long the VM has been running before live
>> migration.
>>
>> Instead, it depends on the delta between when we call MSR_IA32_TSC and
>> KVM_GET_CLOCK, and between MSR_IA32_TSC and KVM_SET_CLOCK.
>>
>> The guest TSC stops at P1 and resumes at P3.
>> The kvmclock stops at P2 and resumes at P4.
>>
>> We expect P1 == P2 and P3 == P4.
>>
>> On source host.
>>
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=0 ===> P1
>
> Here's where it all starts going wrong. Line 1.
>
> Any API which lets you get a single time value in isolation, and thus
> which is already out of date by the time the system call even returns,
> is fundamentally unsuitable for migration.
>
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=1
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=2
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=3
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=4
>> ... ...
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=N
>> - KVM_GET_CLOCK ===> P2
>>
>> On target host.
>>
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=1 ===> P3
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=2
>
> At this point, the nasty hack in the kernel steps in, realises that the
> value you're setting on vCPU 2 is within a second or so of the value
> you had previously set on vCPU 1, and snaps it back to be precisely the
> same. To work around the fundamental brokenness of this method.
>
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=3
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=4
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=5
>> ... ...
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=N
>> - KVM_SET_CLOCK ====> P4
>>
>>
>> Here is my equiation to predict the drift.
>
> I'm sure you're right, but I didn't get that far when looking at this.
> I'd already thrown up in my mouth a little bit by line one.
>
> Here's my equation to predict the drift of a live update done correctly
> on the same host using the method I've now put in the documentation:
>
> 0.
For the ideal live update case (on the same host), there may be no need to
adjust gTSC so that it keeps incrementing. In that case, KVM_SET_CLOCK_GUEST can
be used to adjust kvm-clock based on gTSC.
For the live migration scenario, the current QEMU implementation not only fails
to account for downtime, but also has a drift issue. That is what I would like
to address in QEMU.
Thank you very much!
Dongli Zhang
^ permalink raw reply
* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: Oliver Upton @ 2026-05-19 21:10 UTC (permalink / raw)
To: David Woodhouse
Cc: Paolo Bonzini, Marc Zyngier, Will Deacon, Jonathan Corbet,
Shuah Khan, kvm, Linux Doc Mailing List,
Kernel Mailing List, Linux, Sean Christopherson, Jim Mattson,
Joey Gouly, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <9d0429ddbe4d8c6993e74237c4395697f80092d6.camel@infradead.org>
On Tue, May 19, 2026 at 03:13:30PM +0100, David Woodhouse wrote:
> On Tue, 2026-05-19 at 15:53 +0200, Paolo Bonzini wrote:
> > On Tue, May 19, 2026 at 3:00 PM David Woodhouse <dwmw2@infradead.org> wrote:
> > > Or some guest configurations which have only ever been tested under KVM
> > > could have a bug where they *rely* on the registers not being writable,
> > > and write values which are inconsistent with the rest of their
> > > configuration. Which breaks the moment those registers become writable.
> >
> > Yeah, just having guests that worked by utter chance - but you still
> > don't want to break them - is the case that is most likely. Crappy
> > code that runs only under emulation/virtualization appears with
> > probability 1 over time.
> >
> > Is this likely in this specific case---probably not, honestly.
> > Christoffer's patch dates back to 2018 (commit d53c2c29ae0d); *back
> > then* KVM/Arm was a lot less mature, and people developing for Arm on
> > vanilla upstream kernels have moved on from Linux 4.19.
>
> It's not really 2018 though. EC2 is still running kernels with that
> older commit reverted because of the breaking change that it
> introduced.
>
> So the behaviour corresponding to GICD_IIDR.implementation_rev=1 is
> still current for *millions* of guests.
>
> I'm now finding that revert in our tree during a *later* kernel upgrade
> and trying to eliminate it.
Still, as far as upstream is concerned this is damn near a decade old.
Decisions that you or your peers made in the downstream doesn't change
that.
> And sure, I have given the engineers responsible for that a very hard
> stare and suggested that they should have fixed it 'properly' in the
> first place with a patch like the one we're discussing right now.
>
> And they're looking at this thread and saying "haha no, if fixing
> things properly for Arm is this hard then we'll stick with the crappy
> approach".
The appropriate time to ask for accomodation was a *very* long time ago.
And in the absence of clear evidence of a guest depending on the broken
IGROUPR behavior, I don't see how the guest-side changes of Christoffer's
series are any different from the multitude of bug fixes that we take
every single release cycle. It is an unfortunate bug and I concur with
Marc that it doesn't seem like the sort of thing a guest could rely
upon.
Because it is very much a bug fix, it should've happened without a
change to the revision number.
Now, the handling of GICD_IIDR itself is a separate issue. By my count,
the series broke UAPI on three separate occasions. Before b489edc36169
IIDR was RAZ/WI from userspace. And of course dd6251e463d3 and d53c2c29ae0d
changed the revision with no way of restoring the old value.
And really, IIDR should've *never* been used as a buy in for new UAPI
because it unnecessarily becomes guest visible. 49a1a2c70a7f ("KVM: arm64:
vgic-v3: Advertise GICR_CTLR.{IR, CES} as a new GICD_IIDR revision") is
a much better example for IIDR going forward as it gates *guest-side*
behavior.
> I do not want them to be right. I don't think any of us want that.
>
> > I would still lean towards accepting the code considering the limited
> > complexity of the addition (in fact I like it more now that it uses
> > IIDR instead of v2_groups_user_writable, but that's taste).
>
> I'm absolutely prepared to have a separate technical discussion about
> the v2_groups_user_writable thing for GICv2, which is the second part
> of that series.
>
> It seems like the right thing to do, and as far as I can tell, this
> code *never* worked with QEMU. But I'm not sure who even cares about
> GICv2 any more. I couldn't find hardware and I had to test the whole
> thing inside qemu-tcg.
>
> But the 'IIDR defaults to 3 but the *behaviour* doesn't match until you
> explicitly *set* it to 3' thing seemed so *egregiously* wrong to me,
> that I fixed it anyway.
Wrong or not, this behavior is documented unambiguously. From the VGICv2
UAPI documentation:
"""
Userspace should set GICD_IIDR before setting any other registers (both
KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_CPU_REGS) to ensure
the expected behavior. Unless GICD_IIDR has been set from userspace, writes
to the interrupt group registers (GICD_IGROUPR) are ignored.
"""
I'm not inclined to change that. As a way out of this whole mess, can we
instead:
- Allow userspace to set IIDR.Revision to 1
- Drop any bug emulation from the handling of IGROUPR registers
- Special-case the stupid GICv2 UAPI where IGROUPR are only writable if
the VMM has written to IIDR and the revision >= 2
Thanks,
Oliver
^ permalink raw reply
* Re: [PATCH v3 04/12] x86,fs/resctrl: Program PLZA through kmode arch hooks
From: Luck, Tony @ 2026-05-19 20:59 UTC (permalink / raw)
To: Babu Moger
Cc: corbet, reinette.chatre, Dave.Martin, james.morse, tglx, bp,
dave.hansen, skhan, x86, mingo, hpa, akpm, rdunlap,
pawan.kumar.gupta, feng.tang, dapeng1.mi, kees, elver, lirongqing,
paulmck, bhelgaas, seanjc, alexandre.chartre, yazen.ghannam,
peterz, chang.seok.bae, kim.phillips, xin, naveen,
thomas.lendacky, linux-doc, linux-kernel, eranian, peternewman,
sos-linux-ext-patches
In-Reply-To: <0cfd813e10072eefc8f4d84328e83bd9a6220ad4.1777591497.git.babu.moger@amd.com>
On Thu, Apr 30, 2026 at 06:24:49PM -0500, Babu Moger wrote:
> +void resctrl_arch_configure_kmode(cpumask_var_t cpu_mask, u32 closid, u32 rmid, bool enable)
> +{
> + union msr_pqr_plza_assoc plza = { 0 };
> +
> + plza.split.rmid = rmid;
> + plza.split.rmid_en = 1;
Shouldn't there be a parameter for the value of rmid_en?
User asked for global_assign_ctrl_assign_mon_per_cpu set it to '1'
User asked for global_assign_ctrl_inherit_mon_per_cpu set it to '0'
> + plza.split.closid = closid;
> + plza.split.closid_en = 1;
> + plza.split.plza_en = enable;
> +
> + on_each_cpu_mask(cpu_mask, resctrl_kmode_set_one_amd, &plza, 1);
> +}
-Tony
^ permalink raw reply
* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Paul Moore @ 2026-05-19 20:50 UTC (permalink / raw)
To: Sasha Levin
Cc: Song Liu, corbet, akpm, skhan, linux-doc, linux-kernel,
linux-kselftest, gregkh, linux-security-module
In-Reply-To: <agzBd9mMt3Zf7j1j@laps>
On Tue, May 19, 2026 at 4:00 PM Sasha Levin <sashal@kernel.org> wrote:
> On Mon, May 18, 2026 at 11:08:38PM -0400, Paul Moore wrote:
> >On Mon, May 18, 2026 at 8:31 PM Sasha Levin <sashal@kernel.org> wrote:
> >> On Mon, May 18, 2026 at 05:29:32PM -0400, Paul Moore wrote:
> >> >From my perspective there are two different issues here: should
> >> >killswitch be a LSM, and should killswitch leverage kprobes to be able
> >> >to "kill" security related symbols. After all, are we okay with
> >> >killswitch killing capable() and friends?
> >>
> >> killswitch doesn't do it on it's own. It may be instructed by root to do that,
> >> at which point that is root's problem.
> >
> >As I mentioned previously, there are cases where we can restrict
> >root's privileges today, but a functional killswitch would allow that
> >restriction to be bypassed. My last email to Song has an example with
> >SELinux.
>
> This would be handled by just disabling killswitch in those scenarios like how
> we do with lockdown, no?
One could presumably deny access to killswitch, but that pushes the
burden of choice onto the users/admins. Yes, that is the easy way to
solve thorny use case conflicts like this, but it would be nice if we
could do better for those who have to deal with this in the wild.
> >> >In my opinion, making killswitch an LSM is more of a procedural item
> >> >that deals with how we view a capability like killswitch. I
> >> >personally view killswitch as somewhat similar to Lockdown, which is
> >> >why I made the suggestion.
> >>
> >> Maybe I'm not all that familiar with LSMs, but we would need to be able to stop
> >> "random" code paths from executing, and I don't think we can create LSM hooks
> >> at that granularity, no?
> >
> >I don't see any LSM hooks in this revision of killswitch, and as long
> >as it is based on a kprobes I can't imagine it would ever use any. As
> >I mentioned above, my killswitch-as-a-LSM comment is primarily about
> >killswitch filling a role very similar to Lockdown.
>
> My question was more about how to structure killswitch as an LSM. I want to be
> able to poke at pretty much any function in the kernel, rather than restrict
> access to a known list of functions.
Well, like I said in my last reply to you, I can't imagine a kprobes
based killswitch would need to worry about the LSM hooks. Structuring
killswitch as an LSM would be mostly a few lines of code to register
it as an LSM and that's about it. Benefits would be minor, and likely
a matter of opinion, it's mostly about how we view something like
killswitch in the kernel. If we view it as a security mechanism
similar to lockdown, then it makes sense as a LSM, if we view this as
a completely different thing then it can be whatever it wants to be.
--
paul-moore.com
^ permalink raw reply
* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Paul Moore @ 2026-05-19 20:33 UTC (permalink / raw)
To: Song Liu
Cc: Sasha Levin, corbet, akpm, skhan, linux-doc, linux-kernel,
linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAPhsuW7Rhdh62AoceQpsfm0+kVsvz8zq97fupm4mtBEyVTkkcg@mail.gmail.com>
On Tue, May 19, 2026 at 1:29 AM Song Liu <song@kernel.org> wrote:
> On Mon, May 18, 2026 at 5:31 PM Sasha Levin <sashal@kernel.org> wrote:
> >
> > On Mon, May 18, 2026 at 05:29:32PM -0400, Paul Moore wrote:
> > >From my perspective there are two different issues here: should
> > >killswitch be a LSM, and should killswitch leverage kprobes to be able
> > >to "kill" security related symbols. After all, are we okay with
> > >killswitch killing capable() and friends?
> >
> > killswitch doesn't do it on it's own. It may be instructed by root to do that,
> > at which point that is root's problem.
> >
> > >In my opinion, making killswitch an LSM is more of a procedural item
> > >that deals with how we view a capability like killswitch. I
> > >personally view killswitch as somewhat similar to Lockdown, which is
> > >why I made the suggestion.
> >
> > Maybe I'm not all that familiar with LSMs, but we would need to be able to stop
> > "random" code paths from executing, and I don't think we can create LSM hooks
> > at that granularity, no?
>
> There are much fewer LSM hooks than ftrace-able (killswitch-able)
> functions. In this sense, killswitch is more granular.
I don't know if I would say it is necessarily more granular as its
ability to filter access is limited to a breakpoint set on a symbol,
but killswitch definitely has a larger quantity of control points.
> However, LSM
> hooks allow LSM policies to make different decisions for different
> arguments. In this sense, LSM hooks are more granular than
> killswitch, as killswitch can only set a fixed return value for each
> engaged function.
Yes, I think we agree here.
> With current LSM solutions, we can mitigate issues like Copy Fail
> without breaking other features of the system. In [1], Cloudflare
> shared how they mitigate Copy Fail with BPF LSM.
... and Android has been shown to not be vulnerable in the first place
due to their use of SELinux and a well crafted SELinux policy.
--
paul-moore.com
^ permalink raw reply
* [PATCH net-next V2 2/2] net/mlx5: implement max_sfs parameter
From: Tariq Toukan @ 2026-05-19 20:04 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Jiri Pirko, Simon Horman, Jonathan Corbet, Shuah Khan,
Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Vlad Dumitrescu, Aleksandr Loktionov, Daniel Zahka, David Ahern,
Nikolay Aleksandrov, netdev, linux-doc, linux-kernel, linux-rdma,
Gal Pressman, Dragos Tatulea, Jiri Pirko, Nikolay Aleksandrov
In-Reply-To: <20260519200436.353249-1-tariqt@nvidia.com>
From: Nikolay Aleksandrov <nikolay@nvidia.com>
Implement max_sfs generic parameter to allow users to control the total
light-weight NIC subfunctions that can be created using devlink instead
of external vendor tools. A value of 0 will effectively disable creation
of new subfunction devices. A warning is sent to user-space via extack
(returning extack without error code is interpreted as a warning by
user-space tools).
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
Documentation/networking/devlink/mlx5.rst | 7 +-
.../mellanox/mlx5/core/lib/nv_param.c | 83 ++++++++++++++++++-
2 files changed, 86 insertions(+), 4 deletions(-)
diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst
index 4bba4d780a4a..f5e2dccafa5a 100644
--- a/Documentation/networking/devlink/mlx5.rst
+++ b/Documentation/networking/devlink/mlx5.rst
@@ -45,8 +45,13 @@ Parameters
- The range is between 1 and a device-specific max.
- Applies to each physical function (PF) independently, if the device
supports it. Otherwise, it applies symmetrically to all PFs.
+ * - ``max_sfs``
+ - permanent
+ - The range is between 0 and a device-specific max.
+ - Applies to each physical function (PF) independently.
-Note: permanent parameters such as ``enable_sriov`` and ``total_vfs`` require FW reset to take effect
+Note: permanent parameters such as ``enable_sriov``, ``total_vfs`` and ``max_sfs``
+ require FW reset to take effect
.. code-block:: bash
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/nv_param.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/nv_param.c
index 19bb620b7436..eff3a67e4ca0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/nv_param.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/nv_param.c
@@ -68,7 +68,9 @@ struct mlx5_ifc_mnvda_reg_bits {
struct mlx5_ifc_nv_global_pci_conf_bits {
u8 sriov_valid[0x1];
- u8 reserved_at_1[0x10];
+ u8 reserved_at_1[0xa];
+ u8 per_pf_num_sf[0x1];
+ u8 reserved_at_c[0x5];
u8 per_pf_total_vf[0x1];
u8 reserved_at_12[0xe];
@@ -93,9 +95,11 @@ struct mlx5_ifc_nv_global_pci_cap_bits {
};
struct mlx5_ifc_nv_pf_pci_conf_bits {
- u8 reserved_at_0[0x9];
+ u8 log_sf_bar_size[0x8];
+ u8 pf_total_sf_en[0x1];
u8 pf_total_vf_en[0x1];
- u8 reserved_at_a[0x16];
+ u8 reserved_at_a[0x6];
+ u8 total_sf[0x10];
u8 reserved_at_20[0x20];
@@ -755,6 +759,76 @@ static int mlx5_devlink_total_vfs_validate(struct devlink *devlink, u32 id,
return 0;
}
+static int mlx5_devlink_max_sfs_get(struct devlink *devlink, u32 id,
+ struct devlink_param_gset_ctx *ctx,
+ struct netlink_ext_ack *extack)
+{
+ struct mlx5_core_dev *dev = devlink_priv(devlink);
+ u32 mnvda[MLX5_ST_SZ_DW(mnvda_reg)] = {};
+ void *data;
+ int err;
+
+ err = mlx5_nv_param_read_per_host_pf_conf(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack, "Failed to read PF configuration");
+ return err;
+ }
+
+ data = MLX5_ADDR_OF(mnvda_reg, mnvda, configuration_item_data);
+ ctx->val.vu32 = MLX5_GET(nv_pf_pci_conf, data, total_sf);
+
+ return 0;
+}
+
+static int mlx5_devlink_max_sfs_set(struct devlink *devlink, u32 id,
+ struct devlink_param_gset_ctx *ctx,
+ struct netlink_ext_ack *extack)
+{
+ struct mlx5_core_dev *dev = devlink_priv(devlink);
+ u32 mnvda[MLX5_ST_SZ_DW(mnvda_reg)] = {};
+ void *data;
+ int err;
+
+ err = mlx5_nv_param_read_global_pci_conf(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "Failed to read global PCI configuration");
+ return err;
+ }
+
+ data = MLX5_ADDR_OF(mnvda_reg, mnvda, configuration_item_data);
+ MLX5_SET(nv_global_pci_conf, data, per_pf_num_sf, !!ctx->val.vu32);
+
+ err = mlx5_nv_param_write(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "Failed to change per_pf_num_sf global PCI configuration");
+ return err;
+ }
+
+ memset(mnvda, 0, sizeof(mnvda));
+ err = mlx5_nv_param_read_per_host_pf_conf(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack, "Failed to read PF configuration");
+ return err;
+ }
+
+ data = MLX5_ADDR_OF(mnvda_reg, mnvda, configuration_item_data);
+ MLX5_SET(nv_pf_pci_conf, data, log_sf_bar_size, ctx->val.vu32 ? 12 : 0);
+ MLX5_SET(nv_pf_pci_conf, data, pf_total_sf_en, !!ctx->val.vu32);
+ MLX5_SET(nv_pf_pci_conf, data, total_sf, ctx->val.vu32);
+
+ err = mlx5_nv_param_write(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "Failed to change PF PCI configuration");
+ return err;
+ }
+ NL_SET_ERR_MSG(extack, "Modifying max_sfs requires a reboot");
+
+ return 0;
+}
+
static const struct devlink_param mlx5_nv_param_devlink_params[] = {
DEVLINK_PARAM_GENERIC(ENABLE_SRIOV, BIT(DEVLINK_PARAM_CMODE_PERMANENT),
mlx5_devlink_enable_sriov_get,
@@ -763,6 +837,9 @@ static const struct devlink_param mlx5_nv_param_devlink_params[] = {
mlx5_devlink_total_vfs_get,
mlx5_devlink_total_vfs_set,
mlx5_devlink_total_vfs_validate),
+ DEVLINK_PARAM_GENERIC(MAX_SFS, BIT(DEVLINK_PARAM_CMODE_PERMANENT),
+ mlx5_devlink_max_sfs_get,
+ mlx5_devlink_max_sfs_set, NULL),
DEVLINK_PARAM_DRIVER(MLX5_DEVLINK_PARAM_ID_CQE_COMPRESSION_TYPE,
"cqe_compress_type", DEVLINK_PARAM_TYPE_STRING,
BIT(DEVLINK_PARAM_CMODE_PERMANENT),
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V2 1/2] devlink: add generic device max_sfs parameter
From: Tariq Toukan @ 2026-05-19 20:04 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Jiri Pirko, Simon Horman, Jonathan Corbet, Shuah Khan,
Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Vlad Dumitrescu, Aleksandr Loktionov, Daniel Zahka, David Ahern,
Nikolay Aleksandrov, netdev, linux-doc, linux-kernel, linux-rdma,
Gal Pressman, Dragos Tatulea, Jiri Pirko, Nikolay Aleksandrov
In-Reply-To: <20260519200436.353249-1-tariqt@nvidia.com>
From: Nikolay Aleksandrov <nikolay@nvidia.com>
Add a new generic devlink device parameter (max_sfs) to control if and
how many light-weight NIC subfunctions can be created. Subfunctions are
a light-weight network functions backed by an underlying PCI function.
Their lifecycle can already be managed by devlink, but currently users
cannot enable them in the device. They can be enabled/disabled only via
external vendor tools. This parameter allows subfunctions to be enabled
(>0) or disabled (0) via devlink. A subsequent patch will add support
for max_sfs to the mlx5 driver.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
Documentation/networking/devlink/devlink-params.rst | 6 ++++++
include/net/devlink.h | 4 ++++
net/devlink/param.c | 5 +++++
3 files changed, 15 insertions(+)
diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst
index ea17756dcda6..29b8a9246fb6 100644
--- a/Documentation/networking/devlink/devlink-params.rst
+++ b/Documentation/networking/devlink/devlink-params.rst
@@ -165,3 +165,9 @@ own name.
- u32
- Controls the maximum number of MAC address filters that can be assigned
to a Virtual Function (VF).
+ * - ``max_sfs``
+ - u32
+ - The maximum number of subfunctions which can be created on the device.
+ Modifying this parameter may require a device restart and PCI bus
+ rescanning because the BAR layout may change. A value of 0 disables
+ subfunction creation.
diff --git a/include/net/devlink.h b/include/net/devlink.h
index bcd31de1f890..4ec455cfe7a4 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -546,6 +546,7 @@ enum devlink_param_generic_id {
DEVLINK_PARAM_GENERIC_ID_TOTAL_VFS,
DEVLINK_PARAM_GENERIC_ID_NUM_DOORBELLS,
DEVLINK_PARAM_GENERIC_ID_MAX_MAC_PER_VF,
+ DEVLINK_PARAM_GENERIC_ID_MAX_SFS,
/* add new param generic ids above here*/
__DEVLINK_PARAM_GENERIC_ID_MAX,
@@ -619,6 +620,9 @@ enum devlink_param_generic_id {
#define DEVLINK_PARAM_GENERIC_MAX_MAC_PER_VF_NAME "max_mac_per_vf"
#define DEVLINK_PARAM_GENERIC_MAX_MAC_PER_VF_TYPE DEVLINK_PARAM_TYPE_U32
+#define DEVLINK_PARAM_GENERIC_MAX_SFS_NAME "max_sfs"
+#define DEVLINK_PARAM_GENERIC_MAX_SFS_TYPE DEVLINK_PARAM_TYPE_U32
+
#define DEVLINK_PARAM_GENERIC(_id, _cmodes, _get, _set, _validate) \
{ \
.id = DEVLINK_PARAM_GENERIC_ID_##_id, \
diff --git a/net/devlink/param.c b/net/devlink/param.c
index cf95268da5b0..523243e49d88 100644
--- a/net/devlink/param.c
+++ b/net/devlink/param.c
@@ -117,6 +117,11 @@ static const struct devlink_param devlink_param_generic[] = {
.name = DEVLINK_PARAM_GENERIC_MAX_MAC_PER_VF_NAME,
.type = DEVLINK_PARAM_GENERIC_MAX_MAC_PER_VF_TYPE,
},
+ {
+ .id = DEVLINK_PARAM_GENERIC_ID_MAX_SFS,
+ .name = DEVLINK_PARAM_GENERIC_MAX_SFS_NAME,
+ .type = DEVLINK_PARAM_GENERIC_MAX_SFS_TYPE,
+ },
};
static int devlink_param_generic_verify(const struct devlink_param *param)
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V2 0/2] devlink: add generic max_sfs parameter and mlx5 support
From: Tariq Toukan @ 2026-05-19 20:04 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Jiri Pirko, Simon Horman, Jonathan Corbet, Shuah Khan,
Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Vlad Dumitrescu, Aleksandr Loktionov, Daniel Zahka, David Ahern,
Nikolay Aleksandrov, netdev, linux-doc, linux-kernel, linux-rdma,
Gal Pressman, Dragos Tatulea, Jiri Pirko
Hi,
This series by Nikolay introduces a new generic devlink device
parameter, max_sfs, to control the number of light-weight NIC
subfunctions (SFs) that can be created on a device.
The first patch adds the generic devlink parameter and infrastructure
support.
The second patch implements support for the parameter in the mlx5
driver.
With this addition, users can enable or disable SF creation directly via
devlink, without relying on external vendor-specific tools.
Regards,
Tariq
V2:
- Add missing ` (Aleksandr Loktionov).
- Add review tag to patch 1.
V1:
https://lore.kernel.org/all/20260517112700.343575-1-tariqt@nvidia.com/
Nikolay Aleksandrov (2):
devlink: add generic device max_sfs parameter
net/mlx5: implement max_sfs parameter
.../networking/devlink/devlink-params.rst | 6 ++
Documentation/networking/devlink/mlx5.rst | 7 +-
.../mellanox/mlx5/core/lib/nv_param.c | 83 ++++++++++++++++++-
include/net/devlink.h | 4 +
net/devlink/param.c | 5 ++
5 files changed, 101 insertions(+), 4 deletions(-)
base-commit: 9bf93cb2e180a58d5984ba13daee95903ff4fc14
--
2.44.0
^ permalink raw reply
* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Sasha Levin @ 2026-05-19 20:00 UTC (permalink / raw)
To: Paul Moore
Cc: Song Liu, corbet, akpm, skhan, linux-doc, linux-kernel,
linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAHC9VhTEs7rCaoPG7cWAzyVkN3ztdadHAq0g8mEy_MgCiCe=0g@mail.gmail.com>
On Mon, May 18, 2026 at 11:08:38PM -0400, Paul Moore wrote:
>On Mon, May 18, 2026 at 8:31 PM Sasha Levin <sashal@kernel.org> wrote:
>> On Mon, May 18, 2026 at 05:29:32PM -0400, Paul Moore wrote:
>> >From my perspective there are two different issues here: should
>> >killswitch be a LSM, and should killswitch leverage kprobes to be able
>> >to "kill" security related symbols. After all, are we okay with
>> >killswitch killing capable() and friends?
>>
>> killswitch doesn't do it on it's own. It may be instructed by root to do that,
>> at which point that is root's problem.
>
>As I mentioned previously, there are cases where we can restrict
>root's privileges today, but a functional killswitch would allow that
>restriction to be bypassed. My last email to Song has an example with
>SELinux.
This would be handled by just disabling killswitch in those scenarios like how
we do with lockdown, no?
>> >In my opinion, making killswitch an LSM is more of a procedural item
>> >that deals with how we view a capability like killswitch. I
>> >personally view killswitch as somewhat similar to Lockdown, which is
>> >why I made the suggestion.
>>
>> Maybe I'm not all that familiar with LSMs, but we would need to be able to stop
>> "random" code paths from executing, and I don't think we can create LSM hooks
>> at that granularity, no?
>
>I don't see any LSM hooks in this revision of killswitch, and as long
>as it is based on a kprobes I can't imagine it would ever use any. As
>I mentioned above, my killswitch-as-a-LSM comment is primarily about
>killswitch filling a role very similar to Lockdown.
My question was more about how to structure killswitch as an LSM. I want to be
able to poke at pretty much any function in the kernel, rather than restrict
access to a known list of functions.
>> >The use of kprobes, while an interesting idea, presents problems as
>> >allowing any kernel symbol to be killed introduces the potential for
>> >security regressions. As a reminder, some LSMs, as well as other
>> >kernel subsystems, have mechanisms in place to restrict root and/or
>> >enforce one-way configuration locks; while many people equate "root"
>> >with full control, in many cases today that is not strictly correct.
>>
>> killswitch "complies" with lockdown. Is there a different scenario which we
>> should be blocking?
>
>See the SELinux example I mentioned in my email to Song.
>
>> >Yes, kprobes have been around for some time, this is not a new
>> >problem, but killswitch makes it far more convenient and accessible to
>> >do dangerous things with kprobes. If killswitch makes it past the RFC
>> >stage without any significant changes to its kill mechanism, we may
>> >need to start considering more liberal usage of NOKPROBE_SYMBOL()
>> >which I think would be an unfortunate casualty.
>>
>> Why? If I don't really mind the security impact, I want to be able to have a
>> killswitch-like interface on my systems. If an attacker is in my systems,
>> killswitch is the least of my concerns I think.
>>
>> If you are security concious, just don't enable CONFIG_KILLSWITCH?
>
>Isn't the whole point of killswitch to have it enabled everywhere
>because you never know when you might want/need it?
Right. We have different usecases. If you want selinux/lockdown/etc and a
really crippled root, that should be an option. If you choose to allow
something like killswitch, it should be an option too.
--
Thanks,
Sasha
^ permalink raw reply
* Re: [PATCH v3] killswitch: add per-function short-circuit mitigation primitive
From: Sasha Levin @ 2026-05-19 19:57 UTC (permalink / raw)
To: Daniel Borkmann
Cc: Song Liu, linux-kernel, linux-doc, linux-kselftest, bpf,
live-patching, Greg Kroah-Hartman, Andrew Morton, Jonathan Corbet,
Mathieu Desnoyers, Joshua Peisach, Florian Weimer, Breno Leitao,
Anthony Iliopoulos, Michal Hocko, Jiri Olsa, John Fastabend,
Christian Brauner
In-Reply-To: <b342c38b-7323-4b72-a239-8a574d6bc36b@iogearbox.net>
On Tue, May 19, 2026 at 02:13:26PM +0200, Daniel Borkmann wrote:
>On 5/19/26 1:59 AM, Song Liu wrote:
>>On Mon, May 18, 2026 at 6:33 AM Sasha Levin <sashal@kernel.org> wrote:
>>>On Sun, May 17, 2026 at 11:37:36PM -0700, Song Liu wrote:
>>>>On Sun, May 17, 2026 at 6:49 AM Sasha Levin <sashal@kernel.org> wrote:
>>>>>* fail_function (CONFIG_FUNCTION_ERROR_INJECTION) is disabled in
>>>>> most production kernels. Even where enabled, it only works on
>>>>> functions pre-annotated with ALLOW_ERROR_INJECTION() in source -
>>>>> no help for a freshly-disclosed CVE. The debugfs UI is blocked by
>>>>> lockdown=integrity and the override is probabilistic.
>>>>>
>>>>>* BPF override (bpf_override_return) honors the same
>>>>> ALLOW_ERROR_INJECTION() whitelist, and BPF itself is off in many
>>>>> production kernels. Even where on, the operator interface is
>>>>> "load a verified BPF program," not a one-line write.
>>>>
>>>>If it is OK for killswitch to attach to any kernel functions, do we still
>>>>need ALLOW_ERROR_INJECTION() for fail_function and BPF
>>>>override? Shall we instead also allow fail_function and BPF override
>>>>to attach to any kernel functions?
>>>
>>>I don't think so. ALLOW_ERROR_INJECTION is not a security mechanism, it's an
>>>integrity/safety mechanism for both bpf and fault injection.
>>>
>>>It protects against a "developer or CI script doing legitimate fault injection
>>>accidentally panics the box" scenario, not an "attacker gets in" one.
>>
>>There really isn't a clear boundary between "security mechanism" and
>>"non-security mechanism". As we are making killswitch available
>>everywhere under root, users will soon learn to use it to do fault injection,
>>and potentially much more scary things. (Think about agents with sudo
>>access).
>
>Fully agree with Song here that there is no clear boundary, and that the
>killswitch could lead to arbitrary, hard to debug breakage if applied to
>the wrong function.. introducing worse bugs than the one being mitigated
>or even /short-circuit LSM enforcement/ (engage security_file_open 0,
>engage cap_capable 0, engage apparmor_* etc).
This is similar to livepatch, right? Do we need guardrails there too?
Or do we just trust root to do the right thing for it's systems without needing
to be it's babysitter?
>The ALLOW_ERROR_INJECTION() provides a curated white-list where you may
>return with an error without causing more severe damage (assuming the
>error handling code is right). The right thing would be to more widely
>apply ALLOW_ERROR_INJECTION() or to figure out a better way to safely
>enable the latter without explicit function annotation.
Sure, this would also work. How do you see this happening? Can we let a certain
user/pid/etc disable the allowlist if they choose to?
>Wrt BPF:
>
>>>>>* BPF override (bpf_override_return) honors the same
>>>>> ALLOW_ERROR_INJECTION() whitelist, and BPF itself is off in many
>>>>> production kernels. Even where on, the operator interface is
>>>>> "load a verified BPF program," not a one-line write.
>
>The claim that BPF itself is off in many production kernels is not really
>true, where did you get that from? All the major distros and cloud providers
>have BPF enabled these days, and even systemd ships BPF programs for
>custom service firewalling etc.
The world is a bit bigger than home distros and cloud providers, but sure - bpf
is enabled widely enough at this point.
How do you see this working with the allowlist?
--
Thanks,
Sasha
^ permalink raw reply
* Re: [PATCH mm-unstable v17 00/14] khugepaged: mTHP support
From: Nico Pache @ 2026-05-19 19:20 UTC (permalink / raw)
To: Wei Yang
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap, rientjes,
rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260518125007.a4z3pw4r73uuwja4@master>
On Mon, May 18, 2026 at 6:50 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Mon, May 11, 2026 at 12:58:00PM -0600, Nico Pache wrote:
> >The following series provides khugepaged with the capability to collapse
> >anonymous memory regions to mTHPs.
> >
> >To achieve this we generalize the khugepaged functions to no longer depend
> >on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> >pages that are occupied (!none/zero). After the PMD scan is done, we use
> >the bitmap to find the optimal mTHP sizes for the PMD range. The
> >restriction on max_ptes_none is removed during the scan, to make sure we
> >account for the whole PMD range in the bitmap. When no mTHP size is
> >enabled, the legacy behavior of khugepaged is maintained.
> >
> >We currently only support max_ptes_none values of 0 or HPAGE_PMD_NR - 1
> >(ie 511). If any other value is specified, the kernel will emit a warning
> >and no mTHP collapse will be attempted. If a mTHP collapse is attempted,
> >but contains swapped out, or shared pages, we don't perform the collapse.
> >It is now also possible to collapse to mTHPs without requiring the PMD THP
> >size to be enabled. These limitations are to prevent collapse "creep"
> >behavior. This prevents constantly promoting mTHPs to the next available
> >size, which would occur because a collapse introduces more non-zero pages
> >that would satisfy the promotion condition on subsequent scans.
> >
> >Patch 1-2: Generalize hugepage_vma_revalidate and alloc_charge_folio
> > for arbitrary orders.
> >Patch 3: Rework max_ptes_* handling into helper functions
> >Patch 4: Generalize __collapse_huge_page_* for mTHP support
> >Patch 5: Require collapse_huge_page to enter/exit with the lock dropped
> >Patch 6: Generalize collapse_huge_page for mTHP collapse
> >Patch 7: Skip collapsing mTHP to smaller orders
> >Patch 8-9: Add per-order mTHP statistics and tracepoints
> >Patch 10: Introduce collapse_allowable_orders helper function
> >Patch 11-13: Introduce bitmap and mTHP collapse support, fully enabled
> >Patch 14: Documentation
> >
> >Testing:
> >- Built for x86_64, aarch64, ppc64le, and s390x
> >- ran all arches on test suites provided by the kernel-tests project
> >- internal testing suites: functional testing and performance testing
> >- selftests mm
> >- I created a test script that I used to push khugepaged to its limits
> > while monitoring a number of stats and tracepoints. The code is
> > available here[1] (Run in legacy mode for these changes and set mthp
> > sizes to inherit)
> > The summary from my testings was that there was no significant
> > regression noticed through this test. In some cases my changes had
> > better collapse latencies, and was able to scan more pages in the same
> > amount of time/work, but for the most part the results were consistent.
> >- redis testing. I did some testing with these changes along with my defer
> > changes (see followup [2] post for more details). We've decided to get
> > the mTHP changes merged first before attempting the defer series.
> >- some basic testing on 64k page size.
> >- lots of general use.
> >
>
> Two links are missing. I got them from previous version.
>
> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> [2] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/
Oh whoops, ill make sure they are there in the followup
>
> And the test in [1] is a performance test. I am thinking whether we want a
> functional test in selftests.
It also works as a functional test in some regards. The reason i never
pursued self-tests is that I naively thought this was getting merged
6(?) months ago and at the time the selftests infrastructure didn't
support it well. Baolin included patches to clean that up in his shmem
mTHP support patches and added tests for both features. Let's repost
and re-merge this first; then, I will follow up in one or two weeks
regarding self-tests. I'm currently on PTO and only have time to
complete, test, and return the v18 changes to Andrew before they
create a huge merge headache and we miss yet another window.
>
> I did a quick try with following change and some hack.
Thanks Ill use that as a base!
>
> @@ -744,6 +765,51 @@ static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *o
> ksft_test_result_report(exit_status, "%s\n", __func__);
> }
>
> +static void collapse_mth_ptes(struct collapse_context *c, struct mem_ops *ops)
> +{
> + struct thp_settings settings = *thp_current_settings();
> + void *p;
> + int i;
> +
> + /* Disable mthp on fault */
> + for (i = 0; i < NR_ORDERS; i++) {
> + settings.hugepages[i].enabled = THP_NEVER;
> + }
> + thp_push_settings(&settings);
> +
> + p = ops->setup_area(1);
> +
> + ops->fault(p, 0, hpage_pmd_size);
> +
> + /* Expect all order-0 folio after fault */
> + memset(expected_orders, 0, sizeof(int) * (pmd_order + 1));
> + expected_orders[0] = hpage_pmd_nr;
> + if (check_folio_orders(p, hpage_pmd_size, pagemap_fd,
> + kpageflags_fd, expected_orders,
> + (pmd_order + 1)))
> + ksft_exit_fail_msg("Unexpected huge page at fault\n");
> +
> + /* Enable mthp before collapse */
> + thp_pop_settings();
> + settings.hugepages[2].enabled = THP_ALWAYS;
> + thp_push_settings(&settings);
> +
> + c->collapse("Collapse fully populated PTE table with order 2", p, 1,
> + ops, true);
> +
> + /* Expect all order-2 folio after collapse */
> + memset(expected_orders, 0, sizeof(int) * (pmd_order + 1));
> + expected_orders[2] = 1 << (pmd_order - 2);
> + if (check_folio_orders(p, hpage_pmd_size, pagemap_fd,
> + kpageflags_fd, expected_orders,
> + (pmd_order + 1)))
> + ksft_exit_fail_msg("Unexpected page order\n");
> +
> + ops->cleanup_area(p, hpage_pmd_size);
> + thp_pop_settings();
> + ksft_test_result_report(exit_status, "%s\n", __func__);
> +}
> +
> static void collapse_swapin_single_pte(struct collapse_context *c, struct mem_ops *ops)
> {
> void *p;
>
> This leverage check_after_split_folio_orders() in split_huge_page_test.c to
> check folio order in PMD range.
>
> --
> Wei Yang
> Help you, Help me
>
^ permalink raw reply
* Re: [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Nico Pache @ 2026-05-19 19:05 UTC (permalink / raw)
To: Lorenzo Stoakes, David Hildenbrand (Arm), Wei Yang, Lance Yang
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
liam, mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <agtpK1x27B-E7mMo@lucifer>
On Mon, May 18, 2026 at 1:33 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 03:16:11PM +0200, David Hildenbrand (Arm) wrote:
> > On 5/14/26 05:10, Wei Yang wrote:
> > > On Tue, May 12, 2026 at 03:42:02PM +0800, Lance Yang wrote:
> > >>
> > >> On Mon, May 11, 2026 at 12:58:04PM -0600, Nico Pache wrote:
> > >>> generalize the order of the __collapse_huge_page_* and collapse_max_*
> > >>> functions to support future mTHP collapse.
> > >>>
> > >>> The current mechanism for determining collapse with the
> > >>> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
> > >>> raises a key design issue: if we support user defined max_pte_none values
> > >>> (even those scaled by order), a collapse of a lower order can introduces
> > >>> an feedback loop, or "creep", when max_ptes_none is set to a value greater
> > >>> than HPAGE_PMD_NR / 2. [1]
> > >>>
> > >>> With this configuration, a successful collapse to order N will populate
> > >>> enough pages to satisfy the collapse condition on order N+1 on the next
> > >>> scan. This leads to unnecessary work and memory churn.
> > >>>
> > >>> To fix this issue introduce a helper function that will limit mTHP
> > >>> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
> > >>> This effectively supports two modes: [2]
> > >>>
> > >>> - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
> > >>> that maps the shared zeropage. Consequently, no memory bloat.
> > >>> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
> > >>> available mTHP order.
> > >>>
> > >>> This removes the possiblilty of "creep", while not modifying any uAPI
> > >>> expectations. A warning will be emitted if any non-supported
> > >>> max_ptes_none value is configured with mTHP enabled.
> > >>>
> > >>> mTHP collapse will not honor the khugepaged_max_ptes_shared or
> > >>> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
> > >>> shared or swapped entry.
> > >>>
> > >>> No functional changes in this patch; however it defines future behavior
> > >>> for mTHP collapse.
> > >>>
> > >>> [1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
> > >>> [2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
> > >>>
> > >>> Co-developed-by: Dev Jain <dev.jain@arm.com>
> > >>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> > >>> Signed-off-by: Nico Pache <npache@redhat.com>
> > >>> ---
> > >>> include/trace/events/huge_memory.h | 3 +-
> > >>> mm/khugepaged.c | 117 ++++++++++++++++++++---------
> > >>> 2 files changed, 85 insertions(+), 35 deletions(-)
> > >>>
> > >>> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > >>> index bcdc57eea270..443e0bd13fdb 100644
> > >>> --- a/include/trace/events/huge_memory.h
> > >>> +++ b/include/trace/events/huge_memory.h
> > >>> @@ -39,7 +39,8 @@
> > >>> EM( SCAN_STORE_FAILED, "store_failed") \
> > >>> EM( SCAN_COPY_MC, "copy_poisoned_page") \
> > >>> EM( SCAN_PAGE_FILLED, "page_filled") \
> > >>> - EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
> > >>> + EM(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback") \
> > >>> + EMe(SCAN_INVALID_PTES_NONE, "invalid_ptes_none")
> > >>>
> > >>> #undef EM
> > >>> #undef EMe
> > >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > >>> index f68853b3caa7..27465161fa6d 100644
> > >>> --- a/mm/khugepaged.c
> > >>> +++ b/mm/khugepaged.c
> > >>> @@ -61,6 +61,7 @@ enum scan_result {
> > >>> SCAN_COPY_MC,
> > >>> SCAN_PAGE_FILLED,
> > >>> SCAN_PAGE_DIRTY_OR_WRITEBACK,
> > >>> + SCAN_INVALID_PTES_NONE,
> > >>> };
> > >>>
> > >>> #define CREATE_TRACE_POINTS
> > >>> @@ -353,37 +354,60 @@ static bool pte_none_or_zero(pte_t pte)
> > >>> * PTEs for the given collapse operation.
> > >>> * @cc: The collapse control struct
> > >>> * @vma: The vma to check for userfaultfd
> > >>> + * @order: The folio order being collapsed to
> > >>> *
> > >>> * Return: Maximum number of none-page or zero-page PTEs allowed for the
> > >>> * collapse operation.
> > >>> */
> > >>> -static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> > >>> - struct vm_area_struct *vma)
> > >>> +static int collapse_max_ptes_none(struct collapse_control *cc,
> > >>> + struct vm_area_struct *vma, unsigned int order)
> > >>> {
> > >>> + unsigned int max_ptes_none = khugepaged_max_ptes_none;
> > >>> // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
> > >>
> > >> One thing I still want to call out: kernel code usually uses C-style
> > >> comments :)
> > >>
> > >>> if (vma && userfaultfd_armed(vma))
> > >>> return 0;
> > >>> // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
> > >>> if (!cc->is_khugepaged)
> > >>> return HPAGE_PMD_NR;
> > >>> - // For all other cases repect the user defined maximum.
> > >>> - return khugepaged_max_ptes_none;
> > >>> + // for PMD collapse, respect the user defined maximum.
> > >>> + if (is_pmd_order(order))
> > >>> + return max_ptes_none;
> > >>> + /* Zero/non-present collapse disabled. */
> > >>> + if (!max_ptes_none)
> > >>> + return 0;
> > >>> + // for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
> > >>> + // scale the maximum number of PTEs to the order of the collapse.
> > >>> + if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
> > >>> + return (1 << order) - 1;
> > >>> +
> > >>> + // We currently only support max_ptes_none values of 0 or KHUGEPAGED_MAX_PTES_LIMIT.
> > >>> + // Emit a warning and return -EINVAL.
> > >>> + pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
> > >>> + KHUGEPAGED_MAX_PTES_LIMIT);
> > >>
> > >> Maybe fallback to 0 instead, as David suggested earlier?
> > >>
> > >
> > > It looks reasonable to fallback to 0.
> > >
> > > But as the updated Document says in patch 14:
> > >
> > > For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. Any other
> > > value will emit a warning and no mTHP collapse will be attempted.
> > >
> > > This is why it does like this now.
> > >
> > > mthp_collapse()
> > > max_ptes_none = collapse_max_ptes_none();
> > > if (max_ptes_none < 0)
> > > return collapsed;
> > >
> > >> max_ptes_none is mostly legacy PMD THP behavior. mTHP is new, and any
> > >> intermediate value in (0, KHUGEPAGED_MAX_PTES_LIMIT) would implicitly
> > >> disable it :(
> > >>
> > >
> > > So it depends on what we want to do here :-)
> > >
> > > For me, I would vote for fallback to 0.
> >
> > At this point I'll prefer to not return errors from collapse_max_ptes_none().
> > It's just rather awkward to return an error deep down in collapse code for a
> > configuration problem.
> >
> > For mthp collapse, we only support max_ptes_none==0 and
> > max_ptes_none=="HPAGE_PMD_NR - 1" (default).
> >
> > If another value is specified while collapsing mTHP, print a warning and treat
> > it as 0 (save value, no creep, no memory waste).
> >
> > In a sense, this is similar to how we handle max_ptes_shared + max_ptes_swap:
> > for mTHP: we always treat them as being 0 for mTHP collapse (and don't issue a
> > warning, because we would issue a warning with the default settings).
> >
> > @Lorenzo, fine with you?
>
> Yes 100%, this sounds sensible both in terms of the error and the default. Let's
> keep our lives simple(-ish) please :)
Ok thank you im glad we finally came to consensus on this! phew!
>
> >
> > --
> > Cheers,
> >
> > David
>
> Cheers, Lorenzo
>
^ permalink raw reply
* Re: [PATCH mm-unstable v17 02/14] mm/khugepaged: generalize alloc_charge_folio()
From: Nico Pache @ 2026-05-19 19:03 UTC (permalink / raw)
To: Lance Yang, Usama Arif
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm,
anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <51db205d-77cf-416f-bfe5-fd9d0b12c433@linux.dev>
On Mon, May 18, 2026 at 8:50 AM Lance Yang <lance.yang@linux.dev> wrote:
>
>
>
> On 2026/5/18 19:55, Usama Arif wrote:
> [...]
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index 979885694351..f0e29d5c7b1f 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -1068,21 +1068,26 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> >> }
> >>
> >> static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> >> - struct collapse_control *cc)
> >> + struct collapse_control *cc, unsigned int order)
> >> {
> >> gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> >> GFP_TRANSHUGE);
> >> int node = collapse_find_target_node(cc);
> >> struct folio *folio;
> >>
> >> - folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
> >> + folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
> >> if (!folio) {
> >> *foliop = NULL;
> >> - count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> >> + if (is_pmd_order(order))
> >> + count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> >> + count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
> >> return SCAN_ALLOC_HUGE_PAGE_FAIL;
> >> }
> >>
> >> - count_vm_event(THP_COLLAPSE_ALLOC);
> >> + if (is_pmd_order(order))
> >> + count_vm_event(THP_COLLAPSE_ALLOC);
> >> + count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC);
> >> +
> >
> > The vmstat THP_COLLAPSE_ALLOC counter is pmd order only.
> > But after this we have
> >
> > count_memcg_folio_events(folio, THP_COLLAPSE_ALLOC, 1);
> >
> > which is not being guarded with is_pmd_order().
>
> Good catch!
>
> >
> > I think we want this to be pmd order only as well so that
> > the meaning of the vmstat and cgroup counter remains the same?
>
> Agreed. THP_COLLAPSE_ALLOC should remain PMD order only for
> vmstat and memcg events.
>
> So this should be guarded with is_pmd_order() as well :)
Thanks Usama, I added that.
>
> Cheers, Lance
>
^ permalink raw reply
* Re: [PATCH mm-unstable v17 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
From: Nico Pache @ 2026-05-19 18:21 UTC (permalink / raw)
To: Lance Yang
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, usama.arif
In-Reply-To: <20260512044444.71798-1-lance.yang@linux.dev>
On Mon, May 11, 2026 at 10:45 PM Lance Yang <lance.yang@linux.dev> wrote:
>
>
> On Mon, May 11, 2026 at 12:58:03PM -0600, Nico Pache wrote:
> >The following cleanup reworks all the max_ptes_* handling into helper
> >functions. This increases the code readability and will later be used to
> >implement the mTHP handling of these variables.
> >
> >With these changes we abstract all the madvise_collapse() special casing
> >(dont respect the sysctls) away from the functions that utilize them. And
>
> Nit: s/dont/do not/
>
> >will be used later in this series to cleanly restrict the mTHP collapse
> >behavior.
> >
> >No functional change is intended; however, we are now only reading the
> >sysfs variables once per scan, whereas before these variables were being
> >read on each loop iteration.
> >
> >Suggested-by: David Hildenbrand <david@kernel.org>
> >Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> >Acked-by: Usama Arif <usama.arif@linux.dev>
> >Signed-off-by: Nico Pache <npache@redhat.com>
> >---
> > mm/khugepaged.c | 118 +++++++++++++++++++++++++++++++++---------------
> > 1 file changed, 82 insertions(+), 36 deletions(-)
> >
> >diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >index f0e29d5c7b1f..f68853b3caa7 100644
> >--- a/mm/khugepaged.c
> >+++ b/mm/khugepaged.c
> >@@ -348,6 +348,62 @@ static bool pte_none_or_zero(pte_t pte)
> > return pte_present(pte) && is_zero_pfn(pte_pfn(pte));
> > }
> >
> >+/**
> >+ * collapse_max_ptes_none - Calculate maximum allowed none-page or zero-page
> >+ * PTEs for the given collapse operation.
> >+ * @cc: The collapse control struct
> >+ * @vma: The vma to check for userfaultfd
> >+ *
> >+ * Return: Maximum number of none-page or zero-page PTEs allowed for the
> >+ * collapse operation.
> >+ */
> >+static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> >+ struct vm_area_struct *vma)
> >+{
> >+ // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
> >+ if (vma && userfaultfd_armed(vma))
> >+ return 0;
> >+ // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
> >+ if (!cc->is_khugepaged)
> >+ return HPAGE_PMD_NR;
> >+ // For all other cases repect the user defined maximum.
> >+ return khugepaged_max_ptes_none;
>
> Nit: kernel code usually uses C-style comments. This could be:
>
> /* For all other cases, respect the user-defined maximum. */
>
> Also, s/repect/respect/.
>
> >+}
> >+
> >+/**
> >+ * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
> >+ * anonymous pages for the given collapse operation.
> >+ * @cc: The collapse control struct
> >+ *
> >+ * Return: Maximum number of PTEs that map shared anonymous pages for the
> >+ * collapse operation
> >+ */
> >+static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
> >+{
> >+ // for MADV_COLLAPSE, do not restrict the number of PTEs that map shared
> >+ // anonymous pages.
>
> Ditto.
>
> >+ if (!cc->is_khugepaged)
> >+ return HPAGE_PMD_NR;
> >+ return khugepaged_max_ptes_shared;
> >+}
> >+
> >+/**
> >+ * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
> >+ * maximum allowed non-present pagecache entries for the given collapse operation.
> >+ * @cc: The collapse control struct
> >+ *
> >+ * Return: Maximum number of non-present PTEs or the maximum allowed non-present
> >+ * pagecache entries for the collapse operation.
> >+ */
> >+static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
> >+{
> >+ // for MADV_COLLAPSE, do not restrict the number PTEs entries or
> >+ // pagecache entries that are non-present.
>
> Same here.
>
> >+ if (!cc->is_khugepaged)
> >+ return HPAGE_PMD_NR;
> >+ return khugepaged_max_ptes_swap;
> >+}
> >+
> > int hugepage_madvise(struct vm_area_struct *vma,
> > vm_flags_t *vm_flags, int advice)
> > {
> >@@ -546,21 +602,19 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > pte_t *_pte;
> > int none_or_zero = 0, shared = 0, referenced = 0;
> > enum scan_result result = SCAN_FAIL;
> >+ unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> >+ unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>
> Nit: could these be const, as David suggested earlier?
>
> Nothing else jumped out at me. LGTM!
>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
Ack on all the above thank you !
>
^ permalink raw reply
* Re: [PATCH mm-unstable v17 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
From: Nico Pache @ 2026-05-19 18:21 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, Usama Arif
In-Reply-To: <4566170a-7e3d-49e4-baab-ba2790c198db@kernel.org>
On Tue, May 12, 2026 at 1:30 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 5/11/26 20:58, Nico Pache wrote:
> > The following cleanup reworks all the max_ptes_* handling into helper
> > functions. This increases the code readability and will later be used to
> > implement the mTHP handling of these variables.
> >
> > With these changes we abstract all the madvise_collapse() special casing
> > (dont respect the sysctls) away from the functions that utilize them. And
> > will be used later in this series to cleanly restrict the mTHP collapse
> > behavior.
> >
> > No functional change is intended; however, we are now only reading the
> > sysfs variables once per scan, whereas before these variables were being
> > read on each loop iteration.
> >
> > Suggested-by: David Hildenbrand <david@kernel.org>
> > Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>
> Some nits when re-reading:
>
> > Acked-by: Usama Arif <usama.arif@linux.dev>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> > mm/khugepaged.c | 118 +++++++++++++++++++++++++++++++++---------------
> > 1 file changed, 82 insertions(+), 36 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index f0e29d5c7b1f..f68853b3caa7 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -348,6 +348,62 @@ static bool pte_none_or_zero(pte_t pte)
> > return pte_present(pte) && is_zero_pfn(pte_pfn(pte));
> > }
> >
> > +/**
> > + * collapse_max_ptes_none - Calculate maximum allowed none-page or zero-page
>
> I know, it's painful, but ...
>
> There is no "none-page".
Yeah I think you mentioned that last review... sorry!
>
> Calculate maximum allowed empty PTEs or PTEs mapping the shared zeropage ... ?
>
> > + * PTEs for the given collapse operation.
>
> We usually indent here (second line of subject), I think. Same applies to the
> other doc below.
Hmm tbh I couldn't find a example of what you meant here. There are
some that put a space between the first sentence and the @ list.
>
> > + * @cc: The collapse control struct
> > + * @vma: The vma to check for userfaultfd
> > + *
> > + * Return: Maximum number of none-page or zero-page PTEs allowed for the
> > + * collapse operation.
>
> Same here.
>
> > + */
> > +static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> > + struct vm_area_struct *vma)
> > +{
> > + // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
>
> Lance commented on the comment style.
>
> Is this comment really required? It's pretty self-documenting already.
Dropped it, thanks.
>
> > + if (vma && userfaultfd_armed(vma))
> > + return 0;
> > + // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
> > + if (!cc->is_khugepaged)
> > + return HPAGE_PMD_NR;
> > + // For all other cases repect the user defined maximum.
> > + return khugepaged_max_ptes_none;
> > +}
> > +
> --
> Cheers,
>
> David
>
^ permalink raw reply
* Re: [PATCH v2 0/2] selftests/mm: separate GUP microbenchmarking from functional testing
From: Andrew Morton @ 2026-05-19 18:20 UTC (permalink / raw)
To: Sarthak Sharma
Cc: David Hildenbrand, Jonathan Corbet, Jason Gunthorpe, John Hubbard,
Peter Xu, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
linux-mm, linux-kselftest, linux-kernel, linux-doc
In-Reply-To: <20260519120506.184512-1-sarthak.sharma@arm.com>
On Tue, 19 May 2026 17:35:04 +0530 Sarthak Sharma <sarthak.sharma@arm.com> wrote:
> gup_test.c currently serves two distinct purposes: microbenchmarking
> (GUP_FAST_BENCHMARK, PIN_FAST_BENCHMARK, PIN_LONGTERM_BENCHMARK) and
> functional correctness testing (GUP_BASIC_TEST, PIN_BASIC_TEST,
> DUMP_USER_PAGES_TEST). Mixing these in a single binary means functional
> tests cannot be run or reported individually, and run_vmtests.sh must
> invoke the binary multiple times with different flag combinations to
> cover all configurations. This patch series separates the two concerns:
> tools/mm/gup_bench for benchmarking and tools/testing/selftests/mm/gup_test
> for functional testing.
>
> Patch 1 adds tools/mm/gup_bench.c, a standalone microbenchmark for
> GUP_FAST, PIN_FAST and PIN_LONGTERM via the CONFIG_GUP_TEST debugfs
> interface. It runs the same matrix of configurations as the old
> run_gup_matrix() shell function (all three commands, read/write,
> private/shared, four page counts, THP on/off, hugetlb), but as a
> standalone C program under tools/mm with no dependency on kselftest.
>
> Patch 2 rewrites gup_test.c as a kselftest harness-based selftest. It
> covers all five GUP kernel functions (get_user_pages, get_user_pages_fast,
> pin_user_pages, pin_user_pages_fast, pin_user_pages with FOLL_LONGTERM)
> plus DUMP_USER_PAGES_TEST, across 12 mapping configurations (THP on,
> THP off and hugetlb, each across private/shared and read/write variants)
> and four batch sizes (1, 512, 123, all pages). Results are reported as
> standard TAP output with no command-line arguments required.
Thanks. AI review asked a few things which seem fairly minor to me,
but probably legitimate:
https://sashiko.dev/#/patchset/20260519120506.184512-1-sarthak.sharma@arm.com
^ permalink raw reply
* Re: [PATCH v5 00/14] module: Introduce hash-based integrity checking
From: Thomas Weißschuh @ 2026-05-19 18:19 UTC (permalink / raw)
To: Sami Tolvanen
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Kumar Kartikeya Dwivedi, Nathan Chancellor,
Nicolas Schier, Arnd Bergmann, Luis Chamberlain, Petr Pavlu,
Daniel Gomez, Paul Moore, James Morris, Serge E. Hallyn,
Jonathan Corbet, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Naveen N Rao, Mimi Zohar, Roberto Sassu,
Dmitry Kasatkin, Eric Snowberg, Nicolas Schier, Daniel Gomez,
Aaron Tomlin, Christophe Leroy (CS GROUP), Nicolas Bouchinet,
Xiu Jianfeng, Martin KaFai Lau, Song Liu, Yonghong Song,
Jiri Olsa, bpf, Fabian Grünbichler, Arnout Engelen,
Mattia Rizzolo, kpcyrd, Christian Heusel, Câju Mihai-Drosi,
Eric Biggers, Sebastian Andrzej Siewior, linux-kbuild,
linux-kernel, linux-arch, linux-modules, linux-security-module,
linux-doc, linuxppc-dev, linux-integrity, debian-kernel
In-Reply-To: <20260518215543.GA1878854@google.com>
Hi Sami,
On 2026-05-18 21:55:43+0000, Sami Tolvanen wrote:
> On Tue, May 05, 2026 at 11:05:04AM +0200, Thomas Weißschuh wrote:
> > The current signature-based module integrity checking has some drawbacks
> > in combination with reproducible builds. Either the module signing key
> > is generated at build time, which makes the build unreproducible, or a
> > static signing key is used, which precludes rebuilds by third parties
> > and makes the whole build and packaging process much more complicated.
> >
> > The goal is to reach bit-for-bit reproducibility. Excluding certain
> > parts of the build output from the reproducibility analysis would be
> > error-prone and force each downstream consumer to introduce new tooling.
> >
> > Introduce a new mechanism to ensure only well-known modules are loaded
> > by embedding a merkle tree root of all modules built as part of the full
> > kernel build into vmlinux.
>
> I noticed Sashiko had a few concerns about the build changes. Would you
> mind taking a look to see if they're valid?
>
> https://sashiko.dev/#/patchset/20260505-module-hashes-v5-0-e174a5a49fce%40weissschuh.net
I definitively have these on my list. Unfortunately I am busy with
something else right now. But this series and the Sashiko comments
are next.
Thomas
^ permalink raw reply
* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-19 18:07 UTC (permalink / raw)
To: Christian König
Cc: Albert Esteve, Christian Brauner, Tejun Heo, Johannes Weiner,
Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley,
Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc, linux-kernel,
linux-media, dri-devel, linaro-mm-sig, linux-mm,
linux-security-module, selinux, linux-kselftest, mripard,
echanude
In-Reply-To: <01b6eefc-c107-4f8c-9d7c-3b86f54cabaa@amd.com>
On Tue, May 19, 2026 at 12:19 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/19/26 01:39, T.J. Mercier wrote:
> > On Mon, May 18, 2026 at 7:07 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >>
> >> On 5/18/26 14:50, Albert Esteve wrote:
> >>> On Mon, May 18, 2026 at 9:20 AM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>>
> >>>> On 5/15/26 19:06, T.J. Mercier wrote:
> >>>>> On Fri, May 15, 2026 at 6:53 AM Christian Brauner <brauner@kernel.org> wrote:
> >>>>>>
> >>>>>> On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
> >>>>>>> On embedded platforms a central process often allocates dma-buf
> >>>>>>> memory on behalf of client applications. Without a way to
> >>>>>>> attribute the charge to the requesting client's cgroup, the
> >>>>>>> cost lands on the allocator, making per-cgroup memory limits
> >>>>>>> ineffective for the actual consumers.
> >>>>>>>
> >>>>>>> Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> >>>>>>
> >>>>>> Please be aware that pidfds come in two flavors:
> >>>>>>
> >>>>>> thread-group pidfds and thread-specific pidfds. Make sure that your API
> >>>>>> doesn't implicitly depend on this distinction not existing.
> >>>>>
> >>>>> Hi Christian,
> >>>>>
> >>>>> Memcg is not a controller that supports "thread mode" so all threads
> >>>>> in a group should belong to the same memcg.
> >>>>
> >>>> BTW: Exactly that is the requirement automotive has with their native context use case.
> >>>>
> >>>> The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
> >>>>
> >>>> At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
> >>>>
> >>>> Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.
> >>>
> >>> Hi Christian,
> >>>
> >>> Thanks for sharing this atuomotive usecase. If I understand correctly,
> >>> the actual requirement is attributing dma-buf charges to the right
> >>> client, not putting each daemon thread in a different cgroup?
> >>
> >> Nope, exactly that's the difference.
> >>
> >> The thread acts as a filtering agent for both memory allocation and command submission for somebody else, the process on which behalve the daemon does things can even be in a client VM, completely remote over some network or even something like a microcontroller.
> >>
> >> Everything the thread does regarding CPU time, GPU driver memory allocation as well as resources like GPU processing and I/O time etc.. needs to be accounted to one client which can be different for each thread of the process.
> >>
> >> The only thing which is shared with the main process thread is CPU memory resources, e.g. malloc() because that is basically just needed for housekeeping and pretty much irrelevant for this kind of use case.
> >>
> >> The problem is now you can't do that with cgroups at the moment but unfortunately only the kernel has the information you need to know to do this.
> >>
> >> So what you end up with is to define tons of interfaces just to get the necessary information from the kernel into userspace and then essentially duplicate the same infrastructure cgroup provides in the kernel in userspace again.
> >>
> >>> If so,
> >>> the `charge_pid_fd` approach achieves this directly by passing the
> >>> client's `pid_fd`, without needing to add per-thread cgroup
> >>> infrastructure.
> >>
> >> Well it's already a massive improvemt, we could basically stop doing the whole duplication part for the GPU driver stack and just use cgroups for this part.
> >>
> >> Doing that automatically for CPU and I/O time would just be nice to have additionally.
> >>
> >> Regards,
> >> Christian.
> >
> > Hopefully I'm following correctly here.... So you are duplicating the
> > GPU driver stack to achieve remote accounting on a per-thread basis?
>
> Not quite, we are duplicating the handling cgroup provides in the kernel in userspace.
>
> For this memory usage information as well as execution times of the GPU kernel driver is exposed in fdinfo for example.
Oh I see, thanks.
> > Does this mean for GPU allocations you currently have some GFP_ACCOUNT
> > magic in your driver to attribute GPU memory to the correct remote
> > client?
>
> No, we just expose what the kernel driver has allocated for itself. E.g. page tables, buffers etc...
>
> When userspace allocates something using memfd_create() for example we just ignore that.
>
> > So this series would close the gap for dma-buf allocations,
> > but what about private GPU driver memory allocated on behalf of a
> > client?
>
> Well we would need a cgroup which isn't associated with any process were we could charge the GPU driver allocations against.
>
> But good point, charging against a pid wouldn't work in this use case.
It would be pretty low overhead to put a process doing while(1)
pause(); in a separate cgroup for this purpose, but I guess a fd for
the actual cgroup would be a little cleaner in this case.
> Regards,
> Christian.
^ permalink raw reply
* Re: [Linaro-mm-sig] Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-19 18:06 UTC (permalink / raw)
To: Christian König
Cc: Barry Song, Albert Esteve, Tejun Heo, Johannes Weiner,
Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
linux-kernel, linux-media, dri-, linaro-mm-sig, linux-mm,
linux-security-module, selinux, linux-kselftest, mripard,
echanude
In-Reply-To: <8a13b1ad-f1be-4ef4-905e-0d9828ae8cb5@amd.com>
On Tue, May 19, 2026 at 12:10 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/19/26 01:00, Barry Song wrote:
> > On Mon, May 18, 2026 at 3:34 PM Christian König
> > <christian.koenig@amd.com> wrote:
> >>
> >> On 5/16/26 11:19, Barry Song wrote:
> >>> On Thu, May 14, 2026 at 12:35 AM T.J. Mercier <tjmercier@google.com> wrote:
> >>> [...]
> >>>>>> I have a question about this part. Albert I guess you are interested
> >>>>>> only in accounting dmabuf-heap allocations, or do you expect to add
> >>>>>> __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other
> >>>>>> non-dmabuf-heap exporters?
> >>>>>
> >>>>> We're scoping this to dma-buf heaps for now. CMA heaps and the dmem
> >>>>> controller are on the radar for follow-up/parallel work (there will be
> >>>>> dragons and will surely need discussion). For DRM and V4L2 the
> >>>>> long-term intent is migration to heaps, which would make direct
> >>>>> accounting on those paths unnecessary.
> >>>>
> >>>> Ah I see. GEM buffers exported to dmabufs are what I had in mind. I
> >>>> guess this would only leave the odd non-DRM driver with the need to
> >>>> add their own accounting calls, which I don't expect would be a big
> >>>> problem.
> >>>>
> >>>
> >>> sounds like we still have a long way to go to correctly account for
> >>> various v4l2, drm, GEM, CMA, etc. In patch 1, the charging is done in
> >>> dma_buf_export(), so I guess it covers all dma-buf types except
> >>> dma_heap, but the problem is that it has no remote charging support at
> >>> all?
> >>
> >> No, just the other way around
> >>
> >> DMA-buf heaps can be handled here because we know that it is pure system memory and nothing special so memcg always applies.
> >>
> >> dma_buf_export() on the other hand handles tons of different use cases, ranging from buffer accounted to dmem, over special resources which aren't even memory all the way to buffers which can migrate from dmem to memcg and back during their lifetime.
> >>
> >
> > Hi Christian,
> >
> > Thanks very much for your explanation. So basically it seems that
> > dma_buf_export() is not the proper place to charge, since it may end up
> > mixing in non-system-memory accounting?
>
> Yes, exactly that.
>
> > My question is also about the global view for both heap and non-heap cases.
> > After reading the discussion, I’ve tried to summarize it—please let me know
> > if my understanding is correct.
> >
> > for dma_heap, we have the ioctl DMA_HEAP_IOCTL_ALLOC, where users can pass a
> > remote pidfd or similar information to indicate where the dma-buf should be
> > charged, as in Albert's patchset.
>
> Well that's the current proposal, but I think we need to come up with something more general.
>
> > For non-dma_heap dma-bufs, we don’t have an obvious userspace entry point that
> > triggers the allocation. So we likely need other approaches. We could either
> > move more drivers over to dma-heap, or introduce something like
> > DMA_BUF_IOCTL_XFER_CHARGE, as you are discussing, to let userspace explicitly
> > declare a charge.
>
> Yeah but that's not only for DMA-buf, we need that for file descriptors returned by memfd_create() as well.
memfds get charged on fault, so an allocator shouldn't currently be
charged just for creating the fd. Unlike system/CMA heap buffers, the
shmem backing a memfd / udmabuf is LRU memory, and swapping the memcg
owner of those pages is a more-involved process which is not supported
by memcg v2. There used to be some support in memcg v1, but it was
removed. Commit e548ad4a7cbf ("mm: memcg: move charge migration code
to memcontrol-v1.c ") said, "It's a fairly large and complicated code
which created a number of problems in the past." So I'm not sure how
much appetite there would be to support it in v2 for this.
^ permalink raw reply
* Re: [PATCH 1/6] alloc_tag: add ioctl to /proc/allocinfo
From: Suren Baghdasaryan @ 2026-05-19 17:42 UTC (permalink / raw)
To: Hao Ge
Cc: Abhishek Bapat, Shuah Khan, Jonathan Corbet, linux-doc,
linux-kernel, linux-mm, Sourav Panda, Kent Overstreet,
Andrew Morton
In-Reply-To: <c627136d-8060-4e2d-8473-0fe322ce1e6c@linux.dev>
On Mon, May 18, 2026 at 7:53 PM Hao Ge <hao.ge@linux.dev> wrote:
>
> Hi Abhishek
>
>
> Thanks for the follow-up.
>
>
> On 2026/5/19 07:41, Abhishek Bapat wrote:
> > On Wed, May 13, 2026 at 9:38 PM Hao Ge<hao.ge@linux.dev> wrote:
> >> Hi Suren and Abhishek
> >>
> >>
> >> Thanks for the patch! A couple of minor comments below.
> >>
> >>
> >> On 2026/5/5 07:36, Abhishek Bapat wrote:
> >>> From: Suren Baghdasaryan<surenb@google.com>
> >>>
> >>> Add the following ioctl commands for /proc/allocinfo file:
> >>>
> >>> ALLOCINFO_IOC_CONTENT_ID - gets content identifier which can be used
> >>> to check whether the file content has changed specifically due to module
> >>> load/unload. Every time a module is loaded / unloaded, the returned
> >>> value will be different. By comparing the identifier value at the
> >>> beginning and at the end of the content retrieval operation, users can
> >>> validate retrieved information for consistency.
> >>>
> >>> ALLOCINFO_IOC_GET_AT - gets the record at the specified position. This
> >>> is the position of a record in /proc/allocinfo.
> >>>
> >>> ALLOCINFO_IOC_GET_NEXT - gets the record next to the last retrieved
> >>> one. If no records were previously retrieved, returns the first
> >>> record.
> >>>
> >>> Signed-off-by: Suren Baghdasaryan<surenb@google.com>
> >>> Signed-off-by: Abhishek Bapat<abhishekbapat@google.com>
> >>> ---
> >>> .../userspace-api/ioctl/ioctl-number.rst | 2 +
> >>> include/linux/codetag.h | 1 +
> >>> include/uapi/linux/alloc_tag.h | 54 ++++++
> >>> lib/alloc_tag.c | 178 +++++++++++++++++-
> >>> lib/codetag.c | 11 ++
> >>> 5 files changed, 244 insertions(+), 2 deletions(-)
> >>> create mode 100644 include/uapi/linux/alloc_tag.h
> >>>
> >>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> index 331223761fff..84f6808a8578 100644
> >>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> @@ -349,6 +349,8 @@ Code Seq# Include File Comments
> >>> <mailto:luzmaximilian@gmail.com>
> >>> 0xA5 20-2F linux/surface_aggregator/dtx.h Microsoft Surface DTX driver
> >>> <mailto:luzmaximilian@gmail.com>
> >>> +0xA6 00-0F uapi/linux/alloc_tag.h Memory allocation profiling
> >>> +<mailto:surenb@google.com>
> >>> 0xAA 00-3F linux/uapi/linux/userfaultfd.h
> >>> 0xAB 00-1F linux/nbd.h
> >>> 0xAC 00-1F linux/raw.h
> >>> diff --git a/include/linux/codetag.h b/include/linux/codetag.h
> >>> index 8ea2a5f7c98a..2bcd4e7c809e 100644
> >>> --- a/include/linux/codetag.h
> >>> +++ b/include/linux/codetag.h
> >>> @@ -76,6 +76,7 @@ struct codetag_iterator {
> >>>
> >>> void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
> >>> bool codetag_trylock_module_list(struct codetag_type *cttype);
> >>> +unsigned long codetag_get_content_id(struct codetag_type *cttype);
> >>> struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
> >>> struct codetag *codetag_next_ct(struct codetag_iterator *iter);
> >>>
> >>> diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
> >>> new file mode 100644
> >>> index 000000000000..e9a5b55fcc7a
> >>> --- /dev/null
> >>> +++ b/include/uapi/linux/alloc_tag.h
> >>> @@ -0,0 +1,54 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> >>> +/*
> >>> + * include/linux/alloc_tag.h
> >>> + */
> >>> +
> >>> +#ifndef _UAPI_ALLOC_TAG_H
> >>> +#define _UAPI_ALLOC_TAG_H
> >>> +
> >>> +#include <linux/types.h>
> >>> +
> >>> +#define ALLOCINFO_STR_SIZE 64
> >>> +
> >>> +struct allocinfo_content_id {
> >>> + __u64 id;
> >>> +};
> >>> +
> >>> +struct allocinfo_tag {
> >>> + /* Longer names are trimmed */
> >>> + char modname[ALLOCINFO_STR_SIZE];
> >>> + char function[ALLOCINFO_STR_SIZE];
> >>> + char filename[ALLOCINFO_STR_SIZE];
> >>> + __u64 lineno;
> >>> +};
> >>> +
> >>> +struct allocinfo_counter {
> >>> + __u64 bytes;
> >>> + __u64 calls;
> >>> + __u8 accurate;
> >>> + __u8 pad[7]; /* Add alignment to not break the 32-bit compatible interface */
> >>> +};
> >>> +
> >>> +struct allocinfo_tag_data {
> >>> + struct allocinfo_tag tag;
> >>> + struct allocinfo_counter counter;
> >>> +};
> >>> +
> >>> +struct allocinfo_get_at {
> >>> + __u64 pos; /* input */
> >>> + struct allocinfo_tag_data data;
> >>> +};
> >>> +
> >>> +#define _ALLOCINFO_IOC_CONTENT_ID 0
> >>> +#define _ALLOCINFO_IOC_GET_AT 1
> >>> +#define _ALLOCINFO_IOC_GET_NEXT 2
> >>> +
> >>> +#define ALLOCINFO_IOC_BASE 0xA6
> >>> +#define ALLOCINFO_IOC_CONTENT_ID _IOR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_CONTENT_ID, \
> >>> + struct allocinfo_content_id)
> >>> +#define ALLOCINFO_IOC_GET_AT _IOWR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_GET_AT, \
> >>> + struct allocinfo_get_at)
> >>> +#define ALLOCINFO_IOC_GET_NEXT _IOR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_GET_NEXT, \
> >>> + struct allocinfo_tag_data)
> >>> +
> >>> +#endif /* _UAPI_ALLOC_TAG_H */
> >>> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> >>> index ed1bdcf1f8ab..5c24d2f954d4 100644
> >>> --- a/lib/alloc_tag.c
> >>> +++ b/lib/alloc_tag.c
> >>> @@ -14,6 +14,7 @@
> >>> #include <linux/string_choices.h>
> >>> #include <linux/vmalloc.h>
> >>> #include <linux/kmemleak.h>
> >>> +#include <uapi/linux/alloc_tag.h>
> >>>
> >>> #define ALLOCINFO_FILE_NAME "allocinfo"
> >>> #define MODULE_ALLOC_TAG_VMAP_SIZE (100000UL * sizeof(struct alloc_tag))
> >>> @@ -46,6 +47,9 @@ int alloc_tag_ref_offs;
> >>> struct allocinfo_private {
> >>> struct codetag_iterator iter;
> >>> bool print_header;
> >>> + /* ioctl uses a separate iterator not to interfere with reads */
> >>> + struct codetag_iterator ioctl_iter;
> >>> + bool positioned; /* seq_open_private() sets to 0 */
> >>> };
> >>>
> >>> static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> >>> @@ -125,6 +129,177 @@ static const struct seq_operations allocinfo_seq_op = {
> >>> .show = allocinfo_show,
> >>> };
> >>>
> >>> +static int allocinfo_open(struct inode *inode, struct file *file)
> >>> +{
> >>> + return seq_open_private(file, &allocinfo_seq_op,
> >>> + sizeof(struct allocinfo_private));
> >>> +}
> >>> +
> >>> +static int allocinfo_release(struct inode *inode, struct file *file)
> >>> +{
> >>> + return seq_release_private(inode, file);
> >>> +}
> >>> +
> >>> +static const char *allocinfo_str(const char *str)
> >>> +{
> >>> + size_t len = strlen(str);
> >>> +
> >>> + /* Keep an extra space for the trailing NULL. */
> >>> + if (len >= ALLOCINFO_STR_SIZE)
> >>> + str += (len - ALLOCINFO_STR_SIZE) + 1;
> >>> + return str;
> >>> +}
> >>> +
> >>> +/* Copy a string and trim from the beginning if it's too long */
> >>> +static void allocinfo_copy_str(char *dest, const char *src)
> >>> +{
> >>> + strscpy(dest, allocinfo_str(src), ALLOCINFO_STR_SIZE);
> >>> +}
> >>> +
> >>> +static void allocinfo_to_params(struct codetag *ct,
> >>> + struct allocinfo_tag_data *data)
> >>> +{
> >>> + struct alloc_tag *tag = ct_to_alloc_tag(ct);
> >>> + struct alloc_tag_counters counter = alloc_tag_read(tag);
> >>> +
> >>> + if (ct->modname)
> >>> + allocinfo_copy_str(data->tag.modname, ct->modname);
> >>> + else
> >>> + data->tag.modname[0] = '\0';
> >> Minor nit about allocinfo_to_params():
> >>
> >> When modname is NULL (built-in kernel code), the current code sets it
> >>
> >> to an empty string:
> >>
> >> if (ct->modname)
> >>
> >> allocinfo_copy_str(data->tag.modname, ct->modname);
> >>
> >> else
> >>
> >> data->tag.modname[0] = '\0';
> >>
> >> This is of course workable in userspace by checking for an empty
> >>
> >> string, but I was wondering if it would be cleaner to use "vmlinux"
> >>
> >> as a default:
> >>
> >> else
> >>
> >> allocinfo_copy_str(data->tag.modname, "vmlinux");
> >>
> >>
> >> For some context, in our memory analysis workflow we often group
> >>
> >> allocations by module to get a quick overview of where memory goes,
> >>
> >> for example:
> >>
> >> vmlinux: 2.1 GB (kernel core)
> >>
> >> nvidia: 1.2 GB (GPU driver)
> >>
> >> iwlwifi: 800 MB (WiFi driver)
> >>
> >> ext4: 500 MB (filesystem)
> >>
> >> Having a consistent identifier for kernel built-in allocations would
> >>
> >> avoid each userspace tool needing to handle the empty string as a
> >>
> >> special case. Totally fine if this is intentional though.
> >>
> > Thanks for bringing this up, I can certainly make this change.
> > However, the information is not currently exposed this way through
> > /proc/allocinfo. /proc/allocinfo does not categorize kernel non-module
> > allocations as vmlinux, so there will a delta between how IOCTL and
> > /proc/allocinfo behave. Suren, could you comment on whether this
> > recommendation is fine by you?
> >
> Right, /proc/allocinfo indeed doesn't categorize them as vmlinux currently.
>
> It's just that in practice we often group allocations by module, so
> having "vmlinux" as a default
>
> would be convenient. Let's wait for Suren's input.
Hi Folks,
I would prefer to keep it empty because vmlinux is not really a module
and hardcoding this name also seems suboptimal (in case it ever
changes). Empty string also aligns with how we output /proc/allocinfo
data. If the symbol is in the kernel itself, we do not display the
module name at all. So, all in all, unless there is a strong reason
against it, I think we should keep it empty.
>
> >>> + allocinfo_copy_str(data->tag.function, ct->function);
> >>> + allocinfo_copy_str(data->tag.filename, ct->filename);
> >>> + data->tag.lineno = ct->lineno;
> >>> + data->counter.bytes = counter.bytes;
> >>> + data->counter.calls = counter.calls;
> >>> + data->counter.accurate = !alloc_tag_is_inaccurate(tag);
> >>> +}
> >>> +
> >>> +static int allocinfo_ioctl_get_content_id(struct seq_file *m, void __user *arg)
> >>> +{
> >>> + struct allocinfo_content_id params;
> >>> +
> >>> + codetag_lock_module_list(alloc_tag_cttype, true);
> >>> + params.id = codetag_get_content_id(alloc_tag_cttype);
> >>> + codetag_lock_module_list(alloc_tag_cttype, false);
> >>> + if (copy_to_user(arg, ¶ms, sizeof(params)))
> >>> + return -EFAULT;
> >>> +
> >>> + return 0;
> >>> +}
> >>> +
> >>> +static int allocinfo_ioctl_get_at(struct seq_file *m, void __user *arg)
> >>> +{
> >>> + struct allocinfo_private *priv;
> >>> + struct codetag *ct;
> >>> + __u64 pos;
> >>> + struct allocinfo_get_at params = {0};
> >>> +
> >>> + if (copy_from_user(¶ms, arg, sizeof(params)))
> >>> + return -EFAULT;
> >>> +
> >>> + priv = (struct allocinfo_private *)m->private;
> >>> + pos = params.pos;
> >>> +
> >>> + codetag_lock_module_list(alloc_tag_cttype, true);
> >>> +
> >>> + /* Find the codetag */
> >>> + priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
> >>> + ct = codetag_next_ct(&priv->ioctl_iter);
> >>> + while (ct && pos--)
> >>> + ct = codetag_next_ct(&priv->ioctl_iter);
> >> I noticed that codetag_next_ct(&priv->ioctl_iter) and
> >>
> >> priv->positioned are accessed without serialization in the ioctl
> >>
> >> path. Concurrent ioctl calls on the same fd could race on these
> >>
> >> fields. Just something I spotted while reading the code.
> >>
> >>
> >> Thanks
> >>
> >> Best Regards
> >>
> >> Hao
> >>
> > I believe this should be prevented by `codetag_lock_module_list`; am I
> > wrong in my understanding?
>
> Thanks for the explanation! codetag_lock_module_list is designed to
> protect the module list from concurrent load/unload, which it does
>
> correctly. However, it doesn't cover the race between concurrent ioctl
> calls on the same fd, since it acquires cttype->mod_lock via
>
> down_read() and rwsem read locks allow multiple readers to proceed
> concurrently:
>
> Thread A: ALLOCINFO_IOC_GET_AT
>
> down_read(&cttype->mod_lock) // read lock acquired
>
> priv->ioctl_iter = codetag_get_ct_iter(...)
>
> ct = codetag_next_ct(&priv->ioctl_iter)
>
> priv->positioned = true;
>
> Thread B: ALLOCINFO_IOC_GET_NEXT // concurrent ioctl on same fd
>
> down_read(&cttype->mod_lock) // read locks don't exclude
> each other
>
> if (!priv->positioned) { // sees partial state from
> Thread A
>
> priv->ioctl_iter = ... // overwrites Thread A's iterator
>
> }
>
> ct = codetag_next_ct(&priv->ioctl_iter) // corrupted iterator
>
> priv->ioctl_iter and priv->positioned are per-fd state with no
> serialization in the ioctl path.
Yep, you are right. codetag_lock_module_list() is not enough here to
protect from such races. I guess allocinfo_private would need another
lock.
Thanks,
Suren.
>
> Just something I spotted.
>
> Thanks
>
> Best Regards
>
> Hao
>
> >>> + if (ct) {
> >>> + allocinfo_to_params(ct, ¶ms.data);
> >>> + priv->positioned = true;
> >>> + }
> >>> +
> >>> + codetag_lock_module_list(alloc_tag_cttype, false);
> >>> +
> >>> + if (!ct)
> >>> + return -ENOENT;
> >>> +
> >>> + if (copy_to_user(arg, ¶ms, sizeof(params)))
> >>> + return -EFAULT;
> >>> +
> >>> + return 0;
> >>> +}
> >>> +
> >>> +static int allocinfo_ioctl_get_next(struct seq_file *m, void __user *arg)
> >>> +{
> >>> + struct allocinfo_private *priv;
> >>> + struct codetag *ct;
> >>> + struct allocinfo_tag_data params = {0};
> >>> + int ret = 0;
> >>> +
> >>> + priv = (struct allocinfo_private *)m->private;
> >>> +
> >>> + codetag_lock_module_list(alloc_tag_cttype, true);
> >>> +
> >>> + if (!priv->positioned) {
> >>> + priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
> >>> + priv->positioned = true;
> >>> + }
> >>> +
> >>> + ct = codetag_next_ct(&priv->ioctl_iter);
> >>> + if (ct)
> >>> + allocinfo_to_params(ct, ¶ms);
> >>> +
> >>> + if (!ct) {
> >>> + priv->positioned = false;
> >>> + ret = -ENOENT;
> >>> + }
> >>> + codetag_lock_module_list(alloc_tag_cttype, false);
> >>> +
> >>> + if (ret == 0) {
> >>> + if (copy_to_user(arg, ¶ms, sizeof(params)))
> >>> + return -EFAULT;
> >>> + }
> >>> + return ret;
> >>> +}
> >>> +
> >>> +static long allocinfo_ioctl(struct file *file, unsigned int cmd,
> >>> + unsigned long __arg)
> >>> +{
> >>> + void __user *arg = (void __user *)__arg;
> >>> + int ret;
> >>> +
> >>> + switch (cmd) {
> >>> + case ALLOCINFO_IOC_CONTENT_ID:
> >>> + ret = allocinfo_ioctl_get_content_id(file->private_data, arg);
> >>> + break;
> >>> + case ALLOCINFO_IOC_GET_AT:
> >>> + ret = allocinfo_ioctl_get_at(file->private_data, arg);
> >>> + break;
> >>> + case ALLOCINFO_IOC_GET_NEXT:
> >>> + ret = allocinfo_ioctl_get_next(file->private_data, arg);
> >>> + break;
> >>> + default:
> >>> + ret = -ENOIOCTLCMD;
> >>> + break;
> >>> + }
> >>> +
> >>> + return ret;
> >>> +}
> >>> +
> >>> +#ifdef CONFIG_COMPAT
> >>> +static long allocinfo_compat_ioctl(struct file *file, unsigned int cmd,
> >>> + unsigned long arg)
> >>> +{
> >>> + return allocinfo_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
> >>> +}
> >>> +#endif
> >>> +
> >>> +static const struct proc_ops allocinfo_proc_ops = {
> >>> + .proc_open = allocinfo_open,
> >>> + .proc_read_iter = seq_read_iter,
> >>> + .proc_lseek = seq_lseek,
> >>> + .proc_release = allocinfo_release,
> >>> + .proc_ioctl = allocinfo_ioctl,
> >>> +#ifdef CONFIG_COMPAT
> >>> + .proc_compat_ioctl = allocinfo_compat_ioctl,
> >>> +#endif
> >>> +
> >>> +};
> >>> +
> >>> size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sleep)
> >>> {
> >>> struct codetag_iterator iter;
> >>> @@ -946,8 +1121,7 @@ static int __init alloc_tag_init(void)
> >>> return 0;
> >>> }
> >>>
> >>> - if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op,
> >>> - sizeof(struct allocinfo_private), NULL)) {
> >>> + if (!proc_create(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_proc_ops)) {
> >>> pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME);
> >>> shutdown_mem_profiling(false);
> >>> return -ENOMEM;
> >>> diff --git a/lib/codetag.c b/lib/codetag.c
> >>> index 304667897ad4..93aa30991563 100644
> >>> --- a/lib/codetag.c
> >>> +++ b/lib/codetag.c
> >>> @@ -48,6 +48,17 @@ bool codetag_trylock_module_list(struct codetag_type *cttype)
> >>> return down_read_trylock(&cttype->mod_lock) != 0;
> >>> }
> >>>
> >>> +unsigned long codetag_get_content_id(struct codetag_type *cttype)
> >>> +{
> >>> + lockdep_assert_held(&cttype->mod_lock);
> >>> +
> >>> + /*
> >>> + * next_mod_seq is updated on every load, so can be used to identify
> >>> + * content changes.
> >>> + */
> >>> + return cttype->next_mod_seq;
> >>> +}
> >>> +
> >>> struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
> >>> {
> >>> struct codetag_iterator iter = {
> > Note, I will be following up with a v2 patchset with your feedback
> > included. Please bring up any other points you'd want to clarify so
> > that I can include all the changes in the v2 patchset. Thanks for
> > reviewing!
>
^ permalink raw reply
* Re: [PATCH 4/8] drm/panthor: Add support for protected memory allocation in panthor
From: Boris Brezillon @ 2026-05-19 17:29 UTC (permalink / raw)
To: Chia-I Wu
Cc: Ketil Johnsen, Liviu Dudau, Marcin Ślusarz, David Airlie,
Simona Vetter, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, Jonathan Corbet, Shuah Khan, Sumit Semwal,
Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
Christian König, Steven Price, Daniel Almeida, Alice Ryhl,
Matthias Brugger, AngeloGioacchino Del Regno, dri-devel,
linux-doc, linux-kernel, linux-media, linaro-mm-sig,
linux-arm-kernel, linux-mediatek, Florent Tomasin, nd
In-Reply-To: <CAPaKu7T7JZRmsS+D_3zFZtyhJk9mNXjL=xpAQ-UNGbm0vztyRg@mail.gmail.com>
On Tue, 19 May 2026 10:07:02 -0700
Chia-I Wu <olvaffe@gmail.com> wrote:
> On Tue, May 19, 2026 at 1:49 AM Ketil Johnsen <ketil.johnsen@arm.com> wrote:
> >
> > On 19/05/2026 09:39, Boris Brezillon wrote:
> > > On Mon, 18 May 2026 17:36:40 -0700
> > > Chia-I Wu <olvaffe@gmail.com> wrote:
> > >
> > >> On Mon, May 18, 2026 at 12:16 AM Boris Brezillon
> > >> <boris.brezillon@collabora.com> wrote:
> > >>>
> > >>> On Wed, 13 May 2026 12:31:32 -0700
> > >>> Chia-I Wu <olvaffe@gmail.com> wrote:
> > >>>
> > >>>> On Tue, May 12, 2026 at 8:39 AM Liviu Dudau <liviu.dudau@arm.com> wrote:
> > >>>>>
> > >>>>> On Tue, May 12, 2026 at 04:11:11PM +0200, Boris Brezillon wrote:
> > >>>>>> On Tue, 12 May 2026 14:47:27 +0100
> > >>>>>> Liviu Dudau <liviu.dudau@arm.com> wrote:
> > >>>>>>
> > >>>>>>> On Thu, May 07, 2026 at 01:53:56PM +0200, Boris Brezillon wrote:
> > >>>>>>>> On Thu, 7 May 2026 11:02:26 +0200
> > >>>>>>>> Marcin Ślusarz <marcin.slusarz@arm.com> wrote:
> > >>>>>>>>
> > >>>>>>>>> On Tue, May 05, 2026 at 06:15:23PM +0200, Boris Brezillon wrote:
> > >>>>>>>>>>> @@ -277,9 +286,21 @@ int panthor_device_init(struct panthor_device *ptdev)
> > >>>>>>>>>>> return ret;
> > >>>>>>>>>>> }
> > >>>>>>>>>>>
> > >>>>>>>>>>> + /* If a protected heap name is specified but not found, defer the probe until created */
> > >>>>>>>>>>> + if (protected_heap_name && strlen(protected_heap_name)) {
> > >>>>>>>>>>
> > >>>>>>>>>> Do we really need this strlen() > 0? Won't dma_heap_find() fail is the
> > >>>>>>>>>> name is "" already?
> > >>>>>>>>>
> > >>>>>>>>> If dma_heap_find() will fail, then the whole probe with fail too.
> > >>>>>>>>> This check prevents that.
> > >>>>>>>>
> > >>>>>>>> Yeah, that's also a questionable design choice. I mean, we can
> > >>>>>>>> currently probe and boot the FW even though we never setup the
> > >>>>>>>> protected FW sections, so why should we defer the probe here? Can't we
> > >>>>>>>> just retry the next time a group with the protected bit is created and
> > >>>>>>>> fail if we can find a protected heap?
> > >>>>>>>
> > >>>>>>> The problem we have with the current firmware is that it does a number of setup steps at "boot"
> > >>>>>>> time only. One of the steps is preparing its internal structures for when it enters protected
> > >>>>>>> mode and it stores them in the buffer passed in at firmware loading. We cannot later run the
> > >>>>>>> process when we have a group with protected mode set.
> > >>>>>>
> > >>>>>> No, but we can force a full/slow reset and have that thing
> > >>>>>> re-initialized, can't we? I mean, that's basically what we do when a
> > >>>>>> fast reset fails: we re-initialize all the sections and reset again, at
> > >>>>>> which point the FW should start from a fresh state, and be able to
> > >>>>>> properly initialize the protected-related stuff if protected sections
> > >>>>>> are populated. Am I missing something?
> > >>>>>
> > >>>>> Right, we can do that. For some reason I keep associating the reset with the
> > >>>>> error handling and not with "normal" operations.
> > >>>> I kind of hope we end up with either
> > >>>>
> > >>>> - panthor knows the exact heap to use and fails with EPROBE_DEFER if
> > >>>> the heap is missing, or
> > >>>> - panthor gets a dma-buf from userspace and does the full reset
> > >>>> - userspace also needs to provide a dma-buf for each protected
> > >>>> group for the suspend buffer
> > >>>>
> > >>>> than something in-between. The latter is more ad-hoc and basically
> > >>>> kicks the issue to the userspace.
> > >>>
> > >>> Indeed, the second option is more ad-hoc, but when you think about it,
> > >>> userspace has to have this knowledge, because it needs to know the
> > >>> dma-heap to use for buffer allocation that cross a device boundary
> > >>> anyway. Think about frames produced by a video decoder, and composited
> > >>> by the GPU into a protected scanout buffer that's passed to the KMS
> > >>> device. Why would the GPU driver be source of truth when it comes to
> > >>> choosing the heap to use to allocate protected buffers for the video
> > >>> decoder or those used for the display?
> > >> I don't think the GPU driver is ever the source of truth. If the
> > >> system integrator wants to specify the source of truth (SoT) from
> > >> kernel space, they should use the device tree (or module params /
> > >> config options). If they want to specify the SoT in userspace, then we
> > >> don't really care how it is done other than providing an ioctl.
> > >> Panthor is always on the receiving end.
> > >
> > > Okay, we're on the same page then.
> > >
> > >>
> > >> If we don't want to delay this functionality, but it takes time to
> > >> converge on SoT, maybe a solution that is not a long-term promise can
> > >> work? Of the options on the table (dt, module params, kconfig options,
> > >> ioctls), a kconfig option, potentially marked as experimental, seems
> > >> like a good candidate.
> > >
> > > If Panthor is only a consumer, I actually think it'd be easier to just
> > > let userspace pass the protected FW section as an imported buffer
> > > through an ioctl for now. It means we don't need any of the
> > > modifications to the dma_heap API in this series, and userspace is free
> > > to choose its SoT (efuse, DT, ...) and pass the info back to mesa/GBM
> > > somehow (envvar, driconf, ...). The only thing we need to ensure is if
> > > lazy protected FW section allocation is going to work, but given the
> > > current code purely and simply ignores those sections, and the FW is
> > > still able to boot and act properly (at least on v10-v13), I'm pretty
> > > confident this is okay, unless there's some trick the MCU can do to
> > > detect that the protected section isn't mapped (which I doubt, because
> > > the MCU doesn't know it lives behind an MMU).
> I set up MMU to map non-protected memory to the protected section the
> other day. The FW still booted fine. I didn't get access violation
> until the FW executed PROT_REGION and panthor requested
> GLB_PROTM_ENTER in response.
Ah, thanks for testing! We still don't have a setup with proper
protected heap, but that was on my list of things to test.
>
> This was on v13, but I also doubt it will become an issue. Can ARM help clarify?
>
> > >
> > > Of course, once we have a consensus on how to describe this in the DT,
> > > we can switch Panthor over to "protected dma_heap selection through DT",
> > > and reflect that through the ioctl that exposes whether protected
> > > support is ready or not (would be a DEV_QUERY), such that userspace can
> > > skip this "PROTM initialization" step.
> > >
> > > We're talking about an extra ioctl to set those buffers, and a
> > > DEV_QUERY to query the state (ready or not), the size of the global
> > > protected buffer (protected FW section) and the size of the protected
> > > suspend buffer. The protected suspend buffer would be allocated and
> > > passed at group creation time (extra arg passed to the existing
> > > GROUP_CREATE ioctl). So, overall, I don't consider it a huge liability
> > > in term of maintenance cost.
> >
> > If we can avoid the dma-heap changes, then that would surely help!
> > I can try to implement this in the next version unless someone finds a
> > reason why it is a bad idea.
> Yeah, that sounds good to me too.
>
> Will the extra ioctl require root?
The PROTM_INIT ioctl will certainly require high privilege
CAP_SYS_<something>, dunno yet what that <something> would be though.
> On a system with true protected
> memory, the FW cannot write to non-protected memory. It seems ok to
> allow any client to make the ioctl call. But on systems without true
> protected memory, it can be problematic.
Yep, I agree we shouldn't let random users pretend they initialized
protected mode if the system as a whole doesn't have proper the proper
bit hooked up to set that up.
^ permalink raw reply
* Re: [PATCH 00/12] misc/syncobj: add /dev/syncobj device
From: Xaver Hugl @ 2026-05-19 17:08 UTC (permalink / raw)
To: Christian König
Cc: Julian Orth, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
David Airlie, Simona Vetter, Sumit Semwal, Jonathan Corbet,
Shuah Khan, Arnd Bergmann, Greg Kroah-Hartman, dri-devel,
linux-kernel, linux-media, linaro-mm-sig, linux-doc,
wayland-devel, Michel Dänzer
In-Reply-To: <dff60378-4e47-4753-8878-feec6e1c2690@amd.com>
> > The part where we get this independent of attached hardware is quite
> > important for us though, since we can't just ignore explicit sync once
> > the device we previously imported the syncobj into is disconnected.
>
> Can you elaborate more on this?
In Wayland, the client is allowed to attach dmabuf and syncobj
independently, they don't have to be from the same device (and the
compositor wouldn't be able to verify the opposite anyways). The
compositor will usually import both into the same drm device, but
especially with compositors that render on multiple devices, that's
not necessarily the case either.
If for example we had a system with one internal GPU and one external
GPU, the client renders on the internal GPU and the compositor uses
the external one. Now when the user yanks the USB C cable, afaiu
- the buffers from the client stay valid
- the syncobj stays valid on the client side
- the syncobj becomes invalid on the compositor side
"invalid" there means either
- the acquire point of the client is marked as signaled, before
rendering on the client side is completed
- the acquire point of the client is never signaled. Since the
compositor waits for the acquire point, the Wayland surface is stuck
forever
Afaik the latter is currently the case. The former wouldn't be much
better though, not when it's preventable.
This is admittedly an edge case, but GPU hotunplug is something we try
to support as well as possible in Plasma, and all the edge cases cause
a lot of problems in combination and are a lot of headaches to handle
(or really work around) in the compositor.
Another edge case is when the client asks the compositor to import the
syncobj, which can fail when a hotunplug is in process, and ends up
disconnecting the client for no fault of either client or compositor.
> >>> 3. It removes the need to translate between syncobjs fds and handles.
> >>
> >> That's a pretty big no-go as well. The differentiation between FDs and handles is completely intentional.
> > Could you expand on why it's needed? For compositors, the handle is
> > just an intermediary thing when translating between file descriptors.
>
> Well what we could do is to add an IOCTL to directly attach an syncobj file descriptor to an eventfd.
That would be nice.
- Xaver
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox