Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* Re: [PATCH v7 6/10] security: Hornet LSM
From: Paul Moore @ 2026-05-13 18:36 UTC (permalink / raw)
  To: Blaise Boscaccy, Blaise Boscaccy, Jonathan Corbet, James Morris,
	Serge E. Hallyn, Mickaël Salaün, Günther Noack,
	Dr. David Alan Gilbert, Andrew Morton, James.Bottomley, dhowells,
	Fan Wu, Ryan Foster, Randy Dunlap, linux-security-module,
	linux-doc, linux-kernel, bpf, Song Liu
In-Reply-To: <20260507191416.2984054-7-bboscaccy@linux.microsoft.com>

On May  7, 2026 Blaise Boscaccy <bboscaccy@linux.microsoft.com> wrote:
> 
> This adds the Hornet Linux Security Module which provides enhanced
> signature verification and data validation for eBPF programs. This
> allows users to continue to maintain an invariant that all code
> running inside of the kernel has actually been signed and verified, by
> the kernel.
> 
> This effort builds upon the currently excepted upstream solution. It
> further hardens it by providing deterministic, in-kernel checking of
> map hashes to solidify auditing along with preventing TOCTOU attacks
> against lskel map hashes.
> 
> Target map hashes are passed in via PKCS#7 signed attributes. Hornet
> determines the extent which the eBFP program is signed and defers to
> other LSMs for policy decisions.
> 
> Signed-off-by: Blaise Boscaccy <bboscaccy@linux.microsoft.com>
> Nacked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> ---
>  Documentation/admin-guide/LSM/Hornet.rst | 323 +++++++++++++++++++++
>  Documentation/admin-guide/LSM/index.rst  |   1 +
>  MAINTAINERS                              |   9 +
>  include/linux/oid_registry.h             |   3 +
>  include/uapi/linux/lsm.h                 |   1 +
>  security/Kconfig                         |   3 +-
>  security/Makefile                        |   1 +
>  security/hornet/Kconfig                  |  13 +
>  security/hornet/Makefile                 |   7 +
>  security/hornet/hornet.asn1              |  12 +
>  security/hornet/hornet_lsm.c             | 352 +++++++++++++++++++++++
>  11 files changed, 724 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/admin-guide/LSM/Hornet.rst
>  create mode 100644 security/hornet/Kconfig
>  create mode 100644 security/hornet/Makefile
>  create mode 100644 security/hornet/hornet.asn1
>  create mode 100644 security/hornet/hornet_lsm.c

Merged into lsm/dev, thanks.

--
paul-moore.com

^ permalink raw reply

* Re: [PATCH v7 5/10] lsm: security: Add additional enum values for bpf  integrity checks
From: Paul Moore @ 2026-05-13 18:36 UTC (permalink / raw)
  To: Blaise Boscaccy, Blaise Boscaccy, Jonathan Corbet, James Morris,
	Serge E. Hallyn, Mickaël Salaün, Günther Noack,
	Dr. David Alan Gilbert, Andrew Morton, James.Bottomley, dhowells,
	Fan Wu, Ryan Foster, Randy Dunlap, linux-security-module,
	linux-doc, linux-kernel, bpf, Song Liu
In-Reply-To: <20260507191416.2984054-6-bboscaccy@linux.microsoft.com>

On May  7, 2026 Blaise Boscaccy <bboscaccy@linux.microsoft.com> wrote:
> 
> First add a generic LSM_INT_VERDICT_FAULT value to indicate a system
> failure during checking. Second, add a LSM_INT_VERDICT_UNKNOWNKEY to
> signal that the payload was signed with a key other than one that
> exists in the secondary keyring. And finally add an
> LSM_INT_VERDICT_UNEXPECTED enum value to indicate that a unexpected
> hash value was encountered at some stage of verification.
> 
> Signed-off-by: Blaise Boscaccy <bboscaccy@linux.microsoft.com>
> ---
>  include/linux/security.h | 3 +++
>  1 file changed, 3 insertions(+)

Merged into lsm/dev, thanks.

--
paul-moore.com

^ permalink raw reply

* Re: [PATCH v7 4/10] lsm: framework for BPF integrity verification
From: Paul Moore @ 2026-05-13 18:36 UTC (permalink / raw)
  To: Blaise Boscaccy, Blaise Boscaccy, Jonathan Corbet, James Morris,
	Serge E. Hallyn, Mickaël Salaün, Günther Noack,
	Dr. David Alan Gilbert, Andrew Morton, James.Bottomley, dhowells,
	Fan Wu, Ryan Foster, Randy Dunlap, linux-security-module,
	linux-doc, linux-kernel, bpf, Song Liu
In-Reply-To: <20260507191416.2984054-5-bboscaccy@linux.microsoft.com>

On May  7, 2026 Blaise Boscaccy <bboscaccy@linux.microsoft.com> wrote:
> 
> Add a new LSM hook and two new LSM hook callbacks to support LSMs that
> perform integrity verification, e.g. digital signature verification,
> of BPF programs.
> 
> While the BPF subsystem does implement a signature verification scheme,
> it does not satisfy a number of existing requirements, adding support
> for BPF program integrity verification to the LSM framework allows
> administrators to select additional integrity verification mechanisms
> to meet these needs while also providing a mechanism for future
> expansion.  Additional on why this is necessary can be found at the
> lore archive link below:
> 
> https://lore.kernel.org/linux-security-module/CAHC9VhTQ_DR=ANzoDBjcCtrimV7XcCZVUsANPt=TjcvM4d-vjg@mail.gmail.com/
> 
> The LSM-based BPF integrity verification mechanism works within the
> existing security_bpf_prog_load() hook called by the BPF subsystem.
> It adds an additional dedicated integrity callback and a new LSM
> hook/callback to be called from within LSMs implementing integrity
> verification.
> 
> The first new callback, bpf_prog_load_integrity(), located within the
> security_bpf_prog_load() hook, is necessary to ensure that the integrity
> verification callbacks are executed before any of the existing LSMs
> are executed via the bpf_prog_load() callback.  Reusing the existing
> bpf_prog_load() callback for integrity verification could result in LSMs
> not having access to the integrity verification results when asked to
> authorize the BPF program load in the bpf_prog_load() callback.
> 
> The new LSM hook, security_bpf_prog_load_post_integrity(), is intended
> to be called from within LSMs performing BPF program integrity
> verification.  It is used to report the verdict of the integrity
> verification to other LSMs enforcing access control policy on BPF
> program loads.  LSMs enforcing such access controls should register a
> bpf_prog_load_post_integrity() callback to receive integrity verdicts.
> 
> More information on these new callbacks and hook can be found in the
> code comments in this patch.
> 
> Signed-off-by: Blaise Boscaccy <bboscaccy@linux.microsoft.com>
> Link: https://lore.kernel.org/linux-security-module/CAHC9VhTQ_DR=ANzoDBjcCtrimV7XcCZVUsANPt=TjcvM4d-vjg@mail.gmail.com/
> Signed-off-by: Paul Moore <paul@paul-moore.com>
> ---
>  include/linux/lsm_hook_defs.h |  5 +++
>  include/linux/security.h      | 25 ++++++++++++
>  security/security.c           | 75 +++++++++++++++++++++++++++++++++--
>  3 files changed, 102 insertions(+), 3 deletions(-)

Merged into lsm/dev, thanks.

--
paul-moore.com

^ permalink raw reply

* Re: [PATCH v7 3/10] crypto: pkcs7: add tests for pkcs7_get_authattr
From: Paul Moore @ 2026-05-13 18:36 UTC (permalink / raw)
  To: Blaise Boscaccy, Blaise Boscaccy, Jonathan Corbet, James Morris,
	Serge E. Hallyn, Mickaël Salaün, Günther Noack,
	Dr. David Alan Gilbert, Andrew Morton, James.Bottomley, dhowells,
	Fan Wu, Ryan Foster, Randy Dunlap, linux-security-module,
	linux-doc, linux-kernel, bpf, Song Liu
In-Reply-To: <20260507191416.2984054-4-bboscaccy@linux.microsoft.com>

On May  7, 2026 Blaise Boscaccy <bboscaccy@linux.microsoft.com> wrote:
> 
> Add example code to the test module pkcs7_key_type.c that verifies a
> message and then pulls out a known authenticated attribute.
> 
> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
> Signed-off-by: Blaise Boscaccy <bboscaccy@linux.microsoft.com>
> Acked-by: David Howells <dhowells@redhat.com>
> ---
>  crypto/asymmetric_keys/pkcs7_key_type.c | 44 ++++++++++++++++++++++++-
>  1 file changed, 43 insertions(+), 1 deletion(-)

Merged into lsm/dev, thanks.

--
paul-moore.com

^ permalink raw reply

* Re: [PATCH v7 2/10] crypto: pkcs7: add ability to extract signed  attributes by OID
From: Paul Moore @ 2026-05-13 18:36 UTC (permalink / raw)
  To: Blaise Boscaccy, Blaise Boscaccy, Jonathan Corbet, James Morris,
	Serge E. Hallyn, Mickaël Salaün, Günther Noack,
	Dr. David Alan Gilbert, Andrew Morton, James.Bottomley, dhowells,
	Fan Wu, Ryan Foster, Randy Dunlap, linux-security-module,
	linux-doc, linux-kernel, bpf, Song Liu
In-Reply-To: <20260507191416.2984054-3-bboscaccy@linux.microsoft.com>

On May  7, 2026 Blaise Boscaccy <bboscaccy@linux.microsoft.com> wrote:
> 
> Signers may add any information they like in signed attributes and
> sometimes this information turns out to be relevant to specific
> signing cases, so add an api pkcs7_get_authattr() to extract the value
> of an authenticated attribute by specific OID.  The current
> implementation is designed for the single signer use case and simply
> terminates the search when it finds the relevant OID.
> 
> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
> Signed-off-by: Blaise Boscaccy <bboscaccy@linux.microsoft.com>
> ---
>  crypto/asymmetric_keys/Makefile       |  4 +-
>  crypto/asymmetric_keys/pkcs7_aa.asn1  | 18 ++++++
>  crypto/asymmetric_keys/pkcs7_parser.c | 81 +++++++++++++++++++++++++++
>  include/crypto/pkcs7.h                |  4 ++
>  4 files changed, 106 insertions(+), 1 deletion(-)
>  create mode 100644 crypto/asymmetric_keys/pkcs7_aa.asn1

Merged into lsm/dev, thanks.

--
paul-moore.com

^ permalink raw reply

* Re: [PATCH v7 1/10] crypto: pkcs7: add flag for validated trust on a  signed info block
From: Paul Moore @ 2026-05-13 18:36 UTC (permalink / raw)
  To: Blaise Boscaccy, Blaise Boscaccy, Jonathan Corbet, James Morris,
	Serge E. Hallyn, Mickaël Salaün, Günther Noack,
	Dr. David Alan Gilbert, Andrew Morton, James.Bottomley, dhowells,
	Fan Wu, Ryan Foster, Randy Dunlap, linux-security-module,
	linux-doc, linux-kernel, bpf, Song Liu
In-Reply-To: <20260507191416.2984054-2-bboscaccy@linux.microsoft.com>

On May  7, 2026 Blaise Boscaccy <bboscaccy@linux.microsoft.com> wrote:
> 
> Allow consumers of struct pkcs7_message to tell if any of the sinfo
> fields has passed a trust validation.  Note that this does not happen
> in parsing, pkcs7_validate_trust() must be explicitly called or called
> via validate_pkcs7_trust().  Since the way to get this trusted pkcs7
> object is via verify_pkcs7_message_sig, export that so modules can use
> it.
> 
> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
> Signed-off-by: Blaise Boscaccy <bboscaccy@linux.microsoft.com>
> ---
>  certs/system_keyring.c                | 1 +
>  crypto/asymmetric_keys/pkcs7_parser.h | 1 +
>  crypto/asymmetric_keys/pkcs7_trust.c  | 1 +
>  3 files changed, 3 insertions(+)

Merged into lsm/dev, thanks.

--
paul-moore.com

^ permalink raw reply

* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: David Woodhouse @ 2026-05-13 18:26 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Jonathan Corbet, Shuah Khan, kvm,
	Linux Doc Mailing List, Kernel Mailing List, Linux,
	Sean Christopherson, Jim Mattson, Oliver Upton, Joey Gouly,
	Suzuki K Poulose, Zenghui Yu, Catalin Marinas, Will Deacon,
	Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
	Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <CABgObfaM-JtNn2MuYXaiadQnLfAhTEaoHAcTG9=J6LkMcQCJ3A@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6837 bytes --]

On Wed, 2026-05-13 at 18:24 +0200, Paolo Bonzini wrote:
> Il mer 13 mag 2026, 15:57 David Woodhouse <dwmw2@infradead.org> ha scritto:
> > > x86 doesn't do bug-for-bug compatibility, thankfully - we have quirks
> > > but only 11 of them, or about one per year since we started adding them.
> > >   We only add quirks, generally speaking, when 1) we change the way file
> > > descriptors are initialized, 2) guests in the wild were relying on it,
> > > or 3) it prevends restoring state saved from an old kernel.  Is there
> > > anything else?
> > > 
> > > https://lore.kernel.org/kvm/e03f092dfbb7d391a6bf2797ba01e122ba080bcd.camel@infradead.org/
> > > is an example of a bug that "no SW can make any reasonable use of".
> > 
> > I actually believe that the focus on ICEBP was triggered by some weird
> > gaming software's anti-DRM mechanism, and that it *did* affect actual
> > guests in the wild?
> > 
> > But yeah, *fixing* it should not have any adverse effects. That's the
> > key.
> 
> Yep, so "bug for bug" is not it.

Of course. I'm not discriminating between 'bugs' and 'features'. In
this context I only care about guest-visible behaviour changes,
whatever the reason.

What I said was:
> > > > Once a behaviour is present in a released version of Linux/KVM, we
> > > > can't just declare it "wrong" and unilaterally impose a change in
> > > > guest-visible behaviour on *running* guests as a side-effect of a
> > > > kernel upgrade.

And yes, you're technically right to challenge that phrasing of it. It
does need the additional caveat of "...unless we are sure that changing
it in either direction underneath running guests cannot cause
problems", as discussed. That's the key for the ICEBP thing.

> > 
> > > And besides, both miss the point of *configurability* which is the basis of
> > > it all.
> > 
> > Hm, configurability *is* the point, I thought.
> 
> Yes, and configurability goes way beyond bugs/quirks, which are to
> some extent a red herring. Configurability for example says that "KVM:
> arm64: vgic: Allow userspace to set IIDR revision 1" shouldn't be
> controversial at all.

Indeed it shouldn't. And yet here we are.

> > > So we have the third case, "restoring state saved from an old kernel".
> > > If this case arises, I do believe that Arm will have to deal with it and
> > > introduce quirks or KVM_GET/SET_REG hacks.  Maybe it hasn't happened
> > > yet, lucky you.
> > 
> > We literally have those mechanisms already.
> 
> I am not talking about guest-visible changes across save/restore here,
> but rather about round-trips through userspace. For example, see the
> effect of KVM_X2APIC_API_USE_32BIT_IDS on KVM_GET/SET_LAPIC: it
> couldn't be made the default, because userspace expects to take old
> data returned by KVM_GET_LAPIC and shove it into KVM_SET_LAPIC. Sucks
> but can't be avoided.

Yes, you're right. And I fully expect and trust x86 to get that right
and not break existing userspace in any way at all.

But honestly, the bar for Arm is so low right now that anything I
physically *can* work around in userspace, I'm prepared to tolerate.

If KVM/arm did the equivalent of just changing the KVM_[SG]ET_LAPIC
data without the KVM_X2APIC_API_USE_32BIT_IDS trick, I wouldn't even
bat an eyelid; I'd just accommodate it and move on.

> > See commit https://git.kernel.org/torvalds/c/49a1a2c70a7f which adds a
> > new guest-visible feature in revision 3, but allowed userspace to
> > restore the old behaviour by setting it to revision 2. All my patch above does, is make it possible to set it to revision 1 as
> > well. Because https://git.kernel.org/torvalds/c/d53c2c29ae0d previously
> > changed the behaviour and bumped the default to 2 *without* allowing
> > userspace to restore the prior behaviour, and we've been carrying a
> > *revert* of that patch.
> > 
> > Why would we *not* accept such a patch?
> 
> Agreed. Even ignoring your revert, there's no reason why any upgrade
> past 49a1a2c70a7f has to be from after d53c2c29ae0d.
> 
> > Marc seems terribly insistent that we SHOULD NOT
> > restore the behaviour that older KVM offered to guests, and we MUST
> > change it unconditionally underneath running guests, making these
> > registers writable on upgrade... and reverting them to read-only for
> > running guests on a rollback.
> > 
> > And there we do have a very different viewpoint.
> 
> That's the design decision I mentioned, of not starting the guest
> configuration from a clean slate. I believe it complicates things
> because you have to design from the beginning with the ability to
> rollback to old versions and to potentially detect conflicts
> introduced by the rollback. This is exactly why
> KVM_X86_QUIRK_STUFF_FEATURE_MSRS was introduced: "KVM's initialization
> of feature MSRs during vCPU creation results in a failed save/restore
> of PERF_CAPABILITIES. If userspace configures the VM to _not_ have a
> PMU, because KVM initializes the vCPU's PERF_CAPABILITIES, trying to
> save/restore the non-zero value will be rejected by the destination."
> (https://lkml.org/lkml/2024/8/2/1032)

No, I don't think this is like that. In that case, IIUC it was at least
*possible* for userspace to manually filter out capabilities and adjust
things. But it kind of sucked if we *made* userspace do that and broke
things for existing userspace, so of *course* x86 did better.

I'm not even *dreaming* about a world where KVM/arm meets that bar.

> For Arm, however, it may be too late to change it; if not, I'll
> happily watch you argue with Marc about it. 

I'm not even going to try. You're right that it's the better option,
and it most certainly *isn't* too late for Arm to choose to be a stable
and mature platform providing continuity to userspace like x86 does.

But we are *so* far from that right now; we're fighting even to have
the *possibility* for userspace to remain compatible — even if
userspace *is* updated to know everything that the latest kernel
changed underneath it.

> But even without that,
> this doc patch (and the idea that "Where a new kernel introduces a
> guest-visible change, it provides a mechanism for userspace to select
> the previous behaviour") should be uncontroversial.

Indeed. And again, if you really want then you can add the caveat
discussed above, "unless you're really sure it won't make *any*
difference to the zoo of possible guests running Linux, Windows,
FreeBSD, or any number of random home-grown or network appliance
operating systems".

Although I didn't think it really needed spelling out in the doc, just
as I didn't think it needed spelling out earlier today (although you
called my sentence nonsense purely because it lacked that obvious
caveat, AFAICT).


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* [PATCH v3] docs: reporting-issues: replace "these advices" with "all of this advice"
From: Chen-Shi-Hong @ 2026-05-13 17:39 UTC (permalink / raw)
  To: linux; +Cc: corbet, skhan, linux-doc, linux-kernel, Chen-Shi-Hong
In-Reply-To: <20260512150431.894-1-eric039eric@gmail.com>

"Advice" is an uncountable noun, so "these advices" is grammatically
incorrect.

Replace it with "all of this advice" instead, which keeps the sentence
grammatical while also making it clear that it refers to the full set of
recommendations in the paragraph.

Signed-off-by: Chen-Shi-Hong <eric039eric@gmail.com>

v3:
- resend against the original base as requested
- replace "these advices" directly with "all of this advice"

v2:
- use "all of this advice" based on review feedback

---
 Documentation/admin-guide/reporting-issues.rst | 4 ++--

 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/reporting-issues.rst b/Documentation/admin-guide/reporting-issues.rst
index 16a66a1f1975..87dd874fffcf 100644
--- a/Documentation/admin-guide/reporting-issues.rst
+++ b/Documentation/admin-guide/reporting-issues.rst
@@ -129,7 +129,7 @@ After these preparations you'll now enter the main part:
    situations; during the merge window that actually might be even the best
    approach, but in that development phase it can be an even better idea to
    suspend your efforts for a few days anyway. Whatever version you choose,
-   ideally use a 'vanilla' build. Ignoring these advices will dramatically
+   ideally use a 'vanilla' build. Ignoring all of this advice will dramatically
    increase the risk your report will be rejected or ignored.
 
  * Ensure the kernel you just installed does not 'taint' itself when
@@ -795,7 +795,7 @@ Install a fresh kernel for testing
     situations; during the merge window that actually might be even the best
     approach, but in that development phase it can be an even better idea to
     suspend your efforts for a few days anyway. Whatever version you choose,
-    ideally use a 'vanilla' built. Ignoring these advices will dramatically
+    ideally use a 'vanilla' built. Ignoring all of this advice will dramatically
     increase the risk your report will be rejected or ignored.*
 
 As mentioned in the detailed explanation for the first step already: Like most
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v7 13/20] KVM: arm64: Apply dynamic guest counter reservations
From: Colton Lewis @ 2026-05-13 16:45 UTC (permalink / raw)
  To: James Clark
  Cc: alexandru.elisei, pbonzini, corbet, linux, catalin.marinas, will,
	maz, oliver.upton, mizhang, joey.gouly, suzuki.poulose, yuzenghui,
	mark.rutland, shuah, gankulkarni, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, linux-perf-users, linux-kselftest, kvm
In-Reply-To: <e2a7679d-e61a-43ac-a1d7-72f7e815c400@linaro.org>

James Clark <james.clark@linaro.org> writes:

> On 04/05/2026 10:18 pm, Colton Lewis wrote:
>> Apply dynamic guest counter reservations by checking if the requested
>> guest mask collides with any events the host has scheduled and calling
>> pmu_perf_resched_update() with a hook that updates the mask of
>> available counters in between schedule out and schedule in.

>> Signed-off-by: Colton Lewis <coltonlewis@google.com>
>> ---
>>    arch/arm64/kvm/pmu-direct.c  | 69 ++++++++++++++++++++++++++++++++++++
>>    include/linux/perf/arm_pmu.h |  1 +
>>    2 files changed, 70 insertions(+)

>> diff --git a/arch/arm64/kvm/pmu-direct.c b/arch/arm64/kvm/pmu-direct.c
>> index 2252d3b905db9..14cc419dbafad 100644
>> --- a/arch/arm64/kvm/pmu-direct.c
>> +++ b/arch/arm64/kvm/pmu-direct.c
>> @@ -100,6 +100,73 @@ u8 kvm_pmu_hpmn(struct kvm_vcpu *vcpu)
>>    	return *host_data_ptr(nr_event_counters);
>>    }

>> +/* Callback to update counter mask between perf scheduling */
>> +static void kvm_pmu_update_mask(struct pmu *pmu, void *data)
>> +{
>> +	struct arm_pmu *arm_pmu = to_arm_pmu(pmu);
>> +	unsigned long *new_mask = data;
>> +
>> +	bitmap_copy(arm_pmu->cntr_mask, new_mask, ARMPMU_MAX_HWEVENTS);
>> +}
>> +
>> +/**
>> + * kvm_pmu_set_guest_counters() - Handle dynamic counter reservations
>> + * @cpu_pmu: struct arm_pmu to potentially modify
>> + * @guest_mask: new guest mask for the pmu
>> + *
>> + * Check if guest counters will interfere with current host events and
>> + * call into perf_pmu_resched_update if a reschedule is required.
>> + */
>> +static void kvm_pmu_set_guest_counters(struct arm_pmu *cpu_pmu, u64  
>> guest_mask)
>> +{
>> +	struct pmu_hw_events *cpuc = this_cpu_ptr(cpu_pmu->hw_events);
>> +	DECLARE_BITMAP(guest_bitmap, ARMPMU_MAX_HWEVENTS);
>> +	DECLARE_BITMAP(new_mask, ARMPMU_MAX_HWEVENTS);
>> +	bool need_resched = false;
>> +
>> +	bitmap_from_arr64(guest_bitmap, &guest_mask, ARMPMU_MAX_HWEVENTS);
>> +	bitmap_copy(new_mask, cpu_pmu->hw_cntr_mask, ARMPMU_MAX_HWEVENTS);
>> +
>> +	if (guest_mask) {
>> +		/* Subtract guest counters from available host mask */
>> +		bitmap_andnot(new_mask, new_mask, guest_bitmap, ARMPMU_MAX_HWEVENTS);
>> +
>> +		/* Did we collide with an active host event? */
>> +		if (bitmap_intersects(cpuc->used_mask, guest_bitmap,  
>> ARMPMU_MAX_HWEVENTS)) {
>> +			int idx;
>> +
>> +			need_resched = true;
>> +			cpuc->host_squeezed = true;
>> +
>> +			/* Look for pinned events that are about to be preempted */
>> +			for_each_set_bit(idx, guest_bitmap, ARMPMU_MAX_HWEVENTS) {
>> +				if (test_bit(idx, cpuc->used_mask) && cpuc->events[idx] &&
>> +				    cpuc->events[idx]->attr.pinned) {
>> +					pr_warn_ratelimited("perf: Pinned host event squeezed out by KVM  
>> guest PMU partition\n");

> Hi Colton,

> I get "perf: Pinned host event squeezed out by KVM guest PMU partition"
> even with arm_pmuv3.reserved_host_counters=3 for example. I would have
> expected any non zero value to stop the warning.

> I think armv8pmu_get_single_idx() needs to be changed to allocate from
> the high end host counters first. A more complicated option would be
> checking to see if there are any non-pinned counters in the host
> reserved half when a new pinned counter is opened, then swapping the
> places of the new pinned and existing non-pinned counters so pinned
> always prefer being put into the host half. But it's probably not worth
> doing that.

> James


I agree it makes the most sense to allocate from the top, but I'm happy
the basic idea works.

>> +					break;
>> +				}
>> +			}
>> +		}
>> +	} else {
>> +		/*
>> +		 * Restoring to hw_cntr_mask.
>> +		 * Only resched if we previously squeezed an event.
>> +		 */
>> +		if (cpuc->host_squeezed) {
>> +			need_resched = true;
>> +			cpuc->host_squeezed = false;
>> +		}
>> +	}
>> +
>> +	if (need_resched) {
>> +		/* Collision: run full perf reschedule */
>> +		perf_pmu_resched_update(&cpu_pmu->pmu, kvm_pmu_update_mask, new_mask);
>> +	} else {
>> +		/* Host was never using guest counters anyway */
>> +		bitmap_copy(cpu_pmu->cntr_mask, new_mask, ARMPMU_MAX_HWEVENTS);
>> +	}
>> +}
>> +
>>    /**
>>     * kvm_pmu_host_counter_mask() - Compute bitmask of host-reserved  
>> counters
>>     * @pmu: Pointer to arm_pmu struct
>> @@ -218,6 +285,7 @@ void kvm_pmu_load(struct kvm_vcpu *vcpu)

>>    	pmu = vcpu->kvm->arch.arm_pmu;
>>    	guest_counters = kvm_pmu_guest_counter_mask(pmu);
>> +	kvm_pmu_set_guest_counters(pmu, guest_counters);
>>    	kvm_pmu_apply_event_filter(vcpu);

>>    	for_each_set_bit(i, &guest_counters, ARMPMU_MAX_HWEVENTS) {
>> @@ -319,5 +387,6 @@ void kvm_pmu_put(struct kvm_vcpu *vcpu)
>>    	val = read_sysreg(pmintenset_el1);
>>    	__vcpu_assign_sys_reg(vcpu, PMINTENSET_EL1, val & mask);

>> +	kvm_pmu_set_guest_counters(pmu, 0);
>>    	preempt_enable();
>>    }
>> diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
>> index f7b000bb3eca8..63f88fec5e80f 100644
>> --- a/include/linux/perf/arm_pmu.h
>> +++ b/include/linux/perf/arm_pmu.h
>> @@ -75,6 +75,7 @@ struct pmu_hw_events {

>>    	/* Active events requesting branch records */
>>    	unsigned int		branch_users;
>> +	bool host_squeezed;
>>    };

>>    enum armpmu_attr_groups {

^ permalink raw reply

* Re: [RFC PATCH v3] bpf: introduce TAINT_UNSAFE_BPF for mutating helpers
From: Steven Rostedt @ 2026-05-13 16:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Aaron Tomlin, Jonathan Corbet, Song Liu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Eduard,
	Kumar Kartikeya Dwivedi, Masami Hiramatsu, Shuah Khan, Jiri Olsa,
	Martin KaFai Lau, Yonghong Song, Mathieu Desnoyers, Randy Dunlap,
	neelx, sean, chjohnst, steve, mproche, nick.lange,
	open list:DOCUMENTATION, LKML, bpf, linux-trace-kernel
In-Reply-To: <CAADnVQLw+_NaOVeaKabuf085wNo_-6MAv8w0EDO3fBz3KCQT5g@mail.gmail.com>

On Wed, 13 May 2026 09:35:29 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Wed, May 13, 2026 at 8:23 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Wed, 13 May 2026 08:16:07 -0700
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >  
> > > It's impossible to track all modifications.
> > > See what sched-ext is doing.
> > > What does it modify? Everything.  
> >
> > What about just having a list of what BPF programs are loaded, what they
> > may be attached to, and what kfuncs they are calling?  
> 
> Ohh. These have been available forever.
> Just bpftool prog, bpftool link, bpftool prog dump xlated

Ah thanks. That is useful.

-- Steve

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-13 16:39 UTC (permalink / raw)
  To: Albert Esteve
  Cc: Christian König, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CADSE00KZMJFYJ92XZa=r9EeJJRGT=SNChwOW-_jTznc7F79xGw@mail.gmail.com>

On Wed, May 13, 2026 at 5:41 AM Albert Esteve <aesteve@redhat.com> wrote:
>
> On Tue, May 12, 2026 at 12:14 PM Christian König
> <christian.koenig@amd.com> wrote:
> >
> > On 5/12/26 11:10, Albert Esteve wrote:
> > > On embedded platforms a central process often allocates dma-buf
> > > memory on behalf of client applications. Without a way to
> > > attribute the charge to the requesting client's cgroup, the
> > > cost lands on the allocator, making per-cgroup memory limits
> > > ineffective for the actual consumers.
> > >
> > > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> > > a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> > > memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> > > inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> > > the mem_accounting module parameter enabled, the buffer is charged
> > > to the allocator's own cgroup.
> > >
> > > Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> > > system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> > > page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> > > twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> > > all accounting through a single MEMCG_DMABUF path.
> > >
> > > Usage examples:
> > >
> > >   1. Central allocator charging to a client at allocation time.
> > >      The allocator knows the client's PID (e.g., from binder's
> > >      sender_pid) and uses pidfd to attribute the charge:
> > >
> > >        pid_t client_pid = txn->sender_pid;
> > >        int pidfd = pidfd_open(client_pid, 0);
> > >
> > >        struct dma_heap_allocation_data alloc = {
> > >            .len             = buffer_size,
> > >            .fd_flags        = O_RDWR | O_CLOEXEC,
> > >            .charge_pid_fd   = pidfd,
> > >        };
> > >        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> > >        close(pidfd);
> > >        /* alloc.fd is now charged to client's cgroup */
> > >
> > >   2. Default allocation (no pidfd, mem_accounting=1).
> > >      When charge_pid_fd is not set and the mem_accounting module
> > >      parameter is enabled, the buffer is charged to the allocator's
> > >      own cgroup:
> > >
> > >        struct dma_heap_allocation_data alloc = {
> > >            .len      = buffer_size,
> > >            .fd_flags = O_RDWR | O_CLOEXEC,
> > >        };
> > >        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> > >        /* charged to current process's cgroup */
> > >
> > > Current limitations:
> > >
> > >  - Single-owner model: a dma-buf carries one memcg charge regardless of
> > >    how many processes share it. Means only the first owner (and exporter)
> > >    of the shared buffer bears the charge.
> > >  - Only memcg accounting supported. While this makes sense for system
> > >    heap buffers, other heaps (e.g., CMA heaps) will require selectively
> > >    charging also for the dmem controller.
> >
> > Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.
> >
> > I'm just not sure if this is future prove and will work for all use cases, e.g. cloud gaming, native context for automotive etc...
> >
> > Essentially the problem boils down to two limitations:
> > 1) a piece of memory can only be charged to one cgroup, the framework doesn't has a concept of charging shared memory to multiple groups
> > 2) when memory references in the form of file descriptors are passed between applications we have no way of changing the accounting to a different cgroup
> >
> > The passing of the memory reference already has a well defined uAPI and if we could solve those two limitations we not only solve the problem without introducing new uAPI (with potential new security risks) but also solve it for all other use cases which uses file descriptors as well as. E.g. memfd, accel and GPU drivers etc...
>
> Honestly, adding a hook to fd-passing uAPI to manage charge transfers
> sounds like a promising solution requiring no uAPI changes. However,
> it still does not cover all paths, e.g., dup() or fork(). And shared
> memory sounds like a hard one to tackle, where deciding the best
> policy is more a per-usecase thing and would probably require
> userspace configuration.

I'm curious if anyone knows of a use case where FDs aren't involved at
all? It's possible to fork() or clone() with only a dmabuf mapping and
no FD. That sounds strange, and I'm not sure there's a real usecase
for transferring ownership with that approach, but figured I'd at
least pose the question.

> All in all, charge_pid_fd covers a
> well-defined and immediately practical subset. The UAPI cost is small
> and the mechanism is explicit about what it does and doesn't solve. A
> general solution, if it ever converges, would likely supersede
> charge_pid_fd for most cases, which is a fine outcome if it solves the
> problem more completely.
>
> Either way, if you have a specific approach in mind for solving any of
> the above limitations, I'd be happy to look into it further.
>
> BR,
> Albert.
>
> >
> > On the other hand it is really nice to finally see this tackled for at least DMA-buf heaps. On the GPU side I have seen just another try of a driver doing some kind of special driver specific accounting to solve this just a few weeks ago. And to be honest such single driver island approach have the tendency to break more often that they are working correctly.
> >
> > Regards,
> > Christian.
> >
> > >
> > > Signed-off-by: Albert Esteve <aesteve@redhat.com>
> > > ---
> > >  Documentation/admin-guide/cgroup-v2.rst |  5 ++--
> > >  drivers/dma-buf/dma-buf.c               | 16 ++++---------
> > >  drivers/dma-buf/dma-heap.c              | 42 ++++++++++++++++++++++++++++++---
> > >  drivers/dma-buf/heaps/system_heap.c     |  2 --
> > >  include/uapi/linux/dma-heap.h           |  6 +++++
> > >  5 files changed, 53 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > index 8bdbc2e866430..824d269531eb1 100644
> > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > @@ -1636,8 +1636,9 @@ The following nested keys are defined.
> > >               structures.
> > >
> > >         dmabuf (npn)
> > > -             Amount of memory used for exported DMA buffers allocated by the cgroup.
> > > -             Stays with the allocating cgroup regardless of how the buffer is shared.
> > > +             Amount of memory used for exported DMA buffers allocated by or on
> > > +             behalf of the cgroup. Stays with the allocating cgroup regardless
> > > +             of how the buffer is shared.
> > >
> > >         workingset_refault_anon
> > >               Number of refaults of previously evicted anonymous pages.
> > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > > index ce02377f48908..23fb758b78297 100644
> > > --- a/drivers/dma-buf/dma-buf.c
> > > +++ b/drivers/dma-buf/dma-buf.c
> > > @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry)
> > >        */
> > >       BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
> > >
> > > -     mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> > > -     mem_cgroup_put(dmabuf->memcg);
> > > +     if (dmabuf->memcg) {
> > > +             mem_cgroup_uncharge_dmabuf(dmabuf->memcg,
> > > +                                       PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> > > +             mem_cgroup_put(dmabuf->memcg);
> > > +     }
> > >
> > >       dmabuf->ops->release(dmabuf);
> > >
> > > @@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> > >               dmabuf->resv = resv;
> > >       }
> > >
> > > -     dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
> > > -     if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
> > > -                                   GFP_KERNEL)) {
> > > -             ret = -ENOMEM;
> > > -             goto err_memcg;
> > > -     }
> > > -
> > >       file->private_data = dmabuf;
> > >       file->f_path.dentry->d_fsdata = dmabuf;
> > >       dmabuf->file = file;
> > > @@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> > >
> > >       return dmabuf;
> > >
> > > -err_memcg:
> > > -     mem_cgroup_put(dmabuf->memcg);
> > >  err_file:
> > >       fput(file);
> > >  err_module:
> > > diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
> > > index ac5f8685a6494..ff6e259afcdc0 100644
> > > --- a/drivers/dma-buf/dma-heap.c
> > > +++ b/drivers/dma-buf/dma-heap.c
> > > @@ -7,13 +7,17 @@
> > >   */
> > >
> > >  #include <linux/cdev.h>
> > > +#include <linux/cgroup.h>
> > >  #include <linux/device.h>
> > >  #include <linux/dma-buf.h>
> > >  #include <linux/dma-heap.h>
> > > +#include <linux/memcontrol.h>
> > > +#include <linux/sched/mm.h>
> > >  #include <linux/err.h>
> > >  #include <linux/export.h>
> > >  #include <linux/list.h>
> > >  #include <linux/nospec.h>
> > > +#include <linux/pidfd.h>
> > >  #include <linux/syscalls.h>
> > >  #include <linux/uaccess.h>
> > >  #include <linux/xarray.h>
> > > @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting,
> > >                "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
> > >
> > >  static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> > > -                              u32 fd_flags,
> > > -                              u64 heap_flags)
> > > +                              u32 fd_flags, u64 heap_flags,
> > > +                              struct mem_cgroup *charge_to)
> > >  {
> > >       struct dma_buf *dmabuf;
> > > +     unsigned int nr_pages;
> > > +     struct mem_cgroup *memcg = charge_to;
> > >       int fd;
> > >
> > >       /*
> > > @@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> > >       if (IS_ERR(dmabuf))
> > >               return PTR_ERR(dmabuf);
> > >
> > > +     nr_pages = len / PAGE_SIZE;
> > > +
> > > +     if (memcg)
> > > +             css_get(&memcg->css);
> > > +     else if (mem_accounting)
> > > +             memcg = get_mem_cgroup_from_mm(current->mm);
> > > +
> > > +     if (memcg) {
> > > +             if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {
> > > +                     mem_cgroup_put(memcg);
> > > +                     dma_buf_put(dmabuf);
> > > +                     return -ENOMEM;
> > > +             }
> > > +             dmabuf->memcg = memcg;
> > > +     }
> > > +
> > >       fd = dma_buf_fd(dmabuf, fd_flags);
> > >       if (fd < 0) {
> > >               dma_buf_put(dmabuf);
> > > @@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> > >  {
> > >       struct dma_heap_allocation_data *heap_allocation = data;
> > >       struct dma_heap *heap = file->private_data;
> > > +     struct mem_cgroup *memcg = NULL;
> > > +     struct task_struct *task;
> > > +     unsigned int pidfd_flags;
> > >       int fd;
> > >
> > >       if (heap_allocation->fd)
> > > @@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> > >       if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS)
> > >               return -EINVAL;
> > >
> > > +     if (heap_allocation->charge_pid_fd) {
> > > +             task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);
> > > +             if (IS_ERR(task))
> > > +                     return PTR_ERR(task);
> > > +
> > > +             memcg = get_mem_cgroup_from_mm(task->mm);
> > > +             put_task_struct(task);
> > > +     }
> > > +
> > >       fd = dma_heap_buffer_alloc(heap, heap_allocation->len,
> > >                                  heap_allocation->fd_flags,
> > > -                                heap_allocation->heap_flags);
> > > +                                heap_allocation->heap_flags,
> > > +                                memcg);
> > > +     mem_cgroup_put(memcg);
> > >       if (fd < 0)
> > >               return fd;
> > >
> > > diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> > > index 03c2b87cb1112..95d7688167b93 100644
> > > --- a/drivers/dma-buf/heaps/system_heap.c
> > > +++ b/drivers/dma-buf/heaps/system_heap.c
> > > @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size,
> > >               if (max_order < orders[i])
> > >                       continue;
> > >               flags = order_flags[i];
> > > -             if (mem_accounting)
> > > -                     flags |= __GFP_ACCOUNT;
> > >               page = alloc_pages(flags, orders[i]);
> > >               if (!page)
> > >                       continue;
> > > diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h
> > > index a4cf716a49fa6..e02b0f8cbc6a1 100644
> > > --- a/include/uapi/linux/dma-heap.h
> > > +++ b/include/uapi/linux/dma-heap.h
> > > @@ -29,6 +29,10 @@
> > >   *                   handle to the allocated dma-buf
> > >   * @fd_flags:                file descriptor flags used when allocating
> > >   * @heap_flags:              flags passed to heap
> > > + * @charge_pid_fd:   optional pidfd of the process whose cgroup should be
> > > + *                   charged for this allocation; 0 means charge the calling
> > > + *                   process's cgroup
> > > + * @__padding:               reserved, must be zero
> > >   *
> > >   * Provided by userspace as an argument to the ioctl
> > >   */
> > > @@ -37,6 +41,8 @@ struct dma_heap_allocation_data {
> > >       __u32 fd;
> > >       __u32 fd_flags;
> > >       __u64 heap_flags;
> > > +     __u32 charge_pid_fd;
> > > +     __u32 __padding;
> > >  };
> > >
> > >  #define DMA_HEAP_IOC_MAGIC           'H'
> > >
> >
>

^ permalink raw reply

* Re: [PATCH v7 10/20] KVM: arm64: Context swap Partitioned PMU guest registers
From: Colton Lewis @ 2026-05-13 16:38 UTC (permalink / raw)
  To: James Clark
  Cc: alexandru.elisei, pbonzini, corbet, linux, catalin.marinas, will,
	maz, oliver.upton, mizhang, joey.gouly, suzuki.poulose, yuzenghui,
	mark.rutland, shuah, gankulkarni, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, linux-perf-users, linux-kselftest, kvm
In-Reply-To: <ad02327b-01b6-4ed9-b9bb-e2c6ed4b2890@linaro.org>

James Clark <james.clark@linaro.org> writes:

> On 04/05/2026 10:18 pm, Colton Lewis wrote:
>> Save and restore newly untrapped registers that can be directly
>> accessed by the guest when the PMU is partitioned.

>> * PMEVCNTRn_EL0
>> * PMCCNTR_EL0
>> * PMSELR_EL0
>> * PMCR_EL0
>> * PMCNTEN_EL0
>> * PMINTEN_EL1

>> If we know we are not partitioned (that is, using the emulated vPMU),
>> then return immediately. A later patch will make this lazy so the
>> context swaps don't happen unless the guest has accessed the PMU.

>> PMEVTYPER is handled in a following patch since we must apply the KVM
>> event filter before writing values to hardware.

>> PMOVS guest counters are cleared to avoid the possibility of
>> generating spurious interrupts when PMINTEN is written. This is fine
>> because the virtual register for PMOVS is always the canonical value.

>> Signed-off-by: Colton Lewis <coltonlewis@google.com>
>> ---
>>    arch/arm/include/asm/arm_pmuv3.h |   4 +
>>    arch/arm64/kvm/arm.c             |   2 +
>>    arch/arm64/kvm/pmu-direct.c      | 169 +++++++++++++++++++++++++++++++
>>    include/kvm/arm_pmu.h            |  16 +++
>>    4 files changed, 191 insertions(+)

>> diff --git a/arch/arm/include/asm/arm_pmuv3.h  
>> b/arch/arm/include/asm/arm_pmuv3.h
>> index 42d62aa48d0a6..eebc89bdab7a1 100644
>> --- a/arch/arm/include/asm/arm_pmuv3.h
>> +++ b/arch/arm/include/asm/arm_pmuv3.h
>> @@ -235,6 +235,10 @@ static inline bool kvm_pmu_is_partitioned(struct  
>> arm_pmu *pmu)
>>    {
>>    	return false;
>>    }
>> +static inline u64 kvm_pmu_host_counter_mask(struct arm_pmu *pmu)
>> +{
>> +	return ~0;
>> +}

>>    /* PMU Version in DFR Register */
>>    #define ARMV8_PMU_DFR_VER_NI        0
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 410ffd41fd73a..a942f2bc13fc4 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -680,6 +680,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int  
>> cpu)
>>    		kvm_vcpu_load_vhe(vcpu);
>>    	kvm_arch_vcpu_load_fp(vcpu);
>>    	kvm_vcpu_pmu_restore_guest(vcpu);
>> +	kvm_pmu_load(vcpu);
>>    	if (kvm_arm_is_pvtime_enabled(&vcpu->arch))
>>    		kvm_make_request(KVM_REQ_RECORD_STEAL, vcpu);

>> @@ -721,6 +722,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>>    	kvm_timer_vcpu_put(vcpu);
>>    	kvm_vgic_put(vcpu);
>>    	kvm_vcpu_pmu_restore_host(vcpu);
>> +	kvm_pmu_put(vcpu);
>>    	if (vcpu_has_nv(vcpu))
>>    		kvm_vcpu_put_hw_mmu(vcpu);
>>    	kvm_arm_vmid_clear_active();
>> diff --git a/arch/arm64/kvm/pmu-direct.c b/arch/arm64/kvm/pmu-direct.c
>> index 63ac72910e4b5..360d022d918d5 100644
>> --- a/arch/arm64/kvm/pmu-direct.c
>> +++ b/arch/arm64/kvm/pmu-direct.c
>> @@ -9,6 +9,7 @@
>>    #include <linux/perf/arm_pmuv3.h>

>>    #include <asm/arm_pmuv3.h>
>> +#include <asm/kvm_emulate.h>

>>    /**
>>     * has_host_pmu_partition_support() - Determine if partitioning is  
>> possible
>> @@ -98,3 +99,171 @@ u8 kvm_pmu_hpmn(struct kvm_vcpu *vcpu)

>>    	return *host_data_ptr(nr_event_counters);
>>    }
>> +
>> +/**
>> + * kvm_pmu_host_counter_mask() - Compute bitmask of host-reserved  
>> counters
>> + * @pmu: Pointer to arm_pmu struct
>> + *
>> + * Compute the bitmask that selects the host-reserved counters in the
>> + * {PMCNTEN,PMINTEN,PMOVS}{SET,CLR} registers. These are the counters
>> + * in HPMN..N
>> + *
>> + * Return: Bitmask
>> + */
>> +u64 kvm_pmu_host_counter_mask(struct arm_pmu *pmu)
>> +{
>> +	u8 nr_counters = *host_data_ptr(nr_event_counters);
>> +
>> +	if (kvm_pmu_is_partitioned(pmu))
>> +		return GENMASK(nr_counters - 1, pmu->max_guest_counters);
>> +
>> +	return ARMV8_PMU_CNT_MASK_ALL;
>> +}
>> +
>> +/**
>> + * kvm_pmu_guest_counter_mask() - Compute bitmask of guest-reserved  
>> counters
>> + * @pmu: Pointer to arm_pmu struct
>> + *
>> + * Compute the bitmask that selects the guest-reserved counters in the
>> + * {PMCNTEN,PMINTEN,PMOVS}{SET,CLR} registers. These are the counters
>> + * in 0..HPMN and the cycle and instruction counters.
>> + *
>> + * Return: Bitmask
>> + */
>> +u64 kvm_pmu_guest_counter_mask(struct arm_pmu *pmu)
>> +{
>> +	if (kvm_pmu_is_partitioned(pmu))
>> +		return ARMV8_PMU_CNT_MASK_C | GENMASK(pmu->max_guest_counters - 1, 0);
>> +
>> +	return 0;
>> +}

> Minor nit: slightly inconsistent use of types. Returns a u64 but doesn't
> use GENMASK_ULL and is also usually saved into a long when it's called.

Will fix

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-13 16:35 UTC (permalink / raw)
  To: Albert Esteve
  Cc: Christian König, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CADSE00Jq_uvNgvxgPze0mEdUd+hF4-DPZkHy0KroWHZzygf4WA@mail.gmail.com>

On Wed, May 13, 2026 at 4:39 AM Albert Esteve <aesteve@redhat.com> wrote:
>
> On Tue, May 12, 2026 at 8:53 PM T.J. Mercier <tjmercier@google.com> wrote:
> >
> > On Tue, May 12, 2026 at 3:14 AM Christian König
> > <christian.koenig@amd.com> wrote:
> > >
> > > On 5/12/26 11:10, Albert Esteve wrote:
> > > > On embedded platforms a central process often allocates dma-buf
> > > > memory on behalf of client applications. Without a way to
> > > > attribute the charge to the requesting client's cgroup, the
> > > > cost lands on the allocator, making per-cgroup memory limits
> > > > ineffective for the actual consumers.
> > > >
> > > > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> > > > a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> > > > memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> > > > inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> > > > the mem_accounting module parameter enabled, the buffer is charged
> > > > to the allocator's own cgroup.
> > > >
> > > > Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> > > > system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> > > > page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> > > > twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> > > > all accounting through a single MEMCG_DMABUF path.
> > > >
> > > > Usage examples:
> > > >
> > > >   1. Central allocator charging to a client at allocation time.
> > > >      The allocator knows the client's PID (e.g., from binder's
> > > >      sender_pid) and uses pidfd to attribute the charge:
> > > >
> > > >        pid_t client_pid = txn->sender_pid;
> > > >        int pidfd = pidfd_open(client_pid, 0);
> > > >
> > > >        struct dma_heap_allocation_data alloc = {
> > > >            .len             = buffer_size,
> > > >            .fd_flags        = O_RDWR | O_CLOEXEC,
> > > >            .charge_pid_fd   = pidfd,
> > > >        };
> > > >        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> > > >        close(pidfd);
> > > >        /* alloc.fd is now charged to client's cgroup */
> > > >
> > > >   2. Default allocation (no pidfd, mem_accounting=1).
> > > >      When charge_pid_fd is not set and the mem_accounting module
> > > >      parameter is enabled, the buffer is charged to the allocator's
> > > >      own cgroup:
> > > >
> > > >        struct dma_heap_allocation_data alloc = {
> > > >            .len      = buffer_size,
> > > >            .fd_flags = O_RDWR | O_CLOEXEC,
> > > >        };
> > > >        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> > > >        /* charged to current process's cgroup */
> > > >
> > > > Current limitations:
> > > >
> > > >  - Single-owner model: a dma-buf carries one memcg charge regardless of
> > > >    how many processes share it. Means only the first owner (and exporter)
> > > >    of the shared buffer bears the charge.
> > > >  - Only memcg accounting supported. While this makes sense for system
> > > >    heap buffers, other heaps (e.g., CMA heaps) will require selectively
> > > >    charging also for the dmem controller.
> > >
> > > Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.
> >
> > Yeah I think this might work. I know of 3 cases, and it trivially
> > solves the first two. The third requires some work on our end to
> > extend our userspace interfaces to include the pidfd but it seems
> > doable. I'm checking with our graphics folks.
> >
> > 1) Direct allocation from user (e.g. app -> allocation ioctl on
> > /dev/dma_heap/foo)
> > No changes required to userspace. mem_accounting=1 charges the app.
> >
> > 2) Single hop remote allocation (e.g. app -> AHardwareBuffer_allocate
> > -> gralloc)
> > gralloc has the caller's pid as described in the commit message. Open
> > a pidfd and pass it in the dma_heap_allocation_data.
> >
> > 3) Double hop remote allocation (e.g. app -> dequeueBuffer ->
> > SurfaceFlinger -> gralloc)
> > In this case gralloc knows SurfaceFlinger's pid, but not the app's. So
> > we need to add the app's pidfd to the SurfaceFlinger -> gralloc
> > interface, or transfer the memcg charge from SurfaceFlinger to the app
> > after the allocation.
> > It'd be nice to avoid the charge transfer option entirely, but if we
> > need it that doesn't seem so bad in this case because it's a bulk
> > charge for the entire dmabuf rather than per-page. So the exporter
> > doesn't need to get involved (we wouldn't need a new dma_buf_op) and
> > we wouldn't have to worry about looping and locking for each page.
> >
> > > I'm just not sure if this is future prove and will work for all use cases, e.g. cloud gaming, native context for automotive etc...
> > >
> > > Essentially the problem boils down to two limitations:
> > > 1) a piece of memory can only be charged to one cgroup, the framework doesn't has a concept of charging shared memory to multiple groups
> >
> > Yup, memcg already has this problem with pagecache and shmem.
> >
> > > 2) when memory references in the form of file descriptors are passed between applications we have no way of changing the accounting to a different cgroup
> > >
> > > The passing of the memory reference already has a well defined uAPI and if we could solve those two limitations we not only solve the problem without introducing new uAPI (with potential new security risks) but also solve it for all other use cases which uses file descriptors as well as. E.g. memfd, accel and GPU drivers etc...
> > >
> > > On the other hand it is really nice to finally see this tackled for at least DMA-buf heaps.
> >
> > I have a question about this part. Albert I guess you are interested
> > only in accounting dmabuf-heap allocations, or do you expect to add
> > __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other
> > non-dmabuf-heap exporters?
>
> We're scoping this to dma-buf heaps for now. CMA heaps and the dmem
> controller are on the radar for follow-up/parallel work (there will be
> dragons and will surely need discussion). For DRM and V4L2 the
> long-term intent is migration to heaps, which would make direct
> accounting on those paths unnecessary.

Ah I see. GEM buffers exported to dmabufs are what I had in mind. I
guess this would only leave the odd non-DRM driver with the need to
add their own accounting calls, which I don't expect would be a big
problem.

> udmabufs are already
> memcg-charged, so adding a separate MEMCG_DMABUF would double count.
> Are there any other exporters you had in mind that would benefit from
> this approach?
>
> BR,
> Albert.
>
> >
> > > On the GPU side I have seen just another try of a driver doing some kind of special driver specific accounting to solve this just a few weeks ago. And to be honest such single driver island approach have the tendency to break more often that they are working correctly.
> > >
> > > Regards,
> > > Christian.
> > >
> > > >
> > > > Signed-off-by: Albert Esteve <aesteve@redhat.com>
> > > > ---
> > > >  Documentation/admin-guide/cgroup-v2.rst |  5 ++--
> > > >  drivers/dma-buf/dma-buf.c               | 16 ++++---------
> > > >  drivers/dma-buf/dma-heap.c              | 42 ++++++++++++++++++++++++++++++---
> > > >  drivers/dma-buf/heaps/system_heap.c     |  2 --
> > > >  include/uapi/linux/dma-heap.h           |  6 +++++
> > > >  5 files changed, 53 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > > index 8bdbc2e866430..824d269531eb1 100644
> > > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > > @@ -1636,8 +1636,9 @@ The following nested keys are defined.
> > > >               structures.
> > > >
> > > >         dmabuf (npn)
> > > > -             Amount of memory used for exported DMA buffers allocated by the cgroup.
> > > > -             Stays with the allocating cgroup regardless of how the buffer is shared.
> > > > +             Amount of memory used for exported DMA buffers allocated by or on
> > > > +             behalf of the cgroup. Stays with the allocating cgroup regardless
> > > > +             of how the buffer is shared.
> > > >
> > > >         workingset_refault_anon
> > > >               Number of refaults of previously evicted anonymous pages.
> > > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > > > index ce02377f48908..23fb758b78297 100644
> > > > --- a/drivers/dma-buf/dma-buf.c
> > > > +++ b/drivers/dma-buf/dma-buf.c
> > > > @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry)
> > > >        */
> > > >       BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
> > > >
> > > > -     mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> > > > -     mem_cgroup_put(dmabuf->memcg);
> > > > +     if (dmabuf->memcg) {
> > > > +             mem_cgroup_uncharge_dmabuf(dmabuf->memcg,
> > > > +                                       PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> > > > +             mem_cgroup_put(dmabuf->memcg);
> > > > +     }
> > > >
> > > >       dmabuf->ops->release(dmabuf);
> > > >
> > > > @@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> > > >               dmabuf->resv = resv;
> > > >       }
> > > >
> > > > -     dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
> > > > -     if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
> > > > -                                   GFP_KERNEL)) {
> > > > -             ret = -ENOMEM;
> > > > -             goto err_memcg;
> > > > -     }
> > > > -
> > > >       file->private_data = dmabuf;
> > > >       file->f_path.dentry->d_fsdata = dmabuf;
> > > >       dmabuf->file = file;
> > > > @@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> > > >
> > > >       return dmabuf;
> > > >
> > > > -err_memcg:
> > > > -     mem_cgroup_put(dmabuf->memcg);
> > > >  err_file:
> > > >       fput(file);
> > > >  err_module:
> > > > diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
> > > > index ac5f8685a6494..ff6e259afcdc0 100644
> > > > --- a/drivers/dma-buf/dma-heap.c
> > > > +++ b/drivers/dma-buf/dma-heap.c
> > > > @@ -7,13 +7,17 @@
> > > >   */
> > > >
> > > >  #include <linux/cdev.h>
> > > > +#include <linux/cgroup.h>
> > > >  #include <linux/device.h>
> > > >  #include <linux/dma-buf.h>
> > > >  #include <linux/dma-heap.h>
> > > > +#include <linux/memcontrol.h>
> > > > +#include <linux/sched/mm.h>
> > > >  #include <linux/err.h>
> > > >  #include <linux/export.h>
> > > >  #include <linux/list.h>
> > > >  #include <linux/nospec.h>
> > > > +#include <linux/pidfd.h>
> > > >  #include <linux/syscalls.h>
> > > >  #include <linux/uaccess.h>
> > > >  #include <linux/xarray.h>
> > > > @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting,
> > > >                "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
> > > >
> > > >  static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> > > > -                              u32 fd_flags,
> > > > -                              u64 heap_flags)
> > > > +                              u32 fd_flags, u64 heap_flags,
> > > > +                              struct mem_cgroup *charge_to)
> > > >  {
> > > >       struct dma_buf *dmabuf;
> > > > +     unsigned int nr_pages;
> > > > +     struct mem_cgroup *memcg = charge_to;
> > > >       int fd;
> > > >
> > > >       /*
> > > > @@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> > > >       if (IS_ERR(dmabuf))
> > > >               return PTR_ERR(dmabuf);
> > > >
> > > > +     nr_pages = len / PAGE_SIZE;
> > > > +
> > > > +     if (memcg)
> > > > +             css_get(&memcg->css);
> > > > +     else if (mem_accounting)
> > > > +             memcg = get_mem_cgroup_from_mm(current->mm);
> > > > +
> > > > +     if (memcg) {
> > > > +             if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {
> > > > +                     mem_cgroup_put(memcg);
> > > > +                     dma_buf_put(dmabuf);
> > > > +                     return -ENOMEM;
> > > > +             }
> > > > +             dmabuf->memcg = memcg;
> > > > +     }
> > > > +
> > > >       fd = dma_buf_fd(dmabuf, fd_flags);
> > > >       if (fd < 0) {
> > > >               dma_buf_put(dmabuf);
> > > > @@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> > > >  {
> > > >       struct dma_heap_allocation_data *heap_allocation = data;
> > > >       struct dma_heap *heap = file->private_data;
> > > > +     struct mem_cgroup *memcg = NULL;
> > > > +     struct task_struct *task;
> > > > +     unsigned int pidfd_flags;
> > > >       int fd;
> > > >
> > > >       if (heap_allocation->fd)
> > > > @@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> > > >       if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS)
> > > >               return -EINVAL;
> > > >
> > > > +     if (heap_allocation->charge_pid_fd) {
> > > > +             task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);
> > > > +             if (IS_ERR(task))
> > > > +                     return PTR_ERR(task);
> > > > +
> > > > +             memcg = get_mem_cgroup_from_mm(task->mm);
> > > > +             put_task_struct(task);
> > > > +     }
> > > > +
> > > >       fd = dma_heap_buffer_alloc(heap, heap_allocation->len,
> > > >                                  heap_allocation->fd_flags,
> > > > -                                heap_allocation->heap_flags);
> > > > +                                heap_allocation->heap_flags,
> > > > +                                memcg);
> > > > +     mem_cgroup_put(memcg);
> > > >       if (fd < 0)
> > > >               return fd;
> > > >
> > > > diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> > > > index 03c2b87cb1112..95d7688167b93 100644
> > > > --- a/drivers/dma-buf/heaps/system_heap.c
> > > > +++ b/drivers/dma-buf/heaps/system_heap.c
> > > > @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size,
> > > >               if (max_order < orders[i])
> > > >                       continue;
> > > >               flags = order_flags[i];
> > > > -             if (mem_accounting)
> > > > -                     flags |= __GFP_ACCOUNT;
> > > >               page = alloc_pages(flags, orders[i]);
> > > >               if (!page)
> > > >                       continue;
> > > > diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h
> > > > index a4cf716a49fa6..e02b0f8cbc6a1 100644
> > > > --- a/include/uapi/linux/dma-heap.h
> > > > +++ b/include/uapi/linux/dma-heap.h
> > > > @@ -29,6 +29,10 @@
> > > >   *                   handle to the allocated dma-buf
> > > >   * @fd_flags:                file descriptor flags used when allocating
> > > >   * @heap_flags:              flags passed to heap
> > > > + * @charge_pid_fd:   optional pidfd of the process whose cgroup should be
> > > > + *                   charged for this allocation; 0 means charge the calling
> > > > + *                   process's cgroup
> > > > + * @__padding:               reserved, must be zero
> > > >   *
> > > >   * Provided by userspace as an argument to the ioctl
> > > >   */
> > > > @@ -37,6 +41,8 @@ struct dma_heap_allocation_data {
> > > >       __u32 fd;
> > > >       __u32 fd_flags;
> > > >       __u64 heap_flags;
> > > > +     __u32 charge_pid_fd;
> > > > +     __u32 __padding;
> > > >  };
> > > >
> > > >  #define DMA_HEAP_IOC_MAGIC           'H'
> > > >
> > >
> >
>

^ permalink raw reply

* Re: [RFC PATCH v3] bpf: introduce TAINT_UNSAFE_BPF for mutating helpers
From: Alexei Starovoitov @ 2026-05-13 16:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Aaron Tomlin, Jonathan Corbet, Song Liu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Eduard,
	Kumar Kartikeya Dwivedi, Masami Hiramatsu, Shuah Khan, Jiri Olsa,
	Martin KaFai Lau, Yonghong Song, Mathieu Desnoyers, Randy Dunlap,
	neelx, sean, chjohnst, steve, mproche, nick.lange,
	open list:DOCUMENTATION, LKML, bpf, linux-trace-kernel
In-Reply-To: <20260513112307.53e77312@gandalf.local.home>

On Wed, May 13, 2026 at 8:23 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 13 May 2026 08:16:07 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
> > It's impossible to track all modifications.
> > See what sched-ext is doing.
> > What does it modify? Everything.
>
> What about just having a list of what BPF programs are loaded, what they
> may be attached to, and what kfuncs they are calling?

Ohh. These have been available forever.
Just bpftool prog, bpftool link, bpftool prog dump xlated

^ permalink raw reply

* Re: [PATCH v13 3/4] gpio: rpmsg: add generic rpmsg GPIO driver
From: Mathieu Poirier @ 2026-05-13 16:34 UTC (permalink / raw)
  To: tanmay.shah
  Cc: Arnaud POULIQUEN, Beleswar Prasad Padhi, Shenwei Wang,
	Andrew Lunn, Linus Walleij, Bartosz Golaszewski, Jonathan Corbet,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson,
	Frank Li, Sascha Hauer, Shuah Khan, linux-gpio@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Pengutronix Kernel Team, Fabio Estevam, Peng Fan,
	devicetree@vger.kernel.org, linux-remoteproc@vger.kernel.org,
	imx@lists.linux.dev, linux-arm-kernel@lists.infradead.org,
	dl-linux-imx, Bartosz Golaszewski
In-Reply-To: <13140ca1-b4bc-4acc-9f7c-d23490e56dbb@amd.com>

On Tue, 12 May 2026 at 11:20, Shah, Tanmay <tanmays@amd.com> wrote:
>
>
>
> On 5/12/2026 10:41 AM, Mathieu Poirier wrote:
> > On Mon, May 11, 2026 at 04:35:46PM -0500, Shah, Tanmay wrote:
> >>
> >>
> >> On 5/11/2026 12:58 PM, Mathieu Poirier wrote:
> >>> On Mon, 11 May 2026 at 10:47, Shah, Tanmay <tanmays@amd.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 5/5/2026 10:52 AM, Shah, Tanmay wrote:
> >>>>>
> >>>>>
> >>>>> On 5/5/2026 4:28 AM, Arnaud POULIQUEN wrote:
> >>>>>> Hi Tanmay,
> >>>>>>
> >>>>>> On 5/4/26 21:19, Shah, Tanmay wrote:
> >>>>>>>
> >>>>>>> Hello all,
> >>>>>>>
> >>>>>>> I have started reviewing this work as well.
> >>>>>>> Thanks Shenwei for this work.
> >>>>>>>
> >>>>>>> I have gone through only the current revision, and would like to provide
> >>>>>>> idea on how to achieve GPIO number multiplexing with the RPMsg protocol.
> >>>>>>> Also, have some bindings related question.
> >>>>>>>
> >>>>>>> Please see below:
> >>>>>>>
> >>>>>>> On 4/30/2026 11:40 AM, Arnaud POULIQUEN wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 4/30/26 14:56, Beleswar Prasad Padhi wrote:
> >>>>>>>>> Hello Arnaud,
> >>>>>>>>>
> >>>>>>>>> On 30/04/26 13:05, Arnaud POULIQUEN wrote:
> >>>>>>>>>> Hello,
> >>>>>>>>>>
> >>>>>>>>>> On 4/29/26 21:20, Mathieu Poirier wrote:
> >>>>>>>>>>> On Wed, 29 Apr 2026 at 12:07, Padhi, Beleswar <b-padhi@ti.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Mathieu,
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 4/29/2026 11:03 PM, Mathieu Poirier wrote:
> >>>>>>>>>>>>> On Wed, 29 Apr 2026 at 10:53, Shenwei Wang <shenwei.wang@nxp.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>>>>> From: Mathieu Poirier <mathieu.poirier@linaro.org>
> >>>>>>>>>>>>>>> Sent: Wednesday, April 29, 2026 10:42 AM
> >>>>>>>>>>>>>>> To: Shenwei Wang <shenwei.wang@nxp.com>
> >>>>>>>>>>>>>>> Cc: Andrew Lunn <andrew@lunn.ch>; Padhi, Beleswar <b-
> >>>>>>>>>>>>>>> padhi@ti.com>; Linus
> >>>>>>>>>>>>>>> Walleij <linusw@kernel.org>; Bartosz Golaszewski
> >>>>>>>>>>>>>>> <brgl@kernel.org>; Jonathan
> >>>>>>>>>>>>>>> Corbet <corbet@lwn.net>; Rob Herring <robh@kernel.org>;
> >>>>>>>>>>>>>>> Krzysztof Kozlowski
> >>>>>>>>>>>>>>> <krzk+dt@kernel.org>; Conor Dooley <conor+dt@kernel.org>; Bjorn
> >>>>>>>>>>>>>>> Andersson
> >>>>>>>>>>>>>>> <andersson@kernel.org>; Frank Li <frank.li@nxp.com>; Sascha Hauer
> >>>>>>>>>>>>>>> <s.hauer@pengutronix.de>; Shuah Khan
> >>>>>>>>>>>>>>> <skhan@linuxfoundation.org>; linux-
> >>>>>>>>>>>>>>> gpio@vger.kernel.org; linux-doc@vger.kernel.org; linux-
> >>>>>>>>>>>>>>> kernel@vger.kernel.org;
> >>>>>>>>>>>>>>> Pengutronix Kernel Team <kernel@pengutronix.de>; Fabio Estevam
> >>>>>>>>>>>>>>> <festevam@gmail.com>; Peng Fan <peng.fan@nxp.com>;
> >>>>>>>>>>>>>>> devicetree@vger.kernel.org; linux-remoteproc@vger.kernel.org;
> >>>>>>>>>>>>>>> imx@lists.linux.dev; linux-arm-kernel@lists.infradead.org; dl-
> >>>>>>>>>>>>>>> linux-imx <linux-
> >>>>>>>>>>>>>>> imx@nxp.com>; Bartosz Golaszewski <brgl@bgdev.pl>
> >>>>>>>>>>>>>>> Subject: [EXT] Re: [PATCH v13 3/4] gpio: rpmsg: add generic
> >>>>>>>>>>>>>>> rpmsg GPIO driver
> >>>>>>>>>>>>>>> On Tue, Apr 28, 2026 at 03:24:59PM +0000, Shenwei Wang wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>>>>>>> From: Andrew Lunn <andrew@lunn.ch>
> >>>>>>>>>>>>>>>>> Sent: Monday, April 27, 2026 3:49 PM
> >>>>>>>>>>>>>>>>> To: Shenwei Wang <shenwei.wang@nxp.com>
> >>>>>>>>>>>>>>>>> Cc: Padhi, Beleswar <b-padhi@ti.com>; Linus Walleij
> >>>>>>>>>>>>>>>>> <linusw@kernel.org>; Bartosz Golaszewski <brgl@kernel.org>;
> >>>>>>>>>>>>>>>>> Jonathan
> >>>>>>>>>>>>>>>>> Corbet <corbet@lwn.net>; Rob Herring <robh@kernel.org>;
> >>>>>>>>>>>>>>>>> Krzysztof
> >>>>>>>>>>>>>>>>> Kozlowski <krzk+dt@kernel.org>; Conor Dooley
> >>>>>>>>>>>>>>>>> <conor+dt@kernel.org>;
> >>>>>>>>>>>>>>>>> Bjorn Andersson <andersson@kernel.org>; Mathieu Poirier
> >>>>>>>>>>>>>>>>> <mathieu.poirier@linaro.org>; Frank Li <frank.li@nxp.com>;
> >>>>>>>>>>>>>>>>> Sascha
> >>>>>>>>>>>>>>>>> Hauer <s.hauer@pengutronix.de>; Shuah Khan
> >>>>>>>>>>>>>>>>> <skhan@linuxfoundation.org>; linux-gpio@vger.kernel.org; linux-
> >>>>>>>>>>>>>>>>> doc@vger.kernel.org; linux-kernel@vger.kernel.org; Pengutronix
> >>>>>>>>>>>>>>>>> Kernel Team <kernel@pengutronix.de>; Fabio Estevam
> >>>>>>>>>>>>>>>>> <festevam@gmail.com>; Peng Fan <peng.fan@nxp.com>;
> >>>>>>>>>>>>>>>>> devicetree@vger.kernel.org; linux- remoteproc@vger.kernel.org;
> >>>>>>>>>>>>>>>>> imx@lists.linux.dev; linux-arm- kernel@lists.infradead.org;
> >>>>>>>>>>>>>>>>> dl-linux-imx <linux-imx@nxp.com>; Bartosz Golaszewski
> >>>>>>>>>>>>>>>>> <brgl@bgdev.pl>
> >>>>>>>>>>>>>>>>> Subject: [EXT] Re: [PATCH v13 3/4] gpio: rpmsg: add generic
> >>>>>>>>>>>>>>>>> rpmsg
> >>>>>>>>>>>>>>>>> GPIO driver
> >>>>>>>>>>>>>>>>>>> struct virtio_gpio_response {
> >>>>>>>>>>>>>>>>>>>             __u8 status;
> >>>>>>>>>>>>>>>>>>>             __u8 value;
> >>>>>>>>>>>>>>>>>>> };
> >>>>>>>>>>>>>>>>>> It is the same message format. Please see the message
> >>>>>>>>>>>>>>>>>> definition
> >>>>>>>>>>>>>>>>> (GET_DIRECTION) below:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> +   +-----+-----+-----+-----+-----+----+
> >>>>>>>>>>>>>>>>>> +   |0x00 |0x01 |0x02 |0x03 |0x04 |0x05|
> >>>>>>>>>>>>>>>>>> +   | 1   | 2   |port |line | err | dir|
> >>>>>>>>>>>>>>>>>> +   +-----+-----+-----+-----+-----+----+
> >>>>>>>>>>>>>>>>> Sorry, but i don't see how two u8 vs six u8 are the same
> >>>>>>>>>>>>>>>>> message format.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Some changes to the message format are necessary.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Virtio uses two communication channels (virtqueues): one for
> >>>>>>>>>>>>>>>> requests and
> >>>>>>>>>>>>>>> replies, and a second one for events.
> >>>>>>>>>>>>>>>> In contrast, rpmsg provides only a single communication
> >>>>>>>>>>>>>>>> channel, so a
> >>>>>>>>>>>>>>>> type field is required to distinguish between different kinds
> >>>>>>>>>>>>>>>> of messages.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Since rpmsg replies and events share the same message format,
> >>>>>>>>>>>>>>>> an additional
> >>>>>>>>>>>>>>> line is introduced to handle both cases.
> >>>>>>>>>>>>>>>> Finally, rpmsg supports multiple GPIO controllers, so a port
> >>>>>>>>>>>>>>>> field is added to
> >>>>>>>>>>>>>>> uniquely identify the target controller.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I have commented on this before - RPMSG is already providing
> >>>>>>>>>>>>>>> multiplexing
> >>>>>>>>>>>>>>> capability by way of endpoints.  There is no need for a port
> >>>>>>>>>>>>>>> field.  One endpoint,
> >>>>>>>>>>>>>>> one GPIO controller.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> You still need a way to let the remote side know which port the
> >>>>>>>>>>>>>> endpoint maps to, either
> >>>>>>>>>>>>>> by embedding the port information in the message (the current
> >>>>>>>>>>>>>> way), or by sending it
> >>>>>>>>>>>>>> separately.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> An endpoint is created with every namespace request.  There
> >>>>>>>>>>>>> should be
> >>>>>>>>>>>>> one namespace request for every GPIO controller, which yields a
> >>>>>>>>>>>>> unique
> >>>>>>>>>>>>> endpoint for each controller and eliminates the need for an extra
> >>>>>>>>>>>>> field to identify them.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Right, but this can still be done by just having one namespace
> >>>>>>>>>>>> request.
> >>>>>>>>>>>> We can create new endpoints bound to an existing namespace/
> >>>>>>>>>>>> channel by
> >>>>>>>>>>>> invoking rpmsg_create_ept(). This is what I suggested here too:
> >>>>>>>>>>>> https://lore.kernel.org/all/29485742-6e49-482e-
> >>>>>>>>>>>> b73d-228295daaeec@ti.com/
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I will look at your suggestion (i.e link above) later this week or
> >>>>>>>>>>> next week.
> >>>>>>>>>>>
> >>>>>>>>>>>> My mental model looks like this for the complete picture:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. namespace/channel#1 = rpmsg-io
> >>>>>>>>>>>>        a. ept1 -> gpio-controller@1
> >>>>>>>>>>>>        b. ept2 -> gpio-controller@2
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>>> If my understanding of what gpio-controller is right, than this won't
> >>>>>>> work. We need one rpmsg channel per gpio-controller, and in most cases
> >>>>>>> there will be only one GPIO-controller on the remote side. If there are
> >>>>>>> multiple or multiple instances of same controller, than we need separate
> >>>>>>> channel name for that controller just like we would have separate device
> >>>>>>> on the Linux.
> >>>>>>
> >>>>>> As done in ehe rpmsg_tty driver it could be instantiated several times with
> >>>>>> the same channel/service name. This would imply a specific rpmsg to
> >>>>>> retreive
> >>>>>> the gpio controller index from the remote side.
> >>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I've asked for one endpoint per GPIO controller since the very
> >>>>>>>>>>> beginning.  I don't yet have a strong opinion on whether to use one
> >>>>>>>>>>> namespace request per GPIO controller or a single request that spins
> >>>>>>>>>>> off multiple endpoints.  I'll have to look at your link and
> >>>>>>>>>>> reflect on
> >>>>>>>>>>> that.  Regardless of how we proceed on that front, multiplexing needs
> >>>>>>>>>>> to happen at the endpoint level rather than the packet level.
> >>>>>>>>>>> This is
> >>>>>>>>>>> the only way this work can move forward.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I would be more in favor of Mathieu’s proposal: “An endpoint is
> >>>>>>>>>> created with every namespace request.”
> >>>>>>>>>>
> >>>>>>>>>> If the endpoint is created only on the Linux side, how do we match
> >>>>>>>>>> the Linux endpoint address with the local port field on the remote
> >>>>>>>>>> side?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Simply by sending a message to the remote containing the newly created
> >>>>>>>>> endpoint and the port idx. Note that is this done just one time, after
> >>>>>>>>> this
> >>>>>>>>> Linux need not have the port field in the message everytime its sending
> >>>>>>>>> a message.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> With a multi-namespace approach, the namespace could be rpmsg-io-
> >>>>>>>>>> [addr], where [addr] corresponds to the GPIO controller address in
> >>>>>>>>>> the DT. This would:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> You will face the same problem in this case also that you asked above:
> >>>>>>>>> "how do we match the Linux endpoint address with the local port field
> >>>>>>>>> on the remote side?"
> >>>>>>>>
> >>>>>>>> Sorry I probably introduced confusion here
> >>>>>>>> my sentence should be;
> >>>>>>>>   With a multi-namespace approach, the namespace could be rpmsg-io-
> >>>>>>>> [port],
> >>>>>>>>   where [port] corresponds to the GPIO controller port in the DT.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> For instance:
> >>>>>>>>
> >>>>>>>>        rpmsg {
> >>>>>>>>          rpmsg-io {
> >>>>>>>>            #address-cells = <1>;
> >>>>>>>>            #size-cells = <0>;
> >>>>>>>>
> >>>>>>>>            gpio@25 {
> >>>>>>>>              compatible = "rpmsg-gpio";
> >>>>>>>>              reg = <25>;
> >>>>>>>>              gpio-controller;
> >>>>>>>>              #gpio-cells = <2>;
> >>>>>>>>              #interrupt-cells = <2>;
> >>>>>>>>              interrupt-controller;
> >>>>>>>>            };
> >>>>>>>>
> >>>>>>>>            gpio@32 {
> >>>>>>>>              compatible = "rpmsg-gpio";
> >>>>>>>>              reg = <32>;
> >>>>>>>>              gpio-controller;
> >>>>>>>>              #gpio-cells = <2>;
> >>>>>>>>              #interrupt-cells = <2>;
> >>>>>>>>              interrupt-controller;
> >>>>>>>>            };
> >>>>>>>>          };
> >>>>>>>>        };
> >>>>>>>>
> >>>>>>>>   rpmsg-io-25  would match with gpio@25
> >>>>>>>>   rpmsg-io-32  would match with gpio@32
> >>>>>>>>
> >>>>>>>
> >>>>>>> The problem with this approach is, we will endup creating way too many
> >>>>>>> RPMsg devices/channels. i.e. one channel per one GPIO. That limits how
> >>>>>>> many GPIOs can be handled by remote from memory perspective. At
> >>>>>>> somepoint we might just run-out of number ept & channels created by the
> >>>>>>> remote. As of now, open-amp library supports 128 epts I think.
> >>>>>>
> >>>>>> Right, I proposed a solution in my previous answer to Beleswar who has
> >>>>>> the same concern.
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Because the endpoint that is created on a namespace request is also
> >>>>>>>>> dynamic in nature. How will the remote know which endpoint addr
> >>>>>>>>> Linux allocated for a namespace that it announced?
> >>>>>>>>>
> >>>>>>>>> As an example/PoC, I created a firmware example which announces
> >>>>>>>>> 2 name services to Linux, one is the standard "rpmsg_chrdev" and
> >>>>>>>>> the other is a TI specific name service "ti.ipc4.ping-pong". You can
> >>>>>>>>> see it created 2 different addresses (0x400 and 0x401) for each of
> >>>>>>>>> the name service request from the same firmware:
> >>>>>>>>>
> >>>>>>>>> root@j784s4-evm:~# dmesg | grep virtio0 | grep -i channel
> >>>>>>>>> [    9.290275] virtio_rpmsg_bus virtio0: creating channel
> >>>>>>>>> ti.ipc4.ping-pong addr 0xd
> >>>>>>>>> [    9.311230] virtio_rpmsg_bus virtio0: creating channel rpmsg_chrdev
> >>>>>>>>> addr 0xe
> >>>>>>>>> [    9.496645] rpmsg_chrdev virtio0.rpmsg_chrdev.-1.14: DEBUG: Channel
> >>>>>>>>> formed from src = 0x400 to dst = 0xe
> >>>>>>>>> [    9.707255] rpmsg_client_sample virtio0.ti.ipc4.ping-pong.-1.13:
> >>>>>>>>> new channel: 0x401 -> 0xd!
> >>>>>>>>>
> >>>>>>>>> So in this case, rpmsg-io-1 can have different ept addr than rpmsg-io-2
> >>>>>>>>> Back to same problem. Simple solution is to reply to remote with the
> >>>>>>>>> created ept addr and the index.
> >>>>>>>>
> >>>>>>>> That why I would like to suggest to use the name service field to
> >>>>>>>> identify the port/controller, instead of the endpoint address.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> - match the RPMsg probe with the DT,
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> We can probe from all controllers with a single name service
> >>>>>>>>> announcement too.
> >>>>>>>>>
> >>>>>>>>>> - provide a simple mapping between the port and the endpoint on both
> >>>>>>>>>> sides,
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> We are trying to get rid of this mapping from Linux side to adapt
> >>>>>>>>> the gpio-virtio design.
> >>>>>>>>>
> >>>>>>>>>> - allow multiple endpoints on the remote side,
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> We can support this as well with single nameservice model.
> >>>>>>>>> There is no limitation. Remote has to send a message with
> >>>>>>>>> its newly created ept that's all.
> >>>>>>>>>
> >>>>>>>>>> - provide a simple discovery mechanism for remote capabilities.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> A single announcement: "rpmsg-io" is also discovery mechanism.
> >>>>>>>>>
> >>>>>>>>> Feel free to let me know if you have concerns with any of the
> >>>>>>>>> suggestions!
> >>>>>>>>
> >>>>>>>> My only concern, whatever the solution, is that we find a smart
> >>>>>>>> solution to associate the correct endpoint with the correct GPIO
> >>>>>>>> port/controller defined in the DT.
> >>>>>>>>
> >>>>>>>> I may have misunderstood your solution. Could you please help me
> >>>>>>>> understand your proposal by explaining how you would handle three
> >>>>>>>> GPIO ports defined in the DT, considering that the endpoint
> >>>>>>>> addresses on the Linux side can be random?
> >>>>>>>> If I assume there is a unique endpoint on the remote side,
> >>>>>>>> I do not understand how you can match, on the firmware side,
> >>>>>>>> the Linux endpoint address to the GPIO port.
> >>>>>>>>
> >>>>>>>> Thanks and Regards,Arnaud
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Beleswar
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Arnaud
> >>>>>>>>>>
> >>>>>>>>>>>> 2. namespace/channel#2 = rpmsg-i2c
> >>>>>>>>>>>>        a. ept1 -> i2c@1
> >>>>>>>>>>>>        b. ept2 -> i2c@2
> >>>>>>>>>>>>        c. ept3 -> i2c@3
> >>>>>>>>>>>>
> >>>>>>>>>>>> etc...
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>>> Just want to clear-up few terms before I jump to the solution:
> >>>>>>>
> >>>>>>> **RPMsg channel/device**:
> >>>>>>>    - These are devices announced by the remote processor, and created by
> >>>>>>> linux. They are created at: /sys/bus/rpmsg/devices
> >>>>>>>    - The channel format: <name>.<src ept>.<dst ept>
> >>>>>>>
> >>>>>>> **RPMsg endpoint**:
> >>>>>>>    - Endpoint is differnt than channel. Single channel can have multiple
> >>>>>>> endpoints, and represented in the linux with: /dev/rpmsg? devices.
> >>>>>>>
> >>>>>>> To create endpoint device, we have rpmsg_create_ept API, which takes
> >>>>>>> channel information as input, which has src-ept, dst-ept.
> >>>>>>>
> >>>>>>> Following is proposed solution:
> >>>>>>>
> >>>>>>> 1) Assign RPMsg channel/device per rpmsg-gpio controller (Not per GPIO
> >>>>>>> pin/port).
> >>>>>>>    - In our case that would be, single rpmsg-io node. (That makes me
> >>>>>>> question if bindings are correct or not).
> >>>>>>>
> >>>>>>> 2) Assign GPIO number as src ept.
> >>>>>>>
> >>>>>>> i.e. *rpmsg-io.<GPIO number>.<dst ept>*. Do not randomly assign src
> >>>>>>> endpoint.
> >>>>>>>
> >>>>>>> Now, RPMSG channel by spec reserves first 1024 endpoints [1], so we can
> >>>>>>> add 1024 offset to the GPIO number:
> >>>>>>>
> >>>>>>> so, when calling rpmsg_create_ept() API, we assing src_endpoint as:
> >>>>>>> (GPIO_NUMBER + RPMSG_RESERVED_ADDRESSES)
> >>>>>>>
> >>>>>>> Now on the remote side, there is single channel and only single-endpoint
> >>>>>>> is needed that is mapped to the rpmsg-io channel callback.
> >>>>>>>
> >>>>>>> That callback will receive all the payloads from the Linux, which will
> >>>>>>> have src-ept i.e. (RPMSG_RESERVED_ADDRESSES + GPIO_NUMBER).
> >>>>>>
> >>>>>>
> >>>>>> Interesting approach. I also tried to find a similar solution.
> >>>>>>
> >>>>>> The question here is: how can we guarantee continuous addresses? Given
> >>>>>> the static and dynamic allocation of endpoint addresses that are
> >>>>>> implemented, my conclusion was that it is not reliable enough.
> >>>>>>
> >>>>>> but perhaps I missed something...
> >>>>>>
> >>>>>>>
> >>>>>>> It can retrieve GPIO_NUMBER easily, and convert to appropriate pin based
> >>>>>>> on platform specific logic.
> >>>>>>>
> >>>>>>> This doesn't need PORT information at all. Also it makes sure that
> >>>>>>> remote is using only single-endpoint so not much memory is used.
> >>>>>>>
> >>>>>>> *Example*:
> >>>>>>> If only rpmsg-gpio channel is created by the remote side, than following
> >>>>>>> is the representation of the devices when GPIO 25, 26, 27 is assigned to
> >>>>>>> the rpmsg-io controller:
> >>>>>>>
> >>>>>>> Linux                                                      Remote
> >>>>>>>
> >>>>>>> rpmsg-channel: rpmsg-gpio.0x400.0x400
> >>>>>>>
> >>>>>>> /dev/rpmsg0 - GPIO25 ept (rpmsg-gpio.0x419.0x400)-|
> >>>>>>>                                                    |
> >>>>>>> /dev/rpmsg1 - GPIO26 ept (rpmsg-gpio.0x41a.0x400)-|-> rpmsg-gpio.*.0x400
> >>>>>>>                                                    |
> >>>>>>> /dev/rpmsg2 - GPIO27 ept (rpmsg-gpio.0x41b.0x400)-|  0x400 ept callback.
> >>>>>>>
> >>>>>>>
> >>>>>>> *On remote side*:
> >>>>>>>
> >>>>>>> ept_0x400_callback(..., int src_ept, ...,)
> >>>>>>> {
> >>>>>>>     int gpio_num = src_ept - RPMSG_RESERVED_ADDRESSES;
> >>>>>>>     // platform specific logic to convert gpio num to proper pin,
> >>>>>>>     // just like you would convert gpio num to pin on a linux gpio
> >>>>>>> controller.
> >>>>>>> }
> >>>>>>>
> >>>>>>> My question on the binding:
> >>>>>>>
> >>>>>>> Why each GPIO is represented with the separate node? I think rpmsg-gpio
> >>>>>>> can be represented just any other GPIO controller? Please let me know if
> >>>>>>> I am missing something. So rpmsg channel/rpmsg device is not created per
> >>>>>>> GPIO, but per controller. GPIO number multiplexing should be done with
> >>>>>>> rpmsg src ept, that removes the need of having each GPIO as a separate
> >>>>>>> node.
> >>>>>>>
> >>>>>>>
> >>>>>>> rpmsg_gpio: rpmsg-gpio@0 {
> >>>>>>>         compatible = "rpmsg-gpio";
> >>>>>>>         reg = <0>;
> >>>>>>>         gpio-controller;
> >>>>>>>         #gpio-cells = <2>;
> >>>>>>>         #interrupt-cells = <2>;
> >>>>>>>         interrupt-controller;
> >>>>>>>     };
> >>>>>>>
> >>>>>>> Then in DT, use like regular GPIO, but with the rpmsg-gpio controller:
> >>>>>>>
> >>>>>>> rpmsg-gpios = <&rpmsg_gpio (GPIO NUM) (flags)>;
> >>>>>>>
> >>>>>>> If the intent to create separate gpio nodes was only for the channel
> >>>>>>> creation, then it's not really needed.
> >>>>>>>
> >>>>>>> [1]
> >>>>>>> https://github.com/torvalds/linux/
> >>>>>>> blob/6d35786de28116ecf78797a62b84e6bf3c45aa5a/drivers/rpmsg/
> >>>>>>> virtio_rpmsg_bus.c#L136
> >>>>>>>
> >>>>>>
> >>>>>> It is already the case. bindings declare GPIO controllers, not directly
> >>>>>> GPIOs in:
> >>>>>>
> >>>>>> [PATCH v13 2/4] dt-bindings: remoteproc: imx_rproc: Add "rpmsg" subnode
> >>>>>> support
> >>>>>>
> >>>>>> The discussion is around having an unique RPmsg endpoint for all
> >>>>>> GPIO controller or one RPmsg endpoint per GPIO controller.
> >>>>>>
> >>>>>
> >>>>> Endpoint where remote side or linux side?
> >>>>>
> >>>>> If unique endpoint on remote side per gpio controller then it makes sense.
> >>>>>
> >>>>> Unique endpoint on linux side doesn't make sense. Instead, unique
> >>>>> channel per gpio controller makes sense, and each channel will have
> >>>>> multiple endpoints on linux side. As I replied to Beleswar on the other
> >>>>> email, I will copy past my answer here too:
> >>>>>
> >>>>>
> >>>>> To be more specific:
> >>>>>
> >>>>> Linux:                               remote:
> >>>>>
> >>>>> ch1: rpmsg-gpio.-1.1024 ->     gpio-controller@1024
> >>>>>     - gpio-line ept1
> >>>>>     - gpio-line ept2    ->     They all map to same callback_ept_1024.
> >>>>>     - gpio-line ept3
> >>>>>
> >>>>> ch2: rpmsg-gpio.-1.1025 ->     gpio-controller@1025
> >>>>>     - gpio-line ept1
> >>>>>     - gpio-line ept2    ->     They all map to same callback_ept_1025.
> >>>>>     - gpio-line ept3
> >>>>>
> >>>>
> >>>>
> >>>> Hi Mathieu,
> >>>>
> >>>> So upon more brain storming in this approach I found limitation:
> >>>>
> >>>> This approach won't work if host OS is any other OS but Linux. For
> >>>> example, if the remote OS is zephyr/baremetal using open-amp, then Only
> >>>> Linux <-> zephyr combination will work, and we won't be able to re-use
> >>>> this approach for zephyr <-> zephyr use case. The concept of rpmsg
> >>>> channel/device exist only in the linux kernel implementation. This
> >>>> brings another question: Should the protocol we decide work on other use
> >>>> cases as well? Or Linux must be the Host OS for this protocol ?
> >>>>
> >>>
> >>> Linux and Zephyr are very distinct OS, each with their own subsystems
> >>> and characteristics.  The design we choose here involves RPMSG and,
> >>> inherently, Linux.  We can't make decisions based on what may
> >>> potentially happen in Zephyr.
> >>>
> >>>>
> >>>> I think your & Arnaud's proposed approach of single endpoint per
> >>>> gpio-controller on both side makes more sense, as it will work
> >>>> regardless of any OS on host or remote side.
> >>>>
> >>>
> >>> Arnaud, Beleswar, Andrew and I are all advocating for one endpoint per
> >>> GPIO controller.  The remaining issue it about the best way to work
> >>> out source and destination addresses between Linux and the remote
> >>> processor.  I'm running out of time for today but I'll return to this
> >>> thread with a final analysis by the end of the week.
> >>>
> >>
> >> Okay. Then that means multiple endpoints on Linux side can be considered.
> >
> > If there are multiple GPIO controllers then yes, there will be more than one
> > endpoint.  At this time I do now want to condiser other bus architectures (i2c,
> > spi, ...) to avoid muddying an already difficult conversation.
> >
> >>
> >> If we decide to go single-endpoint per device on both side, then for
> >> that here is the proposal to represent src ept and dst ept:
> >
> > I do not understand what you mean by "per device" - please be more specific.
> >
>
> "per device" I mean, per rpmsg device/channel. In our case that would be
> per gpio-controller.
>
> >>
> >> When we represent any device under rpmsg bus node, I think it should be
> >> considered remote's view of the adddress space. So ideally we can
> >> convert it to Linux view of the address space, via 'ranges' property.
> >
> > There is no address space to consider since there is no GPIO controller memory
> > space to access.  All that is done by the driver (remote processor) and
> > completely hidden from Linux by rpmsg-virtio-gpio.
> >
>
> So IMHO the dt-binding is the representation of the device hardware and
> is independent of how driver will access it. Any gpio-controller device
> node, we are just representing how gpio-controller hardware on the
> remote side looks like, and what is the corresponding view of the linux is.
>
> The rpmsg-gpio driver is different than the platform gpio controller
> driver mainly in two ways:
>
> 1) How the driver is probed: rpmsg-gpio driver will be probed when
> corresponding rpmsg channel/device name-service announcment will happen
> from the remote side.
>

I agree.

> 2) The GPIO Ops are not performed on the hardware directly, but it's
> done via rpmsg commands on the remote side.
>

I agree.

> However, the GPIO controller hardware remains the same. So bindings
> shoudln't change.
>

That is where I have a different point of view.  There is no need to
have information in the bindings the kernel won't use.  We are
advertizing virtio-gpio devices and as such should use virtio-gpio
bindings.  The only thing that changes is the transport method, i.e,
encapsulated in RPMSG rather than directly over virtqueues.

> IMHO That means, if I want to move any existing GPIO-controller to the
> remote side, and want the rpmsg-gpio driver to handle it then, all I
> need to change is the compatible string of the current gpio-controller
> device node. The rest of the address space should remain the same, and
> leave ranges property empty. If the remote core has different view of
> the address space, then the device should contain remote's view and
> parent bus (rpmsg-io bus) should provide linux view via 'ranges' property.
>
> That is just the device hw representation in the device-tree as rpmsg
> device. Same for any other type of the controller: i2c, spi etc.
>
> Thanks,
> Tanmay
>
>
> >>
> >> So bindings should include 'ranges' property in the parent node. Then
> >> linux view of the start address becomes src ept, and remote view of the
> >> start address becomes dest ept. The remote view of the start address is
> >> expected to be the static src endpoint on the remote side.
> >>
> >> Following representation of the rpmsg devices (gpio, i2c, spi or any other):
> >>
> >> rpmsg {
> >>   #address-cells = <1>;
> >>   #size-cells = <1>;
> >>
> >>   rpmsg-io {
> >>     compatible = "rpmsg-io-bus";
> >>     ranges = <remote_view_addr(dst ept) linux_view_addr(src ept) size>;
> >>     #address-cells = <1>;
> >>     #size-cells = <1>;
> >>
> >>     gpio@remote_view_addr(or dst ept) {
> >>       compatible = "rpmsg-io";
> >>       reg = <remote_view_addr addr_space_size>;
> >>       gpio-controller;
> >>       #gpio-cells = <2>;
> >>       interrupt-controller;
> >>       #interrupt-cells = <2>;
> >>     };
> >>
> >>     ...
> >>
> >>   };
> >>
> >> };
> >>
> >> Example device-tree:
> >>
> >> rpmsg {
> >>   #address-cells = <1>;
> >>   #size-cells = <1>;
> >>
> >>   rpmsg-io {
> >>     compatible = "rpmsg-io-bus";
> >>     ranges = <0x10000 0x50000 0x1000>,
> >>              <0x20000 0x60000 0x1000>;
> >>     #address-cells = <1>;
> >>     #size-cells = <1>;
> >>
> >>     gpio@10000 {
> >>       compatible = "rpmsg-io";
> >>       reg = <0x10000 0x1000>;
> >>       gpio-controller;
> >>       #gpio-cells = <2>;
> >>       interrupt-controller;
> >>       #interrupt-cells = <2>;
> >>     };
> >>
> >>     gpio@20000 {
> >>       compatible = "rpmsg-io";
> >>       reg = <0x20000 0x1000>;
> >>       gpio-controller;
> >>       #gpio-cells = <2>;
> >>       interrupt-controller;
> >>       #interrupt-cells = <2>;
> >>     };
> >>
> >>   };
> >>
> >> };
> >>
> >>
> >> Thanks,
> >> Tanmay
> >>
> >>
> >>>> To be more specific this will look like following:
> >>>>
> >>>> Host (Linux)                       Remote (baremetal/RTOS)
> >>>>
> >>>> rpmsg ch/device 1:
> >>>>     - rpmsg ept 1   <------>     rpmsg ept 1 gpio-controller 0
> >>>>
> >>>> rpmsg ch/device 2:
> >>>>      - rpmsg ept 2   <------>     rpmsg ept 2 gpio-controller 1
> >>>>
> >>>>
> >>>> The question is, how to decide src ept, and dest ept on both sides?
> >>>> I still think it should be static endpoints.
> >>>>
> >>>> I will get back with more reasoning on that.
> >>>>
> >>>>> On the remote side, we have to hardcode Which rpmsg controller is mapped
> >>>>> to which endpoint.
> >>>>>
> >>>>>> Or did I misunderstand your questions?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Arnaud
> >>>>>>
> >>>>>
> >>>>>
> >>>>> I gave this patch more time yesterday, and I think the 'reg' property
> >>>>> should represent remote endpoint, instead of the gpio-controller index.
> >>>>>
> >>>>> So in this approach remote implementation is expected to provide
> >>>>> hard-coded (static) endpoints for each gpio-controller instance, and
> >>>>> that same number should be represented with the 'reg' property.
> >>>>>
> >>>>> On remote side:
> >>>>>
> >>>>> #define RPMSG_GPIO_0_CONTROLLER_EPT (RPMSG_RESERVED_ADDRESSES + 1) // 1024
> >>>>>
> >>>>> ept_1024_callback() {
> >>>>>
> >>>>>       // handle appropriate gpio port ()
> >>>>>
> >>>>> }
> >>>>>
> >>>>> On linux side:
> >>>>>
> >>>>> So new representation of controller:
> >>>>>
> >>>>>  rpmsg_gpio_0:   gpio@1024 {
> >>>>>              compatible = "rpmsg-gpio";
> >>>>>              reg = <1024>;
> >>>>>              gpio-controller;
> >>>>>              #gpio-cells = <2>;
> >>>>>              #interrupt-cells = <2>;
> >>>>>              interrupt-controller;
> >>>>>           };
> >>>>>
> >>>>>  rpmsg_gpio_1:   gpio@1025 {
> >>>>>              compatible = "rpmsg-gpio";
> >>>>>              reg = <1025>;
> >>>>>              gpio-controller;
> >>>>>              #gpio-cells = <2>;
> >>>>>              #interrupt-cells = <2>;
> >>>>>              interrupt-controller;
> >>>>>           };
> >>>>>
> >>>>> gpios = <&rpmsg_gpio_0 (GPIO NUM or PIN) flags>,
> >>>>>       <&rpmsg_gpio_1 (GPIO NUM or PIN) flags>;
> >>>>>
> >>>>> Now in the linux driver:
> >>>>>
> >>>>> You can easily retrieve destination endpoint when we want to send the
> >>>>> command to the gpio controller via device's "reg" property.
> >>>>>
> >>>>> This approach also provides built-in security as well. Because now
> >>>>> gpio-controller instance is hardcoded with the endpoint callback, it
> >>>>> can't be modified/addressed without changing the 'reg' property.
> >>>>>
> >>>>> Just like you wouldn't change device address for the instance of the
> >>>>> gpio-controller right?
> >>>>>
> >>>>> This approach can be easily adapted to all the other rpmsg controllers
> >>>>> as well.
> >>>>>
> >>>>> So, dynamic endpoint allocation doesn't make sense in this case. Dynamic
> >>>>> endpoint allocation makes more sense for user-space apps which don't
> >>>>> really care about endpoints and only payloads.
> >>>>>
> >>>>> But, here we are multiplexing device-addresses with endpoints, and so it
> >>>>> has to be fixed, and presented via 'reg' property. So, firmware can't
> >>>>> change device-address without Linux knowing it.
> >>>>>
> >>>>> Thanks,
> >>>>> Tanmay
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>>>>>>> This way device groups are isolated with each channel/namespace, and
> >>>>>>>>>>>> instances within each device groups are also respected with specific
> >>>>>>>>>>>> endpoints.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Beleswar
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
>

^ permalink raw reply

* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: Paolo Bonzini @ 2026-05-13 16:24 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Marc Zyngier, Jonathan Corbet, Shuah Khan, kvm,
	Linux Doc Mailing List, Kernel Mailing List, Linux,
	Sean Christopherson, Jim Mattson, Oliver Upton, Joey Gouly,
	Suzuki K Poulose, Zenghui Yu, Catalin Marinas, Will Deacon,
	Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
	Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <d9d4471a7f5ec1e297b3ca07f42a59090aa91e15.camel@infradead.org>

Il mer 13 mag 2026, 15:57 David Woodhouse <dwmw2@infradead.org> ha scritto:
> > x86 doesn't do bug-for-bug compatibility, thankfully - we have quirks
> > but only 11 of them, or about one per year since we started adding them.
> >   We only add quirks, generally speaking, when 1) we change the way file
> > descriptors are initialized, 2) guests in the wild were relying on it,
> > or 3) it prevends restoring state saved from an old kernel.  Is there
> > anything else?
> >
> > https://lore.kernel.org/kvm/e03f092dfbb7d391a6bf2797ba01e122ba080bcd.camel@infradead.org/
> > is an example of a bug that "no SW can make any reasonable use of".
>
> I actually believe that the focus on ICEBP was triggered by some weird
> gaming software's anti-DRM mechanism, and that it *did* affect actual
> guests in the wild?
>
> But yeah, *fixing* it should not have any adverse effects. That's the
> key.

Yep, so "bug for bug" is not it.

> > That is *also* obviously nonsense though, isn't it (see example above)?
> > The truth is in the middle, "once it is in the architecture" is likely
> > too narrow but "once it is in a Linux release" is way too broad.
>
> How about "once it is in a Linux release and guest visible, and unless
> we *know* that changing it in either direction underneath running
> guests cannot cause problems".
>
> > And besides, both miss the point of *configurability* which is the basis of
> > it all.
>
> Hm, configurability *is* the point, I thought.

Yes, and configurability goes way beyond bugs/quirks, which are to
some extent a red herring. Configurability for example says that "KVM:
arm64: vgic: Allow userspace to set IIDR revision 1" shouldn't be
controversial at all.

> > So we have the third case, "restoring state saved from an old kernel".
> > If this case arises, I do believe that Arm will have to deal with it and
> > introduce quirks or KVM_GET/SET_REG hacks.  Maybe it hasn't happened
> > yet, lucky you.
>
> We literally have those mechanisms already.

I am not talking about guest-visible changes across save/restore here,
but rather about round-trips through userspace. For example, see the
effect of KVM_X2APIC_API_USE_32BIT_IDS on KVM_GET/SET_LAPIC: it
couldn't be made the default, because userspace expects to take old
data returned by KVM_GET_LAPIC and shove it into KVM_SET_LAPIC. Sucks
but can't be avoided.

> See commit https://git.kernel.org/torvalds/c/49a1a2c70a7f which adds a
> new guest-visible feature in revision 3, but allowed userspace to
> restore the old behaviour by setting it to revision 2. All my patch above does, is make it possible to set it to revision 1 as
> well. Because https://git.kernel.org/torvalds/c/d53c2c29ae0d previously
> changed the behaviour and bumped the default to 2 *without* allowing
> userspace to restore the prior behaviour, and we've been carrying a
> *revert* of that patch.
>
> Why would we *not* accept such a patch?

Agreed. Even ignoring your revert, there's no reason why any upgrade
past 49a1a2c70a7f has to be from after d53c2c29ae0d.

> Marc seems terribly insistent that we SHOULD NOT
> restore the behaviour that older KVM offered to guests, and we MUST
> change it unconditionally underneath running guests, making these
> registers writable on upgrade... and reverting them to read-only for
> running guests on a rollback.
>
> And there we do have a very different viewpoint.

That's the design decision I mentioned, of not starting the guest
configuration from a clean slate. I believe it complicates things
because you have to design from the beginning with the ability to
rollback to old versions and to potentially detect conflicts
introduced by the rollback. This is exactly why
KVM_X86_QUIRK_STUFF_FEATURE_MSRS was introduced: "KVM's initialization
of feature MSRs during vCPU creation results in a failed save/restore
of PERF_CAPABILITIES. If userspace configures the VM to _not_ have a
PMU, because KVM initializes the vCPU's PERF_CAPABILITIES, trying to
save/restore the non-zero value will be rejected by the destination."
(https://lkml.org/lkml/2024/8/2/1032)

For Arm, however, it may be too late to change it; if not, I'll
happily watch you argue with Marc about it. But even without that,
this doc patch (and the idea that "Where a new kernel introduces a
guest-visible change, it provides a mechanism for userspace to select
the previous behaviour") should be uncontroversial.

Paolo

^ permalink raw reply

* Re: [PATCH 08/23] arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask
From: Frederic Weisbecker @ 2026-05-13 16:19 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Thomas Gleixner, Chen Ridong, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, cgroups,
	linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan
In-Reply-To: <20260421030351.281436-9-longman@redhat.com>

Le Mon, Apr 20, 2026 at 11:03:36PM -0400, Waiman Long a écrit :
> As the HK_TYPE_TICK cpumask is going to be changeable at run time, we
> need to use RCU to protect access to the cpumask to prevent it from
> going away in the middle of the operation.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  arch/arm64/kernel/topology.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index b32f13358fbb..48f150801689 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -173,6 +173,7 @@ void arch_cpu_idle_enter(void)
>  	if (!amu_fie_cpu_supported(cpu))
>  		return;
>  
> +	guard(rcu)();
>  	/* Kick in AMU update but only if one has not happened already */
>  	if (housekeeping_cpu(cpu, HK_TYPE_TICK) &&
>  	    time_is_before_jiffies(per_cpu(cpu_amu_samples.last_scale_update,
>  	cpu)))

This is called with IRQs disabled in the current CPU that is online so it's
already guaranteed to be stable.


> @@ -187,11 +188,16 @@ int arch_freq_get_on_cpu(int cpu)
>  	unsigned int start_cpu = cpu;
>  	unsigned long last_update;
>  	unsigned int freq = 0;
> +	bool hk_cpu;
>  	u64 scale;
>  
>  	if (!amu_fie_cpu_supported(cpu) || !arch_scale_freq_ref(cpu))
>  		return -EOPNOTSUPP;
>  
> +	scoped_guard(rcu) {
> +		hk_cpu = housekeeping_cpu(cpu, HK_TYPE_TICK);
> +	}
> +
>  	while (1) {
>  
>  		amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
> @@ -204,16 +210,21 @@ int arch_freq_get_on_cpu(int cpu)
>  		 * (and thus freq scale), if available, for given policy: this boils
>  		 * down to identifying an active cpu within the same freq domain, if any.
>  		 */
> -		if (!housekeeping_cpu(cpu, HK_TYPE_TICK) ||
> +		if (!hk_cpu ||
>  		    time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
>  			struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> +			bool hk_intersects;
>  			int ref_cpu;
>  
>  			if (!policy)
>  				return -EINVAL;
>  
> -			if (!cpumask_intersects(policy->related_cpus,
> -						housekeeping_cpumask(HK_TYPE_TICK))) {
> +			scoped_guard(rcu) {
> +				hk_intersects = cpumask_intersects(policy->related_cpus,
> +							housekeeping_cpumask(HK_TYPE_TICK));
> +			}
> +
> +			if (!hk_intersects) {
>  				cpufreq_cpu_put(policy);
>  				return -EOPNOTSUPP;
>  			}

Ok so this is racy but it's fine because:

This function is only used by cpufreq with either cpufreq_policy_write or
cpufreq_policy_read held (that is, struct cpufreq_policy::rwsem).

And that rwsem is write held on cpufreq_online() -> cpufreq_policy_online() and
also offline to guarantee the policy->cpus and policy->cpu stability.

Therefore housekeeping_cpumask() should only deal with stable online CPUs here. So
even if the housekeeping mask can be changed concurrently, those CPUs can't
appear or disappear from it.

Would be worth adding a comment about that.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply

* Re: [PATCH v7 06/20] perf: arm_pmuv3: Add method to partition the PMU
From: Colton Lewis @ 2026-05-13 16:13 UTC (permalink / raw)
  To: James Clark
  Cc: alexandru.elisei, pbonzini, corbet, linux, catalin.marinas, will,
	maz, oliver.upton, mizhang, joey.gouly, suzuki.poulose, yuzenghui,
	mark.rutland, shuah, gankulkarni, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, linux-perf-users, linux-kselftest, kvm
In-Reply-To: <485b8846-13d7-4d31-abf1-686d2516f772@linaro.org>

James Clark <james.clark@linaro.org> writes:

> On 04/05/2026 10:17 pm, Colton Lewis wrote:
>> For PMUv3, the register field MDCR_EL2.HPMN partitiones the PMU
>> counters into two ranges where counters 0..HPMN-1 are accessible by
>> EL1 and, if allowed, EL0 while counters HPMN..N are only accessible by
>> EL2.

>> Create a module parameter reserved_host_counters to reserve a number
>> of counters for the host. Counters not reserved for the host may be
>> used by a guest VM when the PMU is partitioned.

>> Add the function armv8pmu_partition() to check the validity of the
>> reservation and record a partition has happened and the maximum
>> allowable value for HPMN.

>> Due to the difficulty this feature would create for the driver running
>> in nVHE mode, partitioning is only allowed in VHE mode. In order to
>> support a partitioning on nVHE we'd need to explicitly disable guest
>> counters on every exit and reset HPMN to place all counters in the
>> first range.

>> Signed-off-by: Colton Lewis <coltonlewis@google.com>
>> ---
>>    arch/arm/include/asm/arm_pmuv3.h   |  4 ++
>>    arch/arm64/include/asm/arm_pmuv3.h |  5 ++
>>    arch/arm64/kvm/Makefile            |  2 +-
>>    arch/arm64/kvm/pmu-direct.c        | 22 +++++++++
>>    drivers/perf/arm_pmuv3.c           | 77 ++++++++++++++++++++++++++++--
>>    include/kvm/arm_pmu.h              |  8 ++++
>>    include/linux/perf/arm_pmu.h       |  2 +
>>    7 files changed, 115 insertions(+), 5 deletions(-)
>>    create mode 100644 arch/arm64/kvm/pmu-direct.c

>> diff --git a/arch/arm/include/asm/arm_pmuv3.h  
>> b/arch/arm/include/asm/arm_pmuv3.h
>> index 2ec0e5e83fc98..154503f054886 100644
>> --- a/arch/arm/include/asm/arm_pmuv3.h
>> +++ b/arch/arm/include/asm/arm_pmuv3.h
>> @@ -221,6 +221,10 @@ static inline bool kvm_pmu_counter_deferred(struct  
>> perf_event_attr *attr)
>>    	return false;
>>    }

>> +static inline bool has_host_pmu_partition_support(void)
>> +{
>> +	return false;
>> +}
>>    static inline bool kvm_set_pmuserenr(u64 val)
>>    {
>>    	return false;
>> diff --git a/arch/arm64/include/asm/arm_pmuv3.h  
>> b/arch/arm64/include/asm/arm_pmuv3.h
>> index cf2b2212e00a2..27c4d6d47da31 100644
>> --- a/arch/arm64/include/asm/arm_pmuv3.h
>> +++ b/arch/arm64/include/asm/arm_pmuv3.h
>> @@ -171,6 +171,11 @@ static inline bool pmuv3_implemented(int pmuver)
>>    		 pmuver == ID_AA64DFR0_EL1_PMUVer_NI);
>>    }

>> +static inline bool is_pmuv3p1(int pmuver)
>> +{
>> +	return pmuver >= ID_AA64DFR0_EL1_PMUVer_V3P1;
>> +}
>> +
>>    static inline bool is_pmuv3p4(int pmuver)
>>    {
>>    	return pmuver >= ID_AA64DFR0_EL1_PMUVer_V3P4;
>> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
>> index 3ebc0570345cc..baf0f296c0e53 100644
>> --- a/arch/arm64/kvm/Makefile
>> +++ b/arch/arm64/kvm/Makefile
>> @@ -26,7 +26,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o  
>> pvtime.o \
>>    	 vgic/vgic-its.o vgic/vgic-debug.o vgic/vgic-v3-nested.o \
>>    	 vgic/vgic-v5.o

>> -kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
>> +kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu-direct.o pmu.o
>>    kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
>>    kvm-$(CONFIG_PTDUMP_STAGE2_DEBUGFS) += ptdump.o

>> diff --git a/arch/arm64/kvm/pmu-direct.c b/arch/arm64/kvm/pmu-direct.c
>> new file mode 100644
>> index 0000000000000..74e40e4915416
>> --- /dev/null
>> +++ b/arch/arm64/kvm/pmu-direct.c
>> @@ -0,0 +1,22 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) 2025 Google LLC
>> + * Author: Colton Lewis <coltonlewis@google.com>
>> + */
>> +
>> +#include <linux/kvm_host.h>
>> +
>> +#include <asm/arm_pmuv3.h>
>> +
>> +/**
>> + * has_host_pmu_partition_support() - Determine if partitioning is  
>> possible
>> + *
>> + * Partitioning is only supported in VHE mode with PMUv3
>> + *
>> + * Return: True if partitioning is possible, false otherwise
>> + */
>> +bool has_host_pmu_partition_support(void)
>> +{
>> +	return has_vhe() &&
>> +		system_supports_pmuv3();
>> +}
>> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
>> index 7ff3139dda893..6e447227d801f 100644
>> --- a/drivers/perf/arm_pmuv3.c
>> +++ b/drivers/perf/arm_pmuv3.c
>> @@ -42,6 +42,13 @@
>>    #define ARMV8_THUNDER_PERFCTR_L1I_CACHE_PREF_ACCESS		0xEC
>>    #define ARMV8_THUNDER_PERFCTR_L1I_CACHE_PREF_MISS		0xED

>> +static int reserved_host_counters __read_mostly = -1;
>> +bool armv8pmu_is_partitioned;
>> +
>> +module_param(reserved_host_counters, int, 0);
>> +MODULE_PARM_DESC(reserved_host_counters,
>> +		 "PMU Partition: -1 = No partition; +N = Reserve N counters for the  
>> host");
>> +
>>    /*
>>     * ARMv8 Architectural defined events, not all of these may
>>     * be supported on any given implementation. Unsupported events will
>> @@ -532,6 +539,11 @@ static void armv8pmu_pmcr_write(u64 val)
>>    	write_pmcr(val);
>>    }

>> +static u64 armv8pmu_pmcr_n_read(void)
>> +{
>> +	return FIELD_GET(ARMV8_PMU_PMCR_N, armv8pmu_pmcr_read());
>> +}
>> +
>>    static int armv8pmu_has_overflowed(u64 pmovsr)
>>    {
>>    	return !!(pmovsr & ARMV8_PMU_CNT_MASK_ALL);
>> @@ -1312,6 +1324,54 @@ struct armv8pmu_probe_info {
>>    	bool present;
>>    };

>> +/**
>> + * armv8pmu_reservation_is_valid() - Determine if reservation is allowed
>> + * @host_counters: Number of host counters to reserve
>> + *
>> + * Determine if the number of host counters in the argument is an
>> + * allowed reservation, 0 to NR_COUNTERS inclusive.
>> + *
>> + * Return: True if reservation allowed, false otherwise
>> + */
>> +static bool armv8pmu_reservation_is_valid(int host_counters)
>> +{
>> +	return host_counters >= 0 &&
>> +		host_counters <= armv8pmu_pmcr_n_read();
>> +}
>> +
>> +/**
>> + * armv8pmu_partition() - Partition the PMU
>> + * @pmu: Pointer to pmu being partitioned
>> + * @host_counters: Number of host counters to reserve
>> + *
>> + * Partition the given PMU by taking a number of host counters to
>> + * reserve and, if it is a valid reservation, recording the
>> + * corresponding HPMN value in the max_guest_counters field of the PMU  
>> and
>> + * clearing the guest-reserved counters from the counter mask.
>> + *
>> + * Return: 0 on success, -ERROR otherwise
>> + */
>> +static int armv8pmu_partition(struct arm_pmu *pmu, int host_counters)
>> +{
>> +	u8 nr_counters;
>> +	u8 hpmn;
>> +
>> +	if (!armv8pmu_reservation_is_valid(host_counters)) {
>> +		pr_err("PMU partition reservation of %d host counters is not valid",  
>> host_counters);
>> +		return -EINVAL;
>> +	}
>> +
>> +	nr_counters = armv8pmu_pmcr_n_read();
>> +	hpmn = nr_counters - host_counters;
>> +
>> +	pmu->max_guest_counters = hpmn;
>> +	armv8pmu_is_partitioned = true;
>> +
>> +	pr_info("Partitioned PMU with %d host counters -> %u guest counters",  
>> host_counters, hpmn);
>> +
>> +	return 0;
>> +}
>> +
>>    static void __armv8pmu_probe_pmu(void *info)
>>    {
>>    	struct armv8pmu_probe_info *probe = info;
>> @@ -1326,17 +1386,26 @@ static void __armv8pmu_probe_pmu(void *info)

>>    	cpu_pmu->pmuver = pmuver;
>>    	probe->present = true;
>> +	cpu_pmu->max_guest_counters = -1;

>>    	/* Read the nb of CNTx counters supported from PMNC */
>> -	bitmap_set(cpu_pmu->cntr_mask,
>> -		   0, FIELD_GET(ARMV8_PMU_PMCR_N, armv8pmu_pmcr_read()));
>> +	bitmap_set(cpu_pmu->hw_cntr_mask, 0, armv8pmu_pmcr_n_read());

>>    	/* Add the CPU cycles counter */
>> -	set_bit(ARMV8_PMU_CYCLE_IDX, cpu_pmu->cntr_mask);
>> +	set_bit(ARMV8_PMU_CYCLE_IDX, cpu_pmu->hw_cntr_mask);

>>    	/* Add the CPU instructions counter */
>>    	if (pmuv3_has_icntr())
>> -		set_bit(ARMV8_PMU_INSTR_IDX, cpu_pmu->cntr_mask);
>> +		set_bit(ARMV8_PMU_INSTR_IDX, cpu_pmu->hw_cntr_mask);
>> +
>> +	bitmap_copy(cpu_pmu->cntr_mask, cpu_pmu->hw_cntr_mask,  
>> ARMPMU_MAX_HWEVENTS);
>> +
>> +	if (reserved_host_counters >= 0) {
>> +		if (has_host_pmu_partition_support())
>> +			armv8pmu_partition(cpu_pmu, reserved_host_counters);
>> +		else
>> +			pr_err("PMU partition is not supported");
>> +	}

>>    	pmceid[0] = pmceid_raw[0] = read_pmceid0();
>>    	pmceid[1] = pmceid_raw[1] = read_pmceid1();
>> diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
>> index 24a471cf59d56..95f404cdcb2df 100644
>> --- a/include/kvm/arm_pmu.h
>> +++ b/include/kvm/arm_pmu.h
>> @@ -47,7 +47,10 @@ struct arm_pmu_entry {
>>    	struct arm_pmu *arm_pmu;
>>    };

>> +extern bool armv8pmu_is_partitioned;
>> +
>>    bool kvm_supports_guest_pmuv3(void);
>> +bool has_host_pmu_partition_support(void);
>>    #define kvm_arm_pmu_irq_initialized(v)	((v)->arch.pmu.irq_num >=  
>> VGIC_NR_SGIS)
>>    u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u64 select_idx);
>>    void kvm_pmu_set_counter_value(struct kvm_vcpu *vcpu, u64 select_idx,  
>> u64 val);
>> @@ -117,6 +120,11 @@ static inline bool kvm_supports_guest_pmuv3(void)
>>    	return false;
>>    }

>> +static inline bool has_host_pmu_partition_support(void)
>> +{
>> +	return false;
>> +}
>> +
>>    #define kvm_arm_pmu_irq_initialized(v)	(false)
>>    static inline u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu,
>>    					    u64 select_idx)
>> diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
>> index 52b37f7bdbf9e..f7b000bb3eca8 100644
>> --- a/include/linux/perf/arm_pmu.h
>> +++ b/include/linux/perf/arm_pmu.h
>> @@ -109,6 +109,7 @@ struct arm_pmu {
>>    	 */
>>    	int		(*map_pmuv3_event)(unsigned int eventsel);
>>    	DECLARE_BITMAP(cntr_mask, ARMPMU_MAX_HWEVENTS);
>> +	DECLARE_BITMAP(hw_cntr_mask, ARMPMU_MAX_HWEVENTS);

> I think this needs a comment or a clearer name. Both cntr_mask and
> hw_cntr_mask are used in KVM and the PMU driver and it's not immediately
> obvious what the difference is.

I will clarify this. The goal was for hw_cntr_mask to be the unmodified
reference point to restore cntr_mask later.

^ permalink raw reply

* Re: [PATCH v7 00/20] ARM64 PMU Partitioning
From: Colton Lewis @ 2026-05-13 16:10 UTC (permalink / raw)
  To: James Clark
  Cc: alexandru.elisei, pbonzini, corbet, linux, catalin.marinas, will,
	maz, oliver.upton, mizhang, joey.gouly, suzuki.poulose, yuzenghui,
	mark.rutland, shuah, gankulkarni, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, linux-perf-users, linux-kselftest
In-Reply-To: <18d747ea-660a-4ae6-b8b8-365d745352ce@linaro.org>


Hi James. Thanks for reviewing.

James Clark <james.clark@linaro.org> writes:

> On 04/05/2026 10:17 pm, Colton Lewis wrote:
>> This series creates a new PMU scheme on ARM, a partitioned PMU that
>> allows reserving a subset of counters for more direct guest access,
>> significantly reducing overhead. More details, including performance
>> benchmarks, can be read in the v1 cover letter linked below.

>> An overview of what this series accomplishes was presented at KVM
>> Forum 2025. Slides [1] and video [2] are linked below.

>> After a few false starts, meeting with Will Deacon and Mark Rutland to
>> discuss implementation ideas, and a few more false starts, I finally
>> have an implementation of dynamic counter reservation that works
>> without disrupting host perf too much. Now the host only loses access
>> to the guest counters when a vCPU resides on the CPU.

>> The key was creating perf_pmu_resched_update, which behaves exactly
>> like perf_pmu_resched except it takes a callback to call in between
>> when the perf events are scheduled out and when they are scheduled
>> back in. That allows us to update the PMU's available counters when we
>> know they are not currently in use without needing to expose private
>> perf core functions and triple check they are not being called in a
>> way that violates existing assumptions.

>> Because this introduces a possibility of perf reschedule during vCPU
>> load, I've optimized to only do that operation if there are host
>> events occupying the intended guest counters at the time of the load.

>> The kernel command line parameter for the driver still exists, but now
>> only defines an upper limit of counters the guest might use rather
>> than taking those counters from the host permanently.

>> v7:

>> * Implement dynamic counter reservation as described above. One side
>>     effect is the PMUv3 driver now needs much fewer changes to enforce
>>     the boundary.

>> * Move register accesses out of fast path for non-FGT hardware. The
>>     performance impact was negligible and this moves bloat out of the
>>     fast path and allows a more reliable design with more code sharing.

>> * Make PMCCNTR a special case in the context swap again because trying
>>     to access it with PMXEVCNTR is undefined.

>> * Fix a bug where kvm_pmu_guest_counter_mask was using & instead of |.

>> * Re-expose the dedicated instruction counter to the host since it was
>>     decided the guest will not own it.

>> * Change the global armv8pmu_reserved_host_counters to
>>     armv8pmu_is_partitoned because it was only used in boolean checks.

>> * Fix typo in vcpu attribute commit so the spelling of the flag in the
>>     commit message matches the code.

>> * Rebase to v7.0-rc7

>> v6:
>> https://lore.kernel.org/kvmarm/20260209221414.2169465-1-coltonlewis@google.com/

>> v5:
>> https://lore.kernel.org/kvmarm/20251209205121.1871534-1-coltonlewis@google.com/

>> v4:
>> https://lore.kernel.org/kvmarm/20250714225917.1396543-1-coltonlewis@google.com/

>> v3:
>> https://lore.kernel.org/kvm/20250626200459.1153955-1-coltonlewis@google.com/

>> v2:
>> https://lore.kernel.org/kvm/20250620221326.1261128-1-coltonlewis@google.com/

>> v1:
>> https://lore.kernel.org/kvm/20250602192702.2125115-1-coltonlewis@google.com/

>> [1]  
>> https://gitlab.com/qemu-project/kvm-forum/-/raw/main/_attachments/2025/Optimizing__itvHkhc.pdf
>> [2]  
>> https://www.youtube.com/watch?v=YRzZ8jMIA6M&list=PLW3ep1uCIRfxwmllXTOA2txfDWN6vUOHp&index=9

>> Colton Lewis (19):
>>     arm64: cpufeature: Add cpucap for HPMN0
>>     KVM: arm64: Reorganize PMU functions
>>     perf: arm_pmuv3: Generalize counter bitmasks
>>     perf: arm_pmuv3: Check cntr_mask before using pmccntr
>>     perf: arm_pmuv3: Add method to partition the PMU
>>     KVM: arm64: Set up FGT for Partitioned PMU
>>     KVM: arm64: Add Partitioned PMU register trap handlers
>>     KVM: arm64: Set up MDCR_EL2 to handle a Partitioned PMU
>>     KVM: arm64: Context swap Partitioned PMU guest registers
>>     KVM: arm64: Enforce PMU event filter at vcpu_load()
>>     perf: Add perf_pmu_resched_update()
>>     KVM: arm64: Apply dynamic guest counter reservations
>>     KVM: arm64: Implement lazy PMU context swaps
>>     perf: arm_pmuv3: Handle IRQs for Partitioned PMU guest counters
>>     KVM: arm64: Detect overflows for the Partitioned PMU
>>     KVM: arm64: Add vCPU device attr to partition the PMU
>>     KVM: selftests: Add find_bit to KVM library
>>     KVM: arm64: selftests: Add test case for Partitioned PMU
>>     KVM: arm64: selftests: Relax testing for exceptions when partitioned

>> Marc Zyngier (1):
>>     KVM: arm64: Reorganize PMU includes

>>    arch/arm/include/asm/arm_pmuv3.h              |  18 +
>>    arch/arm64/include/asm/arm_pmuv3.h            |  12 +-
>>    arch/arm64/include/asm/kvm_host.h             |  17 +-
>>    arch/arm64/include/asm/kvm_types.h            |   6 +-
>>    arch/arm64/include/uapi/asm/kvm.h             |   2 +
>>    arch/arm64/kernel/cpufeature.c                |   8 +
>>    arch/arm64/kvm/Makefile                       |   2 +-
>>    arch/arm64/kvm/arm.c                          |   2 +
>>    arch/arm64/kvm/config.c                       |  41 +-
>>    arch/arm64/kvm/debug.c                        |  31 +-
>>    arch/arm64/kvm/pmu-direct.c                   | 494 ++++++++++++
>>    arch/arm64/kvm/pmu-emul.c                     | 674 +----------------
>>    arch/arm64/kvm/pmu.c                          | 701 ++++++++++++++++++
>>    arch/arm64/kvm/sys_regs.c                     | 250 ++++++-
>>    arch/arm64/tools/cpucaps                      |   1 +
>>    arch/arm64/tools/sysreg                       |   6 +-
>>    drivers/perf/arm_pmuv3.c                      | 111 ++-
>>    include/kvm/arm_pmu.h                         | 110 +++
>>    include/linux/perf/arm_pmu.h                  |   3 +
>>    include/linux/perf/arm_pmuv3.h                |  14 +-
>>    include/linux/perf_event.h                    |   3 +
>>    kernel/events/core.c                          |  28 +-
>>    tools/testing/selftests/kvm/Makefile.kvm      |   1 +
>>    .../selftests/kvm/arm64/vpmu_counter_access.c | 112 ++-
>>    tools/testing/selftests/kvm/lib/find_bit.c    |   1 +
>>    25 files changed, 1861 insertions(+), 787 deletions(-)
>>    create mode 100644 arch/arm64/kvm/pmu-direct.c
>>    create mode 100644 tools/testing/selftests/kvm/lib/find_bit.c


>> base-commit: 591cd656a1bf5ea94a222af5ef2ee76df029c1d2
>> --
>> 2.54.0.545.g6539524ca2-goog

> I tested it a bit and ran the kselftests and it all seems to be working

Great to hear you didn't find any obvious problems with your testing!

> ok. Some of the critical sashiko comments look like they are worth
> looking into though:
> https://sashiko.dev/#/patchset/20260504211813.1804997-1-coltonlewis%40google.com
> For example writing to PMCR_EL0.P from EL2 resets the host's counters,
> even if it's KVM doing it after trapping a write from the guest.

I will comb through this and the other sashiko comments and fix.

^ permalink raw reply

* Re: [RFC v2 0/2] add kconfirm
From: Julian Braha @ 2026-05-13 16:04 UTC (permalink / raw)
  To: nathan, jani.nikula, akpm, gary, ljs, arnd, gregkh, masahiroy,
	ojeda, corbet, qingfang.deng, linux-kernel, rust-for-linux,
	linux-doc, linux-kbuild
In-Reply-To: <agSXOHvQqTxSsArW@levanger>

On 5/13/26 16:22, Nicolas Schier wrote:
> I guess the github branch is expected to work out of the box, but on my arm64
> system this fails with:
> 
>     kconfirm$ make -j8 kconfirm
>     error: no matching package named `env_logger` found
>     location searched: crates.io index
>     required by package `kconfirm-lib v0.9.0 (/data/kbuild/kbuild-fixes/kconfirm/scripts/kconfirm/kconfirm-lib)`
>     As a reminder, you're using offline mode (--offline) which can sometimes cause surprising resolution failures, if this error is too confusing you may wish to retry without the offline flag.
>     make[2]: *** [Makefile:17: kconfirm] Error 101
>     make[1]: *** [kconfirm/Makefile:2244: kconfirm] Error 2
>     make: *** [Makefile:248: __sub-make] Error 2
>     [exit code 2]

Thanks for giving it a shot! I will look into this.

> and if 'kconfirm' does not need a .config file, you want to add 'kconfirm' to
> the list of 'no-dot-config-targets' in top-level Makefile.
> 
> 
> FTR: the 'kconfirm' and 'kconfirmclean' targets need some love: both do not
> really integrate in kbuild, yet: 'kconfirm' is not working with out-of-source
> builds (O=...), 'kconfirmclean' should not be required if 'make clean' is
> supported correctly, and 'make mrproper' removes the whole scripts/kconfirm
> tree due to the change in 'scripts/Makefile'.  (Tested?)

Also thank you for the makefile feedback, this is exactly what I was
looking for.

> The large amount of changes has been mentioned often enough;  even if all the
> vendored dependencies could be dropped, I am not convinced yet, that it is a
> good idea to maintain kconfirm in-tree due to its project size.

It's true, even though kconfirm only imports a few packages and is
2,000 LoC itself, once you consider the transitive dependencies, it
really adds up.

I'm currently trying to shrink the dependency tree as much as I
can, e.g. taking advantage of expected system packages, as was suggested
by previous reviewers.
> IMO, we need at least someone who steps up for maintaining kconfirm and
> registers in a dedicated MAINTAINERS entry.  (My own rust knowledge is not good
> enough for appropriate review, I can only offer some initial testing and
> frequent use when it is working/integrated.)

I will add myself to the MAINTAINERS for this. These RFCs are only the
beginning for kconfirm, not the finishing of it. My ultimate goal is to
be able to detect _all misusage_ of kconfig, while keeping zero false
alarms. I haven't even added the SMT solving that is needed for
path-sensitive analysis and detecting unmet dependency bugs.

Thanks again for your review!

- Julian Braha

^ permalink raw reply

* Re: [RFC v2 0/2] add kconfirm
From: Miguel Ojeda @ 2026-05-13 15:52 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Julian Braha, nathan, nsc, jani.nikula, akpm, gary, ljs, arnd,
	gregkh, masahiroy, ojeda, corbet, qingfang.deng, linux-kernel,
	rust-for-linux, linux-doc, linux-kbuild
In-Reply-To: <5220poq2-qq9p-27p0-3sq9-50q1845n76n0@vanv.qr>

On Mon, May 11, 2026 at 1:28 AM Jan Engelhardt <ej@inai.de> wrote:
>
> Linux, and many other projects, have run on a "The system version is
> king" model for a long time. If libelf, binutils, gcc, libx11, or
> whatever the dependency in question may be, the project trying to use
> a dependency would add a few-liner patch to broaden the accepted
> range, rather than trying to re-provide the dependency as a whole.

Definitely! It is why I mentioned "even better", i.e. if it can be
done using system packages across a reasonable amount of
distributions, then that should be the approach. However, sometimes
that may not be as easy as it is with very well established (and
stable) C libraries/APIs.

Cheers,
Miguel

^ permalink raw reply

* [PATCH v7 6/6] Documentation: document panic_on_unrecoverable_memory_failure sysctl
From: Breno Leitao @ 2026-05-13 15:39 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260513-ecc_panic-v7-0-be2e578e61da@debian.org>

Add documentation for the new vm.panic_on_unrecoverable_memory_failure
sysctl, describing which failures trigger a panic (kernel-owned pages
the handler cannot recover) and which are intentionally left out
(transient allocator races and unclassified pages).

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 Documentation/admin-guide/sysctl/vm.rst | 80 +++++++++++++++++++++++++++++++++
 1 file changed, 80 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 97e12359775c9..452c2ab25b35e 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -67,6 +67,7 @@ Currently, these files are in /proc/sys/vm:
 - page-cluster
 - page_lock_unfairness
 - panic_on_oom
+- panic_on_unrecoverable_memory_failure
 - percpu_pagelist_high_fraction
 - stat_interval
 - stat_refresh
@@ -925,6 +926,85 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
 why oom happens. You can get snapshot.
 
 
+panic_on_unrecoverable_memory_failure
+======================================
+
+When a hardware memory error (e.g. multi-bit ECC) hits a kernel page
+that cannot be recovered by the memory failure handler, the default
+behaviour is to ignore the error and continue operation.  This is
+dangerous because the corrupted data remains accessible to the kernel,
+risking silent data corruption or a delayed crash when the poisoned
+memory is next accessed.
+
+When enabled, this sysctl triggers a panic on memory failure events
+hitting reserved (``PageReserved``) memory: firmware reservations,
+the kernel image, vDSO, the zero page, and similar memblock-reserved
+regions.  These are owned by the kernel, are not managed by the page
+allocator, and cannot be recovered by the memory failure handler.
+
+Other unrecoverable kernel-owned populations (slab, vmalloc, page
+tables, kernel stacks, ...) are not currently covered by this
+sysctl.  The handler cannot reliably distinguish them from a
+userspace folio temporarily off the LRU during migration or
+compaction, and the cost of a false-positive panic on a recoverable
+userspace page is too high.  Such pages still go through the
+standard MF_MSG_GET_HWPOISON path: ``PG_hwpoison`` is set on them
+and a delayed crash on the next access remains possible.  Coverage
+may grow in the future as the handler gains stronger
+kernel-ownership signals.
+
+Recoverable failure paths are also intentionally left out: in-flight
+buddy allocations and other transient races with the page allocator
+can reach the same diagnostic, and panicking on them would risk
+killing the box for a page destined for userspace where the standard
+SIGBUS recovery path applies.  Pages whose state could not be
+classified at all are not covered either, since an unknown state is
+not a sound basis for a panic decision.
+
+For many environments it is preferable to panic immediately with a clean
+crash dump that captures the original error context, rather than to
+continue and face a random crash later whose cause is difficult to
+diagnose.
+
+Use cases
+---------
+
+This option is most useful in environments where unattributed crashes
+are expensive to debug or where data integrity must take precedence
+over availability:
+
+* Large fleets, where multi-bit ECC errors on kernel pages are observed
+  regularly and post-mortem analysis of an unrelated downstream crash
+  (often seconds to minutes after the original error) consumes
+  significant engineering effort.
+
+* Systems configured with kdump, where panicking at the moment of the
+  hardware error produces a vmcore that still contains the faulting
+  address, the affected page state, and the originating MCE/GHES
+  record — context that is typically lost by the time a delayed crash
+  occurs.
+
+* High-availability clusters that rely on fast, deterministic node
+  failure for failover, and prefer an immediate panic over silent data
+  corruption propagating to replicas or persistent storage.
+
+* Kernel and platform developers reproducing hwpoison issues with
+  tools such as ``mce-inject`` or error-injection debugfs interfaces,
+  where panicking on the unrecoverable path makes regressions
+  immediately visible instead of surfacing as later, unrelated
+  failures.
+
+= =====================================================================
+0 Try to continue operation (default).
+1 Panic immediately.  If the ``panic`` sysctl is also non-zero then the
+  machine will be rebooted.
+= =====================================================================
+
+Example::
+
+     echo 1 > /proc/sys/vm/panic_on_unrecoverable_memory_failure
+
+
 percpu_pagelist_high_fraction
 =============================
 

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v7 5/6] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-05-13 15:39 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260513-ecc_panic-v7-0-be2e578e61da@debian.org>

Add a sysctl panic_on_unrecoverable_memory_failure that triggers a
kernel panic when memory_failure() encounters pages that cannot be
recovered.  This provides a clean crash with useful debug information
rather than allowing silent data corruption or a delayed crash at an
unrelated code path.

Panic eligibility is intentionally narrow: only MF_MSG_KERNEL with
result == MF_IGNORED panics.  After the previous patch, MF_MSG_KERNEL
covers PG_reserved pages and the kernel-owned pages promoted from
get_hwpoison_page() via -ENOTRECOVERABLE (slab, vmalloc, page tables,
kernel stacks, ...).

All other action types are excluded:

- MF_MSG_GET_HWPOISON and MF_MSG_KERNEL_HIGH_ORDER can be reached by
  transient refcount races with the page allocator (an in-flight buddy
  allocation has refcount 0 and is no longer on the buddy free list,
  briefly), and panicking on them would risk killing the box for what
  is actually a recoverable userspace page.

- MF_MSG_UNKNOWN means identify_page_state() could not classify the
  page; that is precisely the wrong basis for a panic decision.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 8ba3df21d1270..cb2965c0ec0b4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -74,6 +74,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;
 
 static int sysctl_enable_soft_offline __read_mostly = 1;
 
+static int sysctl_panic_on_unrecoverable_mf __read_mostly;
+
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
 static bool hw_memory_failure __read_mostly = false;
@@ -155,6 +157,15 @@ static const struct ctl_table memory_failure_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
+	},
+	{
+		.procname	= "panic_on_unrecoverable_memory_failure",
+		.data		= &sysctl_panic_on_unrecoverable_mf,
+		.maxlen		= sizeof(sysctl_panic_on_unrecoverable_mf),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
 	}
 };
 
@@ -1267,6 +1278,15 @@ static void update_per_node_mf_stats(unsigned long pfn,
 	++mf_stats->total;
 }
 
+static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
+				      enum mf_result result)
+{
+	if (!sysctl_panic_on_unrecoverable_mf || result != MF_IGNORED)
+		return false;
+
+	return type == MF_MSG_KERNEL;
+}
+
 /*
  * "Dirty/Clean" indication is not 100% accurate due to the possibility of
  * setting PG_dirty outside page lock. See also comment above set_page_dirty().
@@ -1284,6 +1304,9 @@ static int action_result(unsigned long pfn, enum mf_action_page_type type,
 	pr_err("%#lx: recovery action for %s: %s\n",
 		pfn, action_page_types[type], action_name[result]);
 
+	if (panic_on_unrecoverable_mf(type, result))
+		panic("Memory failure: %#lx: unrecoverable page", pfn);
+
 	return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
 }
 

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v7 4/6] mm/memory-failure: short-circuit PG_reserved before get_hwpoison_page()
From: Breno Leitao @ 2026-05-13 15:39 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <20260513-ecc_panic-v7-0-be2e578e61da@debian.org>

The previous patch already classifies PG_reserved pages as
MF_MSG_KERNEL through the long path: get_hwpoison_page() calls
__get_hwpoison_page() which fails HWPoisonHandlable(), get_any_page()
exhausts its shake_page() retry budget, and the resulting
-ENOTRECOVERABLE is mapped to MF_MSG_KERNEL by the switch.  The
outcome is correct but the work in between is wasted: shake_page()
cannot turn a reserved page into a handlable one.

Detect PG_reserved up front in memory_failure() and report
MF_MSG_KERNEL directly.  put_ref_page() releases the caller's
reference when MF_COUNT_INCREASED is set, which is important on the
MADV_HWPOISON path where get_user_pages_fast() holds a reference
across the call.

Suggested-by: Lance Yang <lance.yang@linux.dev>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4b3a5d4190a07..8ba3df21d1270 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2398,6 +2398,19 @@ int memory_failure(unsigned long pfn, int flags)
 		goto unlock_mutex;
 	}
 
+	/*
+	 * PG_reserved pages are kernel-owned (memblock reservations,
+	 * driver reservations, ...) and cannot be recovered.  Skip the
+	 * get_hwpoison_page() lifecycle dance and report MF_MSG_KERNEL
+	 * straight away; HWPoisonHandlable() would just keep rejecting
+	 * the page through the retry budget anyway.
+	 */
+	if (PageReserved(p)) {
+		put_ref_page(pfn, flags);
+		res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
+		goto unlock_mutex;
+	}
+
 	/*
 	 * We need/can do nothing about count=0 pages.
 	 * 1) it's a free page, and therefore in safe hand:

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v7 3/6] mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
From: Breno Leitao @ 2026-05-13 15:39 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260513-ecc_panic-v7-0-be2e578e61da@debian.org>

The previous patch teaches get_any_page() to return -ENOTRECOVERABLE
for stable unhandlable kernel pages (PG_reserved, slab, vmalloc, page
tables, kernel stacks, ...).  memory_failure() still folds every
negative return into MF_MSG_GET_HWPOISON, so callers that want to
react to the unrecoverable cases (a panic option, smarter logging)
cannot tell them apart from transient page-allocator races.

Turn the post-call branch into a switch over the get_hwpoison_page()
return code: map -ENOTRECOVERABLE to MF_MSG_KERNEL and any other
negative return to MF_MSG_GET_HWPOISON.  case 0 keeps the existing
free-buddy / kernel-high-order handling and case 1 falls through to
the rest of memory_failure() unchanged.

The MF_MSG_KERNEL label and tracepoint string are kept as
"reserved kernel page" to avoid breaking userspace tools that match
on those literals; the enum value still adequately tags the failure
even though it now also covers slab, vmalloc, page tables and kernel
stack pages.

Suggested-by: David Hildenbrand <david@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index bae883df3ccb2..4b3a5d4190a07 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2410,7 +2410,8 @@ int memory_failure(unsigned long pfn, int flags)
 	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 	 */
 	res = get_hwpoison_page(p, flags);
-	if (!res) {
+	switch (res) {
+	case 0:
 		if (is_free_buddy_page(p)) {
 			if (take_page_off_buddy(p)) {
 				page_ref_inc(p);
@@ -2429,7 +2430,19 @@ int memory_failure(unsigned long pfn, int flags)
 			res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
 		}
 		goto unlock_mutex;
-	} else if (res < 0) {
+	case 1:
+		/* Got a refcount on a handlable page. */
+		break;
+	case -ENOTRECOVERABLE:
+		/*
+		 * Stable unhandlable kernel-owned page (PG_reserved,
+		 * slab, vmalloc, page tables, kernel stacks, ...).
+		 * No recovery possible.
+		 */
+		res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
+		goto unlock_mutex;
+	default:
+		/* Transient lifecycle race with the page allocator. */
 		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
 		goto unlock_mutex;
 	}

-- 
2.53.0-Meta

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox