* Re: [PATCH v9 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
From: Jim Mattson @ 2026-04-07 18:40 UTC (permalink / raw)
To: Pawan Gupta
Cc: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
David Kaplan, Sean Christopherson, Borislav Petkov, Dave Hansen,
Peter Zijlstra, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, KP Singh, Jiri Olsa, David S. Miller,
David Laight, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
David Ahern, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
Paolo Bonzini, Jonathan Corbet, linux-kernel, kvm, Asit Mallick,
Tao Zhang, bpf, netdev, linux-doc, chao.gao
In-Reply-To: <20260407171151.2gf2idjbmph35ypb@desk>
On Tue, Apr 7, 2026 at 10:12 AM Pawan Gupta
<pawan.kumar.gupta@linux.intel.com> wrote:
>
> On Tue, Apr 07, 2026 at 09:46:07AM -0700, Jim Mattson wrote:
> > On Tue, Apr 7, 2026 at 9:40 AM Pawan Gupta
> > <pawan.kumar.gupta@linux.intel.com> wrote:
> > >
> > > On Mon, Apr 06, 2026 at 07:23:25AM -0700, Jim Mattson wrote:
> > > > Yes, but the guest needs a way to determine whether the hypervisor
> > > > will do what's necessary to make the short sequence effective. And, in
> > > > particular, no KVM hypervisor today is prepared to do that.
> > > >
> > > > When running under a hypervisor, without BHI_CTRL and without any
> > > > evidence to the contrary, the guest must assume that the longer
> > > > sequence is necessary. At the very least, we need a CPUID or MSR bit
> > > > that says, "the short BHB clearing sequence is adequate for this
> > > > vCPU."
> > >
> > > After discussing this internally, the consensus is that the best path
> > > forward is to add virtual SPEC_CTRL support to KVM, which also aligns with
> > > Intel's guidance. In the long term, virtual SPEC_CTRL can benefit future
> > > mitigations as well. As with many other mitigations (e.g. microcode), the
> > > guest would rely on the host to enforce the appropriate protections.
> >
> > I don't think it's reasonable for the guest to rely on a future
> > implementation to enforce the appropriate protections.
> >
> > This is already a problem today. If a guest sees that BHI_CTRL is
> > unavailable, it will deploy the short BHB clearing sequence and
> > declare that the vulnerability is mitigated. That isn't true if the
> > guest is running on Alder Lake or newer.
>
> In any case, there is a change required in the kernel either for the guest
> or the host, they both are future implementations. Why not implement the
> one that is more future proof.
There will always be old hypervisors. True future-proofing requires
that the guest be able to distinguish an old hypervisor from a new
one.
My proposal is as follows:
1. The (advanced) hypervisor can advertise to the guest (via CPUID bit
or MSR bit) that the short BHB clearing sequence is adequate. This may
mean either that the VM will only be hosted on pre-Alder Lake hardware
or that the hypervisor will set BHI_DIS_S behind the back of the
guest. Presumably, this bit would not be reported if BHI_CTRL is
advertised to the guest.
2. If the guest sees this bit, then it can use the short sequence. If
it doesn't see this bit, it must use the long sequence.
^ permalink raw reply
* Re: [PATCH] sched_ext: Documentation: Fix scx_bpf_move_to_local kfunc name
From: Tejun Heo @ 2026-04-07 18:18 UTC (permalink / raw)
To: fangqiurong; +Cc: linux-kernel, linux-doc, corbet
In-Reply-To: <20260407093405.2573184-1-fangqiurong@kylinos.cn>
Hello,
On Tue, Apr 7, 2026 at 05:34:05PM +0800, fangqiurong@kylinos.cn wrote:
> The correct kfunc name is scx_bpf_dsq_move_to_local(), not
> scx_bpf_move_to_local(). Fix the two references in the
> Scheduling Cycle section.
Applied to sched_ext/for-7.1. The patch had the author From: line
set to fangqiurong@kylinos.com while the envelope and Signed-off-by
were @kylinos.cn -- I aligned the recorded author with the
Signed-off-by (.cn). Please let me know if that's wrong.
Thanks.
--
tejun
^ permalink raw reply
* Re: [PATCH v9 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
From: Pawan Gupta @ 2026-04-07 17:52 UTC (permalink / raw)
To: Jon Kohler
Cc: Jim Mattson, x86@kernel.org, Nikolay Borisov, H. Peter Anvin,
Josh Poimboeuf, David Kaplan, Sean Christopherson,
Borislav Petkov, Dave Hansen, Peter Zijlstra, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, KP Singh, Jiri Olsa,
David S. Miller, David Laight, Andy Lutomirski, Thomas Gleixner,
Ingo Molnar, David Ahern, Martin KaFai Lau, Eduard Zingerman,
Song Liu, Yonghong Song, John Fastabend, Stanislav Fomichev,
Hao Luo, Paolo Bonzini, Jonathan Corbet,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Asit Mallick,
Tao Zhang, bpf@vger.kernel.org, netdev@vger.kernel.org,
linux-doc@vger.kernel.org, chao.gao@intel.com
In-Reply-To: <FAA31092-E1CA-4D79-8CEC-3DB0F6F1C792@nutanix.com>
On Tue, Apr 07, 2026 at 05:12:06PM +0000, Jon Kohler wrote:
>
>
> > On Apr 7, 2026, at 11:46 AM, Jim Mattson <jmattson@google.com> wrote:
> >
> > On Tue, Apr 7, 2026 at 9:40 AM Pawan Gupta
> > <pawan.kumar.gupta@linux.intel.com> wrote:
> >>
> >> On Mon, Apr 06, 2026 at 07:23:25AM -0700, Jim Mattson wrote:
> >>> Yes, but the guest needs a way to determine whether the hypervisor
> >>> will do what's necessary to make the short sequence effective. And, in
> >>> particular, no KVM hypervisor today is prepared to do that.
> >>>
> >>> When running under a hypervisor, without BHI_CTRL and without any
> >>> evidence to the contrary, the guest must assume that the longer
> >>> sequence is necessary. At the very least, we need a CPUID or MSR bit
> >>> that says, "the short BHB clearing sequence is adequate for this
> >>> vCPU."
> >>
> >> After discussing this internally, the consensus is that the best path
> >> forward is to add virtual SPEC_CTRL support to KVM, which also aligns with
> >> Intel's guidance. In the long term, virtual SPEC_CTRL can benefit future
> >> mitigations as well. As with many other mitigations (e.g. microcode), the
> >> guest would rely on the host to enforce the appropriate protections.
>
> Would we have to wait for virtual SPEC_CTRL to get this optimization?
The optimization works with or without virtual-SPEC_CTRL.
> Or would that be a future enhancement to make this more prescriptive?
Virtual-SPEC_CTRL enables safer guest migrations between pre and post Alder
Lake CPUs w.r.t. Native BHI mitigation. It is not related to VMSCAPE.
^ permalink raw reply
* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Reinette Chatre @ 2026-04-07 17:48 UTC (permalink / raw)
To: Babu Moger, corbet, tony.luck, Dave.Martin, james.morse, tglx,
mingo, bp, dave.hansen
Cc: skhan, x86, hpa, peterz, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, kas,
rick.p.edgecombe, akpm, pmladek, rdunlap, dapeng1.mi, kees, elver,
paulmck, lirongqing, safinaskar, fvdl, seanjc, pawan.kumar.gupta,
xin, tiala, Neeraj.Upadhyay, chang.seok.bae, thomas.lendacky,
elena.reshetova, linux-doc, linux-kernel, linux-coco, kvm,
eranian, peternewman
In-Reply-To: <5a740f47-d3f3-45af-9d8c-ebcf3dd89c0d@amd.com>
Hi Babu,
On 4/6/26 3:45 PM, Babu Moger wrote:
> Hi Reinette,
>
> Sorry for the late response. I was trying to get confirmation about the use case.
No problem. I appreciate that you did this so that we can make sure resctrl supports
needed use cases.
>
> On 3/31/26 17:24, Reinette Chatre wrote:
>> On 3/30/26 11:46 AM, Babu Moger wrote:
>>> On 3/27/26 17:11, Reinette Chatre wrote:
>>>> On 3/26/26 10:12 AM, Babu Moger wrote:
>>>>> On 3/24/26 17:51, Reinette Chatre wrote:
>>>>>> On 3/12/26 1:36 PM, Babu Moger wrote:
>> can have domains that span different CPUs. There thus seem to be a built in assumption of what a "domain"
>> means for PQR_PLZA_ASSOC so it sounds to me as though, instead of saying that "PQR_PLZA_ASSOC needs
>> to be the same in QoS domain" it may be more accurate to, for example, say that "PQR_PLZA_ASSOC has L3 scope"?
>
> Yes.
Above is about L3 scope ...
>>
>> This seems to be what this implementation does since it hardcodes PQR_PLZA_ASSOC scope to the L3
>> resource but that creates dependency to the L3 resource that would make PLZA unusable if, for example,
>> the user boots with "rdt=!l3cat" while wanting to use PLZA to manage MBA allocations when in kernel?
>
> Yes. that is correct. It should not be attached to one resource. We need to change it to global scope.
Can I interpret "global scope" as "all online CPUs"? Doing so will simplify
supporting this feature. It does not sound practical for a user wanting to assign
different resource groups to kernel work done in different domains ... the guidance should
instead be to just set the allocations of one resource group to what is needed in the different
domains? There may be more flexibility when supporting per-domain RMIDs though but so far
it sounds as though the focus is global. We can consider what needs to be done to support
some type of "per-domain" assignment as exercise whether current interface could support it
in the future.
...
>>> There are multiple ways this feature can be applied. For simplicity, the discussion below focuses only on CLOSID.
>>>
>>>
>>> 1. Global PLZA enablement
>>>
>>> PLZA can be configured as a global feature by setting |PQR_PLZA_ASSOC.closid = CLOSID| and |PQR_PLZA_ASSOC.plza_en = 1| on all threads in the system. A dedicated CLOSID is reserved for this purpose,
>>
>> Also discussed during v1 is that there is no need to dedicate a CLOSID for this purpose.
>> There could be an "unthrottled" CLOSID to which all high priority user space tasks as
>> well as all kernel work of all tasks are assigned.
>> If user space chooses to dedicate a CLOSID for kernel work then that should supported and
>> interface can allow that, but there is no need for resctrl to enforce this.
(above is comment about dedicated group - please see below)
> Yes. I agree. The changes in context switch code is a concern.
>
> You covered some of the cases I was thinking(xx_set_individual).
>
> How about this idea?
>
> I suggest splitting the PLZA into two distinct aspects:
>
> 1. How PLZA is applied within a resource group
>
> 2. How PLZA is monitored
I think I see where you are going here. While the "How PLZA is monitored" naming
refers to "monitoring" I *think* what you are separating here is (a) how PLZA is configured
(CLOSID and RMID settings) and (b) how that PLZA configuration is assigned to tasks/CPUs,
not just within a resource group but across the system. Please see below.
> Introduce a new file, "info/kmode_type", to describe how kmode applies in the system.
ack. "in the system" as you have above, not "within a resource group" as mentioned
before that.
>
> # cat info/kmode_type
> [global] <- Kernel mode applies to the entire system (all CPUs/tasks)
> cpus <- Kernel mode applies only to the CPUs in the group
> tasks <- Kernel mode applies only to the tasks in the group
>
> The "global" option is the default right now and it is current common use-case.
>
> The "info/kmode_type -> cpus" option introduces new files
> "kmode_cpus" and "kmode_cpus_list" for users to apply kmode to
> specific set of CPUs. This lets users change the CPU set for PLZA.
Where were you thinking about placing these files in the hierarchy?
> The PLZA MSR is updated when user changes the association to the
> file. No context switch code changes are needed. This will be
> dedicated group. The current resctrl group files, "cpus, cpus_list
Why does this have to be a dedicated group? One of the conclusions from v1
discussion was that the "PLZA group" need *not* be a dedicated group. I repeated that
in my earlier response that I left quoted above. You did not respond to these
conclusions and statements in this regard while you keep coming back to this
needing to be a dedicated group without providing a motivation to do so.
Could you please elaborate why a dedicated group is required?
> and tasks" will not be accessible in this mode. This option give
These files can continue to be accessible.
> some flexibility for the user without the context switch overhead.
Dedicating a resource group to PLZA removes flexibility though, no?
>
> The "info/kmode_type -> tasks" option introduces a new file,
> "kmode_tasks", for users to apply kmode to specific set of tasks.
> This requires context switch changes. This will be dedicated group.
> The current resctrl group files, "cpus, cpus_list and tasks" will
> not be accessible in this mode. We currently have no use case for
> this, so it will not be supported now.
Thank you for confirming. This is a relief.
>
>
> Add a file, "info/kmode_monitor", to describe how kmode is monitored.
>
> # cat info/kmode_monitor
> [inherit_ctrl_and_mon] <- Kernel uses the same CLOSID/RMID as user. Default option for the "global"
> assign_ctrl_inherit_mon <- One CLOSID for all kernel work; RMID inherited from user.
> assign_ctrl_assign_mon <- One resource group (CLOSID+RMID) for all kernel work. Default option for "cpu" type.
My first thought is that the naming is confusing. resctrl has a very strong relationship between
"RMID" and "monitoring" so naming a file "monitor" that deals with allocation/ctrl/CLOSID is
potentially confusion.
Apart from that, while I think I understand where you are going by separating the mode into
two files I am concerned about future complications needing to accommodate all different
combinations of the (now) essentially two modes. My preference is thus to keep this simple by
keeping the mode within one file.
Even so, when stepping back, it does not really look like we need to separate the "global"
and "per CPU" modes. We could just have a single "per CPU" mode and the "global" is just
its default of "all CPUs", no?
Consider, for example, the implementation just consisting of:
# cat info/kernel_mode
[inherit_ctrl_and_mon]
global_assign_ctrl_inherit_mon_per_cpu
global_assign_ctrl_assign_mon_per_cpu
>
> Rename “kernel_mode_assignment” to “kmode_group” to assign the specific group to kmode. This file usage is same as before.
>
> #cat info/kmode_groups (Renamed "kernel_mode_assignment")
> //
Please consider the intent of this file when thinking about names. The idea is that "info/kernel_mode"
specifies the "mode" of how kernel work is handled and it determines the configuration files used in that
mode as well as the syntax when interacting with those files. By renaming "kernel_mode_assignment" to
"kmode_groups" it implicitly requires all future kernel mode enhancements to need some data related to "groups".
In summary, I think this can be simplified by introducing just two new files in info/ that enables the
user to (a) select and (b) configure the "kernel mode". To start there can be just two modes,
global_assign_ctrl_inherit_mon_per_cpu and global_assign_ctrl_assign_mon_per_cpu.
global_assign_ctrl_inherit_mon_per_cpu mode requires a control group in kernel_mode_assignment while
global_assign_ctrl_assign_mon_per_cpu requires a control and monitoring group.
The resource group in info/kernel_mode_assignment gets two additional files "kernel_mode_cpus" and
"kernel_mode_cpus_list" that contains the CPUs enabled with the kernel mode configuration, by default
it will be all online CPUs. The resource group can continue to be used to manage allocations of and
monitor user space tasks. Specifically, the "cpus", "cpus_list", and "tasks" files remain.
A user wanting just "global" settings will get just that when writing the group to
info/kernel_mode_assignment. A user wanting "per CPU" settings can follow the
info/kernel_mode_assignment setting with changes to that resource group's kernel_mode_cpus/kernel_mode_cpus_list
files. Any task running on a CPU that is *not* in kernel_mode_cpus/kernel_mode_cpus_list can be
expected to inherit both CLOSID and RMID from user space for all kernel work.
Reinette
^ permalink raw reply
* Re: [PATCH v9 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
From: Jon Kohler @ 2026-04-07 17:12 UTC (permalink / raw)
To: Jim Mattson
Cc: Pawan Gupta, x86@kernel.org, Nikolay Borisov, H. Peter Anvin,
Josh Poimboeuf, David Kaplan, Sean Christopherson,
Borislav Petkov, Dave Hansen, Peter Zijlstra, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, KP Singh, Jiri Olsa,
David S. Miller, David Laight, Andy Lutomirski, Thomas Gleixner,
Ingo Molnar, David Ahern, Martin KaFai Lau, Eduard Zingerman,
Song Liu, Yonghong Song, John Fastabend, Stanislav Fomichev,
Hao Luo, Paolo Bonzini, Jonathan Corbet,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Asit Mallick,
Tao Zhang, bpf@vger.kernel.org, netdev@vger.kernel.org,
linux-doc@vger.kernel.org, chao.gao@intel.com
In-Reply-To: <CALMp9eTA3cXxuOT4dq=6y1hx52gPH1ywwTEmPQ5-fA-vz6r3VQ@mail.gmail.com>
> On Apr 7, 2026, at 11:46 AM, Jim Mattson <jmattson@google.com> wrote:
>
> On Tue, Apr 7, 2026 at 9:40 AM Pawan Gupta
> <pawan.kumar.gupta@linux.intel.com> wrote:
>>
>> On Mon, Apr 06, 2026 at 07:23:25AM -0700, Jim Mattson wrote:
>>> Yes, but the guest needs a way to determine whether the hypervisor
>>> will do what's necessary to make the short sequence effective. And, in
>>> particular, no KVM hypervisor today is prepared to do that.
>>>
>>> When running under a hypervisor, without BHI_CTRL and without any
>>> evidence to the contrary, the guest must assume that the longer
>>> sequence is necessary. At the very least, we need a CPUID or MSR bit
>>> that says, "the short BHB clearing sequence is adequate for this
>>> vCPU."
>>
>> After discussing this internally, the consensus is that the best path
>> forward is to add virtual SPEC_CTRL support to KVM, which also aligns with
>> Intel's guidance. In the long term, virtual SPEC_CTRL can benefit future
>> mitigations as well. As with many other mitigations (e.g. microcode), the
>> guest would rely on the host to enforce the appropriate protections.
Would we have to wait for virtual SPEC_CTRL to get this optimization?
Or would that be a future enhancement to make this more prescriptive?
>
> I don't think it's reasonable for the guest to rely on a future
> implementation to enforce the appropriate protections.
>
> This is already a problem today. If a guest sees that BHI_CTRL is
> unavailable, it will deploy the short BHB clearing sequence and
> declare that the vulnerability is mitigated. That isn't true if the
> guest is running on Alder Lake or newer.
^ permalink raw reply
* Re: [PATCH v9 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
From: Pawan Gupta @ 2026-04-07 17:11 UTC (permalink / raw)
To: Jim Mattson
Cc: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
David Kaplan, Sean Christopherson, Borislav Petkov, Dave Hansen,
Peter Zijlstra, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, KP Singh, Jiri Olsa, David S. Miller,
David Laight, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
David Ahern, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
Paolo Bonzini, Jonathan Corbet, linux-kernel, kvm, Asit Mallick,
Tao Zhang, bpf, netdev, linux-doc, chao.gao
In-Reply-To: <CALMp9eTA3cXxuOT4dq=6y1hx52gPH1ywwTEmPQ5-fA-vz6r3VQ@mail.gmail.com>
On Tue, Apr 07, 2026 at 09:46:07AM -0700, Jim Mattson wrote:
> On Tue, Apr 7, 2026 at 9:40 AM Pawan Gupta
> <pawan.kumar.gupta@linux.intel.com> wrote:
> >
> > On Mon, Apr 06, 2026 at 07:23:25AM -0700, Jim Mattson wrote:
> > > Yes, but the guest needs a way to determine whether the hypervisor
> > > will do what's necessary to make the short sequence effective. And, in
> > > particular, no KVM hypervisor today is prepared to do that.
> > >
> > > When running under a hypervisor, without BHI_CTRL and without any
> > > evidence to the contrary, the guest must assume that the longer
> > > sequence is necessary. At the very least, we need a CPUID or MSR bit
> > > that says, "the short BHB clearing sequence is adequate for this
> > > vCPU."
> >
> > After discussing this internally, the consensus is that the best path
> > forward is to add virtual SPEC_CTRL support to KVM, which also aligns with
> > Intel's guidance. In the long term, virtual SPEC_CTRL can benefit future
> > mitigations as well. As with many other mitigations (e.g. microcode), the
> > guest would rely on the host to enforce the appropriate protections.
>
> I don't think it's reasonable for the guest to rely on a future
> implementation to enforce the appropriate protections.
>
> This is already a problem today. If a guest sees that BHI_CTRL is
> unavailable, it will deploy the short BHB clearing sequence and
> declare that the vulnerability is mitigated. That isn't true if the
> guest is running on Alder Lake or newer.
In any case, there is a change required in the kernel either for the guest
or the host, they both are future implementations. Why not implement the
one that is more future proof.
^ permalink raw reply
* Re: [PATCH v7 1/9] KVM: x86: Define KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT
From: Sean Christopherson @ 2026-04-07 17:00 UTC (permalink / raw)
To: Jim Mattson
Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
kvm, linux-doc, linux-kernel, linux-kselftest, Yosry Ahmed
In-Reply-To: <CALMp9eQR_ZivpcARLyvDK3w+frpwU8bj2Z+ZvA_fLdCtTq3Vhg@mail.gmail.com>
On Tue, Apr 07, 2026, Jim Mattson wrote:
> On Mon, Apr 6, 2026 at 4:27 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Mar 27, 2026, Jim Mattson wrote:
> > > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > > index ff1e4b4dc998..74014110b550 100644
> > > --- a/arch/x86/kvm/svm/svm.h
> > > +++ b/arch/x86/kvm/svm/svm.h
> > > @@ -616,6 +616,17 @@ static inline bool nested_npt_enabled(struct vcpu_svm *svm)
> > > return svm->nested.ctl.misc_ctl & SVM_MISC_ENABLE_NP;
> > > }
> > >
> > > +static inline bool l2_has_separate_pat(struct vcpu_svm *svm)
> >
> > Take @vcpu instead of @svm. All of the callers have a "vcpu", but not all have
> > a local "svm". That will shorten the quirk check far enough to let it poke out.
>
> What is the actual line length limit?
There's a "medium-firm" limit at 80 and a "mostly-hard" limit at 100. 100 isn't
a true hard limit to allow for things like pre-formatted strings, and cases where
the only way to stay under 100 chars would (arguably) yield less readable code
overall, e.g. msr-index.h deliberately has this
#define MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI (1ULL << MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT)
and not
#define MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI \
(1ULL << MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT)
> > > +{
> > > + /*
> > > + * If KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled while a vCPU
> > > + * is running, the L2 IA32_PAT semantics for that vCPU are undefined.
> > > + */
> > > + return nested_npt_enabled(svm) &&
> > > + !kvm_check_has_quirk(svm->vcpu.kvm,
> > > + KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT);
> >
> > Align indentation. With the @svm => @vcpu change, this becomes:
> >
> > return nested_npt_enabled(to_svm(vcpu)) &&
> > !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT);
>
> You wouldn't happen to know the Emacs configuration for the alignment
> you like, would you? I asked Gemini, but it lied to me.
Heh, no. Any time I unintentionally end up in Emacs, I have to do a search just
to figure out how to save and exit :-)
^ permalink raw reply
* Re: [PATCH v7 8/9] KVM: x86: nSVM: Save/restore gPAT with KVM_{GET,SET}_NESTED_STATE
From: Sean Christopherson @ 2026-04-07 16:54 UTC (permalink / raw)
To: Jim Mattson
Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
kvm, linux-doc, linux-kernel, linux-kselftest, Yosry Ahmed
In-Reply-To: <CALMp9eSysKOVGF_xakbT59tVsgER6oEYpJuK9=hQutjY=ZpM-A@mail.gmail.com>
On Tue, Apr 07, 2026, Jim Mattson wrote:
> On Tue, Apr 7, 2026 at 7:14 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > use_separate_l2_pat = (ctl_cached.misc_ctl & SVM_MISC_ENABLE_NP);
> > > > if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT))
> > > > use_separate_l2_pat = false;
> > >
> > > Wow. I really have no idea how to predict what you're going to want
> > > the code to look like. How is this better than the original?!?
> >
> > It doesn't immediately wrap after the "=". Similar to my view on wrapping before
> > function names[*], I find wrapping immediately after an assignment operator to be
> > unnecessarily difficult to read as it doesn't provide any context for single-line
> > searches.
>
> That's actually a good argument to *never* wrap a line. If a line is
> broken at all, the interesting context might follow the line break.
Don't let perfect be the enemy of good. :-)
> > I'm pretty darn consistent in my dislike for that style: I count 26 instances in
> > arch/x86/kvm that match "\s=\n", and only two of those carry my SoB or R-b. I
> > simply missed the wrap in kvm_vcpu_apicv_activated() that was added by commit
> > 896046474f8d ("KVM: x86: Introduce kvm_x86_call() to simplify static calls of
> > kvm_x86_ops"), and I'll give myself a pass for commit 8764ed55c970 ("KVM: x86:
> > Whitelist port 0x7e for pre-incrementing %rip") as that predates treating
> > checkpatch's 80 char limit as a soft limit.
>
> Might I suggest that you should provide a tool—something like
> checkpatch.pl—that flags style violations?
Or maybe extend checkpatch with an optional "feature"? Or subsystem-specific
rules?
^ permalink raw reply
* Re: [PATCH v9 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
From: Jim Mattson @ 2026-04-07 16:46 UTC (permalink / raw)
To: Pawan Gupta
Cc: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
David Kaplan, Sean Christopherson, Borislav Petkov, Dave Hansen,
Peter Zijlstra, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, KP Singh, Jiri Olsa, David S. Miller,
David Laight, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
David Ahern, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
Paolo Bonzini, Jonathan Corbet, linux-kernel, kvm, Asit Mallick,
Tao Zhang, bpf, netdev, linux-doc, chao.gao
In-Reply-To: <20260407163943.y6tkh26z2rfktn3y@desk>
On Tue, Apr 7, 2026 at 9:40 AM Pawan Gupta
<pawan.kumar.gupta@linux.intel.com> wrote:
>
> On Mon, Apr 06, 2026 at 07:23:25AM -0700, Jim Mattson wrote:
> > Yes, but the guest needs a way to determine whether the hypervisor
> > will do what's necessary to make the short sequence effective. And, in
> > particular, no KVM hypervisor today is prepared to do that.
> >
> > When running under a hypervisor, without BHI_CTRL and without any
> > evidence to the contrary, the guest must assume that the longer
> > sequence is necessary. At the very least, we need a CPUID or MSR bit
> > that says, "the short BHB clearing sequence is adequate for this
> > vCPU."
>
> After discussing this internally, the consensus is that the best path
> forward is to add virtual SPEC_CTRL support to KVM, which also aligns with
> Intel's guidance. In the long term, virtual SPEC_CTRL can benefit future
> mitigations as well. As with many other mitigations (e.g. microcode), the
> guest would rely on the host to enforce the appropriate protections.
I don't think it's reasonable for the guest to rely on a future
implementation to enforce the appropriate protections.
This is already a problem today. If a guest sees that BHI_CTRL is
unavailable, it will deploy the short BHB clearing sequence and
declare that the vulnerability is mitigated. That isn't true if the
guest is running on Alder Lake or newer.
^ permalink raw reply
* Re: [PATCH v9 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
From: Pawan Gupta @ 2026-04-07 16:39 UTC (permalink / raw)
To: Jim Mattson
Cc: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
David Kaplan, Sean Christopherson, Borislav Petkov, Dave Hansen,
Peter Zijlstra, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, KP Singh, Jiri Olsa, David S. Miller,
David Laight, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
David Ahern, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
Paolo Bonzini, Jonathan Corbet, linux-kernel, kvm, Asit Mallick,
Tao Zhang, bpf, netdev, linux-doc, chao.gao
In-Reply-To: <CALMp9eR70eE2U63gzNzTiic0PqJVGv3CBBuVUOVbi3nqbWKZkQ@mail.gmail.com>
On Mon, Apr 06, 2026 at 07:23:25AM -0700, Jim Mattson wrote:
> Yes, but the guest needs a way to determine whether the hypervisor
> will do what's necessary to make the short sequence effective. And, in
> particular, no KVM hypervisor today is prepared to do that.
>
> When running under a hypervisor, without BHI_CTRL and without any
> evidence to the contrary, the guest must assume that the longer
> sequence is necessary. At the very least, we need a CPUID or MSR bit
> that says, "the short BHB clearing sequence is adequate for this
> vCPU."
After discussing this internally, the consensus is that the best path
forward is to add virtual SPEC_CTRL support to KVM, which also aligns with
Intel's guidance. In the long term, virtual SPEC_CTRL can benefit future
mitigations as well. As with many other mitigations (e.g. microcode), the
guest would rely on the host to enforce the appropriate protections.
^ permalink raw reply
* Re: [PATCH v5 2/3] ima: trim N IMA event log records
From: Roberto Sassu @ 2026-04-07 16:19 UTC (permalink / raw)
To: steven chen, linux-integrity
Cc: zohar, roberto.sassu, dmitry.kasatkin, eric.snowberg, corbet,
serge, paul, jmorris, linux-security-module, anirudhve,
gregorylumen, nramas, sushring, linux-doc
In-Reply-To: <20260401172956.4581-3-chenste@linux.microsoft.com>
On Wed, 2026-04-01 at 10:29 -0700, steven chen wrote:
> Trim N entries of the IMA event logs. Do not clean the hash table.
The very first change of this patch is the kernel option
ima_flush_htable option that I introduced for my use case.
At the bottom of this patch you actually check the ima_flush_htable
boolean, and delete the measurements entries without disconnecting them
from the hash table, so the digest lookup is done on freed memory.
Next, you duplicated my changes regarding the measurements list
counter. But instead of removing the old counter from the hash table,
you keep incrementing both, but use the new one.
In ima_log_trim_open(), you use again my duplicated code to manage
exclusive write/concurrent read scheme for the measurement interfaces.
However, for read, if the process does not have CAP_SYS_ADMIN it falls
back calling _ima_measurements_open(). Not sure it was intended.
And, in ima_log_trim_release(), you check again CAP_SYS_ADMIN which is
redundant, you would not reach this code if the same requirements were
not met at open time. You also return an error on close().
In ima_log_trim_write(), you do manual string to number conversion for
your first number and use kstrtoul() for the second.
The measurements lists and the associated counter are atomically
updated in ima_add_digest_entry(), but not atomically accessed in
ima_delete_event_log(). Also, the measurements list is traversed
without _rcu variant or lock.
While this trimming scheme aims at minimizing the kernel space and user
space delay, it also introduces the following problem. If two agents
perform a TPM quote that include a different number of entries, there
is no guarantee that the one willing to trim less entries wins. Which
means that, one agent could end up not seeing the most recent entries,
as they were already trimmed by the other agent.
My solution is not affected by this problem, since there will be only
one process collecting all the measurements in user space and exposing
them to the agents.
Also, I didn't understand why T and ima_measure_users have to be
preserved on soft reboots. Especially ima_measure_users reflects the
state of open files for a particular kernel, but on soft reboot a new
kernel is booted.
I personally will not endorse a solution based on the ima_trim_log
interface. I could accept trimming N even more efficiently than we
currently do with a lockless walk to determine the cutting position in
ima_queue_stage(), so that we don't need to splice back entries to the
measurement list. This would be a replacement of patch 11 in my patch
set, but this would be as far as I would like to go.
Roberto
> The values saved in hash table were already used.
>
> Provide a userspace interface ima_trim_log:
> When read this interface, it returns total number T of entries trimmed
> since system boot up.
> When write to this interface need to provide two numbers T:N to let
> kernel to trim N entries of IMA event logs.
>
> Kernel measurement list lock time performance improvement by not
> clean the hash table.
>
> when kernel get log trim request T:N
> - Get the T, compare with the total trimmed number
> - if equal, then do trim N and change T to T+N
> - else return error
>
> Signed-off-by: steven chen <chenste@linux.microsoft.com>
> ---
> .../admin-guide/kernel-parameters.txt | 4 +
> security/integrity/ima/ima.h | 4 +-
> security/integrity/ima/ima_fs.c | 198 +++++++++++++++++-
> security/integrity/ima/ima_kexec.c | 2 +-
> security/integrity/ima/ima_queue.c | 96 +++++++++
> 5 files changed, 296 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index e92c0056e4e0..cd1a1d0bf0e2 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2197,6 +2197,10 @@
> Use the canonical format for the binary runtime
> measurements, instead of host native format.
>
> + ima_flush_htable [IMA]
> + Flush the measurement list hash table when trim all
> + or a part of it for deletion.
> +
> ima_hash= [IMA]
> Format: { md5 | sha1 | rmd160 | sha256 | sha384
> | sha512 | ... }
> diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
> index e3d71d8d56e3..5cbee3a295a0 100644
> --- a/security/integrity/ima/ima.h
> +++ b/security/integrity/ima/ima.h
> @@ -243,11 +243,13 @@ void ima_post_key_create_or_update(struct key *keyring, struct key *key,
> const void *payload, size_t plen,
> unsigned long flags, bool create);
> #endif
> -
> +extern atomic_long_t ima_number_entries;
> #ifdef CONFIG_IMA_KEXEC
> void ima_measure_kexec_event(const char *event_name);
> +long ima_delete_event_log(long req_val);
> #else
> static inline void ima_measure_kexec_event(const char *event_name) {}
> +static inline long ima_delete_event_log(long req_val) { return 0; }
> #endif
>
> /*
> diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c
> index 87045b09f120..8e26e0f34311 100644
> --- a/security/integrity/ima/ima_fs.c
> +++ b/security/integrity/ima/ima_fs.c
> @@ -21,6 +21,9 @@
> #include <linux/rcupdate.h>
> #include <linux/parser.h>
> #include <linux/vmalloc.h>
> +#include <linux/ktime.h>
> +#include <linux/timekeeping.h>
> +#include <linux/ima.h>
>
> #include "ima.h"
>
> @@ -38,6 +41,17 @@ __setup("ima_canonical_fmt", default_canonical_fmt_setup);
>
> static int valid_policy = 1;
>
> +#define IMA_LOG_TRIM_REQ_NUM_LENGTH 15
> +#define IMA_LOG_TRIM_REQ_TOTAL_LENGTH 32
> +atomic_long_t ima_number_entries = ATOMIC_LONG_INIT(0);
> +static long trimcount;
> +/* mutex protects atomicity of trimming measurement list
> + * and also protects atomicity the measurement list read
> + * write operation.
> + */
> +static DEFINE_MUTEX(ima_measure_lock);
> +static long ima_measure_users;
> +
> static ssize_t ima_show_htable_value(char __user *buf, size_t count,
> loff_t *ppos, atomic_long_t *val)
> {
> @@ -64,8 +78,7 @@ static ssize_t ima_show_measurements_count(struct file *filp,
> char __user *buf,
> size_t count, loff_t *ppos)
> {
> - return ima_show_htable_value(buf, count, ppos, &ima_htable.len);
> -
> + return ima_show_htable_value(buf, count, ppos, &ima_number_entries);
> }
>
> static const struct file_operations ima_measurements_count_ops = {
> @@ -202,16 +215,77 @@ static const struct seq_operations ima_measurments_seqops = {
> .show = ima_measurements_show
> };
>
> +/*
> + * _ima_measurements_open - open the IMA measurements file
> + * @inode: inode of the file being opened
> + * @file: file being opened
> + * @seq_ops: sequence operations for the file
> + *
> + * Returns 0 on success, or negative error code.
> + * Implements mutual exclusion between readers and writer
> + * of the measurements file. Multiple readers are allowed,
> + * but writer get exclusive access only no other readers/writers.
> + * Readers is not allowed when there is a writer.
> + */
> +static int _ima_measurements_open(struct inode *inode, struct file *file,
> + const struct seq_operations *seq_ops)
> +{
> + bool write = !!(file->f_mode & FMODE_WRITE);
> + int ret;
> +
> + if (write && !capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + mutex_lock(&ima_measure_lock);
> + if ((write && ima_measure_users != 0) ||
> + (!write && ima_measure_users < 0)) {
> + mutex_unlock(&ima_measure_lock);
> + return -EBUSY;
> + }
> +
> + ret = seq_open(file, seq_ops);
> + if (ret < 0) {
> + mutex_unlock(&ima_measure_lock);
> + return ret;
> + }
> +
> + if (write)
> + ima_measure_users--;
> + else
> + ima_measure_users++;
> +
> + mutex_unlock(&ima_measure_lock);
> + return ret;
> +}
> +
> static int ima_measurements_open(struct inode *inode, struct file *file)
> {
> - return seq_open(file, &ima_measurments_seqops);
> + return _ima_measurements_open(inode, file, &ima_measurments_seqops);
> +}
> +
> +static int ima_measurements_release(struct inode *inode, struct file *file)
> +{
> + bool write = !!(file->f_mode & FMODE_WRITE);
> + int ret;
> +
> + mutex_lock(&ima_measure_lock);
> + ret = seq_release(inode, file);
> + if (!ret) {
> + if (!write)
> + ima_measure_users--;
> + else
> + ima_measure_users++;
> + }
> +
> + mutex_unlock(&ima_measure_lock);
> + return ret;
> }
>
> static const struct file_operations ima_measurements_ops = {
> .open = ima_measurements_open,
> .read = seq_read,
> .llseek = seq_lseek,
> - .release = seq_release,
> + .release = ima_measurements_release,
> };
>
> void ima_print_digest(struct seq_file *m, u8 *digest, u32 size)
> @@ -279,14 +353,114 @@ static const struct seq_operations ima_ascii_measurements_seqops = {
>
> static int ima_ascii_measurements_open(struct inode *inode, struct file *file)
> {
> - return seq_open(file, &ima_ascii_measurements_seqops);
> + return _ima_measurements_open(inode, file, &ima_ascii_measurements_seqops);
> }
>
> static const struct file_operations ima_ascii_measurements_ops = {
> .open = ima_ascii_measurements_open,
> .read = seq_read,
> .llseek = seq_lseek,
> - .release = seq_release,
> + .release = ima_measurements_release,
> +};
> +
> +static int ima_log_trim_open(struct inode *inode, struct file *file)
> +{
> + bool write = !!(file->f_mode & FMODE_WRITE);
> +
> + if (!write && capable(CAP_SYS_ADMIN))
> + return 0;
> + else if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + return _ima_measurements_open(inode, file, &ima_measurments_seqops);
> +}
> +
> +static ssize_t ima_log_trim_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
> +{
> + char tmpbuf[IMA_LOG_TRIM_REQ_NUM_LENGTH];
> + ssize_t len;
> +
> + len = scnprintf(tmpbuf, sizeof(tmpbuf), "%li\n", trimcount);
> + return simple_read_from_buffer(buf, size, ppos, tmpbuf, len);
> +}
> +
> +static ssize_t ima_log_trim_write(struct file *file,
> + const char __user *buf, size_t datalen, loff_t *ppos)
> +{
> + char tmpbuf[IMA_LOG_TRIM_REQ_TOTAL_LENGTH];
> + char *p = tmpbuf;
> + long count, ret, val = 0, max = LONG_MAX;
> +
> + if (*ppos > 0 || datalen > IMA_LOG_TRIM_REQ_TOTAL_LENGTH || datalen < 2) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + if (copy_from_user(tmpbuf, buf, datalen) != 0) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + p = tmpbuf;
> +
> + while (*p && *p != ':') {
> + if (!isdigit((unsigned char)*p))
> + return -EINVAL;
> +
> + /* digit value */
> + int d = *p - '0';
> +
> + /* overflow check: val * 10 + d > max -> (val > (max - d) / 10) */
> + if (val > (max - d) / 10)
> + return -ERANGE;
> +
> + val = val * 10 + d;
> + p++;
> + }
> +
> + if (*p != ':')
> + return -EINVAL;
> +
> + /* verify trim count matches */
> + if (val != trimcount)
> + return -EINVAL;
> +
> + p++; /* skip ':' */
> + ret = kstrtoul(p, 0, &count);
> +
> + if (ret < 0)
> + goto out;
> +
> + ret = ima_delete_event_log(count);
> +
> + if (ret < 0)
> + goto out;
> +
> + trimcount += ret;
> +
> + ret = datalen;
> +out:
> + return ret;
> +}
> +
> +static int ima_log_trim_release(struct inode *inode, struct file *file)
> +{
> + bool write = !!(file->f_mode & FMODE_WRITE);
> +
> + if (!write && capable(CAP_SYS_ADMIN))
> + return 0;
> + else if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + return ima_measurements_release(inode, file);
> +}
> +
> +static const struct file_operations ima_log_trim_ops = {
> + .open = ima_log_trim_open,
> + .read = ima_log_trim_read,
> + .write = ima_log_trim_write,
> + .llseek = generic_file_llseek,
> + .release = ima_log_trim_release
> };
>
> static ssize_t ima_read_policy(char *path)
> @@ -528,6 +702,18 @@ int __init ima_fs_init(void)
> goto out;
> }
>
> + if (IS_ENABLED(CONFIG_IMA_LOG_TRIMMING)) {
> + dentry = securityfs_create_file("ima_trim_log",
> + S_IRUSR | S_IRGRP | S_IWUSR | S_IWGRP,
> + ima_dir, NULL, &ima_log_trim_ops);
> + if (IS_ERR(dentry)) {
> + ret = PTR_ERR(dentry);
> + goto out;
> + }
> + }
> +
> + trimcount = 0;
> +
> dentry = securityfs_create_file("runtime_measurements_count",
> S_IRUSR | S_IRGRP, ima_dir, NULL,
> &ima_measurements_count_ops);
> diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c
> index 7362f68f2d8b..bee997683e03 100644
> --- a/security/integrity/ima/ima_kexec.c
> +++ b/security/integrity/ima/ima_kexec.c
> @@ -41,7 +41,7 @@ void ima_measure_kexec_event(const char *event_name)
> int n;
>
> buf_size = ima_get_binary_runtime_size();
> - len = atomic_long_read(&ima_htable.len);
> + len = atomic_long_read(&ima_number_entries);
>
> n = scnprintf(ima_kexec_event, IMA_KEXEC_EVENT_LEN,
> "kexec_segment_size=%lu;ima_binary_runtime_size=%lu;"
> diff --git a/security/integrity/ima/ima_queue.c b/security/integrity/ima/ima_queue.c
> index 590637e81ad1..07225e19b9b5 100644
> --- a/security/integrity/ima/ima_queue.c
> +++ b/security/integrity/ima/ima_queue.c
> @@ -22,6 +22,14 @@
>
> #define AUDIT_CAUSE_LEN_MAX 32
>
> +bool ima_flush_htable;
> +static int __init ima_flush_htable_setup(char *str)
> +{
> + ima_flush_htable = true;
> + return 1;
> +}
> +__setup("ima_flush_htable", ima_flush_htable_setup);
> +
> /* pre-allocated array of tpm_digest structures to extend a PCR */
> static struct tpm_digest *digests;
>
> @@ -114,6 +122,7 @@ static int ima_add_digest_entry(struct ima_template_entry *entry,
> list_add_tail_rcu(&qe->later, &ima_measurements);
>
> atomic_long_inc(&ima_htable.len);
> + atomic_long_inc(&ima_number_entries);
> if (update_htable) {
> key = ima_hash_key(entry->digests[ima_hash_algo_idx].digest);
> hlist_add_head_rcu(&qe->hnext, &ima_htable.queue[key]);
> @@ -220,6 +229,93 @@ int ima_add_template_entry(struct ima_template_entry *entry, int violation,
> return result;
> }
>
> +/**
> + * ima_delete_event_log - delete IMA event entry
> + * @num_records: number of records to delete
> + *
> + * delete num_records entries off the measurement list.
> + * Returns num_records, or negative error code.
> + */
> +long ima_delete_event_log(long num_records)
> +{
> + long len, cur = num_records, tmp_len = 0;
> + struct ima_queue_entry *qe, *qe_tmp;
> + LIST_HEAD(ima_measurements_to_delete);
> + struct list_head *list_ptr;
> +
> + if (!IS_ENABLED(CONFIG_IMA_LOG_TRIMMING))
> + return -EOPNOTSUPP;
> +
> + if (num_records <= 0)
> + return num_records;
> +
> + list_ptr = &ima_measurements;
> +
> + len = atomic_long_read(&ima_number_entries);
> +
> + if (num_records <= len) {
> + list_for_each_entry(qe, list_ptr, later) {
> + if (cur > 0) {
> + tmp_len += get_binary_runtime_size(qe->entry);
> + --cur;
> + }
> + if (cur == 0) {
> + qe_tmp = qe;
> + break;
> + }
> + }
> + }
> + else {
> + return -ENOENT;
> + }
> +
> +
> + mutex_lock(&ima_extend_list_mutex);
> + len = atomic_long_read(&ima_number_entries);
> +
> + if (num_records == len) {
> + list_replace(&ima_measurements, &ima_measurements_to_delete);
> + INIT_LIST_HEAD(&ima_measurements);
> + atomic_long_set(&ima_number_entries, 0);
> + list_ptr = &ima_measurements_to_delete;
> + }
> + else {
> + __list_cut_position(&ima_measurements_to_delete, &ima_measurements,
> + &qe_tmp->later);
> + atomic_long_sub(num_records, &ima_number_entries);
> + if (IS_ENABLED(CONFIG_IMA_KEXEC))
> + binary_runtime_size -= tmp_len;
> + }
> +
> + mutex_unlock(&ima_extend_list_mutex);
> +
> + if (ima_flush_htable)
> + synchronize_rcu();
> +
> + list_for_each_entry_safe(qe, qe_tmp, &ima_measurements_to_delete, later) {
> + /*
> + * Ok because after list delete qe is only accessed by
> + * ima_lookup_digest_entry().
> + */
> + for (int i = 0; i < qe->entry->template_desc->num_fields; i++) {
> + kfree(qe->entry->template_data[i].data);
> + qe->entry->template_data[i].data = NULL;
> + qe->entry->template_data[i].len = 0;
> + }
> +
> + list_del(&qe->later);
> +
> + /* No leak if !ima_flush_htable, referenced by ima_htable. */
> + if (ima_flush_htable) {
> + kfree(qe->entry->digests);
> + kfree(qe->entry);
> + kfree(qe);
> + }
> + }
> +
> + return num_records;
> +}
> +
> int ima_restore_measurement_entry(struct ima_template_entry *entry)
> {
> int result = 0;
^ permalink raw reply
* Re: [PATCH v7 1/9] KVM: x86: Define KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT
From: Jim Mattson @ 2026-04-07 16:27 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
kvm, linux-doc, linux-kernel, linux-kselftest, Yosry Ahmed
In-Reply-To: <adRBZuqNlBozaDrK@google.com>
On Mon, Apr 6, 2026 at 4:27 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Mar 27, 2026, Jim Mattson wrote:
> > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > index ff1e4b4dc998..74014110b550 100644
> > --- a/arch/x86/kvm/svm/svm.h
> > +++ b/arch/x86/kvm/svm/svm.h
> > @@ -616,6 +616,17 @@ static inline bool nested_npt_enabled(struct vcpu_svm *svm)
> > return svm->nested.ctl.misc_ctl & SVM_MISC_ENABLE_NP;
> > }
> >
> > +static inline bool l2_has_separate_pat(struct vcpu_svm *svm)
>
> Take @vcpu instead of @svm. All of the callers have a "vcpu", but not all have
> a local "svm". That will shorten the quirk check far enough to let it poke out.
What is the actual line length limit?
> > +{
> > + /*
> > + * If KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled while a vCPU
> > + * is running, the L2 IA32_PAT semantics for that vCPU are undefined.
> > + */
> > + return nested_npt_enabled(svm) &&
> > + !kvm_check_has_quirk(svm->vcpu.kvm,
> > + KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT);
>
> Align indentation. With the @svm => @vcpu change, this becomes:
>
> return nested_npt_enabled(to_svm(vcpu)) &&
> !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT);
You wouldn't happen to know the Emacs configuration for the alignment
you like, would you? I asked Gemini, but it lied to me.
> > +}
> > +
> > static inline bool nested_vnmi_enabled(struct vcpu_svm *svm)
> > {
> > return guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_VNMI) &&
> > --
> > 2.53.0.1018.g2bb0e51243-goog
> >
^ permalink raw reply
* Re: [PATCH] hwmon: (yogafan) various markup improvements
From: Sergio Melas @ 2026-04-07 16:12 UTC (permalink / raw)
To: Guenter Roeck
Cc: Randy Dunlap, linux-kernel, linux-hwmon, Jonathan Corbet,
Shuah Khan, linux-doc
In-Reply-To: <7752cce3-3362-42c0-becd-96dbc7b17cab@roeck-us.net>
Hi Guenter,
My apologies for the confusion—I am still learning the standard
workflow. I understand now why applying Randy’s patch immediately is
the correct move.
When I mentioned the "next version," I was thinking about a major
expansion I am currently preparing (v1, second round). It expands
support to nearly all Lenovo and Xiaoxin models. Because the database
has grown so much, I’ve had to significantly change the table format
in the .rst file to keep it readable. So i was referring to this new
table (see below). Fully open to modify the format if you thin is not
ok.
As an automation engineer , this process is quite new to me, so I
appreciate your patience as I learn the proper terms and procedures. I
will ensure my next submission is rebased on your current tree with
Randy's improvements.
Best regards, Sergio
::
================================================
LENOVO FAN CONTROLLER Hardware Abstraction Layer
================================================
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| MODEL | FAMILY / SERIES | OFFSET | FULL ACPI OBJECT PATH
| WIDTH | NMAX | RMAX | MULT |
+=============+===================+=========+================================+========+=======+=======+======+
| 82N7 | Yoga 14cACN | 0x06 | _SB.PCI0.LPC0.EC0.FANS
| 8-bit | 0 | 5500 | 100 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 83E2 | Yoga Pro 9i | 0xFE/FF | _SB.PCI0.LPC0.EC0.FANS
(Fan1) | 16-bit | 0 | 8000 | 1 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 83E2 | Yoga Pro 9i | 0xFE/FF | _SB.PCI0.LPC0.EC0.FA2S
(Fan2) | 16-bit | 0 | 8000 | 1 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 83CV | Yoga Pro 9 (Aura) | 0xFE | _SB.PCI0.LPC0.EC0.FANS
| 8-bit | 0 | 6000 | 100 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 83DN | Yoga Pro 7 | 0xFE | _SB.PCI0.LPC0.EC0.FANS
| 8-bit | 0 | 6000 | 100 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 82A2 / 82A3 | Yoga Slim 7 | 0x06 | _SB.PCI0.LPC0.EC0.FANS
| 8-bit | 0 | 5500 | 100 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 83JC / 83DX | Xiaoxin Pro 14/16 | 0xFE | _SB.PCI0.LPC0.EC0.FANS
| 8-bit | 80 | 5000 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 83FD / 83DE | Xiaoxin Pro | 0xFE/FF |
_SB.PCI0.LPC0.EC0.FAN0/.FANS | 8-bit | 0 | 5000 | 100 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 81YM / 82FG | IdeaPad 5 | 0x06 | _SB.PCI0.LPC0.EC0.FAN0
| 8-bit | 0 | 4500 | 100 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 83AK | ThinkBook G7 | 0x06 | _SB.PCI0.LPC0.EC0.FAN0
| 8-bit | 0 | 5400 | 100 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 81X1 | Flex 5 | 0x06 | _SB.PCI0.LPC0.EC0.FAN0
| 8-bit | 0 | 4500 | 100 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| Legion 9 | Legion 9i / Extr | 0xFE/FF | _SB.PCI0.LPC0.EC0.FANS
(Fan1) | 16-bit | 0 | 8000 | 1 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| Legion 9 | Legion 9i / Extr | 0xFE/FF | _SB.PCI0.LPC0.EC0.FA2S
(Fan2) | 16-bit | 0 | 8000 | 1 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| Legion 9 | Legion 9i / Extr | 0xFE/FF | _SB.PCI0.LPC0.EC0.FA3S
(Fan3) | 16-bit | 0 | 8000 | 1 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 82WQ | Legion 7i (Int) | 0xFE/FF | _SB.PCI0.LPC0.EC0.FANS
(Fan1) | 16-bit | 0 | 8000 | 1 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 82WQ | Legion 7i (Int) | 0xFE/FF | _SB.PCI0.LPC0.EC0.FA2S
(Fan2) | 16-bit | 0 | 8000 | 1 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 82JW / 82JU | Legion 5 (AMD) | 0xFE/FF | _SB.PCI0.LPC0.EC0.FANS
(Fan1) | 16-bit | 0 | 6500 | 1 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 82JW / 82JU | Legion 5 (AMD) | 0xFE/FF | _SB.PCI0.LPC0.EC0.FA2S
(Fan2) | 16-bit | 0 | 6500 | 1 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| GeekPro | GeekPro G5000/6k | 0xFE/FF | _SB.PCI0.LPC0.EC0.FANS
(Fan1) | 16-bit | 0 | 6500 | 1 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 82XV / 83DV | LOQ 15/16 | 0xFE/FF | _SB.PCI0.LPC0.EC0.FANS
(Fan1) | 16-bit | 0 | 6500 | 1 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 82XV / 83DV | LOQ 15/16 | 0xFE/FF | _SB.PCI0.LPC0.EC0.FA2S
(Fan2) | 16-bit | 0 | 6500 | 1 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 80V2 / 81C3 | Yoga 710/720 | 0x06 | _SB.PCI0.LPC0.EC0.FAN0
| 8-bit | 59 | 4500 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 80S7 | Yoga 510 | 0x06 | _SB.PCI0.LPC0.EC0.FAN0
| 8-bit | 41 | 4500 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 80JH | Yoga 3 14 | 0x06 |
_SB.PCI0.LPC0.EC0.FAN0/.FANS | 8-bit | 80 | 5000 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 20344 | Yoga 2 13 | 0xAB | _SB.PCI0.LPC0.EC0.FANS
| 8-bit | 8 | 4200 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 2191 / 20191| Yoga 13 | 0xF2/F3 | _SB.PCI0.LPC0.EC0.FAN1/2
| 8-bit | 0 | 5000 | 100 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| Legacy | Yoga 11s | 0x56 |
_SB.PCI0.LPC0.EC0.FAN0/.FANS | 8-bit | 80 | 4500 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 20GJ / 20GK | ThinkPad 13 | 0x85 | _SB.PCI0.LPC0.EC0.FAN0
| 8-bit | 7 | 5500 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 1143 | ThinkPad E520 | 0x95 | _SB.PCI0.LPC0.EC0.FAN0
| 8-bit | 0 | 4200 | 100 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 3698 | ThinkPad Helix | 0x2F | _SB.PCI0.LPC0.EC0.FANS
| 8-bit | 7 | 4500 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 20M7 / 20M8 | ThinkPad L380 | 0x95 | _SB.PCI0.LPC0.EC0.FAN1
| 8-bit | 0 | 4600 | 100 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 20NR / 20NS | ThinkPad L390 | 0x95 | _SB.PCI0.LPC0.EC0.FAN0
| 8-bit | 0 | 5500 | 100 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 2464 / 2468 | ThinkPad L530 | 0x95 | _SB.PCI0.LPC0.EC0.FAN0
| 8-bit | 0 | 4400 | 100 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 2356 | ThinkPad T430s | 0x2F | _SB.PCI0.LPC0.EC0.FANS
| 8-bit | 7 | 5000 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 20AQ / 20AR | ThinkPad T440s | 0x4E | _SB.PCI0.LPC0.EC0.FANS
| 8-bit | 7 | 5200 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 20BE / 20BF | ThinkPad T540p | 0x2F | _SB.PCI0.LPC0.EC0.FANS
| 8-bit | 7 | 5500 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 3051 | ThinkPad x121e | 0x2F | _SB.PCI0.LPC0.EC0.FANS
| 8-bit | 7 | 4500 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 4290 | ThinkPad x220i | 0x2F | _SB.PCI0.LPC0.EC0.FANS
| 8-bit | 7 | 5000 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 2324 / 2325 | ThinkPad x230 | 0x2F | _SB.PCI0.LPC0.EC0.FANS
| 8-bit | 7 | 5000 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 81AX | V330-15IKB | 0x95 | _SB.PCI0.LPC0.EC0.FAN0
| 8-bit | 0 | 5100 | 100 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| Legacy | IdeaPad Y580 | 0x06 | _SB.PCI0.LPC0.EC0.FAN0
| 8-bit | 35 | 4800 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| Legacy | IdeaPad V580 | 0x95 | _SB.PCI0.LPC0.EC0.FAN0
| 8-bit | 0 | 5000 | 100 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 80SR / 80SX | IdeaPad 500S-13 | 0x06 | _SB.PCI0.LPC0.EC0.FAN0
| 8-bit | 44 | 5500 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 80S1 | IdeaPad 500S-14 | 0x95 | _SB.PCI0.LPC0.EC0.FAN0
| 8-bit | 116 | 5000 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 80TK | IdeaPad 510S | 0x06 | _SB.PCI0.LPC0.EC0.FAN0
| 8-bit | 41 | 5100 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 80S9 | IdeaPad 710S | 0x95/98 | _SB.PCI0.LPC0.EC0.FAN1/2
| 8-bit | 0 | 5200 | 100 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 80KU | U31-70 | 0x06 | _SB.PCI0.LPC0.EC0.FAN0
| 8-bit | 44 | 5500 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| 80S1 | U41-70 | 0x95 | _SB.PCI0.LPC0.EC0.FAN0
| 8-bit | 116 | 5000 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| Legacy | U160 | 0x95 | _SB.PCI0.LPC0.EC0.FAN0
| 8-bit | 64 | 4500 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
| Legacy | U330p/U430p | 0x92 | _SB.PCI0.LPC0.EC0.FAN0
| 16-bit | 768 | 5000 | 0 |
+-------------+-------------------+---------+--------------------------------+--------+-------+-------+------+
Note for the raw_RPM we have 2 cases:
* Discrete Level Estimation
**Nmax > 0 then raw_RPM = (Rmax * IN) / Nmax**
* Continuous Unit Mapping
**Nmax = 0 then raw_RPM = IN * Multiplier**
^ permalink raw reply
* Re: (sashiko review) [PATCH v6 1/1] mm/damon: add node_eligible_mem_bp and node_ineligible_mem_bp goal metrics
From: SeongJae Park @ 2026-04-07 16:05 UTC (permalink / raw)
To: SeongJae Park
Cc: Ravi Jonnalagadda, damon, linux-mm, linux-kernel, linux-doc, akpm,
corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun
In-Reply-To: <20260407001310.78557-1-sj@kernel.org>
Adding another thought at the end of the mail without cutting the previous
unrelated questions, so that Ravi can answer all my questions at once.
On Mon, 6 Apr 2026 17:13:08 -0700 SeongJae Park <sj@kernel.org> wrote:
> On Mon, 6 Apr 2026 12:47:56 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:
>
> > On Sun, Apr 5, 2026 at 3:45 PM SeongJae Park <sj@kernel.org> wrote:
> > >
> > >
> > > Ravi, thank you for reposting this patch after the rebase. This time sashiko
> > > was able to review this, and found good points including things that deserve
> > > another revision of this patch.
> > >
> > > Forwarding full sashiko review in a reply format with my inline comments below,
> > > for sharing details of my view and doing followup discussions via mails. Ravi,
> > > could you please reply?
> > >
> >
> > Thanks SJ, providing your comments on top of sashiko's review is very helpful.
>
> I'm glad to hear that it is working for you :)
>
> [...]
> > > > +static unsigned long damos_calc_eligible_bytes(struct damon_ctx *c,
> > > > > + struct damos *s, int nid, unsigned long *total)
> > > > > +{
> [...]
> > > > > + struct folio *folio;
> > > > > + unsigned long folio_sz, counted;
> > > > > +
> > > > > + folio = damon_get_folio(PHYS_PFN(addr));
> > > >
> > > > What happens if this metric is assigned to a DAMON context configured for
> > > > virtual address space monitoring? If the context uses DAMON_OPS_VADDR,
> > > > passing a user-space virtual address to PHYS_PFN() might cause invalid
> > > > memory accesses or out-of-bounds page struct reads. Should this code
> > > > explicitly verify the operations type first?
> > >
> > > Good finding. We intend to support only paddr ops. But there is no guard for
> > > using this on vaddr ops configuration. Ravi, could we add underlying ops
> > > check? I think damon_commit_ctx() is a good place to add that. The check
> > > could be something like below?
> > >
> >
> > I plan to add the ops type check directly in the metric functions
> > (damos_get_node_eligible_mem_bp and its counterpart) rather than in
> > damon_commit_ctx(). The functions will return 0 early
> > if c->ops.id != DAMON_OPS_PADDR.
> >
> > That said, if you prefer the damon_commit_ctx() validation approach to
> > reject the configuration outright, I can implement it that way instead.
> > Please let me know your preference.
>
> I'd prefer damon_commit_ctx() validation approach since it would give users
> more clear message of the failure.
>
> >
> > > '''
> > > --- a/mm/damon/core.c
> > > +++ b/mm/damon/core.c
> > > @@ -1515,10 +1515,23 @@ static int damon_commit_sample_control(
> > > int damon_commit_ctx(struct damon_ctx *dst, struct damon_ctx *src)
> > > {
> > > int err;
> > > + struct damos *scheme;
> > > + struct damos_quota_goal *goal;
> > >
> > > dst->maybe_corrupted = true;
> > > if (!is_power_of_2(src->min_region_sz))
> > > return -EINVAL;
> > > + if (src->ops.id != DAMON_OPS_PADDR) {
> > > + damon_for_each_scheme(scheme, src) {
> > > + damos_for_each_quota_goal(goal, &scheme->quota) {
> > > + switch (goal->metric) {
> > > + case DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP:
> > > + case DAMOS_QUOTA_NODE_INELIGIBLE_MEMPBP:
> > > + return -EINVAL;
> > > + }
> > > + }
> > > + }
> > > + }
> > >
> > > err = damon_commit_schemes(dst, src);
> > > if (err)
> > > '''
> [...]
> > > > > + /* Compute ineligible ratio directly: 10000 - eligible_bp */
> > > > > + return 10000 - mult_frac(node_eligible, 10000, total_eligible);
> > > > > +}
> > > >
> > > > Does this return value match the documented metric? The formula computes the
> > > > percentage of the system's eligible memory located on other NUMA nodes,
> > > > rather than the amount of actual ineligible (filtered out) memory residing
> > > > on the target node. Could this semantic mismatch cause confusion when
> > > > configuring quota policies?
> > >
> > > Nice catch. The name and the documentation are confusing. We actually
> > > confused a few times in previous revisions, and I'm again confused now. IIUC,
> > > the current implementation is the intended and right one for the given use
> > > case, though. If my understanding is correct, how about renaming
> > > DAMOS_QUOTA_NODE_INELIGIBLE_MEM_BP to
> > > DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP_COMPLEMENT, and updating the documentation
> > > together? Ravi, what do you think?
> > >
> >
> > Agreed, the current name is confusing. How about
> > DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP_OFFNODE?
> >
> > The rationale is that this metric measures "eligible memory that is off
> > this node" (i.e., on other nodes).
> >
> > I think "offnode" conveys the physical meaning more directly than "complement".
> > That said, I'm happy to go with "complement" if you prefer.
> > both are clearer than "ineligible".
>
> Thank you for the nice suggestion. I like "offnode" term. But I think having
> "node" twice on the name is not really efficient for people who print code on
> papers. What about DAMOS_QUOTA_OFFNODE_ELIGIBLE_MEM_BP?
>
> But... Maybe more importantly... Now I realize this means that
> offnode_eligible_mem_bp with target nid 0 is just same to node_eligible_mem_bp
> with target nid 1, on your test setup. Maybe we don't really need
> offnode_eligible_mem_bp? That is, your test setup could be like below.
>
> '''
> For maintaining hot memory on DRAM (node 0) and CXL (node 1) in a 7:3
> ratio:
>
> PUSH scheme: migrate_hot from node 0 -> node 1
> goal: node_eligible_mem_bp, nid=1, target=3000
> "Move hot pages from DRAM to CXL if less thatn 30% of hot data is
> in CXL"
>
> PULL scheme: migrate_hot from node 1 -> node 0
> goal: node_eligible_mem_bp, nid=0, target=7000
> "Move hot pages from CXL to DRAM if less than 70% of hot data is
> in DRAM"
> '''
>
> And the schemes are more easy to read and understand for me. This seems even
> straightforward to scale for >2 nodes. For example, if we want hot memory
> distribution of 5:3:2 to nodes 0:1:2,
>
> Two schemes for migrating hot pages out of node 0
> - migrate_hot from node 0 -> node 1
> - goal: node_eligible_mem_bp, nid=1, target=3000
> - migrate_hot from node 0 -> node 2
> - goal: node_eligible_mem_bp, nid=2, target=2000
>
> Two schemes for migrating hot pages out of node 1
> - migrate_hot from node 1 -> node 0
> - goal: node_eligible_mem_bp, nid=0, target=5000
> - migrate_hot from node 1 -> node 2
> - goal: node_eligible_mem_bp, nid=2, target=2000
>
> Two schemes for migrating hot pages out of node 2
> - migrate_hot from node 2 -> node 0
> - goal: node_eligible_mem_bp, nid=0, target=5000
> - migrate_hot from node 2 -> node 1
> - goal: node_eligible_mem_bp, nid=1, target=3000
>
> Do you think this makes sense? If it makes sense and works for your use case,
> what about dropping the offnode goal type?
Now I recall I suggested the offnode metric because I suggested to run a
kdamond per node. That is, having one kdamond that monitors only node 0 and
migrate hot memory to node 1, and another kdamond that monitors only node 1 and
migrate hot memory to node 0. And I suggested to do so because I knew it is
suboptimal to run DAMOS schemes with node filter.
We made a change [1] for making that more optimum, though. The change is now
in mm-stable, so hopefully it will be available from 7.1-rc1. So I believe the
single quota goal metric should work now. Ravi, could you share what you
think?
[1] commit e1ace69c33ec ("mm/damon/core: set quota-score histogram with core filters")
Thanks,
SJ
[...]
^ permalink raw reply
* Re: [PATCH v7 8/9] KVM: x86: nSVM: Save/restore gPAT with KVM_{GET,SET}_NESTED_STATE
From: Jim Mattson @ 2026-04-07 15:47 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
kvm, linux-doc, linux-kernel, linux-kselftest, Yosry Ahmed
In-Reply-To: <adURPZJEDs50NPkB@google.com>
On Tue, Apr 7, 2026 at 7:14 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Apr 06, 2026, Jim Mattson wrote:
> > On Mon, Apr 6, 2026 at 4:47 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Fri, Mar 27, 2026, Jim Mattson wrote:
> > > > @@ -1918,6 +1921,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> > > > struct vmcb_save_area_cached save_cached;
> > > > struct vmcb_ctrl_area_cached ctl_cached;
> > > > unsigned long cr0;
> > > > + bool use_separate_l2_pat;
> > >
> > > Land this above "cr0" to preserve the inverted fir tree.
> > >
> > > > int ret;
> > > >
> > > > BUILD_BUG_ON(sizeof(struct vmcb_control_area) + sizeof(struct vmcb_save_area) >
> > > > @@ -1993,6 +1997,18 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> > > > !nested_vmcb_check_save(vcpu, &save_cached, false))
> > > > goto out_free;
> > > >
> > > > + /*
> > > > + * Validate gPAT when the shared PAT quirk is disabled (i.e. L2
> > > > + * has its own gPAT). This is done separately from the
> > > > + * vmcb_save_area_cached validation above, because gPAT is L2
> > > > + * state, but the vmcb_save_area_cached is populated with L1 state.
> > > > + */
> > > > + use_separate_l2_pat =
> > > > + (ctl_cached.misc_ctl & SVM_MISC_ENABLE_NP) &&
> > > > + !kvm_check_has_quirk(vcpu->kvm,
> > > > + KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT);
> > >
> > > I vote for either:
> > >
> > > use_separate_l2_pat = (ctl_cached.misc_ctl & SVM_MISC_ENABLE_NP) &&
> > > !kvm_check_has_quirk(vcpu->kvm,
> > > KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT);
> > >
> > LOL! Aren't you the one who keeps complaining that my indentation
> > doesn't line up? Are you schizophrenic?
>
> Huh? That is aligned. Perhaps it's whitespace damaged by your MUA?
Indeed. It was.
> > > or
> > >
> > > use_separate_l2_pat = (ctl_cached.misc_ctl & SVM_MISC_ENABLE_NP);
> > > if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT))
> > > use_separate_l2_pat = false;
> >
> > Wow. I really have no idea how to predict what you're going to want
> > the code to look like. How is this better than the original?!?
>
> It doesn't immediately wrap after the "=". Similar to my view on wrapping before
> function names[*], I find wrapping immediately after an assignment operator to be
> unnecessarily difficult to read as it doesn't provide any context for single-line
> searches.
That's actually a good argument to *never* wrap a line. If a line is
broken at all, the interesting context might follow the line break.
> I'm pretty darn consistent in my dislike for that style: I count 26 instances in
> arch/x86/kvm that match "\s=\n", and only two of those carry my SoB or R-b. I
> simply missed the wrap in kvm_vcpu_apicv_activated() that was added by commit
> 896046474f8d ("KVM: x86: Introduce kvm_x86_call() to simplify static calls of
> kvm_x86_ops"), and I'll give myself a pass for commit 8764ed55c970 ("KVM: x86:
> Whitelist port 0x7e for pre-incrementing %rip") as that predates treating
> checkpatch's 80 char limit as a soft limit.
Might I suggest that you should provide a tool—something like
checkpatch.pl—that flags style violations?
^ permalink raw reply
* Re: [PATCH] docs: proc: document ProtectionKey in smaps
From: Kevin Brodsky @ 2026-04-07 15:12 UTC (permalink / raw)
To: Dave Hansen, linux-doc
Cc: linux-kernel, Yury Khrustalev, Jonathan Corbet, Shuah Khan,
Dave Hansen, Andrew Morton, Lorenzo Stoakes, Vlastimil Babka,
David Hildenbrand, Mark Rutland, linux-fsdevel, linux-mm
In-Reply-To: <98880cc2-09be-4bd8-b8f4-f0f0845f939e@intel.com>
On 07/04/2026 16:42, Dave Hansen wrote:
> On 4/7/26 05:51, Kevin Brodsky wrote:
>> +If both the kernel and the system support protection keys (pkeys),
>> +"ProtectionKey" indicates the memory protection key associated with the
>> +virtual memory area.
> I think you're trying to get across the point here that the kernel needs
> to know about protection keys, have it enabled, and be running on a CPU
> with pkey support.
Indeed.
> To me "system" is a bit ambiguous here but _can_ refer to the whole
> hardware/software system as a whole. To avoid redundancy, I'd say either:
>
> If both the kernel and the processor support protection keys...
>
> or
>
> If the system supports protection keys...
I see your point. By "system" I essentially mean the hardware (the SoC).
In general I would tend to avoid "processor" because not all CPUs in a
system necessarily have the same features, and some features require
hardware support beyond the CPU itself. Terminology is hard...
Happy to replace "system" with "hardware" if that's clearer :)
> But I'm ok with what you have in any case. Folks will understand what
> you're saying:
Hopefully!
- Kevin
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
^ permalink raw reply
* Re: [PATCH v5 0/3] PCI Controller event and LTSSM tracepoint support
From: Manivannan Sadhasivam @ 2026-04-07 14:43 UTC (permalink / raw)
To: Bjorn Helgaas, Shawn Lin
Cc: linux-rockchip, linux-pci, linux-trace-kernel, linux-doc,
Steven Rostedt
In-Reply-To: <1774403912-210670-1-git-send-email-shawn.lin@rock-chips.com>
On Wed, 25 Mar 2026 09:58:29 +0800, Shawn Lin wrote:
> This patch-set adds new pci controller event and LTSSM tracepoint used by host drivers
> which provide LTSSM trace functionality. The first user is pcie-dw-rockchip with a 256
> Bytes FIFO for recording LTSSM transition.
>
> Testing
> =========
>
> [...]
Applied, thanks!
[1/3] PCI: trace: Add PCI controller LTSSM transition tracepoint
commit: d1b7add89c004295cd48d7cd49946ed5cb5cbb55
[2/3] Documentation: tracing: Add PCI controller event documentation
commit: a3966a6f915ea7d1af0941ea26848d921e574c45
[3/3] PCI: dw-rockchip: Add pcie_ltssm_state_transition trace support
commit: a276c0d802d8d2a22088b7919d9e82e936995cf4
Best regards,
--
Manivannan Sadhasivam <mani@kernel.org>
^ permalink raw reply
* Re: [PATCH] docs: proc: document ProtectionKey in smaps
From: Dave Hansen @ 2026-04-07 14:42 UTC (permalink / raw)
To: Kevin Brodsky, linux-doc
Cc: linux-kernel, Yury Khrustalev, Jonathan Corbet, Shuah Khan,
Dave Hansen, Andrew Morton, Lorenzo Stoakes, Vlastimil Babka,
David Hildenbrand, Mark Rutland, linux-fsdevel, linux-mm
In-Reply-To: <20260407125133.564182-1-kevin.brodsky@arm.com>
On 4/7/26 05:51, Kevin Brodsky wrote:
> +If both the kernel and the system support protection keys (pkeys),
> +"ProtectionKey" indicates the memory protection key associated with the
> +virtual memory area.
I think you're trying to get across the point here that the kernel needs
to know about protection keys, have it enabled, and be running on a CPU
with pkey support.
To me "system" is a bit ambiguous here but _can_ refer to the whole
hardware/software system as a whole. To avoid redundancy, I'd say either:
If both the kernel and the processor support protection keys...
or
If the system supports protection keys...
But I'm ok with what you have in any case. Folks will understand what
you're saying:
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
^ permalink raw reply
* Re: [PATCH] docs: proc: document ProtectionKey in smaps
From: Lorenzo Stoakes @ 2026-04-07 14:33 UTC (permalink / raw)
To: Kevin Brodsky
Cc: linux-doc, linux-kernel, Yury Khrustalev, Jonathan Corbet,
Shuah Khan, Dave Hansen, Andrew Morton, Vlastimil Babka,
David Hildenbrand, Mark Rutland, linux-fsdevel, linux-mm
In-Reply-To: <20260407125133.564182-1-kevin.brodsky@arm.com>
On Tue, Apr 07, 2026 at 01:51:33PM +0100, Kevin Brodsky wrote:
> The ProtectionKey entry was added in v4.9; back then it was
> x86-specific, but it now lives in generic code and applies to all
> architectures supporting pkeys (currently x86, power, arm64).
>
> Time to document it: add a paragraph to proc.rst about the
> ProtectionKey entry.
>
> Reported-by: Yury Khrustalev <yury.khrustalev@arm.com>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
LGTM, So:
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
> ---
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Shuah Khan <skhan@linuxfoundation.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <ljs@kernel.org>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
> Documentation/filesystems/proc.rst | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index b0c0d1b45b99..d673cad7dbe4 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -549,6 +549,10 @@ does not take into account swapped out page of underlying shmem objects.
> naturally aligned THP pages of any currently enabled size. 1 if true, 0
> otherwise.
>
> +If both the kernel and the system support protection keys (pkeys),
> +"ProtectionKey" indicates the memory protection key associated with the
> +virtual memory area.
> +
> "VmFlags" field deserves a separate description. This member represents the
> kernel flags associated with the particular virtual memory area in two letter
> encoded manner. The codes are the following:
> --
> 2.51.2
>
^ permalink raw reply
* [PATCH v4 6/6] arm64: hw_breakpoint: Enable FEAT_Debugv8p9
From: Rob Herring (Arm) @ 2026-04-07 14:29 UTC (permalink / raw)
To: Will Deacon, Mark Rutland, Catalin Marinas, Jonathan Corbet,
Shuah Khan
Cc: Anshuman Khandual, linux-arm-kernel, linux-perf-users,
linux-kernel, linux-doc
In-Reply-To: <20260407-arm-debug-8-9-v4-0-a4864e69b0ea@kernel.org>
From: Anshuman Khandual <anshuman.khandual@arm.com>
Currently, there can be maximum 16 breakpoints and 16 watchpoints available
on a given platform - as detected from ID_AA64DFR0_EL1.[BRPs|WRPs] register
fields. These breakpoints and watchpoints can be extended further up to
64 via a new arch feature FEAT_Debugv8p9.
Checking for FEAT_Debugv8p9 alone is not enough to enable the support.
It is also necessary to determine if there are more than 16 breakpoints
or watchpoints. The behavior with FEAT_Debugv8p9 and <=16 breakpoints
and watchpoints is IMPDEF.
The addition of the MDSELR_EL1 to set the bank index makes the register
accesses non-atomic. However, the combination of all the breakpoint code
being in the kprobe blacklist and breakpoint install/uninstall being
protected by perf locking (IRQs disabled and context lock) will prevent
debug exceptions during accesses and serialize the accesses.
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
---
v4:
- Update commit message.
- Configure MDSCR_EL1_EMBWE on CPU reset/hotplug instead of every time
breakpoints are enabled/disabled.
- Drop unnecessary IRQ save and restore on register accesses.
- Stash checking whether FEAT_Debugv8p9 is used rather than reading
feature register on every register access.
- Check that we're greater than or equal to Debug_v8p9 not just equal
to.
- Use is_debug_v8p9_enabled() in get_num_brps/get_num_wrps(). Handle
the case when FEAT_Debugv8p9 is present, but the number of BP/WP
are <16. It is IMPDEF if ID_AA64DFR1_EL1 is used in this case. It is
also IMPDEF if MDSELR_EL1 is accessible. TF-A doesn't enable access
to MDSELR_EL1 in this case.
- Mark register access functions nokprobe.
---
arch/arm64/include/asm/hw_breakpoint.h | 47 ++++++++++++++++++++++++++--------
arch/arm64/kernel/debug-monitors.c | 16 ++++++++----
arch/arm64/kernel/hw_breakpoint.c | 41 +++++++++++++++++++++++++++--
3 files changed, 87 insertions(+), 17 deletions(-)
diff --git a/arch/arm64/include/asm/hw_breakpoint.h b/arch/arm64/include/asm/hw_breakpoint.h
index bd81cf17744a..c5624a906f3c 100644
--- a/arch/arm64/include/asm/hw_breakpoint.h
+++ b/arch/arm64/include/asm/hw_breakpoint.h
@@ -79,8 +79,9 @@ static inline void decode_ctrl_reg(u32 reg,
* Limits.
* Changing these will require modifications to the register accessors.
*/
-#define ARM_MAX_BRP 16
-#define ARM_MAX_WRP 16
+#define ARM_MAX_BRP 64
+#define ARM_MAX_WRP 64
+#define MAX_PER_BANK 16
/* Virtual debug register bases. */
#define AARCH64_DBG_REG_BVR 0
@@ -94,6 +95,14 @@ static inline void decode_ctrl_reg(u32 reg,
#define AARCH64_DBG_REG_NAME_WVR wvr
#define AARCH64_DBG_REG_NAME_WCR wcr
+static inline bool is_debug_v8p9_enabled(void)
+{
+ u64 dfr0 = read_sanitised_ftr_reg(SYS_ID_AA64DFR0_EL1);
+ int dver = cpuid_feature_extract_unsigned_field(dfr0, ID_AA64DFR0_EL1_DebugVer_SHIFT);
+
+ return dver >= ID_AA64DFR0_EL1_DebugVer_V8P9;
+}
+
/* Accessor macros for the debug registers. */
#define AARCH64_DBG_READ(N, REG, VAL) do {\
VAL = read_sysreg(dbg##REG##N##_el1);\
@@ -138,19 +147,37 @@ static inline void ptrace_hw_copy_thread(struct task_struct *task)
/* Determine number of BRP registers available. */
static inline int get_num_brps(void)
{
- u64 dfr0 = read_sanitised_ftr_reg(SYS_ID_AA64DFR0_EL1);
- return 1 +
- cpuid_feature_extract_unsigned_field(dfr0,
- ID_AA64DFR0_EL1_BRPs_SHIFT);
+ u64 dfr0, dfr1;
+ int brps;
+
+ dfr0 = read_sanitised_ftr_reg(SYS_ID_AA64DFR0_EL1);
+ brps = cpuid_feature_extract_unsigned_field(dfr0, ID_AA64DFR0_EL1_BRPs_SHIFT);
+ if (is_debug_v8p9_enabled() && brps == 15) {
+ dfr1 = read_sanitised_ftr_reg(SYS_ID_AA64DFR1_EL1);
+ brps = cpuid_feature_extract_unsigned_field_width(dfr1,
+ ID_AA64DFR1_EL1_BRPs_SHIFT, 8);
+ if (!brps)
+ return 16;
+ }
+ return 1 + brps;
}
/* Determine number of WRP registers available. */
static inline int get_num_wrps(void)
{
- u64 dfr0 = read_sanitised_ftr_reg(SYS_ID_AA64DFR0_EL1);
- return 1 +
- cpuid_feature_extract_unsigned_field(dfr0,
- ID_AA64DFR0_EL1_WRPs_SHIFT);
+ u64 dfr0, dfr1;
+ int wrps;
+
+ dfr0 = read_sanitised_ftr_reg(SYS_ID_AA64DFR0_EL1);
+ wrps = cpuid_feature_extract_unsigned_field(dfr0, ID_AA64DFR0_EL1_WRPs_SHIFT);
+ if (is_debug_v8p9_enabled() && wrps == 15) {
+ dfr1 = read_sanitised_ftr_reg(SYS_ID_AA64DFR1_EL1);
+ wrps = cpuid_feature_extract_unsigned_field_width(dfr1,
+ ID_AA64DFR1_EL1_WRPs_SHIFT, 8);
+ if (!wrps)
+ return 16;
+ }
+ return 1 + wrps;
}
#ifdef CONFIG_CPU_PM
diff --git a/arch/arm64/kernel/debug-monitors.c b/arch/arm64/kernel/debug-monitors.c
index 29307642f4c9..8ff74432d0c3 100644
--- a/arch/arm64/kernel/debug-monitors.c
+++ b/arch/arm64/kernel/debug-monitors.c
@@ -22,6 +22,7 @@
#include <asm/daifflags.h>
#include <asm/debug-monitors.h>
#include <asm/exception.h>
+#include <asm/hw_breakpoint.h>
#include <asm/kgdb.h>
#include <asm/kprobes.h>
#include <asm/system_misc.h>
@@ -123,11 +124,16 @@ void disable_debug_monitors(enum dbg_active_el el)
}
NOKPROBE_SYMBOL(disable_debug_monitors);
-/*
- * OS lock clearing.
- */
-static int clear_os_lock(unsigned int cpu)
+static int debug_monitors_reset(unsigned int cpu)
{
+ if (is_debug_v8p9_enabled()) {
+ u64 mdscr = mdscr_read();
+
+ mdscr |= MDSCR_EL1_EMBWE;
+ mdscr_write(mdscr);
+ }
+
+ /* Clear OS lock */
write_sysreg(0, osdlr_el1);
write_sysreg(0, oslar_el1);
isb();
@@ -138,7 +144,7 @@ static int __init debug_monitors_init(void)
{
return cpuhp_setup_state(CPUHP_AP_ARM64_DEBUG_MONITORS_STARTING,
"arm64/debug_monitors:starting",
- clear_os_lock, NULL);
+ debug_monitors_reset, NULL);
}
postcore_initcall(debug_monitors_init);
diff --git a/arch/arm64/kernel/hw_breakpoint.c b/arch/arm64/kernel/hw_breakpoint.c
index a9266dc710b4..ea48c1562bee 100644
--- a/arch/arm64/kernel/hw_breakpoint.c
+++ b/arch/arm64/kernel/hw_breakpoint.c
@@ -40,6 +40,7 @@ static DEFINE_PER_CPU(int, stepping_kernel_bp);
/* Number of BRP/WRP registers on this CPU. */
static int core_num_brps;
static int core_num_wrps;
+static bool has_debug_v8p9;
int hw_breakpoint_slots(int type)
{
@@ -104,7 +105,7 @@ int hw_breakpoint_slots(int type)
WRITE_WB_REG_CASE(OFF, 14, REG, VAL); \
WRITE_WB_REG_CASE(OFF, 15, REG, VAL)
-static u64 read_wb_reg(int reg, int n)
+static nokprobe_inline u64 __read_wb_reg(int reg, int n)
{
u64 val = 0;
@@ -119,9 +120,27 @@ static u64 read_wb_reg(int reg, int n)
return val;
}
+
+static u64 read_wb_reg(int reg, int n)
+{
+ u64 val;
+
+ /*
+ * Bank selection in MDSELR_EL1, followed by an indexed read from
+ * breakpoint (or watchpoint) registers cannot be interrupted, as
+ * that might cause misread from the wrong targets instead. Hence
+ * this requires mutual exclusion.
+ */
+ if (has_debug_v8p9) {
+ write_sysreg_s(SYS_FIELD_PREP(MDSELR_EL1, BANK, n / MAX_PER_BANK), SYS_MDSELR_EL1);
+ isb();
+ }
+ val = __read_wb_reg(reg, n % MAX_PER_BANK);
+ return val;
+}
NOKPROBE_SYMBOL(read_wb_reg);
-static void write_wb_reg(int reg, int n, u64 val)
+static nokprobe_inline void __write_wb_reg(int reg, int n, u64 val)
{
switch (reg + n) {
GEN_WRITE_WB_REG_CASES(AARCH64_DBG_REG_BVR, AARCH64_DBG_REG_NAME_BVR, val);
@@ -133,6 +152,21 @@ static void write_wb_reg(int reg, int n, u64 val)
}
isb();
}
+
+static void write_wb_reg(int reg, int n, u64 val)
+{
+ /*
+ * Bank selection in MDSELR_EL1, followed by an indexed read from
+ * breakpoint (or watchpoint) registers cannot be interrupted, as
+ * that might cause misread from the wrong targets instead. Hence
+ * this requires mutual exclusion.
+ */
+ if (has_debug_v8p9) {
+ write_sysreg_s(SYS_FIELD_PREP(MDSELR_EL1, BANK, n / MAX_PER_BANK), SYS_MDSELR_EL1);
+ isb();
+ }
+ __write_wb_reg(reg, n % MAX_PER_BANK, val);
+}
NOKPROBE_SYMBOL(write_wb_reg);
/*
@@ -990,6 +1024,7 @@ static int __init arch_hw_breakpoint_init(void)
core_num_brps = get_num_brps();
core_num_wrps = get_num_wrps();
+ has_debug_v8p9 = (core_num_brps > 16) || (core_num_wrps > 16);
pr_info("found %d breakpoint and %d watchpoint registers.\n",
core_num_brps, core_num_wrps);
@@ -1006,6 +1041,8 @@ static int __init arch_hw_breakpoint_init(void)
/* Register cpu_suspend hw breakpoint restore hook */
cpu_suspend_set_dbg_restorer(hw_breakpoint_reset);
+ BUILD_BUG_ON((ARM_MAX_BRP % MAX_PER_BANK) != 0);
+ BUILD_BUG_ON((ARM_MAX_WRP % MAX_PER_BANK) != 0);
return ret;
}
--
2.53.0
^ permalink raw reply related
* [PATCH v4 5/6] arm64/boot: Enable EL2 requirements for FEAT_Debugv8p9
From: Rob Herring (Arm) @ 2026-04-07 14:29 UTC (permalink / raw)
To: Will Deacon, Mark Rutland, Catalin Marinas, Jonathan Corbet,
Shuah Khan
Cc: Anshuman Khandual, linux-arm-kernel, linux-perf-users,
linux-kernel, linux-doc, Marc Zyngier, kvmarm, Oliver Upton
In-Reply-To: <20260407-arm-debug-8-9-v4-0-a4864e69b0ea@kernel.org>
From: Anshuman Khandual <anshuman.khandual@arm.com>
Fine grained trap control for MDSELR_EL1 register needs to be configured in
HDFGRTR2_EL2, and HDFGWTR2_EL2 registers when kernel enters at EL1, but EL2
is also present.
MDCR_EL2.EBWE needs to be enabled for additional (beyond 16) breakpoint and
watchpoint exceptions when kernel enters at EL1, but EL2 is also present.
While here, also update booting.rst with MDCR_EL3 and SCR_EL3 requirements.
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: kvmarm@lists.linux.dev
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
---
v4:
- Add that the requirements only apply when there are >16
breakpoints/watchpoints
- Adapt to changes in v7.0-rc1
---
Documentation/arch/arm64/booting.rst | 13 +++++++++++++
arch/arm64/include/asm/el2_setup.h | 14 ++++++++++++++
2 files changed, 27 insertions(+)
diff --git a/Documentation/arch/arm64/booting.rst b/Documentation/arch/arm64/booting.rst
index 13ef311dace8..00ba91bbd278 100644
--- a/Documentation/arch/arm64/booting.rst
+++ b/Documentation/arch/arm64/booting.rst
@@ -369,6 +369,19 @@ Before jumping into the kernel, the following conditions must be met:
- ZCR_EL2.LEN must be initialised to the same value for all CPUs the
kernel will execute on.
+ For CPUs with FEAT_Debugv8p9 extension present and >16 breakpoints or
+ watchpoints:
+
+ - If the kernel is entered at EL1 and EL2 is present:
+
+ - HDFGRTR2_EL2.nMDSELR_EL1 (bit 5) must be initialized to 0b1
+ - HDFGWTR2_EL2.nMDSELR_EL1 (bit 5) must be initialized to 0b1
+ - MDCR_EL2.EBWE (bit 43) must be initialized to 0b1
+
+ - If EL3 is present:
+
+ - MDCR_EL3.EBWE (bit 43) must be initialized to 0b1
+
For CPUs with the Scalable Matrix Extension (FEAT_SME):
- If EL3 is present:
diff --git a/arch/arm64/include/asm/el2_setup.h b/arch/arm64/include/asm/el2_setup.h
index 85f4c1615472..b51a280c18c0 100644
--- a/arch/arm64/include/asm/el2_setup.h
+++ b/arch/arm64/include/asm/el2_setup.h
@@ -174,6 +174,13 @@
// to own it.
.Lskip_trace_\@:
+ mrs x1, id_aa64dfr0_el1
+ ubfx x1, x1, #ID_AA64DFR0_EL1_DebugVer_SHIFT, #4
+ cmp x1, #ID_AA64DFR0_EL1_DebugVer_V8P9
+ b.lt .Lskip_dbg_v8p9_\@
+
+ orr x2, x2, #MDCR_EL2_EBWE
+.Lskip_dbg_v8p9_\@:
msr mdcr_el2, x2 // Configure debug traps
.endm
@@ -438,6 +445,13 @@
orr x0, x0, #HDFGRTR2_EL2_nPMSDSFR_EL1
.Lskip_spefds_\@:
+ mrs x1, id_aa64dfr0_el1
+ ubfx x1, x1, #ID_AA64DFR0_EL1_DebugVer_SHIFT, #4
+ cmp x1, #ID_AA64DFR0_EL1_DebugVer_V8P9
+ b.lt .Lskip_dbg_v8p9_\@
+
+ mov_q x0, HDFGWTR2_EL2_nMDSELR_EL1
+.Lskip_dbg_v8p9_\@:
msr_s SYS_HDFGRTR2_EL2, x0
msr_s SYS_HDFGWTR2_EL2, x0
msr_s SYS_HFGRTR2_EL2, xzr
--
2.53.0
^ permalink raw reply related
* [PATCH v4 4/6] arm64/cpufeature: Add field details for ID_AA64DFR1_EL1 register
From: Rob Herring (Arm) @ 2026-04-07 14:29 UTC (permalink / raw)
To: Will Deacon, Mark Rutland, Catalin Marinas, Jonathan Corbet,
Shuah Khan
Cc: Anshuman Khandual, linux-arm-kernel, linux-perf-users,
linux-kernel, linux-doc
In-Reply-To: <20260407-arm-debug-8-9-v4-0-a4864e69b0ea@kernel.org>
From: Anshuman Khandual <anshuman.khandual@arm.com>
This adds required field details for ID_AA64DFR1_EL1, and also drops dummy
ftr_raz[] array which is now redundant. These register fields will be used
to enable increased breakpoint and watchpoint registers via FEAT_Debugv8p9
later. The register fields have been marked as FTR_STRICT, unless there is
a known variation in practice.
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
---
arch/arm64/kernel/cpufeature.c | 21 ++++++++++++++++-----
1 file changed, 16 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index c31f8e17732a..24c8e9147e35 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -570,6 +570,21 @@ static const struct arm64_ftr_bits ftr_id_aa64dfr0[] = {
ARM64_FTR_END,
};
+static const struct arm64_ftr_bits ftr_id_aa64dfr1[] = {
+ ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64DFR1_EL1_ABL_CMPs_SHIFT, 8, 0),
+ ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64DFR1_EL1_DPFZS_SHIFT, 4, 0),
+ ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64DFR1_EL1_EBEP_SHIFT, 4, 0),
+ ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64DFR1_EL1_ITE_SHIFT, 4, 0),
+ ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64DFR1_EL1_ABLE_SHIFT, 4, 0),
+ ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64DFR1_EL1_PMICNTR_SHIFT, 4, 0),
+ ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64DFR1_EL1_SPMU_SHIFT, 4, 0),
+ ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64DFR1_EL1_CTX_CMPs_SHIFT, 8, 0),
+ ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64DFR1_EL1_WRPs_SHIFT, 8, 0),
+ ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64DFR1_EL1_BRPs_SHIFT, 8, 0),
+ ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64DFR1_EL1_SYSPMUID_SHIFT, 8, 0),
+ ARM64_FTR_END,
+};
+
static const struct arm64_ftr_bits ftr_mvfr0[] = {
ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR0_EL1_FPRound_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR0_EL1_FPShVec_SHIFT, 4, 0),
@@ -756,10 +771,6 @@ static const struct arm64_ftr_bits ftr_single32[] = {
ARM64_FTR_END,
};
-static const struct arm64_ftr_bits ftr_raz[] = {
- ARM64_FTR_END,
-};
-
#define __ARM64_FTR_REG_OVERRIDE(id_str, id, table, ovr) { \
.sys_id = id, \
.reg = &(struct arm64_ftr_reg){ \
@@ -832,7 +843,7 @@ static const struct __ftr_reg_entry {
/* Op1 = 0, CRn = 0, CRm = 5 */
ARM64_FTR_REG(SYS_ID_AA64DFR0_EL1, ftr_id_aa64dfr0),
- ARM64_FTR_REG(SYS_ID_AA64DFR1_EL1, ftr_raz),
+ ARM64_FTR_REG(SYS_ID_AA64DFR1_EL1, ftr_id_aa64dfr1),
/* Op1 = 0, CRn = 0, CRm = 6 */
ARM64_FTR_REG(SYS_ID_AA64ISAR0_EL1, ftr_id_aa64isar0),
--
2.53.0
^ permalink raw reply related
* [PATCH v4 3/6] arm64: hw_breakpoint: Add lockdep_assert_irqs_disabled() on install/uninstall
From: Rob Herring (Arm) @ 2026-04-07 14:29 UTC (permalink / raw)
To: Will Deacon, Mark Rutland, Catalin Marinas, Jonathan Corbet,
Shuah Khan
Cc: Anshuman Khandual, linux-arm-kernel, linux-perf-users,
linux-kernel, linux-doc
In-Reply-To: <20260407-arm-debug-8-9-v4-0-a4864e69b0ea@kernel.org>
The breakpoint install/uninstall/restore code depends on interrupts
being disabled. Make this requirement explicit with a
lockdep_assert_irqs_disabled() assertion.
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
---
arch/arm64/kernel/hw_breakpoint.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/arm64/kernel/hw_breakpoint.c b/arch/arm64/kernel/hw_breakpoint.c
index bb39bc759810..a9266dc710b4 100644
--- a/arch/arm64/kernel/hw_breakpoint.c
+++ b/arch/arm64/kernel/hw_breakpoint.c
@@ -231,6 +231,8 @@ static int hw_breakpoint_control(struct perf_event *bp,
enum dbg_active_el dbg_el = debug_exception_level(info->ctrl.privilege);
u32 ctrl;
+ lockdep_assert_irqs_disabled();
+
if (info->ctrl.type == ARM_BREAKPOINT_EXECUTE) {
/* Breakpoint */
ctrl_reg = AARCH64_DBG_REG_BCR;
--
2.53.0
^ permalink raw reply related
* [PATCH v4 2/6] arm64: hw_breakpoint: Add additional kprobe excluded functions
From: Rob Herring (Arm) @ 2026-04-07 14:29 UTC (permalink / raw)
To: Will Deacon, Mark Rutland, Catalin Marinas, Jonathan Corbet,
Shuah Khan
Cc: Anshuman Khandual, linux-arm-kernel, linux-perf-users,
linux-kernel, linux-doc
In-Reply-To: <20260407-arm-debug-8-9-v4-0-a4864e69b0ea@kernel.org>
Everything that either runs during exceptions or touches the
breakpoint/watchpoint registers should be excluded from kprobes and
breakpoints.
The static functions are may or may not end up in the no kprobe section
depending on whether the compiler inlines them or not. They are likely
inlined, but make it explicit to ensure that they always are.
Unfortunately, it is not possible to leave the inlining decision up to
the compiler and place code within the no kprobes section.
Parts of what hw_breakpoint_control() calls are excluded already. Just
exclude all of it to be safe.
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
---
arch/arm64/kernel/hw_breakpoint.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/kernel/hw_breakpoint.c b/arch/arm64/kernel/hw_breakpoint.c
index 38fbd67b2a6e..bb39bc759810 100644
--- a/arch/arm64/kernel/hw_breakpoint.c
+++ b/arch/arm64/kernel/hw_breakpoint.c
@@ -187,9 +187,9 @@ static int is_compat_bp(struct perf_event *bp)
* -ENOSPC if no slot is available/matches
* -EINVAL on wrong operations parameter
*/
-static int hw_breakpoint_slot_setup(struct perf_event **slots, int max_slots,
- struct perf_event *bp,
- enum hw_breakpoint_ops ops)
+static nokprobe_inline int
+hw_breakpoint_slot_setup(struct perf_event **slots, int max_slots,
+ struct perf_event *bp, enum hw_breakpoint_ops ops)
{
int i;
struct perf_event **slot;
@@ -283,6 +283,7 @@ static int hw_breakpoint_control(struct perf_event *bp,
return 0;
}
+NOKPROBE_SYMBOL(hw_breakpoint_control);
/*
* Install a perf counter breakpoint.
@@ -718,8 +719,8 @@ NOKPROBE_SYMBOL(do_breakpoint);
* The function returns the distance of the address from the bytes watched by
* the watchpoint. In case of an exact match, it returns 0.
*/
-static u64 get_distance_from_watchpoint(unsigned long addr, u64 val,
- struct arch_hw_breakpoint_ctrl *ctrl)
+static nokprobe_inline u64 get_distance_from_watchpoint(unsigned long addr, u64 val,
+ struct arch_hw_breakpoint_ctrl *ctrl)
{
u64 wp_low, wp_high;
u32 lens, lene;
@@ -739,8 +740,8 @@ static u64 get_distance_from_watchpoint(unsigned long addr, u64 val,
return 0;
}
-static int watchpoint_report(struct perf_event *wp, unsigned long addr,
- struct pt_regs *regs)
+static nokprobe_inline int watchpoint_report(struct perf_event *wp, unsigned long addr,
+ struct pt_regs *regs)
{
int step = is_default_overflow_handler(wp);
struct arch_hw_breakpoint *info = counter_arch_bp(wp);
--
2.53.0
^ permalink raw reply related
* [PATCH v4 1/6] arm64: hw_breakpoint: Disallow breakpoints in no kprobe code
From: Rob Herring (Arm) @ 2026-04-07 14:29 UTC (permalink / raw)
To: Will Deacon, Mark Rutland, Catalin Marinas, Jonathan Corbet,
Shuah Khan
Cc: Anshuman Khandual, linux-arm-kernel, linux-perf-users,
linux-kernel, linux-doc
In-Reply-To: <20260407-arm-debug-8-9-v4-0-a4864e69b0ea@kernel.org>
Taking debug exceptions while manipulating the breakpoints is likely to
be unsafe. The setting kprobes in the breakpoint code is already
forbidden, but the setting of h/w breakpoints is not. Copy what x86 does
and exclude breakpoints that fall within the kprobe section.
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
---
arch/arm64/kernel/hw_breakpoint.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/arm64/kernel/hw_breakpoint.c b/arch/arm64/kernel/hw_breakpoint.c
index ab76b36dce82..38fbd67b2a6e 100644
--- a/arch/arm64/kernel/hw_breakpoint.c
+++ b/arch/arm64/kernel/hw_breakpoint.c
@@ -418,6 +418,16 @@ static int arch_build_bp_info(struct perf_event *bp,
/* Type */
switch (attr->bp_type) {
case HW_BREAKPOINT_X:
+ /*
+ * We don't allow kernel breakpoints in places that are not
+ * acceptable for kprobes. On non-kprobes kernels, we don't
+ * allow kernel breakpoints at all.
+ */
+ if (attr->bp_addr >= TASK_SIZE_MAX) {
+ if (within_kprobe_blacklist(attr->bp_addr))
+ return -EINVAL;
+ }
+
hw->ctrl.type = ARM_BREAKPOINT_EXECUTE;
break;
case HW_BREAKPOINT_R:
--
2.53.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox