* Re: [PATCH] doc: watchdog: fix typos etc.
From: Randy Dunlap @ 2026-04-08 21:28 UTC (permalink / raw)
To: Björn Persson
Cc: Andrew Morton, Jonathan Corbet, Shuah Khan, linux-doc,
linux-kernel
In-Reply-To: <20260408205611.0f7e38de@tag.xn--rombobjrn-67a.se>
On 4/8/26 11:56 AM, Björn Persson wrote:
> Randy Dunlap wrote:
>> -Similarly to the softlockup case, the current stack trace is displayed
>> +Similar to the softlockup case, the current stack trace is displayed
>
> "Similarly" modifies "is displayed", so the adverbial form is correct.
>
>> -The core of the detectors in a hrtimer. It servers multiple purpose:
>> +The core of the detectors is an hrtimer. It servers multiple purposes:
>
> And "servers" should be "serves".
Thank you.
Andrew, I'll send a v2 patch.
--
~Randy
^ permalink raw reply
* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Reinette Chatre @ 2026-04-08 21:24 UTC (permalink / raw)
To: Babu Moger, corbet@lwn.net, tony.luck@intel.com,
Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
Cc: skhan@linuxfoundation.org, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, kas@kernel.org, rick.p.edgecombe@intel.com,
akpm@linux-foundation.org, pmladek@suse.com,
rdunlap@infradead.org, dapeng1.mi@linux.intel.com,
kees@kernel.org, elver@google.com, paulmck@kernel.org,
lirongqing@baidu.com, safinaskar@gmail.com, fvdl@google.com,
seanjc@google.com, pawan.kumar.gupta@linux.intel.com,
xin@zytor.com, tiala@microsoft.com, chang.seok.bae@intel.com,
Lendacky, Thomas, elena.reshetova@intel.com,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-coco@lists.linux.dev, kvm@vger.kernel.org,
eranian@google.com, peternewman@google.com
In-Reply-To: <0ae2b267-4527-4251-9136-6afdc3fc97a5@amd.com>
Hi Babu,
On 4/8/26 1:45 PM, Babu Moger wrote:
> On 4/7/26 23:45, Reinette Chatre wrote:
>> On 4/7/26 6:01 PM, Babu Moger wrote:
>>> That said, I’m open to not having a dedicated group if we can still support all the features that PLZA provides without it.
>>
>> I find that enabling user space to share CLOSID/RMID between user space
>> and kernel space to indeed support what PLZA provides. I think I am missing
>> something here since below proposal again attempts to isolate a resource group
>> (CLOSID) for kernel work.
>
> No. I dont want to isolate a group just for PLZA. All I am saying
> is, we should provide option to create a dedicated group if the user
> wants to do it.
I agree. I do not see resctrl needing to do anything to accomplish this though. If
the user wants a group dedicated to kernel mode/PLZA then all that is needed is for the
user not to assign any tasks to this group, either via changes to the group's tasks file
or via the group's cpus/cpus_list files.
>>>
>>> The mode can simply be determined on a per-group basis. We can
>>> introduce two new files—kernel_mode_cpus and
>>> kernel_mode_cpus_list—within each resctrl group when kmode (or
>>> PLZA) is supported.
>>
>> I think having these files in every resource group is confusing since user can only interact
>> with these files in one resource group for current PLZA. Why not *just* have the files in the
>> resource group that matches the group in info/kernel_mode_assignment?
>
> The default group can also serve as the PLZA group.
>
> #cat info/kernel_mode_assignment
> //
>
> At this point, the (kmode_cpus / kmode_cpus_list) files will exist in the default group:
>
> Then user changes the PLZA group to "test".
>
> #echo "test//" > info/kernel_mode_assignment
>
> At this point, we expect the files "(kmode_cpus/kmode_cpus_list)" to be visible in "test//" group.
>
> One open question is whether we should remove the visibility of these files from the default group. It’s unclear if we can safely do this dynamically.
>
> An alternative approach would be to always keep the files present, but allow access to them only for groups that are listed in "info/kernel_mode_assignment".
The files appearing/disappearing is just how the user experiences the resctrl fs interface.
Within resctrl the files could indeed always exist but resctrl can use the kernfs_show()
API to show/hide them as needed. Similar to resctrl_bmec_files_show() that you created.
Allowing/removing access becomes complicated because user space can always do a chmod
to change permissions that resctrl would need to handle.
I do not know if there are sharp corners here when thinking about strange scenarios where
user opens a file before resctrl changes visibility or permissions and then user space
interacts with the file. This may be worthwhile to test to matter which mechanism is used.
>>> Files and behavior:
>>> - cpus / cpus_list:
>>>
>>> CPUs listed here use the same allocation for both user and kernel space.
>>
>> Both user and kernel space?
>
> As it stands today, the CPU list is written to MSR_PQR_ASSOC, resulting in the same allocation for both user and kernel within a given CLOS.
>
> Kernel-mode allocation changes only if specific CPUs are included in the kmode_cpus list.
ack.
>>> There is no change to the current semantics of these files.
>>> If these files are empty, the group effectively becomes a PLZA-dedicated group.
>>
>> I do not see it this way. If the cpu/cpus_list files are empty then it means that the
>> tasks in the group will use their own CLOSID/RMID for user space allocation and
>> monitoring. What allocations/monitoring is used by tasks when in kernel mode depends
>> on whether the CPU the task is running on can be found in a kernel_mode_cpus/kernel_mode_cpuslist
>> file. If the CPU the task is running on can be found in a kernel_mode_cpus/kernel_mode_cpuslist
>> file then it will inherit whatever the PQR_PLZA setting of that CPU which is the allocation
>> associated with the resource group to which that kernel_mode_cpus/kernel_mode_cpuslist belongs.
>> If the CPU the task is running on cannot be found in kernel_mode_cpus/kernel_mode_cpuslist
>> then its kernel work will inherit its user space allocations and monitoring.
>>
>
> Yes. that is correct. I think our understanding is correct, but our implementation ideas are different it seems.
While we have been sharing different ideas I have tried to be clear on *why* I made
certain choices and attempted to provide specific feedback to your ideas. If you find
your plan to be better then please respond to my feedback about it to help me understand
why that may be the better solution. If you find your solution is better then could you please
describe it with detail? At this time I do not have a clear understanding of what you propose.
...
>
> Let me make sure I understand what you mentioned earlier. Copied the text below from the thread for the context:
>
> https://lore.kernel.org/lkml/3305c18e-9e50-4df0-b9f1-c61028628967@intel.com/
> =====================================================================
>
> Please consider the intent of this file when thinking about names. The idea is that "info/kernel_mode"
> specifies the "mode" of how kernel work is handled and it determines the configuration files used in that
> mode as well as the syntax when interacting with those files. By renaming "kernel_mode_assignment" to
> "kmode_groups" it implicitly requires all future kernel mode enhancements to need some data related to "groups".
>
> In summary, I think this can be simplified by introducing just two new files in info/ that enables the
> user to (a) select and (b) configure the "kernel mode". To start there can be just two modes,
> global_assign_ctrl_inherit_mon_per_cpu and global_assign_ctrl_assign_mon_per_cpu.
> global_assign_ctrl_inherit_mon_per_cpu mode requires a control group in kernel_mode_assignment while
> global_assign_ctrl_assign_mon_per_cpu requires a control and monitoring group.
>
> The resource group in info/kernel_mode_assignment gets two additional files "kernel_mode_cpus" and
> "kernel_mode_cpus_list" that contains the CPUs enabled with the kernel mode configuration, by default
> it will be all online CPUs. The resource group can continue to be used to manage allocations of and
> monitor user space tasks. Specifically, the "cpus", "cpus_list", and "tasks" files remain.
>
> A user wanting just "global" settings will get just that when writing the group to
> info/kernel_mode_assignment. A user wanting "per CPU" settings can follow the
> info/kernel_mode_assignment setting with changes to that resource group's kernel_mode_cpus/kernel_mode_cpus_list
> files. Any task running on a CPU that is *not* in kernel_mode_cpus/kernel_mode_cpus_list can be
> expected to inherit both CLOSID and RMID from user space for all kernel work.
>
> ======================================================================
>
> Let me try to get few clarification on things here.
>
> # cat info/kernel_mode
> [inherit_ctrl_and_mon]
> global_assign_ctrl_inherit_mon_per_cpu
> global_assign_ctrl_assign_mon_per_cpu
>
> My understanding of "inherit_ctrl_and_mon" is that the kernel
> inherits both the CLOS and the RMID from user space. Basically both
> user and kernel uses same CLOSID and RMID. This reflects the current
> behavior (without PLZA) correct? This would correspond to the
Correct.
> default group when resctrl is mounted.
>
> The modes "global_assign_ctrl_inherit_mon_per_cpu" and "global_assign_ctrl_assign_mon_per_cpu" represent the actual PLZA modes.
>
> Both of these modes introduce new files kernel_mode_cpus/ and kernel_mode_cpus_list in the resctrl group.
Right. To be specific when the user changes the mode to either "global_assign_ctrl_inherit_mon_per_cpu" or
"global_assign_ctrl_assign_mon_per_cpu" the new files will be created in the default resource group with
associated setting applied globally at that time.
>
> When the user echoes a group name into info/kernel_mode_assignment, PLZA is applied globally across all CPUs. This is default behavior.
>
> If the user wants PLZA to apply only to a specific subset of CPUs, then the kernel_mode_cpus or kernel_mode_cpus_list files need to be updated accordingly.
>
> global_assign_ctrl_inherit_mon_per_cpu : The group needs to be CTLR_MON group. This mode uses rmid_en=0 when writing PLZA MSR.
>
> global_assign_ctrl_assign_mon_per_cpu: The group needs to be CTLR_MON/MON group. This mode uses rmid_en=1 when writing PLZA MSR.
>
> Did I get it right?
This is my understanding also, yes.
Reinette
^ permalink raw reply
* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Babu Moger @ 2026-04-08 20:45 UTC (permalink / raw)
To: Reinette Chatre, corbet@lwn.net, tony.luck@intel.com,
Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
Cc: skhan@linuxfoundation.org, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, kas@kernel.org, rick.p.edgecombe@intel.com,
akpm@linux-foundation.org, pmladek@suse.com,
rdunlap@infradead.org, dapeng1.mi@linux.intel.com,
kees@kernel.org, elver@google.com, paulmck@kernel.org,
lirongqing@baidu.com, safinaskar@gmail.com, fvdl@google.com,
seanjc@google.com, pawan.kumar.gupta@linux.intel.com,
xin@zytor.com, tiala@microsoft.com, chang.seok.bae@intel.com,
Lendacky, Thomas, elena.reshetova@intel.com,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-coco@lists.linux.dev, kvm@vger.kernel.org,
eranian@google.com, peternewman@google.com
In-Reply-To: <efc269f8-bf98-4f12-8d76-1fee564be84c@intel.com>
Hi Reinette,
On 4/7/26 23:45, Reinette Chatre wrote:
> Hi Babu,
>
> On 4/7/26 6:01 PM, Babu Moger wrote:
>> Hi Reinette,
>>
>> On 4/7/26 12:48, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 4/6/26 3:45 PM, Babu Moger wrote:
>>>> Hi Reinette,
>>>>
>>>> Sorry for the late response. I was trying to get confirmation about the use case.
>>>
>>> No problem. I appreciate that you did this so that we can make sure resctrl supports
>>> needed use cases.
>>>
>>>>
>>>> On 3/31/26 17:24, Reinette Chatre wrote:
>>>>> On 3/30/26 11:46 AM, Babu Moger wrote:
>>>>>> On 3/27/26 17:11, Reinette Chatre wrote:
>>>>>>> On 3/26/26 10:12 AM, Babu Moger wrote:
>>>>>>>> On 3/24/26 17:51, Reinette Chatre wrote:
>>>>>>>>> On 3/12/26 1:36 PM, Babu Moger wrote:
>>>
>>>>> can have domains that span different CPUs. There thus seem to be a built in assumption of what a "domain"
>>>>> means for PQR_PLZA_ASSOC so it sounds to me as though, instead of saying that "PQR_PLZA_ASSOC needs
>>>>> to be the same in QoS domain" it may be more accurate to, for example, say that "PQR_PLZA_ASSOC has L3 scope"?
>>>>
>>>> Yes.
>>>
>>> Above is about L3 scope ...
>>
>> Yes. The scope for PQR_PLZA_ASSOC is L3.
>>
>> Is that what you are asking here?
>
> I was trying to point out that there appears to be a mismatch between the actual scope and
> the planned implementation. As highlighted below during the discussion about "global" this is
> fine with me and I just wanted to confirm that this matches your intentions.
Ack.
>
>>
>>>
>>>>>
>>>>> This seems to be what this implementation does since it hardcodes PQR_PLZA_ASSOC scope to the L3
>>>>> resource but that creates dependency to the L3 resource that would make PLZA unusable if, for example,
>>>>> the user boots with "rdt=!l3cat" while wanting to use PLZA to manage MBA allocations when in kernel?
>>>>
>>>> Yes. that is correct. It should not be attached to one resource. We need to change it to global scope.
>>>
>>> Can I interpret "global scope" as "all online CPUs"? Doing so will simplify
>>
>> Yes. That is correct.
>>
>>
>>> supporting this feature. It does not sound practical for a user wanting to assign
>>> different resource groups to kernel work done in different domains ... the guidance should
>>> instead be to just set the allocations of one resource group to what is needed in the different
>>> domains? There may be more flexibility when supporting per-domain RMIDs though but so far
>>> it sounds as though the focus is global. We can consider what needs to be done to support
>>> some type of "per-domain" assignment as exercise whether current interface could support it
>>> in the future.
>>
>> Yes. Makes sense.
>>
>>>
>
> ...
>
>>>> The PLZA MSR is updated when user changes the association to the
>>>> file. No context switch code changes are needed. This will be
>>>> dedicated group. The current resctrl group files, "cpus, cpus_list
>>>
>>> Why does this have to be a dedicated group? One of the conclusions from v1
>>> discussion was that the "PLZA group" need *not* be a dedicated group. I repeated that
>>> in my earlier response that I left quoted above. You did not respond to these
>>> conclusions and statements in this regard while you keep coming back to this
>>> needing to be a dedicated group without providing a motivation to do so.
>>> Could you please elaborate why a dedicated group is required?
>>
>> If the same group applies identical limits to both user and kernel
>> space, it essentially behaves like a current resctrl group. In that
>> sense, it’s not really a PLZA group. PLZA’s key value is the ability
>> to separate allocations between user space and kernel space. A
>
> The plan has never been to force identical allocations for user and kernel
> space since that would go against this feature entirely. Even so, just as
> user and kernel space cannot be forced to have identical allocations they
> also cannot be forced to have different allocations. Specifically,
> a task *can* use the same CLOSID for user and kernel space work just as easily
> as it can use *different* CLOSID for user and kernel space work. There
> should not be any CLOSID reserved just for kernel work. Or am I missing something?
No. You are not missing anything.
>
>> single CPU can belong to two groups: one group manages the user-
>> space allocation for that CPU, while another manages the kernel-mode
>> allocation.
>
> Exactly. This is why it is important to have two files for this CPU association
> within a resource group. The cpus/cpus_list file continues to be used as today
> while the new kernel_mode_cpus/kernel_mode_cpus_list is used for kernel work.
> With this a task can be associated with any resource group for its user space
> allocations but when it runs on one of the CPUs within kernel_mode_cpus then
> its kernel work will be done with allocations of the resource group the
> kernel_mode_cpus file belongs to, which may or may not be the same
> resource group that the user space task belongs to.
Yes. Exactly.
>
>> This approach also simplifies file handling, which is another reason
>> I prefer it.
>
> I *think* we have different interpretations of "dedicated group":
> It sounds as though you interpret "dedicated group" as a way that enforces
> the same allocations to user space and kernel work.
> I interpret "dedicated group" essentially as a CLOSID reserved for kernel
> work. Since I do not see that resctrl should dedicate a CLOSID/resource group
> for kernel work I have been pushing against such "dedicated group".
Actually, our understanding is same. Probably, I am not explaining it
right. Hope we get there soon.
>
>> That said, I’m open to not having a dedicated group if we can still support all the features that PLZA provides without it.
>
> I find that enabling user space to share CLOSID/RMID between user space
> and kernel space to indeed support what PLZA provides. I think I am missing
> something here since below proposal again attempts to isolate a resource group
> (CLOSID) for kernel work.
No. I dont want to isolate a group just for PLZA. All I am saying is, we
should provide option to create a dedicated group if the user wants to
do it.
>
>>>> Add a file, "info/kmode_monitor", to describe how kmode is monitored.
>>>>
>>>> # cat info/kmode_monitor
>>>> [inherit_ctrl_and_mon] <- Kernel uses the same CLOSID/RMID as user. Default option for the "global"
>>>> assign_ctrl_inherit_mon <- One CLOSID for all kernel work; RMID inherited from user.
>>>> assign_ctrl_assign_mon <- One resource group (CLOSID+RMID) for all kernel work. Default option for "cpu" type.
>>>
>>> My first thought is that the naming is confusing. resctrl has a very strong relationship between
>>> "RMID" and "monitoring" so naming a file "monitor" that deals with allocation/ctrl/CLOSID is
>>> potentially confusion.
>>>
>>> Apart from that, while I think I understand where you are going by separating the mode into
>>> two files I am concerned about future complications needing to accommodate all different
>>> combinations of the (now) essentially two modes. My preference is thus to keep this simple by
>>> keeping the mode within one file.
>>>
>>> Even so, when stepping back, it does not really look like we need to separate the "global"
>>> and "per CPU" modes. We could just have a single "per CPU" mode and the "global" is just
>>> its default of "all CPUs", no?
>>
>> Yes. That correct.
>>
>>>
>>> Consider, for example, the implementation just consisting of:
>>>
>>> # cat info/kernel_mode
>>> [inherit_ctrl_and_mon]
>>> global_assign_ctrl_inherit_mon_per_cpu
>>> global_assign_ctrl_assign_mon_per_cpu
>>>
>>>>
>>>> Rename “kernel_mode_assignment” to “kmode_group” to assign the specific group to kmode. This file usage is same as before.
>>>>
>>>> #cat info/kmode_groups (Renamed "kernel_mode_assignment")
>>>> //
>>>
>>> Please consider the intent of this file when thinking about names. The idea is that "info/kernel_mode"
>>> specifies the "mode" of how kernel work is handled and it determines the configuration files used in that
>>> mode as well as the syntax when interacting with those files. By renaming "kernel_mode_assignment" to
>>> "kmode_groups" it implicitly requires all future kernel mode enhancements to need some data related to "groups".
>>>
>>> In summary, I think this can be simplified by introducing just two new files in info/ that enables the
>>> user to (a) select and (b) configure the "kernel mode". To start there can be just two modes,
>>> global_assign_ctrl_inherit_mon_per_cpu and global_assign_ctrl_assign_mon_per_cpu.
>>> global_assign_ctrl_inherit_mon_per_cpu mode requires a control group in kernel_mode_assignment while
>>> global_assign_ctrl_assign_mon_per_cpu requires a control and monitoring group.
>>>
>>> The resource group in info/kernel_mode_assignment gets two additional files "kernel_mode_cpus" and
>>> "kernel_mode_cpus_list" that contains the CPUs enabled with the kernel mode configuration, by default
>>> it will be all online CPUs. The resource group can continue to be used to manage allocations of and
>>> monitor user space tasks. Specifically, the "cpus", "cpus_list", and "tasks" files remain.
>>>
>>> A user wanting just "global" settings will get just that when writing the group to
>>> info/kernel_mode_assignment. A user wanting "per CPU" settings can follow the
>>> info/kernel_mode_assignment setting with changes to that resource group's kernel_mode_cpus/kernel_mode_cpus_list
>>> files. Any task running on a CPU that is *not* in kernel_mode_cpus/kernel_mode_cpus_list can be
>>> expected to inherit both CLOSID and RMID from user space for all kernel work.
>>
>> After further consideration, I don’t think the info/kernel_mode file
>> is necessary. There’s no need to enforce a specific mode for all the
>> PLZA groups. Avoiding this constraint makes the design more
>> flexible, particularly as we move toward supporting multiple PLZA
>> groups in the future. MPAM already appears capable of handling more
>> than one group—for example, one group could use
>> inherit_ctrl_and_mon, while another could use
>> global_assign_ctrl_inherit_mon_per_cpu.
>
> You are looking ahead at future capabilities for which we do not know all requirements
> at this time. I think it is very good to consider how things may progress and your example
> of MPAM is of course on point. I believe the current design does consider this progression.
> Please see https://lore.kernel.org/lkml/2ab556af-095b-422b-9396-f845c6fd0342@intel.com/
> (search for "per_group_assign_ctrl_assign_mon"). In that exploration per-group assignment
> is actually accomplished with global files. I thus think we should not make such a big
> architectural decision that does not benefit the immediate feature using partial information.
> As it is, a "info/kernel_mode" gives the flexibility to expand to, if needed, configuration
> files within a resource group. That is why the intention is to associate the mode within
> info/kernel_mode with the presence/absence of info/kernel_mode_assignment (search for
> "Visibility depends on active mode in info/kernel_mode" in linked email) since in the
> future resctrl may need to enable a mode that needs configuration files within each
> resource group and when enabling such mode the per-resource group files will appear
> instead of the global info/kernel_mode_assignment.
>
>>
>> The mode can simply be determined on a per-group basis. We can introduce two new files—kernel_mode_cpus and kernel_mode_cpus_list—within each resctrl group when kmode (or PLZA) is supported.
>
> I think having these files in every resource group is confusing since user can only interact
> with these files in one resource group for current PLZA. Why not *just* have the files in the
> resource group that matches the group in info/kernel_mode_assignment?
The default group can also serve as the PLZA group.
#cat info/kernel_mode_assignment
//
At this point, the (kmode_cpus / kmode_cpus_list) files will exist in
the default group:
Then user changes the PLZA group to "test".
#echo "test//" > info/kernel_mode_assignment
At this point, we expect the files "(kmode_cpus/kmode_cpus_list)" to be
visible in "test//" group.
One open question is whether we should remove the visibility of these
files from the default group. It’s unclear if we can safely do this
dynamically.
An alternative approach would be to always keep the files present, but
allow access to them only for groups that are listed in
"info/kernel_mode_assignment".
>>
>> The info/kernel_mode_assignment file would indicate which resctrl
>> group(or groups) is used for PLZA. The files—kernel_mode_cpus and
>> kernel_mode_cpus_list would indicate how the plza is applied which
>> each group.
>
> The "how PLZA is applied" should be learned from info/kernel_mode where user
> space learns whether RMID is inherited or not. While I find kernel_mode_cpus
> and kernel_mode_cpus_list to be just for configuration and just found in the
> resource group listed in info/kernel_mode_assignment.
ok.
>
>>
>> Files and behavior:
>> - cpus / cpus_list:
>>
>> CPUs listed here use the same allocation for both user and kernel space.
>
> Both user and kernel space?
As it stands today, the CPU list is written to MSR_PQR_ASSOC, resulting
in the same allocation for both user and kernel within a given CLOS.
Kernel-mode allocation changes only if specific CPUs are included in the
kmode_cpus list.
> Monitoring would depend on info/kernel_mode_assignment ("inherit_mon")
> and kernel space allocation would depend on whether the CPU on which the task runs
> can be found in kernel_mode_cpus, no?
Yes. that is correct.
>
>
>> There is no change to the current semantics of these files.
>> If these files are empty, the group effectively becomes a PLZA-dedicated group.
>
> I do not see it this way. If the cpu/cpus_list files are empty then it means that the
> tasks in the group will use their own CLOSID/RMID for user space allocation and
> monitoring. What allocations/monitoring is used by tasks when in kernel mode depends
> on whether the CPU the task is running on can be found in a kernel_mode_cpus/kernel_mode_cpuslist
> file. If the CPU the task is running on can be found in a kernel_mode_cpus/kernel_mode_cpuslist
> file then it will inherit whatever the PQR_PLZA setting of that CPU which is the allocation
> associated with the resource group to which that kernel_mode_cpus/kernel_mode_cpuslist belongs.
> If the CPU the task is running on cannot be found in kernel_mode_cpus/kernel_mode_cpuslist
> then its kernel work will inherit its user space allocations and monitoring.
>
Yes. that is correct. I think our understanding is correct, but our
implementation ideas are different it seems.
>>
>> - kernel_mode_cpus / kernel_mode_cpus_list:
>>
>> These files determine whether a separate kernel allocation is applied.
>> If empty, user and kernel share the same allocation.
>> If non-empty, the kernel uses a separate allocation.
>>
>> The group can be CTL_MON or MON group. Based on type the group the CLOSID and RMID will be used to enable PLZA. If it is MON, then rmid_en = 1 when writing PLZA MSR.
>
> This will be difficult to get right since CTRL_MON groups also have RMID assigned.
>
>> Here’s the proposed flow:
>>
>> # mount -t resctrl resctrl /sys/fs/resctrl/
>> # cd /sys/fs/resctrl/
>> # cat info/kernel_mode_assignment
>> //
>>
>> By default, the root (default) group is PLZA-enabled when resctrl is mounted. All CPUs use CLOSID 0 for both user and kernel-mode allocation.
>>
>> # cat cpus_list
>> 1-64
>> # cat kmode_cpus_list
>> 1-64
>>
>> Next, create a new group for PLZA:
>>
>> # mkdir plza_group
>>
>> # echo "plza_group//" > info/kernel_mode_assignment
>>
>> At this point, plza_group becomes the new PLZA-enabled group, and the PLZA-related MSRs are updated accordingly.
>
> It really looks like you are getting back to trying to dedicate a resource group to
> kernel work and that is not something that resctrl should enforce.
>
>>
>> # cat plza_group/cpus_list
>> <empty>
>>
>> # cat plza_group/kmode_cpus_list
>> 1-64
>>
>> The user can then update kmode_cpus_list to apply PLZA only to a specific subset of CPUs, if desired.
>>
>>
>> What do you think of this approach?
>
> It is difficult to predict how the "next" PLZA will actually end up looking like and I find resctrl creating a complicated
> interface to support this to be risky. Instead I would prefer to focus on efficiently supporting what PLZA can do today
> and make it extensible. Apart from that I find the implicit interface, "If it is MON, then rmid_en = 1" to be too
> architecture specific for a generic interface while also not able to accurately capture user's intent (i.e. user may
> indeed, for example, want "a CTRL_MON group to have rmid_en = 1"). Finally, I am just so confused about why the implementations
> keep needing to dedicate a resource group/CLOSID to kernel work.
Let me make sure I understand what you mentioned earlier. Copied the
text below from the thread for the context:
https://lore.kernel.org/lkml/3305c18e-9e50-4df0-b9f1-c61028628967@intel.com/
=====================================================================
Please consider the intent of this file when thinking about names. The
idea is that "info/kernel_mode"
specifies the "mode" of how kernel work is handled and it determines the
configuration files used in that
mode as well as the syntax when interacting with those files. By
renaming "kernel_mode_assignment" to
"kmode_groups" it implicitly requires all future kernel mode
enhancements to need some data related to "groups".
In summary, I think this can be simplified by introducing just two new
files in info/ that enables the
user to (a) select and (b) configure the "kernel mode". To start there
can be just two modes,
global_assign_ctrl_inherit_mon_per_cpu and
global_assign_ctrl_assign_mon_per_cpu.
global_assign_ctrl_inherit_mon_per_cpu mode requires a control group in
kernel_mode_assignment while
global_assign_ctrl_assign_mon_per_cpu requires a control and monitoring
group.
The resource group in info/kernel_mode_assignment gets two additional
files "kernel_mode_cpus" and
"kernel_mode_cpus_list" that contains the CPUs enabled with the kernel
mode configuration, by default
it will be all online CPUs. The resource group can continue to be used
to manage allocations of and
monitor user space tasks. Specifically, the "cpus", "cpus_list", and
"tasks" files remain.
A user wanting just "global" settings will get just that when writing
the group to
info/kernel_mode_assignment. A user wanting "per CPU" settings can
follow the
info/kernel_mode_assignment setting with changes to that resource
group's kernel_mode_cpus/kernel_mode_cpus_list
files. Any task running on a CPU that is *not* in
kernel_mode_cpus/kernel_mode_cpus_list can be
expected to inherit both CLOSID and RMID from user space for all kernel
work.
======================================================================
Let me try to get few clarification on things here.
# cat info/kernel_mode
[inherit_ctrl_and_mon]
global_assign_ctrl_inherit_mon_per_cpu
global_assign_ctrl_assign_mon_per_cpu
My understanding of "inherit_ctrl_and_mon" is that the kernel inherits
both the CLOS and the RMID from user space. Basically both user and
kernel uses same CLOSID and RMID. This reflects the current behavior
(without PLZA) correct? This would correspond to the default group when
resctrl is mounted.
The modes "global_assign_ctrl_inherit_mon_per_cpu" and
"global_assign_ctrl_assign_mon_per_cpu" represent the actual PLZA modes.
Both of these modes introduce new files kernel_mode_cpus/ and
kernel_mode_cpus_list in the resctrl group.
When the user echoes a group name into info/kernel_mode_assignment, PLZA
is applied globally across all CPUs. This is default behavior.
If the user wants PLZA to apply only to a specific subset of CPUs, then
the kernel_mode_cpus or kernel_mode_cpus_list files need to be updated
accordingly.
global_assign_ctrl_inherit_mon_per_cpu : The group needs to be CTLR_MON
group. This mode uses rmid_en=0 when writing PLZA MSR.
global_assign_ctrl_assign_mon_per_cpu: The group needs to be
CTLR_MON/MON group. This mode uses rmid_en=1 when writing PLZA MSR.
Did I get it right?
Thanks
Babu
^ permalink raw reply
* Re: [PATCH] hwmon: (asus-ec-sensors) add ROG STRIX B650E-E GAMING WIFI
From: Veronika Kossmann @ 2026-04-08 20:28 UTC (permalink / raw)
To: Eugene Shalygin, Guenter Roeck
Cc: Veronika Kossmann, Veronika Kossmann, Jonathan Corbet, Shuah Khan,
linux-hwmon, linux-doc, linux-kernel
In-Reply-To: <CAB95QATxrJa0koMq=BCjnXvLHJ5boRBUA+76FwqWJhmhEi-Tqg@mail.gmail.com>
On 4/4/26 10:12, Eugene Shalygin wrote:
> On Sat, 4 Apr 2026 at 06:38, Guenter Roeck <linux@roeck-us.net> wrote:
>> Sashiko has a problem with this patch:
> I must admit now, that these _SET macros were a bad idea, it turned
> out to be too easy to misread. I'm going to remove them.
>
> Veronika, could you, please, show us the output from sensors with this
> version of the code?
>
> Cheers,
> Eugene
Of course:
$sensors asusec-isa-000a
asusec-isa-000a
Adapter: ISA adapter
CPU: +37.0°C
Motherboard: +38.0°C
VRM: +51.0°C
These are relevant to actual temperatures.
Best wishes,
Veronika
^ permalink raw reply
* [PATCH 1/2] KVM: arm64: Add KVM_CAP_ARM_DISABLE_EXITS for WFI/WFE passthrough
From: David Woodhouse @ 2026-04-08 20:23 UTC (permalink / raw)
To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Marc Zyngier,
Oliver Upton, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
Catalin Marinas, Will Deacon, kvm, linux-doc, linux-kernel,
linux-arm-kernel, kvmarm, linux-kselftest, Colton Lewis,
Jing Zhang, David Woodhouse
In-Reply-To: <20260408202557.2102476-1-dwmw2@infradead.org>
From: David Woodhouse <dwmw@amazon.co.uk>
Add a per-VM capability to allow userspace to disable WFI and/or WFE
trapping, modelled after x86's KVM_CAP_X86_DISABLE_EXITS. When the
corresponding flag is set, the trap is unconditionally cleared
regardless of the global kvm-arm.wf{i,e}_trap_policy setting.
The existing kernel command line parameters provide a system-wide
override, but a per-VM capability allows the VMM to make the decision
per guest.
This is useful for hypervisors running a combination of dedicated
pinned vCPUs which want to avoid the cost of trapping WFI/WFE, as
well as overcommitted floating instances where it is necessary.
As with the x86 equivalent, KVM_CHECK_EXTENSION returns the bitmask of
supported exit disables.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
Documentation/virt/kvm/api.rst | 28 ++++++++++++++++++++++++++++
arch/arm64/include/asm/kvm_host.h | 4 ++++
arch/arm64/kvm/arm.c | 20 ++++++++++++++++++++
include/uapi/linux/kvm.h | 6 ++++++
4 files changed, 58 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 032516783e96..e3b3bd9edeec 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8902,6 +8902,34 @@ helpful if user space wants to emulate instructions which are not
This capability can be enabled dynamically even if VCPUs were already
created and are running.
+7.47 KVM_CAP_ARM_DISABLE_EXITS
+------------------------------
+
+:Architecture: arm64
+:Target: VM
+:Parameters: args[0] is a bitmask of exits to disable
+:Returns: 0 on success, -EINVAL if unsupported bits are set.
+
+Valid bits in args[0]:
+
+ - ``KVM_ARM_DISABLE_EXITS_WFI``: Disable trapping of WFI (Wait For
+ Interrupt) instructions. The guest WFI will execute natively instead
+ of causing a VM exit.
+
+ - ``KVM_ARM_DISABLE_EXITS_WFE``: Disable trapping of WFE (Wait For
+ Event) instructions. The guest WFE will execute natively instead of
+ causing a VM exit.
+
+When a bit is set, the corresponding trap is unconditionally cleared for
+all vCPUs in the VM, overriding the system-wide ``kvm-arm.wfi_trap_policy``
+and ``kvm-arm.wfe_trap_policy`` kernel parameters.
+
+Disabling exits is a one-way operation: once an exit type is disabled for
+a VM, it cannot be re-enabled. Calling this ioctl with args[0] = 0 is a
+no-op.
+
+``KVM_CHECK_EXTENSION`` returns the bitmask of exits that can be disabled.
+
8. Other capabilities.
======================
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 70cb9cfd760a..a1bb025c641f 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -312,6 +312,10 @@ struct kvm_arch {
size_t nested_mmus_size;
int nested_mmus_next;
+ /* Per-VM WFI trap override; set via KVM_CAP_ARM_DISABLE_EXITS */
+ bool wfi_in_guest;
+ bool wfe_in_guest;
+
/* Interrupt controller */
struct vgic_dist vgic;
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 410ffd41fd73..326a99fea753 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -178,6 +178,17 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
}
mutex_unlock(&kvm->lock);
break;
+ case KVM_CAP_ARM_DISABLE_EXITS:
+ if (cap->args[0] & ~KVM_ARM_DISABLE_VALID_EXITS) {
+ r = -EINVAL;
+ break;
+ }
+ if (cap->args[0] & KVM_ARM_DISABLE_EXITS_WFI)
+ kvm->arch.wfi_in_guest = true;
+ if (cap->args[0] & KVM_ARM_DISABLE_EXITS_WFE)
+ kvm->arch.wfe_in_guest = true;
+ r = 0;
+ break;
case KVM_CAP_ARM_SEA_TO_USER:
r = 0;
set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
@@ -379,6 +390,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_ARM_SEA_TO_USER:
r = 1;
break;
+ case KVM_CAP_ARM_DISABLE_EXITS:
+ r = KVM_ARM_DISABLE_VALID_EXITS;
+ break;
case KVM_CAP_SET_GUEST_DEBUG2:
return KVM_GUESTDBG_VALID_MASK;
case KVM_CAP_ARM_SET_DEVICE_ADDR:
@@ -610,6 +624,9 @@ static void vcpu_set_pauth_traps(struct kvm_vcpu *vcpu)
static bool kvm_vcpu_should_clear_twi(struct kvm_vcpu *vcpu)
{
+ if (vcpu->kvm->arch.wfi_in_guest)
+ return true;
+
if (unlikely(kvm_wfi_trap_policy != KVM_WFX_NOTRAP_SINGLE_TASK))
return kvm_wfi_trap_policy == KVM_WFX_NOTRAP;
@@ -621,6 +638,9 @@ static bool kvm_vcpu_should_clear_twi(struct kvm_vcpu *vcpu)
static bool kvm_vcpu_should_clear_twe(struct kvm_vcpu *vcpu)
{
+ if (vcpu->kvm->arch.wfe_in_guest)
+ return true;
+
if (unlikely(kvm_wfe_trap_policy != KVM_WFX_NOTRAP_SINGLE_TASK))
return kvm_wfe_trap_policy == KVM_WFX_NOTRAP;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 80364d4dbebb..694cf699ed0a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -669,6 +669,11 @@ struct kvm_ioeventfd {
#define KVM_X86_DISABLE_EXITS_CSTATE (1 << 3)
#define KVM_X86_DISABLE_EXITS_APERFMPERF (1 << 4)
+#define KVM_ARM_DISABLE_EXITS_WFI (1 << 0)
+#define KVM_ARM_DISABLE_EXITS_WFE (1 << 1)
+#define KVM_ARM_DISABLE_VALID_EXITS (KVM_ARM_DISABLE_EXITS_WFI | \
+ KVM_ARM_DISABLE_EXITS_WFE)
+
/* for KVM_ENABLE_CAP */
struct kvm_enable_cap {
/* in */
@@ -989,6 +994,7 @@ struct kvm_enable_cap {
#define KVM_CAP_ARM_SEA_TO_USER 245
#define KVM_CAP_S390_USER_OPEREXEC 246
#define KVM_CAP_S390_KEYOP 247
+#define KVM_CAP_ARM_DISABLE_EXITS 248
struct kvm_irq_routing_irqchip {
__u32 irqchip;
--
2.51.0
^ permalink raw reply related
* [PATCH 2/2] KVM: arm64: selftests: Add KVM_CAP_ARM_DISABLE_EXITS UAPI test
From: David Woodhouse @ 2026-04-08 20:23 UTC (permalink / raw)
To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Marc Zyngier,
Oliver Upton, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
Catalin Marinas, Will Deacon, kvm, linux-doc, linux-kernel,
linux-arm-kernel, kvmarm, linux-kselftest, Colton Lewis,
Jing Zhang, David Woodhouse
In-Reply-To: <20260408202557.2102476-1-dwmw2@infradead.org>
From: David Woodhouse <dwmw@amazon.co.uk>
Test the KVM_CAP_ARM_DISABLE_EXITS capability interface:
- KVM_CHECK_EXTENSION reports KVM_ARM_DISABLE_EXITS_WFI
- KVM_ENABLE_CAP succeeds with valid flags (WFI, zero)
- KVM_ENABLE_CAP fails with EINVAL for unknown flags
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../selftests/kvm/arm64/disable_exits.c | 48 +++++++++++++++++++
2 files changed, 49 insertions(+)
create mode 100644 tools/testing/selftests/kvm/arm64/disable_exits.c
diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index 878d7cb92555..d8e7ff122445 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -179,6 +179,7 @@ TEST_GEN_PROGS_arm64 += arm64/vgic_irq
TEST_GEN_PROGS_arm64 += arm64/vgic_lpi_stress
TEST_GEN_PROGS_arm64 += arm64/vgic_group_iidr
TEST_GEN_PROGS_arm64 += arm64/vgic_group_v2
+TEST_GEN_PROGS_arm64 += arm64/disable_exits
TEST_GEN_PROGS_arm64 += arm64/vpmu_counter_access
TEST_GEN_PROGS_arm64 += arm64/no-vgic-v3
TEST_GEN_PROGS_arm64 += arm64/idreg-idst
diff --git a/tools/testing/selftests/kvm/arm64/disable_exits.c b/tools/testing/selftests/kvm/arm64/disable_exits.c
new file mode 100644
index 000000000000..27fe6c9297b2
--- /dev/null
+++ b/tools/testing/selftests/kvm/arm64/disable_exits.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * disable_exits.c - Test KVM_CAP_ARM_DISABLE_EXITS UAPI
+ *
+ * Verify that KVM_CHECK_EXTENSION reports the valid exit disable mask
+ * and that KVM_ENABLE_CAP accepts valid flags and rejects invalid ones.
+ */
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+
+int main(int argc, char *argv[])
+{
+ struct kvm_vm *vm;
+ int r;
+
+ TEST_REQUIRE(kvm_has_cap(KVM_CAP_ARM_DISABLE_EXITS));
+
+ r = kvm_check_cap(KVM_CAP_ARM_DISABLE_EXITS);
+ TEST_ASSERT(r & KVM_ARM_DISABLE_EXITS_WFI,
+ "KVM_CHECK_EXTENSION should report WFI: got 0x%x", r);
+ TEST_ASSERT(r & KVM_ARM_DISABLE_EXITS_WFE,
+ "KVM_CHECK_EXTENSION should report WFE: got 0x%x", r);
+
+ vm = vm_create(1);
+
+ /* Valid: disable WFI trapping */
+ vm_enable_cap(vm, KVM_CAP_ARM_DISABLE_EXITS, KVM_ARM_DISABLE_EXITS_WFI);
+
+ /* Valid: disable WFE trapping */
+ vm_enable_cap(vm, KVM_CAP_ARM_DISABLE_EXITS, KVM_ARM_DISABLE_EXITS_WFE);
+
+ /* Valid: disable both */
+ vm_enable_cap(vm, KVM_CAP_ARM_DISABLE_EXITS,
+ KVM_ARM_DISABLE_EXITS_WFI | KVM_ARM_DISABLE_EXITS_WFE);
+
+ /* Valid: no exits disabled (no-op) */
+ vm_enable_cap(vm, KVM_CAP_ARM_DISABLE_EXITS, 0);
+
+ /* Invalid: unknown bit set */
+ r = __vm_enable_cap(vm, KVM_CAP_ARM_DISABLE_EXITS, 1ULL << 31);
+ TEST_ASSERT(r == -1 && errno == EINVAL,
+ "Unknown flags should fail with EINVAL: got %d errno %d",
+ r, errno);
+
+ kvm_vm_free(vm);
+ return 0;
+}
--
2.51.0
^ permalink raw reply related
* [PATCH 0/2] KVM: arm64: KVM: arm64: Add per-VM WFI/WFE exit disable capability
From: David Woodhouse @ 2026-04-08 20:23 UTC (permalink / raw)
To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Marc Zyngier,
Oliver Upton, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
Catalin Marinas, Will Deacon, kvm, linux-doc, linux-kernel,
linux-arm-kernel, kvmarm, linux-kselftest, Colton Lewis,
Jing Zhang, David Woodhouse
Add KVM_CAP_ARM_DISABLE_EXITS, modelled after the existing x86
KVM_CAP_X86_DISABLE_EXITS, to allow userspace to disable WFI and/or
WFE trapping on a per-VM basis.
KVM already has system-wide kernel command line parameters
(kvm-arm.wfi_trap_policy and kvm-arm.wfe_trap_policy, added in
0b5afe05377d) to control WFx trapping. However, these are global and
set at boot time. A per-VM capability allows the VMM to make the
decision per guest — for example, disabling WFI trapping for
latency-sensitive VMs with pinned vCPUs while keeping it enabled for
overcommitted guests on the same host.
When a flag is set via KVM_ENABLE_CAP, the corresponding trap is
unconditionally cleared, overriding the system-wide policy. When the
flag is not set, the system policy (including the default
single-task heuristic) applies as before.
As with the x86 equivalent, disabling exits is a one-way operation
per VM.
Tested on Graviton 3 (Neoverse-V1) metal.
David Woodhouse (2):
KVM: arm64: Add KVM_CAP_ARM_DISABLE_EXITS for WFI/WFE passthrough
KVM: arm64: selftests: Add KVM_CAP_ARM_DISABLE_EXITS UAPI test
Documentation/virt/kvm/api.rst | 28 +++++++++++++
arch/arm64/include/asm/kvm_host.h | 4 ++
arch/arm64/kvm/arm.c | 20 ++++++++++
include/uapi/linux/kvm.h | 6 +++
tools/testing/selftests/kvm/Makefile.kvm | 1 +
tools/testing/selftests/kvm/arm64/disable_exits.c | 48 +++++++++++++++++++++++
6 files changed, 107 insertions(+)
create mode 100644 tools/testing/selftests/kvm/arm64/disable_exits.c
^ permalink raw reply
* Re: [PATCH v10 12/21] gpu: nova-core: mm: Add unified page table entry wrapper enums
From: Joel Fernandes @ 2026-04-08 20:19 UTC (permalink / raw)
To: Alexandre Courbot, Eliot Courtney, Danilo Krummrich
Cc: linux-kernel, Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron,
Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
Dave Airlie, Daniel Almeida, Koen Koning, dri-devel,
rust-for-linux, Nikola Djukic, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter, Jonathan Corbet,
Alex Deucher, Christian Koenig, Jani Nikula, Joonas Lahtinen,
Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Matthew Auld,
Matthew Brost, Lucas De Marchi, Thomas Hellstrom, Helge Deller,
Alex Gaynor, Boqun Feng, John Hubbard, Alistair Popple,
Timur Tabi, Edwin Peer, Andrea Righi, Andy Ritger, Zhi Wang,
Balbir Singh, Philipp Stanner, Elle Rhumsaa, alexeyi, joel,
linux-doc, amd-gfx, intel-gfx, intel-xe, linux-fbdev
In-Reply-To: <DHNKYBM159T9.2UUQ7CU0RN0BU@nvidia.com>
Hi Alex, Eliot, Danilo,
Thanks for taking a look. Let me respond to the specific points below.
On Wed, 08 Apr 2026, Alexandre Courbot wrote:
> After a quick look I'd say that having a trait here would actually be
> *good* for correctness and maintainability.
>
> The current design implies that every operation on a page table (most
> likely using the walker) goes through a branching point. Just looking at
> `PtWalk::read_pte_at_level`, there are already at least 6
> `if version == 2 { } else { }` branches that all resolve to the same
> result. Include walking down the PDEs and you have at least a dozen of
> these just to resolve a virtual address. I know CPUs are fast, but this
> is still wasted cycles for no good reason.
I did some measurements and there is no notieceable difference in both
approaches. I ran perf and loaded nova with self-tests running. The extra
potential branching is lost in the noise. In both cases, loading nova and
running the self-tests has ~119.7M branch instructions on my Ampere. The total
instruction count is also identical (~615M).
I measured like this:
perf stat -e
branches,branch-misses,cache-references,cache-misses,instructions,cycles --
modprobe nova_core
So I think the branching argument is not a strong one. I also did more
measurements and the dominant time taken is MMIO. During the map prep and
execute, page table walks are done. A TLB flush alone costs ~1.4 microseconds.
And PRAMIN BAR0 writes to write the PTE is also about 1 microsecond. Considering
this, I don't think the extra branching argument holds (even without branch
prediction and speculation).
Also some branches cannot be eliminated even with parameterization:
if level == self.mmu_version.dual_pde_level() {
// 128-bit dual PDE read
} else {
// Regular 64-bit PDE read
}
This isn't really a version branch -- it's a structural branch that
distinguishes between 64-bit PDE and 128-bit dual PDE entries. Any MMU
version with a dual PDE level would need this same distinction.
I also did code-generation size analysis (see diff of code used below):
Code generation analysis:
Module .ko size: Before: 511,792 bytes After: 524,464 bytes (+2.5%)
.text section: Before: 112,620 bytes After: 116,628 bytes (+4,008 bytes)
The +4K .text growth is the monomorphization cost: every generic function
is compiled twice (once for MmuV2, once for MmuV3).
> If you use a trait here, and make `PtWalk` generic against it, you can
> optimize this away. We had a similar situation when we introduced Turing
> support and the v2 ucode header, and tried both approaches: the
> trait-based one was slightly shorter, and arguably more readable.
Actually I was the one who suggested traits for Falcon ucode descriptor if you
see this thread [1]. So basically you and Eliot are telling me to do what I
suggested in [1]. :-) However, I disagree that it is the right choice for this code.
[1] https://lore.kernel.org/all/20251117231028.GA1095236@joelbox2/
I think the two cases are quite different in complexity:
The falcon ucode descriptor is essentially a set of flat field accessors
and a few params (imem_sec_load_params, dmem_load_params).
The trait has ~10 simple getter methods. There's no multi-level hierarchy,
no walker, and no generic propagation.
The MMU page table case is structurally different. Making PtWalk generic
over an Mmu trait would require:
- PtWalk<M: Mmu> (the walker)
- Plus all the associated types: M::Pte, M::Pde, M::DualPde each
needing their own trait bounds
And we would also need:
- Vmm<M: Mmu> (which creates PtWalk)
- BarUser<M: Mmu> (which creates Vmm)
I am also against making Vmm an enum as Eliot suggested:
enum Vmm {
V2(VmmInner<MmuV2>),
V3(VmmInner<MmuV3>),
}
That moves the version complexity up to the reader. Code complexity IMO should
decrease as we go up abstractions, making it easier for users (Vmm/Bar).
If you look at the the changes in vmm.rs to handle version dispatch there [2]:
Added: +109
Removed: -28
[2]
https://github.com/Edgeworth/linux/commit/3627af550b61256184d589e7ec666c1108971f0e
The main benefit of my approach is version-specific dispatch complexity is
completely isolated inside MmuVersion thus making the code outside of
pagetable.rs much more readable, without having to parametrize anything, and
without code size increase. I think that is worth considering.
> But the main argument to use a trait here IMO is that it enables
> associated types and constants. That's particularly critical since some
> equivalent fields have different lengths between v2 and v3. An
> associated `Bounded` type for these would force the caller to validate
> the length of these fields before calling a non-fallible operation,
> which is exactly the level of caution that we want when dealing with
> page tables.
I think Bounded validation is orthogonal to the dispatch model.
We can add Bounded to the current design without restructuring
into traits. For example:
// In ver2::Pte
pub fn new_vram(pfn: Bounded<Pfn, 25>, writable: bool) -> Self { ... }
// In ver3::Pte
pub fn new_vram(pfn: Bounded<Pfn, 40>, writable: bool) -> Self { ... }
The unified Pte enum wrapper already dispatches to the correct
version-specific constructor, which would enforce the correct Bounded
constraint for that version.
> In order to fully benefit from it, we will need the bitfield macro from
> the `kernel` crate so the PDE/PTE fields can be `Bounded`, I will try to
> make it available quickly in a patch that you can depend on.
That would be great, and I'd be happy to integrate Bounded validation once
the macro is available. I just don't think we need to restructure the
dispatch model in order to benefit from it.
> But long story short, and although I need to dive deeper into the code,
> this looks like a good candidate for using a trait and associated types.
The walker code (walk.rs) is already version-agnostic and reads cleanly.
The version dispatch is encapsulated behind method calls, not exposed as
inline if/else blocks.
Generic propagation (or version-specific dispatch at higher levels) adds more
complexity at higher layers.
Enclosed below [3] is the diff I used for my testing with the data, I don't
really see a net readability win there (IMO, it is a net-loss in readability).
[3]
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=trait-pt-dispatch&id=5eb0e98af11ba608ff4d0f7a06065ee863f5066a
thanks,
--
Joel Fernandes
^ permalink raw reply
* [PATCH v13 35/36] docs/dyndbg: add classmap info to howto
From: Jim Cromie @ 2026-04-08 20:02 UTC (permalink / raw)
To: linux-kernel; +Cc: gregkh, jbaron, louis.chauvet, Jim Cromie, linux-doc
In-Reply-To: <20260408200211.43821-1-jim.cromie@gmail.com>
Describe the 3 API macros providing dynamic_debug's classmaps
DYNAMIC_DEBUG_CLASSMAP_DEFINE - create & export a classmap
DYNAMIC_DEBUG_CLASSMAP_USE - refer to exported map
DYNAMIC_DEBUG_CLASSMAP_PARAM - bind control param to the classmap
DYNAMIC_DEBUG_CLASSMAP_PARAM_REF + use module's storage - __drm_debug
NB: The _DEFINE & _USE model makes the user dependent on the definer,
just like EXPORT_SYMBOL(__drm_debug) already does.
cc: linux-doc@vger.kernel.org
Reviewed-by: Louis Chauvet <louis.chauvet@bootlin.com>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
---
.../admin-guide/dynamic-debug-howto.rst | 132 ++++++++++++++++--
1 file changed, 122 insertions(+), 10 deletions(-)
diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst
index 0a42b9de55ac..734be0b5fe9a 100644
--- a/Documentation/admin-guide/dynamic-debug-howto.rst
+++ b/Documentation/admin-guide/dynamic-debug-howto.rst
@@ -146,6 +146,9 @@ keywords are::
"1-30" is valid range but "1 - 30" is not.
+Keywords
+--------
+
The meanings of each keyword are:
func
@@ -194,16 +197,6 @@ format
format "nfsd: SETATTR" // a neater way to match a format with whitespace
format 'nfsd: SETATTR' // yet another way to match a format with whitespace
-class
- The given class_name is validated against each module, which may
- have declared a list of known class_names. If the class_name is
- found for a module, callsite & class matching and adjustment
- proceeds. Examples::
-
- class DRM_UT_KMS # a DRM.debug category
- class JUNK # silent non-match
- // class TLD_* # NOTICE: no wildcard in class names
-
line
The given line number or range of line numbers is compared
against the line number of each ``pr_debug()`` callsite. A single
@@ -218,6 +211,25 @@ line
line -1605 // the 1605 lines from line 1 to line 1605
line 1600- // all lines from line 1600 to the end of the file
+class
+
+ The given class_name is validated against each module, which may
+ have declared a list of class_names it accepts. If the class_name
+ accepted by a module, callsite & class matching and adjustment
+ proceeds. Examples::
+
+ class DRM_UT_KMS # a drm.debug category
+ class JUNK # silent non-match
+ // class TLD_* # NOTICE: no wildcard in class names
+
+.. note::
+
+ Unlike other keywords, classes are "name-to-change", not
+ "omitting-constraint-allows-change". See Dynamic Debug Classmaps
+
+Flags
+-----
+
The flags specification comprises a change operation followed
by one or more flag characters. The change operation is one
of the characters::
@@ -239,6 +251,11 @@ The flags are::
l Include line number
d Include call trace
+.. note::
+
+ * To query without changing ``+_`` or ``-_``.
+ * To clear all flags ``=_`` or ``-fslmpt``.
+
For ``print_hex_dump_debug()`` and ``print_hex_dump_bytes()``, only
the ``p`` flag has meaning, other flags are ignored.
@@ -395,3 +412,98 @@ just a shortcut for ``print_hex_dump(KERN_DEBUG)``.
For ``print_hex_dump_debug()``/``print_hex_dump_bytes()``, format string is
its ``prefix_str`` argument, if it is constant string; or ``hexdump``
in case ``prefix_str`` is built dynamically.
+
+.. _dyndbg-classmaps:
+
+Dynamic Debug Classmaps
+=======================
+
+The "class" keyword selects prdbgs based on author supplied,
+domain-oriented names. This complements the nested-scope keywords:
+module, file, function, line.
+
+The main difference from the others: classes must be named to be
+changed. This protects them from unintended overwrite::
+
+ # IOW this cannot undo any drm.debug settings
+ :#> ddcmd -p
+
+This protection is needed; /sys/module/drm/parameters/debug is ABI.
+drm.debug is authoritative when dyndbg is not used, dyndbg-under-DRM
+is an implementation detail, and must not behave erratically, just
+because another admin fed >control something unrelated.
+
+So each class must be enabled individually (no wildcards)::
+
+ :#> ddcmd class DRM_UT_CORE +p
+ :#> ddcmd class DRM_UT_KMS +p
+ # or more selectively
+ :#> ddcmd class DRM_UT_CORE module drm +p
+
+That makes direct >control wordy and annoying, but it is a secondary
+interface; it is not intended to replace the ABI, just slide in
+underneath and reimplement the guaranteed behavior. So DRM would keep
+using the convenient way, and be able to trust it::
+
+ :#> echo 0x1ff > /sys/module/drm/parameters/debug
+
+That said, since the sysfs/kparam is the ABI, if the author omits the
+CLASSMAP_PARAM, theres no ABI to guard, and he probably wants a less
+pedantic >control interface. In this case, protection is dropped.
+
+Dynamic Debug Classmap API
+==========================
+
+DYNAMIC_DEBUG_CLASSMAP_DEFINE(clname,type,_base,classnames) - this maps
+classnames (a list of strings) onto class-ids consecutively, starting
+at _base.
+
+DYNAMIC_DEBUG_CLASSMAP_USE(clname) & _USE_(clname,_base) - modules
+call this to refer to the var _DEFINEd elsewhere (and exported).
+
+DYNAMIC_DEBUG_CLASSMAP_PARAM(clname) - creates the sysfs/kparam,
+maps/exposes bits 0..N as class-names.
+
+Classmaps are opt-in: modules invoke _DEFINE or _USE to authorize
+dyndbg to update those named classes. "class FOO" queries are
+validated against the classes defined or used by the module, this
+finds the classid to alter; classes are not directly selectable by
+their classid.
+
+Classnames are global in scope, so subsystems (module-groups) should
+prepend a subsystem name; unqualified names like "CORE" are discouraged.
+
+NB: It is an inherent API limitation (due to class_id's int type) that
+the following are possible:
+
+ // these errors should be caught in review
+ __pr_debug_cls(0, "fake DRM_UT_CORE msg"); // this works
+ __pr_debug_cls(62, "un-known classid msg"); // this compiles, does nothing
+
+There are 2 types of classmaps:
+
+* DD_CLASS_TYPE_DISJOINT_BITS: classes are independent, like drm.debug
+* DD_CLASS_TYPE_LEVEL_NUM: classes are relative, ordered (V3 > V2)
+
+DYNAMIC_DEBUG_CLASSMAP_PARAM - modelled after module_param_cb, it
+refers to a DEFINEd classmap, and associates it to the param's
+data-store. This state is then applied to DEFINEr and USEr modules
+when they're modprobed.
+
+The PARAM interface also enforces the DD_CLASS_TYPE_LEVEL_NUM relation
+amongst the contained classnames; all classes are independent in the
+control parser itself. There is no implied meaning in names like "V4"
+or "PL_ERROR" vs "PL_WARNING".
+
+Modules or subsystems (drm & drivers) can define multiple classmaps,
+as long as they (all the classmaps) share the limited 0..62
+per-module-group _class_id range, without overlap.
+
+If a module encounters a conflict between 2 classmaps it is _USEing or
+_DEFINEing, it can invoke the extended _USE_(name,_base) macro to
+de-conflict the respective ranges.
+
+``#define DEBUG`` will enable all pr_debugs in scope, including any
+class'd ones. This won't be reflected in the PARAM readback value,
+but the class'd pr_debug callsites can be forced off by toggling the
+classmap-kparam all-on then all-off.
--
2.53.0
^ permalink raw reply related
* [PATCH v13 29/36] dyndbg-API: promote DYNAMIC_DEBUG_CLASSMAP_PARAM to API
From: Jim Cromie @ 2026-04-08 20:02 UTC (permalink / raw)
To: linux-kernel; +Cc: gregkh, jbaron, louis.chauvet, Jim Cromie, linux-doc
In-Reply-To: <20260408200211.43821-1-jim.cromie@gmail.com>
move the DYNAMIC_DEBUG_CLASSMAP_PARAM macro from test-dynamic-debug.c into
the header, and refine it, by distinguishing the 2 use cases:
1.DYNAMIC_DEBUG_CLASSMAP_PARAM_REF
for DRM, to pass in extern __drm_debug by name.
dyndbg keeps bits in it, so drm can still use it as before
2.DYNAMIC_DEBUG_CLASSMAP_PARAM
new user (test_dynamic_debug) doesn't need to share state,
decls a static long unsigned int to store the bitvec.
__DYNAMIC_DEBUG_CLASSMAP_PARAM
bottom layer - allocate,init a ddebug-class-param, module-param-cb.
Modify ddebug_sync_classbits() argtype deref inside the fn, to give
access to all kp members.
Also add stub macros, clean up and improve comments in test-code, and
add MODULE_DESCRIPTIONs.
cc: linux-doc@vger.kernel.org
Reviewed-by: Louis Chauvet <louis.chauvet@bootlin.com>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
---
include/linux/dynamic_debug.h | 40 ++++++++++++++++++++++
lib/dynamic_debug.c | 60 ++++++++++++++++++++++-----------
lib/test_dynamic_debug.c | 47 ++++++++++----------------
lib/test_dynamic_debug_submod.c | 9 ++++-
4 files changed, 106 insertions(+), 50 deletions(-)
diff --git a/include/linux/dynamic_debug.h b/include/linux/dynamic_debug.h
index a1c75237abaa..1cae9a2f32d7 100644
--- a/include/linux/dynamic_debug.h
+++ b/include/linux/dynamic_debug.h
@@ -273,6 +273,44 @@ struct _ddebug_class_param {
.offset = _offset \
}
+/**
+ * DYNAMIC_DEBUG_CLASSMAP_PARAM - control a ddebug-classmap from a sys-param
+ * @_name: sysfs node name
+ * @_var: name of the classmap var defining the controlled classes/bits
+ * @_flags: flags to be toggled, typically just 'p'
+ *
+ * Creates a sysfs-param to control the classes defined by the
+ * exported classmap, with bits 0..N-1 mapped to the classes named.
+ * This version keeps class-state in a private long int.
+ */
+#define DYNAMIC_DEBUG_CLASSMAP_PARAM(_name, _var, _flags) \
+ static unsigned long _name##_bvec; \
+ __DYNAMIC_DEBUG_CLASSMAP_PARAM(_name, _name##_bvec, _var, _flags)
+
+/**
+ * DYNAMIC_DEBUG_CLASSMAP_PARAM_REF - wrap a classmap with a controlling sys-param
+ * @_name: sysfs node name
+ * @_bits: name of the module's unsigned long bit-vector, ex: __drm_debug
+ * @_var: name of the (exported) classmap var defining the classes/bits
+ * @_flags: flags to be toggled, typically just 'p'
+ *
+ * Creates a sysfs-param to control the classes defined by the
+ * exported clasmap, with bits 0..N-1 mapped to the classes named.
+ * This version keeps class-state in user @_bits. This lets drm check
+ * __drm_debug elsewhere too.
+ */
+#define DYNAMIC_DEBUG_CLASSMAP_PARAM_REF(_name, _bits, _var, _flags) \
+ __DYNAMIC_DEBUG_CLASSMAP_PARAM(_name, _bits, _var, _flags)
+
+#define __DYNAMIC_DEBUG_CLASSMAP_PARAM(_name, _bits, _var, _flags) \
+ static struct _ddebug_class_param _name##_##_flags = { \
+ .bits = &(_bits), \
+ .flags = #_flags, \
+ .map = &(_var), \
+ }; \
+ module_param_cb(_name, ¶m_ops_dyndbg_classes, \
+ &_name##_##_flags, 0600)
+
extern __printf(2, 3)
void __dynamic_pr_debug(struct _ddebug *descriptor, const char *fmt, ...);
@@ -455,6 +493,8 @@ void __dynamic_ibdev_dbg(struct _ddebug *descriptor,
#define DYNAMIC_DEBUG_CLASSMAP_DEFINE(_var, _mapty, _base, ...)
#define DYNAMIC_DEBUG_CLASSMAP_USE(_var)
+#define DYNAMIC_DEBUG_CLASSMAP_PARAM(_name, _var, _flags)
+#define DYNAMIC_DEBUG_CLASSMAP_PARAM_REF(_name, _var, _flags)
#define DEFINE_DYNAMIC_DEBUG_METADATA(name, fmt)
#define DYNAMIC_DEBUG_BRANCH(descriptor) false
#define DECLARE_DYNDBG_CLASSMAP(...)
diff --git a/lib/dynamic_debug.c b/lib/dynamic_debug.c
index a7f67ecbc4d7..c6f3b0452dfa 100644
--- a/lib/dynamic_debug.c
+++ b/lib/dynamic_debug.c
@@ -687,6 +687,30 @@ static int ddebug_apply_class_bitmap(const struct _ddebug_class_param *dcp,
#define CLASSMAP_BITMASK(width) ((1UL << (width)) - 1)
+static void ddebug_class_param_clamp_input(unsigned long *inrep, const struct kernel_param *kp)
+{
+ const struct _ddebug_class_param *dcp = kp->arg;
+ const struct _ddebug_class_map *map = dcp->map;
+
+ switch (map->map_type) {
+ case DD_CLASS_TYPE_DISJOINT_BITS:
+ /* expect bits. mask and warn if too many */
+ if (*inrep & ~CLASSMAP_BITMASK(map->length)) {
+ pr_warn("%s: input: 0x%lx exceeds mask: 0x%lx, masking\n",
+ KP_NAME(kp), *inrep, CLASSMAP_BITMASK(map->length));
+ *inrep &= CLASSMAP_BITMASK(map->length);
+ }
+ break;
+ case DD_CLASS_TYPE_LEVEL_NUM:
+ /* input is bitpos, of highest verbosity to be enabled */
+ if (*inrep > map->length) {
+ pr_warn("%s: level:%ld exceeds max:%d, clamping\n",
+ KP_NAME(kp), *inrep, map->length);
+ *inrep = map->length;
+ }
+ break;
+ }
+}
static int param_set_dyndbg_module_classes(const char *instr,
const struct kernel_param *kp,
const char *mod_name)
@@ -705,26 +729,15 @@ static int param_set_dyndbg_module_classes(const char *instr,
pr_err("expecting numeric input, not: %s > %s\n", instr, KP_NAME(kp));
return -EINVAL;
}
+ ddebug_class_param_clamp_input(&inrep, kp);
switch (map->map_type) {
case DD_CLASS_TYPE_DISJOINT_BITS:
- /* expect bits. mask and warn if too many */
- if (inrep & ~CLASSMAP_BITMASK(map->length)) {
- pr_warn("%s: input: 0x%lx exceeds mask: 0x%lx, masking\n",
- KP_NAME(kp), inrep, CLASSMAP_BITMASK(map->length));
- inrep &= CLASSMAP_BITMASK(map->length);
- }
v2pr_info("bits:0x%lx > %s.%s\n", inrep, mod_name ?: "*", KP_NAME(kp));
totct += ddebug_apply_class_bitmap(dcp, &inrep, *dcp->bits, mod_name);
*dcp->bits = inrep;
break;
case DD_CLASS_TYPE_LEVEL_NUM:
- /* input is bitpos, of highest verbosity to be enabled */
- if (inrep > map->length) {
- pr_warn("%s: level:%ld exceeds max:%d, clamping\n",
- KP_NAME(kp), inrep, map->length);
- inrep = map->length;
- }
old_bits = CLASSMAP_BITMASK(*dcp->lvl);
new_bits = CLASSMAP_BITMASK(inrep);
v2pr_info("lvl:%ld bits:0x%lx > %s\n", inrep, new_bits, KP_NAME(kp));
@@ -1200,15 +1213,24 @@ static const struct proc_ops proc_fops = {
static void ddebug_sync_classbits(const struct kernel_param *kp, const char *modname)
{
const struct _ddebug_class_param *dcp = kp->arg;
+ unsigned long new_bits;
- /* clamp initial bitvec, mask off hi-bits */
- if (*dcp->bits & ~CLASSMAP_BITMASK(dcp->map->length)) {
- *dcp->bits &= CLASSMAP_BITMASK(dcp->map->length);
- v2pr_info("preset classbits: %lx\n", *dcp->bits);
+ ddebug_class_param_clamp_input(dcp->bits, kp);
+
+ switch (dcp->map->map_type) {
+ case DD_CLASS_TYPE_DISJOINT_BITS:
+ v2pr_info(" %s: classbits: 0x%lx\n", KP_NAME(kp), *dcp->bits);
+ ddebug_apply_class_bitmap(dcp, dcp->bits, 0UL, modname);
+ break;
+ case DD_CLASS_TYPE_LEVEL_NUM:
+ new_bits = CLASSMAP_BITMASK(*dcp->lvl);
+ v2pr_info(" %s: lvl:%ld bits:0x%lx\n", KP_NAME(kp), *dcp->lvl, new_bits);
+ ddebug_apply_class_bitmap(dcp, &new_bits, 0UL, modname);
+ break;
+ default:
+ pr_err("bad map type %d\n", dcp->map->map_type);
+ return;
}
- /* force class'd prdbgs (in USEr module) to match (DEFINEr module) class-param */
- ddebug_apply_class_bitmap(dcp, dcp->bits, ~0, modname);
- ddebug_apply_class_bitmap(dcp, dcp->bits, 0, modname);
}
static void ddebug_match_apply_kparam(const struct kernel_param *kp,
diff --git a/lib/test_dynamic_debug.c b/lib/test_dynamic_debug.c
index 5de839f13c8b..d65fa3f3ef9e 100644
--- a/lib/test_dynamic_debug.c
+++ b/lib/test_dynamic_debug.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0-only
/*
- * Kernel module for testing dynamic_debug
+ * Kernel module to test/demonstrate dynamic_debug features,
+ * particularly classmaps and their support for subsystems like DRM.
*
* Authors:
* Jim Cromie <jim.cromie@gmail.com>
@@ -57,24 +58,6 @@ module_param_cb(do_prints, ¶m_ops_do_prints, NULL, 0600);
#define CLASSMAP_BITMASK(width, base) (((1UL << (width)) - 1) << (base))
-/* sysfs param wrapper, proto-API */
-#define DYNAMIC_DEBUG_CLASSMAP_PARAM_(_model, _flags, _init) \
- static unsigned long bits_##_model = _init; \
- static struct _ddebug_class_param _flags##_##_model = { \
- .bits = &bits_##_model, \
- .flags = #_flags, \
- .map = &map_##_model, \
- }; \
- module_param_cb(_flags##_##_model, ¶m_ops_dyndbg_classes, \
- &_flags##_##_model, 0600)
-#ifdef DEBUG
-#define DYNAMIC_DEBUG_CLASSMAP_PARAM(_model, _flags) \
- DYNAMIC_DEBUG_CLASSMAP_PARAM_(_model, _flags, ~0)
-#else
-#define DYNAMIC_DEBUG_CLASSMAP_PARAM(_model, _flags) \
- DYNAMIC_DEBUG_CLASSMAP_PARAM_(_model, _flags, 0)
-#endif
-
/*
* Demonstrate/test DISJOINT & LEVEL typed classmaps with a sys-param.
*
@@ -105,12 +88,15 @@ enum cat_disjoint_bits {
/* numeric verbosity, V2 > V1 related. V0 is > D2_DRMRES */
enum cat_level_num { V0 = 16, V1, V2, V3, V4, V5, V6, V7 };
-/* recapitulate DRM's multi-classmap setup */
+/*
+ * use/demonstrate multi-module-group classmaps, as for DRM
+ */
#if !defined(TEST_DYNAMIC_DEBUG_SUBMOD)
/*
- * In single user, or parent / coordinator (drm.ko) modules, define
- * classmaps on the client enums above, and then declares the PARAMS
- * ref'g the classmaps. Each is exported.
+ * For module-groups of 1+, define classmaps with names (stringified
+ * enum-symbols) copied from above. 1-to-1 mapping is recommended.
+ * The classmap is exported, so that other modules in the group can
+ * link to it and control their prdbgs.
*/
DYNAMIC_DEBUG_CLASSMAP_DEFINE(map_disjoint_bits, DD_CLASS_TYPE_DISJOINT_BITS,
D2_CORE,
@@ -129,11 +115,13 @@ DYNAMIC_DEBUG_CLASSMAP_DEFINE(map_level_num, DD_CLASS_TYPE_LEVEL_NUM,
V0, "V0", "V1", "V2", "V3", "V4", "V5", "V6", "V7");
/*
- * now add the sysfs-params
+ * for use-cases that want it, provide a sysfs-param to set the
+ * classes in the classmap. It is at this interface where the
+ * "v3>v2" property is applied to DD_CLASS_TYPE_LEVEL_NUM inputs.
*/
-DYNAMIC_DEBUG_CLASSMAP_PARAM(disjoint_bits, p);
-DYNAMIC_DEBUG_CLASSMAP_PARAM(level_num, p);
+DYNAMIC_DEBUG_CLASSMAP_PARAM(p_disjoint_bits, map_disjoint_bits, p);
+DYNAMIC_DEBUG_CLASSMAP_PARAM(p_level_num, map_level_num, p);
#ifdef FORCE_CLASSID_CONFLICT
/*
@@ -144,12 +132,10 @@ DYNAMIC_DEBUG_CLASSMAP_DEFINE(classid_range_conflict, 0, D2_CORE + 1, "D3_CORE")
#endif
#else /* TEST_DYNAMIC_DEBUG_SUBMOD */
-
/*
- * in submod/drm-drivers, use the classmaps defined in top/parent
- * module above.
+ * the +1 members of a multi-module group refer to the classmap
+ * DEFINEd (and exported) above.
*/
-
DYNAMIC_DEBUG_CLASSMAP_USE(map_disjoint_bits);
DYNAMIC_DEBUG_CLASSMAP_USE(map_level_num);
@@ -228,6 +214,7 @@ static void __exit test_dynamic_debug_exit(void)
module_init(test_dynamic_debug_init);
module_exit(test_dynamic_debug_exit);
+MODULE_DESCRIPTION("test/demonstrate dynamic-debug features");
MODULE_AUTHOR("Jim Cromie <jim.cromie@gmail.com>");
MODULE_DESCRIPTION("Kernel module for testing dynamic_debug");
MODULE_LICENSE("GPL");
diff --git a/lib/test_dynamic_debug_submod.c b/lib/test_dynamic_debug_submod.c
index 672aabf40160..3adf3925fb86 100644
--- a/lib/test_dynamic_debug_submod.c
+++ b/lib/test_dynamic_debug_submod.c
@@ -1,6 +1,9 @@
// SPDX-License-Identifier: GPL-2.0
/*
- * Kernel module for testing dynamic_debug
+ * Kernel module to test/demonstrate dynamic_debug features,
+ * particularly classmaps and their support for subsystems, like DRM,
+ * which defines its drm_debug classmap in drm module, and uses it in
+ * helpers & drivers.
*
* Authors:
* Jim Cromie <jim.cromie@gmail.com>
@@ -12,3 +15,7 @@
*/
#define TEST_DYNAMIC_DEBUG_SUBMOD
#include "test_dynamic_debug.c"
+
+MODULE_DESCRIPTION("test/demonstrate dynamic-debug subsystem support");
+MODULE_AUTHOR("Jim Cromie <jim.cromie@gmail.com>");
+MODULE_LICENSE("GPL");
--
2.53.0
^ permalink raw reply related
* [PATCH v13 25/36] dyndbg-API: replace DECLARE_DYNDBG_CLASSMAP
From: Jim Cromie @ 2026-04-08 20:02 UTC (permalink / raw)
To: linux-kernel; +Cc: gregkh, jbaron, louis.chauvet, Jim Cromie, linux-doc
In-Reply-To: <20260408200211.43821-1-jim.cromie@gmail.com>
commit aad0214f3026 ("dyndbg: add DECLARE_DYNDBG_CLASSMAP macro")
DECLARE_DYNDBG_CLASSMAP() has a design error; its usage fails a
basic K&R rule: "define once, refer many times".
When CONFIG_DRM_USE_DYNAMIC_DEBUG=y, it is used across DRM core &
drivers; each invocation allocates/inits the classmap understood by
that module. They *all* must match for the DRM modules to respond
consistently when drm.debug categories are enabled. This is at least
a maintenance hassle.
Worse, its the root-cause of the CONFIG_DRM_USE_DYNAMIC_DEBUG=Y
regression; its use in both core & drivers obfuscates the 2 roles,
muddling the design, yielding an incomplete initialization when
modprobing drivers:
1st drm.ko loads, and dyndbg initializes its drm.debug callsites, then
a drm-driver loads, but too late for the drm.debug enablement.
And that led to:
commit bb2ff6c27bc9 ("drm: Disable dynamic debug as broken")
So retire it, replace with 2 macros:
DYNAMIC_DEBUG_CLASSMAP_DEFINE - invoked once from core - drm.ko
DYNAMIC_DEBUG_CLASSMAP_USE* - from all drm drivers and helpers.
NB: name-space de-noise
DYNAMIC_DEBUG_CLASSMAP_DEFINE: this reworks DECLARE_DYNDBG_CLASSMAP,
basically by dropping the static qualifier on the classmap, and
exporting it instead.
DYNAMIC_DEBUG_CLASSMAP_USE: then refers to the exported var by name:
used from drivers, helper-mods
lets us drop the repetitive "classname" declarations
fixes 2nd-defn problem
creates a ddebug_class_user record in new __dyndbg_class_users section
new section is scanned "differently"
DECLARE_DYNDBG_CLASSMAP is preserved temporarily, to decouple DRM
adaptation work and avoid compile-errs before its done.
The DEFINE,USE distinction, and the separate classmap-use record,
allows dyndbg to initialize the driver's & helper's drm.debug
callsites separately after each is modprobed.
Basically, the classmap initial scan is repeated for classmap-users.
dyndbg's existing __dyndbg_classes[] section does:
. catalogs the module's classmaps
. tells dyndbg about them, allowing >control
. DYNAMIC_DEBUG_CLASSMAP_DEFINE creates section records.
. we rename it to: __dyndbg_class_maps[]
this patch adds __dyndbg_class_users[] section:
. catalogs users of classmap definitions from elsewhere
. authorizes dyndbg to >control user's class'd prdbgs
. DYNAMIC_DEBUG_CLASSMAP_USE() creates section records.
Now ddebug_add_module(etal) can handle classmap-uses similar to (and
after) classmaps; when a dependent module is loaded, if it has
classmap-uses (to a classmap-def in another module), that module's
kernel params are scanned to find if it has a kparam that is wired to
dyndbg's param-ops, and whose classmap is the one being ref'd.
To support this, theres a few data/header changes:
new struct ddebug_class_user
contains: user-module-name, &classmap-defn
it records drm-driver's use of a classmap in the section, allowing lookup
struct ddebug_info gets 2 new fields for the new sections:
class_users, num_class_users.
set by dynamic_debug_init() for builtins.
or by kernel/module/main:load_info() for loadable modules.
vmlinux.lds.h: Add a new BOUNDED_SECTION for __dyndbg_class_users.
this creates start,stop C symbol-names for the section.
TLDR ?
dynamic_debug.c: 2 changes from ddebug_add_module() & ddebug_change():
ddebug_add_module():
ddebug_attach_module_classes() is reworked/renamed/split into
debug_apply_class_maps(), ddebug_apply_class_users(), which both call
ddebug_apply_params().
ddebug_apply_params(new fn):
It scans module's/builtin kernel-params, calls ddebug_match_apply_kparam
for each to find any params/sysfs-nodes which may be wired to a classmap.
ddebug_match_apply_kparam(new fn):
1st, it tests the kernel-param.ops is dyndbg's; this guarantees that
the attached arg is a struct ddebug_class_param, which has a ref to
the param's state, and to the classmap defining the param's handling.
2nd, it requires that the classmap ref'd by the kparam is the one
we've been called for; modules can use many separate classmaps (as
test_dynamic_debug does).
Then apply the "parent" kparam's setting to the dependent module,
using ddebug_apply_class_bitmap().
ddebug_change(and callees) also gets adjustments:
ddebug_find_valid_class(): This does a search over the module's
classmaps, looking for the class FOO echo'd to >control. So now it
searches over __dyndbg_class_users[] after __dyndbg_classes[].
ddebug_class_name(): return class-names for defined OR used classes.
test_dynamic_debug.c, test_dynamic_debug_submod.c:
This demonstrates the 2 types of classmaps & sysfs-params, following
the 4-part recipe:
0. define an enum for the classmap's class_ids
drm.debug gives us DRM_UT_<*> (aka <T>)
multiple classmaps in a module(s) must share 0-62 classid space.
1. DYNAMIC_DEBUG_CLASSMAP_DEFINE(classmap_name, .. "<T>")
names the classes, maps them to consecutive class-ids.
convention here is stringified ENUM_SYMBOLS
these become API/ABI if 2 is done.
2. DYNAMIC_DEBUG_CLASSMAP_PARAM* (classmap_name)
adds a controlling kparam to the class
3. DYNAMIC_DEBUG_CLASSMAP_USE(classmap_name)
for subsystem/group/drivers to use extern created by 1.
Move all the enum declarations together, to better explain how they
share the 0..62 class-id space available to a module (non-overlapping
subranges).
reorg macros 2,3 by name. This gives a tabular format, making it easy
to see the pattern of repetition, and the points of change.
And extend the test to replicate the 2-module (parent & dependent)
scenario which caused the CONFIG_DRM_USE_DYNAMIC_DEBUG=y regression
seen in drm & drivers.
The _submod.c is a 2-line file: #define _SUBMOD, #include parent.
This gives identical complements of prdbgs in parent & _submod, and
thus identical print behavior when all of: >control, >params, and
parent->_submod propagation are working correctly.
It also puts all the parent/_submod declarations together in the same
source; the new ifdef _SUBMOD block invokes DYNAMIC_DEBUG_CLASSMAP_USE
for the 2 test-interfaces. I think this is clearer.
These 2 modules are both tristate, allowing 3 super/sub combos: Y/Y,
Y/M, M/M (not N/Y, since this is disallowed by dependence).
Y/Y, Y/M testing once exposed a missing __align(8) in the _METADATA
macro, which M/M didn't see, probably because the module-loader memory
placement constrained it from misalignment.
Fixes: aad0214f3026 ("dyndbg: add DECLARE_DYNDBG_CLASSMAP macro")
cc: linux-doc@vger.kernel.org
Reviewed-by: Louis Chauvet <louis.chauvet@bootlin.com>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
---
-v?
replace di with &dt->info, since di becomes stale
fix dd_mark_vector_subrange macro param ordering to match kdoc
s/base/offset/ in _ddebug_class_user, to reduce later churn
-v12 - squash in _USE_ and refinements.
A: dyndbg: add DYNAMIC_DEBUG_CLASSMAP_USE_(dd_class_name, offset)
Allow a module to use 2 classmaps together that would otherwise have a
class_id range conflict.
Suppose a drm-driver does:
DYNAMIC_DEBUG_CLASSMAP_USE(drm_debug_classes);
DYNAMIC_DEBUG_CLASSMAP_USE(drm_accel_xfer_debug);
If (for some reason) drm-accel cannot define their constants to avoid
DRM's drm_debug_category 0..10 reservations, we would have a conflict
with reserved-ids.
In this case a driver needing to use both would _USE_ one of them with
an offset to avoid the conflict. This will handle most forseeable
cases; perhaps a 3-X-3 of classmap-defns X classmap-users would get
too awkward and fiddly.
B: dyndbg: refine DYNAMIC_DEBUG_CLASSMAP_USE_ macro
The struct _ddebug_class_user _varname construct is needlessly
permissive; it has a static qualifier, and a unique name. Together,
these allow a module to have 2 or more _USE(foo)s, which is contrary
to its purpose, and therefore potentially confusing.
So drop the unique name, and the static qualifier, and replace it with
an extern pre-declaration. Construct the name by pasting together the
_var (which is the name of the exported ddebug_class_map), and
__KBUILD_MODNAME (which is the user module name). This allows only a
single USE() reference to the exported record, which is all that is
required.
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
---
MAINTAINERS | 2 +-
include/asm-generic/dyndbg.lds.h | 7 +-
include/linux/dynamic_debug.h | 159 +++++++++++++++++++++++---
kernel/module/main.c | 3 +
lib/Kconfig.debug | 24 +++-
lib/Makefile | 5 +
lib/dynamic_debug.c | 185 ++++++++++++++++++++++++-------
lib/test_dynamic_debug.c | 132 ++++++++++++++++------
lib/test_dynamic_debug_submod.c | 14 +++
9 files changed, 433 insertions(+), 98 deletions(-)
create mode 100644 lib/test_dynamic_debug_submod.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 0f4c2f182d63..31c945228fab 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9034,7 +9034,7 @@ M: Jim Cromie <jim.cromie@gmail.com>
S: Maintained
F: include/linux/dynamic_debug.h
F: lib/dynamic_debug.c
-F: lib/test_dynamic_debug.c
+F: lib/test_dynamic_debug*.c
F: tools/testing/selftests/dynamic_debug/*
DYNAMIC INTERRUPT MODERATION
diff --git a/include/asm-generic/dyndbg.lds.h b/include/asm-generic/dyndbg.lds.h
index 8345ac6c52b7..6e38d0f1d00b 100644
--- a/include/asm-generic/dyndbg.lds.h
+++ b/include/asm-generic/dyndbg.lds.h
@@ -6,7 +6,8 @@
#define DYNDBG_SECTIONS() \
. = ALIGN(8); \
BOUNDED_SECTION_BY(__dyndbg_descriptors, ___dyndbg_descs) \
- BOUNDED_SECTION_BY(__dyndbg_class_maps, ___dyndbg_class_maps)
+ BOUNDED_SECTION_BY(__dyndbg_class_maps, ___dyndbg_class_maps) \
+ BOUNDED_SECTION_BY(__dyndbg_class_users, ___dyndbg_class_users)
#define MOD_DYNDBG_SECTIONS() \
__dyndbg_descriptors : { \
@@ -16,6 +17,10 @@
__dyndbg_class_maps : { \
BOUNDED_SECTION_BY(__dyndbg_class_maps, \
___dyndbg_class_maps) \
+ } \
+ __dyndbg_class_users : { \
+ BOUNDED_SECTION_BY(__dyndbg_class_users, \
+ ___dyndbg_class_users) \
}
#endif /* __ASM_GENERIC_DYNDBG_LDS_H */
diff --git a/include/linux/dynamic_debug.h b/include/linux/dynamic_debug.h
index 33291abd8971..71c91bc8d3a6 100644
--- a/include/linux/dynamic_debug.h
+++ b/include/linux/dynamic_debug.h
@@ -72,19 +72,30 @@ enum ddebug_class_map_type {
*/
};
+/*
+ * map @class_names 0..N to consecutive constants starting at @base.
+ */
struct _ddebug_class_map {
- struct module *mod; /* NULL for builtins */
- const char *mod_name; /* needed for builtins */
+ const struct module *mod; /* NULL for builtins */
+ const char *mod_name; /* needed for builtins */
const char **class_names;
const int length;
const int base; /* index of 1st .class_id, allows split/shared space */
enum ddebug_class_map_type map_type;
};
+struct _ddebug_class_user {
+ char *mod_name;
+ struct _ddebug_class_map *map;
+ const int offset; /* offset from map->base */
+};
+
/*
- * @_ddebug_info: gathers module/builtin dyndbg_* __sections together.
+ * @_ddebug_info: gathers module/builtin __dyndbg_<T> __sections
+ * together, each is a vec_<T>: a struct { struct T start[], int len }.
+ *
* For builtins, it is used as a cursor, with the inner structs
- * marking sub-vectors of the builtin __sections in DATA.
+ * marking sub-vectors of the builtin __sections in DATA_DATA
*/
struct _ddebug_descs {
struct _ddebug *start;
@@ -96,10 +107,16 @@ struct _ddebug_class_maps {
int len;
};
+struct _ddebug_class_users {
+ struct _ddebug_class_user *start;
+ int len;
+};
+
struct _ddebug_info {
const char *mod_name;
struct _ddebug_descs descs;
struct _ddebug_class_maps maps;
+ struct _ddebug_class_users users;
};
struct _ddebug_class_param {
@@ -118,12 +135,81 @@ struct _ddebug_class_param {
#if defined(CONFIG_DYNAMIC_DEBUG) || \
(defined(CONFIG_DYNAMIC_DEBUG_CORE) && defined(DYNAMIC_DEBUG_MODULE))
+/*
+ * dyndbg classmaps is modelled closely upon drm.debug:
+ *
+ * 1. run-time control via sysfs node (api/abi)
+ * 2. each bit 0..N controls a single "category"
+ * 3. a pr_debug can have only 1 category, not several.
+ * 4. "kind" is a compile-time constant: 0..N or BIT() thereof
+ * 5. macro impls - give compile-time resolution or fail.
+ *
+ * dyndbg classmaps design axioms/constraints:
+ *
+ * . optimizing compilers use 1-5 above, so preserve them.
+ * . classmaps.class_id *is* the category.
+ * . classmap definers/users are modules.
+ * . every user wants 0..N
+ * . 0..N exposes as ABI
+ * . no 1 use-case wants N > 32, 16 is more usable
+ * . N <= 64 in *all* cases
+ * . modules/subsystems make category/classmap decisions
+ * . ie an enum: DRM has DRM_UT_CORE..DRM_UT_DRMRES
+ * . some categories are exposed to user: ABI
+ * . making modules change their numbering is bogus, avoid if possible
+ *
+ * We can solve for all these at once:
+ * A: map class-names to a .class_id range at compile-time
+ * B: allow only "class NAME" changes to class'd callsites at run-time
+ * C: users/modules must manage 0..62 hardcoded .class_id range limit.
+ * D: existing pr_debugs get CLASS_DFLT=63
+ *
+ * By mapping class-names at >control to class-ids underneath, and
+ * responding only to class-names DEFINEd or USEd by the module, we
+ * can private-ize the class-id, and adjust class'd pr_debugs only by
+ * their names.
+ *
+ * This give us:
+ * E: class_ids without classnames are unreachable
+ * F: user modules opt-in by DEFINEing a classmap and/or USEing another
+ *
+ * Multi-classmap modules/groups are supported, if the classmaps share
+ * the class_id space [0..62] without overlap/conflict.
+ *
+ * NOTE: Due to the integer class_id, this api cannot disallow these:
+ * __pr_debug_cls(0, "fake CORE msg"); works only if a classmap maps 0.
+ * __pr_debug_cls(22, "no such class"); compiles but is not reachable
+ */
+
/**
- * DECLARE_DYNDBG_CLASSMAP - declare classnames known by a module
- * @_var: a struct ddebug_class_map, passed to module_param_cb
- * @_type: enum class_map_type, chooses bits/verbose, numeric/symbolic
- * @_base: offset of 1st class-name. splits .class_id space
- * @classes: class-names used to control class'd prdbgs
+ * DYNAMIC_DEBUG_CLASSMAP_DEFINE - define debug classes used by a module.
+ * @_var: name of the classmap, exported for other modules coordinated use.
+ * @_mapty: enum ddebug_class_map_type: 0:DISJOINT - independent, 1:LEVEL - v2>v1
+ * @_base: reserve N classids starting at _base, to split 0..62 classid space
+ * @classes: names of the N classes.
+ *
+ * This tells dyndbg what class_ids the module is using: _base..+N, by
+ * mapping names onto them. This qualifies "class NAME" >controls on
+ * the defining module, ignoring unknown names.
+ */
+#define DYNAMIC_DEBUG_CLASSMAP_DEFINE(_var, _mapty, _base, ...) \
+ static const char *_var##_classnames[] = { __VA_ARGS__ }; \
+ extern struct _ddebug_class_map _var; \
+ struct _ddebug_class_map __aligned(8) __used \
+ __section("__dyndbg_class_maps") _var = { \
+ .mod = THIS_MODULE, \
+ .mod_name = KBUILD_MODNAME, \
+ .base = (_base), \
+ .map_type = (_mapty), \
+ .length = ARRAY_SIZE(_var##_classnames), \
+ .class_names = _var##_classnames, \
+ }; \
+ EXPORT_SYMBOL(_var)
+
+/*
+ * XXX: keep this until DRM adapts to use the DEFINE/USE api, it
+ * differs from DYNAMIC_DEBUG_CLASSMAP_DEFINE by the lack of the
+ * extern/EXPORT on the struct init, and cascading thinkos.
*/
#define DECLARE_DYNDBG_CLASSMAP(_var, _maptype, _base, ...) \
static const char *_var##_classnames[] = { __VA_ARGS__ }; \
@@ -137,6 +223,44 @@ struct _ddebug_class_param {
.class_names = _var##_classnames, \
}
+/**
+ * DYNAMIC_DEBUG_CLASSMAP_USE - refer to a classmap, DEFINEd elsewhere.
+ * @_var: name of the exported classmap var
+ *
+ * This tells dyndbg that the module has prdbgs with classids defined
+ * in the named classmap. This qualifies "class NAME" >controls on
+ * the user module, and ignores unknown names. This is a wrapper for
+ * DYNAMIC_DEBUG_CLASSMAP_USE_() with a base offset of 0.
+ */
+#define DYNAMIC_DEBUG_CLASSMAP_USE(_var) \
+ DYNAMIC_DEBUG_CLASSMAP_USE_(_var, 0)
+
+/**
+ * DYNAMIC_DEBUG_CLASSMAP_USE_ - refer to a classmap with a manual offset.
+ * @_var: name of the exported classmap var to use.
+ * @_offset: an integer offset to add to the class IDs of the used map.
+ *
+ * This is an extended version of DYNAMIC_DEBUG_CLASSMAP_USE(). It should
+ * only be used to resolve class ID conflicts when a module uses multiple
+ * classmaps that have overlapping ID ranges.
+ *
+ * The final class IDs for the used map will be calculated as:
+ * original_map_base + class_index + @_offset.
+ */
+#define DYNAMIC_DEBUG_CLASSMAP_USE_(_var, _offset) \
+ extern struct _ddebug_class_map _var; \
+ static_assert((_offset) >= 0 && (_offset) < _DPRINTK_CLASS_DFLT, \
+ "classmap use offset must be in 0..62"); \
+ extern struct _ddebug_class_user __aligned(8) \
+ __PASTE(_var ## _, __KBUILD_MODNAME); \
+ struct _ddebug_class_user __aligned(8) __used \
+ __section("__dyndbg_class_users") \
+ __PASTE(_var ## _, __KBUILD_MODNAME) = { \
+ .mod_name = KBUILD_MODNAME, \
+ .map = &(_var), \
+ .offset = _offset \
+ }
+
extern __printf(2, 3)
void __dynamic_pr_debug(struct _ddebug *descriptor, const char *fmt, ...);
@@ -298,12 +422,18 @@ void __dynamic_ibdev_dbg(struct _ddebug *descriptor,
KERN_DEBUG, prefix_str, prefix_type, \
rowsize, groupsize, buf, len, ascii)
-/* for test only, generally expect drm.debug style macro wrappers */
-#define __pr_debug_cls(cls, fmt, ...) do { \
+/*
+ * This is the "model" class variant of pr_debug. It is not really
+ * intended for direct use; I'd encourage DRM-style drm_dbg_<T>
+ * macros for the interface, along with an enum for the <T>
+ *
+ * __printf(2, 3) would apply.
+ */
+#define __pr_debug_cls(cls, fmt, ...) ({ \
BUILD_BUG_ON_MSG(!__builtin_constant_p(cls), \
"expecting constant class int/enum"); \
dynamic_pr_debug_cls(cls, fmt, ##__VA_ARGS__); \
- } while (0)
+})
#else /* !(CONFIG_DYNAMIC_DEBUG || (CONFIG_DYNAMIC_DEBUG_CORE && DYNAMIC_DEBUG_MODULE)) */
@@ -311,6 +441,8 @@ void __dynamic_ibdev_dbg(struct _ddebug *descriptor,
#include <linux/errno.h>
#include <linux/printk.h>
+#define DYNAMIC_DEBUG_CLASSMAP_DEFINE(_var, _mapty, _base, ...)
+#define DYNAMIC_DEBUG_CLASSMAP_USE(_var)
#define DEFINE_DYNAMIC_DEBUG_METADATA(name, fmt)
#define DYNAMIC_DEBUG_BRANCH(descriptor) false
#define DECLARE_DYNDBG_CLASSMAP(...)
@@ -357,8 +489,7 @@ static inline int param_set_dyndbg_classes(const char *instr, const struct kerne
static inline int param_get_dyndbg_classes(char *buffer, const struct kernel_param *kp)
{ return 0; }
-#endif
-
+#endif /* !CONFIG_DYNAMIC_DEBUG_CORE */
extern const struct kernel_param_ops param_ops_dyndbg_classes;
diff --git a/kernel/module/main.c b/kernel/module/main.c
index a0fe6c7aab75..8a3fc98d8a4c 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2723,6 +2723,9 @@ static int find_module_sections(struct module *mod, struct load_info *info)
mod->dyndbg_info.maps.start = section_objs(info, "__dyndbg_class_maps",
sizeof(*mod->dyndbg_info.maps.start),
&mod->dyndbg_info.maps.len);
+ mod->dyndbg_info.users.start = section_objs(info, "__dyndbg_class_users",
+ sizeof(*mod->dyndbg_info.users.start),
+ &mod->dyndbg_info.users.len);
#endif
return 0;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 93f356d2b3d9..302bb2656682 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -3106,12 +3106,26 @@ config TEST_STATIC_KEYS
If unsure, say N.
config TEST_DYNAMIC_DEBUG
- tristate "Test DYNAMIC_DEBUG"
- depends on DYNAMIC_DEBUG
+ tristate "Build test-dynamic-debug module"
+ depends on DYNAMIC_DEBUG || DYNAMIC_DEBUG_CORE
help
- This module registers a tracer callback to count enabled
- pr_debugs in a 'do_debugging' function, then alters their
- enablements, calls the function, and compares counts.
+ This module exercises/demonstrates dyndbg's classmap API, by
+ creating 2 classes: a DISJOINT classmap (supporting DRM.debug)
+ and a LEVELS/VERBOSE classmap (like verbose2 > verbose1).
+
+ If unsure, say N.
+
+config TEST_DYNAMIC_DEBUG_SUBMOD
+ tristate "Build test-dynamic-debug submodule"
+ default m
+ depends on DYNAMIC_DEBUG || DYNAMIC_DEBUG_CORE
+ depends on TEST_DYNAMIC_DEBUG
+ help
+ This sub-module uses a classmap defined and exported by the
+ parent module, recapitulating drm & driver's shared use of
+ drm.debug to control enabled debug-categories.
+ It is tristate, independent of parent, to allow testing all
+ proper combinations of parent=y/m submod=y/m.
If unsure, say N.
diff --git a/lib/Makefile b/lib/Makefile
index 1b9ee167517f..19ab40903436 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -83,6 +83,9 @@ obj-$(CONFIG_TEST_RHASHTABLE) += test_rhashtable.o
obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_keys.o
obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_key_base.o
obj-$(CONFIG_TEST_DYNAMIC_DEBUG) += test_dynamic_debug.o
+obj-$(CONFIG_TEST_DYNAMIC_DEBUG_SUBMOD) += test_dynamic_debug_submod.o
+obj-$(CONFIG_TEST_PRINTF) += test_printf.o
+obj-$(CONFIG_TEST_SCANF) += test_scanf.o
obj-$(CONFIG_TEST_BITMAP) += test_bitmap.o
ifeq ($(CONFIG_CC_IS_CLANG)$(CONFIG_KASAN),yy)
@@ -206,6 +209,8 @@ obj-$(CONFIG_ARCH_NEED_CMPXCHG_1_EMU) += cmpxchg-emu.o
obj-$(CONFIG_DYNAMIC_DEBUG_CORE) += dynamic_debug.o
#ensure exported functions have prototypes
CFLAGS_dynamic_debug.o := -DDYNAMIC_DEBUG_MODULE
+CFLAGS_test_dynamic_debug.o := -DDYNAMIC_DEBUG_MODULE
+CFLAGS_test_dynamic_debug_submod.o := -DDYNAMIC_DEBUG_MODULE
obj-$(CONFIG_SYMBOLIC_ERRNAME) += errname.o
diff --git a/lib/dynamic_debug.c b/lib/dynamic_debug.c
index b8983e095e60..ce512efaeffd 100644
--- a/lib/dynamic_debug.c
+++ b/lib/dynamic_debug.c
@@ -29,6 +29,7 @@
#include <linux/string_helpers.h>
#include <linux/uaccess.h>
#include <linux/dynamic_debug.h>
+
#include <linux/debugfs.h>
#include <linux/slab.h>
#include <linux/jump_label.h>
@@ -43,6 +44,8 @@ extern struct _ddebug __start___dyndbg_descs[];
extern struct _ddebug __stop___dyndbg_descs[];
extern struct _ddebug_class_map __start___dyndbg_class_maps[];
extern struct _ddebug_class_map __stop___dyndbg_class_maps[];
+extern struct _ddebug_class_user __start___dyndbg_class_users[];
+extern struct _ddebug_class_user __stop___dyndbg_class_users[];
struct ddebug_table {
struct list_head link;
@@ -168,20 +171,37 @@ static void vpr_info_dq(const struct ddebug_query *query, const char *msg)
query->first_lineno, query->last_lineno, query->class_string);
}
-static struct _ddebug_class_map *ddebug_find_valid_class(struct ddebug_table const *dt,
- const char *class_string,
- int *class_id)
+#define vpr_di_info(di_p, msg_p, ...) ({ \
+ struct _ddebug_info const *_di = di_p; \
+ v2pr_info(msg_p "module:%s nd:%d nc:%d nu:%d\n", ##__VA_ARGS__, \
+ _di->mod_name, _di->descs.len, _di->maps.len, \
+ _di->users.len); \
+ })
+
+static struct _ddebug_class_map *
+ddebug_find_valid_class(struct _ddebug_info const *di, const char *query_class, int *class_id)
{
struct _ddebug_class_map *map;
+ struct _ddebug_class_user *cli;
int i, idx;
- for_subvec(i, map, &dt->info, maps) {
- idx = match_string(map->class_names, map->length, class_string);
+ for_subvec(i, map, di, maps) {
+ idx = match_string(map->class_names, map->length, query_class);
if (idx >= 0) {
+ vpr_di_info(di, "good-class: %s.%s ", map->mod_name, query_class);
*class_id = idx + map->base;
return map;
}
}
+ for_subvec(i, cli, di, users) {
+ idx = match_string(cli->map->class_names, cli->map->length, query_class);
+ if (idx >= 0) {
+ vpr_di_info(di, "class-ref: %s -> %s.%s ",
+ cli->mod_name, cli->map->mod_name, query_class);
+ *class_id = idx + cli->map->base - cli->offset;
+ return cli->map;
+ }
+ }
*class_id = -ENOENT;
return NULL;
}
@@ -238,8 +258,7 @@ static bool ddebug_match_desc(const struct ddebug_query *query,
return true;
}
-static int ddebug_change(const struct ddebug_query *query,
- struct flag_settings *modifiers)
+static int ddebug_change(const struct ddebug_query *query, struct flag_settings *modifiers)
{
int i;
struct ddebug_table *dt;
@@ -260,7 +279,8 @@ static int ddebug_change(const struct ddebug_query *query,
continue;
if (query->class_string) {
- map = ddebug_find_valid_class(dt, query->class_string, &valid_class);
+ map = ddebug_find_valid_class(&dt->info, query->class_string,
+ &valid_class);
if (!map)
continue;
} else {
@@ -590,7 +610,7 @@ static int ddebug_exec_query(char *query_string, const char *modname)
/* handle multiple queries in query string, continue on error, return
last error or number of matching callsites. Module name is either
- in param (for boot arg) or perhaps in query string.
+ in the modname arg (for boot args) or perhaps in query string.
*/
static int ddebug_exec_queries(char *query, const char *modname)
{
@@ -721,7 +741,7 @@ static int param_set_dyndbg_module_classes(const char *instr,
/**
* param_set_dyndbg_classes - classmap kparam setter
* @instr: string echo>d to sysfs, input depends on map_type
- * @kp: kp->arg has state: bits/lvl, map, map_type
+ * @kp: kp->arg has state: bits/lvl, classmap, map_type
*
* enable/disable all class'd pr_debugs in the classmap. For LEVEL
* map-types, enforce * relative levels by bitpos.
@@ -758,6 +778,7 @@ int param_get_dyndbg_classes(char *buffer, const struct kernel_param *kp)
default:
return -1;
}
+ return 0;
}
EXPORT_SYMBOL(param_get_dyndbg_classes);
@@ -1073,12 +1094,17 @@ static bool ddebug_class_in_range(const int class_id, const struct _ddebug_class
static const char *ddebug_class_name(struct _ddebug_info *di, struct _ddebug *dp)
{
struct _ddebug_class_map *map;
+ struct _ddebug_class_user *cli;
int i;
for_subvec(i, map, di, maps)
if (ddebug_class_in_range(dp->class_id, map))
return map->class_names[dp->class_id - map->base];
+ for_subvec(i, cli, di, users)
+ if (ddebug_class_in_range(dp->class_id, cli->map))
+ return cli->map->class_names[dp->class_id - cli->map->base - cli->offset];
+
return NULL;
}
@@ -1159,32 +1185,87 @@ static const struct proc_ops proc_fops = {
.proc_write = ddebug_proc_write
};
-static void ddebug_attach_module_classes(struct ddebug_table *dt, struct _ddebug_info *di)
+#define vpr_cm_info(cm_p, msg_fmt, ...) ({ \
+ struct _ddebug_class_map const *_cm = cm_p; \
+ v2pr_info(msg_fmt "%s [%d..%d] %s..%s\n", ##__VA_ARGS__, \
+ _cm->mod_name, _cm->base, _cm->base + _cm->length, \
+ _cm->class_names[0], _cm->class_names[_cm->length - 1]); \
+ })
+
+static void ddebug_sync_classbits(const struct kernel_param *kp, const char *modname)
{
- struct _ddebug_class_map *cm;
- int i, nc = 0;
+ const struct _ddebug_class_param *dcp = kp->arg;
- /*
- * Find this module's classmaps in a subrange/wholerange of
- * the builtin/modular classmap vector/section. Save the start
- * and length of the subrange at its edges.
- */
- for_subvec(i, cm, di, maps) {
- if (!strcmp(cm->mod_name, dt->info.mod_name)) {
- if (!nc) {
- v2pr_info("start subrange, class[%d]: module:%s base:%d len:%d ty:%d\n",
- i, cm->mod_name, cm->base, cm->length, cm->map_type);
- dt->info.maps.start = cm;
- }
- nc++;
- }
+ /* clamp initial bitvec, mask off hi-bits */
+ if (*dcp->bits & ~CLASSMAP_BITMASK(dcp->map->length)) {
+ *dcp->bits &= CLASSMAP_BITMASK(dcp->map->length);
+ v2pr_info("preset classbits: %lx\n", *dcp->bits);
+ }
+ /* force class'd prdbgs (in USEr module) to match (DEFINEr module) class-param */
+ ddebug_apply_class_bitmap(dcp, dcp->bits, ~0, modname);
+ ddebug_apply_class_bitmap(dcp, dcp->bits, 0, modname);
+}
+
+static void ddebug_match_apply_kparam(const struct kernel_param *kp,
+ const struct _ddebug_class_map *map,
+ const char *mod_name)
+{
+ struct _ddebug_class_param *dcp;
+
+ if (kp->ops != ¶m_ops_dyndbg_classes)
+ return;
+
+ dcp = (struct _ddebug_class_param *)kp->arg;
+
+ if (map == dcp->map) {
+ v2pr_info(" kp:%s.%s =0x%lx", mod_name, kp->name, *dcp->bits);
+ vpr_cm_info(map, " %s maps ", mod_name);
+ ddebug_sync_classbits(kp, mod_name);
+ }
+}
+
+static void ddebug_apply_params(const struct _ddebug_class_map *cm, const char *mod_name)
+{
+ const struct kernel_param *kp;
+#if IS_ENABLED(CONFIG_MODULES)
+ int i;
+
+ if (cm->mod) {
+ vpr_cm_info(cm, "loaded classmap: %s ", mod_name);
+ /* ifdef protects the cm->mod->kp deref */
+ for (i = 0, kp = cm->mod->kp; i < cm->mod->num_kp; i++, kp++)
+ ddebug_match_apply_kparam(kp, cm, mod_name);
}
- if (nc) {
- dt->info.maps.len = nc;
- vpr_info("module:%s attached %d classes\n", dt->info.mod_name, nc);
+#endif
+ if (!cm->mod) {
+ vpr_cm_info(cm, "builtin classmap: %s ", mod_name);
+ for (kp = __start___param; kp < __stop___param; kp++)
+ ddebug_match_apply_kparam(kp, cm, mod_name);
}
}
+static void ddebug_apply_class_maps(const struct _ddebug_info *di)
+{
+ struct _ddebug_class_map *cm;
+ int i;
+
+ for_subvec(i, cm, di, maps)
+ ddebug_apply_params(cm, cm->mod_name);
+
+ vpr_di_info(di, "attached %d class-maps to ", i);
+}
+
+static void ddebug_apply_class_users(const struct _ddebug_info *di)
+{
+ struct _ddebug_class_user *cli;
+ int i;
+
+ for_subvec(i, cli, di, users)
+ ddebug_apply_params(cli->map, cli->mod_name);
+
+ vpr_di_info(di, "attached %d class-users to ", i);
+}
+
/*
* Narrow a _ddebug_info's vector (@_vec) to the contiguous subrange
* of elements where ->mod_name matches @__di->mod_name.
@@ -1214,6 +1295,22 @@ static void ddebug_attach_module_classes(struct ddebug_table *dt, struct _ddebug
__di->_vec.len = __nc; \
})
+static int __maybe_unused
+ddebug_class_range_overlap(struct _ddebug_class_map *cm,
+ u64 *reserved_ids)
+{
+ u64 range = (((1ULL << cm->length) - 1) << cm->base);
+
+ if (range & *reserved_ids) {
+ pr_err("[%d..%d] on %s conflicts with %llx\n", cm->base,
+ cm->base + cm->length - 1, cm->class_names[0],
+ *reserved_ids);
+ return -EINVAL;
+ }
+ *reserved_ids |= range;
+ return 0;
+}
+
/*
* Allocate a new ddebug_table for the given module
* and add it to the global list.
@@ -1222,6 +1319,7 @@ static int ddebug_add_module(struct _ddebug_info *di)
{
struct ddebug_table *dt;
struct _ddebug_class_map *cm;
+ struct _ddebug_class_user *cli;
int i;
if (!di->descs.len)
@@ -1234,26 +1332,29 @@ static int ddebug_add_module(struct _ddebug_info *di)
pr_err("error adding module: %s\n", di->mod_name);
return -ENOMEM;
}
+ INIT_LIST_HEAD(&dt->link);
/*
- * For built-in modules, name (as supplied in di by its
- * callers) lives in .rodata and is immortal. For loaded
- * modules, name points at the name[] member of struct module,
- * which lives at least as long as this struct ddebug_table.
+ * For built-in modules, di-> referents live in .*data and are
+ * immortal. For loaded modules, di points at the dyndbg_info
+ * member of its struct module, which lives at least as
+ * long as this struct ddebug_table.
*/
dt->info = *di;
-
- INIT_LIST_HEAD(&dt->link);
-
dd_set_module_subrange(i, cm, &dt->info, maps);
-
- if (di->maps.len)
- ddebug_attach_module_classes(dt, di);
+ dd_set_module_subrange(i, cli, &dt->info, users);
+ /* now di is stale */
mutex_lock(&ddebug_lock);
list_add_tail(&dt->link, &ddebug_tables);
mutex_unlock(&ddebug_lock);
- vpr_info("%3u debug prints in module %s\n", di->descs.len, di->mod_name);
+ if (dt->info.maps.len)
+ ddebug_apply_class_maps(&dt->info);
+ if (dt->info.users.len)
+ ddebug_apply_class_users(&dt->info);
+
+ vpr_info("%3u debug prints in module %s\n",
+ dt->info.descs.len, dt->info.mod_name);
return 0;
}
@@ -1403,8 +1504,10 @@ static int __init dynamic_debug_init(void)
struct _ddebug_info di = {
.descs.start = __start___dyndbg_descs,
.maps.start = __start___dyndbg_class_maps,
+ .users.start = __start___dyndbg_class_users,
.descs.len = __stop___dyndbg_descs - __start___dyndbg_descs,
.maps.len = __stop___dyndbg_class_maps - __start___dyndbg_class_maps,
+ .users.len = __stop___dyndbg_class_users - __start___dyndbg_class_users,
};
#ifdef CONFIG_MODULES
diff --git a/lib/test_dynamic_debug.c b/lib/test_dynamic_debug.c
index 9c3e53cd26bd..6c4548f63512 100644
--- a/lib/test_dynamic_debug.c
+++ b/lib/test_dynamic_debug.c
@@ -6,11 +6,30 @@
* Jim Cromie <jim.cromie@gmail.com>
*/
-#define pr_fmt(fmt) "test_dd: " fmt
+/*
+ * This file is built 2x, also making test_dynamic_debug_submod.ko,
+ * whose 2-line src file #includes this file. This gives us a _submod
+ * clone with identical pr_debugs, without further maintenance.
+ *
+ * If things are working properly, they should operate identically
+ * when printed or adjusted by >control. This eases visual perusal of
+ * the logs, and simplifies testing, by easing the proper accounting
+ * of expectations.
+ *
+ * It also puts both halves of the subsystem _DEFINE & _USE use case
+ * together, and integrates the common ENUM providing both class_ids
+ * and class-names to both _DEFINErs and _USERs. I think this makes
+ * the usage clearer.
+ */
+#if defined(TEST_DYNAMIC_DEBUG_SUBMOD)
+ #define pr_fmt(fmt) "test_dd_submod: " fmt
+#else
+ #define pr_fmt(fmt) "test_dd: " fmt
+#endif
#include <linux/module.h>
-/* run tests by reading or writing sysfs node: do_prints */
+/* re-gen output by reading or writing sysfs node: do_prints */
static void do_prints(void); /* device under test */
static int param_set_do_prints(const char *instr, const struct kernel_param *kp)
@@ -29,24 +48,39 @@ static const struct kernel_param_ops param_ops_do_prints = {
};
module_param_cb(do_prints, ¶m_ops_do_prints, NULL, 0600);
-/*
- * Using the CLASSMAP api:
- * - classmaps must have corresponding enum
- * - enum symbols must match/correlate with class-name strings in the map.
- * - base must equal enum's 1st value
- * - multiple maps must set their base to share the 0-30 class_id space !!
- * (build-bug-on tips welcome)
- * Additionally, here:
- * - tie together sysname, mapname, bitsname, flagsname
- */
-#define DD_SYS_WRAP(_model, _flags) \
- static unsigned long bits_##_model; \
- static struct _ddebug_class_param _flags##_model = { \
+#define CLASSMAP_BITMASK(width, base) (((1UL << (width)) - 1) << (base))
+
+/* sysfs param wrapper, proto-API */
+#define DYNAMIC_DEBUG_CLASSMAP_PARAM_(_model, _flags, _init) \
+ static unsigned long bits_##_model = _init; \
+ static struct _ddebug_class_param _flags##_##_model = { \
.bits = &bits_##_model, \
.flags = #_flags, \
.map = &map_##_model, \
}; \
- module_param_cb(_flags##_##_model, ¶m_ops_dyndbg_classes, &_flags##_model, 0600)
+ module_param_cb(_flags##_##_model, ¶m_ops_dyndbg_classes, \
+ &_flags##_##_model, 0600)
+#ifdef DEBUG
+#define DYNAMIC_DEBUG_CLASSMAP_PARAM(_model, _flags) \
+ DYNAMIC_DEBUG_CLASSMAP_PARAM_(_model, _flags, ~0)
+#else
+#define DYNAMIC_DEBUG_CLASSMAP_PARAM(_model, _flags) \
+ DYNAMIC_DEBUG_CLASSMAP_PARAM_(_model, _flags, 0)
+#endif
+
+/*
+ * Demonstrate/test DISJOINT & LEVEL typed classmaps with a sys-param.
+ *
+ * To comport with DRM debug-category (an int), classmaps map names to
+ * ids (also an int). So a classmap starts with an enum; DRM has enum
+ * debug_category: with DRM_UT_<CORE,DRIVER,KMS,etc>. We use the enum
+ * values as class-ids, and stringified enum-symbols as classnames.
+ *
+ * Modules with multiple CLASSMAPS must have enums with distinct
+ * value-ranges, as arranged below with explicit enum_sym = X inits.
+ * To clarify this sharing, declare the 2 enums now, for the 2
+ * different classmap types
+ */
/* numeric input, independent bits */
enum cat_disjoint_bits {
@@ -60,26 +94,51 @@ enum cat_disjoint_bits {
D2_LEASE,
D2_DP,
D2_DRMRES };
-DECLARE_DYNDBG_CLASSMAP(map_disjoint_bits, DD_CLASS_TYPE_DISJOINT_BITS, 0,
- "D2_CORE",
- "D2_DRIVER",
- "D2_KMS",
- "D2_PRIME",
- "D2_ATOMIC",
- "D2_VBL",
- "D2_STATE",
- "D2_LEASE",
- "D2_DP",
- "D2_DRMRES");
-DD_SYS_WRAP(disjoint_bits, p);
-DD_SYS_WRAP(disjoint_bits, T);
-
-/* numeric verbosity, V2 > V1 related */
-enum cat_level_num { V0 = 14, V1, V2, V3, V4, V5, V6, V7 };
-DECLARE_DYNDBG_CLASSMAP(map_level_num, DD_CLASS_TYPE_LEVEL_NUM, 14,
- "V0", "V1", "V2", "V3", "V4", "V5", "V6", "V7");
-DD_SYS_WRAP(level_num, p);
-DD_SYS_WRAP(level_num, T);
+
+/* numeric verbosity, V2 > V1 related. V0 is > D2_DRMRES */
+enum cat_level_num { V0 = 16, V1, V2, V3, V4, V5, V6, V7 };
+
+/* recapitulate DRM's multi-classmap setup */
+#if !defined(TEST_DYNAMIC_DEBUG_SUBMOD)
+/*
+ * In single user, or parent / coordinator (drm.ko) modules, define
+ * classmaps on the client enums above, and then declares the PARAMS
+ * ref'g the classmaps. Each is exported.
+ */
+DYNAMIC_DEBUG_CLASSMAP_DEFINE(map_disjoint_bits, DD_CLASS_TYPE_DISJOINT_BITS,
+ D2_CORE,
+ "D2_CORE",
+ "D2_DRIVER",
+ "D2_KMS",
+ "D2_PRIME",
+ "D2_ATOMIC",
+ "D2_VBL",
+ "D2_STATE",
+ "D2_LEASE",
+ "D2_DP",
+ "D2_DRMRES");
+
+DYNAMIC_DEBUG_CLASSMAP_DEFINE(map_level_num, DD_CLASS_TYPE_LEVEL_NUM,
+ V0, "V0", "V1", "V2", "V3", "V4", "V5", "V6", "V7");
+
+/*
+ * now add the sysfs-params
+ */
+
+DYNAMIC_DEBUG_CLASSMAP_PARAM(disjoint_bits, p);
+DYNAMIC_DEBUG_CLASSMAP_PARAM(level_num, p);
+
+#else /* TEST_DYNAMIC_DEBUG_SUBMOD */
+
+/*
+ * in submod/drm-drivers, use the classmaps defined in top/parent
+ * module above.
+ */
+
+DYNAMIC_DEBUG_CLASSMAP_USE(map_disjoint_bits);
+DYNAMIC_DEBUG_CLASSMAP_USE(map_level_num);
+
+#endif
/* stand-in for all pr_debug etc */
#define prdbg(SYM) __pr_debug_cls(SYM, #SYM " msg\n")
@@ -115,6 +174,7 @@ static void do_levels(void)
static void do_prints(void)
{
+ pr_debug("do_prints:\n");
do_cats();
do_levels();
}
diff --git a/lib/test_dynamic_debug_submod.c b/lib/test_dynamic_debug_submod.c
new file mode 100644
index 000000000000..672aabf40160
--- /dev/null
+++ b/lib/test_dynamic_debug_submod.c
@@ -0,0 +1,14 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Kernel module for testing dynamic_debug
+ *
+ * Authors:
+ * Jim Cromie <jim.cromie@gmail.com>
+ */
+
+/*
+ * clone the parent, inherit all the properties, for consistency and
+ * simpler accounting in test expectations.
+ */
+#define TEST_DYNAMIC_DEBUG_SUBMOD
+#include "test_dynamic_debug.c"
--
2.53.0
^ permalink raw reply related
* [PATCH v13 08/36] docs/dyndbg: explain flags parse 1st
From: Jim Cromie @ 2026-04-08 20:01 UTC (permalink / raw)
To: linux-kernel; +Cc: gregkh, jbaron, louis.chauvet, Jim Cromie, linux-doc
In-Reply-To: <20260408200211.43821-1-jim.cromie@gmail.com>
When writing queries to >control, flags are parsed 1st, since they are
the only required field, and they require specific compositions. So
if the flags draw an error (on those specifics), then keyword errors
aren't reported. This can be mildly confusing/annoying, so explain it
instead.
cc: linux-doc@vger.kernel.org
Reviewed-by: Louis Chauvet <louis.chauvet@bootlin.com>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
---
.../admin-guide/dynamic-debug-howto.rst | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst
index 4b14d9fd0300..9c2f096ed1d8 100644
--- a/Documentation/admin-guide/dynamic-debug-howto.rst
+++ b/Documentation/admin-guide/dynamic-debug-howto.rst
@@ -109,10 +109,19 @@ The match-spec's select *prdbgs* from the catalog, upon which to apply
the flags-spec, all constraints are ANDed together. An absent keyword
is the same as keyword "*".
-
-A match specification is a keyword, which selects the attribute of
-the callsite to be compared, and a value to compare against. Possible
-keywords are:::
+Note that since the match-spec can be empty, the flags are checked 1st,
+then the pairs of keyword and value. Flag errs will hide keyword errs::
+
+ bash-5.2# ddcmd mod bar +foo
+ dyndbg: read 13 bytes from userspace
+ dyndbg: query 0: "mod bar +foo" mod:*
+ dyndbg: unknown flag 'o'
+ dyndbg: flags parse failed
+ dyndbg: processed 1 queries, with 0 matches, 1 errs
+
+So a match-spec is a keyword, which selects the attribute of the
+callsite to be compared, and a value to compare against. Possible
+keywords are::
match-spec ::= 'func' string |
'file' string |
--
2.53.0
^ permalink raw reply related
* Re: [PATCH 12/24] nfsd: add data structures for handling CB_NOTIFY
From: Jeff Layton @ 2026-04-08 19:53 UTC (permalink / raw)
To: Chuck Lever, Alexander Viro, Christian Brauner, Jan Kara,
Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs
In-Reply-To: <8a6dbbee-5d59-4833-b0f6-22b1e46dfd11@app.fastmail.com>
On Wed, 2026-04-08 at 14:39 -0400, Chuck Lever wrote:
> On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> > Add the data structures, allocation helpers, and callback operations
> > needed for directory delegation CB_NOTIFY support:
> >
> > - struct nfsd_notify_event: carries fsnotify events for CB_NOTIFY
> > - struct nfsd4_cb_notify: per-delegation state for notification handling
> > - Union dl_cb_fattr with dl_cb_notify in nfs4_delegation since a
> > delegation is either a regular file delegation or a directory
> > delegation, never both
> >
> > Refactor alloc_init_deleg() into a common __alloc_init_deleg() base
> > with a pluggable sc_free callback, and add alloc_init_dir_deleg() which
> > allocates the page array and notify4 buffer needed for CB_NOTIFY
> > encoding.
> >
> > Add skeleton nfsd4_cb_notify_ops with done/release handlers that will
> > be filled in when the notification path is wired up.
> >
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
>
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index 4afe7e68fb51..b2b8c454fc0f 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
>
> > @@ -3381,6 +3440,30 @@ nfsd4_cb_getattr_release(struct nfsd4_callback
> > *cb)
> > nfs4_put_stid(&dp->dl_stid);
> > }
> >
> > +static int
> > +nfsd4_cb_notify_done(struct nfsd4_callback *cb,
> > + struct rpc_task *task)
> > +{
> > + switch (task->tk_status) {
> > + case -NFS4ERR_DELAY:
> > + rpc_delay(task, 2 * HZ);
> > + return 0;
> > + default:
> > + return 1;
> > + }
> > +}
> > +
> > +static void
> > +nfsd4_cb_notify_release(struct nfsd4_callback *cb)
> > +{
> > + struct nfsd4_cb_notify *ncn =
> > + container_of(cb, struct nfsd4_cb_notify, ncn_cb);
> > + struct nfs4_delegation *dp =
> > + container_of(ncn, struct nfs4_delegation, dl_cb_notify);
> > +
> > + nfs4_put_stid(&dp->dl_stid);
> > +}
> > +
> > static const struct nfsd4_callback_ops nfsd4_cb_recall_any_ops = {
> > .done = nfsd4_cb_recall_any_done,
> > .release = nfsd4_cb_recall_any_release,
>
> So when a client responds with NFS4ERR_DELAY, the RPC framework retries
> after 2s. On retry, prepare() is called again, but ncn_evt_cnt is
> already 0 (drained in the first prepare). prepare returns false, which
> destroys the callback.
>
This is actually not a problem. When ->done() returns 0,
rpc_restart_call_prepare retries the RPC through the RPC-level prepare
(nfsd4_cb_prepare at nfs4callback.c:1479), which just acquires a
session slot and calls rpc_call_start. It does not call cb_ops->prepare
again — the same encoded XDR is re-sent in that case.
> Events arriving during the retry window are dropped because
> nfsd4_run_cb_notify() returns early when NFSD4_CALLBACK_RUNNING is set.
> After the callback is destroyed, future events can queue a new CB_NOTIFY,
> but the window's events are lost.
>
> The result is that the client misses notifications. Does this impact
> behavioral correctness or spec compliance? Is there a way for that
> client to detect the loss and recover?
>
There _is_ a problem, however. Events that arrive while the job is
running queue up, but won't get sent until the _next_ event arrives. We
need to make the ->release handler check for new events and requeue the
callback if there are any. I'll plan to fix that up.
Thanks for all the review! I'll plan to send another version soon (but
probably not until after next week's testing event).
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply
* Re: [PATCH mm-unstable v15 03/13] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Nico Pache @ 2026-04-08 19:48 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <c4e80668-9018-48fc-883c-5d52a5950065@kernel.org>
On Thu, Mar 12, 2026 at 2:56 PM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 3/12/26 21:36, David Hildenbrand (Arm) wrote:
> > On 3/12/26 21:32, David Hildenbrand (Arm) wrote:
> >> On 2/26/26 04:23, Nico Pache wrote:
> >>> generalize the order of the __collapse_huge_page_* functions
> >>> to support future mTHP collapse.
> >>>
> >>> mTHP collapse will not honor the khugepaged_max_ptes_shared or
> >>> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
> >>> shared or swapped entry.
> >>>
> >>> No functional changes in this patch.
> >>>
> >>> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> >>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> >>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >>> Co-developed-by: Dev Jain <dev.jain@arm.com>
> >>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >>> Signed-off-by: Nico Pache <npache@redhat.com>
> >>> ---
> >>> mm/khugepaged.c | 73 +++++++++++++++++++++++++++++++------------------
> >>> 1 file changed, 47 insertions(+), 26 deletions(-)
> >>>
> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index a9b645402b7f..ecdbbf6a01a6 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -535,7 +535,7 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
> >>>
> >>> static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >>> unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
> >>> - struct list_head *compound_pagelist)
> >>> + unsigned int order, struct list_head *compound_pagelist)
> >>> {
> >>> struct page *page = NULL;
> >>> struct folio *folio = NULL;
> >>> @@ -543,15 +543,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >>> pte_t *_pte;
> >>> int none_or_zero = 0, shared = 0, referenced = 0;
> >>> enum scan_result result = SCAN_FAIL;
> >>> + const unsigned long nr_pages = 1UL << order;
> >>> + int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> >>
> >> It might be a bit more readable to move "const unsigned long
> >> nr_pages = 1UL << order;" all the way to the top.
> >>
> >> Then, have here
> >>
> >> int max_ptes_none = 0;
> >>
> >> and do at the beginning of the function:
> >>
> >> /* For MADV_COLLAPSE, we always collapse ... */
> >> if (!cc->is_khugepaged)
> >> max_ptes_none = HPAGE_PMD_NR;
> >> /* ... except if userfaultf relies on MISSING faults. */
> >> if (!userfaultfd_armed(vma))
> >> max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> >>
> >> (but see below regarding helper function)
> >>
> >> then the code below becomes ...
> >>
> >>>
> >>> - for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> >>> + for (_pte = pte; _pte < pte + nr_pages;
> >>> _pte++, addr += PAGE_SIZE) {
> >>> pte_t pteval = ptep_get(_pte);
> >>> if (pte_none_or_zero(pteval)) {
> >>> ++none_or_zero;
> >>> if (!userfaultfd_armed(vma) &&
> >>> (!cc->is_khugepaged ||
> >>> - none_or_zero <= khugepaged_max_ptes_none)) {
> >>> + none_or_zero <= max_ptes_none)) {
> >>
> >> ...
> >>
> >> if (none_or_zero <= max_ptes_none) {
> >>
> >>
> >> I see that you do something like that (but slightly different) in the next
> >> patch. You could easily extend the above by it.
> >>
> >> Or go one step further and move all of that conditional into collapse_max_ptes_none(), whereby
> >> you simply also pass the cc and the vma.
> >>
> >> Then this all gets cleaned up and you'd end up above with
> >>
> >> max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> >> if (max_ptes_none < 0)
> >> return result;
> >>
> >> I'd do all that in this patch here, getting rid of #4.
> >>
> >>
> >>> continue;
> >>> } else {
> >>> result = SCAN_EXCEED_NONE_PTE;
> >>> @@ -585,8 +587,14 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >>> /* See collapse_scan_pmd(). */
> >>> if (folio_maybe_mapped_shared(folio)) {
> >>> ++shared;
> >>> - if (cc->is_khugepaged &&
> >>> - shared > khugepaged_max_ptes_shared) {
> >>> + /*
> >>> + * TODO: Support shared pages without leading to further
> >>> + * mTHP collapses. Currently bringing in new pages via
> >>> + * shared may cause a future higher order collapse on a
> >>> + * rescan of the same range.
> >>> + */
> >>> + if (!is_pmd_order(order) || (cc->is_khugepaged &&
> >>> + shared > khugepaged_max_ptes_shared)) {
> >>
> >> That's not how we indent within a nested ().
> >>
> >> To make this easier to read, what about similarly having at the beginning
> >> of the function:
> >>
> >> int max_ptes_shared = 0;
> >>
> >> /* For MADV_COLLAPSE, we always collapse. */
> >> if (cc->is_khugepaged)
> >> max_ptes_none = HPAGE_PMD_NR;
> >> /* TODO ... */
> >> if (is_pmd_order(order))
> >> max_ptes_none = khugepaged_max_ptes_shared;
> >>
> >> to turn this code into a
> >>
> >> if (shared > khugepaged_max_ptes_shared)
> >>
> >> Also, here, might make sense to have a collapse_max_ptes_swap(cc, order)
> >> to do that and clean it up.
> >>
> >>
> >>> result = SCAN_EXCEED_SHARED_PTE;
> >>> count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> >>> goto out;
> >>> @@ -679,18 +687,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >>> }
> >>>
> >>> static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> >>> - struct vm_area_struct *vma,
> >>> - unsigned long address,
> >>> - spinlock_t *ptl,
> >>> - struct list_head *compound_pagelist)
> >>> + struct vm_area_struct *vma, unsigned long address,
> >>> + spinlock_t *ptl, unsigned int order,
> >>> + struct list_head *compound_pagelist)
> >>> {
> >>> - unsigned long end = address + HPAGE_PMD_SIZE;
> >>> + unsigned long end = address + (PAGE_SIZE << order);
> >>> struct folio *src, *tmp;
> >>> pte_t pteval;
> >>> pte_t *_pte;
> >>> unsigned int nr_ptes;
> >>> + const unsigned long nr_pages = 1UL << order;
> >>
> >> Move it further to the top.
> >>
> >>>
> >>> - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
> >>> + for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
> >>> address += nr_ptes * PAGE_SIZE) {
> >>> nr_ptes = 1;
> >>> pteval = ptep_get(_pte);
> >>> @@ -743,13 +751,11 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> >>> }
> >>>
> >>> static void __collapse_huge_page_copy_failed(pte_t *pte,
> >>> - pmd_t *pmd,
> >>> - pmd_t orig_pmd,
> >>> - struct vm_area_struct *vma,
> >>> - struct list_head *compound_pagelist)
> >>> + pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> >>> + unsigned int order, struct list_head *compound_pagelist)
> >>> {
> >>> spinlock_t *pmd_ptl;
> >>> -
> >>> + const unsigned long nr_pages = 1UL << order;
> >>> /*
> >>> * Re-establish the PMD to point to the original page table
> >>> * entry. Restoring PMD needs to be done prior to releasing
> >>> @@ -763,7 +769,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> >>> * Release both raw and compound pages isolated
> >>> * in __collapse_huge_page_isolate.
> >>> */
> >>> - release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> >>> + release_pte_pages(pte, pte + nr_pages, compound_pagelist);
> >>> }
> >>>
> >>> /*
> >>> @@ -783,16 +789,16 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> >>> */
> >>> static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> >>> pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> >>> - unsigned long address, spinlock_t *ptl,
> >>> + unsigned long address, spinlock_t *ptl, unsigned int order,
> >>> struct list_head *compound_pagelist)
> >>> {
> >>> unsigned int i;
> >>> enum scan_result result = SCAN_SUCCEED;
> >>> -
> >>> + const unsigned long nr_pages = 1UL << order;
> >>
> >> Same here, all the way to the top.
> >>
> >>> /*
> >>> * Copying pages' contents is subject to memory poison at any iteration.
> >>> */
> >>> - for (i = 0; i < HPAGE_PMD_NR; i++) {
> >>> + for (i = 0; i < nr_pages; i++) {
> >>> pte_t pteval = ptep_get(pte + i);
> >>> struct page *page = folio_page(folio, i);
> >>> unsigned long src_addr = address + i * PAGE_SIZE;
> >>> @@ -811,10 +817,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
> >>>
> >>> if (likely(result == SCAN_SUCCEED))
> >>> __collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> >>> - compound_pagelist);
> >>> + order, compound_pagelist);
> >>> else
> >>> __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> >>> - compound_pagelist);
> >>> + order, compound_pagelist);
> >>>
> >>> return result;
> >>> }
> >>> @@ -985,12 +991,12 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
> >>> * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
> >>> */
> >>> static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> >>> - struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
> >>> - int referenced)
> >>> + struct vm_area_struct *vma, unsigned long start_addr,
> >>> + pmd_t *pmd, int referenced, unsigned int order)
> >>> {
> >>> int swapped_in = 0;
> >>> vm_fault_t ret = 0;
> >>> - unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
> >>> + unsigned long addr, end = start_addr + (PAGE_SIZE << order);
> >>> enum scan_result result;
> >>> pte_t *pte = NULL;
> >>> spinlock_t *ptl;
> >>> @@ -1022,6 +1028,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> >>> pte_present(vmf.orig_pte))
> >>> continue;
> >>>
> >>> + /*
> >>> + * TODO: Support swapin without leading to further mTHP
> >>> + * collapses. Currently bringing in new pages via swapin may
> >>> + * cause a future higher order collapse on a rescan of the same
> >>> + * range.
> >>> + */
> >>> + if (!is_pmd_order(order)) {
> >>> + pte_unmap(pte);
> >>> + mmap_read_unlock(mm);
> >>> + result = SCAN_EXCEED_SWAP_PTE;
> >>> + goto out;
> >>> + }
> >>> +
> >>
> >> Interesting, we just swapin everything we find :)
> >>
> >> But do we really need this check here? I mean, we just found it to be present.
> >>
> >> In the rare event that there was a race, do we really care? It was just
> >> present, now it's swapped. Bad luck. Just swap it in.
> >>
> >
> > Okay, now I am confused. Why are you not taking care of
> > collapse_scan_pmd() in the same context?
> >
> > Because if you make sure that we properly check against a max_ptes_swap
> > similar as in the style above, we'd rule out swapin right from the start?
> >
> > Also, I would expect that all other parameters in there are similarly
> > handled?
> >
>
> Okay, I think you should add the following:
Hey! Thanks for all your reviews here.
For multiple reasons, here is the solution I developed:
Add a patch before the generalize __collapse.. patch that reworks the
max_ptes* handling and introduces the helpers (no functional changes).
I later updated these functions to follow the specific mthp rules in
the generalization patch. Honestly, refactoring much of this has been
very hard without one large patch, which is why we split it up
initially.
How does that sound?
-- Nico
>
> From 17bce81ab93f3b16e044ac2f4f62be19aac38180 Mon Sep 17 00:00:00 2001
> From: "David Hildenbrand (Arm)" <david@kernel.org>
> Date: Thu, 12 Mar 2026 21:54:22 +0100
> Subject: [PATCH] tmp
>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
> ---
> mm/khugepaged.c | 89 +++++++++++++++++++++++++++++--------------------
> 1 file changed, 53 insertions(+), 36 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b7b4680d27ab..6a3773bfa0a2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -318,6 +318,34 @@ static ssize_t max_ptes_shared_store(struct kobject *kobj,
> return count;
> }
>
> +static int collapse_max_ptes_none(struct collapse_control *cc,
> + struct vm_area_struct *vma)
> +{
> + /* We don't mess with MISSING faults. */
> + if (vma && userfaultfd_armed(vma))
> + return 0;
> + /* MADV_COLLAPSE always collapses. */
> + if (!cc->is_khugepaged)
> + return HPAGE_PMD_NR;
> + return khugepaged_max_ptes_none;
> +}
> +
> +static int collapse_max_ptes_shared(struct collapse_control *cc)
> +{
> + /* MADV_COLLAPSE always collapses. */
> + if (!cc->is_khugepaged)
> + return HPAGE_PMD_NR;
> + return khugepaged_max_ptes_shared;
> +}
> +
> +static int collapse_max_ptes_swap(struct collapse_control *cc)
> +{
> + /* MADV_COLLAPSE always collapses. */
> + if (!cc->is_khugepaged)
> + return HPAGE_PMD_NR;
> + return khugepaged_max_ptes_swap;
> +}
> +
> static struct kobj_attribute khugepaged_max_ptes_shared_attr =
> __ATTR_RW(max_ptes_shared);
>
> @@ -539,6 +567,8 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
> struct list_head *compound_pagelist)
> {
> + const int max_ptes_none = collapse_max_ptes_none(cc, vma);
> + const int max_ptes_shared = collapse_max_ptes_shared(cc);
> struct page *page = NULL;
> struct folio *folio = NULL;
> unsigned long addr = start_addr;
> @@ -550,16 +580,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> _pte++, addr += PAGE_SIZE) {
> pte_t pteval = ptep_get(_pte);
> if (pte_none_or_zero(pteval)) {
> - ++none_or_zero;
> - if (!userfaultfd_armed(vma) &&
> - (!cc->is_khugepaged ||
> - none_or_zero <= khugepaged_max_ptes_none)) {
> - continue;
> - } else {
> + if (++none_or_zero > max_ptes_none) {
> result = SCAN_EXCEED_NONE_PTE;
> count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> goto out;
> }
> + continue;
> }
> if (!pte_present(pteval)) {
> result = SCAN_PTE_NON_PRESENT;
> @@ -586,9 +612,7 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>
> /* See hpage_collapse_scan_pmd(). */
> if (folio_maybe_mapped_shared(folio)) {
> - ++shared;
> - if (cc->is_khugepaged &&
> - shared > khugepaged_max_ptes_shared) {
> + if (++shared > max_ptes_shared) {
> result = SCAN_EXCEED_SHARED_PTE;
> count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> goto out;
> @@ -1247,6 +1271,9 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> struct vm_area_struct *vma, unsigned long start_addr,
> bool *mmap_locked, struct collapse_control *cc)
> {
> + const int max_ptes_none = collapse_max_ptes_none(cc, vma);
> + const int max_ptes_swap = collapse_max_ptes_swap(cc);
> + const int max_ptes_shared = collapse_max_ptes_shared(cc);
> pmd_t *pmd;
> pte_t *pte, *_pte;
> int none_or_zero = 0, shared = 0, referenced = 0;
> @@ -1280,36 +1307,28 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>
> pte_t pteval = ptep_get(_pte);
> if (pte_none_or_zero(pteval)) {
> - ++none_or_zero;
> - if (!userfaultfd_armed(vma) &&
> - (!cc->is_khugepaged ||
> - none_or_zero <= khugepaged_max_ptes_none)) {
> - continue;
> - } else {
> + if (++none_or_zero > max_ptes_none) {
> result = SCAN_EXCEED_NONE_PTE;
> count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> goto out_unmap;
> }
> + continue;
> }
> if (!pte_present(pteval)) {
> - ++unmapped;
> - if (!cc->is_khugepaged ||
> - unmapped <= khugepaged_max_ptes_swap) {
> - /*
> - * Always be strict with uffd-wp
> - * enabled swap entries. Please see
> - * comment below for pte_uffd_wp().
> - */
> - if (pte_swp_uffd_wp_any(pteval)) {
> - result = SCAN_PTE_UFFD_WP;
> - goto out_unmap;
> - }
> - continue;
> - } else {
> + if (++unmapped > max_ptes_swap) {
> result = SCAN_EXCEED_SWAP_PTE;
> count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> goto out_unmap;
> }
> + /*
> + * Always be strict with uffd-wp enabled swap entries.
> + * See the comment below for pte_uffd_wp().
> + */
> + if (pte_swp_uffd_wp_any(pteval)) {
> + result = SCAN_PTE_UFFD_WP;
> + goto out_unmap;
> + }
> + continue;
> }
> if (pte_uffd_wp(pteval)) {
> /*
> @@ -1348,9 +1367,7 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> * is shared.
> */
> if (folio_maybe_mapped_shared(folio)) {
> - ++shared;
> - if (cc->is_khugepaged &&
> - shared > khugepaged_max_ptes_shared) {
> + if (++shared > max_ptes_shared) {
> result = SCAN_EXCEED_SHARED_PTE;
> count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> goto out_unmap;
> @@ -2305,6 +2322,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
> unsigned long addr, struct file *file, pgoff_t start,
> struct collapse_control *cc)
> {
> + const int max_ptes_none = collapse_max_ptes_none(cc, NULL);
> + const int max_ptes_swap = collapse_max_ptes_swap(cc);
> struct folio *folio = NULL;
> struct address_space *mapping = file->f_mapping;
> XA_STATE(xas, &mapping->i_pages, start);
> @@ -2323,8 +2342,7 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
>
> if (xa_is_value(folio)) {
> swap += 1 << xas_get_order(&xas);
> - if (cc->is_khugepaged &&
> - swap > khugepaged_max_ptes_swap) {
> + if (swap > max_ptes_swap) {
> result = SCAN_EXCEED_SWAP_PTE;
> count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> break;
> @@ -2395,8 +2413,7 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
> cc->progress += HPAGE_PMD_NR;
>
> if (result == SCAN_SUCCEED) {
> - if (cc->is_khugepaged &&
> - present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
> + if (present < HPAGE_PMD_NR - max_ptes_none) {
> result = SCAN_EXCEED_NONE_PTE;
> count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> } else {
> --
> 2.43.0
>
>
> Then extend it by passing an order + return value check in this patch here. You can
> directly squash changes from patch #4 in here then.
>
> --
> Cheers,
>
> David
>
^ permalink raw reply
* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Sean Christopherson @ 2026-04-08 19:48 UTC (permalink / raw)
To: Ackerley Tng
Cc: Michael Roth, Vishal Annapurve, aik, andrew.jones, binbin.wu,
brauner, chao.p.peng, david, ira.weiny, jmattson, jthoughton,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
forkloop, pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <CAEvNRgEcKnHkgC2iK8hRMybruY5LrxWH+RK94M7MddUqgGChHQ@mail.gmail.com>
On Wed, Apr 08, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> > On Tue, Apr 07, 2026, Michael Roth wrote:
> >> On Tue, Apr 07, 2026 at 02:50:58PM -0700, Vishal Annapurve wrote:
> >> > > So I agree with Ackerley's proposal (which I guess is the same as what's
> >> > > in this series).
> >> > >
> >> > > However, 1 other alternative would be to do what was suggested on the
> >> > > call, but require userspace to subsequently handle the shared->private
> >> > > conversion. I think that would be workable too.
> >> >
> >> > IIUC, Converting memory ranges to private after it essentially is
> >> > treated as private by the KVM CC backend will expose the
> >> > implementation to the same risk of userspace being able to access
> >> > private memory and compromise host safety which guest_memfd was
> >> > invented to address.
> >>
> >> Doh, fair point. Doing conversion as part of the populate call would allow
> >> us to use the filemap write-lock to avoid userspace being able to fault
> >> in private (as tracked by trusted entity) pages before they are
> >> transitioned to private (as tracked by KVM), so it's safer than having
> >> userspace drive it.
> >>
> >> But obviously I still think Ackerley's original proposal has more
> >> upsides than the alternatives mentioned so far.
> >
> > I'm a bit lost. What exactly is/was Ackerley's original proposal? If the answer
> > is "convert pages from shared=>private when populating via in-place conversion",
> > then I agree, because AFAICT, that's the only sane option.
>
> Discussed this at PUCK today 2026-04-08.
>
> The update is that the KVM_SET_MEMORY_ATTRIBUTES2 guest_memfd ioctl will
> now support the PRESERVE flag for TDX and SNP only if the setup for the
> VM in question hasn't yet been completed (KVM_TDX_FINALIZE_VM or
> KVM_SEV_SNP_LAUNCH_FINISH hasn't completed yet).
>
> The populate flow will be
>
> 1a. Get contents to be loaded in guest_memfd (src_addr: NULL) as shared
> OR
> 1b. Provide contents from some other userspace address (src_addr:
> userspace address)
>
> 2. KVM_SET_MEMORY_ATTRIBUTES2(attribute: PRIVATE and flags: PRESERVE)
> 3. KVM_SEV_SNP_LAUNCH_UPDATE() or KVM_TDX_INIT_MEM_REGION()
> ...
> 4. KVM_SEV_SNP_LAUNCH_FINISH() or KVM_TDX_FINALIZE_VM()
>
> This applies whether src_addr is some userspace address that is shared
> or NULL, so the non-in-place loading flow is not considered legacy. ARM
> CCA can still use that flow :)
>
> Other than supporting PRESERVE only if the setup for the VM in question
> hasn't yet been completed, KVM's fault path will also not permit faults
> if the setup hasn't been completed. (Some exception setup will be used
> for TDX to be able to perform the required fault.)
Nit: as Mike (or Rick?) called out in PUCK, TDX's flow is now a separate path
thanks to commit 3ab3283dbb2c ("KVM: x86/mmu: Add dedicated API to map guest_memfd
pfn into TDP MMU"). I.e. it is NOT a fault in any way, shape, or form.
tdx_mem_page_add() already asserts pre_fault_allowed=false:
if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) ||
KVM_BUG_ON(!kvm_tdx->page_add_src, kvm))
return -EIO;
so I think we just to add similar checks in SEV and the MMU. This can even be
done today as a hardening measure, as the rules aren't changing, we're just
doubling down on disallowing (pre-)faulting during pre-boot.
E.g.
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 73cdcbccc89e..99f070cf2480 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -363,6 +363,9 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
};
int r;
+ if (KVM_BUG_ON(!vcpu->kvm->arch.pre_fault_allowed, vcpu->kvm))
+ return -EIO;
+
if (vcpu->arch.mmu->root_role.direct) {
/*
* Things like memslots don't understand the concept of a shared
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 2010b157e288..f0bbbda6e9c4 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2419,6 +2419,9 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
if (!sev_snp_guest(kvm) || !sev->snp_context)
return -EINVAL;
+ if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm)
+ return -EIO;
+
if (copy_from_user(¶ms, u64_to_user_ptr(argp->data), sizeof(params)))
return -EFAULT;
^ permalink raw reply related
* Re: [PATCH] riscv: enable HAVE_CMPXCHG_{DOUBLE,LOCAL}
From: Miquel Sabaté Solà @ 2026-04-08 19:44 UTC (permalink / raw)
To: pjw; +Cc: palmer, alex, corbet, linux-riscv, linux-doc, linux-kernel
In-Reply-To: <87sea63hmh.fsf@>
[-- Attachment #1: Type: text/plain, Size: 2344 bytes --]
Miquel Sabaté Solà @ 2026-03-11 09:17 +01:
> Miquel Sabaté Solà @ 2026-02-20 08:44 +01:
>
>> Support for atomic Compare-And-Swap instructions has been in the RISC-V
>> port of the Linux kernel for a long time. That being said, we apparently
>> never bothered to set HAVE_CMPXCHG_DOUBLE and HAVE_CMPXCHG_LOCAL in the
>> Kconfig, despite having all the framework to support them.
>>
>> Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
>> ---
>> I have built this patch with multiple configurations and ran it with KVM
>> (the VisionFive2 board that I have lacks the needed extensions). All seems
>> to work, but I do wonder if we did not enable these for a reason or this
>> just slipped through. So far in the code I believe everything is in place,
>> and I haven't seen any commit in the git log stating otherwise.
>>
>> Documentation/features/locking/cmpxchg-local/arch-support.txt | 2 +-
>> arch/riscv/Kconfig | 2 ++
>> 2 files changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/features/locking/cmpxchg-local/arch-support.txt b/Documentation/features/locking/cmpxchg-local/arch-support.txt
>> index 2c3a4b91f16d..28d5fa8c3b4f 100644
>> --- a/Documentation/features/locking/cmpxchg-local/arch-support.txt
>> +++ b/Documentation/features/locking/cmpxchg-local/arch-support.txt
>> @@ -20,7 +20,7 @@
>> | openrisc: | TODO |
>> | parisc: | TODO |
>> | powerpc: | TODO |
>> - | riscv: | TODO |
>> + | riscv: | ok |
>> | s390: | ok |
>> | sh: | TODO |
>> | sparc: | TODO |
>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>> index 7e76b6316425..7c6726a8d738 100644
>> --- a/arch/riscv/Kconfig
>> +++ b/arch/riscv/Kconfig
>> @@ -151,6 +151,8 @@ config RISCV
>> select HAVE_ARCH_USERFAULTFD_WP if 64BIT && MMU && USERFAULTFD && RISCV_ISA_SVRSW60T59B
>> select HAVE_ARCH_VMAP_STACK if MMU && 64BIT
>> select HAVE_ASM_MODVERSIONS
>> + select HAVE_CMPXCHG_DOUBLE if RISCV_ISA_ZACAS && RISCV_ISA_ZABHA
>> + select HAVE_CMPXCHG_LOCAL if RISCV_ISA_ZACAS && RISCV_ISA_ZABHA
>> select HAVE_CONTEXT_TRACKING_USER
>> select HAVE_DEBUG_KMEMLEAK
>> select HAVE_DMA_CONTIGUOUS if MMU
>
> Gentle ping :)
Another gentle ping :)
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 897 bytes --]
^ permalink raw reply
* Re: [PATCH 13/24] nfsd: add notification handlers for dir events
From: Jeff Layton @ 2026-04-08 19:40 UTC (permalink / raw)
To: Chuck Lever, Alexander Viro, Christian Brauner, Jan Kara,
Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs
In-Reply-To: <f60c8d0f-45b6-4d4e-8944-dfd69f2cf9bf@app.fastmail.com>
On Wed, 2026-04-08 at 14:34 -0400, Chuck Lever wrote:
>
> On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> > Add the necessary parts to accept a fsnotify callback for directory
> > change event and create a CB_NOTIFY request for it. When a dir nfsd_file
> > is created set a handle_event callback to handle the notification.
> >
> > Use that to allocate a nfsd_notify_event object and then hand off a
> > reference to each delegation's CB_NOTIFY. If anything fails along the
> > way, recall any affected delegations.
> >
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
>
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index b2b8c454fc0f..339c3d0bb575 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
>
> > @@ -9796,3 +9887,118 @@ nfsd_get_dir_deleg(struct nfsd4_compound_state
> > *cstate,
> > put_nfs4_file(fp);
> > return ERR_PTR(status);
> > }
> > +
> > +static void
> > +nfsd4_run_cb_notify(struct nfsd4_cb_notify *ncn)
> > +{
> > + struct nfs4_delegation *dp = container_of(ncn, struct
> > nfs4_delegation, dl_cb_notify);
> > +
> > + if (test_and_set_bit(NFSD4_CALLBACK_RUNNING, &ncn->ncn_cb.cb_flags))
> > + return;
> > +
> > + if (!refcount_inc_not_zero(&dp->dl_stid.sc_count))
> > + clear_bit(NFSD4_CALLBACK_RUNNING, &ncn->ncn_cb.cb_flags);
> > + else
> > + nfsd4_run_cb(&ncn->ncn_cb);
> > +}
> > +
> > +static struct nfsd_notify_event *
> > +alloc_nfsd_notify_event(u32 mask, const struct qstr *q, struct dentry
> > *dentry)
> > +{
> > + struct nfsd_notify_event *ne;
> > +
> > + ne = kmalloc(sizeof(*ne) + q->len + 1, GFP_KERNEL);
> > + if (!ne)
> > + return NULL;
> > +
> > + memcpy(&ne->ne_name, q->name, q->len);
> > + refcount_set(&ne->ne_ref, 1);
> > + ne->ne_mask = mask;
> > + ne->ne_name[q->len] = '\0';
> > + ne->ne_namelen = q->len;
> > + ne->ne_dentry = dget(dentry);
> > + return ne;
> > +}
> > +
> > +static bool
> > +should_notify_deleg(u32 mask, struct file_lease *fl)
> > +{
> > + /* Only nfsd leases */
> > + if (fl->fl_lmops != &nfsd_lease_mng_ops)
> > + return false;
> > +
> > + /* Skip if this event wasn't ignored by the lease */
> > + if ((mask & FS_DELETE) && !(fl->c.flc_flags & FL_IGN_DIR_DELETE))
> > + return false;
> > + if ((mask & FS_CREATE) && !(fl->c.flc_flags & FL_IGN_DIR_CREATE))
> > + return false;
> > + if ((mask & FS_RENAME) && !(fl->c.flc_flags & FL_IGN_DIR_RENAME))
> > + return false;
> > +
> > + return true;
> > +}
>
> For a cross-directory rename, vfs_rename calls try_break_deleg(old_dir,
> LEASE_BREAK_DIR_DELETE, ...). A delegation with FL_IGN_DIR_DELETE
> (subscribed to NOTIFY4_REMOVE_ENTRY) suppresses the lease break, which
> is correct.
>
> But fsnotify delivers FS_RENAME on old_dir, not FS_DELETE. In
> should_notify_deleg(), the check (mask & FS_RENAME) &&
> !(fl->c.flc_flags & FL_IGN_DIR_RENAME) fails, because the delegation
> has FL_IGN_DIR_DELETE but not FL_IGN_DIR_RENAME. No notification is
> sent.
>
Actually, it delivers FS_RENAME_FROM to the old directory and
FS_RENAME_TO to the new one in this case, AFAICT. Basically, we need to
watch for those in addition to FS_CREATE/DELETE and treat them the same
way. I'm looking at a fix now. I'll also see about adding a pynfs test
to do a cross directory rename with notifications. I don't think I have
that now.
Thanks!
> IIUC, a client subscribed to NOTIFY4_REMOVE_ENTRY for old_dir sees
> neither a lease break nor a CB_NOTIFY when a child is renamed out of
> the directory. Is that behavior correct?
>
>
> > +
> > +static void
> > +nfsd_recall_all_dir_delegs(const struct inode *dir)
> > +{
> > + struct file_lock_context *ctx = locks_inode_context(dir);
> > + struct file_lock_core *flc;
> > +
> > + spin_lock(&ctx->flc_lock);
> > + list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
> > + struct file_lease *fl = container_of(flc, struct file_lease, c);
> > +
> > + if (fl->fl_lmops == &nfsd_lease_mng_ops)
> > + nfsd_break_deleg_cb(fl);
> > + }
> > + spin_unlock(&ctx->flc_lock);
> > +}
> > +
> > +int
> > +nfsd_handle_dir_event(u32 mask, const struct inode *dir, const void
> > *data,
> > + int data_type, const struct qstr *name)
> > +{
> > + struct dentry *dentry = fsnotify_data_dentry(data, data_type);
> > + struct file_lock_context *ctx;
> > + struct file_lock_core *flc;
> > + struct nfsd_notify_event *evt;
> > +
> > + /* Don't do anything if this is not an expected event */
> > + if (!(mask & (FS_CREATE|FS_DELETE|FS_RENAME)))
> > + return 0;
> > +
> > + ctx = locks_inode_context(dir);
> > + if (!ctx || list_empty(&ctx->flc_lease))
> > + return 0;
> > +
> > + evt = alloc_nfsd_notify_event(mask, name, dentry);
> > + if (!evt) {
> > + nfsd_recall_all_dir_delegs(dir);
> > + return 0;
> > + }
> > +
> > + spin_lock(&ctx->flc_lock);
> > + list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
> > + struct file_lease *fl = container_of(flc, struct file_lease, c);
> > + struct nfs4_delegation *dp = flc->flc_owner;
> > + struct nfsd4_cb_notify *ncn = &dp->dl_cb_notify;
> > +
> > + if (!should_notify_deleg(mask, fl))
> > + continue;
> > +
> > + spin_lock(&ncn->ncn_lock);
> > + if (ncn->ncn_evt_cnt >= NOTIFY4_EVENT_QUEUE_SIZE) {
> > + /* We're generating notifications too fast. Recall. */
> > + spin_unlock(&ncn->ncn_lock);
> > + nfsd_break_deleg_cb(fl);
> > + continue;
> > + }
> > + ncn->ncn_evt[ncn->ncn_evt_cnt++] = nfsd_notify_event_get(evt);
> > + spin_unlock(&ncn->ncn_lock);
> > +
> > + nfsd4_run_cb_notify(ncn);
> > + }
> > + spin_unlock(&ctx->flc_lock);
> > + nfsd_notify_event_put(evt);
> > + return 0;
> > +}
>
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply
* Re: [PATCH 08/24] nfsd: update the fsnotify mark when setting or removing a dir delegation
From: Jeff Layton @ 2026-04-08 19:29 UTC (permalink / raw)
To: Chuck Lever, Alexander Viro, Christian Brauner, Jan Kara,
Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs
In-Reply-To: <412e143e-1e99-42a6-a959-654bde4e7038@app.fastmail.com>
On Wed, 2026-04-08 at 14:24 -0400, Chuck Lever wrote:
> On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> > Add a new helper function that will update the mask on the nfsd_file's
> > fsnotify_mark to be a union of all current directory delegations on an
> > inode. Call that when directory delegations are added or removed.
> >
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
>
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index c8fb84c38637..9a4cff08c67d 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
>
> > @@ -1266,6 +1297,7 @@ static void nfs4_unlock_deleg_lease(struct
> > nfs4_delegation *dp)
> > WARN_ON_ONCE(!fp->fi_delegees);
> >
> > nfsd4_finalize_deleg_timestamps(dp, nf->nf_file);
> > + nfsd_fsnotify_recalc_mask(nf);
> > kernel_setlease(nf->nf_file, F_UNLCK, NULL, (void **)&dp);
> > put_deleg_file(fp);
> > }
>
> The grant path in nfsd_get_dir_deleg() uses a different ordering
> (setlease first, recalc_mask after).
>
> Here, since the delegation being removed is still in flc_lease,
> inode_lease_ignore_mask() includes its ignore flags. The mask is
> computed as if the delegation is still present.
>
> The result is that stale FS_CREATE/FS_DELETE/FS_RENAME bits remain
> in the fsnotify mark. It might be harmless in practice since the
> handler finds no leases and returns early, but it creates
> unnecessary work.
>
> Should nfs4_unlock_deleg_lease call nfsd_fsnotify_recalc_mask()
> after kernel_setlease(F_UNLCK)?
>
Good catch. Will fix.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply
* Re: [PATCH 01/24] filelock: add support for ignoring deleg breaks for dir change events
From: Jeff Layton @ 2026-04-08 19:25 UTC (permalink / raw)
To: Chuck Lever, Alexander Viro, Christian Brauner, Jan Kara,
Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs
In-Reply-To: <bb4fc3da-765d-4284-9282-dd04d286f671@app.fastmail.com>
On Wed, 2026-04-08 at 14:16 -0400, Chuck Lever wrote:
> On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> > If a NFS client requests a directory delegation with a notification
> > bitmask covering directory change events, the server shouldn't recall
> > the delegation. Instead the client will be notified of the change after
> > the fact.
> >
> > Add support for ignoring lease breaks on directory changes. Add a new
> > flags parameter to try_break_deleg() and teach __break_lease how to
> > ignore certain types of delegation break events.
> >
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
>
> > diff --git a/fs/locks.c b/fs/locks.c
> > index 8e44b1f6c15a..dafa0752fdce 100644
> > --- a/fs/locks.c
> > +++ b/fs/locks.c
>
> > @@ -1670,7 +1709,7 @@ int __break_lease(struct inode *inode, unsigned int flags)
> > locks_delete_lock_ctx(&fl->c, &dispose);
> > }
> >
> > - if (list_empty(&ctx->flc_lease))
> > + if (!visible_leases_remaining(inode, flags))
> > goto out;
> >
> > if (flags & LEASE_BREAK_NONBLOCK) {
>
> After breaking visible leases, the restart: label calls any_leases_conflict()
> which does not filter ignored dir-delegation leases. When only ignored leases
> remain, any_leases_conflict returns true, but visible_leases_remaining also
> returned true (triggering the wait). The code picks the first lease (possibly
> ignored), computes break_time = 1 jiffy, blocks, then loops.
>
> For example, suppose you have two directory delegations on a directory, one
> with FL_IGN_DIR_DELETE and one without. After the non-ignored one is broken
> and removed, the ignored one keeps any_leases_conflict returning true. The
> loop spins at 1-jiffy intervals until the ignored delegation is released.
>
> Should the restart: block skip ignored leases?
>
Yep. The right fix is to switch the any_leases_conflict() call to
visible_leases_remaining(). I'll code that up and test it.
Thanks for the review!
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply
* Re: [PATCH v2 1/1] leds: Introduce the multi_max_intensity sysfs attribute
From: Armin Wolf @ 2026-04-08 19:12 UTC (permalink / raw)
To: lee, pavel
Cc: linux-kernel, corbet, skhan, linux-leds, linux-doc, wse,
jacek.anaszewski, pobrn, m.tretter
In-Reply-To: <20260331191619.3729-2-W_Armin@gmx.de>
Am 31.03.26 um 21:16 schrieb Armin Wolf:
> Some multicolor LEDs support global brightness control in hardware,
> meaning that the maximum intensity of the color components is not
> connected to the maximum global brightness. Such LEDs cannot be
> described properly by the current multicolor LED class interface,
> because it assumes that the maximum intensity of each color component
> is described by the maximum global brightness of the LED.
>
> Fix this by introducing a new sysfs attribute called
> "multi_max_intensity" holding the maximum intensity values for the
> color components of a multicolor LED class device. Drivers can use
> the new max_intensity field inside struct mc_subled to tell the
> multicolor LED class code about those values. Intensity values written
> by userspace applications will be limited to this maximum value.
>
> Drivers for multicolor LEDs that do not support global brightness
> control in hardware might still want to use the maximum global LED
> brightness supplied via devicetree as the maximum intensity of each
> individual color component. Such drivers should set max_intensity
> to 0 so that the multicolor LED core can act accordingly.
>
> The lp50xx and ncp5623 LED drivers already use hardware-based control
> for the global LED brightness. Modify those drivers to correctly
> initalize .max_intensity to avoid being limited to the maximum global
> brightness supplied via devicetree.
Any thoughts on this?
Thanks,
Armin Wolf
> Signed-off-by: Armin Wolf <W_Armin@gmx.de>
> ---
> .../ABI/testing/sysfs-class-led-multicolor | 19 ++++++--
> Documentation/leds/leds-class-multicolor.rst | 21 ++++++++-
> drivers/leds/led-class-multicolor.c | 47 ++++++++++++++++++-
> drivers/leds/leds-lp50xx.c | 1 +
> drivers/leds/rgb/leds-ncp5623.c | 4 +-
> include/linux/led-class-multicolor.h | 30 +++++++++++-
> 6 files changed, 113 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-class-led-multicolor b/Documentation/ABI/testing/sysfs-class-led-multicolor
> index 16fc827b10cb..197da3e775b4 100644
> --- a/Documentation/ABI/testing/sysfs-class-led-multicolor
> +++ b/Documentation/ABI/testing/sysfs-class-led-multicolor
> @@ -16,9 +16,22 @@ Date: March 2020
> KernelVersion: 5.9
> Contact: Dan Murphy <dmurphy@ti.com>
> Description: read/write
> - This file contains array of integers. Order of components is
> - described by the multi_index array. The maximum intensity should
> - not exceed /sys/class/leds/<led>/max_brightness.
> + This file contains an array of integers. The order of components
> + is described by the multi_index array. The maximum intensity value
> + supported by each color component is described by the multi_max_intensity
> + file. Writing intensity values larger than the maximum value of a
> + given color component will result in those values being clamped.
> +
> + For additional details please refer to
> + Documentation/leds/leds-class-multicolor.rst.
> +
> +What: /sys/class/leds/<led>/multi_max_intensity
> +Date: March 2026
> +KernelVersion: 7.1
> +Contact: Armin Wolf <W_Armin@gmx.de>
> +Description: read
> + This file contains an array of integers describing the maximum
> + intensity value for each intensity component.
>
> For additional details please refer to
> Documentation/leds/leds-class-multicolor.rst.
> diff --git a/Documentation/leds/leds-class-multicolor.rst b/Documentation/leds/leds-class-multicolor.rst
> index c6b47b4093c4..68340644f80b 100644
> --- a/Documentation/leds/leds-class-multicolor.rst
> +++ b/Documentation/leds/leds-class-multicolor.rst
> @@ -25,10 +25,14 @@ color name to indexed value.
> The ``multi_index`` file is an array that contains the string list of the colors as
> they are defined in each ``multi_*`` array file.
>
> -The ``multi_intensity`` is an array that can be read or written to for the
> +The ``multi_intensity`` file is an array that can be read or written to for the
> individual color intensities. All elements within this array must be written in
> order for the color LED intensities to be updated.
>
> +The ``multi_max_intensity`` file is an array that contains the maximum intensity
> +value supported by each color intensity. Intensity values above this will be
> +automatically clamped into the supported range.
> +
> Directory Layout Example
> ========================
> .. code-block:: console
> @@ -38,6 +42,7 @@ Directory Layout Example
> -r--r--r-- 1 root root 4096 Oct 19 16:16 max_brightness
> -r--r--r-- 1 root root 4096 Oct 19 16:16 multi_index
> -rw-r--r-- 1 root root 4096 Oct 19 16:16 multi_intensity
> + -r--r--r-- 1 root root 4096 Oct 19 16:16 multi_max_intensity
>
> ..
>
> @@ -104,3 +109,17 @@ the color LED group.
> 128
>
> ..
> +
> +Writing intensity values larger than the maximum specified in ``multi_max_intensity``
> +will result in those values being clamped into the supported range.
> +
> +.. code-block:: console
> +
> + # cat /sys/class/leds/multicolor:status/multi_max_intensity
> + 255 255 255
> +
> + # echo 512 512 512 > /sys/class/leds/multicolor:status/multi_intensity
> + # cat /sys/class/leds/multicolor:status/multi_intensity
> + 255 255 255
> +
> +..
> diff --git a/drivers/leds/led-class-multicolor.c b/drivers/leds/led-class-multicolor.c
> index 6b671f3f9c61..8d763b1ae76f 100644
> --- a/drivers/leds/led-class-multicolor.c
> +++ b/drivers/leds/led-class-multicolor.c
> @@ -7,10 +7,28 @@
> #include <linux/init.h>
> #include <linux/led-class-multicolor.h>
> #include <linux/math.h>
> +#include <linux/minmax.h>
> #include <linux/module.h>
> #include <linux/slab.h>
> #include <linux/uaccess.h>
>
> +static unsigned int led_mc_get_max_intensity(struct led_classdev_mc *mcled_cdev, size_t index)
> +{
> + unsigned int max_intensity;
> +
> + /* The maximum global brightness value might still be changed by
> + * led_classdev_register_ext() using devicetree properties. This
> + * prevents us from changing subled_info[X].max_intensity when
> + * registering a multicolor LED class device, so we have to do
> + * this during runtime.
> + */
> + max_intensity = mcled_cdev->subled_info[index].max_intensity;
> + if (max_intensity)
> + return max_intensity;
> +
> + return mcled_cdev->led_cdev.max_brightness;
> +}
> +
> int led_mc_calc_color_components(struct led_classdev_mc *mcled_cdev,
> enum led_brightness brightness)
> {
> @@ -27,6 +45,27 @@ int led_mc_calc_color_components(struct led_classdev_mc *mcled_cdev,
> }
> EXPORT_SYMBOL_GPL(led_mc_calc_color_components);
>
> +static ssize_t multi_max_intensity_show(struct device *dev,
> + struct device_attribute *intensity_attr, char *buf)
> +{
> + struct led_classdev *led_cdev = dev_get_drvdata(dev);
> + struct led_classdev_mc *mcled_cdev = lcdev_to_mccdev(led_cdev);
> + unsigned int max_intensity;
> + int len = 0;
> + int i;
> +
> + for (i = 0; i < mcled_cdev->num_colors; i++) {
> + max_intensity = led_mc_get_max_intensity(mcled_cdev, i);
> + len += sysfs_emit_at(buf, len, "%u", max_intensity);
> + if (i < mcled_cdev->num_colors - 1)
> + len += sprintf(buf + len, " ");
> + }
> +
> + buf[len++] = '\n';
> + return len;
> +}
> +static DEVICE_ATTR_RO(multi_max_intensity);
> +
> static ssize_t multi_intensity_store(struct device *dev,
> struct device_attribute *intensity_attr,
> const char *buf, size_t size)
> @@ -35,6 +74,7 @@ static ssize_t multi_intensity_store(struct device *dev,
> struct led_classdev_mc *mcled_cdev = lcdev_to_mccdev(led_cdev);
> int nrchars, offset = 0;
> unsigned int intensity_value[LED_COLOR_ID_MAX];
> + unsigned int max_intensity;
> int i;
> ssize_t ret;
>
> @@ -56,8 +96,10 @@ static ssize_t multi_intensity_store(struct device *dev,
> goto err_out;
> }
>
> - for (i = 0; i < mcled_cdev->num_colors; i++)
> - mcled_cdev->subled_info[i].intensity = intensity_value[i];
> + for (i = 0; i < mcled_cdev->num_colors; i++) {
> + max_intensity = led_mc_get_max_intensity(mcled_cdev, i);
> + mcled_cdev->subled_info[i].intensity = min(intensity_value[i], max_intensity);
> + }
>
> if (!test_bit(LED_BLINK_SW, &led_cdev->work_flags))
> led_set_brightness(led_cdev, led_cdev->brightness);
> @@ -111,6 +153,7 @@ static ssize_t multi_index_show(struct device *dev,
> static DEVICE_ATTR_RO(multi_index);
>
> static struct attribute *led_multicolor_attrs[] = {
> + &dev_attr_multi_max_intensity.attr,
> &dev_attr_multi_intensity.attr,
> &dev_attr_multi_index.attr,
> NULL,
> diff --git a/drivers/leds/leds-lp50xx.c b/drivers/leds/leds-lp50xx.c
> index e2a9c8592953..69c3550f1a31 100644
> --- a/drivers/leds/leds-lp50xx.c
> +++ b/drivers/leds/leds-lp50xx.c
> @@ -525,6 +525,7 @@ static int lp50xx_probe_dt(struct lp50xx *priv)
> }
>
> mc_led_info[multi_index].color_index = color_id;
> + mc_led_info[multi_index].max_intensity = 255;
> num_colors++;
> }
>
> diff --git a/drivers/leds/rgb/leds-ncp5623.c b/drivers/leds/rgb/leds-ncp5623.c
> index 85d6be6fff2b..f2528f06507d 100644
> --- a/drivers/leds/rgb/leds-ncp5623.c
> +++ b/drivers/leds/rgb/leds-ncp5623.c
> @@ -56,8 +56,7 @@ static int ncp5623_brightness_set(struct led_classdev *cdev,
> for (int i = 0; i < mc_cdev->num_colors; i++) {
> ret = ncp5623_write(ncp->client,
> NCP5623_PWM_REG(mc_cdev->subled_info[i].channel),
> - min(mc_cdev->subled_info[i].intensity,
> - NCP5623_MAX_BRIGHTNESS));
> + mc_cdev->subled_info[i].intensity);
> if (ret)
> return ret;
> }
> @@ -190,6 +189,7 @@ static int ncp5623_probe(struct i2c_client *client)
> goto release_led_node;
>
> subled_info[ncp->mc_dev.num_colors].channel = reg;
> + subled_info[ncp->mc_dev.num_colors].max_intensity = NCP5623_MAX_BRIGHTNESS;
> subled_info[ncp->mc_dev.num_colors++].color_index = color_index;
> }
>
> diff --git a/include/linux/led-class-multicolor.h b/include/linux/led-class-multicolor.h
> index db9f34c6736e..6f89d92566b2 100644
> --- a/include/linux/led-class-multicolor.h
> +++ b/include/linux/led-class-multicolor.h
> @@ -9,10 +9,31 @@
> #include <linux/leds.h>
> #include <dt-bindings/leds/common.h>
>
> +/**
> + * struct mc_subled - Color component description.
> + * @color_index: Color ID.
> + * @brightness: Scaled intensity.
> + * @intensity: Current intensity.
> + * @max_intensity: Maximum supported intensity value.
> + * @channel: Channel index.
> + *
> + * Describes a color component of a multicolor LED. Many multicolor LEDs
> + * do no support gobal brightness control in hardware, so they use
> + * the brightness field in connection with led_mc_calc_color_components()
> + * to perform the intensity scaling in software.
> + * Such drivers should set max_intensity to 0 to signal the multicolor LED core
> + * that the maximum global brightness of the LED class device should be used for
> + * limiting incoming intensity values.
> + *
> + * Multicolor LEDs that do support global brightness control in hardware
> + * should instead set max_intensity to the maximum intensity value supported
> + * by the hardware for a given color component.
> + */
> struct mc_subled {
> unsigned int color_index;
> unsigned int brightness;
> unsigned int intensity;
> + unsigned int max_intensity;
> unsigned int channel;
> };
>
> @@ -53,7 +74,14 @@ int led_classdev_multicolor_register_ext(struct device *parent,
> */
> void led_classdev_multicolor_unregister(struct led_classdev_mc *mcled_cdev);
>
> -/* Calculate brightness for the monochrome LED cluster */
> +/**
> + * led_mc_calc_color_components() - Calculates component brightness values of a LED cluster.
> + * @mcled_cdev - Multicolor LED class device of the LED cluster.
> + * @brightness - Global brightness of the LED cluster.
> + *
> + * Calculates the brightness values for each color component of a monochrome LED cluster,
> + * see Documentation/leds/leds-class-multicolor.rst for details.
> + */
> int led_mc_calc_color_components(struct led_classdev_mc *mcled_cdev,
> enum led_brightness brightness);
>
^ permalink raw reply
* Re: [PATCH v10 12/21] gpu: nova-core: mm: Add unified page table entry wrapper enums
From: Joel Fernandes @ 2026-04-08 19:04 UTC (permalink / raw)
To: Danilo Krummrich
Cc: Eliot Courtney, linux-kernel, Miguel Ojeda, Boqun Feng, Gary Guo,
Bjorn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
Trevor Gross, Dave Airlie, Daniel Almeida, Koen Koning, dri-devel,
rust-for-linux, Nikola Djukic, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter, Jonathan Corbet,
Alex Deucher, Christian Koenig, Jani Nikula, Joonas Lahtinen,
Vivi Rodrigo, Tvrtko Ursulin, Rui Huang, Matthew Auld,
Matthew Brost, Lucas De Marchi, Thomas Hellstrom, Helge Deller,
Alex Gaynor, Boqun Feng, John Hubbard, Alistair Popple,
Timur Tabi, Edwin Peer, Alexandre Courbot, Andrea Righi,
Andy Ritger, Zhi Wang, Balbir Singh, Philipp Stanner,
Elle Rhumsaa, alexeyi, joel, linux-doc, amd-gfx, intel-gfx,
intel-xe, linux-fbdev
In-Reply-To: <DHNYXGAVLNVM.3TV1VDZ9YXA9U@kernel.org>
On Wed, Apr 08, 2026 at 08:01:01PM +0200, Danilo Krummrich <dakr@kernel.org> wrote:
> On Wed Apr 8, 2026 at 6:58 PM CEST, Joel Fernandes wrote:
>> So you're making the code much much worse than before actually. We don't
>> new traits and types pointlessly.
>
> I had a look at both approaches and yes, the traits can be considered
> boilerplate. But, they are not complex and they just list method signatures that
> each version's types already implement functionality wise and they get us rid of
> a lot of dispatch sites. The implementation turns out cleaner as there is less
> parameter threading throughout call chains, etc. Overall, it seems more
> scalable.
>
> On the other hand, there are indeed more abstractions and type indirections to
> understand in Eliot's code. I.e. there are advantages and disadvantages to both
> approaches.
>
> That said, please engage with Eliot's proposal, it is not as off as your reply
> implies and dismissing it right away is not what I'd like to see in this
> situation.
No one is “dismissing it right away”. If you read all the threads, you will see
that I already came up with the same approach and have spent time coding
it up myself to learn its potential advantages and drawbacks. :-) I would never
dismiss anyone’s suggestion just for the sake of it, without evaluating. Hence
most of my morning went in collecting data with both approaches (more on that in
another upcoming reply).
Also yes of course we have to look at all the pros and cons and carefully make a
choice collectively, there are tradeoffs in both approaches.
thanks,
--
Joel Fernandes
^ permalink raw reply
* Re: [PATCH] doc: watchdog: fix typos etc.
From: Björn Persson @ 2026-04-08 18:56 UTC (permalink / raw)
To: Randy Dunlap
Cc: Andrew Morton, Jonathan Corbet, Shuah Khan, linux-doc,
linux-kernel
In-Reply-To: <20260408031212.2510235-1-rdunlap@infradead.org>
[-- Attachment #1: Type: text/plain, Size: 448 bytes --]
Randy Dunlap wrote:
> -Similarly to the softlockup case, the current stack trace is displayed
> +Similar to the softlockup case, the current stack trace is displayed
"Similarly" modifies "is displayed", so the adverbial form is correct.
> -The core of the detectors in a hrtimer. It servers multiple purpose:
> +The core of the detectors is an hrtimer. It servers multiple purposes:
And "servers" should be "serves".
Björn Persson
[-- Attachment #2: OpenPGP digital signatur --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH 12/24] nfsd: add data structures for handling CB_NOTIFY
From: Chuck Lever @ 2026-04-08 18:39 UTC (permalink / raw)
To: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-12-aaf68c478abd@kernel.org>
On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> Add the data structures, allocation helpers, and callback operations
> needed for directory delegation CB_NOTIFY support:
>
> - struct nfsd_notify_event: carries fsnotify events for CB_NOTIFY
> - struct nfsd4_cb_notify: per-delegation state for notification handling
> - Union dl_cb_fattr with dl_cb_notify in nfs4_delegation since a
> delegation is either a regular file delegation or a directory
> delegation, never both
>
> Refactor alloc_init_deleg() into a common __alloc_init_deleg() base
> with a pluggable sc_free callback, and add alloc_init_dir_deleg() which
> allocates the page array and notify4 buffer needed for CB_NOTIFY
> encoding.
>
> Add skeleton nfsd4_cb_notify_ops with done/release handlers that will
> be filled in when the notification path is wired up.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 4afe7e68fb51..b2b8c454fc0f 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -3381,6 +3440,30 @@ nfsd4_cb_getattr_release(struct nfsd4_callback
> *cb)
> nfs4_put_stid(&dp->dl_stid);
> }
>
> +static int
> +nfsd4_cb_notify_done(struct nfsd4_callback *cb,
> + struct rpc_task *task)
> +{
> + switch (task->tk_status) {
> + case -NFS4ERR_DELAY:
> + rpc_delay(task, 2 * HZ);
> + return 0;
> + default:
> + return 1;
> + }
> +}
> +
> +static void
> +nfsd4_cb_notify_release(struct nfsd4_callback *cb)
> +{
> + struct nfsd4_cb_notify *ncn =
> + container_of(cb, struct nfsd4_cb_notify, ncn_cb);
> + struct nfs4_delegation *dp =
> + container_of(ncn, struct nfs4_delegation, dl_cb_notify);
> +
> + nfs4_put_stid(&dp->dl_stid);
> +}
> +
> static const struct nfsd4_callback_ops nfsd4_cb_recall_any_ops = {
> .done = nfsd4_cb_recall_any_done,
> .release = nfsd4_cb_recall_any_release,
So when a client responds with NFS4ERR_DELAY, the RPC framework retries
after 2s. On retry, prepare() is called again, but ncn_evt_cnt is
already 0 (drained in the first prepare). prepare returns false, which
destroys the callback.
Events arriving during the retry window are dropped because
nfsd4_run_cb_notify() returns early when NFSD4_CALLBACK_RUNNING is set.
After the callback is destroyed, future events can queue a new CB_NOTIFY,
but the window's events are lost.
The result is that the client misses notifications. Does this impact
behavioral correctness or spec compliance? Is there a way for that
client to detect the loss and recover?
--
Chuck Lever
^ permalink raw reply
* Re: [PATCH 13/24] nfsd: add notification handlers for dir events
From: Chuck Lever @ 2026-04-08 18:34 UTC (permalink / raw)
To: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-13-aaf68c478abd@kernel.org>
On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> Add the necessary parts to accept a fsnotify callback for directory
> change event and create a CB_NOTIFY request for it. When a dir nfsd_file
> is created set a handle_event callback to handle the notification.
>
> Use that to allocate a nfsd_notify_event object and then hand off a
> reference to each delegation's CB_NOTIFY. If anything fails along the
> way, recall any affected delegations.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index b2b8c454fc0f..339c3d0bb575 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -9796,3 +9887,118 @@ nfsd_get_dir_deleg(struct nfsd4_compound_state
> *cstate,
> put_nfs4_file(fp);
> return ERR_PTR(status);
> }
> +
> +static void
> +nfsd4_run_cb_notify(struct nfsd4_cb_notify *ncn)
> +{
> + struct nfs4_delegation *dp = container_of(ncn, struct
> nfs4_delegation, dl_cb_notify);
> +
> + if (test_and_set_bit(NFSD4_CALLBACK_RUNNING, &ncn->ncn_cb.cb_flags))
> + return;
> +
> + if (!refcount_inc_not_zero(&dp->dl_stid.sc_count))
> + clear_bit(NFSD4_CALLBACK_RUNNING, &ncn->ncn_cb.cb_flags);
> + else
> + nfsd4_run_cb(&ncn->ncn_cb);
> +}
> +
> +static struct nfsd_notify_event *
> +alloc_nfsd_notify_event(u32 mask, const struct qstr *q, struct dentry
> *dentry)
> +{
> + struct nfsd_notify_event *ne;
> +
> + ne = kmalloc(sizeof(*ne) + q->len + 1, GFP_KERNEL);
> + if (!ne)
> + return NULL;
> +
> + memcpy(&ne->ne_name, q->name, q->len);
> + refcount_set(&ne->ne_ref, 1);
> + ne->ne_mask = mask;
> + ne->ne_name[q->len] = '\0';
> + ne->ne_namelen = q->len;
> + ne->ne_dentry = dget(dentry);
> + return ne;
> +}
> +
> +static bool
> +should_notify_deleg(u32 mask, struct file_lease *fl)
> +{
> + /* Only nfsd leases */
> + if (fl->fl_lmops != &nfsd_lease_mng_ops)
> + return false;
> +
> + /* Skip if this event wasn't ignored by the lease */
> + if ((mask & FS_DELETE) && !(fl->c.flc_flags & FL_IGN_DIR_DELETE))
> + return false;
> + if ((mask & FS_CREATE) && !(fl->c.flc_flags & FL_IGN_DIR_CREATE))
> + return false;
> + if ((mask & FS_RENAME) && !(fl->c.flc_flags & FL_IGN_DIR_RENAME))
> + return false;
> +
> + return true;
> +}
For a cross-directory rename, vfs_rename calls try_break_deleg(old_dir,
LEASE_BREAK_DIR_DELETE, ...). A delegation with FL_IGN_DIR_DELETE
(subscribed to NOTIFY4_REMOVE_ENTRY) suppresses the lease break, which
is correct.
But fsnotify delivers FS_RENAME on old_dir, not FS_DELETE. In
should_notify_deleg(), the check (mask & FS_RENAME) &&
!(fl->c.flc_flags & FL_IGN_DIR_RENAME) fails, because the delegation
has FL_IGN_DIR_DELETE but not FL_IGN_DIR_RENAME. No notification is
sent.
IIUC, a client subscribed to NOTIFY4_REMOVE_ENTRY for old_dir sees
neither a lease break nor a CB_NOTIFY when a child is renamed out of
the directory. Is that behavior correct?
> +
> +static void
> +nfsd_recall_all_dir_delegs(const struct inode *dir)
> +{
> + struct file_lock_context *ctx = locks_inode_context(dir);
> + struct file_lock_core *flc;
> +
> + spin_lock(&ctx->flc_lock);
> + list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
> + struct file_lease *fl = container_of(flc, struct file_lease, c);
> +
> + if (fl->fl_lmops == &nfsd_lease_mng_ops)
> + nfsd_break_deleg_cb(fl);
> + }
> + spin_unlock(&ctx->flc_lock);
> +}
> +
> +int
> +nfsd_handle_dir_event(u32 mask, const struct inode *dir, const void
> *data,
> + int data_type, const struct qstr *name)
> +{
> + struct dentry *dentry = fsnotify_data_dentry(data, data_type);
> + struct file_lock_context *ctx;
> + struct file_lock_core *flc;
> + struct nfsd_notify_event *evt;
> +
> + /* Don't do anything if this is not an expected event */
> + if (!(mask & (FS_CREATE|FS_DELETE|FS_RENAME)))
> + return 0;
> +
> + ctx = locks_inode_context(dir);
> + if (!ctx || list_empty(&ctx->flc_lease))
> + return 0;
> +
> + evt = alloc_nfsd_notify_event(mask, name, dentry);
> + if (!evt) {
> + nfsd_recall_all_dir_delegs(dir);
> + return 0;
> + }
> +
> + spin_lock(&ctx->flc_lock);
> + list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
> + struct file_lease *fl = container_of(flc, struct file_lease, c);
> + struct nfs4_delegation *dp = flc->flc_owner;
> + struct nfsd4_cb_notify *ncn = &dp->dl_cb_notify;
> +
> + if (!should_notify_deleg(mask, fl))
> + continue;
> +
> + spin_lock(&ncn->ncn_lock);
> + if (ncn->ncn_evt_cnt >= NOTIFY4_EVENT_QUEUE_SIZE) {
> + /* We're generating notifications too fast. Recall. */
> + spin_unlock(&ncn->ncn_lock);
> + nfsd_break_deleg_cb(fl);
> + continue;
> + }
> + ncn->ncn_evt[ncn->ncn_evt_cnt++] = nfsd_notify_event_get(evt);
> + spin_unlock(&ncn->ncn_lock);
> +
> + nfsd4_run_cb_notify(ncn);
> + }
> + spin_unlock(&ctx->flc_lock);
> + nfsd_notify_event_put(evt);
> + return 0;
> +}
--
Chuck Lever
^ permalink raw reply
* Re: [PATCH 08/24] nfsd: update the fsnotify mark when setting or removing a dir delegation
From: Chuck Lever @ 2026-04-08 18:24 UTC (permalink / raw)
To: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-8-aaf68c478abd@kernel.org>
On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> Add a new helper function that will update the mask on the nfsd_file's
> fsnotify_mark to be a union of all current directory delegations on an
> inode. Call that when directory delegations are added or removed.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index c8fb84c38637..9a4cff08c67d 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -1266,6 +1297,7 @@ static void nfs4_unlock_deleg_lease(struct
> nfs4_delegation *dp)
> WARN_ON_ONCE(!fp->fi_delegees);
>
> nfsd4_finalize_deleg_timestamps(dp, nf->nf_file);
> + nfsd_fsnotify_recalc_mask(nf);
> kernel_setlease(nf->nf_file, F_UNLCK, NULL, (void **)&dp);
> put_deleg_file(fp);
> }
The grant path in nfsd_get_dir_deleg() uses a different ordering
(setlease first, recalc_mask after).
Here, since the delegation being removed is still in flc_lease,
inode_lease_ignore_mask() includes its ignore flags. The mask is
computed as if the delegation is still present.
The result is that stale FS_CREATE/FS_DELETE/FS_RENAME bits remain
in the fsnotify mark. It might be harmless in practice since the
handler finds no leases and returns early, but it creates
unnecessary work.
Should nfs4_unlock_deleg_lease call nfsd_fsnotify_recalc_mask()
after kernel_setlease(F_UNLCK)?
--
Chuck Lever
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox