* Re: [PATCH v2 06/31] x86/virt/tdx: Read global metadata for TDX Module Extensions/Connect
From: Xu Yilun @ 2026-04-08 6:17 UTC (permalink / raw)
To: Huang, Kai
Cc: Williams, Dan J, linux-pci@vger.kernel.org,
linux-coco@lists.linux.dev, x86@kernel.org, Gao, Chao,
Edgecombe, Rick P, Xu, Yilun, Jiang, Dave,
dave.hansen@linux.intel.com, baolu.lu@linux.intel.com,
Duan, Zhenzhong, kas@kernel.org, Verma, Vishal L, Li, Xiaoyao,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <123290f2bd1fb9c98bf494650c5912a3bf080114.camel@intel.com>
On Wed, Apr 01, 2026 at 09:36:18PM +0000, Huang, Kai wrote:
> On Sat, 2026-03-28 at 00:01 +0800, Xu Yilun wrote:
> > Add reading of the global metadata for TDX Module Extensions & TDX
> > Connect. Add them in a batch as TDX Connect is currently the only user
> > of TDX Module Extensions and no way to initialize TDX Module Extensions
> > without firstly enabling TDX Connect.
> >
> > TDX Module Extensions & TDX Connect are optional features enumerated by
> > TDX_FEATURES0. Check the TDX_FEATURES0 before reading these metadata to
> > avoid failing the whole TDX initialization.
>
> Maybe it's better to split this patch into two, one to read generic "TDX
> Module Extension" related global metadata, and the other to read TDX Connect
> specific ones?
>
> They are logically two separate things anyway. And there are other features
> also need to enable TDX Module Extensions (e.g., NRX for migration), and we
> can just reuse the generic metadata patch from this series.
Will do.
^ permalink raw reply
* Re: [PATCH v2 05/31] x86/virt/tdx: Extend tdx_page_array to support IOMMU_MT
From: Xu Yilun @ 2026-04-08 6:16 UTC (permalink / raw)
To: Huang, Kai
Cc: Williams, Dan J, linux-pci@vger.kernel.org,
linux-coco@lists.linux.dev, x86@kernel.org, Gao, Chao,
Edgecombe, Rick P, Xu, Yilun, Jiang, Dave,
dave.hansen@linux.intel.com, baolu.lu@linux.intel.com,
Duan, Zhenzhong, kas@kernel.org, Verma, Vishal L, Li, Xiaoyao,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <e290bd999c5cec93bf0b611f5506ba1e8a23c81e.camel@intel.com>
On Thu, Apr 02, 2026 at 12:05:43AM +0000, Huang, Kai wrote:
> On Sat, 2026-03-28 at 00:01 +0800, Xu Yilun wrote:
> > IOMMU_MT is another TDX Module defined structure similar to HPA_ARRAY_T
> > and HPA_LIST_INFO. The difference is it requires multi-order contiguous
> > pages for some entries. It adds an additional NUM_PAGES field for every
> > multi-order page entry.
> >
> > Add a dedicated allocation helper for IOMMU_MT. Fortunately put_page()
> > works well for both single pages and multi-order folios, simplifying the
> > cleanup logic for all allocation methods.
>
> Well I guess you can have a 'free_fn' to free the pages you allocated via
> 'alloc_fn'? Will this simplify the code and at least keep tdx_page_array
> implementation cleaner?
mm.. I think code would be simplified with less callbacks.
But anyway, the need for the alloc_fn becomes a sign for me to think
about separating the memory allocation and struct tdx_page_array
construction. Especially that the IOMMU_MT needs specialized memory
layout so better managed by the kernel driver who really uses IOMMU_MT.
>
> It's strange that you only have a 'alloc_fn' but doesn't have a 'free_fn'
> anyway.
^ permalink raw reply
* Re: [PATCH v2 05/31] x86/virt/tdx: Extend tdx_page_array to support IOMMU_MT
From: Xu Yilun @ 2026-04-08 4:29 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: Gao, Chao, Xu, Yilun, x86@kernel.org, kas@kernel.org,
baolu.lu@linux.intel.com, dave.hansen@linux.intel.com,
Li, Xiaoyao, Williams, Dan J, Jiang, Dave,
linux-pci@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Duan, Zhenzhong, Verma, Vishal L,
kvm@vger.kernel.org
In-Reply-To: <f38d0a080aee052937cb6721683d55155c657717.camel@intel.com>
On Wed, Apr 01, 2026 at 12:17:45AM +0000, Edgecombe, Rick P wrote:
> On Tue, 2026-03-31 at 22:19 +0800, Xu Yilun wrote:
> > > Consider the amount of tricks that are needed to coax the tdx_page_array to
> > > populate the handoff page as needed. It adds 2 pages here, then subtracts
> > > them
> > > later in the callback. Then tweaks the pa in tdx_page_array_populate() to
> > > add
> > > the length...
> >
> > mm.. The tricky part is the specific memory requirement/allocation, the
> > common part is the pa list contained in a root page. Maybe we only model
> > the later, let the specific user does the memory allocation. Is that
> > closer to your "break concepts apart" idea?
>
> I haven't wrapped my head around this enough to suggest anything is definitely
> the right approach.
>
> But yes, the idea would be that the allocation of the list of pages to give to
> the TDX module would be a separate allocation and set of management functions.
> And the the allocation of the pages that are used to communicate the list of
> pages (and in this case other args) with the module would be another set. So
> each type of TDX module arg page format (IOMMU_MT, etc) would be separable, but
> share the page list allocation part only. It looks like Nikolay was probing
> along the same path. Not sure if he had the same solution in mind.
>
> So for this:
> 1. Allocate a list or array of pages using a generic method.
> 2. Allocate these two IOMMU special pages.
> 3. Allocate memory needed for the seamcall (root pages)
>
> Hand all three to the wrapper and have it shove them all through in the special
> way it prefers.
>
> Maybe... Can you write something about the similarities and differences with the
> three types of lists in that series? Like in a compact form?
The common part:
64bit obj type root page
+----------+----------------+ +---------------------+
| ... | ... | | page0 HPA(bit12-51) |--> page0
+----------+----------------+ +---------------------+
|bit 12-51 | root page HPA |--->| page1 HPA |--> page1
+----------+----------------+ +---------------------+
| ... | ... | | pageX HPA |--> pageX
+----------+----------------+ +---------------------+
The specific objects:
HPA_LIST_INFO root page
+----------+----------------+ +---------------------+
|bit 3-11 | first entry | | page0 HPA(bit12-51) |
+----------+----------------+ +---------------------+
|bit 12-51 | root page HPA |--->| page1 HPA |
+----------+----------------+ +---------------------+
|bit 55-63 | last entry | | pageX HPA |
+----------+----------------+ +---------------------+
HPA_ARRAY_T root page HPA_ARRAY_T(singleton mode)
+----------+----------------+ +---------------------+ +----------+----------------+
|bit 3-11 | Reserved 0 | | page0 HPA(bit12-51) | |bit 3-11 | Reserved 0 |
+----------+----------------+ +---------------------+ +----------+----------------+
|bit 12-51 | root page HPA |--->| page1 HPA | |bit 12-51 | page0 HPA |--> page0
+----------+----------------+ +---------------------+ +----------+----------------+
|bit 55-63 | last entry | | pageX HPA | |bit 55-63 | last entry |
+----------+----------------+ +---------------------+ +----------+----------------+
MMIOMT root page
+----------+----------------+ +-----------------------------+-------------------+
|bit 3-11 | Reserved 0 | | 2^order page0 HPA(bit12-51) |num pages(bit 0-11)|
+----------+----------------+ +-----------------------------+-------------------+
|bit 12-51 | root page HPA |--->| 2^order page1 HPA |num pages |
+----------+----------------+ +-----------------------------+-------------------+
|bit 55-63 | Reserved 0 | | page2 HPA |0 |
+----------+----------------+ +-----------------------------+-------------------+
| page3 HPA |0 |
+-----------------------------+-------------------+
| pageX HPA |0 |
+-----------------------------+-------------------+
The same thing is they all have root_page_hpa->root_page->page_hpa_list structure.
The differences:
HPA_LIST_INFO HPA_ARRAY_T IOMMU_MT Note
first entry Y N N start entry in root page
last entry Y Y N last entry in root page
num pages always 0 always 0 Y for multi-order page
singleton N Y N try to save a root page
>
> Also, how much of the earlier code duplication you wanted to avoid was the
> leaking and special error handling stuff?
This is indeed a large part, and now we don't need them anymore.
Others are:
- the root_page allocation/population/free
- Too much parameters (struct page **, num_pages, struct page *root...)
for seamcall wrappers. Or 3 newly defined structures which looks
pretty much the same and need same implementations like
tdx_clflush_page().
Thanks,
Yilun
^ permalink raw reply
* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Reinette Chatre @ 2026-04-08 4:45 UTC (permalink / raw)
To: Babu Moger, corbet@lwn.net, tony.luck@intel.com,
Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
Cc: skhan@linuxfoundation.org, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, kas@kernel.org, rick.p.edgecombe@intel.com,
akpm@linux-foundation.org, pmladek@suse.com,
rdunlap@infradead.org, dapeng1.mi@linux.intel.com,
kees@kernel.org, elver@google.com, paulmck@kernel.org,
lirongqing@baidu.com, safinaskar@gmail.com, fvdl@google.com,
seanjc@google.com, pawan.kumar.gupta@linux.intel.com,
xin@zytor.com, tiala@microsoft.com, Neeraj.Upadhyay@amd.com,
chang.seok.bae@intel.com, Lendacky, Thomas,
elena.reshetova@intel.com, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
kvm@vger.kernel.org, eranian@google.com, peternewman@google.com
In-Reply-To: <c6f574b7-fe5f-49ae-9865-0e4dbb2f9803@amd.com>
Hi Babu,
On 4/7/26 6:01 PM, Babu Moger wrote:
> Hi Reinette,
>
> On 4/7/26 12:48, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 4/6/26 3:45 PM, Babu Moger wrote:
>>> Hi Reinette,
>>>
>>> Sorry for the late response. I was trying to get confirmation about the use case.
>>
>> No problem. I appreciate that you did this so that we can make sure resctrl supports
>> needed use cases.
>>
>>>
>>> On 3/31/26 17:24, Reinette Chatre wrote:
>>>> On 3/30/26 11:46 AM, Babu Moger wrote:
>>>>> On 3/27/26 17:11, Reinette Chatre wrote:
>>>>>> On 3/26/26 10:12 AM, Babu Moger wrote:
>>>>>>> On 3/24/26 17:51, Reinette Chatre wrote:
>>>>>>>> On 3/12/26 1:36 PM, Babu Moger wrote:
>>
>>>> can have domains that span different CPUs. There thus seem to be a built in assumption of what a "domain"
>>>> means for PQR_PLZA_ASSOC so it sounds to me as though, instead of saying that "PQR_PLZA_ASSOC needs
>>>> to be the same in QoS domain" it may be more accurate to, for example, say that "PQR_PLZA_ASSOC has L3 scope"?
>>>
>>> Yes.
>>
>> Above is about L3 scope ...
>
> Yes. The scope for PQR_PLZA_ASSOC is L3.
>
> Is that what you are asking here?
I was trying to point out that there appears to be a mismatch between the actual scope and
the planned implementation. As highlighted below during the discussion about "global" this is
fine with me and I just wanted to confirm that this matches your intentions.
>
>>
>>>>
>>>> This seems to be what this implementation does since it hardcodes PQR_PLZA_ASSOC scope to the L3
>>>> resource but that creates dependency to the L3 resource that would make PLZA unusable if, for example,
>>>> the user boots with "rdt=!l3cat" while wanting to use PLZA to manage MBA allocations when in kernel?
>>>
>>> Yes. that is correct. It should not be attached to one resource. We need to change it to global scope.
>>
>> Can I interpret "global scope" as "all online CPUs"? Doing so will simplify
>
> Yes. That is correct.
>
>
>> supporting this feature. It does not sound practical for a user wanting to assign
>> different resource groups to kernel work done in different domains ... the guidance should
>> instead be to just set the allocations of one resource group to what is needed in the different
>> domains? There may be more flexibility when supporting per-domain RMIDs though but so far
>> it sounds as though the focus is global. We can consider what needs to be done to support
>> some type of "per-domain" assignment as exercise whether current interface could support it
>> in the future.
>
> Yes. Makes sense.
>
>>
...
>>> The PLZA MSR is updated when user changes the association to the
>>> file. No context switch code changes are needed. This will be
>>> dedicated group. The current resctrl group files, "cpus, cpus_list
>>
>> Why does this have to be a dedicated group? One of the conclusions from v1
>> discussion was that the "PLZA group" need *not* be a dedicated group. I repeated that
>> in my earlier response that I left quoted above. You did not respond to these
>> conclusions and statements in this regard while you keep coming back to this
>> needing to be a dedicated group without providing a motivation to do so.
>> Could you please elaborate why a dedicated group is required?
>
> If the same group applies identical limits to both user and kernel
> space, it essentially behaves like a current resctrl group. In that
> sense, it’s not really a PLZA group. PLZA’s key value is the ability
> to separate allocations between user space and kernel space. A
The plan has never been to force identical allocations for user and kernel
space since that would go against this feature entirely. Even so, just as
user and kernel space cannot be forced to have identical allocations they
also cannot be forced to have different allocations. Specifically,
a task *can* use the same CLOSID for user and kernel space work just as easily
as it can use *different* CLOSID for user and kernel space work. There
should not be any CLOSID reserved just for kernel work. Or am I missing something?
> single CPU can belong to two groups: one group manages the user-
> space allocation for that CPU, while another manages the kernel-mode
> allocation.
Exactly. This is why it is important to have two files for this CPU association
within a resource group. The cpus/cpus_list file continues to be used as today
while the new kernel_mode_cpus/kernel_mode_cpus_list is used for kernel work.
With this a task can be associated with any resource group for its user space
allocations but when it runs on one of the CPUs within kernel_mode_cpus then
its kernel work will be done with allocations of the resource group the
kernel_mode_cpus file belongs to, which may or may not be the same
resource group that the user space task belongs to.
> This approach also simplifies file handling, which is another reason
> I prefer it.
I *think* we have different interpretations of "dedicated group":
It sounds as though you interpret "dedicated group" as a way that enforces
the same allocations to user space and kernel work.
I interpret "dedicated group" essentially as a CLOSID reserved for kernel
work. Since I do not see that resctrl should dedicate a CLOSID/resource group
for kernel work I have been pushing against such "dedicated group".
> That said, I’m open to not having a dedicated group if we can still support all the features that PLZA provides without it.
I find that enabling user space to share CLOSID/RMID between user space
and kernel space to indeed support what PLZA provides. I think I am missing
something here since below proposal again attempts to isolate a resource group
(CLOSID) for kernel work.
>>> Add a file, "info/kmode_monitor", to describe how kmode is monitored.
>>>
>>> # cat info/kmode_monitor
>>> [inherit_ctrl_and_mon] <- Kernel uses the same CLOSID/RMID as user. Default option for the "global"
>>> assign_ctrl_inherit_mon <- One CLOSID for all kernel work; RMID inherited from user.
>>> assign_ctrl_assign_mon <- One resource group (CLOSID+RMID) for all kernel work. Default option for "cpu" type.
>>
>> My first thought is that the naming is confusing. resctrl has a very strong relationship between
>> "RMID" and "monitoring" so naming a file "monitor" that deals with allocation/ctrl/CLOSID is
>> potentially confusion.
>>
>> Apart from that, while I think I understand where you are going by separating the mode into
>> two files I am concerned about future complications needing to accommodate all different
>> combinations of the (now) essentially two modes. My preference is thus to keep this simple by
>> keeping the mode within one file.
>>
>> Even so, when stepping back, it does not really look like we need to separate the "global"
>> and "per CPU" modes. We could just have a single "per CPU" mode and the "global" is just
>> its default of "all CPUs", no?
>
> Yes. That correct.
>
>>
>> Consider, for example, the implementation just consisting of:
>>
>> # cat info/kernel_mode
>> [inherit_ctrl_and_mon]
>> global_assign_ctrl_inherit_mon_per_cpu
>> global_assign_ctrl_assign_mon_per_cpu
>>
>>>
>>> Rename “kernel_mode_assignment” to “kmode_group” to assign the specific group to kmode. This file usage is same as before.
>>>
>>> #cat info/kmode_groups (Renamed "kernel_mode_assignment")
>>> //
>>
>> Please consider the intent of this file when thinking about names. The idea is that "info/kernel_mode"
>> specifies the "mode" of how kernel work is handled and it determines the configuration files used in that
>> mode as well as the syntax when interacting with those files. By renaming "kernel_mode_assignment" to
>> "kmode_groups" it implicitly requires all future kernel mode enhancements to need some data related to "groups".
>>
>> In summary, I think this can be simplified by introducing just two new files in info/ that enables the
>> user to (a) select and (b) configure the "kernel mode". To start there can be just two modes,
>> global_assign_ctrl_inherit_mon_per_cpu and global_assign_ctrl_assign_mon_per_cpu.
>> global_assign_ctrl_inherit_mon_per_cpu mode requires a control group in kernel_mode_assignment while
>> global_assign_ctrl_assign_mon_per_cpu requires a control and monitoring group.
>>
>> The resource group in info/kernel_mode_assignment gets two additional files "kernel_mode_cpus" and
>> "kernel_mode_cpus_list" that contains the CPUs enabled with the kernel mode configuration, by default
>> it will be all online CPUs. The resource group can continue to be used to manage allocations of and
>> monitor user space tasks. Specifically, the "cpus", "cpus_list", and "tasks" files remain.
>>
>> A user wanting just "global" settings will get just that when writing the group to
>> info/kernel_mode_assignment. A user wanting "per CPU" settings can follow the
>> info/kernel_mode_assignment setting with changes to that resource group's kernel_mode_cpus/kernel_mode_cpus_list
>> files. Any task running on a CPU that is *not* in kernel_mode_cpus/kernel_mode_cpus_list can be
>> expected to inherit both CLOSID and RMID from user space for all kernel work.
>
> After further consideration, I don’t think the info/kernel_mode file
> is necessary. There’s no need to enforce a specific mode for all the
> PLZA groups. Avoiding this constraint makes the design more
> flexible, particularly as we move toward supporting multiple PLZA
> groups in the future. MPAM already appears capable of handling more
> than one group—for example, one group could use
> inherit_ctrl_and_mon, while another could use
> global_assign_ctrl_inherit_mon_per_cpu.
You are looking ahead at future capabilities for which we do not know all requirements
at this time. I think it is very good to consider how things may progress and your example
of MPAM is of course on point. I believe the current design does consider this progression.
Please see https://lore.kernel.org/lkml/2ab556af-095b-422b-9396-f845c6fd0342@intel.com/
(search for "per_group_assign_ctrl_assign_mon"). In that exploration per-group assignment
is actually accomplished with global files. I thus think we should not make such a big
architectural decision that does not benefit the immediate feature using partial information.
As it is, a "info/kernel_mode" gives the flexibility to expand to, if needed, configuration
files within a resource group. That is why the intention is to associate the mode within
info/kernel_mode with the presence/absence of info/kernel_mode_assignment (search for
"Visibility depends on active mode in info/kernel_mode" in linked email) since in the
future resctrl may need to enable a mode that needs configuration files within each
resource group and when enabling such mode the per-resource group files will appear
instead of the global info/kernel_mode_assignment.
>
> The mode can simply be determined on a per-group basis. We can introduce two new files—kernel_mode_cpus and kernel_mode_cpus_list—within each resctrl group when kmode (or PLZA) is supported.
I think having these files in every resource group is confusing since user can only interact
with these files in one resource group for current PLZA. Why not *just* have the files in the
resource group that matches the group in info/kernel_mode_assignment?
>
> The info/kernel_mode_assignment file would indicate which resctrl
> group(or groups) is used for PLZA. The files—kernel_mode_cpus and
> kernel_mode_cpus_list would indicate how the plza is applied which
> each group.
The "how PLZA is applied" should be learned from info/kernel_mode where user
space learns whether RMID is inherited or not. While I find kernel_mode_cpus
and kernel_mode_cpus_list to be just for configuration and just found in the
resource group listed in info/kernel_mode_assignment.
>
> Files and behavior:
> - cpus / cpus_list:
>
> CPUs listed here use the same allocation for both user and kernel space.
Both user and kernel space?
Monitoring would depend on info/kernel_mode_assignment ("inherit_mon")
and kernel space allocation would depend on whether the CPU on which the task runs
can be found in kernel_mode_cpus, no?
> There is no change to the current semantics of these files.
> If these files are empty, the group effectively becomes a PLZA-dedicated group.
I do not see it this way. If the cpu/cpus_list files are empty then it means that the
tasks in the group will use their own CLOSID/RMID for user space allocation and
monitoring. What allocations/monitoring is used by tasks when in kernel mode depends
on whether the CPU the task is running on can be found in a kernel_mode_cpus/kernel_mode_cpuslist
file. If the CPU the task is running on can be found in a kernel_mode_cpus/kernel_mode_cpuslist
file then it will inherit whatever the PQR_PLZA setting of that CPU which is the allocation
associated with the resource group to which that kernel_mode_cpus/kernel_mode_cpuslist belongs.
If the CPU the task is running on cannot be found in kernel_mode_cpus/kernel_mode_cpuslist
then its kernel work will inherit its user space allocations and monitoring.
>
> - kernel_mode_cpus / kernel_mode_cpus_list:
>
> These files determine whether a separate kernel allocation is applied.
> If empty, user and kernel share the same allocation.
> If non-empty, the kernel uses a separate allocation.
>
> The group can be CTL_MON or MON group. Based on type the group the CLOSID and RMID will be used to enable PLZA. If it is MON, then rmid_en = 1 when writing PLZA MSR.
This will be difficult to get right since CTRL_MON groups also have RMID assigned.
> Here’s the proposed flow:
>
> # mount -t resctrl resctrl /sys/fs/resctrl/
> # cd /sys/fs/resctrl/
> # cat info/kernel_mode_assignment
> //
>
> By default, the root (default) group is PLZA-enabled when resctrl is mounted. All CPUs use CLOSID 0 for both user and kernel-mode allocation.
>
> # cat cpus_list
> 1-64
> # cat kmode_cpus_list
> 1-64
>
> Next, create a new group for PLZA:
>
> # mkdir plza_group
>
> # echo "plza_group//" > info/kernel_mode_assignment
>
> At this point, plza_group becomes the new PLZA-enabled group, and the PLZA-related MSRs are updated accordingly.
It really looks like you are getting back to trying to dedicate a resource group to
kernel work and that is not something that resctrl should enforce.
>
> # cat plza_group/cpus_list
> <empty>
>
> # cat plza_group/kmode_cpus_list
> 1-64
>
> The user can then update kmode_cpus_list to apply PLZA only to a specific subset of CPUs, if desired.
>
>
> What do you think of this approach?
It is difficult to predict how the "next" PLZA will actually end up looking like and I find resctrl creating a complicated
interface to support this to be risky. Instead I would prefer to focus on efficiently supporting what PLZA can do today
and make it extensible. Apart from that I find the implicit interface, "If it is MON, then rmid_en = 1" to be too
architecture specific for a generic interface while also not able to accurately capture user's intent (i.e. user may
indeed, for example, want "a CTRL_MON group to have rmid_en = 1"). Finally, I am just so confused about why the implementations
keep needing to dedicate a resource group/CLOSID to kernel work.
Reinette
^ permalink raw reply
* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Babu Moger @ 2026-04-08 1:01 UTC (permalink / raw)
To: Reinette Chatre, corbet@lwn.net, tony.luck@intel.com,
Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
Cc: skhan@linuxfoundation.org, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, kas@kernel.org, rick.p.edgecombe@intel.com,
akpm@linux-foundation.org, pmladek@suse.com,
rdunlap@infradead.org, dapeng1.mi@linux.intel.com,
kees@kernel.org, elver@google.com, paulmck@kernel.org,
lirongqing@baidu.com, safinaskar@gmail.com, fvdl@google.com,
seanjc@google.com, pawan.kumar.gupta@linux.intel.com,
xin@zytor.com, tiala@microsoft.com, Neeraj.Upadhyay@amd.com,
chang.seok.bae@intel.com, Lendacky, Thomas,
elena.reshetova@intel.com, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
kvm@vger.kernel.org, eranian@google.com, peternewman@google.com
In-Reply-To: <3305c18e-9e50-4df0-b9f1-c61028628967@intel.com>
Hi Reinette,
On 4/7/26 12:48, Reinette Chatre wrote:
> Hi Babu,
>
> On 4/6/26 3:45 PM, Babu Moger wrote:
>> Hi Reinette,
>>
>> Sorry for the late response. I was trying to get confirmation about the use case.
>
> No problem. I appreciate that you did this so that we can make sure resctrl supports
> needed use cases.
>
>>
>> On 3/31/26 17:24, Reinette Chatre wrote:
>>> On 3/30/26 11:46 AM, Babu Moger wrote:
>>>> On 3/27/26 17:11, Reinette Chatre wrote:
>>>>> On 3/26/26 10:12 AM, Babu Moger wrote:
>>>>>> On 3/24/26 17:51, Reinette Chatre wrote:
>>>>>>> On 3/12/26 1:36 PM, Babu Moger wrote:
>
>>> can have domains that span different CPUs. There thus seem to be a built in assumption of what a "domain"
>>> means for PQR_PLZA_ASSOC so it sounds to me as though, instead of saying that "PQR_PLZA_ASSOC needs
>>> to be the same in QoS domain" it may be more accurate to, for example, say that "PQR_PLZA_ASSOC has L3 scope"?
>>
>> Yes.
>
> Above is about L3 scope ...
Yes. The scope for PQR_PLZA_ASSOC is L3.
Is that what you are asking here?
>
>>>
>>> This seems to be what this implementation does since it hardcodes PQR_PLZA_ASSOC scope to the L3
>>> resource but that creates dependency to the L3 resource that would make PLZA unusable if, for example,
>>> the user boots with "rdt=!l3cat" while wanting to use PLZA to manage MBA allocations when in kernel?
>>
>> Yes. that is correct. It should not be attached to one resource. We need to change it to global scope.
>
> Can I interpret "global scope" as "all online CPUs"? Doing so will simplify
Yes. That is correct.
> supporting this feature. It does not sound practical for a user wanting to assign
> different resource groups to kernel work done in different domains ... the guidance should
> instead be to just set the allocations of one resource group to what is needed in the different
> domains? There may be more flexibility when supporting per-domain RMIDs though but so far
> it sounds as though the focus is global. We can consider what needs to be done to support
> some type of "per-domain" assignment as exercise whether current interface could support it
> in the future.
Yes. Makes sense.
>
> ...
>
>>>> There are multiple ways this feature can be applied. For simplicity, the discussion below focuses only on CLOSID.
>>>>
>>>>
>>>> 1. Global PLZA enablement
>>>>
>>>> PLZA can be configured as a global feature by setting |PQR_PLZA_ASSOC.closid = CLOSID| and |PQR_PLZA_ASSOC.plza_en = 1| on all threads in the system. A dedicated CLOSID is reserved for this purpose,
>>>
>>> Also discussed during v1 is that there is no need to dedicate a CLOSID for this purpose.
>>> There could be an "unthrottled" CLOSID to which all high priority user space tasks as
>>> well as all kernel work of all tasks are assigned.
>>> If user space chooses to dedicate a CLOSID for kernel work then that should supported and
>>> interface can allow that, but there is no need for resctrl to enforce this.
>
> (above is comment about dedicated group - please see below)
>
>
>> Yes. I agree. The changes in context switch code is a concern.
>>
>> You covered some of the cases I was thinking(xx_set_individual).
>>
>> How about this idea?
>>
>> I suggest splitting the PLZA into two distinct aspects:
>>
>> 1. How PLZA is applied within a resource group
>>
>> 2. How PLZA is monitored
>
> I think I see where you are going here. While the "How PLZA is monitored" naming
> refers to "monitoring" I *think* what you are separating here is (a) how PLZA is configured
> (CLOSID and RMID settings) and (b) how that PLZA configuration is assigned to tasks/CPUs,
> not just within a resource group but across the system. Please see below.
>
>
>> Introduce a new file, "info/kmode_type", to describe how kmode applies in the system.
>
> ack. "in the system" as you have above, not "within a resource group" as mentioned
> before that.
>
>>
>> # cat info/kmode_type
>> [global] <- Kernel mode applies to the entire system (all CPUs/tasks)
>> cpus <- Kernel mode applies only to the CPUs in the group
>> tasks <- Kernel mode applies only to the tasks in the group
>>
>> The "global" option is the default right now and it is current common use-case.
>>
>> The "info/kmode_type -> cpus" option introduces new files
>> "kmode_cpus" and "kmode_cpus_list" for users to apply kmode to
>> specific set of CPUs. This lets users change the CPU set for PLZA.
> Where were you thinking about placing these files in the hierarchy?
It needs to be inside the resctrl group (in struct rdtgroup).
>
>> The PLZA MSR is updated when user changes the association to the
>> file. No context switch code changes are needed. This will be
>> dedicated group. The current resctrl group files, "cpus, cpus_list
>
> Why does this have to be a dedicated group? One of the conclusions from v1
> discussion was that the "PLZA group" need *not* be a dedicated group. I repeated that
> in my earlier response that I left quoted above. You did not respond to these
> conclusions and statements in this regard while you keep coming back to this
> needing to be a dedicated group without providing a motivation to do so.
> Could you please elaborate why a dedicated group is required?
If the same group applies identical limits to both user and kernel
space, it essentially behaves like a current resctrl group. In that
sense, it’s not really a PLZA group. PLZA’s key value is the ability to
separate allocations between user space and kernel space. A single CPU
can belong to two groups: one group manages the user-space allocation
for that CPU, while another manages the kernel-mode allocation.
This approach also simplifies file handling, which is another reason I
prefer it.
That said, I’m open to not having a dedicated group if we can still
support all the features that PLZA provides without it.
>
>
>> and tasks" will not be accessible in this mode. This option give
>
> These files can continue to be accessible.
ok.
>
>> some flexibility for the user without the context switch overhead.
>
> Dedicating a resource group to PLZA removes flexibility though, no?
Yes. But makes it easy to handle the files as I mentioned above.
>
>>
>> The "info/kmode_type -> tasks" option introduces a new file,
>> "kmode_tasks", for users to apply kmode to specific set of tasks.
>> This requires context switch changes. This will be dedicated group.
>> The current resctrl group files, "cpus, cpus_list and tasks" will
>> not be accessible in this mode. We currently have no use case for
>> this, so it will not be supported now.
>
> Thank you for confirming. This is a relief.
>
>>
>>
>> Add a file, "info/kmode_monitor", to describe how kmode is monitored.
>>
>> # cat info/kmode_monitor
>> [inherit_ctrl_and_mon] <- Kernel uses the same CLOSID/RMID as user. Default option for the "global"
>> assign_ctrl_inherit_mon <- One CLOSID for all kernel work; RMID inherited from user.
>> assign_ctrl_assign_mon <- One resource group (CLOSID+RMID) for all kernel work. Default option for "cpu" type.
>
> My first thought is that the naming is confusing. resctrl has a very strong relationship between
> "RMID" and "monitoring" so naming a file "monitor" that deals with allocation/ctrl/CLOSID is
> potentially confusion.
>
> Apart from that, while I think I understand where you are going by separating the mode into
> two files I am concerned about future complications needing to accommodate all different
> combinations of the (now) essentially two modes. My preference is thus to keep this simple by
> keeping the mode within one file.
>
> Even so, when stepping back, it does not really look like we need to separate the "global"
> and "per CPU" modes. We could just have a single "per CPU" mode and the "global" is just
> its default of "all CPUs", no?
Yes. That correct.
>
> Consider, for example, the implementation just consisting of:
>
> # cat info/kernel_mode
> [inherit_ctrl_and_mon]
> global_assign_ctrl_inherit_mon_per_cpu
> global_assign_ctrl_assign_mon_per_cpu
>
>>
>> Rename “kernel_mode_assignment” to “kmode_group” to assign the specific group to kmode. This file usage is same as before.
>>
>> #cat info/kmode_groups (Renamed "kernel_mode_assignment")
>> //
>
> Please consider the intent of this file when thinking about names. The idea is that "info/kernel_mode"
> specifies the "mode" of how kernel work is handled and it determines the configuration files used in that
> mode as well as the syntax when interacting with those files. By renaming "kernel_mode_assignment" to
> "kmode_groups" it implicitly requires all future kernel mode enhancements to need some data related to "groups".
>
> In summary, I think this can be simplified by introducing just two new files in info/ that enables the
> user to (a) select and (b) configure the "kernel mode". To start there can be just two modes,
> global_assign_ctrl_inherit_mon_per_cpu and global_assign_ctrl_assign_mon_per_cpu.
> global_assign_ctrl_inherit_mon_per_cpu mode requires a control group in kernel_mode_assignment while
> global_assign_ctrl_assign_mon_per_cpu requires a control and monitoring group.
>
> The resource group in info/kernel_mode_assignment gets two additional files "kernel_mode_cpus" and
> "kernel_mode_cpus_list" that contains the CPUs enabled with the kernel mode configuration, by default
> it will be all online CPUs. The resource group can continue to be used to manage allocations of and
> monitor user space tasks. Specifically, the "cpus", "cpus_list", and "tasks" files remain.
>
> A user wanting just "global" settings will get just that when writing the group to
> info/kernel_mode_assignment. A user wanting "per CPU" settings can follow the
> info/kernel_mode_assignment setting with changes to that resource group's kernel_mode_cpus/kernel_mode_cpus_list
> files. Any task running on a CPU that is *not* in kernel_mode_cpus/kernel_mode_cpus_list can be
> expected to inherit both CLOSID and RMID from user space for all kernel work.
After further consideration, I don’t think the info/kernel_mode file is
necessary. There’s no need to enforce a specific mode for all the PLZA
groups. Avoiding this constraint makes the design more flexible,
particularly as we move toward supporting multiple PLZA groups in the
future. MPAM already appears capable of handling more than one group—for
example, one group could use inherit_ctrl_and_mon, while another could
use global_assign_ctrl_inherit_mon_per_cpu.
The mode can simply be determined on a per-group basis. We can introduce
two new files—kernel_mode_cpus and kernel_mode_cpus_list—within each
resctrl group when kmode (or PLZA) is supported.
The info/kernel_mode_assignment file would indicate which resctrl
group(or groups) is used for PLZA. The files—kernel_mode_cpus and
kernel_mode_cpus_list would indicate how the plza is applied which each
group.
Files and behavior:
- cpus / cpus_list:
CPUs listed here use the same allocation for both user and kernel space.
There is no change to the current semantics of these files.
If these files are empty, the group effectively becomes a PLZA-dedicated
group.
- kernel_mode_cpus / kernel_mode_cpus_list:
These files determine whether a separate kernel allocation is applied.
If empty, user and kernel share the same allocation.
If non-empty, the kernel uses a separate allocation.
The group can be CTL_MON or MON group. Based on type the group the
CLOSID and RMID will be used to enable PLZA. If it is MON, then rmid_en
= 1 when writing PLZA MSR.
Here’s the proposed flow:
# mount -t resctrl resctrl /sys/fs/resctrl/
# cd /sys/fs/resctrl/
# cat info/kernel_mode_assignment
//
By default, the root (default) group is PLZA-enabled when resctrl is
mounted. All CPUs use CLOSID 0 for both user and kernel-mode allocation.
# cat cpus_list
1-64
# cat kmode_cpus_list
1-64
Next, create a new group for PLZA:
# mkdir plza_group
# echo "plza_group//" > info/kernel_mode_assignment
At this point, plza_group becomes the new PLZA-enabled group, and the
PLZA-related MSRs are updated accordingly.
# cat plza_group/cpus_list
<empty>
# cat plza_group/kmode_cpus_list
1-64
The user can then update kmode_cpus_list to apply PLZA only to a
specific subset of CPUs, if desired.
What do you think of this approach?
Thanks
Babu
^ permalink raw reply
* RE: [EXTERNAL] SVSM Development Call April 8th, 2026
From: Jon Lange @ 2026-04-07 22:24 UTC (permalink / raw)
To: Stefano Garzarella, coconut-svsm@lists.linux.dev,
linux-coco@lists.linux.dev
In-Reply-To: <CAGxU2F6OApB3K61_sPujnvK_gx_K8zFWyOSTPV9mzWOCyNkBJg@mail.gmail.com>
> Here is the call for agenda items for this week's SVSM development
> call. Please send any agenda items you have in mind as a reply to this
> email or raise them in the meeting.
As a reminder, we previously scheduled an agenda top for this week to discuss IGVM measurements. The IGVM community is working on a plan to add CoRIM support to the IGVM crate, which would include automatic generation of expected measurements. It will be helpful to have a discussion among consumers of IGVM measurements to ensure that the CoRIM work can account for the needs of the community.
-Jon
^ permalink raw reply
* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Reinette Chatre @ 2026-04-07 17:48 UTC (permalink / raw)
To: Babu Moger, corbet, tony.luck, Dave.Martin, james.morse, tglx,
mingo, bp, dave.hansen
Cc: skhan, x86, hpa, peterz, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, kas,
rick.p.edgecombe, akpm, pmladek, rdunlap, dapeng1.mi, kees, elver,
paulmck, lirongqing, safinaskar, fvdl, seanjc, pawan.kumar.gupta,
xin, tiala, Neeraj.Upadhyay, chang.seok.bae, thomas.lendacky,
elena.reshetova, linux-doc, linux-kernel, linux-coco, kvm,
eranian, peternewman
In-Reply-To: <5a740f47-d3f3-45af-9d8c-ebcf3dd89c0d@amd.com>
Hi Babu,
On 4/6/26 3:45 PM, Babu Moger wrote:
> Hi Reinette,
>
> Sorry for the late response. I was trying to get confirmation about the use case.
No problem. I appreciate that you did this so that we can make sure resctrl supports
needed use cases.
>
> On 3/31/26 17:24, Reinette Chatre wrote:
>> On 3/30/26 11:46 AM, Babu Moger wrote:
>>> On 3/27/26 17:11, Reinette Chatre wrote:
>>>> On 3/26/26 10:12 AM, Babu Moger wrote:
>>>>> On 3/24/26 17:51, Reinette Chatre wrote:
>>>>>> On 3/12/26 1:36 PM, Babu Moger wrote:
>> can have domains that span different CPUs. There thus seem to be a built in assumption of what a "domain"
>> means for PQR_PLZA_ASSOC so it sounds to me as though, instead of saying that "PQR_PLZA_ASSOC needs
>> to be the same in QoS domain" it may be more accurate to, for example, say that "PQR_PLZA_ASSOC has L3 scope"?
>
> Yes.
Above is about L3 scope ...
>>
>> This seems to be what this implementation does since it hardcodes PQR_PLZA_ASSOC scope to the L3
>> resource but that creates dependency to the L3 resource that would make PLZA unusable if, for example,
>> the user boots with "rdt=!l3cat" while wanting to use PLZA to manage MBA allocations when in kernel?
>
> Yes. that is correct. It should not be attached to one resource. We need to change it to global scope.
Can I interpret "global scope" as "all online CPUs"? Doing so will simplify
supporting this feature. It does not sound practical for a user wanting to assign
different resource groups to kernel work done in different domains ... the guidance should
instead be to just set the allocations of one resource group to what is needed in the different
domains? There may be more flexibility when supporting per-domain RMIDs though but so far
it sounds as though the focus is global. We can consider what needs to be done to support
some type of "per-domain" assignment as exercise whether current interface could support it
in the future.
...
>>> There are multiple ways this feature can be applied. For simplicity, the discussion below focuses only on CLOSID.
>>>
>>>
>>> 1. Global PLZA enablement
>>>
>>> PLZA can be configured as a global feature by setting |PQR_PLZA_ASSOC.closid = CLOSID| and |PQR_PLZA_ASSOC.plza_en = 1| on all threads in the system. A dedicated CLOSID is reserved for this purpose,
>>
>> Also discussed during v1 is that there is no need to dedicate a CLOSID for this purpose.
>> There could be an "unthrottled" CLOSID to which all high priority user space tasks as
>> well as all kernel work of all tasks are assigned.
>> If user space chooses to dedicate a CLOSID for kernel work then that should supported and
>> interface can allow that, but there is no need for resctrl to enforce this.
(above is comment about dedicated group - please see below)
> Yes. I agree. The changes in context switch code is a concern.
>
> You covered some of the cases I was thinking(xx_set_individual).
>
> How about this idea?
>
> I suggest splitting the PLZA into two distinct aspects:
>
> 1. How PLZA is applied within a resource group
>
> 2. How PLZA is monitored
I think I see where you are going here. While the "How PLZA is monitored" naming
refers to "monitoring" I *think* what you are separating here is (a) how PLZA is configured
(CLOSID and RMID settings) and (b) how that PLZA configuration is assigned to tasks/CPUs,
not just within a resource group but across the system. Please see below.
> Introduce a new file, "info/kmode_type", to describe how kmode applies in the system.
ack. "in the system" as you have above, not "within a resource group" as mentioned
before that.
>
> # cat info/kmode_type
> [global] <- Kernel mode applies to the entire system (all CPUs/tasks)
> cpus <- Kernel mode applies only to the CPUs in the group
> tasks <- Kernel mode applies only to the tasks in the group
>
> The "global" option is the default right now and it is current common use-case.
>
> The "info/kmode_type -> cpus" option introduces new files
> "kmode_cpus" and "kmode_cpus_list" for users to apply kmode to
> specific set of CPUs. This lets users change the CPU set for PLZA.
Where were you thinking about placing these files in the hierarchy?
> The PLZA MSR is updated when user changes the association to the
> file. No context switch code changes are needed. This will be
> dedicated group. The current resctrl group files, "cpus, cpus_list
Why does this have to be a dedicated group? One of the conclusions from v1
discussion was that the "PLZA group" need *not* be a dedicated group. I repeated that
in my earlier response that I left quoted above. You did not respond to these
conclusions and statements in this regard while you keep coming back to this
needing to be a dedicated group without providing a motivation to do so.
Could you please elaborate why a dedicated group is required?
> and tasks" will not be accessible in this mode. This option give
These files can continue to be accessible.
> some flexibility for the user without the context switch overhead.
Dedicating a resource group to PLZA removes flexibility though, no?
>
> The "info/kmode_type -> tasks" option introduces a new file,
> "kmode_tasks", for users to apply kmode to specific set of tasks.
> This requires context switch changes. This will be dedicated group.
> The current resctrl group files, "cpus, cpus_list and tasks" will
> not be accessible in this mode. We currently have no use case for
> this, so it will not be supported now.
Thank you for confirming. This is a relief.
>
>
> Add a file, "info/kmode_monitor", to describe how kmode is monitored.
>
> # cat info/kmode_monitor
> [inherit_ctrl_and_mon] <- Kernel uses the same CLOSID/RMID as user. Default option for the "global"
> assign_ctrl_inherit_mon <- One CLOSID for all kernel work; RMID inherited from user.
> assign_ctrl_assign_mon <- One resource group (CLOSID+RMID) for all kernel work. Default option for "cpu" type.
My first thought is that the naming is confusing. resctrl has a very strong relationship between
"RMID" and "monitoring" so naming a file "monitor" that deals with allocation/ctrl/CLOSID is
potentially confusion.
Apart from that, while I think I understand where you are going by separating the mode into
two files I am concerned about future complications needing to accommodate all different
combinations of the (now) essentially two modes. My preference is thus to keep this simple by
keeping the mode within one file.
Even so, when stepping back, it does not really look like we need to separate the "global"
and "per CPU" modes. We could just have a single "per CPU" mode and the "global" is just
its default of "all CPUs", no?
Consider, for example, the implementation just consisting of:
# cat info/kernel_mode
[inherit_ctrl_and_mon]
global_assign_ctrl_inherit_mon_per_cpu
global_assign_ctrl_assign_mon_per_cpu
>
> Rename “kernel_mode_assignment” to “kmode_group” to assign the specific group to kmode. This file usage is same as before.
>
> #cat info/kmode_groups (Renamed "kernel_mode_assignment")
> //
Please consider the intent of this file when thinking about names. The idea is that "info/kernel_mode"
specifies the "mode" of how kernel work is handled and it determines the configuration files used in that
mode as well as the syntax when interacting with those files. By renaming "kernel_mode_assignment" to
"kmode_groups" it implicitly requires all future kernel mode enhancements to need some data related to "groups".
In summary, I think this can be simplified by introducing just two new files in info/ that enables the
user to (a) select and (b) configure the "kernel mode". To start there can be just two modes,
global_assign_ctrl_inherit_mon_per_cpu and global_assign_ctrl_assign_mon_per_cpu.
global_assign_ctrl_inherit_mon_per_cpu mode requires a control group in kernel_mode_assignment while
global_assign_ctrl_assign_mon_per_cpu requires a control and monitoring group.
The resource group in info/kernel_mode_assignment gets two additional files "kernel_mode_cpus" and
"kernel_mode_cpus_list" that contains the CPUs enabled with the kernel mode configuration, by default
it will be all online CPUs. The resource group can continue to be used to manage allocations of and
monitor user space tasks. Specifically, the "cpus", "cpus_list", and "tasks" files remain.
A user wanting just "global" settings will get just that when writing the group to
info/kernel_mode_assignment. A user wanting "per CPU" settings can follow the
info/kernel_mode_assignment setting with changes to that resource group's kernel_mode_cpus/kernel_mode_cpus_list
files. Any task running on a CPU that is *not* in kernel_mode_cpus/kernel_mode_cpus_list can be
expected to inherit both CLOSID and RMID from user space for all kernel work.
Reinette
^ permalink raw reply
* SVSM Development Call April 8th, 2026
From: Stefano Garzarella @ 2026-04-07 16:36 UTC (permalink / raw)
To: coconut-svsm, linux-coco
Hi,
Here is the call for agenda items for this week's SVSM development
call. Please send any agenda items you have in mind as a reply to this
email or raise them in the meeting.
We will use the LF Zoom instance. Details of the meeting can be found
in our governance repository at:
https://github.com/coconut-svsm/governance
The link to the COCONUT-SVSM calendar is:
https://zoom-lfx.platform.linuxfoundation.org/meetings/coconut-svsm?view=week
The meeting will be recorded and the recording eventually published.
Regards,
Stefano
^ permalink raw reply
* Re: [PATCH v2 01/19] PCI/TSM: Report active IDE streams per host bridge
From: Xu Yilun @ 2026-04-07 16:02 UTC (permalink / raw)
To: Dan Williams
Cc: linux-coco, linux-pci, gregkh, aik, aneesh.kumar, bhelgaas,
alistair23, lukas, jgg
In-Reply-To: <20260303000207.1836586-2-dan.j.williams@intel.com>
On Mon, Mar 02, 2026 at 04:01:49PM -0800, Dan Williams wrote:
> The first attempt at an ABI for this failed to account for naming
> collisions across host bridges:
>
> Commit a4438f06b1db ("PCI/TSM: Report active IDE streams")
>
> Revive this ABI with a per host bridge link that appears at first stream
> creation for a given host bridge and disappears after the last stream is
> removed.
>
> For systems with many host bridge objects it allows:
>
> ls /sys/class/tsm/tsmN/pci*/stream*
>
> ...to find all the host bridges with active streams without first iterating
> over all host bridges. Yilun notes that is handy to have this short cut [1]
> and from an administrator perspective it helps with inventory for
> constrained stream resources.
>
> Link: http://lore.kernel.org/aXLtILY85oMU5qlb@yilunxu-OptiPlex-7050 [1]
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
^ permalink raw reply
* Re: [PATCH v7 09/22] x86/virt/seamldr: Introduce skeleton for TDX module updates
From: Dave Hansen @ 2026-04-07 15:55 UTC (permalink / raw)
To: Chao Gao, linux-kernel, linux-coco, kvm
Cc: binbin.wu, dan.j.williams, dave.hansen, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin
In-Reply-To: <adTvNKVRxsp9Vz10@intel.com>
On 4/7/26 04:49, Chao Gao wrote:
> Applying Dave's feedback to simplify comments across the series. I will change
> this to:
>
> /* The lockstep update needs a stable set of online CPUs. */
Try to speak in imperative voice, please:
/* Ensure a stable set of online CPUs for ...
^ permalink raw reply
* Re: [PATCH v7 16/22] x86/virt/tdx: Update tdx_sysinfo and check features post-update
From: Dave Hansen @ 2026-04-07 15:53 UTC (permalink / raw)
To: Chao Gao, linux-kernel, linux-coco, kvm
Cc: binbin.wu, dan.j.williams, dave.hansen, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin
In-Reply-To: <adT1Tkz+/ysSZ1Ua@intel.com>
On 4/7/26 05:15, Chao Gao wrote:
> Dave's comment on another patch applies here too: don't preemptively handle
> errors that never occur. The custom error message is unnecessary, and
> propagating the error isn't worth it. Will simplify it to:
>
> /* Shouldn't fail as the update has succeeded. */
> WARN_ON_ONCE(get_tdx_sys_info(info));
This is nit territory, but I don't like that either.
Actual, important, normal-program-flow logic should stand on its own,
separate from warnings.
OK:
ret = foo()
WARN_ON(ret);
Not OK:
WARN_ON(foo());
^ permalink raw reply
* Re: [PATCH 0/7] KVM: x86: APX reg prep work
From: Sean Christopherson @ 2026-04-07 13:20 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Chang S. Bae, Kiryl Shutsemau, kvm, the arch/x86 maintainers,
linux-coco, Kernel Mailing List, Linux, Andrew Cooper
In-Reply-To: <CABgObfaFqrSENS=_eNgkyxebqL1vFauNqG3XAgZm0EHfkbQ_gw@mail.gmail.com>
On Tue, Apr 07, 2026, Paolo Bonzini wrote:
> Il mar 7 apr 2026, 00:00 Sean Christopherson <seanjc@google.com> ha scritto:
> >
> > > > . So unless I'm missing something (or hardware is flawed and lets the
> > > > guest speculative consume R16-R31, which would be sad), it's perfectly safe to
> > > > run the guest with host state in R16-R31.
> > > >
> > > > That would avoid pointlessly context switching 16 registers when APX is not being
> > > > used by the guest, and would avoid having to write XCR0 in the fastpath.
> > >
> > > For now yes, but once/if the kernel starts using the registers there's
> > > no way out of writing XCR0 for APX-disabled guests in the fast path.
> >
> > Why's that? So long as KVM uses vcpu->arch.regs[R16-R31] as the source of truth
> > when emulating anything, there's no danger of taking a #UD in the host due to
> > accessing R16-R31 with XCR0.APX=0.
>
> Yes I agree with that. But the unavoidable part is the XSETBV because
> only the assembly code can run with XCR0.APX=0. As soon as you go back
> to C, including during the fast path, you have to ensure XCR0.APX=1
> again if the kernel is compiled with -mapxf.
/facepalm
I got so focused on register state that I completely forgot about actually
using the registers...
^ permalink raw reply
* Re: [PATCH v7 16/22] x86/virt/tdx: Update tdx_sysinfo and check features post-update
From: Chao Gao @ 2026-04-07 12:15 UTC (permalink / raw)
To: linux-kernel, linux-coco, kvm
Cc: binbin.wu, dan.j.williams, dave.hansen, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin
In-Reply-To: <20260331124214.117808-17-chao.gao@intel.com>
>+int tdx_module_post_update(struct tdx_sys_info *info)
>+{
>+ struct tdx_sys_info_version *old, *new;
>+ int ret;
>+
>+ /* Shouldn't fail as the update has succeeded. */
>+ ret = get_tdx_sys_info(info);
>+ if (WARN_ONCE(ret, "version retrieval failed after update, replace the TDX module\n"))
>+ return ret;
Dave's comment on another patch applies here too: don't preemptively handle
errors that never occur. The custom error message is unnecessary, and
propagating the error isn't worth it. Will simplify it to:
/* Shouldn't fail as the update has succeeded. */
WARN_ON_ONCE(get_tdx_sys_info(info));
>+
>+ old = &tdx_sysinfo.version;
>+ new = &info->version;
>+ pr_info("version " TDX_VERSION_FMT " -> " TDX_VERSION_FMT "\n",
>+ old->major_version, old->minor_version, old->update_version,
>+ new->major_version, new->minor_version, new->update_version);
>+
>+ /*
>+ * Blindly refreshing the entire tdx_sysinfo could disrupt running
>+ * software, as it may subtly rely on the previous state unless
>+ * proven otherwise.
>+ *
>+ * Only refresh update_version and handoff version. They don't
>+ * affect TDX functionality. Major/minor versions do not change
>+ * across updates, so no refresh is needed.
>+ */
>+ tdx_sysinfo.version.update_version = info->version.update_version;
>+ tdx_sysinfo.handoff = info->handoff;
>+
>+ if (!memcmp(&tdx_sysinfo, info, sizeof(*info)))
>+ return 0;
>+
>+ pr_info("TDX module features have changed after updates, but might not take effect.\n");
>+ pr_info("Please consider updating your BIOS to install the TDX module.\n");
>+ return 0;
>+}
^ permalink raw reply
* Re: [PATCH v7 15/22] x86/virt/tdx: Restore TDX module state
From: Chao Gao @ 2026-04-07 12:07 UTC (permalink / raw)
To: linux-kernel, linux-coco, kvm
Cc: binbin.wu, dan.j.williams, dave.hansen, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin
In-Reply-To: <20260331124214.117808-16-chao.gao@intel.com>
>+int tdx_module_run_update(void)
>+{
>+ struct tdx_module_args args = {};
>+ int ret;
>+
>+ ret = seamcall_prerr(TDH_SYS_UPDATE, &args);
>+ if (ret) {
>+ pr_err("update failed (%d)\n", ret);
>+ tdx_module_status = TDX_MODULE_ERROR;
>+ return ret;
>+ }
The pr_err() isn't needed as seamcall_prerr() will emit a
message. and no need to set tdx_module_status to ERROR on
failure as it is already done during shutdown.
so, this can be simplified to:
ret = seamcall_prerr(TDH_SYS_UPDATE, &args);
if (ret)
return ret;
^ permalink raw reply
* Re: [PATCH v7 12/22] x86/virt/tdx: Reset software states during TDX module shutdown
From: Chao Gao @ 2026-04-07 12:02 UTC (permalink / raw)
To: linux-kernel, linux-coco, kvm
Cc: binbin.wu, dan.j.williams, dave.hansen, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin
In-Reply-To: <20260331124214.117808-13-chao.gao@intel.com>
> int tdx_module_shutdown(void)
> {
> struct tdx_module_args args = {};
>+ int ret, cpu;
>
> /*
> * Shut down the TDX module and prepare handoff data for the next
>@@ -1188,7 +1189,31 @@ int tdx_module_shutdown(void)
> * modules as new modules likely have higher handoff version.
> */
> args.rcx = tdx_sysinfo.handoff.module_hv;
>- return seamcall_prerr(TDH_SYS_SHUTDOWN, &args);
>+ ret = seamcall_prerr(TDH_SYS_SHUTDOWN, &args);
>+ if (ret)
>+ return ret;
>+
>+ /*
>+ * Mark the module is unavailable (in ERROR status) to prevent
>+ * re-initialization and tdx_sysinfo reporting. Note the status
>+ * will be restored after a successful update.
>+ *
>+ * No need to acquire tdx_module_lock here since this runs in
>+ * stop_machine() where no concurrent initialization can occur.
>+ */
>+ tdx_module_status = TDX_MODULE_ERROR;
>+ sysinit_done = false;
>+ sysinit_ret = 0;
>+
>+ /*
>+ * Since the TDX module is shut down and gone, mark all CPUs
>+ * (including offlined ones) as uninitialized. This is called in
>+ * stop_machine() (where CPU hotplug is disabled), preventing
>+ * races with other tdx_lp_initialized accesses.
>+ */
>+ for_each_possible_cpu(cpu)
>+ per_cpu(tdx_lp_initialized, cpu) = false;
I would like to merge the two comments and make them more concise:
/*
* Clear global and per-CPU initialization flags so the new module
* can be fully re-initialized after a successful update. The ERROR
* status prevents re-init if the update ultimately fails.
*
* No locks needed as no concurrent accesses can occur here.
*/
tdx_module_status = TDX_MODULE_ERROR;
sysinit_done = false;
sysinit_ret = 0;
for_each_possible_cpu(cpu)
per_cpu(tdx_lp_initialized, cpu) = false;
>+ return 0;
> }
>
> static bool is_pamt_page(unsigned long phys)
>--
>2.47.3
>
^ permalink raw reply
* Re: [PATCH v7 11/22] x86/virt/seamldr: Shut down the current TDX module
From: Chao Gao @ 2026-04-07 11:51 UTC (permalink / raw)
To: linux-kernel, linux-coco, kvm
Cc: binbin.wu, dan.j.williams, dave.hansen, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin
In-Reply-To: <20260331124214.117808-12-chao.gao@intel.com>
>+int tdx_module_shutdown(void)
>+{
>+ struct tdx_module_args args = {};
>+
>+ /*
>+ * Shut down the TDX module and prepare handoff data for the next
>+ * TDX module. This SEAMCALL requires a handoff version. Use the
>+ * module's handoff version, as it is the highest version the
>+ * module can produce and is more likely to be supported by new
>+ * modules as new modules likely have higher handoff version.
>+ */
Will change this comment to:
/*
* Use the module's handoff version as it is the highest the
* module can produce and most likely supported by newer modules.
*/
>+ args.rcx = tdx_sysinfo.handoff.module_hv;
>+ return seamcall_prerr(TDH_SYS_SHUTDOWN, &args);
>+}
^ permalink raw reply
* Re: [PATCH v7 09/22] x86/virt/seamldr: Introduce skeleton for TDX module updates
From: Chao Gao @ 2026-04-07 11:49 UTC (permalink / raw)
To: linux-kernel, linux-coco, kvm
Cc: binbin.wu, dan.j.williams, dave.hansen, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin
In-Reply-To: <20260331124214.117808-10-chao.gao@intel.com>
>@@ -214,7 +287,14 @@ int seamldr_install_module(const u8 *data, u32 size)
> if (IS_ERR(params))
> return PTR_ERR(params);
>
>- /* TODO: Update TDX module here */
>- return 0;
>+ /*
>+ * Prevent CPU hotplug. If a CPU goes offline after thread_ack
>+ * initialization, thread_ack will exceed the online count and
>+ * never decrement to zero, causing all CPUs spinning forever
>+ * with IRQs disabled.
>+ */
Applying Dave's feedback to simplify comments across the series. I will change
this to:
/* The lockstep update needs a stable set of online CPUs. */
>+ guard(cpus_read_lock)();
>+ set_target_state(MODULE_UPDATE_START + 1);
>+ return stop_machine_cpuslocked(do_seamldr_install_module, params, cpu_online_mask);
> }
> EXPORT_SYMBOL_FOR_MODULES(seamldr_install_module, "tdx-host");
>--
>2.47.3
>
^ permalink raw reply
* [PATCH v2] dma-buf: heaps: system: document system_cc_shared heap
From: Jiri Pirko @ 2026-04-07 9:26 UTC (permalink / raw)
To: dri-devel, linaro-mm-sig, iommu, linux-media
Cc: sumit.semwal, benjamin.gaignard, Brian.Starkey, jstultz,
tjmercier, christian.koenig, m.szyprowski, robin.murphy, jgg,
leon, ptesarik, catalin.marinas, aneesh.kumar, suzuki.poulose,
steven.price, thomas.lendacky, john.allen, ashish.kalra,
suravee.suthikulpanit, linux-coco
From: Jiri Pirko <jiri@nvidia.com>
Document the system_cc_shared dma-buf heap that was introduced
recently. Describe its purpose, availability conditions and
relation to confidential computing VMs.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: T.J.Mercier <tjmercier@google.com>
---
Documentation/userspace-api/dma-buf-heaps.rst | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/Documentation/userspace-api/dma-buf-heaps.rst b/Documentation/userspace-api/dma-buf-heaps.rst
index 05445c83b79a..f56b743cdb36 100644
--- a/Documentation/userspace-api/dma-buf-heaps.rst
+++ b/Documentation/userspace-api/dma-buf-heaps.rst
@@ -16,6 +16,13 @@ following heaps:
- The ``system`` heap allocates virtually contiguous, cacheable, buffers.
+ - The ``system_cc_shared`` heap allocates virtually contiguous, cacheable,
+ buffers using shared (decrypted) memory. It is only present on
+ confidential computing (CoCo) VMs where memory encryption is active
+ (e.g., AMD SEV, Intel TDX). The allocated pages have the encryption
+ bit cleared, making them accessible for device DMA without TDISP
+ support. On non-CoCo VM configurations, this heap is not registered.
+
- The ``default_cma_region`` heap allocates physically contiguous,
cacheable, buffers. Only present if a CMA region is present. Such a
region is usually created either through the kernel commandline
--
2.51.1
^ permalink raw reply related
* Re: [PATCH] dma-buf: heaps: system: document system_cc_shared heap
From: Jiri Pirko @ 2026-04-07 9:25 UTC (permalink / raw)
To: T.J. Mercier
Cc: dri-devel, linaro-mm-sig, iommu, linux-media, sumit.semwal,
benjamin.gaignard, Brian.Starkey, jstultz, christian.koenig,
m.szyprowski, robin.murphy, jgg, leon, sean.anderson, ptesarik,
catalin.marinas, aneesh.kumar, suzuki.poulose, steven.price,
thomas.lendacky, john.allen, ashish.kalra, suravee.suthikulpanit,
linux-coco
In-Reply-To: <CABdmKX3N70j8ZZs5DNhx6fhRi=Aa_+2xY1JHcW+uDoaV2+Sngw@mail.gmail.com>
Mon, Apr 06, 2026 at 10:20:33PM +0200, tjmercier@google.com wrote:
>On Thu, Apr 2, 2026 at 7:11 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> From: Jiri Pirko <jiri@nvidia.com>
>>
>> Document the system_cc_shared dma-buf heap that was introduced
>> recently. Describe its purpose, availability conditions and
>> relation to confidential computing VMs.
>>
>> Signed-off-by: Jiri Pirko <jiri@nvidia.com>
>> ---
>> Documentation/userspace-api/dma-buf-heaps.rst | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> diff --git a/Documentation/userspace-api/dma-buf-heaps.rst b/Documentation/userspace-api/dma-buf-heaps.rst
>> index 05445c83b79a..591732393e7d 100644
>> --- a/Documentation/userspace-api/dma-buf-heaps.rst
>> +++ b/Documentation/userspace-api/dma-buf-heaps.rst
>> @@ -16,6 +16,14 @@ following heaps:
>>
>> - The ``system`` heap allocates virtually contiguous, cacheable, buffers.
>>
>> + - The ``system_cc_shared`` heap allocates virtually contiguous, cacheable,
>> + buffers using shared (decrypted) memory. It is only present on
>> + confidential computing (CoCo) VMs where memory encryption is active
>> + (e.g., AMD SEV, Intel TDX). The allocated pages have the encryption
>> + bit cleared, making them accessible for device DMA without TDISP
>> + support. On non-CoCo VMs configurations, this heap is
>
>"non-CoCo VM configurations"
>
>> + not registered.
>
>Doesn't seem like you need to wrap this line.
>
>with that: Reviewed-by: T.J.Mercier <tjmercier@google.com>
Okay. Thanks!
>
>> +
>> - The ``default_cma_region`` heap allocates physically contiguous,
>> cacheable, buffers. Only present if a CMA region is present. Such a
>> region is usually created either through the kernel commandline
>
>Each paragraph starting with '-' confused me for a second there. Those
>aren't part of the diff. :)
^ permalink raw reply
* Re: [PATCH 0/7] KVM: x86: APX reg prep work
From: Paolo Bonzini @ 2026-04-07 7:18 UTC (permalink / raw)
To: Sean Christopherson
Cc: Chang S. Bae, Kiryl Shutsemau, kvm, the arch/x86 maintainers,
linux-coco, Kernel Mailing List, Linux, Andrew Cooper
In-Reply-To: <adQs4LQgy3mS2t89@google.com>
Il mar 7 apr 2026, 00:00 Sean Christopherson <seanjc@google.com> ha scritto:
>
> > > . So unless I'm missing something (or hardware is flawed and lets the
> > > guest speculative consume R16-R31, which would be sad), it's perfectly safe to
> > > run the guest with host state in R16-R31.
> > >
> > > That would avoid pointlessly context switching 16 registers when APX is not being
> > > used by the guest, and would avoid having to write XCR0 in the fastpath.
> >
> > For now yes, but once/if the kernel starts using the registers there's
> > no way out of writing XCR0 for APX-disabled guests in the fast path.
>
> Why's that? So long as KVM uses vcpu->arch.regs[R16-R31] as the source of truth
> when emulating anything, there's no danger of taking a #UD in the host due to
> accessing R16-R31 with XCR0.APX=0.
Yes I agree with that. But the unavoidable part is the XSETBV because
only the assembly code can run with XCR0.APX=0. As soon as you go back
to C, including during the fast path, you have to ensure XCR0.APX=1
again if the kernel is compiled with -mapxf.
For now, I agree that early_xcr0 isn't needed and you can run all the
time with XCR0.APX=0.
Paolo
^ permalink raw reply
* Re: [PATCH 2/2] x86/virt/tdx: Use PFN directly for unmapping guest private memory
From: Yan Zhao @ 2026-04-07 0:44 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Xiaoyao Li, seanjc, dave.hansen, tglx, mingo, bp, kas, x86,
linux-kernel, kvm, linux-coco, kai.huang, rick.p.edgecombe,
yilun.xu, vannapurve, ackerleytng, sagis, binbin.wu,
isaku.yamahata
In-Reply-To: <5b3110f4-4e46-4573-b68e-54e220ae1c19@redhat.com>
On Sat, Apr 04, 2026 at 08:39:00AM +0200, Paolo Bonzini wrote:
> On 3/19/26 09:56, Yan Zhao wrote:
> > On Thu, Mar 19, 2026 at 04:56:10PM +0800, Xiaoyao Li wrote:
> > > So why not considering option 2?
> > >
> > > 2. keep tdx_quirk_reset_page() as-is for the cases of
> > > tdx_reclaim_page() and tdx_reclaim_td_control_pages() that have the
> > > struct page. But only change tdx_sept_remove_private_spte() to use
> > > tdx_quirk_reset_paddr() directly.
> > >
> > > It will need export tdx_quirk_reset_paddr() for KVM. I think it will be OK?
> > I don't think it's necessary. But if we have to export an extra API, IMHO,
> > tdx_quirk_reset_pfn() is better than tdx_quirk_reset_paddr(). Otherwise,
> > why not only expose tdx_quirk_reset_paddr()?
>
> That works for me, it seems the cleanest.
Hi Paolo,
To avoid misunderstanding: you think only exporting tdx_quirk_reset_paddr() is
the cleanest, right? :)
Thanks
Yan
^ permalink raw reply
* Re: [PATCH] KVM: TDX: Fix APIC MSR ranges in tdx_has_emulated_msr()
From: Sean Christopherson @ 2026-04-06 23:07 UTC (permalink / raw)
To: Rick P Edgecombe
Cc: kvm@vger.kernel.org, Dave Hansen, Isaku Yamahata,
dmaluka@chromium.org, x86@kernel.org, kas@kernel.org,
bp@alien8.de, linux-kernel@vger.kernel.org, mingo@redhat.com,
dave.hansen@linux.intel.com, binbin.wu@linux.intel.com,
linux-coco@lists.linux.dev, hpa@zytor.com, tglx@kernel.org,
pbonzini@redhat.com
In-Reply-To: <faac37c7ba9e3a2e8d996e18c74301d9cefad3dc.camel@intel.com>
On Sat, Apr 04, 2026, Rick P Edgecombe wrote:
> On Fri, 2026-04-03 at 16:07 -0700, Sean Christopherson wrote:
> > > So... "Reduced #VE" (also called "VE reduction") reduces which things cause
> > > a #VE. The guest opts into it and the TDX module starts behaving
> > > differently. It's kind of grab bag of changes including changing CPUID
> > > behavior, which is another wrinkle. It was intended to fixup guest side TDX
> > > arch issues.
> >
> > And KVM has no visilibity into which mode the guest has selected? That's
> > awful.
>
> Yea, on both accounts. So where we are at with this is, starting to reject
> changes that build on the pattern. We haven't gone so far as to ask for a
> feature to notify the host of the guest opt-ins. But I wouldn't say we have a
> grand design in mind either. If you have any clarity, please feel free to drop a
> quotable.
I got nothing, probably best to just deal with things on a case-by-case basis
unless we end up with a recurring theme.
> > If KVM has no visiblity, then I don't see an option other than for KVM to
> > advertise and emulate what it can at all times, and it becomes the guest's
> > responsibility to not screw up. I guess it's not really any different from
> > not trying to use MMIO accesses after switching to x2APIC mode.
>
> Like your diff? Expose any MSRs that might be emulated in the TDX paradigm. But
> don't expose all MSRs that KVM supports.
Yep.
^ permalink raw reply
* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Babu Moger @ 2026-04-06 22:45 UTC (permalink / raw)
To: Reinette Chatre, corbet, tony.luck, Dave.Martin, james.morse,
tglx, mingo, bp, dave.hansen
Cc: skhan, x86, hpa, peterz, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, kas,
rick.p.edgecombe, akpm, pmladek, rdunlap, dapeng1.mi, kees, elver,
paulmck, lirongqing, safinaskar, fvdl, seanjc, pawan.kumar.gupta,
xin, tiala, Neeraj.Upadhyay, chang.seok.bae, thomas.lendacky,
elena.reshetova, linux-doc, linux-kernel, linux-coco, kvm,
eranian, peternewman
In-Reply-To: <83ae0c18-5c5e-4b52-901d-4126fe7c141b@intel.com>
Hi Reinette,
Sorry for the late response. I was trying to get confirmation about the
use case.
On 3/31/26 17:24, Reinette Chatre wrote:
> Hi Babu,
>
> On 3/30/26 11:46 AM, Babu Moger wrote:
>> On 3/27/26 17:11, Reinette Chatre wrote:
>>> On 3/26/26 10:12 AM, Babu Moger wrote:
>>>> On 3/24/26 17:51, Reinette Chatre wrote:
>>>>> On 3/12/26 1:36 PM, Babu Moger wrote:
>
>>>>>> Tony suggested using global variables to store the kernel mode
>>>>>> CLOSID and RMID. However, the kernel mode CLOSID and RMID are
>>>>>> coming from rdtgroup structure with the new interface. Accessing
>>>>>> them requires holding the associated lock, which would make the
>>>>>> context switch path unnecessarily expensive. So, dropped the idea.
>>>>>> https://lore.kernel.org/lkml/aXuxVSbk1GR2ttzF@agluck-desk3/
>>>>>> Let me know if there are other ways to optimize this.
>>>>> I do not see why the context switch path needs to be touched at all with this
>>>>> implementation. Since PLZA only supports global assignment does it not mean that resctrl
>>>>> only needs to update PQR_PLZA_ASSOC when user writes to info/kernel_mode and
>>>>> info/kernel_mode_assignment?
>>>> Each thread has an MSR to configure whether to associate privilege level zero execution with a separate COS and/or RMID, and the value of the COS and/or RMID. PLZA may be enabled or disabled on a per-thread basis. However, the COS and RMID association and configuration must be the same for all threads in the QOS Domain.
>>> Based on previous comment in https://lore.kernel.org/lkml/abb049fa-3a3d-4601-9ae3-61eeb7fd8fcf@amd.com/
>>> and this implementation all fields of PQR_PLZA_ASSOC except PQR_PLZA_ASSOC.plza_en must be the
>>> same for all CPUs on the system, not just per QoS domain. Could you please confirm?
>>
>> Sorry for the confusion. It is "per QoS domain".
>>
>> All the fields of PQR_PLZA_ASSOC except PQR_PLZA_ASSOC.plza_enmust be set to the same value for all HW threads in the QOS domain for consistent operation (Per-QosDomain).
>
> Thank you for clarifying. To build on this, what would be best way for resctrl to interpret this?
> As I see it all values in PQR_PLZA_ASSOC apply to *all* resources yet (theoretically?) every resource
Yes. That is correct. PLZA applies to all the resources.
> can have domains that span different CPUs. There thus seem to be a built in assumption of what a "domain"
> means for PQR_PLZA_ASSOC so it sounds to me as though, instead of saying that "PQR_PLZA_ASSOC needs
> to be the same in QoS domain" it may be more accurate to, for example, say that "PQR_PLZA_ASSOC has L3 scope"?
Yes.
>
> This seems to be what this implementation does since it hardcodes PQR_PLZA_ASSOC scope to the L3
> resource but that creates dependency to the L3 resource that would make PLZA unusable if, for example,
> the user boots with "rdt=!l3cat" while wanting to use PLZA to manage MBA allocations when in kernel?
Yes. that is correct. It should not be attached to one resource. We need
to change it to global scope.
>
> ...
>
>> Yes, I agree with your concerns. The goal here is to make the interface less disruptive while still addressing the different use cases.
>
> I consider changing resctrl behavior when values are written to existing resctrl files
> to be disruptive. This is something we explicitly discussed during v1 as something to
> be avoided so this implementation that overloads the tasks file again is unexpected.
Yes. Agree. If required we need to introduce new files (kmode_cpus,
kmode_cpu_list or kmode_tasks) to handle these cases.
>
>> Background: Customers have identified an issue with the QoS
>> Bandwidth Control feature: when a CLOS is aggressively throttled
>> and execution transitions into kernel mode, kernel operations are
>> also subject to the same aggressive throttling.
>>
>>> Privilege-Level Zero Association (PLZA) allows a user to specify a
>> COS and/or RMID to be used during execution at Privilege Level Zero.
>> When PLZA is enabled on a hardware thread, any execution that enters
>> Privilege Level Zero will have its transactions associated with the
>> PLZA COS and/or RMID. Otherwise, the thread continues to use the COS
>> and RMID specified by |PQR_ASSOC|. In other words, the hardware
>> provides a dedicated COS and/or RMID specifically for kernel-mode
>> execution.
> ack.
>
>>
>> There are multiple ways this feature can be applied. For simplicity, the discussion below focuses only on CLOSID.
>>
>>
>> 1. Global PLZA enablement
>>
>> PLZA can be configured as a global feature by setting |PQR_PLZA_ASSOC.closid = CLOSID| and |PQR_PLZA_ASSOC.plza_en = 1| on all threads in the system. A dedicated CLOSID is reserved for this purpose,
>
> Also discussed during v1 is that there is no need to dedicate a CLOSID for this purpose.
> There could be an "unthrottled" CLOSID to which all high priority user space tasks as
> well as all kernel work of all tasks are assigned.
> If user space chooses to dedicate a CLOSID for kernel work then that should supported and
> interface can allow that, but there is no need for resctrl to enforce this.
>
>> and all CPU threads use its allocations whenever they enter Privilege Level Zero. This CLOSID does not need to be associated with any resctrl group.
I misspoke here.
>
> The CLOSID has to be associated with a resource group to be able to manage its
> resource allocations, no?
Yes. We need to have resource group schemata to enforce the limits.
>
>> The user can explicitly enable or disable this feature.
> ack.
>
>> There is no context switch overhead but there is no flexibility with this approach.
>
> Flexibility is subjective. As I understand this supports the only use case we learned about so far:
> https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
>
>> 2. Group based PLZA allocation : PLZA is managed via dedicated
>> restctrl group. A separate resctrl group can be created
>> specifically for PLZA, with a dedicated CLOSID used exclusively
>> for kernel mode execution. This approach can be further divided
>> into two association models:
>
> So far this sounds like global allocation since both need a dedicated resource group.
> Whether this group is dedicated to kernel work or shared between kernel and user space work
> is up to the user. There is no motivation why CLOSID should ever be enforced to be
> exclusive for kernel mode execution.
Yes. That is fine.
>
>>
>> i) CPU based association
>> CPUs are assigned to the PLZA group, and PLZA is enabled only on
>> those CPUs. This effectively creates a dedicated PLZA group. MSRs (|
>> PQR_PLZA_ASSOC)| are programmed only when the user changes CPU
>> assignments. This approach requires no changes to the context switch
>> code and introduces no additional context switch overhead.
>>
>> ii) Task based association
>> Tasks are explicitly assigned by the user to the PLZA group. Tasks
>> need to be updated when user adds a new task. Also, this requires
>> updates during task scheduling so that the MSRs (|PQR_PLZA_ASSOC)|
>> are programmed on each context switch, which introduces additional
>> context switch overhead.
>
> As discussed during v1 any changes needed to support per task assignment would
> need to be done with new files dedicated to this purpose. Do not overload the
> existing resctrl tasks/cpus/cpus_list files.
Yes. Sure.
>
>> I tried to fit these requirements into the interface files in /sys/
>> fs/resctrl/info/. I may have missed few things while trying to
>> achieve it. As usual, I am open for the discussion and
>> recommendations.
>
> Many of these items were already discussed as part of v1 so I think we may be
> talking past each other here. I tried to highlight the relevant points raised
> during v1 discussion that I thought there already was agreement on.
>
> The one new aspect is that I assumed this implementation will only be for
> global configuration and assignment. It looks like you want to support both
> global configuration and per-task assignment. In the original I did not consider
> configuration and assignment to occur at different scope so we may need to come up
> with new modes to distinguish. Consider the addition of two modes as below:
>
> # cat info/kernel_mode
> [inherit_ctrl_and_mon]
> global_assign_ctrl_inherit_mon_set_all
> global_assign_ctrl_assign_mon_set_all
> global_assign_ctrl_inherit_mon_set_individual
> global_assign_ctrl_assign_mon_set_individual
>
> Above introduces a "set_all" and "set_individual" suffix to the original two
> modes.
>
> global_assign_ctrl_inherit_mon_set_all
> global_assign_ctrl_assign_mon_set_all:
>
> Above are the original two modes but makes it clear that when this mode is
> activated _all_ tasks run with the assignment.
>
> global_assign_ctrl_inherit_mon_set_individual
> global_assign_ctrl_assign_mon_set_individual:
>
> Above are two new modes. In this mode user space also assigns a resource
> group globally but then needs to follow that up by activating every task
> separately to run with this assignment.
> One way in which this can be accomplished could be to have "kernel_mode_tasks",
> "kernel_mode_cpus", and "kernel_mode_cpus_list" files become visible (or be
> created) in the resource group found in info/kernel_mode_assignment. User
> space interacts with the new files to set which tasks and/or CPUs run with
> PLZA enabled.
>
> Even so, as I understand global_assign_ctrl_inherit_mon_set_all and
> global_assign_ctrl_assign_mon_set_all addresses the only known use case. Do you know
> if there are use cases for global_assign_ctrl_inherit_mon_set_individual and
> global_assign_ctrl_assign_mon_set_individual? The latter two adds significant
> complexity to resctrl while I have not heard about any use case for it.
>
Yes. I agree. The changes in context switch code is a concern.
You covered some of the cases I was thinking(xx_set_individual).
How about this idea?
I suggest splitting the PLZA into two distinct aspects:
1. How PLZA is applied within a resource group
2. How PLZA is monitored
Introduce a new file, "info/kmode_type", to describe how kmode applies
in the system.
# cat info/kmode_type
[global] <- Kernel mode applies to the entire system (all CPUs/tasks)
cpus <- Kernel mode applies only to the CPUs in the group
tasks <- Kernel mode applies only to the tasks in the group
The "global" option is the default right now and it is current common
use-case.
The "info/kmode_type -> cpus" option introduces new files "kmode_cpus"
and "kmode_cpus_list" for users to apply kmode to specific set of CPUs.
This lets users change the CPU set for PLZA. The PLZA MSR is updated
when user changes the association to the file. No context switch code
changes are needed. This will be dedicated group. The current resctrl
group files, "cpus, cpus_list and tasks" will not be accessible in this
mode. This option give some flexibility for the user without the context
switch overhead.
The "info/kmode_type -> tasks" option introduces a new file,
"kmode_tasks", for users to apply kmode to specific set of tasks. This
requires context switch changes. This will be dedicated group. The
current resctrl group files, "cpus, cpus_list and tasks" will not be
accessible in this mode. We currently have no use case for this, so it
will not be supported now.
Add a file, "info/kmode_monitor", to describe how kmode is monitored.
# cat info/kmode_monitor
[inherit_ctrl_and_mon] <- Kernel uses the same CLOSID/RMID as user.
Default option for the "global"
assign_ctrl_inherit_mon <- One CLOSID for all kernel work; RMID
inherited from user.
assign_ctrl_assign_mon <- One resource group (CLOSID+RMID) for all
kernel work. Default option for "cpu" type.
Rename “kernel_mode_assignment” to “kmode_group” to assign the specific
group to kmode. This file usage is same as before.
#cat info/kmode_groups (Renamed "kernel_mode_assignment")
//
Thoughts?
thanks
Babu
^ permalink raw reply
* Re: [PATCH v7 17/22] x86/virt/tdx: Avoid updates during update-sensitive operations
From: Sean Christopherson @ 2026-04-06 22:29 UTC (permalink / raw)
To: Chao Gao
Cc: linux-kernel, linux-coco, kvm, binbin.wu, dan.j.williams,
dave.hansen, ira.weiny, kai.huang, kas, nik.borisov, paulmck,
pbonzini, reinette.chatre, rick.p.edgecombe, sagis, tony.lindgren,
vannapurve, vishal.l.verma, yilun.xu, xiaoyao.li, yan.y.zhao,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
H. Peter Anvin
In-Reply-To: <20260331124214.117808-18-chao.gao@intel.com>
On Tue, Mar 31, 2026, Chao Gao wrote:
> A runtime TDX module update can conflict with TD lifecycle operations that
> are update-sensitive.
>
> Today, update-sensitive operations include:
>
> - TD build: TD measurement is accumulated across multiple
> TDH.MEM.PAGE.ADD, TDH.MR.EXTEND, and TDH.MR.FINALIZE calls.
>
> - TD migration: intermediate crypto state is saved/restored across
> interrupted/resumed TDH.EXPORT.STATE.* and TDH.IMPORT.STATE.* flows.
>
> If an update races TD build, for example, TD measurement can become
> incorrect and attestation can fail.
>
> The TDX architecture exposes two approaches:
>
> 1) Avoid updates during update-sensitive operations.
> 2) Detect incompatibility after update and recover.
>
> Post-update detection (option #2) is not a good fit: as discussed in [1],
> future module behavior may expand update-sensitive operations in ways that
> make KVM ABIs unstable and will break userspace.
>
> "Do nothing" is also not preferred: while it keeps kernel code simple, it
> lets the issue leak into the broader stack, where both detection and
> recovery require significantly more effort.
>
> So, use option #1. Specifically, request "avoid update-sensitive" behavior
> during TDX module shutdown and map the resulting failure to -EBUSY so
> userspace can distinguish an update race from other failures.
>
> When the "avoid update-sensitive" feature isn't supported, proceed with
> updates. If a race occurs between module update and update-sensitive
> operations, failures happen at a later stage (e.g., incorrect TD
> measurements in attestation reports for TD build). Effectively, this
> means "let userspace update at their own risk". Userspace can check if
> the feature is supported or not. The alternative of blocking updates
> entirely is rejected [2] as it introduces permanent kernel complexity to
> accommodate limitations in early TDX module releases that userspace can
> handle.
>
> Note: this implementation is based on a reference patch by Vishal [3].
> Note2: moving "NO_RBP_MOD" is just to centralize bit definitions.
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> Link: https://lore.kernel.org/linux-coco/aQIbM5m09G0FYTzE@google.com/ # [1]
> Link: https://lore.kernel.org/kvm/699fe97dc212f_2f4a100b@dwillia2-mobl4.notmuch/ # [2]
> Link: https://lore.kernel.org/linux-coco/CAGtprH_oR44Vx9Z0cfxvq5-QbyLmy_+Gn3tWm3wzHPmC1nC0eg@mail.gmail.com/ # [3]
> ---
For the STATUS_MASK movement:
Acked-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/include/asm/tdx.h | 11 +++++++++--
> arch/x86/kvm/vmx/tdx_errno.h | 2 --
> arch/x86/virt/vmx/tdx/tdx.c | 25 +++++++++++++++++++++----
> arch/x86/virt/vmx/tdx/tdx.h | 3 ---
> 4 files changed, 30 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 79733fdb35c6..00751506dd3c 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -26,11 +26,18 @@
> #define TDX_SEAMCALL_GP (TDX_SW_ERROR | X86_TRAP_GP)
> #define TDX_SEAMCALL_UD (TDX_SW_ERROR | X86_TRAP_UD)
>
> +#define TDX_SEAMCALL_STATUS_MASK 0xFFFFFFFF00000000ULL
> +
...
> diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
> index 6ff4672c4181..215c00d76a94 100644
> --- a/arch/x86/kvm/vmx/tdx_errno.h
> +++ b/arch/x86/kvm/vmx/tdx_errno.h
> @@ -4,8 +4,6 @@
> #ifndef __KVM_X86_TDX_ERRNO_H
> #define __KVM_X86_TDX_ERRNO_H
>
> -#define TDX_SEAMCALL_STATUS_MASK 0xFFFFFFFF00000000ULL
> -
> /*
> * TDX SEAMCALL Status Codes (returned in RAX)
> */
^ permalink raw reply
* Re: [PATCH v2 09/19] PCI/TSM: Support creating encrypted MMIO descriptors via TDISP Report
From: Jason Gunthorpe @ 2026-04-06 22:21 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: Xu Yilun, Aneesh Kumar K.V, Dan Williams, linux-coco, linux-pci,
gregkh, bhelgaas, alistair23, lukas, Arnd Bergmann
In-Reply-To: <70912675-0737-4ebf-8ba0-ab9a2e493bbe@amd.com>
On Tue, Apr 07, 2026 at 08:08:51AM +1000, Alexey Kardashevskiy wrote:
>
>
> On 4/4/26 01:08, Jason Gunthorpe wrote:
> > On Fri, Apr 03, 2026 at 11:41:25PM +1100, Alexey Kardashevskiy wrote:
> > >
> > >
> > > On 30/3/26 22:49, Jason Gunthorpe wrote:
> > > > On Mon, Mar 30, 2026 at 04:47:44PM +1100, Alexey Kardashevskiy wrote:
> > > >
> > > > > What do I miss? Thanks,
> > > >
> > > > You can't tell where things start so there is no way to relate the
> > > > offsets to something the kernel can understand.
> > >
> > > Reported ranges have BAR indexes and start addresses (with the
> > > reported MMIO offset added), and the first reported range starts at
> > > the first 4K of that BAR.
> >
> > I was told this is not the case, the first reported range can start
> > anywhere in the BAR?
>
> This is what I am trying to clarify - if all ranges must be reported
> (as some think this is what the PCIe spec says), then no, not
> anywhere.
>
> pcie r7, Table 11-16 TDI Report Structure, MMIO_RANGE:
>
> "Each MMIO Range of the TDI is reported with the MMIO reporting offset added."
I think the argument was something like it didn't have to report
non-secure ranges? But I don't know, it was hashed out in some thread
for ARM and then I know our folks looked at it and nobody pushed back
to insist that every single byte of the BAR had to be covered by a
reported range.
I wouldn't take the sentance you quoted as confirmation, you need a
sentance that says every single byte of the BAR is covered by a single
reported range.
Jason
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox