Linux Confidential Computing Development

Linux Confidential Computing Development
 help / color / mirror / Atom feed

* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Reinette Chatre @ 2026-04-20 22:03 UTC (permalink / raw)
  To: Babu Moger, Moger, Babu, corbet@lwn.net, tony.luck@intel.com,
	Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
  Cc: skhan@linuxfoundation.org, x86@kernel.org, hpa@zytor.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, kas@kernel.org, rick.p.edgecombe@intel.com,
	akpm@linux-foundation.org, pmladek@suse.com,
	rdunlap@infradead.org, dapeng1.mi@linux.intel.com,
	kees@kernel.org, elver@google.com, paulmck@kernel.org,
	lirongqing@baidu.com, safinaskar@gmail.com, fvdl@google.com,
	seanjc@google.com, pawan.kumar.gupta@linux.intel.com,
	xin@zytor.com, tiala@microsoft.com, chang.seok.bae@intel.com,
	Lendacky, Thomas, elena.reshetova@intel.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-coco@lists.linux.dev, kvm@vger.kernel.org,
	eranian@google.com, peternewman@google.com
In-Reply-To: <39e0c786-cc35-4555-bfb9-ff7cd758c423@amd.com>

Hi Babu,

On 4/20/26 12:38 PM, Babu Moger wrote:
> On 4/9/26 22:41, Reinette Chatre wrote:
>> On 4/9/26 4:42 PM, Moger, Babu wrote:
>>> On 4/9/2026 3:50 PM, Reinette Chatre wrote:
>>>> Hi Babu,
>>>>
>>>> On 4/9/26 11:05 AM, Moger, Babu wrote:
>>>>> On 4/9/2026 12:26 PM, Reinette Chatre wrote:
>>>>>> On 4/9/26 10:19 AM, Moger, Babu wrote:
>>>>>>> On 4/8/2026 6:41 PM, Reinette Chatre wrote:
>>>>>>
>>>>>>>> When the user switches to either "global_assign_ctrl_inherit_mon_per_cpu" or
>>>>>>>> 'global_assign_ctrl_assign_mon_per_cpu" then "info/kernel_mode_assignment" is created
>>>>>>>> (or made visible to user space) and is expected to point to default group.
>>>>>>>> User can change the group using "info/kernel_mode_assignment" at this point.
>>>>>>>>
>>>>>>>> If the current scenario is below ...
>>>>>>>>        # cat info/kernel_mode
>>>>>>>>        [global_assign_ctrl_inherit_mon_per_cpu]
>>>>>>>>        inherit_ctrl_and_mon
>>>>>>>>        global_assign_ctrl_assign_mon_per_cpu
>>>>>>>>
>>>>>>>> ... then "info/kernel_mode_assignment" will exist but what it should contain if
>>>>>>>> user switches mode at this point may be up for discussion.
>>>>>>>>
>>>>>>>> option 1)
>>>>>>>> When user switches mode to "global_assign_ctrl_assign_mon_per_cpu" then
>>>>>>>> the resource group in "info/kernel_mode_assignment" is reset to the
>>>>>>>> default group and all CPUs PLZA state reset to match. The kernel_mode_cpus
>>>>>>>> and kernel_mode_cpuslist files become visible in default resource group
>>>>>>>> and they contain "all online CPUs".
>>>>>>>>
>>>>>>>> option 2)
>>>>>>>> When user switches mode to "global_assign_ctrl_assign_mon_per_cpu" then
>>>>>>>> the resource group in "info/kernel_mode_assignment" is kept and all
>>>>>>>> CPUs PLZA state set to match it while also keeping the current
>>>>>>>> values of that resource group's kernel_mode_cpus and kernel_mode_cpuslist
>>>>>>>> files.
>>>>>>>>
>>>>>>>> I am leaning towards "option 1" to keep it consistent with a switch from
>>>>>>>> "inherit_ctrl_and_mon" and being deterministic about how a mode is started with
>>>>>>>
>>>>>>> Yes. The "option 1" seems appropriate.
>>>>>>>
>>>>>>>> a clean slate. What are your thoughts? What would be use case where a user would
>>>>>>>> want to switch between "global_assign_ctrl_inherit_mon_per_cpu" and
>>>>>>>> "global_assign_ctrl_assign_mon_per_cpu" to just switch rmid_en on and off?
>>>>>>>
>>>>>>>
>>>>>>> This is a bit tricky.
>>>>>>>
>>>>>>> Currently, our requirement is to have a CTRL_MON group for
>>>>>>> global_assign_ctrl_inherit_mon_per_cpu. In this scenario, we use the
>>>>>>> group’s CLOSID for PLZA configuration, and RMID is not used (rmid_en
>>>>>>> = 0) when setting up PLZA.
>>>>>>>
>>>>>>> Our requirement is also to have a CTRL_MON/MON group for
>>>>>>> global_assign_ctrl_assign_mon_per_cpu. In this case as well, the
>>>>>>> group’s CLOSID and RMID (rmid_en = 1)  both are used configure PLZA.
>>>>>>
>>>>>> ah, right. Good catch.
>>>>>>
>>>>>>>
>>>>>>> Actually, we should not allow these changes from
>>>>>>> global_assign_ctrl_inherit_mon_per_cpu  to
>>>>>>> global_assign_ctrl_assign_mon_per_cpu or visa versa.
>>>>>>
>>>>>> resctrl could allow it but as part of the switch it resets the "kernel mode group" to
>>>>>> be the default group every time? This would be the "option 1" above.
>>>>>
>>>>> Other options.
>>>>>
>>>>> Allow global_assign_ctrl_inherit_mon_per_cpu -> global_assign_ctrl_assign_mon_per_cpu. As part of the switch, reset the "kernel mode group" to the default group.
>>>>>
>>>>> Allow global_assign_ctrl_assign_mon_per_cpu -> global_assign_ctrl_inherit_mon_per_cpu. In this case switch
>>>>> to CTRL_MON/MON -> CTRL_MON.
>>>>>
>>>>
>>>> ok. Could you please return the courtesy of providing feedback on the
>>>> suggestion you are responding to and also include the motivation why your
>>>> suggestion is the better option?
>>>
>>> Yea. Sure.
>>>
>>> We need to allow the switch between the modes. Otherwise only way to reset is to remount the resctrl filesystem. That is not a good option.
>>>
>>> Allow global_assign_ctrl_inherit_mon_per_cpu -> global_assign_ctrl_assign_mon_per_cpu. As part of the switch, reset the "kernel mode group" to the default group.
>>>
>>> This option is same as you suggested.
>>>
>>> Allow global_assign_ctrl_assign_mon_per_cpu -> global_assign_ctrl_inherit_mon_per_cpu. In this case switch
>>> to CTRL_MON/MON -> CTRL_MON. This option basically disables monitor (rmid_en=0). It is less disruptive. Move is between child group to parent group.
>>
>> ok. I am concerned that this creates an inconsistent interface. Specifically, sometimes
>> when switching the mode the kernel group will reset and sometimes it won't. This inconsistency
>> may be more apparent when writing the user documentation as part of this work. If you are
>> able to clearly explain how this resctrl fs interface behaves (this cannot be about PLZA
>> internals as above) then this could work.
> Started working on these changes. May be it is better to discuss this before to avoid one more revision.
> 
> 
> The current mode change behavior is very restrictive.
> 
> For example:
> 
> # cat info/kernel_mode
>       inherit_ctrl_and_mon
>       [global_assign_ctrl_assign_mon_per_cpu]
>        global_assign_ctrl_inherit_mon_per_cpu
> 
> 
> # cat info/kernel_mode_assignment
>      ctrl1/mon1/
> 
> In this state, we cannot change kernel_mode to inherit_ctrl_and_mon. The expectation, however, is that inherit_ctrl_and_mon should always map to the RDTCTRL_GROUP.

Could you please provide details behind the "we cannot change kernel_mode to
inherit_ctrl_and_mon" statement? Why is this not possible?

I do not see "inherit_ctrl_and_mon" to map to *any* group though. Expectation is
that when user changes mode to "inherit_ctrl_and_mon" then 
info/kernel_mode_assignment would become invisible to user space.

> 
> 
> A similar issue exists when switching between
> global_assign_ctrl_inherit_mon_per_cpu and
> global_assign_ctrl_assign_mon_per_cpu (in either direction).

What similar issue? Could you please provide some detail to help me understand what the
issue is? Isn't this what we just discussed in thread you are replying to? That is, you were
looking at developing that interface that I viewed as "inconsistent"?

> 
> The same problem also occurs when modifying the kernel_mode_assignment group. If the current group is an RDTMON_GROUP, we can't assign another
> RDTCTRL_GROUP without changing both mode and group together.

Same problem? Still unclear what the problem is. So far three problems are mentioned but I am
not able to decipher what the problems are. Could you please elaborate?
When modifying the kernel_mode_assignment group I expect that the interface
will only accept a MON group when in "assign_mon" mode and a CTRL group when
in "inherit_mon" mode. 
I do not understand what you mean with *another* RDTCTRL_GROUP. Only one group
can be assigned at any time, no?

Reinette

^ permalink raw reply

* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Babu Moger @ 2026-04-20 19:38 UTC (permalink / raw)
  To: Reinette Chatre, Moger, Babu, corbet@lwn.net, tony.luck@intel.com,
	Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
  Cc: skhan@linuxfoundation.org, x86@kernel.org, hpa@zytor.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, kas@kernel.org, rick.p.edgecombe@intel.com,
	akpm@linux-foundation.org, pmladek@suse.com,
	rdunlap@infradead.org, dapeng1.mi@linux.intel.com,
	kees@kernel.org, elver@google.com, paulmck@kernel.org,
	lirongqing@baidu.com, safinaskar@gmail.com, fvdl@google.com,
	seanjc@google.com, pawan.kumar.gupta@linux.intel.com,
	xin@zytor.com, tiala@microsoft.com, chang.seok.bae@intel.com,
	Lendacky, Thomas, elena.reshetova@intel.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-coco@lists.linux.dev, kvm@vger.kernel.org,
	eranian@google.com, peternewman@google.com
In-Reply-To: <68a551ea-d9f0-436a-9bef-e35fd027bb95@intel.com>

Hi Reinette,

On 4/9/26 22:41, Reinette Chatre wrote:
> Hi Babu,
> 
> On 4/9/26 4:42 PM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 4/9/2026 3:50 PM, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 4/9/26 11:05 AM, Moger, Babu wrote:
>>>> On 4/9/2026 12:26 PM, Reinette Chatre wrote:
>>>>> On 4/9/26 10:19 AM, Moger, Babu wrote:
>>>>>> On 4/8/2026 6:41 PM, Reinette Chatre wrote:
>>>>>
>>>>>>> When the user switches to either "global_assign_ctrl_inherit_mon_per_cpu" or
>>>>>>> 'global_assign_ctrl_assign_mon_per_cpu" then "info/kernel_mode_assignment" is created
>>>>>>> (or made visible to user space) and is expected to point to default group.
>>>>>>> User can change the group using "info/kernel_mode_assignment" at this point.
>>>>>>>
>>>>>>> If the current scenario is below ...
>>>>>>>        # cat info/kernel_mode
>>>>>>>        [global_assign_ctrl_inherit_mon_per_cpu]
>>>>>>>        inherit_ctrl_and_mon
>>>>>>>        global_assign_ctrl_assign_mon_per_cpu
>>>>>>>
>>>>>>> ... then "info/kernel_mode_assignment" will exist but what it should contain if
>>>>>>> user switches mode at this point may be up for discussion.
>>>>>>>
>>>>>>> option 1)
>>>>>>> When user switches mode to "global_assign_ctrl_assign_mon_per_cpu" then
>>>>>>> the resource group in "info/kernel_mode_assignment" is reset to the
>>>>>>> default group and all CPUs PLZA state reset to match. The kernel_mode_cpus
>>>>>>> and kernel_mode_cpuslist files become visible in default resource group
>>>>>>> and they contain "all online CPUs".
>>>>>>>
>>>>>>> option 2)
>>>>>>> When user switches mode to "global_assign_ctrl_assign_mon_per_cpu" then
>>>>>>> the resource group in "info/kernel_mode_assignment" is kept and all
>>>>>>> CPUs PLZA state set to match it while also keeping the current
>>>>>>> values of that resource group's kernel_mode_cpus and kernel_mode_cpuslist
>>>>>>> files.
>>>>>>>
>>>>>>> I am leaning towards "option 1" to keep it consistent with a switch from
>>>>>>> "inherit_ctrl_and_mon" and being deterministic about how a mode is started with
>>>>>>
>>>>>> Yes. The "option 1" seems appropriate.
>>>>>>
>>>>>>> a clean slate. What are your thoughts? What would be use case where a user would
>>>>>>> want to switch between "global_assign_ctrl_inherit_mon_per_cpu" and
>>>>>>> "global_assign_ctrl_assign_mon_per_cpu" to just switch rmid_en on and off?
>>>>>>
>>>>>>
>>>>>> This is a bit tricky.
>>>>>>
>>>>>> Currently, our requirement is to have a CTRL_MON group for
>>>>>> global_assign_ctrl_inherit_mon_per_cpu. In this scenario, we use the
>>>>>> group’s CLOSID for PLZA configuration, and RMID is not used (rmid_en
>>>>>> = 0) when setting up PLZA.
>>>>>>
>>>>>> Our requirement is also to have a CTRL_MON/MON group for
>>>>>> global_assign_ctrl_assign_mon_per_cpu. In this case as well, the
>>>>>> group’s CLOSID and RMID (rmid_en = 1)  both are used configure PLZA.
>>>>>
>>>>> ah, right. Good catch.
>>>>>
>>>>>>
>>>>>> Actually, we should not allow these changes from
>>>>>> global_assign_ctrl_inherit_mon_per_cpu  to
>>>>>> global_assign_ctrl_assign_mon_per_cpu or visa versa.
>>>>>
>>>>> resctrl could allow it but as part of the switch it resets the "kernel mode group" to
>>>>> be the default group every time? This would be the "option 1" above.
>>>>
>>>> Other options.
>>>>
>>>> Allow global_assign_ctrl_inherit_mon_per_cpu -> global_assign_ctrl_assign_mon_per_cpu. As part of the switch, reset the "kernel mode group" to the default group.
>>>>
>>>> Allow global_assign_ctrl_assign_mon_per_cpu -> global_assign_ctrl_inherit_mon_per_cpu. In this case switch
>>>> to CTRL_MON/MON -> CTRL_MON.
>>>>
>>>
>>> ok. Could you please return the courtesy of providing feedback on the
>>> suggestion you are responding to and also include the motivation why your
>>> suggestion is the better option?
>>
>> Yea. Sure.
>>
>> We need to allow the switch between the modes. Otherwise only way to reset is to remount the resctrl filesystem. That is not a good option.
>>
>> Allow global_assign_ctrl_inherit_mon_per_cpu -> global_assign_ctrl_assign_mon_per_cpu. As part of the switch, reset the "kernel mode group" to the default group.
>>
>> This option is same as you suggested.
>>
>> Allow global_assign_ctrl_assign_mon_per_cpu -> global_assign_ctrl_inherit_mon_per_cpu. In this case switch
>> to CTRL_MON/MON -> CTRL_MON. This option basically disables monitor (rmid_en=0). It is less disruptive. Move is between child group to parent group.
> 
> ok. I am concerned that this creates an inconsistent interface. Specifically, sometimes
> when switching the mode the kernel group will reset and sometimes it won't. This inconsistency
> may be more apparent when writing the user documentation as part of this work. If you are
> able to clearly explain how this resctrl fs interface behaves (this cannot be about PLZA
> internals as above) then this could work.
Started working on these changes. May be it is better to discuss this 
before to avoid one more revision.


The current mode change behavior is very restrictive.

For example:

# cat info/kernel_mode
       inherit_ctrl_and_mon
       [global_assign_ctrl_assign_mon_per_cpu]
        global_assign_ctrl_inherit_mon_per_cpu


# cat info/kernel_mode_assignment
      ctrl1/mon1/

In this state, we cannot change kernel_mode to inherit_ctrl_and_mon. The 
expectation, however, is that inherit_ctrl_and_mon should always map to 
the RDTCTRL_GROUP.


A similar issue exists when switching between
global_assign_ctrl_inherit_mon_per_cpu and
global_assign_ctrl_assign_mon_per_cpu (in either direction).

The same problem also occurs when modifying the kernel_mode_assignment 
group. If the current group is an RDTMON_GROUP, we can't assign another
RDTCTRL_GROUP without changing both mode and group together.


To address this, I propose changing the mode and the group together.

System boots up with following defaults:

# cat info/kernel_mode
       [inherit_ctrl_and_mon]
       global_assign_ctrl_assign_mon_per_cpu
       global_assign_ctrl_inherit_mon_per_cpu

# cat info/kernel_mode_assignment
      inherit_ctrl_and_mon://


# echo "global_assign_ctrl_assign_mon_per_cpu:ctrl1/mon1/" > 
info/kernel_mode_assignment

# cat info/kernel_mode_assignment
      global_assign_ctrl_assign_mon_per_cpu:ctrl1/mon1/

# cat info/kernel_mode
       inherit_ctrl_and_mon
       [global_assign_ctrl_assign_mon_per_cpu]
       global_assign_ctrl_inherit_mon_per_cpu


# echo "inherit_ctrl_and_mon://" > info/kernel_mode_assignment

# cat info/kernel_mode_assignment
    inherit_ctrl_and_mon://


# cat info/kernel_mode
       [inherit_ctrl_and_mon]
       global_assign_ctrl_assign_mon_per_cpu
        global_assign_ctrl_inherit_mon_per_cpu


# echo "global_assign_ctrl_inherit_mon_per_cpu:ctrl1//"

# cat info/kernel_mode_assignment
    global_assign_ctrl_inherit_mon_per_cpu:ctrl1//

# cat info/kernel_mode
       inherit_ctrl_and_mon
       global_assign_ctrl_assign_mon_per_cpu
       [global_assign_ctrl_inherit_mon_per_cpu]


The interface "info/kernel_mode" becomes read-only,

The mode change and group change will be done with 
"info/kernel_mode_assignment"


I’m also planning to rename the kernel modes as follows:

inherit_ctrl_and_mon → shared_alloc_mon
global_assign_ctrl_inherit_mon_per_cpu → global_alloc_per_cpu
global_assign_ctrl_assign_mon_per_cpu → global_alloc_mon_per_cpu

What do you think?

Thanks
Babu


^ permalink raw reply

* Re: [PATCH v5 1/2] dma-mapping: introduce DMA_ATTR_CC_SHARED for shared memory
From: Jiri Pirko @ 2026-04-20  9:29 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: dri-devel, linaro-mm-sig, iommu, linux-media, sumit.semwal,
	benjamin.gaignard, Brian.Starkey, jstultz, tjmercier,
	christian.koenig, m.szyprowski, robin.murphy, jgg, leon,
	sean.anderson, ptesarik, catalin.marinas, suzuki.poulose,
	steven.price, thomas.lendacky, john.allen, ashish.kalra,
	suravee.suthikulpanit, linux-coco
In-Reply-To: <yq5atst6ywbl.fsf@kernel.org>

Mon, Apr 20, 2026 at 08:34:06AM +0200, aneesh.kumar@kernel.org wrote:
>Jiri Pirko <jiri@resnulli.us> writes:
>
>> From: Jiri Pirko <jiri@nvidia.com>
>>
>> Current CC designs don't place a vIOMMU in front of untrusted devices.
>> Instead, the DMA API forces all untrusted device DMA through swiotlb
>> bounce buffers (is_swiotlb_force_bounce()) which copies data into
>> shared memory on behalf of the device.
>>
>> When a caller has already arranged for the memory to be shared
>> via set_memory_decrypted(), the DMA API needs to know so it can map
>> directly using the unencrypted physical address rather than bounce
>> buffering. Following the pattern of DMA_ATTR_MMIO, add
>> DMA_ATTR_CC_SHARED for this purpose. Like the MMIO case, only the
>> caller knows what kind of memory it has and must inform the DMA API
>> for it to work correctly.
>>
>> Signed-off-by: Jiri Pirko <jiri@nvidia.com>
>> ---
>> v4->v5:
>> - rebased on top od dma-mapping-for-next
>> - s/decrypted/shared/
>> v3->v4:
>> - added some sanity checks to dma_map_phys and dma_unmap_phys
>> - enhanced documentation of DMA_ATTR_CC_DECRYPTED attr
>> v1->v2:
>> - rebased on top of recent dma-mapping-fixes
>> ---
>>  include/linux/dma-mapping.h | 10 ++++++++++
>>  include/trace/events/dma.h  |  3 ++-
>>  kernel/dma/direct.h         | 14 +++++++++++---
>>  kernel/dma/mapping.c        | 13 +++++++++++--
>>  4 files changed, 34 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
>> index 677c51ab7510..db8ab24a54f4 100644
>> --- a/include/linux/dma-mapping.h
>> +++ b/include/linux/dma-mapping.h
>> @@ -92,6 +92,16 @@
>>   * flushing.
>>   */
>>  #define DMA_ATTR_REQUIRE_COHERENT	(1UL << 12)
>> +/*
>> + * DMA_ATTR_CC_SHARED: Indicates the DMA mapping is shared (decrypted) for
>> + * confidential computing guests. For normal system memory the caller must have
>> + * called set_memory_decrypted(), and pgprot_decrypted must be used when
>> + * creating CPU PTEs for the mapping. The same shared semantic may be passed
>> + * to the vIOMMU when it sets up the IOPTE. For MMIO use together with
>> + * DMA_ATTR_MMIO to indicate shared MMIO. Unless DMA_ATTR_MMIO is provided
>> + * a struct page is required.
>> + */
>> +#define DMA_ATTR_CC_SHARED	(1UL << 13)
>>  
>>  /*
>>   * A dma_addr_t can hold any valid DMA or bus address for the platform.  It can
>> diff --git a/include/trace/events/dma.h b/include/trace/events/dma.h
>> index 63597b004424..31c9ddf72c9d 100644
>> --- a/include/trace/events/dma.h
>> +++ b/include/trace/events/dma.h
>> @@ -34,7 +34,8 @@ TRACE_DEFINE_ENUM(DMA_NONE);
>>  		{ DMA_ATTR_PRIVILEGED, "PRIVILEGED" }, \
>>  		{ DMA_ATTR_MMIO, "MMIO" }, \
>>  		{ DMA_ATTR_DEBUGGING_IGNORE_CACHELINES, "CACHELINES_OVERLAP" }, \
>> -		{ DMA_ATTR_REQUIRE_COHERENT, "REQUIRE_COHERENT" })
>> +		{ DMA_ATTR_REQUIRE_COHERENT, "REQUIRE_COHERENT" }, \
>> +		{ DMA_ATTR_CC_SHARED, "CC_SHARED" })
>>  
>>  DECLARE_EVENT_CLASS(dma_map,
>>  	TP_PROTO(struct device *dev, phys_addr_t phys_addr, dma_addr_t dma_addr,
>> diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
>> index b86ff65496fc..7140c208c123 100644
>> --- a/kernel/dma/direct.h
>> +++ b/kernel/dma/direct.h
>> @@ -89,16 +89,24 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
>>  	dma_addr_t dma_addr;
>>  
>>  	if (is_swiotlb_force_bounce(dev)) {
>> -		if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
>> -			return DMA_MAPPING_ERROR;
>> +		if (!(attrs & DMA_ATTR_CC_SHARED)) {
>> +			if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
>> +				return DMA_MAPPING_ERROR;
>>  
>> -		return swiotlb_map(dev, phys, size, dir, attrs);
>> +			return swiotlb_map(dev, phys, size, dir, attrs);
>> +		}
>> +	} else if (attrs & DMA_ATTR_CC_SHARED) {
>> +		return DMA_MAPPING_ERROR;
>>  	}
>>
>
>What is this check for? If we are requesting a DMA mapping with
>DMA_ATTR_CC_SHARED, shouldn’t it be allowed? If not, how would we reach

This is defensive. Only allows to map with DMA_ATTR_CC_SHARED set to
dev dev that does not support CC natively. This can be of course lifted,
if you have a case.


>the conditional below where we convert the physical address to a DMA
>address using phys_to_dma_unencrypted()?. Also, how is this supposed to
>interact with is_swiotlb_force_bounce()?”

You reach there when is_swiotlb_force_bounce(dev) is true and
DMA_ATTR_CC_SHARED is set. What am I missing?



>
>>  
>>  	if (attrs & DMA_ATTR_MMIO) {
>>  		dma_addr = phys;
>>  		if (unlikely(!dma_capable(dev, dma_addr, size, false)))
>>  			goto err_overflow;
>> +	} else if (attrs & DMA_ATTR_CC_SHARED) {
>> +		dma_addr = phys_to_dma_unencrypted(dev, phys);
>> +		if (unlikely(!dma_capable(dev, dma_addr, size, false)))
>> +			goto err_overflow;
>>  	} else {
>>  		dma_addr = phys_to_dma(dev, phys);
>>  		if (unlikely(!dma_capable(dev, dma_addr, size, true)) ||
>
>-aneesh

^ permalink raw reply

* Re: [PATCH v5 1/2] dma-mapping: introduce DMA_ATTR_CC_SHARED for shared memory
From: Aneesh Kumar K.V @ 2026-04-20  6:34 UTC (permalink / raw)
  To: Jiri Pirko, dri-devel, linaro-mm-sig, iommu, linux-media
  Cc: sumit.semwal, benjamin.gaignard, Brian.Starkey, jstultz,
	tjmercier, christian.koenig, m.szyprowski, robin.murphy, jgg,
	leon, sean.anderson, ptesarik, catalin.marinas, suzuki.poulose,
	steven.price, thomas.lendacky, john.allen, ashish.kalra,
	suravee.suthikulpanit, linux-coco
In-Reply-To: <20260325192352.437608-2-jiri@resnulli.us>

Jiri Pirko <jiri@resnulli.us> writes:

> From: Jiri Pirko <jiri@nvidia.com>
>
> Current CC designs don't place a vIOMMU in front of untrusted devices.
> Instead, the DMA API forces all untrusted device DMA through swiotlb
> bounce buffers (is_swiotlb_force_bounce()) which copies data into
> shared memory on behalf of the device.
>
> When a caller has already arranged for the memory to be shared
> via set_memory_decrypted(), the DMA API needs to know so it can map
> directly using the unencrypted physical address rather than bounce
> buffering. Following the pattern of DMA_ATTR_MMIO, add
> DMA_ATTR_CC_SHARED for this purpose. Like the MMIO case, only the
> caller knows what kind of memory it has and must inform the DMA API
> for it to work correctly.
>
> Signed-off-by: Jiri Pirko <jiri@nvidia.com>
> ---
> v4->v5:
> - rebased on top od dma-mapping-for-next
> - s/decrypted/shared/
> v3->v4:
> - added some sanity checks to dma_map_phys and dma_unmap_phys
> - enhanced documentation of DMA_ATTR_CC_DECRYPTED attr
> v1->v2:
> - rebased on top of recent dma-mapping-fixes
> ---
>  include/linux/dma-mapping.h | 10 ++++++++++
>  include/trace/events/dma.h  |  3 ++-
>  kernel/dma/direct.h         | 14 +++++++++++---
>  kernel/dma/mapping.c        | 13 +++++++++++--
>  4 files changed, 34 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 677c51ab7510..db8ab24a54f4 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -92,6 +92,16 @@
>   * flushing.
>   */
>  #define DMA_ATTR_REQUIRE_COHERENT	(1UL << 12)
> +/*
> + * DMA_ATTR_CC_SHARED: Indicates the DMA mapping is shared (decrypted) for
> + * confidential computing guests. For normal system memory the caller must have
> + * called set_memory_decrypted(), and pgprot_decrypted must be used when
> + * creating CPU PTEs for the mapping. The same shared semantic may be passed
> + * to the vIOMMU when it sets up the IOPTE. For MMIO use together with
> + * DMA_ATTR_MMIO to indicate shared MMIO. Unless DMA_ATTR_MMIO is provided
> + * a struct page is required.
> + */
> +#define DMA_ATTR_CC_SHARED	(1UL << 13)
>  
>  /*
>   * A dma_addr_t can hold any valid DMA or bus address for the platform.  It can
> diff --git a/include/trace/events/dma.h b/include/trace/events/dma.h
> index 63597b004424..31c9ddf72c9d 100644
> --- a/include/trace/events/dma.h
> +++ b/include/trace/events/dma.h
> @@ -34,7 +34,8 @@ TRACE_DEFINE_ENUM(DMA_NONE);
>  		{ DMA_ATTR_PRIVILEGED, "PRIVILEGED" }, \
>  		{ DMA_ATTR_MMIO, "MMIO" }, \
>  		{ DMA_ATTR_DEBUGGING_IGNORE_CACHELINES, "CACHELINES_OVERLAP" }, \
> -		{ DMA_ATTR_REQUIRE_COHERENT, "REQUIRE_COHERENT" })
> +		{ DMA_ATTR_REQUIRE_COHERENT, "REQUIRE_COHERENT" }, \
> +		{ DMA_ATTR_CC_SHARED, "CC_SHARED" })
>  
>  DECLARE_EVENT_CLASS(dma_map,
>  	TP_PROTO(struct device *dev, phys_addr_t phys_addr, dma_addr_t dma_addr,
> diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
> index b86ff65496fc..7140c208c123 100644
> --- a/kernel/dma/direct.h
> +++ b/kernel/dma/direct.h
> @@ -89,16 +89,24 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
>  	dma_addr_t dma_addr;
>  
>  	if (is_swiotlb_force_bounce(dev)) {
> -		if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
> -			return DMA_MAPPING_ERROR;
> +		if (!(attrs & DMA_ATTR_CC_SHARED)) {
> +			if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
> +				return DMA_MAPPING_ERROR;
>  
> -		return swiotlb_map(dev, phys, size, dir, attrs);
> +			return swiotlb_map(dev, phys, size, dir, attrs);
> +		}
> +	} else if (attrs & DMA_ATTR_CC_SHARED) {
> +		return DMA_MAPPING_ERROR;
>  	}
>

What is this check for? If we are requesting a DMA mapping with
DMA_ATTR_CC_SHARED, shouldn’t it be allowed? If not, how would we reach
the conditional below where we convert the physical address to a DMA
address using phys_to_dma_unencrypted()?. Also, how is this supposed to
interact with is_swiotlb_force_bounce()?”

>  
>  	if (attrs & DMA_ATTR_MMIO) {
>  		dma_addr = phys;
>  		if (unlikely(!dma_capable(dev, dma_addr, size, false)))
>  			goto err_overflow;
> +	} else if (attrs & DMA_ATTR_CC_SHARED) {
> +		dma_addr = phys_to_dma_unencrypted(dev, phys);
> +		if (unlikely(!dma_capable(dev, dma_addr, size, false)))
> +			goto err_overflow;
>  	} else {
>  		dma_addr = phys_to_dma(dev, phys);
>  		if (unlikely(!dma_capable(dev, dma_addr, size, true)) ||

-aneesh

^ permalink raw reply

* Re: [PATCH v2 03/31] x86/virt/tdx: Add tdx_page_array helpers for new TDX Module objects
From: Xu Yilun @ 2026-04-19  9:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-coco, linux-pci, dan.j.williams, x86, chao.gao, dave.jiang,
	baolu.lu, yilun.xu, zhenzhong.duan, kvm, rick.p.edgecombe,
	dave.hansen, kas, xiaoyao.li, vishal.l.verma, linux-kernel
In-Reply-To: <69e2c4134f1ef_147c801001a@djbw-dev.notmuch>

> > +#define HPA_LIST_INFO_FIRST_ENTRY	GENMASK_U64(11, 3)
> > +#define HPA_LIST_INFO_PFN		GENMASK_U64(51, 12)
> > +#define HPA_LIST_INFO_LAST_ENTRY	GENMASK_U64(63, 55)
> > +
> > +static u64 __maybe_unused hpa_list_info_assign_raw(struct tdx_page_array *array)
> 
> 2 quick comments:
> 
> * I do not understand shipping a __maybe_unused helper in patch 3 that
>   does not get used until patch10.

You once had a comment wanting to see how a tdx_page_array collapses to
a 64-bit raw value for SEAMCALLs in the same patch. So I move the
helpers earlier. Do you want to change them back?

Personally, I'd like to keep them here, to better align with the
illustration in commit log about why we need the tdx_page_array.

> 
> * The "assign_raw" verb feels strange. I think this probably just want
>   to be called: to_hpa_list_info(struct tdx_page_array *)

It's a better name, thanks.

[...]

> > +{
> > +	unsigned long pfn;
> > +
> > +	if (array->nents == 1)
> > +		pfn = page_to_pfn(array->pages[array->offset]);
> > +	else
> > +		pfn = PFN_DOWN(virt_to_phys(array->root));
> > +
> > +	return FIELD_PREP(HPA_ARRAY_T_PFN, pfn) |
> > +	       FIELD_PREP(HPA_ARRAY_T_SIZE, array->nents - 1);
> > +}
> > +
> > +static u64 __maybe_unused hpa_array_t_release_raw(struct tdx_page_array *array)
> 
> It seems too subtle that this function sometimes returns zero and
> sometimes returns a page that the TDX module will clobber with data that
> we do not care about.
> 
> It is also not clear that "0" is what the module considers a valid value
> that meets "checks its validity for forward compatibility". I guess we

It is the TDX Module's requirement, which is 'too subtle'. TDX Module
tries to keep align with the singleton definition for its output
hpa_array_t. If TDX Module wants to output multiple released pages, it
requires VMM to provide a root page HPA (in input register) so it can
write HPA list on the root page. But if it outputs one released page, it
directly writes page0 HPA in output register, and doesn't need a root
page HPA in input register and enforce its value 0.

That's why we return 0 on singleton mode, otherwise a root page HPA.

> get lucky because all of the calls that need this presently are
> multi-page cases?

Let me experiment, see if we have chance to simplify things.

> 
> I would feel better if this always returned the root HPA and was called

No, we can't. We must provide 0 for singleton mode. So I think maybe

      to_hpa_array_t_released()

Anyway, Linux doesn't need the output hpa_array_t. I've already raised
to Module team that don't enforce the medium page input. If VMM doesn't
provide the page, don't bother fill it.

> something like:
> 
>     to_output_clobber(), or to_aux_clobber()
> 
> ...to make it clear that whatever was there before gets destroyed.
> 

^ permalink raw reply

* Re: [PATCH v2 05/31] x86/virt/tdx: Extend tdx_page_array to support IOMMU_MT
From: Xu Yilun @ 2026-04-19  8:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Edgecombe, Rick P, Gao, Chao, Xu, Yilun, x86@kernel.org,
	kas@kernel.org, baolu.lu@linux.intel.com,
	dave.hansen@linux.intel.com, Li, Xiaoyao, Williams, Dan J,
	Jiang, Dave, linux-pci@vger.kernel.org,
	linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org,
	Duan, Zhenzhong, Verma, Vishal L, kvm@vger.kernel.org
In-Reply-To: <69e2c9334cbf7_147c8010040@djbw-dev.notmuch>

On Fri, Apr 17, 2026 at 04:58:43PM -0700, Dan Williams wrote:
> Xu Yilun wrote:
> [..]
> > > 
> > > I'm drafting some changes and make the tdx_page_array look like:
> > > 
> > >   struct tdx_page_array {
> > > 	/* public: */
> > > 	unsigned int nr_pages;
> > > 	struct page **pages;
> > > 
> > > 	/* private: */
> > > 	u64 *root;
> > > 	bool flush_on_free;
> 
> How about "need_phymem_page_wbinvd"?

Yes.

> 
> That makes it a bit more greppable and not to be confused with other
> flushing.
> 
> [..]
> > Hi, I end up made the following changes on top of this series:
> > 
> > -------8<--------
> > 
> >  arch/x86/include/asm/tdx.h            |  32 +-
> >  arch/x86/virt/vmx/tdx/tdx.c           | 561 ++++++++------------------
> >  drivers/virt/coco/tdx-host/tdx-host.c | 179 ++++++--
> >  3 files changed, 316 insertions(+), 456 deletions(-)
> > 
> > +		ret = tdx_ext_mem_setup(nr_pages, &ext_mem);
> >  		if (ret)
> > +			return ret;
> >  	}
> >  
> > +	ret = tdx_ext_init();
> > +	if (ret)
> > +		goto out_remove_ext_mem;
> > +
> >  	/*
> > +	 * Extensions memory is never reclaimed once assigned, stop tracking it
> > +	 * and free the tracking structures.
> >  	 */
> > +	tdx_page_array_free(ext_mem.chunk);
> 
> Wait, these pages belong to the module now, they can't be freed, or I am
> missing something?

With this new solution, tdx_page_array is downgraded to a descriptor,
doesn't manage the actual data pages/memory any more. So
tdx_page_array_free() will not free data pages, only frees the
tdx_page_array descriptor.

> 
> > +	kfree(ext_mem.pages);
> 
> Releasing this makes sense.
> 
> >  
> >  	pr_info("%lu KB allocated for TDX Module Extensions\n",
> >  		nr_pages * PAGE_SIZE / 1024);
> >  
> >  	return 0;
> >  
> > -out_flush:
> > -	if (ext_mem)
> > +out_remove_ext_mem:
> > +	if (nr_pages) {
> > +		/*
> > +		 * TDH.EXT.MEM.ADD only collects required memory. TDX.EXT.INIT
> > +		 * does the actual initialization so if it fails some pages may
> > +		 * have been touched by the TDX module, flush cache before
> > +		 * returning these pages to kernel.
> > +		 */
> >  		wbinvd_on_all_cpus();
> > +		tdx_ext_mem_remove(&ext_mem);
> 
> This only releases the last populated chunk, not all previous chunks,
> right?

Not true. ext_mem stores all the data pages and the reusable descriptor
'chunk' for SEAMCALL. tdx_ext_mem_remove() removes all the data pages
and the 'chunk'.

^ permalink raw reply

* Re: [PATCH v2 04/31] x86/virt/tdx: Support allocating contiguous pages for tdx_page_array
From: Dan Williams @ 2026-04-18  0:05 UTC (permalink / raw)
  To: Xu Yilun, linux-coco, linux-pci, dan.j.williams, x86
  Cc: chao.gao, dave.jiang, baolu.lu, yilun.xu, yilun.xu,
	zhenzhong.duan, kvm, rick.p.edgecombe, dave.hansen, kas,
	xiaoyao.li, vishal.l.verma, linux-kernel
In-Reply-To: <20260327160132.2946114-5-yilun.xu@linux.intel.com>

Xu Yilun wrote:
> The current tdx_page_array implementation allocates scattered order-0
> pages. However, some TDX Module operations benefit from contiguous
> physical memory. E.g. Enabling TDX Module Extensions (an optional TDX
> feature) requires ~50MB memory and never returns. Such allocation
> would at worst cause ~25GB permanently fragmented memory if each
> allocated page is from a different 2M region.
> 
> Support allocating contiguous pages for tdx_page_array by making the
> allocation method configurable. Change the tdx_page_array_alloc() to
> accept a custom allocation function pointer and a context parameter.
> Wrap the specific allocation into a tdx_page_array_alloc_contig()
> helper.
> 
> The foreseeable caller will allocate ~50MB memory with this helper,
> exceeding the maximum HPAs (512) a root page can hold, the typical usage
> will be:
> 
>  - struct tdx_page_array *array = tdx_page_array_alloc_contig(nr_pages);
>  - for each 512-page bulk
>    - tdx_page_array_populate(array, offset);
>    - seamcall(TDH_XXX_ADD, array, ...);
> 
> The configurable allocation method would also benefit more
> tdx_page_array usages. TDX Module may require more specific memory
> layouts encoded in the root page. Will introduce them in following
> patches.

Will skip this one as I see it gets massively reworked in your
incremental update. The change to make the caller responsible for
creating the page array is great.

^ permalink raw reply

* Re: [PATCH v2 05/31] x86/virt/tdx: Extend tdx_page_array to support IOMMU_MT
From: Dan Williams @ 2026-04-17 23:58 UTC (permalink / raw)
  To: Xu Yilun, Edgecombe, Rick P
  Cc: Gao, Chao, Xu, Yilun, x86@kernel.org, kas@kernel.org,
	baolu.lu@linux.intel.com, dave.hansen@linux.intel.com,
	Li, Xiaoyao, Williams, Dan J, Jiang, Dave,
	linux-pci@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-kernel@vger.kernel.org, Duan, Zhenzhong, Verma, Vishal L,
	kvm@vger.kernel.org
In-Reply-To: <aeBufjwO3t7XzVle@yilunxu-OptiPlex-7050>

Xu Yilun wrote:
[..]
> > 
> > I'm drafting some changes and make the tdx_page_array look like:
> > 
> >   struct tdx_page_array {
> > 	/* public: */
> > 	unsigned int nr_pages;
> > 	struct page **pages;
> > 
> > 	/* private: */
> > 	u64 *root;
> > 	bool flush_on_free;

How about "need_phymem_page_wbinvd"?

That makes it a bit more greppable and not to be confused with other
flushing.

[..]
> Hi, I end up made the following changes on top of this series:
> 
> -------8<--------
> 
>  arch/x86/include/asm/tdx.h            |  32 +-
>  arch/x86/virt/vmx/tdx/tdx.c           | 561 ++++++++------------------
>  drivers/virt/coco/tdx-host/tdx-host.c | 179 ++++++--
>  3 files changed, 316 insertions(+), 456 deletions(-)
> 
> +		ret = tdx_ext_mem_setup(nr_pages, &ext_mem);
>  		if (ret)
> +			return ret;
>  	}
>  
> +	ret = tdx_ext_init();
> +	if (ret)
> +		goto out_remove_ext_mem;
> +
>  	/*
> +	 * Extensions memory is never reclaimed once assigned, stop tracking it
> +	 * and free the tracking structures.
>  	 */
> +	tdx_page_array_free(ext_mem.chunk);

Wait, these pages belong to the module now, they can't be freed, or I am
missing something?

> +	kfree(ext_mem.pages);

Releasing this makes sense.

>  
>  	pr_info("%lu KB allocated for TDX Module Extensions\n",
>  		nr_pages * PAGE_SIZE / 1024);
>  
>  	return 0;
>  
> -out_flush:
> -	if (ext_mem)
> +out_remove_ext_mem:
> +	if (nr_pages) {
> +		/*
> +		 * TDH.EXT.MEM.ADD only collects required memory. TDX.EXT.INIT
> +		 * does the actual initialization so if it fails some pages may
> +		 * have been touched by the TDX module, flush cache before
> +		 * returning these pages to kernel.
> +		 */
>  		wbinvd_on_all_cpus();
> +		tdx_ext_mem_remove(&ext_mem);

This only releases the last populated chunk, not all previous chunks,
right?

^ permalink raw reply

* Re: [PATCH v2 03/31] x86/virt/tdx: Add tdx_page_array helpers for new TDX Module objects
From: Dan Williams @ 2026-04-17 23:36 UTC (permalink / raw)
  To: Xu Yilun, linux-coco, linux-pci, dan.j.williams, x86
  Cc: chao.gao, dave.jiang, baolu.lu, yilun.xu, yilun.xu,
	zhenzhong.duan, kvm, rick.p.edgecombe, dave.hansen, kas,
	xiaoyao.li, vishal.l.verma, linux-kernel
In-Reply-To: <20260327160132.2946114-4-yilun.xu@linux.intel.com>

Xu Yilun wrote:
[..]
> The usage of tdx_page_array will be in following patches.
> 
> Co-developed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
[..]
> +static struct tdx_page_array *
> +tdx_page_array_alloc(unsigned int nr_pages)
> +{
> +	struct tdx_page_array *array = NULL;
> +	struct page **pages = NULL;
> +	u64 *root = NULL;
> +	int ret;
> +
> +	if (!nr_pages)
> +		return NULL;
> +
> +	array = kzalloc_obj(*array);
> +	if (!array)
> +		goto out_free;
> +
> +	root = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +	if (!root)
> +		goto out_free;
> +
> +	pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);

This should now be:

    kzalloc_objs(struct page *, nr_pages);

...oh nevermind you caught that in your incremental fixup. ...but a
couple more comments below:

[..]
> +
> +#define HPA_LIST_INFO_FIRST_ENTRY	GENMASK_U64(11, 3)
> +#define HPA_LIST_INFO_PFN		GENMASK_U64(51, 12)
> +#define HPA_LIST_INFO_LAST_ENTRY	GENMASK_U64(63, 55)
> +
> +static u64 __maybe_unused hpa_list_info_assign_raw(struct tdx_page_array *array)

2 quick comments:

* I do not understand shipping a __maybe_unused helper in patch 3 that
  does not get used until patch10.

* The "assign_raw" verb feels strange. I think this probably just want
  to be called: to_hpa_list_info(struct tdx_page_array *)

> +{
> +	return FIELD_PREP(HPA_LIST_INFO_FIRST_ENTRY, 0) |
> +	       FIELD_PREP(HPA_LIST_INFO_PFN,
> +			  PFN_DOWN(virt_to_phys(array->root))) |
> +	       FIELD_PREP(HPA_LIST_INFO_LAST_ENTRY, array->nents - 1);
> +}
> +
> +#define HPA_ARRAY_T_PFN		GENMASK_U64(51, 12)
> +#define HPA_ARRAY_T_SIZE	GENMASK_U64(63, 55)
> +
> +static u64 __maybe_unused hpa_array_t_assign_raw(struct tdx_page_array *array)

to_hpa_array_t()

> +{
> +	unsigned long pfn;
> +
> +	if (array->nents == 1)
> +		pfn = page_to_pfn(array->pages[array->offset]);
> +	else
> +		pfn = PFN_DOWN(virt_to_phys(array->root));
> +
> +	return FIELD_PREP(HPA_ARRAY_T_PFN, pfn) |
> +	       FIELD_PREP(HPA_ARRAY_T_SIZE, array->nents - 1);
> +}
> +
> +static u64 __maybe_unused hpa_array_t_release_raw(struct tdx_page_array *array)

It seems too subtle that this function sometimes returns zero and
sometimes returns a page that the TDX module will clobber with data that
we do not care about.

It is also not clear that "0" is what the module considers a valid value
that meets "checks its validity for forward compatibility". I guess we
get lucky because all of the calls that need this presently are
multi-page cases?

I would feel better if this always returned the root HPA and was called
something like:

    to_output_clobber(), or to_aux_clobber()

...to make it clear that whatever was there before gets destroyed.

^ permalink raw reply

* Re: [RFC PATCH v5 00/45] TDX: Dynamic PAMT + S-EPT Hugepage
From: Edgecombe, Rick P @ 2026-04-17 20:01 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev, Huang, Kai,
	Li, Xiaoyao, Zhao, Yan Y, dave.hansen@linux.intel.com,
	linux-kernel@vger.kernel.org, kas@kernel.org, mingo@redhat.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Tng, Ackerley,
	Yamahata, Isaku, Shahar, Sagi, tglx@kernel.org, bp@alien8.de,
	Annapurve, Vishal, x86@kernel.org
In-Reply-To: <aeJnAfIryxf5Bzpe@google.com>

On Fri, 2026-04-17 at 09:59 -0700, Sean Christopherson wrote:
> > Ya.  I'll apply 1-4 and create a stable tag for patch 1's commit for 7.2. 
> > If it won't throw a wrench in things for you, I'll wait until 7.1-rc2 to get
> > 'em applied.  I've been burned more than a few times when using rc1 as the
> > base for topic branches.

Sounds great from my side. Thanks!

^ permalink raw reply

* Re: [RFC PATCH v5 00/45] TDX: Dynamic PAMT + S-EPT Hugepage
From: Sean Christopherson @ 2026-04-17 16:59 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: x86@kernel.org, dave.hansen@linux.intel.com, kas@kernel.org,
	bp@alien8.de, mingo@redhat.com, pbonzini@redhat.com,
	tglx@kernel.org, Kai Huang, Ackerley Tng, Sagi Shahar,
	Vishal Annapurve, linux-kernel@vger.kernel.org, Yan Y Zhao,
	Xiaoyao Li, kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	Isaku Yamahata, binbin.wu@linux.intel.com
In-Reply-To: <0f25c1c247c08f19d36ca774608bc8162d2454a4.camel@intel.com>

On Wed, Apr 15, 2026, Rick P Edgecombe wrote:
> On Wed, 2026-01-28 at 17:14 -0800, Sean Christopherson wrote:
> > As for landing these series, I think the fastest overall approach would be
> > to land patches 1-4 asap (tangentially related cleanups and fixes), agree
> > on a design (hopefully), and then hand control back to Rick and Yan to polish
> > their respective series for merge.
> 
> Sean, were you still planning to do this? I see you're, I think, asking for Dave
> to pull patch 1 here:
> https://lore.kernel.org/kvm/ac_bMKD7YGhMwUCf@google.com/
> 
> But actually he acked the version in this series,

Oh, right.  I forgot about that.

> so another option is you could just take patches 1-4?

Yeah, I think that makes the most sense.

> Please let me know if you are ok with that plan. We are growing stacks ahead of
> DPAMT and the other splintered pre-req work and looking to reduce it. It could
> be nice if we could take all the pre-req work through the same tree for that
> reason, and most of it is on the KVM side. Is it reasonable?

Ya.  I'll apply 1-4 and create a stable tag for patch 1's commit for 7.2.  If
it won't throw a wrench in things for you, I'll wait until 7.1-rc2 to get 'em
applied.  I've been burned more than a few times when using rc1 as the base for
topic branches.

Dave, holler if this doesn't work for you. 

^ permalink raw reply

* Re: [PATCH v13 00/48] arm64: Support for Arm CCA in KVM
From: Alper Gun @ 2026-04-16 17:44 UTC (permalink / raw)
  To: Suzuki K Poulose
  Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
	Will Deacon, James Morse, Oliver Upton, Zenghui Yu,
	linux-arm-kernel, linux-kernel, Joey Gouly, Alexandru Elisei,
	Christoffer Dall, Fuad Tabba, linux-coco, Ganapatrao Kulkarni,
	Gavin Shan, Shanker Donthineni, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve
In-Reply-To: <97d26e6e-b565-447f-95eb-2ece0755fe57@arm.com>

On Thu, Apr 16, 2026 at 4:05 AM Suzuki K Poulose <suzuki.poulose@arm.com> wrote:
>
> On 16/04/2026 00:27, Alper Gun wrote:
> > On Wed, Apr 15, 2026 at 4:01 AM Steven Price <steven.price@arm.com> wrote:
> >>
> >> On 14/04/2026 22:40, Alper Gun wrote:
> >>> On Wed, Mar 18, 2026 at 8:54 AM Steven Price <steven.price@arm.com> wrote:
> >>>>
> >>>> This series adds support for running protected VMs using KVM under the
> >>>> Arm Confidential Compute Architecture (CCA).
> >>>>
> >>>> New major version number! This now targets RMM v2.0-bet0[1]. And unlike
> >>>> for Linux this represents a significant change.
> >>>>
> >>>> RMM v2.0 brings with it the ability to configure the RMM to have the
> >>>> same page size as the host (so no more RMM_PAGE_SIZE and dealing with
> >>>> granules being different from host pages). It also introduces range
> >>>> based APIs for many operations which should be more efficient and
> >>>> simplifies the code in places.
> >>>>
> >>>> The handling of the GIC has changed, so the system registers are used to
> >>>> pass the GIC state rather than memory. This means fewer changes to the
> >>>> KVM code as it looks much like a normal VM in this respect.
> >>>>
> >>>> And of course the new uAPI introduced in the previous v12 posting is
> >>>> retained so that also remains simplified compared to earlier postings.
> >>>>
> >>>> The RMM support for v2.0 is still early and so this series includes a
> >>>> few hacks to ease the integration. Of note are that there are some RMM
> >>>> v1.0 SMCs added to paper over areas where the RMM implementation isn't
> >>>> quite ready for v2.0, and "SROs" (see below) are deferred to the final
> >>>> patch in the series.
> >>>>
> >>>> The PMU in RMM v2.0 requires more handling on the RMM-side (and
> >>>> therefore simplifies the implementation on Linux), but this isn't quite
> >>>> ready yet. The Linux side is implemented (but untested).
> >>>>
> >>>> PSCI still requires the VMM to provide the "target" REC for operations
> >>>> that affect another vCPU. This is likely to change in a future version
> >>>> of the specification. There's also a desire to force PSCI to be handled
> >>>> in the VMM for realm guests - this isn't implemented yet as I'm waiting
> >>>> for the dust to settle on the RMM interface first.
> >>>>
> >>>> Stateful RMI Operations
> >>>> -----------------------
> >>>>
> >>>> The RMM v2.0 spec brings a new concept of Stateful RMI Operations (SROs)
> >>>> which allow the RMM to complete an operation over several SMC calls and
> >>>> requesting/returning memory to the host. This has the benefit of
> >>>> allowing interrupts to be handled in the middle of an operation (by
> >>>> returning to the host to handle the interrupt without completing the
> >>>> operation) and enables the RMM to dynamically allocate memory for
> >>>> internal tracking purposes. One example of this is RMI_REC_CREATE no
> >>>> longer needs "auxiliary granules" provided upfront but can request the
> >>>> memory needed during the RMI_REC_CREATE operation.
> >>>>
> >>>> There are a fairly large number of operations that are defined as SROs
> >>>> in the specification, but current both Linux and RMM only have support
> >>>> for RMI_REC_CREATE and RMI_REC_DESTROY. There a number of TODOs/FIXMEs
> >>>> in the code where support is missing.
> >>>>
> >>>> Given the early stage support for this, the SRO handling is all confined
> >>>> to the final patch. This patch can be dropped to return to a pre-SRO
> >>>> state (albeit a mixture of RMM v1.0 and v2.0 APIs) for testing purposes.
> >>>>
> >>>> A future posting will reorder the series to move the generic SRO support
> >>>> to an early patch and will implement the proper support for this in all
> >>>> RMI SMCs.
> >>>>
> >>>> One aspect of SROs which is not yet well captured is that in some
> >>>> circumstances the Linux kernel will need to call an SRO call in a
> >>>> context where memory allocation is restricted (e.g. because a spinlock
> >>>> is held). In this case the intention is that the SRO will be cancelled,
> >>>> the spinlock dropped so the memory allocation can be completed, and then
> >>>> the SRO restarted (obviously after rechecking the state that the
> >>>> spinlock was protecting). For this reason the code stores the memory
> >>>> allocations within a struct rmi_sro_state object - see the final patch
> >>>> for more details.
> >>>>
> >>>> This series is based on v7.0-rc1. It is also available as a git
> >>>> repository:
> >>>>
> >>>> https://gitlab.arm.com/linux-arm/linux-cca cca-host/v13
> >>>>
> >>>>
> >>>
> >>> Hi Steven,
> >>>
> >>> I have a question regarding host kexec and kdump scenarios, and
> >>> whether there is any plan to make them work in this initial series.
> >>>
> >>> Intel TDX and AMD SEV-SNP both have a firmware shutdown command that
> >>> is invoked during the kexec or panic code paths to safely bypass
> >>> hardware memory protections and boot into the new kernel. As far as
> >>> I know, there is no similar global teardown command available for
> >>> the RMM.
> >>
> >> Correct, the RMM specification as it stands doesn't provide a mechanism
> >> for the host to do this. The host would have to identify all the realm
> >> guests in the system: specifically the address of the RDs (Realm
> >> Descriptors) and RECs (Realm Execution Contexts). It needs this to tear
> >> down the guests and be able to undelegate the memory.
> >>
> >> It's an interesting point and I'll raise the idea of a "firmware
> >> shutdown command" to make this more possible.
> >>
> >>> What is the roadmap for supporting both general kexec and
> >>> more specifically kdump (panic) scenarios with CCA?
> >>
> >> I don't have a roadmap I'm afraid for these. kexec in theory would be
> >> possible with KVM gracefully terminating all realms. For kdump/panic
> >> that sort of graceful shutdown isn't really appropriate (or likely to
> >> succeed).
> >>
> >
> > Thanks Steven for the clarification.
> >
> > For us, kdump is highly critical as it is our primary diagnostic tool
> > for host crashes. Without it, monitoring and debugging at fleet scale
> > would become unmanageable.
> >
> > To confirm my understanding of the current architecture: if a host
> > panics while no Realms are actively running (and therefore no pages
> > are currently in the delegated state), the standard kdump extraction
> > should work perfectly fine without any modifications, correct?
>
> This may not be true. We could have pages donated to RMM for GPT,
> Tracking etc. So, unless Linux keeps track of them, it may be
> unsafe for a crash kernel to access them.
>
> >
> > Regarding the KVM tracking structures (RDs, RECs, RTTs, etc.) when VMs
> > are running, perhaps we could use `vmcoreinfo` to export the physical
> > addresses of these delegated pages. This would allow tools like
>
> Thinking of this, do we really need to ? We could access the pages from
> "vmcore" read and handle the GPFs for such accesses and give out 0s
> for the Granules. Anyways, we can't get access to the data on those
> pages that are still in Realm PAS.
>

I like the idea of handling the GPFs directly during vmcore reads for
kdump case. That's much simpler\cleaner solution.

> > `makedumpfile` to explicitly filter them out. I assume these pages must
> > remain hardware-locked while the VMs are active.
>
>
>
> >
> > Long-term, having an architectural shutdown command - similar to the
> > TDH.SYS.DISABLE command in Intel TDX - would be incredibly useful. It
> > would allow the kdump kernel to safely bypass these hardware security
> > checks, especially when extracting host-side KVM state.
>

> For kexec, may be we could do this. Alternatively we could try to
> reclaim everything back, (GPTs, Tracking) before kexec-reboot.
>

Agreed. Reclaiming all delegated memory prior to the kexec reboot
makes perfect sense.

> >
> > As for the protected realm memory, I assume that is an easier problem.
> > We naturally want to exclude guest pages from a host dump regardless
> > of whether they are Realm pages or not. However, accidental touches
> > are still fatal.
> >
> >> There is also some RMM configuration which cannot be repeated (see
> >> RMI_RMM_CONFIG_SET) - which implies that the kexec kernel must be
> >> similar to the first kernel (i.e. same page size).
>
> That is true, the page sizes must match. RMM spec is updated to probe
> the state of the RMM and detect if it can do the CONFIG_SET
>
> Suzuki
>
> >>
> >> Thanks,
> >> Steve
>

^ permalink raw reply

* Re: [PATCH v13 00/48] arm64: Support for Arm CCA in KVM
From: Suzuki K Poulose @ 2026-04-16 11:04 UTC (permalink / raw)
  To: Alper Gun, Steven Price
  Cc: kvm, kvmarm, Catalin Marinas, Marc Zyngier, Will Deacon,
	James Morse, Oliver Upton, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve
In-Reply-To: <CABpDEumAVf02uOS5Bj07EDyuU=z9FV-iocQU1j7gFM5z0BeV_w@mail.gmail.com>

On 16/04/2026 00:27, Alper Gun wrote:
> On Wed, Apr 15, 2026 at 4:01 AM Steven Price <steven.price@arm.com> wrote:
>>
>> On 14/04/2026 22:40, Alper Gun wrote:
>>> On Wed, Mar 18, 2026 at 8:54 AM Steven Price <steven.price@arm.com> wrote:
>>>>
>>>> This series adds support for running protected VMs using KVM under the
>>>> Arm Confidential Compute Architecture (CCA).
>>>>
>>>> New major version number! This now targets RMM v2.0-bet0[1]. And unlike
>>>> for Linux this represents a significant change.
>>>>
>>>> RMM v2.0 brings with it the ability to configure the RMM to have the
>>>> same page size as the host (so no more RMM_PAGE_SIZE and dealing with
>>>> granules being different from host pages). It also introduces range
>>>> based APIs for many operations which should be more efficient and
>>>> simplifies the code in places.
>>>>
>>>> The handling of the GIC has changed, so the system registers are used to
>>>> pass the GIC state rather than memory. This means fewer changes to the
>>>> KVM code as it looks much like a normal VM in this respect.
>>>>
>>>> And of course the new uAPI introduced in the previous v12 posting is
>>>> retained so that also remains simplified compared to earlier postings.
>>>>
>>>> The RMM support for v2.0 is still early and so this series includes a
>>>> few hacks to ease the integration. Of note are that there are some RMM
>>>> v1.0 SMCs added to paper over areas where the RMM implementation isn't
>>>> quite ready for v2.0, and "SROs" (see below) are deferred to the final
>>>> patch in the series.
>>>>
>>>> The PMU in RMM v2.0 requires more handling on the RMM-side (and
>>>> therefore simplifies the implementation on Linux), but this isn't quite
>>>> ready yet. The Linux side is implemented (but untested).
>>>>
>>>> PSCI still requires the VMM to provide the "target" REC for operations
>>>> that affect another vCPU. This is likely to change in a future version
>>>> of the specification. There's also a desire to force PSCI to be handled
>>>> in the VMM for realm guests - this isn't implemented yet as I'm waiting
>>>> for the dust to settle on the RMM interface first.
>>>>
>>>> Stateful RMI Operations
>>>> -----------------------
>>>>
>>>> The RMM v2.0 spec brings a new concept of Stateful RMI Operations (SROs)
>>>> which allow the RMM to complete an operation over several SMC calls and
>>>> requesting/returning memory to the host. This has the benefit of
>>>> allowing interrupts to be handled in the middle of an operation (by
>>>> returning to the host to handle the interrupt without completing the
>>>> operation) and enables the RMM to dynamically allocate memory for
>>>> internal tracking purposes. One example of this is RMI_REC_CREATE no
>>>> longer needs "auxiliary granules" provided upfront but can request the
>>>> memory needed during the RMI_REC_CREATE operation.
>>>>
>>>> There are a fairly large number of operations that are defined as SROs
>>>> in the specification, but current both Linux and RMM only have support
>>>> for RMI_REC_CREATE and RMI_REC_DESTROY. There a number of TODOs/FIXMEs
>>>> in the code where support is missing.
>>>>
>>>> Given the early stage support for this, the SRO handling is all confined
>>>> to the final patch. This patch can be dropped to return to a pre-SRO
>>>> state (albeit a mixture of RMM v1.0 and v2.0 APIs) for testing purposes.
>>>>
>>>> A future posting will reorder the series to move the generic SRO support
>>>> to an early patch and will implement the proper support for this in all
>>>> RMI SMCs.
>>>>
>>>> One aspect of SROs which is not yet well captured is that in some
>>>> circumstances the Linux kernel will need to call an SRO call in a
>>>> context where memory allocation is restricted (e.g. because a spinlock
>>>> is held). In this case the intention is that the SRO will be cancelled,
>>>> the spinlock dropped so the memory allocation can be completed, and then
>>>> the SRO restarted (obviously after rechecking the state that the
>>>> spinlock was protecting). For this reason the code stores the memory
>>>> allocations within a struct rmi_sro_state object - see the final patch
>>>> for more details.
>>>>
>>>> This series is based on v7.0-rc1. It is also available as a git
>>>> repository:
>>>>
>>>> https://gitlab.arm.com/linux-arm/linux-cca cca-host/v13
>>>>
>>>>
>>>
>>> Hi Steven,
>>>
>>> I have a question regarding host kexec and kdump scenarios, and
>>> whether there is any plan to make them work in this initial series.
>>>
>>> Intel TDX and AMD SEV-SNP both have a firmware shutdown command that
>>> is invoked during the kexec or panic code paths to safely bypass
>>> hardware memory protections and boot into the new kernel. As far as
>>> I know, there is no similar global teardown command available for
>>> the RMM.
>>
>> Correct, the RMM specification as it stands doesn't provide a mechanism
>> for the host to do this. The host would have to identify all the realm
>> guests in the system: specifically the address of the RDs (Realm
>> Descriptors) and RECs (Realm Execution Contexts). It needs this to tear
>> down the guests and be able to undelegate the memory.
>>
>> It's an interesting point and I'll raise the idea of a "firmware
>> shutdown command" to make this more possible.
>>
>>> What is the roadmap for supporting both general kexec and
>>> more specifically kdump (panic) scenarios with CCA?
>>
>> I don't have a roadmap I'm afraid for these. kexec in theory would be
>> possible with KVM gracefully terminating all realms. For kdump/panic
>> that sort of graceful shutdown isn't really appropriate (or likely to
>> succeed).
>>
> 
> Thanks Steven for the clarification.
> 
> For us, kdump is highly critical as it is our primary diagnostic tool
> for host crashes. Without it, monitoring and debugging at fleet scale
> would become unmanageable.
> 
> To confirm my understanding of the current architecture: if a host
> panics while no Realms are actively running (and therefore no pages
> are currently in the delegated state), the standard kdump extraction
> should work perfectly fine without any modifications, correct?

This may not be true. We could have pages donated to RMM for GPT,
Tracking etc. So, unless Linux keeps track of them, it may be
unsafe for a crash kernel to access them.

> 
> Regarding the KVM tracking structures (RDs, RECs, RTTs, etc.) when VMs
> are running, perhaps we could use `vmcoreinfo` to export the physical
> addresses of these delegated pages. This would allow tools like

Thinking of this, do we really need to ? We could access the pages from
"vmcore" read and handle the GPFs for such accesses and give out 0s
for the Granules. Anyways, we can't get access to the data on those
pages that are still in Realm PAS.

> `makedumpfile` to explicitly filter them out. I assume these pages must
> remain hardware-locked while the VMs are active.



> 
> Long-term, having an architectural shutdown command - similar to the
> TDH.SYS.DISABLE command in Intel TDX - would be incredibly useful. It
> would allow the kdump kernel to safely bypass these hardware security
> checks, especially when extracting host-side KVM state.

For kexec, may be we could do this. Alternatively we could try to
reclaim everything back, (GPTs, Tracking) before kexec-reboot.

> 
> As for the protected realm memory, I assume that is an easier problem.
> We naturally want to exclude guest pages from a host dump regardless
> of whether they are Realm pages or not. However, accidental touches
> are still fatal.
> 
>> There is also some RMM configuration which cannot be repeated (see
>> RMI_RMM_CONFIG_SET) - which implies that the kexec kernel must be
>> similar to the first kernel (i.e. same page size).

That is true, the page sizes must match. RMM spec is updated to probe 
the state of the RMM and detect if it can do the CONFIG_SET

Suzuki

>>
>> Thanks,
>> Steve


^ permalink raw reply

* Re: [PATCH v2 03/31] x86/virt/tdx: Add tdx_page_array helpers for new TDX Module objects
From: Xu Yilun @ 2026-04-16  9:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-coco, linux-pci, x86, chao.gao, dave.jiang, baolu.lu,
	yilun.xu, zhenzhong.duan, kvm, rick.p.edgecombe, dave.hansen, kas,
	xiaoyao.li, vishal.l.verma, linux-kernel
In-Reply-To: <69db0916ba63c_fe08310034@djbw-dev.notmuch>

> > A main difference is HPA_ARRAY_T requires singleton mode when
> > containing just 1 functional page (page0). In this mode the root page is
> > not needed and the HPA field of the raw value directly points to the
> > page0. But in this patch, root page is always allocated for user
> > friendly kAPIs.
> 
> I think this undersells the fact that "singleton mode" is a premature
> optimization that requires complication to take advantage of the benefit
> (sometimes save a single page allocation). The Linux implementation
> forfeits that small benefit for the larger gain of cleaner kAPIs.
> 
> > Another small difference is HPA_LIST_INFO contains a 'first entry' field
> > which could be filled by TDX Module. This simplifies host by providing
> > the same structure when re-invoke the interrupted SEAMCALL. No need for
> > host to touch this field.
> > 
> > Typical usages of the tdx_page_array:
> > 
> > 1. Add control pages:
> >  - struct tdx_page_array *array = tdx_page_array_create(nr_pages);
> >  - seamcall(TDH_XXX_CREATE, array, ...);
> > 
> > 2. Release control pages:
> >  - seamcall(TDX_XXX_DELETE, array, &nr_released, &released_hpa);
> 
> It is simply a bug if TDH_XXX_DELETE does not return every resource
> passed to TDH_XXX_CREATE. The only "leak" case to worry about is that
> TDH_XXX_DELETE fails. In that case it should be "fatal" (TDX_BUG_ON,
> system can keep hobbling along, but panic_on_warn() would not be
> unreasonable). If TDH_XXX_DELETE fails it indicates some catastrophic
> misunderstanding between Linux and the TDX Module.
> 
> So the seamcall in this case has no need for @nr_released or
> @released_hpa, those should already be known to the kernel.
> 
> What is missing is an architectural guarantee that TDH_XXX_DELETE
> success == "all resources you arranged at TDH_XXX_CREATE time are free".
> I would hope that is already the case and AUX_PAGE_PA is only an

Yes, it is already the case.

> unfortunate distraction. If it can ever be the case that CREATE and
> DELETE are asymmetric on success then that needs to be corrected and
> Linux will wait for a future module that can make that guarantee.

Agree. Thanks for this summary as well as the singleton-mode one.

> 
> I think that cleans up a bulk of the logic here to abandon caring that
> the module tries to remind us what we are releasing.

Yes, I'll delete all these "released page matching" logic.

^ permalink raw reply

* Re: [PATCH v2 05/31] x86/virt/tdx: Extend tdx_page_array to support IOMMU_MT
From: Xu Yilun @ 2026-04-16  5:07 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Gao, Chao, Xu, Yilun, x86@kernel.org, kas@kernel.org,
	baolu.lu@linux.intel.com, dave.hansen@linux.intel.com,
	Li, Xiaoyao, Williams, Dan J, Jiang, Dave,
	linux-pci@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-kernel@vger.kernel.org, Duan, Zhenzhong, Verma, Vishal L,
	kvm@vger.kernel.org
In-Reply-To: <ad4Pj0cqlprvNUSj@yilunxu-OptiPlex-7050>

On Tue, Apr 14, 2026 at 05:57:35PM +0800, Xu Yilun wrote:
> On Wed, Apr 01, 2026 at 12:17:45AM +0000, Edgecombe, Rick P wrote:
> > On Tue, 2026-03-31 at 22:19 +0800, Xu Yilun wrote:
> > > > Consider the amount of tricks that are needed to coax the tdx_page_array to
> > > > populate the handoff page as needed. It adds 2 pages here, then subtracts
> > > > them
> > > > later in the callback. Then tweaks the pa in tdx_page_array_populate() to
> > > > add
> > > > the length...
> > > 
> > > mm.. The tricky part is the specific memory requirement/allocation, the
> > > common part is the pa list contained in a root page. Maybe we only model
> > > the later, let the specific user does the memory allocation. Is that
> > > closer to your "break concepts apart" idea?
> > 
> > I haven't wrapped my head around this enough to suggest anything is definitely
> > the right approach.
> > 
> > But yes, the idea would be that the allocation of the list of pages to give to
> > the TDX module would be a separate allocation and set of management functions.
> > And the the allocation of the pages that are used to communicate the list of
> > pages (and in this case other args) with the module would be another set. So
> > each type of TDX module arg page format (IOMMU_MT, etc) would be separable, but
> > share the page list allocation part only. It looks like Nikolay was probing
> > along the same path. Not sure if he had the same solution in mind.
> > 
> > So for this:
> > 1. Allocate a list or array of pages using a generic method.
> > 2. Allocate these two IOMMU special pages.
> > 3. Allocate memory needed for the seamcall (root pages)
> > 
> > Hand all three to the wrapper and have it shove them all through in the special
> > way it prefers.
> 
> I'm drafting some changes and make the tdx_page_array look like:
> 
>   struct tdx_page_array {
> 	/* public: */
> 	unsigned int nr_pages;
> 	struct page **pages;
> 
> 	/* private: */
> 	u64 *root;
> 	bool flush_on_free;
>   };
> 
>   - I removed the page allocations for tdx_page_array kAPIs. Now the
>     caller needs to allocate the struct page **pages and the page list,
>     then create the tdx_page_array by providing these pages.
> 
>     struct tdx_page_array *tdx_page_array_create(struct page **pages,
> 						 unsigned int nr_pages)
> 
>     This also means tdx_page_array doesn't have to hold more than 512
>     pages anymore, it now an exact descriptor for the TDX Module's
>     definitions rather than a manager. It's a chunk of the required
>     memory when we need more than 512 pages. This eliminates the need
>     for 'offset' field and the slide window operations so make the
>     helpers simpler.
> 
>   - I still keep the generic struct tdx_page_array to represent all
>     kinds of object types (HPA_ARRAY_T, HPA_LIST_INFO, IOMMU_MT), and
>     provide the tdx_page_array to SEAMCALL helpers as parameters. I
>     think this structure is generally good enough to represent a list of
>     pages, keeps type safety compared to a list of HPAs.
> 
>   - I still record both the page list (struct page **pages) and the HPA
>     list (in u64 *root). struct page **pages works with kernel memory
>     management (e.g. vmap) well while the populated root works with
>     SEAMCALLs.
> 
>   - I'm not introducing more structures each for an object type, like 
>     struct hpa_array, struct hpa_list_info, struct iommu_metadata. They
>     are conceptually the same thing. The iommu_mt supports multi-order
>     pages, hpa_array_t & hpa_list_info don't support. But their bit
>     definitions don't conflict. I can use the same piece of code to
>     populate their root page content.
> 
>   - Add a flush_on_free field to mark if a cache write back is needed on
>     tdx_page_array_free(), then we don't need 2 free APIs.
> 
> I want to clean up my code, then post an incremental patch for preview.

Hi, I end up made the following changes on top of this series:

-------8<--------

 arch/x86/include/asm/tdx.h            |  32 +-
 arch/x86/virt/vmx/tdx/tdx.c           | 561 ++++++++------------------
 drivers/virt/coco/tdx-host/tdx-host.c | 179 ++++++--
 3 files changed, 316 insertions(+), 456 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 7bdd66acda5b..31d1101a4f45 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -143,15 +143,12 @@ void tdx_quirk_reset_page(struct page *page);
 
 /**
  * struct tdx_page_array - Represents a list of pages for TDX Module access
- * @nr_pages: Total number of data pages in the collection
- * @pages: Array of data page pointers containing all the data
+ * @nr_pages: Number of data pages in the collection (up to 512)
+ * @pages: Array of data page pointers
  *
- * @offset: Internal: The starting index in @pages, positions the currently
- *	    populated page window in @root.
- * @nents: Internal: Number of valid HPAs for the page window in @root
- * @root: Internal: A single 4KB page holding the 8-byte HPAs of the page
- *	  window. The page window max size is constrained by the root page,
- *	  which is 512 HPAs.
+ * @root: Internal: A single 4KB page holding the 8-byte HPAs of the @pages
+ * @flush_on_free: Internal: whether to flush cache when @pages are to be
+ *		   freed.
  *
  * This structure abstracts several TDX Module defined object types, e.g.,
  * HPA_ARRAY_T and HPA_LIST_INFO. Typically they all use a "root page" as the
@@ -165,20 +162,13 @@ struct tdx_page_array {
 	struct page **pages;
 
 	/* private: */
-	unsigned int offset;
-	unsigned int nents;
 	u64 *root;
+	bool flush_on_free;
 };
 
 void tdx_page_array_free(struct tdx_page_array *array);
-DEFINE_FREE(tdx_page_array_free, struct tdx_page_array *, if (_T) tdx_page_array_free(_T))
-struct tdx_page_array *tdx_page_array_create(unsigned int nr_pages);
-void tdx_page_array_ctrl_leak(struct tdx_page_array *array);
-int tdx_page_array_ctrl_release(struct tdx_page_array *array,
-				unsigned int nr_released,
-				u64 released_hpa);
-struct tdx_page_array *
-tdx_page_array_create_iommu_mt(unsigned int iq_order, unsigned int nr_mt_pages);
+struct tdx_page_array *tdx_page_array_create(struct page **pages,
+					     unsigned int nr_pages);
 
 struct tdx_td {
 	/* TD root structure: */
@@ -248,8 +238,7 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page);
 u64 tdh_iommu_setup(u64 vtbar, struct tdx_page_array *iommu_mt, u64 *iommu_id);
 u64 tdh_iommu_clear(u64 iommu_id, struct tdx_page_array *iommu_mt);
 u64 tdh_spdm_create(u64 func_id, struct tdx_page_array *spdm_mt, u64 *spdm_id);
-u64 tdh_spdm_delete(u64 spdm_id, struct tdx_page_array *spdm_mt,
-		    unsigned int *nr_released, u64 *released_hpa);
+u64 tdh_spdm_delete(u64 spdm_id, struct tdx_page_array *spdm_mt);
 u64 tdh_exec_spdm_connect(u64 spdm_id, struct page *spdm_conf,
 			  struct page *spdm_rsp, struct page *spdm_req,
 			  struct tdx_page_array *spdm_out,
@@ -269,8 +258,7 @@ u64 tdh_ide_stream_create(u64 stream_info, u64 spdm_id,
 			  u64 *rp_ide_id);
 u64 tdh_ide_stream_block(u64 spdm_id, u64 stream_id);
 u64 tdh_ide_stream_delete(u64 spdm_id, u64 stream_id,
-			  struct tdx_page_array *stream_mt,
-			  unsigned int *nr_released, u64 *released_hpa);
+			  struct tdx_page_array *stream_mt);
 u64 tdh_ide_stream_km(u64 spdm_id, u64 stream_id, u64 operation,
 		      struct page *spdm_rsp, struct page *spdm_req,
 		      u64 *spdm_req_len);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 72d836b25bd6..04f47c5eb2a5 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -262,21 +262,27 @@ static int build_tdx_memlist(struct list_head *tmb_list)
 #define TDX_PAGE_ARRAY_MAX_NENTS	(PAGE_SIZE / sizeof(u64))
 
 static int tdx_page_array_populate(struct tdx_page_array *array,
-				   unsigned int offset)
+				   struct page **pages, unsigned int nr_pages)
 {
-	u64 *entries;
+	u64 *entries = array->root;
 	int i;
 
-	if (offset >= array->nr_pages)
-		return 0;
+	if (!pages || !nr_pages || nr_pages > TDX_PAGE_ARRAY_MAX_NENTS)
+		return -EINVAL;
+
+	/*
+	 * When re-populating, the old pages are no longer tracked.
+	 * Theoretically they require cache flushing similar to
+	 * tdx_page_array_free(). Since there is no use case for this yet,
+	 * just warn to prompt future improvement.
+	 */
+	WARN_ON_ONCE(array->pages && array->flush_on_free);
 
-	array->offset = offset;
-	array->nents = umin(array->nr_pages - offset,
-			    TDX_PAGE_ARRAY_MAX_NENTS);
+	for (i = 0; i < nr_pages; i++) {
+		struct page *page = pages[i];
 
-	entries = array->root;
-	for (i = 0; i < array->nents; i++) {
-		struct page *page = array->pages[offset + i];
+		if (!page)
+			return -EINVAL;
 
 		entries[i] = page_to_phys(page);
 
@@ -285,359 +291,96 @@ static int tdx_page_array_populate(struct tdx_page_array *array,
 			entries[i] |= compound_nr(page);
 	}
 
-	return array->nents;
-}
-
-static void tdx_free_pages_bulk(unsigned int nr_pages, struct page **pages)
-{
-	int i;
-
-	for (i = 0; i < nr_pages; i++)
-		put_page(pages[i]);
-}
-
-static int tdx_alloc_pages_bulk(unsigned int nr_pages, struct page **pages,
-				void *data)
-{
-	unsigned int filled, done = 0;
-
-	do {
-		filled = alloc_pages_bulk(GFP_KERNEL, nr_pages - done,
-					  pages + done);
-		if (!filled) {
-			tdx_free_pages_bulk(done, pages);
-			return -ENOMEM;
-		}
-
-		done += filled;
-	} while (done != nr_pages);
+	array->pages = pages;
+	array->nr_pages = nr_pages;
 
 	return 0;
 }
 
 /**
- * tdx_page_array_free() - Free all memory for a tdx_page_array
+ * tdx_page_array_free() - Free the tdx_page_array
  * @array: The tdx_page_array to be freed.
  *
- * Free all associated pages and the container itself.
+ * Free this page array decriptor. Note the associated pages are not
+ * freed, their lifecycles are not controlled by tdx_page_array.
+ *
+ * TDX Module may consume page array for private accessing, flush cache before
+ * this tracking decriptor is freed, to avoid private cache write back
+ * damages these pages which may further be returned to kernel and reused.
+ * Specific SEAMCALL helpers should indicate the flushing by setting this flag.
  */
 void tdx_page_array_free(struct tdx_page_array *array)
 {
 	if (!array)
 		return;
 
-	tdx_free_pages_bulk(array->nr_pages, array->pages);
-	kfree(array->pages);
-	kfree(array->root);
-	kfree(array);
-}
-EXPORT_SYMBOL_GPL(tdx_page_array_free);
-
-static struct tdx_page_array *
-tdx_page_array_alloc(unsigned int nr_pages,
-		     int (*alloc_fn)(unsigned int nr_pages,
-				     struct page **pages, void *data),
-		     void *data)
-{
-	struct tdx_page_array *array = NULL;
-	struct page **pages = NULL;
-	u64 *root = NULL;
-	int ret;
-
-	if (!nr_pages)
-		return NULL;
-
-	array = kzalloc_obj(*array);
-	if (!array)
-		goto out_free;
-
-	root = kzalloc(PAGE_SIZE, GFP_KERNEL);
-	if (!root)
-		goto out_free;
-
-	pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
-	if (!pages)
-		goto out_free;
-
-	ret = alloc_fn(nr_pages, pages, data);
-	if (ret)
-		goto out_free;
+	if (array->flush_on_free) {
+		int i;
 
-	array->nr_pages = nr_pages;
-	array->pages = pages;
-	array->root = root;
+		for (i = 0; i < array->nr_pages; i++) {
+			u64 r;
 
-	return array;
+			r = tdh_phymem_page_wbinvd_hkid(tdx_global_keyid,
+							array->pages[i]);
+			WARN_ON_ONCE(r);
+		}
+	}
 
-out_free:
-	kfree(pages);
-	kfree(root);
+	kfree(array->root);
 	kfree(array);
-
-	return NULL;
 }
+EXPORT_SYMBOL_GPL(tdx_page_array_free);
 
-/**
- * tdx_page_array_create() - Create a small tdx_page_array (up to 512 pages)
- * @nr_pages: Number of pages to allocate (must be <= 512).
- *
- * Allocate and populate a tdx_page_array in a single step. This is intended
- * for small collections that fit within a single root page. The allocated
- * pages are all order-0 pages. This is the most common use case for a list of
- * TDX control pages.
- *
- * If more pages are required, use tdx_page_array_alloc() and
- * tdx_page_array_populate() to build tdx_page_array chunk by chunk.
- *
- * Return: Fully populated tdx_page_array or NULL on failure.
- */
-struct tdx_page_array *tdx_page_array_create(unsigned int nr_pages)
+static struct tdx_page_array *tdx_page_array_alloc(void)
 {
 	struct tdx_page_array *array;
-	int populated;
 
-	if (nr_pages > TDX_PAGE_ARRAY_MAX_NENTS)
-		return NULL;
-
-	array = tdx_page_array_alloc(nr_pages, tdx_alloc_pages_bulk, NULL);
+	array = kzalloc_obj(*array);
 	if (!array)
 		return NULL;
 
-	populated = tdx_page_array_populate(array, 0);
-	if (populated != nr_pages)
-		goto out_free;
-
-	return array;
-
-out_free:
-	tdx_page_array_free(array);
-	return NULL;
-}
-EXPORT_SYMBOL_GPL(tdx_page_array_create);
-
-/**
- * tdx_page_array_ctrl_leak() - Leak data pages and free the container
- * @array: The tdx_page_array to be leaked.
- *
- * Call this function when failed to reclaim the control pages. Free the root
- * page and the holding structures, but orphan the data pages, to prevent the
- * host from re-allocating and accessing memory that the hardware may still
- * consider private.
- */
-void tdx_page_array_ctrl_leak(struct tdx_page_array *array)
-{
-	if (!array)
-		return;
-
-	kfree(array->pages);
-	kfree(array->root);
-	kfree(array);
-}
-EXPORT_SYMBOL_GPL(tdx_page_array_ctrl_leak);
-
-static bool tdx_page_array_validate_release(struct tdx_page_array *array,
-					    unsigned int offset,
-					    unsigned int nr_released,
-					    u64 released_hpa)
-{
-	unsigned int nents;
-
-	if (offset >= array->nr_pages)
-		return false;
-
-	nents = umin(array->nr_pages - offset, TDX_PAGE_ARRAY_MAX_NENTS);
-
-	if (nents != nr_released) {
-		pr_err("%s nr_released [%d] doesn't match page array nents [%d]\n",
-		       __func__, nr_released, nents);
-		return false;
-	}
-
-	/*
-	 * Unfortunately TDX has multiple page allocation protocols, check the
-	 * "singleton" case required for HPA_ARRAY_T.
-	 */
-	if (page_to_phys(array->pages[0]) == released_hpa &&
-	    array->nr_pages == 1)
-		return true;
-
-	/* Then check the "non-singleton" case */
-	if (virt_to_phys(array->root) == released_hpa) {
-		u64 *entries = array->root;
-		int i;
-
-		for (i = 0; i < nents; i++) {
-			struct page *page = array->pages[offset + i];
-			u64 val = page_to_phys(page);
-
-			/* Now only for iommu_mt */
-			if (compound_nr(page) > 1)
-				val |= compound_nr(page);
-
-			if (val != entries[i]) {
-				pr_err("%s entry[%d] [0x%llx] doesn't match page hpa [0x%llx]\n",
-				       __func__, i, entries[i], val);
-				return false;
-			}
-		}
-
-		return true;
+	array->root = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!array->root) {
+		kfree(array);
+		return NULL;
 	}
 
-	pr_err("%s failed to validate, released_hpa [0x%llx], root page hpa [0x%llx], page0 hpa [%#llx], number pages %u\n",
-	       __func__, released_hpa, virt_to_phys(array->root),
-	       page_to_phys(array->pages[0]), array->nr_pages);
-
-	return false;
+	return array;
 }
 
 /**
- * tdx_page_array_ctrl_release() - Verify and release TDX control pages
- * @array: The tdx_page_array used to originally create control pages.
- * @nr_released: Number of HPAs the TDX Module reported as released.
- * @released_hpa: The HPA list the TDX Module reported as released.
+ * tdx_page_array_create() - Create a populated tdx_page_array (up to 512 pages)
+ * @pages: Pointer to struct page array for tdx_page_array populating
+ * @nr_pages: Size of @pages array.
  *
- * TDX Module can at most release 512 control pages, so this function only
- * accepts small tdx_page_array (up to 512 pages), usually created by
- * tdx_page_array_create().
+ * Create a populated tdx_page_array in a single step. This is intended for
+ * small collections that fit within a single root page. This is the most
+ * common use case for a list of TDX control pages.
  *
- * Return: 0 on success, -errno on page release protocol error.
- */
-int tdx_page_array_ctrl_release(struct tdx_page_array *array,
-				unsigned int nr_released,
-				u64 released_hpa)
-{
-	int i;
-
-	/*
-	 * The only case where ->nr_pages is allowed to be >
-	 * TDX_PAGE_ARRAY_MAX_NENTS is a case where those pages are never
-	 * expected to be released by this function.
-	 */
-	if (WARN_ON(array->nr_pages > TDX_PAGE_ARRAY_MAX_NENTS))
-		return -EINVAL;
-
-	if (WARN_ONCE(!tdx_page_array_validate_release(array, 0, nr_released,
-						       released_hpa),
-		      "page release protocol error, consider reboot and replace TDX Module.\n"))
-		return -EFAULT;
-
-	for (i = 0; i < array->nr_pages; i++) {
-		u64 r;
-
-		r = tdh_phymem_page_wbinvd_hkid(tdx_global_keyid,
-						array->pages[i]);
-		if (WARN_ON(r))
-			return -EFAULT;
-	}
-
-	tdx_page_array_free(array);
-	return 0;
-}
-EXPORT_SYMBOL_GPL(tdx_page_array_ctrl_release);
-
-static int tdx_alloc_pages_contig(unsigned int nr_pages, struct page **pages,
-				  void *data)
-{
-	struct page *page;
-	int i;
-
-	page = alloc_contig_pages(nr_pages, GFP_KERNEL, numa_mem_id(),
-				  &node_online_map);
-	if (!page)
-		return -ENOMEM;
-
-	for (i = 0; i < nr_pages; i++)
-		pages[i] = page + i;
-
-	return 0;
-}
-
-/*
- * For holding large number of contiguous pages, usually larger than
- * TDX_PAGE_ARRAY_MAX_NENTS (512).
- *
- * Similar to tdx_page_array_alloc(), after allocating with this
- * function, call tdx_page_array_populate() to populate the tdx_page_array.
- */
-static struct tdx_page_array *
-tdx_page_array_alloc_contig(unsigned int nr_pages)
-{
-	return tdx_page_array_alloc(nr_pages, tdx_alloc_pages_contig, NULL);
-}
-
-static int tdx_alloc_pages_iommu_mt(unsigned int nr_pages, struct page **pages,
-				    void *data)
-{
-	unsigned int iq_order = (unsigned int)(long)data;
-	struct folio *t_iq, *t_ctxiq;
-	int ret;
-
-	/* TODO: folio_alloc_node() is preferred, but need numa info */
-	t_iq = folio_alloc(GFP_KERNEL | __GFP_ZERO, iq_order);
-	if (!t_iq)
-		return -ENOMEM;
-
-	t_ctxiq = folio_alloc(GFP_KERNEL | __GFP_ZERO, iq_order);
-	if (!t_ctxiq) {
-		ret = -ENOMEM;
-		goto out_t_iq;
-	}
-
-	ret = tdx_alloc_pages_bulk(nr_pages - 2, pages + 2, NULL);
-	if (ret)
-		goto out_t_ctxiq;
-
-	pages[0] = folio_page(t_iq, 0);
-	pages[1] = folio_page(t_ctxiq, 0);
-
-	return 0;
-
-out_t_ctxiq:
-	folio_put(t_ctxiq);
-out_t_iq:
-	folio_put(t_iq);
-
-	return ret;
-}
-
-/**
- * tdx_page_array_create_iommu_mt() - Create a page array for IOMMU Memory Tables
- * @iq_order: The allocation order for the IOMMU Invalidation Queue.
- * @nr_mt_pages: Number of additional order-0 pages for the MT.
- *
- * Allocate and populate a specialized tdx_page_array for IOMMU_MT structures.
- * The resulting array consists of two multi-order folios (at index 0 and 1)
- * followed by the requested number of order-0 pages.
+ * If more pages are required, use tdx_page_array_alloc() and
+ * tdx_page_array_populate() to build tdx_page_array chunk by chunk.
  *
- * Return: Fully populated tdx_page_array or NULL on failure.
+ * Return: Populated tdx_page_array or NULL on failure.
  */
 struct tdx_page_array *
-tdx_page_array_create_iommu_mt(unsigned int iq_order, unsigned int nr_mt_pages)
+tdx_page_array_create(struct page **pages, unsigned int nr_pages)
 {
-	unsigned int nr_pages = nr_mt_pages + 2;
 	struct tdx_page_array *array;
-	int populated;
-
-	if (nr_pages > TDX_PAGE_ARRAY_MAX_NENTS)
-		return NULL;
+	int ret;
 
-	array = tdx_page_array_alloc(nr_pages, tdx_alloc_pages_iommu_mt,
-				     (void *)(long)iq_order);
+	array = tdx_page_array_alloc();
 	if (!array)
 		return NULL;
 
-	populated = tdx_page_array_populate(array, 0);
-	if (populated != nr_pages)
-		goto out_free;
+	ret = tdx_page_array_populate(array, pages, nr_pages);
+	if (ret) {
+		tdx_page_array_free(array);
+		return NULL;
+	}
 
 	return array;
-
-out_free:
-	tdx_page_array_free(array);
-	return NULL;
 }
-EXPORT_SYMBOL_GPL(tdx_page_array_create_iommu_mt);
+EXPORT_SYMBOL_GPL(tdx_page_array_create);
 
 #define HPA_LIST_INFO_FIRST_ENTRY	GENMASK_U64(11, 3)
 #define HPA_LIST_INFO_PFN		GENMASK_U64(51, 12)
@@ -648,7 +391,7 @@ static u64 hpa_list_info_assign_raw(struct tdx_page_array *array)
 	return FIELD_PREP(HPA_LIST_INFO_FIRST_ENTRY, 0) |
 	       FIELD_PREP(HPA_LIST_INFO_PFN,
 			  PFN_DOWN(virt_to_phys(array->root))) |
-	       FIELD_PREP(HPA_LIST_INFO_LAST_ENTRY, array->nents - 1);
+	       FIELD_PREP(HPA_LIST_INFO_LAST_ENTRY, array->nr_pages - 1);
 }
 
 #define HPA_ARRAY_T_PFN		GENMASK_U64(51, 12)
@@ -658,18 +401,18 @@ static u64 hpa_array_t_assign_raw(struct tdx_page_array *array)
 {
 	unsigned long pfn;
 
-	if (array->nents == 1)
-		pfn = page_to_pfn(array->pages[array->offset]);
+	if (array->nr_pages == 1)
+		pfn = page_to_pfn(array->pages[0]);
 	else
 		pfn = PFN_DOWN(virt_to_phys(array->root));
 
 	return FIELD_PREP(HPA_ARRAY_T_PFN, pfn) |
-	       FIELD_PREP(HPA_ARRAY_T_SIZE, array->nents - 1);
+	       FIELD_PREP(HPA_ARRAY_T_SIZE, array->nr_pages - 1);
 }
 
 static u64 hpa_array_t_release_raw(struct tdx_page_array *array)
 {
-	if (array->nents == 1)
+	if (array->nr_pages == 1)
 		return 0;
 
 	return virt_to_phys(array->root);
@@ -1515,8 +1258,8 @@ static void tdx_clflush_page(struct page *page)
 
 static void tdx_clflush_page_array(struct tdx_page_array *array)
 {
-	for (int i = 0; i < array->nents; i++)
-		tdx_clflush_page(array->pages[array->offset + i]);
+	for (int i = 0; i < array->nr_pages; i++)
+		tdx_clflush_page(array->pages[i]);
 }
 
 /* Initialize the TDX Module Extensions then Extension-SEAMCALLs can be used */
@@ -1536,14 +1279,14 @@ static int tdx_ext_init(void)
 	return 0;
 }
 
-static int tdx_ext_mem_add(struct tdx_page_array *ext_mem)
+static int tdx_ext_mem_add(struct tdx_page_array *mem)
 {
 	struct tdx_module_args args = {
-		.rcx = hpa_list_info_assign_raw(ext_mem),
+		.rcx = hpa_list_info_assign_raw(mem),
 	};
 	u64 r;
 
-	tdx_clflush_page_array(ext_mem);
+	tdx_clflush_page_array(mem);
 
 	do {
 		r = seamcall_ret(TDH_EXT_MEM_ADD, &args);
@@ -1556,33 +1299,86 @@ static int tdx_ext_mem_add(struct tdx_page_array *ext_mem)
 	return 0;
 }
 
-static int tdx_ext_mem_setup(struct tdx_page_array *ext_mem)
+struct tdx_ext_mem {
+	struct page **pages;
+	unsigned int nr_pages;
+	struct tdx_page_array *chunk;
+};
+
+static void tdx_ext_mem_remove(struct tdx_ext_mem *ext_mem)
 {
-	unsigned int populated, offset = 0;
-	int ret;
+	int i;
 
-	/*
-	 * tdx_page_array's root page can hold 512 HPAs at most. We have ~50MB
-	 * memory to add, re-populate the array and add pages bulk by bulk.
-	 */
-	while (1) {
-		populated = tdx_page_array_populate(ext_mem, offset);
-		if (!populated)
-			break;
+	tdx_page_array_free(ext_mem->chunk);
+
+	for (i = 0; i < ext_mem->nr_pages; i++)
+		__free_page(ext_mem->pages[i]);
+
+	kfree(ext_mem->pages);
+}
+
+static int tdx_ext_mem_setup(unsigned int nr_pages,
+			     struct tdx_ext_mem *ext_mem)
+{
+	struct tdx_page_array *chunk;
+	struct page **pages;
+	struct page *page;
+	int i, ret;
 
-		ret = tdx_ext_mem_add(ext_mem);
+	pages = kmalloc_objs(*pages, nr_pages);
+	if (!pages)
+		return -ENOMEM;
+
+	page = alloc_contig_pages(nr_pages, GFP_KERNEL, numa_mem_id(),
+				  &node_online_map);
+	if (!page) {
+		ret = -ENOMEM;
+		goto out_free_pages;
+	}
+
+	for (i = 0; i < nr_pages; i++)
+		pages[i] = page + i;
+
+	chunk = tdx_page_array_alloc();
+	if (!chunk) {
+		ret = -ENOMEM;
+		goto out_free_contig;
+	}
+
+	for (i = 0; i < nr_pages;) {
+		int nents = min(nr_pages - i, TDX_PAGE_ARRAY_MAX_NENTS);
+
+		ret = tdx_page_array_populate(chunk, pages + i, nents);
 		if (ret)
-			return ret;
+			goto out_free_chunk;
+
+		ret = tdx_ext_mem_add(chunk);
+		if (ret)
+			goto out_free_chunk;
 
-		offset += populated;
+		i += nents;
 	}
 
+	ext_mem->nr_pages = nr_pages;
+	ext_mem->pages = pages;
+	ext_mem->chunk = chunk;
+
 	return 0;
+
+out_free_chunk:
+	tdx_page_array_free(chunk);
+out_free_contig:
+	for (i = 0; i < nr_pages; i++)
+		__free_page(pages[i]);
+out_free_pages:
+	kfree(pages);
+
+	return ret;
 }
 
 static int init_tdx_ext(void)
 {
-	struct tdx_page_array *ext_mem = NULL;
+	struct tdx_ext_mem ext_mem;
 	unsigned int nr_pages;
 	int ret;
 
@@ -1600,48 +1396,48 @@ static int init_tdx_ext(void)
 	if (boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
 		return -ENXIO;
 
+	/* No feature requires TDX Module Extensions. */
+	if (!tdx_sysinfo.ext.ext_required)
+		return 0;
+
 	nr_pages = tdx_sysinfo.ext.memory_pool_required_pages;
 	/*
 	 * memory_pool_required_pages == 0 means no need to add more pages,
 	 * skip the memory setup.
 	 */
 	if (nr_pages) {
-		ext_mem = tdx_page_array_alloc_contig(nr_pages);
-		if (!ext_mem)
-			return -ENOMEM;
-
-		ret = tdx_ext_mem_setup(ext_mem);
+		ret = tdx_ext_mem_setup(nr_pages, &ext_mem);
 		if (ret)
-			goto out_ext_mem;
+			return ret;
 	}
 
+	ret = tdx_ext_init();
+	if (ret)
+		goto out_remove_ext_mem;
+
 	/*
-	 * ext_required == 0 means no need to call TDH.EXT.INIT, the Extensions
-	 * are already working.
+	 * Extensions memory is never reclaimed once assigned, stop tracking it
+	 * and free the tracking structures.
 	 */
-	if (tdx_sysinfo.ext.ext_required) {
-		ret = tdx_ext_init();
-		/*
-		 * Some pages may have been touched by the TDX module.
-		 * Flush cache before returning these pages to kernel.
-		 */
-		if (ret)
-			goto out_flush;
-	}
-
-	/* Extension memory is never reclaimed once assigned */
-	tdx_page_array_ctrl_leak(ext_mem);
+	tdx_page_array_free(ext_mem.chunk);
+	kfree(ext_mem.pages);
 
 	pr_info("%lu KB allocated for TDX Module Extensions\n",
 		nr_pages * PAGE_SIZE / 1024);
 
 	return 0;
 
-out_flush:
-	if (ext_mem)
+out_remove_ext_mem:
+	if (nr_pages) {
+		/*
+		 * TDH.EXT.MEM.ADD only collects required memory. TDX.EXT.INIT
+		 * does the actual initialization so if it fails some pages may
+		 * have been touched by the TDX module, flush cache before
+		 * returning these pages to kernel.
+		 */
 		wbinvd_on_all_cpus();
-out_ext_mem:
-	tdx_page_array_free(ext_mem);
+		tdx_ext_mem_remove(&ext_mem);
+	}
 
 	return ret;
 }
@@ -2497,6 +2293,7 @@ u64 tdh_iommu_setup(u64 vtbar, struct tdx_page_array *iommu_mt, u64 *iommu_id)
 	u64 r;
 
 	tdx_clflush_page_array(iommu_mt);
+	iommu_mt->flush_on_free = true;
 
 	r = seamcall_ret_ir_resched(TDH_IOMMU_SETUP, &args);
 
@@ -2525,6 +2322,7 @@ u64 tdh_spdm_create(u64 func_id, struct tdx_page_array *spdm_mt, u64 *spdm_id)
 	u64 r;
 
 	tdx_clflush_page_array(spdm_mt);
+	spdm_mt->flush_on_free = true;
 
 	r = seamcall_ret(TDH_SPDM_CREATE, &args);
 
@@ -2534,23 +2332,14 @@ u64 tdh_spdm_create(u64 func_id, struct tdx_page_array *spdm_mt, u64 *spdm_id)
 }
 EXPORT_SYMBOL_FOR_MODULES(tdh_spdm_create, "tdx-host");
 
-u64 tdh_spdm_delete(u64 spdm_id, struct tdx_page_array *spdm_mt,
-		    unsigned int *nr_released, u64 *released_hpa)
+u64 tdh_spdm_delete(u64 spdm_id, struct tdx_page_array *spdm_mt)
 {
 	struct tdx_module_args args = {
 		.rcx = spdm_id,
 		.rdx = hpa_array_t_release_raw(spdm_mt),
 	};
-	u64 r;
-
-	r = seamcall_ret(TDH_SPDM_DELETE, &args);
-	if (r != TDX_SUCCESS)
-		return r;
 
-	*nr_released = FIELD_GET(HPA_ARRAY_T_SIZE, args.rcx) + 1;
-	*released_hpa = FIELD_GET(HPA_ARRAY_T_PFN, args.rcx) << PAGE_SHIFT;
-
-	return r;
+	return seamcall_ret(TDH_SPDM_DELETE, &args);
 }
 EXPORT_SYMBOL_FOR_MODULES(tdh_spdm_delete, "tdx-host");
 
@@ -2639,6 +2428,7 @@ u64 tdh_ide_stream_create(u64 stream_info, u64 spdm_id,
 	u64 r;
 
 	tdx_clflush_page_array(stream_mt);
+	stream_mt->flush_on_free = true;
 
 	r = seamcall_saved_ret(TDH_IDE_STREAM_CREATE, &args);
 
@@ -2661,24 +2451,15 @@ u64 tdh_ide_stream_block(u64 spdm_id, u64 stream_id)
 EXPORT_SYMBOL_FOR_MODULES(tdh_ide_stream_block, "tdx-host");
 
 u64 tdh_ide_stream_delete(u64 spdm_id, u64 stream_id,
-			  struct tdx_page_array *stream_mt,
-			  unsigned int *nr_released, u64 *released_hpa)
+			  struct tdx_page_array *stream_mt)
 {
 	struct tdx_module_args args = {
 		.rcx = spdm_id,
 		.rdx = stream_id,
 		.r8 = hpa_array_t_release_raw(stream_mt),
 	};
-	u64 r;
 
-	r = seamcall_ret(TDH_IDE_STREAM_DELETE, &args);
-	if (r != TDX_SUCCESS)
-		return r;
-
-	*nr_released = FIELD_GET(HPA_ARRAY_T_SIZE, args.rcx) + 1;
-	*released_hpa = FIELD_GET(HPA_ARRAY_T_PFN, args.rcx) << PAGE_SHIFT;
-
-	return r;
+	return seamcall_ret(TDH_IDE_STREAM_DELETE, &args);
 }
 EXPORT_SYMBOL_FOR_MODULES(tdh_ide_stream_delete, "tdx-host");
 
diff --git a/drivers/virt/coco/tdx-host/tdx-host.c b/drivers/virt/coco/tdx-host/tdx-host.c
index 7800afb0893d..3a37e78dbc89 100644
--- a/drivers/virt/coco/tdx-host/tdx-host.c
+++ b/drivers/virt/coco/tdx-host/tdx-host.c
@@ -83,6 +83,119 @@ static struct tdx_tsm_link *to_tdx_tsm_link(struct pci_tsm *tsm)
 	return container_of(tsm, struct tdx_tsm_link, pci.base_tsm);
 }
 
+static void tdx_free_pages_bulk(unsigned int nr_pages, struct page **pages)
+{
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		put_page(pages[i]);
+}
+
+static int tdx_alloc_pages_bulk(unsigned int nr_pages, struct page **pages)
+{
+	unsigned int filled, done = 0;
+
+	do {
+		filled = alloc_pages_bulk(GFP_KERNEL, nr_pages - done,
+					  pages + done);
+		if (!filled) {
+			tdx_free_pages_bulk(done, pages);
+			return -ENOMEM;
+		}
+
+		done += filled;
+	} while (done != nr_pages);
+
+	return 0;
+}
+
+static void tdx_page_array_mt_free(struct tdx_page_array *array_mt)
+{
+	struct page **pages = array_mt->pages;
+	unsigned int nr_pages = array_mt->nr_pages;
+
+	tdx_page_array_free(array_mt);
+	tdx_free_pages_bulk(nr_pages, pages);
+	kfree(pages);
+}
+
+DEFINE_FREE(tdx_page_array_mt_free, struct tdx_page_array *, if (_T) tdx_page_array_mt_free(_T))
+
+static struct tdx_page_array *tdx_page_array_mt_create(unsigned int nr_pages)
+{
+	struct tdx_page_array *array;
+	struct page **pages;
+	int ret;
+
+	pages = kzalloc_objs(*pages, nr_pages);
+	if (!pages)
+		return NULL;
+
+	ret = tdx_alloc_pages_bulk(nr_pages, pages);
+	if (ret)
+		goto out_free_pages;
+
+	array = tdx_page_array_create(pages, nr_pages);
+	if (!array)
+		goto out_free_bulk;
+
+	return array;
+
+out_free_bulk:
+	tdx_free_pages_bulk(nr_pages, pages);
+out_free_pages:
+	kfree(pages);
+
+	return NULL;
+}
+
+static struct tdx_page_array *
+tdx_page_array_iommu_mt_create(unsigned int iq_order, unsigned int nr_mt_pages)
+{
+	unsigned int nr_pages = nr_mt_pages + 2;
+	struct tdx_page_array *array;
+	struct folio *t_iq, *t_ctxiq;
+	struct page **pages;
+	int ret;
+
+	pages = kzalloc_objs(*pages, nr_pages);
+	if (!pages)
+		return NULL;
+
+	/* TODO: folio_alloc_node() is preferred, but need numa info */
+	t_iq = folio_alloc(GFP_KERNEL | __GFP_ZERO, iq_order);
+	if (!t_iq)
+		goto out_free_pages;
+
+	t_ctxiq = folio_alloc(GFP_KERNEL | __GFP_ZERO, iq_order);
+	if (!t_ctxiq)
+		goto out_free_t_iq;
+
+	pages[0] = folio_page(t_iq, 0);
+	pages[1] = folio_page(t_ctxiq, 0);
+
+	ret = tdx_alloc_pages_bulk(nr_mt_pages, pages + 2);
+	if (ret)
+		goto out_free_t_ctxiq;
+
+	array = tdx_page_array_create(pages, nr_pages);
+	if (!array)
+		goto out_free_bulk;
+
+	return array;
+
+out_free_bulk:
+	tdx_free_pages_bulk(nr_mt_pages, pages + 2);
+out_free_t_ctxiq:
+	folio_put(t_ctxiq);
+out_free_t_iq:
+	folio_put(t_iq);
+out_free_pages:
+	kfree(pages);
+
+	return NULL;
+}
+
 #define PCI_DOE_DATA_OBJECT_HEADER_1_OFFSET	0
 #define PCI_DOE_DATA_OBJECT_HEADER_2_OFFSET	4
 #define PCI_DOE_DATA_OBJECT_HEADER_SIZE		8
@@ -275,8 +388,8 @@ static struct tdx_tsm_link *tdx_spdm_create(struct tdx_tsm_link *tlink)
 	unsigned int nr_pages = tdx_sysinfo->connect.spdm_mt_page_count;
 	u64 spdm_id, r;
 
-	struct tdx_page_array *spdm_mt __free(tdx_page_array_free) =
-		tdx_page_array_create(nr_pages);
+	struct tdx_page_array *spdm_mt __free(tdx_page_array_mt_free) =
+		tdx_page_array_mt_create(nr_pages);
 	if (!spdm_mt)
 		return ERR_PTR(-ENOMEM);
 
@@ -292,24 +405,18 @@ static struct tdx_tsm_link *tdx_spdm_create(struct tdx_tsm_link *tlink)
 static void tdx_spdm_delete(struct tdx_tsm_link *tlink)
 {
 	struct pci_dev *pdev = tlink->pci.base_tsm.pdev;
-	unsigned int nr_released;
-	u64 released_hpa, r;
+	u64 r;
 
-	r = tdh_spdm_delete(tlink->spdm_id, tlink->spdm_mt, &nr_released, &released_hpa);
+	r = tdh_spdm_delete(tlink->spdm_id, tlink->spdm_mt);
 	if (r) {
+		/* leak the metadata pages */
 		pci_err(pdev, "fail to delete spdm 0x%llx\n", r);
-		goto leak;
+		return;
 	}
 
-	if (tdx_page_array_ctrl_release(tlink->spdm_mt, nr_released, released_hpa)) {
-		pci_err(pdev, "fail to release spdm_mt pages\n");
-		goto leak;
-	}
+	tdx_page_array_mt_free(tlink->spdm_mt);
 
 	return;
-
-leak:
-	tdx_page_array_ctrl_leak(tlink->spdm_mt);
 }
 
 DEFINE_FREE(tdx_spdm_delete, struct tdx_tsm_link *, if (!IS_ERR_OR_NULL(_T)) tdx_spdm_delete(_T))
@@ -323,8 +430,8 @@ static struct tdx_tsm_link *tdx_spdm_session_setup(struct tdx_tsm_link *tlink)
 	if (IS_ERR(tlink_create))
 		return tlink_create;
 
-	struct tdx_page_array *dev_info __free(tdx_page_array_free) =
-		tdx_page_array_create(nr_pages);
+	struct tdx_page_array *dev_info __free(tdx_page_array_mt_free) =
+		tdx_page_array_mt_create(nr_pages);
 	if (!dev_info)
 		return ERR_PTR(-ENOMEM);
 
@@ -424,8 +531,8 @@ static struct tdx_tsm_link *tdx_ide_stream_create(struct tdx_tsm_link *tlink,
 	struct pci_ide_regs regs;
 	u64 r;
 
-	struct tdx_page_array *stream_mt __free(tdx_page_array_free) =
-		tdx_page_array_create(nr_pages);
+	struct tdx_page_array *stream_mt __free(tdx_page_array_mt_free) =
+		tdx_page_array_mt_create(nr_pages);
 	if (!stream_mt)
 		return ERR_PTR(-ENOMEM);
 
@@ -472,33 +579,23 @@ static struct tdx_tsm_link *tdx_ide_stream_create(struct tdx_tsm_link *tlink,
 static void tdx_ide_stream_delete(struct tdx_tsm_link *tlink)
 {
 	struct pci_dev *pdev = tlink->pci.base_tsm.pdev;
-	unsigned int nr_released;
-	u64 released_hpa, r;
+	u64 r;
 
 	r = tdh_ide_stream_block(tlink->spdm_id, tlink->stream_id);
 	if (r) {
+		/* leak the metadata pages */
 		pci_err(pdev, "ide stream block fail 0x%llx\n", r);
-		goto leak;
+		return;
 	}
 
 	r = tdh_ide_stream_delete(tlink->spdm_id, tlink->stream_id,
-				  tlink->stream_mt, &nr_released,
-				  &released_hpa);
+				  tlink->stream_mt);
 	if (r) {
 		pci_err(pdev, "ide stream delete fail 0x%llx\n", r);
-		goto leak;
-	}
-
-	if (tdx_page_array_ctrl_release(tlink->stream_mt, nr_released,
-					released_hpa)) {
-		pci_err(pdev, "fail to release IDE stream_mt pages\n");
-		goto leak;
+		return;
 	}
 
-	return;
-
-leak:
-	tdx_page_array_ctrl_leak(tlink->stream_mt);
+	tdx_page_array_mt_free(tlink->stream_mt);
 }
 
 DEFINE_FREE(tdx_ide_stream_delete, struct tdx_tsm_link *,
@@ -815,20 +912,14 @@ static void tdx_iommu_clear(u64 iommu_id, struct tdx_page_array *iommu_mt)
 
 	r = tdh_iommu_clear(iommu_id, iommu_mt);
 	if (r) {
+		/* leak the metadata pages */
 		pr_err("fail to clear tdx iommu 0x%llx\n", r);
-		goto leak;
+		return;
 	}
 
-	if (tdx_page_array_ctrl_release(iommu_mt, iommu_mt->nr_pages,
-					virt_to_phys(iommu_mt->root))) {
-		pr_err("fail to release iommu_mt pages\n");
-		goto leak;
-	}
+	tdx_page_array_mt_free(iommu_mt);
 
 	return;
-
-leak:
-	tdx_page_array_ctrl_leak(iommu_mt);
 }
 
 static int tdx_iommu_enable_one(struct dmar_drhd_unit *drhd)
@@ -837,8 +928,8 @@ static int tdx_iommu_enable_one(struct dmar_drhd_unit *drhd)
 	u64 r, iommu_id;
 	int ret;
 
-	struct tdx_page_array *iommu_mt __free(tdx_page_array_free) =
-		tdx_page_array_create_iommu_mt(1, nr_pages);
+	struct tdx_page_array *iommu_mt __free(tdx_page_array_mt_free) =
+		tdx_page_array_iommu_mt_create(1, nr_pages);
 	if (!iommu_mt)
 		return -ENOMEM;


^ permalink raw reply related

* Re: [PATCH v13 00/48] arm64: Support for Arm CCA in KVM
From: Alper Gun @ 2026-04-15 23:27 UTC (permalink / raw)
  To: Steven Price
  Cc: kvm, kvmarm, Catalin Marinas, Marc Zyngier, Will Deacon,
	James Morse, Oliver Upton, Suzuki K Poulose, Zenghui Yu,
	linux-arm-kernel, linux-kernel, Joey Gouly, Alexandru Elisei,
	Christoffer Dall, Fuad Tabba, linux-coco, Ganapatrao Kulkarni,
	Gavin Shan, Shanker Donthineni, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve
In-Reply-To: <d661a24c-6619-4734-96bd-075d3416e809@arm.com>

On Wed, Apr 15, 2026 at 4:01 AM Steven Price <steven.price@arm.com> wrote:
>
> On 14/04/2026 22:40, Alper Gun wrote:
> > On Wed, Mar 18, 2026 at 8:54 AM Steven Price <steven.price@arm.com> wrote:
> >>
> >> This series adds support for running protected VMs using KVM under the
> >> Arm Confidential Compute Architecture (CCA).
> >>
> >> New major version number! This now targets RMM v2.0-bet0[1]. And unlike
> >> for Linux this represents a significant change.
> >>
> >> RMM v2.0 brings with it the ability to configure the RMM to have the
> >> same page size as the host (so no more RMM_PAGE_SIZE and dealing with
> >> granules being different from host pages). It also introduces range
> >> based APIs for many operations which should be more efficient and
> >> simplifies the code in places.
> >>
> >> The handling of the GIC has changed, so the system registers are used to
> >> pass the GIC state rather than memory. This means fewer changes to the
> >> KVM code as it looks much like a normal VM in this respect.
> >>
> >> And of course the new uAPI introduced in the previous v12 posting is
> >> retained so that also remains simplified compared to earlier postings.
> >>
> >> The RMM support for v2.0 is still early and so this series includes a
> >> few hacks to ease the integration. Of note are that there are some RMM
> >> v1.0 SMCs added to paper over areas where the RMM implementation isn't
> >> quite ready for v2.0, and "SROs" (see below) are deferred to the final
> >> patch in the series.
> >>
> >> The PMU in RMM v2.0 requires more handling on the RMM-side (and
> >> therefore simplifies the implementation on Linux), but this isn't quite
> >> ready yet. The Linux side is implemented (but untested).
> >>
> >> PSCI still requires the VMM to provide the "target" REC for operations
> >> that affect another vCPU. This is likely to change in a future version
> >> of the specification. There's also a desire to force PSCI to be handled
> >> in the VMM for realm guests - this isn't implemented yet as I'm waiting
> >> for the dust to settle on the RMM interface first.
> >>
> >> Stateful RMI Operations
> >> -----------------------
> >>
> >> The RMM v2.0 spec brings a new concept of Stateful RMI Operations (SROs)
> >> which allow the RMM to complete an operation over several SMC calls and
> >> requesting/returning memory to the host. This has the benefit of
> >> allowing interrupts to be handled in the middle of an operation (by
> >> returning to the host to handle the interrupt without completing the
> >> operation) and enables the RMM to dynamically allocate memory for
> >> internal tracking purposes. One example of this is RMI_REC_CREATE no
> >> longer needs "auxiliary granules" provided upfront but can request the
> >> memory needed during the RMI_REC_CREATE operation.
> >>
> >> There are a fairly large number of operations that are defined as SROs
> >> in the specification, but current both Linux and RMM only have support
> >> for RMI_REC_CREATE and RMI_REC_DESTROY. There a number of TODOs/FIXMEs
> >> in the code where support is missing.
> >>
> >> Given the early stage support for this, the SRO handling is all confined
> >> to the final patch. This patch can be dropped to return to a pre-SRO
> >> state (albeit a mixture of RMM v1.0 and v2.0 APIs) for testing purposes.
> >>
> >> A future posting will reorder the series to move the generic SRO support
> >> to an early patch and will implement the proper support for this in all
> >> RMI SMCs.
> >>
> >> One aspect of SROs which is not yet well captured is that in some
> >> circumstances the Linux kernel will need to call an SRO call in a
> >> context where memory allocation is restricted (e.g. because a spinlock
> >> is held). In this case the intention is that the SRO will be cancelled,
> >> the spinlock dropped so the memory allocation can be completed, and then
> >> the SRO restarted (obviously after rechecking the state that the
> >> spinlock was protecting). For this reason the code stores the memory
> >> allocations within a struct rmi_sro_state object - see the final patch
> >> for more details.
> >>
> >> This series is based on v7.0-rc1. It is also available as a git
> >> repository:
> >>
> >> https://gitlab.arm.com/linux-arm/linux-cca cca-host/v13
> >>
> >>
> >
> > Hi Steven,
> >
> > I have a question regarding host kexec and kdump scenarios, and
> > whether there is any plan to make them work in this initial series.
> >
> > Intel TDX and AMD SEV-SNP both have a firmware shutdown command that
> > is invoked during the kexec or panic code paths to safely bypass
> > hardware memory protections and boot into the new kernel. As far as
> > I know, there is no similar global teardown command available for
> > the RMM.
>
> Correct, the RMM specification as it stands doesn't provide a mechanism
> for the host to do this. The host would have to identify all the realm
> guests in the system: specifically the address of the RDs (Realm
> Descriptors) and RECs (Realm Execution Contexts). It needs this to tear
> down the guests and be able to undelegate the memory.
>
> It's an interesting point and I'll raise the idea of a "firmware
> shutdown command" to make this more possible.
>
> > What is the roadmap for supporting both general kexec and
> > more specifically kdump (panic) scenarios with CCA?
>
> I don't have a roadmap I'm afraid for these. kexec in theory would be
> possible with KVM gracefully terminating all realms. For kdump/panic
> that sort of graceful shutdown isn't really appropriate (or likely to
> succeed).
>

Thanks Steven for the clarification.

For us, kdump is highly critical as it is our primary diagnostic tool
for host crashes. Without it, monitoring and debugging at fleet scale
would become unmanageable.

To confirm my understanding of the current architecture: if a host
panics while no Realms are actively running (and therefore no pages
are currently in the delegated state), the standard kdump extraction
should work perfectly fine without any modifications, correct?

Regarding the KVM tracking structures (RDs, RECs, RTTs, etc.) when VMs
are running, perhaps we could use `vmcoreinfo` to export the physical
addresses of these delegated pages. This would allow tools like
`makedumpfile` to explicitly filter them out. I assume these pages must
remain hardware-locked while the VMs are active.

Long-term, having an architectural shutdown command - similar to the
TDH.SYS.DISABLE command in Intel TDX - would be incredibly useful. It
would allow the kdump kernel to safely bypass these hardware security
checks, especially when extracting host-side KVM state.

As for the protected realm memory, I assume that is an easier problem.
We naturally want to exclude guest pages from a host dump regardless
of whether they are Realm pages or not. However, accidental touches
are still fatal.

> There is also some RMM configuration which cannot be repeated (see
> RMI_RMM_CONFIG_SET) - which implies that the kexec kernel must be
> similar to the first kernel (i.e. same page size).
>
> Thanks,
> Steve

^ permalink raw reply

* Re: [RFC PATCH v5 00/45] TDX: Dynamic PAMT + S-EPT Hugepage
From: Edgecombe, Rick P @ 2026-04-15 21:48 UTC (permalink / raw)
  To: seanjc@google.com, x86@kernel.org, dave.hansen@linux.intel.com,
	kas@kernel.org, bp@alien8.de, mingo@redhat.com,
	pbonzini@redhat.com, tglx@kernel.org
  Cc: Huang, Kai, Tng, Ackerley, Shahar, Sagi, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Li, Xiaoyao,
	kvm@vger.kernel.org, linux-coco@lists.linux.dev, Yamahata, Isaku,
	binbin.wu@linux.intel.com
In-Reply-To: <20260129011517.3545883-1-seanjc@google.com>

On Wed, 2026-01-28 at 17:14 -0800, Sean Christopherson wrote:
> As for landing these series, I think the fastest overall approach would be
> to land patches 1-4 asap (tangentially related cleanups and fixes), agree
> on a design (hopefully), and then hand control back to Rick and Yan to polish
> their respective series for merge.

Sean, were you still planning to do this? I see you're, I think, asking for Dave
to pull patch 1 here:
https://lore.kernel.org/kvm/ac_bMKD7YGhMwUCf@google.com/

But actually he acked the version in this series, so another option is you could
just take patches 1-4?

Please let me know if you are ok with that plan. We are growing stacks ahead of
DPAMT and the other splintered pre-req work and looking to reduce it. It could
be nice if we could take all the pre-req work through the same tree for that
reason, and most of it is on the KVM side. Is it reasonable?

^ permalink raw reply

* Re: SVSM Development Call April 15, 2026
From: Jörg Rödel @ 2026-04-15 20:39 UTC (permalink / raw)
  To: coconut-svsm, linux-coco
In-Reply-To: <ek6soyyk2yocghq6eb23unh4cfyzmhf5zf4ez5a3bo6fr2obi3@fwxk4wu6w4vn>

Meeting notes are ready:

	https://github.com/coconut-svsm/governance/pull/104

-Joerg

^ permalink raw reply

* bi-weekly guest_memfd upstream call on 2026-04-16
From: David Hildenbrand (Arm) @ 2026-04-15 19:18 UTC (permalink / raw)
  To: KVM, linux-mm@kvack.org, linux-coco@lists.linux.dev

Hi all,

Our next guest_memfd upstream call is scheduled for tomorrow, Thursday,
2026-04-016 at 8:00 - 9:00am (GMT-07:00) Pacific Time - Vancouver.

We'll be using the following Google meet:
http://meet.google.com/wxp-wtju-jzw

The meeting notes can be found at [1], where we also link recordings and
collect current guest_memfd upstream proposals. If you want an google
calendar invitation that also covers all future meetings, just write me
or Ackerley a mail.

In this meeting, Mike wants to talk about IOMMU considerations for
in-place conversion/hugetlb with pass-through devices. And I'm sure
other things will pop up :)

To put something to discuss onto the agenda, reply to this mail or add
them to the meeting notes.

[1]
https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?usp=sharing

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v7 08/22] x86/virt/seamldr: Allocate and populate a module update request
From: Edgecombe, Rick P @ 2026-04-15 17:19 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev, Li, Xiaoyao,
	Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	yilun.xu@linux.intel.com, bp@alien8.de,
	linux-kernel@vger.kernel.org, seanjc@google.com, Weiny, Ira,
	pbonzini@redhat.com, Chatre, Reinette, Verma, Vishal L,
	kas@kernel.org, nik.borisov@suse.com, hpa@zytor.com,
	mingo@redhat.com, Annapurve, Vishal, sagis@google.com,
	tony.lindgren@linux.intel.com, binbin.wu@linux.intel.com,
	tglx@kernel.org, paulmck@kernel.org, x86@kernel.org,
	dan.j.williams@intel.com
In-Reply-To: <ad9wuJPdluqJrcBM@intel.com>

On Wed, 2026-04-15 at 19:04 +0800, Chao Gao wrote:
> > I wasn't suggesting to merge them. I was suggesting to have them each do
> > a dedicated thing.
> 
> Ok. See the code snippet below. Most checks are in init_seamldr_params(),
> except the limit checks on module_size and sig_size. Moving them there
> would require duplicating the module_size/sig_size calculations. So, I
> just keep the checks next to the calculations.

Like this. All checks are moved to get_and_check_blob(). alloc_seamldr_params()
loses all checks because it's about passing things to the TDX module. Also
alloc_seamldr_params() switches to handling page counts only and becomes
smaller.

Below is not even compile tested. Please verify that everything needed
is still checked if you take from it:

/*
 * Blob fields are processed by the kernel and the payloads
 * are passed to the TDX module. Do normal user input type
 * check for any fields that don't get passed to the TDX module.
 */
static void *get_and_check_blob(const u8 *data, u32 size)
{
	const struct tdx_blob *blob = (const void *)data;

	/*
	 * Ensure the size is valid otherwise reading any field from the
	 * blob may overflow.
	 */
	if (size <= sizeof(struct tdx_blob) || size <= blob->offset_of_module)
		return ERR_PTR(-EINVAL);

	if (blob->version != TDX_BLOB_VERSION_1)
		return ERR_PTR(-EINVAL);

	if (blob->reserved0 || memchr_inv(blob->reserved1, 0, sizeof(blob->reserved1)))
		return ERR_PTR(-EINVAL);

	/*
	 * Don't care about user passing the wrong file, but protect
	 * kernel ABI by preventing accepting garbage.
	 */
	if (memcmp(blob->signature, "TDX-BLOB", 8))
		return ERR_PTR(-EINVAL);

	return blob;
}

static struct seamldr_params *alloc_seamldr_params(const struct tdx_blob *blob, unsigned int blob_size)
{
	struct seamldr_params *params;
	int module_pg_cnt, sig_pg_cnt;
	const void *sig, *module;
	const u8 *ptr;
	int i;

	params = (struct seamldr_params *)get_zeroed_page(GFP_KERNEL);
	if (!params)
		return ERR_PTR(-ENOMEM);

	/*
	 * Split the blob into a sigstruct and a module. Assume all
	 * size/offsets are within bounds of blob_size due to prior checks.
	 */
	sig		= blob->data;
	sig_pg_cnt	= (blob->offset_of_module - sizeof(struct tdx_blob)) >> PAGE_SHIFT;
	module		= blob->data + blob->offset_of_module;
	module_pg_cnt	= (blob_size - blob->offset_of_module) >> PAGE_SHIFT;

	/*
	 * Only use version 1 when required (sigstruct > 4KB) for backward
	 * compatibility with P-SEAMLDR that lacks version 1 support.
	 */
	params->version = sig_pg_cnt > 1;
	params->scenario = SEAMLDR_SCENARIO_UPDATE;

	ptr = sig;
	for (i = 0; i < MIN(sig_pg_cnt, SEAMLDR_MAX_NR_SIG_4KB_PAGES); i++) {
		/*
		 * @sig is 4KB-aligned, but that does not imply PAGE_SIZE
		 * alignment when PAGE_SIZE != SZ_4K. Always include the
		 * in-page offset.
		 */
		params->sigstruct_pa[i] = vmalloc_to_pfn(ptr) << PAGE_SHIFT;
		ptr += SZ_4K;
	}

	params->num_module_pages = MIN(module_pg_cnt, SEAMLDR_MAX_NR_MODULE_4KB_PAGES);
	ptr = module;
	for (i = 0; i < params->num_module_pages; i++) {
		params->mod_pages_pa_list[i] = vmalloc_to_pfn(ptr) << PAGE_SHIFT;
		ptr += SZ_4K;
	}

	return params;
}

static struct seamldr_params *init_seamldr_params(const u8 *data, u32 size)
{
	const struct tdx_blob *blob;

	blob = get_and_check_blob(data, size);
	if (IS_ERR(blob))
		return ERR_CAST(blob);

	return alloc_seamldr_params(blob, size);
}

^ permalink raw reply

* Re: [PATCH v2 1/6] KVM: x86: Add dedicated storage for guest RIP
From: Sean Christopherson @ 2026-04-15 16:06 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Chang S. Bae, Paolo Bonzini, Kiryl Shutsemau, kvm, x86,
	linux-coco, linux-kernel
In-Reply-To: <b6bc6b99-c8d5-4c2d-9ca8-42d36d60d197@intel.com>

On Wed, Apr 15, 2026, Xiaoyao Li wrote:
> On 4/14/2026 11:37 PM, Sean Christopherson wrote:
> > On Tue, Apr 14, 2026, Chang S. Bae wrote:
> > > On 4/14/2026 5:31 AM, Xiaoyao Li wrote:
> > > > Even leave RIP in regs[], what is the problem by just allocating the
> > > > index 16-31 to R16-R31 and making RIP the index 32?
> > > 
> > > But why?
> > > 
> > > Even though the array isn't explicitly labeled as GPRs, that's effectively
> > > how it's being used, and RIP isn't part of that set.
> > > 
> > > I don't think there is any benefit of leaving it in regs[].
> > 
> > +1.  Chang's earlier argument that RIP isn't a proper GPR swayed me over, e.g. RIP
> > doesn't have an architectural index.
> > 
> > Keeping RIP in regs[] saves one line of code in arch/x86/include/asm/kvm_host.h,
> > at the cost of making the code less readable (IMO) and incorrectly suggesting that
> > RIP can be accessed like other regs[].
> 
> I'm not trying to object this patch. Instead, I'm trying to understand the
> justification of the change.
> 
> So I would expected an updated changelog with above justifications
> incorporated.

Noted, I'll expand the changelog for the next version.

^ permalink raw reply

* Re: [PATCH v2 5/6] KVM: x86: Track available/dirty register masks as "unsigned long" values
From: Sean Christopherson @ 2026-04-15 16:05 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Kai Huang, Chang Seok Bae, kvm@vger.kernel.org,
	pbonzini@redhat.com, kas@kernel.org, linux-kernel@vger.kernel.org,
	linux-coco@lists.linux.dev, x86@kernel.org
In-Reply-To: <e11d37d0-a143-44e7-b76d-bf2d5b386d16@intel.com>

On Wed, Apr 15, 2026, Xiaoyao Li wrote:
> On 4/15/2026 7:29 PM, Xiaoyao Li wrote:
> > > Note that updating only the "available" masks is wrong, as TDX needs
> > > to marshall written registers back to their correct location.
> > 
> > In what case it needs to marshall written registers back? Can you
> > elaborate?
> 
> Sorry that I asked a silly question. I mistakenly thought TDVMCALL for
> instructions (like, CPUID, RDMSR, WRMSR) use the same output register as the
> x86 instructions. After checking the TDX GHCI, obviously I'm wrong.
> 
> But I don't understand what's is wrong regarding "updating only the
> "available" masks". Or what is missing for "marshall written registers back
> to their correct location"?

KVM would need to also update the "dirty" mask, so that TDX knows it (a) needs
to propagate state back to the GHCI, and (b) that all registers it expects to be
dirty are indeed marked dirty.

^ permalink raw reply

* Re: [PATCH v7 10/22] x86/virt/seamldr: Abort updates if errors occurred midway
From: Edgecombe, Rick P @ 2026-04-15 14:56 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev, Li, Xiaoyao,
	Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	yilun.xu@linux.intel.com, bp@alien8.de,
	linux-kernel@vger.kernel.org, seanjc@google.com, Weiny, Ira,
	pbonzini@redhat.com, Chatre, Reinette, Verma, Vishal L,
	kas@kernel.org, nik.borisov@suse.com, hpa@zytor.com,
	mingo@redhat.com, Annapurve, Vishal, sagis@google.com,
	tony.lindgren@linux.intel.com, binbin.wu@linux.intel.com,
	tglx@kernel.org, paulmck@kernel.org, x86@kernel.org,
	dan.j.williams@intel.com
In-Reply-To: <ad7/ICVYG+Z63b4i@intel.com>

On Wed, 2026-04-15 at 10:59 +0800, Chao Gao wrote:
> > If it's non-required for "turning the lights on" it seems aligned
> > with Dave's
> > suggestion you highlighted to drop it from the series.
> 
> It is required for the shutdown case. Without early abort, a shutdown
> failure (recoverable) proceeds to seamldr.install, escalating it to a
> destructive failure. The race handling between TD build and updates
> depends on shutdown failures being non-destructive so users can
> retry.

Then what is the level of support of this series intends to achieve?
There is a lot of discussion about how the admin needs to make sure to
not screw things up. What is the level of screw up that is handled by
the kernel? We should spell this out in the cover letter or something. 

^ permalink raw reply

* Re: [PATCH v2 5/6] KVM: x86: Track available/dirty register masks as "unsigned long" values
From: Xiaoyao Li @ 2026-04-15 12:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kai Huang, Chang Seok Bae, kvm@vger.kernel.org,
	pbonzini@redhat.com, kas@kernel.org, linux-kernel@vger.kernel.org,
	linux-coco@lists.linux.dev, x86@kernel.org
In-Reply-To: <c40b5057-4df5-44ba-93c9-5c46696de765@intel.com>

On 4/15/2026 7:29 PM, Xiaoyao Li wrote:
>> Note that updating only the "available" masks is wrong, as TDX needs 
>> to marshall
>> written registers back to their correct location.
> 
> In what case it needs to marshall written registers back? Can you 
> elaborate?

Sorry that I asked a silly question. I mistakenly thought TDVMCALL for 
instructions (like, CPUID, RDMSR, WRMSR) use the same output register as 
the x86 instructions. After checking the TDX GHCI, obviously I'm wrong.

But I don't understand what's is wrong regarding "updating only the 
"available" masks". Or what is missing for "marshall written registers 
back to their correct location"?

^ permalink raw reply

* Re: [PATCH v2 5/6] KVM: x86: Track available/dirty register masks as "unsigned long" values
From: Xiaoyao Li @ 2026-04-15 11:29 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kai Huang, Chang Seok Bae, kvm@vger.kernel.org,
	pbonzini@redhat.com, kas@kernel.org, linux-kernel@vger.kernel.org,
	linux-coco@lists.linux.dev, x86@kernel.org
In-Reply-To: <ad5JeT7UuiQI9Tqo@google.com>

On 4/14/2026 10:04 PM, Sean Christopherson wrote:
> On Tue, Apr 14, 2026, Xiaoyao Li wrote:
>> On 4/14/2026 7:03 AM, Huang, Kai wrote:
>>>> Because VMX and SVM make all GRPs available immediately, except
>>>> for RSP, KVM ignores avail/dirty for GPRs.  I.e. "fixing" TDX will just shift the
>>>> "bugs" elsewhere.
>>> Just want to understand:
>>>
>>> I thought the fix could be we simply remove the wrong GPRs from the list.
>>> Not sure how fixing TDX will shift bugs elsewhere?
>>
>> I'm curious too.
> 
> What I'm saying is that, _if_ there are bugs where KVM uses a register that isn't
> available, then modifying TDX's list won't actually fix anything (without more
> changes), it will just change which code is technically buggy (hence all the quotes
> above).
> 
>>>> More importantly, because the TDX-Module*requires* RCX (the GPR that holds the
>>>> mask of registers to expose to the VMM) to be hidden on TDVMCALL, KVM*can't*
>>>> do any kind of meaningful "available" tracking.
>>>>
>>> Hmm I think RCX conveys the shared GPRs and VMM can read.  Per "Table 5.323:
>>> TDH.VP.ENTER Output Operands Format #5 Definition: On TDCALL(TDG.VP.VMCALL)
>>> Following a TD Entry":
>>>
>>>     RCX   ...
>>> 	Bit(s) Name         Description
>>>
>>> 	31:0   PARAMS_MASK  Value as passed into TDCALL(TDG.VP.VMCALL) by
>>> 			    the guest TD: indicates which part of the guest
>>> 			    TD GPR and XMM state is passed as-is to the
>>> VMM
>>> 			    and back. For details, see the description of
>>> 			    TDG.VP.VMCALL in 5.5.26.
>>>
>>> I think the problem is, as said previously, currently KVM TDX code uses
>>> KVM's existing infrastructure to emulate MSR, KVM hypercall etc,  but
>>> TDVMCALL has a different ABI, thus there's a mismatch here.
>>
>> I once had patch for it internally.
>>
>> It adds back the available check for GPRs when accessing instead of assuming
>> they are always available. For normal VMX and SVM, all the GPRs are still
>> always available. But for TDX, only EXIT_INFO_1 and EXIT_INFO_2 are always
>> marked available, while others need to be explicitly set case by case.
>>
>> The good thing is it makes TDX safer that KVM won't consume invalid data
>> silently for TDX. But it adds additional overhead of checking the
>> unnecessary register availability for VMX and SVM case.
>>
>> -----------------------------&<-------------------------------------
>> From: Xiaoyao Li <xiaoyao.li@intel.com>
>> Date: Tue, 11 Mar 2025 07:13:29 -0400
>> Subject: [PATCH] KVM: x86: Add available check for GPRs
>>
>> Since commit de3cd117ed2f ("KVM: x86: Omit caching logic for
>> always-available GPRs"), KVM doesn't check the availability of GPRs
>> except RSP and RIP when accessing them, because they are always
>> available.
>>
>> However, it's not true when it comes to TDX. The GPRs are not available
>> after TD vcpu exits actually.
> 
>> And it relies on KVM manually sets the
>> GPRs value when needed, e.g.
>>
>>   - setting rax, rbx, rcx, rdx, rsi, for hypercall emulation in
>>     tdx_emulate_tdvmall();
>>
>>   - setting rax, rcx and rdx before MSR write emulation;
>>
>> Add the available check of GPRs read, and WARN_ON_ONCE() when unavailable.
>> It can help capture the cases of undesired GPRs consumption by TDX.
> 
> Sorry, but NAK.  I am strongly against adding any code to the GPR accessors/mutators
> just for TDX.  It's a _lot_ of code.  From commit de3cd117ed2f ("KVM: x86: Omit
> caching logic for always-available GPRs"):
> 
>      E.g. on x86_64, kvm_emulate_cpuid() is reduced from 342 to 182 bytes and
>      kvm_emulate_hypercall() from 1362 to 1143, with the total size of KVM
>      dropping by ~1000 bytes.  With CONFIG_RETPOLINE=y, the numbers are even
>      more pronounced, e.g.: 353->182, 1418->1172 and well over 2000 bytes.

yeah. I had the same feeling that bring up the reduced overhead is not 
acceptable.

> Note that updating only the "available" masks is wrong, as TDX needs to marshall
> written registers back to their correct location.

In what case it needs to marshall written registers back? Can you elaborate?

> In the end, the available/dirty tracking isn't about hardening against bugs, it's
> about deferring expensive VMREAD and VMWRITE (and guest memory) operations until
> action is required.
> 
> We could bury sanity checks behind a Kconfig of some kind, but I genuinely don't
> see much value in doing so.  These emulation flows are very static (all register
> usage is hardcoded), and so it's very much a "get it right once" sort of thing,
> i.e. the odds of a runtime check finding a bug after initial development are
> basically zero.

The initial purpose of writing the code was to find if any case/path in 
KVM that consumes the GPRs for TDX unexpectedly. Not only for the 
hypercall/MSR emulation paths. There are lots of paths in KVM consuming 
the GPRs. It's difficult to aduit every path to ensure either a) it 
won't be reachable by TDX or b) KVM syncs the valid data to the GPRs 
before accessed by TDX.

> An alternative for TDX would be to avoid bouncing through GPRs in the first place,
> e.g. by reworking __kvm_emulate_rdmsr() to not access any registers.  But I'm
> probably opposed to even that, because I doubt the end result would be an overall
> net positive for KVM.  We'd end up with duplicate code, harder to read common
> code (because of the new abstractions), and likely without meaningfully moving
> the needle in terms of finding/preventing bugs.  KVM still needs to get operands
> to/from the right parameters, though only difference is that for TDX, the parameters
> would be very "direct".


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox