Linux Confidential Computing Development
 help / color / mirror / Atom feed
* Re: [PATCH v6 01/11] x86/virt/tdx: Simplify tdmr_get_pamt_sz()
From: Kiryl Shutsemau @ 2026-06-04 16:05 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: bp, dave.hansen, hpa, kvm, linux-coco, linux-doc, linux-kernel,
	mingo, nik.borisov, pbonzini, seanjc, tglx, vannapurve, x86,
	chao.gao, yan.y.zhao, kai.huang, Binbin Wu
In-Reply-To: <20260526023515.288829-2-rick.p.edgecombe@intel.com>

On Mon, May 25, 2026 at 07:35:05PM -0700, Rick Edgecombe wrote:
> For each memory region that the TDX module might use (called TDMR), three
> separate traditional PAMT allocations are needed. One for each supported
> page size (1GB, 2MB, 4KB). These store information on each page in the
> TDMR. In Linux, they are allocated out of one physically contiguous block,
> in order to more efficiently use some internal TDX module book keeping
> resources. So some simple math is needed to break the single large
> allocation into three smaller allocations for each page size.
> 
> There are some commonalities in the math needed to calculate the base and
> size for each smaller allocation, and so an effort was made to share logic
> across the three. Unfortunately doing this turned out unnaturally tortured,
> with a loop iterating over the three page sizes, only to call into a
> function with cases statement for each page size. In the future Dynamic
> PAMT will add more logic that is special to the 4KB page size, making the
> benefit of the math sharing even more questionable.
> 
> Three is not a very high number, so get rid of the loop and just duplicate
> the small calculation three times. In doing so, setup for future Dynamic
> PAMT changes.
> 
> Since the loop that iterates over it is gone, further simplify the code by
> dropping the array of intermediate size and base storage. Just store the
> values to their final locations. Accept the small complication of having
> to clear tdmr->pamt_4k_base in the error path, so that tdmr_do_pamt_func()
> will not try to operate on the TDMR struct when attempting to free it.
> 
> Assisted-by: GitHub Copilot:claude-opus-4-6 Claude:claude-opus-4-7
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>

Couple of nits below.

> ---
> v6:
>  - Drop {} by moving a comment (Binbin)
>  - Log tweaks
> 
> v4:
>  - Just refer to global var instead of passing pamt_entry_size around
>    (Xiaoyao)
>  - Remove setting pamt_4k_base to zero, because it already is zero.
>    Adjust the comment appropriately (Kai)
> 
> v3:
>  - New patch
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 93 ++++++++++++-------------------------
>  1 file changed, 29 insertions(+), 64 deletions(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 967482ae3c801..487f389f52f4b 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -516,31 +516,21 @@ static __init int fill_out_tdmrs(struct list_head *tmb_list,
>   * Calculate PAMT size given a TDMR and a page size.  The returned
>   * PAMT size is always aligned up to 4K page boundary.
>   */
> -static __init unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz,
> -					     u16 pamt_entry_size)
> +static __init unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz)
>  {
>  	unsigned long pamt_sz, nr_pamt_entries;
> +	const int tdx_pg_size_shift[] = { PAGE_SHIFT, PMD_SHIFT, PUD_SHIFT };
> +	const u16 pamt_entry_size[TDX_PS_NR] = {
> +		tdx_sysinfo.tdmr.pamt_4k_entry_size,
> +		tdx_sysinfo.tdmr.pamt_2m_entry_size,
> +		tdx_sysinfo.tdmr.pamt_1g_entry_size,
> +	};
>  
> -	switch (pgsz) {
> -	case TDX_PS_4K:
> -		nr_pamt_entries = tdmr->size >> PAGE_SHIFT;
> -		break;
> -	case TDX_PS_2M:
> -		nr_pamt_entries = tdmr->size >> PMD_SHIFT;
> -		break;
> -	case TDX_PS_1G:
> -		nr_pamt_entries = tdmr->size >> PUD_SHIFT;
> -		break;
> -	default:
> -		WARN_ON_ONCE(1);
> -		return 0;
> -	}
> +	nr_pamt_entries = tdmr->size >> tdx_pg_size_shift[pgsz];
> +	pamt_sz = nr_pamt_entries * pamt_entry_size[pgsz];
>  
> -	pamt_sz = nr_pamt_entries * pamt_entry_size;
>  	/* TDX requires PAMT size must be 4K aligned */
> -	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
> -
> -	return pamt_sz;
> +	return PAGE_ALIGN(pamt_sz);
>  }
>  
>  /*
> @@ -578,28 +568,21 @@ static __init int tdmr_get_nid(struct tdmr_info *tdmr, struct list_head *tmb_lis
>   * within @tdmr, and set up PAMTs for @tdmr.
>   */
>  static __init int tdmr_set_up_pamt(struct tdmr_info *tdmr,
> -				   struct list_head *tmb_list,
> -				   u16 pamt_entry_size[])
> +				   struct list_head *tmb_list)
>  {
> -	unsigned long pamt_base[TDX_PS_NR];
> -	unsigned long pamt_size[TDX_PS_NR];
> -	unsigned long tdmr_pamt_base;
>  	unsigned long tdmr_pamt_size;
>  	struct page *pamt;
> -	int pgsz, nid;
> -
> +	int nid;

Add a newline here?

>  	nid = tdmr_get_nid(tdmr, tmb_list);
>  
>  	/*
>  	 * Calculate the PAMT size for each TDX supported page size
>  	 * and the total PAMT size.
>  	 */
> -	tdmr_pamt_size = 0;
> -	for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) {
> -		pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz,
> -					pamt_entry_size[pgsz]);
> -		tdmr_pamt_size += pamt_size[pgsz];
> -	}
> +	tdmr->pamt_4k_size = tdmr_get_pamt_sz(tdmr, TDX_PS_4K);
> +	tdmr->pamt_2m_size = tdmr_get_pamt_sz(tdmr, TDX_PS_2M);
> +	tdmr->pamt_1g_size = tdmr_get_pamt_sz(tdmr, TDX_PS_1G);
> +	tdmr_pamt_size = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
>  
>  	/*
>  	 * Allocate one chunk of physically contiguous memory for all
> @@ -607,26 +590,18 @@ static __init int tdmr_set_up_pamt(struct tdmr_info *tdmr,
>  	 * in overlapped TDMRs.
>  	 */
>  	pamt = alloc_contig_pages(tdmr_pamt_size >> PAGE_SHIFT, GFP_KERNEL,
> -			nid, &node_online_map);
> +				  nid, &node_online_map);
> +

Looks like unrelated whitespace change. Is it intentional?

> +	/*
> +	 * tdmr->pamt_4k_base is still zero so the error
> +	 * path of the caller will skip freeing the pamt.
> +	 */
>  	if (!pamt)
>  		return -ENOMEM;
>  
> -	/*
> -	 * Break the contiguous allocation back up into the
> -	 * individual PAMTs for each page size.
> -	 */
> -	tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
> -	for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) {
> -		pamt_base[pgsz] = tdmr_pamt_base;
> -		tdmr_pamt_base += pamt_size[pgsz];
> -	}
> -
> -	tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
> -	tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
> -	tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
> -	tdmr->pamt_2m_size = pamt_size[TDX_PS_2M];
> -	tdmr->pamt_1g_base = pamt_base[TDX_PS_1G];
> -	tdmr->pamt_1g_size = pamt_size[TDX_PS_1G];
> +	tdmr->pamt_4k_base = page_to_phys(pamt);
> +	tdmr->pamt_2m_base = tdmr->pamt_4k_base + tdmr->pamt_4k_size;
> +	tdmr->pamt_1g_base = tdmr->pamt_2m_base + tdmr->pamt_2m_size;
>  
>  	return 0;
>  }
> @@ -657,10 +632,7 @@ static __init void tdmr_do_pamt_func(struct tdmr_info *tdmr,
>  	tdmr_get_pamt(tdmr, &pamt_base, &pamt_size);
>  
>  	/* Do nothing if PAMT hasn't been allocated for this TDMR */
> -	if (!pamt_size)
> -		return;
> -
> -	if (WARN_ON_ONCE(!pamt_base))
> +	if (!pamt_base)
>  		return;
>  
>  	pamt_func(pamt_base, pamt_size);
> @@ -686,14 +658,12 @@ static __init void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list)
>  
>  /* Allocate and set up PAMTs for all TDMRs */
>  static __init int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
> -					struct list_head *tmb_list,
> -					u16 pamt_entry_size[])
> +				 struct list_head *tmb_list)
>  {
>  	int i, ret = 0;
>  
>  	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
> -		ret = tdmr_set_up_pamt(tdmr_entry(tdmr_list, i), tmb_list,
> -				pamt_entry_size);
> +		ret = tdmr_set_up_pamt(tdmr_entry(tdmr_list, i), tmb_list);
>  		if (ret)
>  			goto err;
>  	}
> @@ -970,18 +940,13 @@ static __init int construct_tdmrs(struct list_head *tmb_list,
>  				  struct tdmr_info_list *tdmr_list,
>  				  struct tdx_sys_info_tdmr *sysinfo_tdmr)
>  {
> -	u16 pamt_entry_size[TDX_PS_NR] = {
> -		sysinfo_tdmr->pamt_4k_entry_size,
> -		sysinfo_tdmr->pamt_2m_entry_size,
> -		sysinfo_tdmr->pamt_1g_entry_size,
> -	};
>  	int ret;
>  
>  	ret = fill_out_tdmrs(tmb_list, tdmr_list);
>  	if (ret)
>  		return ret;
>  
> -	ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list, pamt_entry_size);
> +	ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list);
>  	if (ret)
>  		return ret;
>  
> -- 
> 2.54.0
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v6 02/11] x86/virt/tdx: Allocate page bitmap for Dynamic PAMT
From: Kiryl Shutsemau @ 2026-06-04 16:14 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: bp, dave.hansen, hpa, kvm, linux-coco, linux-doc, linux-kernel,
	mingo, nik.borisov, pbonzini, seanjc, tglx, vannapurve, x86,
	chao.gao, yan.y.zhao, kai.huang, Kirill A. Shutemov, Binbin Wu
In-Reply-To: <20260526023515.288829-3-rick.p.edgecombe@intel.com>

On Mon, May 25, 2026 at 07:35:06PM -0700, Rick Edgecombe wrote:
> @@ -579,7 +591,12 @@ static __init int tdmr_set_up_pamt(struct tdmr_info *tdmr,
>  	 * Calculate the PAMT size for each TDX supported page size
>  	 * and the total PAMT size.
>  	 */
> -	tdmr->pamt_4k_size = tdmr_get_pamt_sz(tdmr, TDX_PS_4K);
> +	if (tdx_supports_dynamic_pamt(&tdx_sysinfo)) {
> +		/* With Dynamic PAMT, PAMT_4K is replaced with a bitmap */
> +		tdmr->pamt_4k_size = tdmr_get_pamt_bitmap_sz(tdmr);
> +	} else {
> +		tdmr->pamt_4k_size = tdmr_get_pamt_sz(tdmr, TDX_PS_4K);
> +	}
>  	tdmr->pamt_2m_size = tdmr_get_pamt_sz(tdmr, TDX_PS_2M);
>  	tdmr->pamt_1g_size = tdmr_get_pamt_sz(tdmr, TDX_PS_1G);
>  	tdmr_pamt_size = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;

Maybe it would more readable if we reverse the size order:

	/*
	 * Calculate the PAMT size for each TDX supported page size
	 * and the total PAMT size.
	 */
  	tdmr->pamt_1g_size = tdmr_get_pamt_sz(tdmr, TDX_PS_1G);
  	tdmr->pamt_2m_size = tdmr_get_pamt_sz(tdmr, TDX_PS_2M);

	if (tdx_supports_dynamic_pamt(&tdx_sysinfo)) {
		/* With Dynamic PAMT, PAMT_4K is replaced with a bitmap */
		tdmr->pamt_4k_size = tdmr_get_pamt_bitmap_sz(tdmr);
	} else {
		tdmr->pamt_4k_size = tdmr_get_pamt_sz(tdmr, TDX_PS_4K);
	}

  	tdmr_pamt_size = tdmr->pamt_1g_size + tdmr->pamt_2m_size + tdmr->pamt_4k_size;

It allows split it into logical blocks while keeping the comment attached.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* RE: [PATCH v5 05/20] dma-pool: track decrypted atomic pools and select them via attrs
From: Michael Kelley @ 2026-06-04 16:18 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Michael Kelley, Jason Gunthorpe, Michael Kelley
  Cc: iommu@lists.linux.dev, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Mostafa Saleh, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev@lists.ozlabs.org,
	linux-s390@vger.kernel.org, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86@kernel.org, Jiri Pirko
In-Reply-To: <yq5apl26qrof.fsf@kernel.org>

From: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Sent: Thursday, June 4, 2026 7:58 AM
> 
> Michael Kelley <mhklinux@outlook.com> writes:
> 
> > From: Jason Gunthorpe <jgg@ziepe.ca> Sent: Tuesday, June 2, 2026 5:55 PM
> >>
> >> On Tue, Jun 02, 2026 at 02:24:40PM +0000, Michael Kelley wrote:
> >>
> >> > Except that in a normal VM, the "unencrypted" pool attribute does *not*
> >> > describe the state of the memory itself.  In a normal VM, the memory is
> >> > unencrypted, but the "unencrypted" pool attribute is false. That
> >> > contradiction is the essence of my concern.
> >>
> >> I would argue no..
> >>
> >> When CC is enabled the default state of memory in a Linux environment
> >> is "encrypted". You have to take a special action to "decrypt" it.
> >>
> >> Thus the default state of memory in a non-CC environment is also
> >> paradoxically "encrypted" too.
> >
> > The need to have such an unnatural premise is usually an indication
> > of a conceptual problem with the overall model, or perhaps just a
> > terminology problem.
> >
> > Here's a proposal. The new DMA attribute is DMA_ATTR_CC_SHARED.
> > Name the pool attribute "cc_shared" instead of "unencrypted". Having
> > "cc_shared" set to false in a normal VM doesn't lead to the non-sensical
> > situation of claiming that a normal VM is encrypted. The boolean
> > "unencrypted" parameter that has been added to various calls also
> > becomes "cc_shared".  If "CC_SHARED" is a suitable name for the DMA
> > attribute, it ought to be suitable as the pool attribute. And everything
> > matches as well.
> >
> 
> That is better. It would also simplify:
> 
> 	if (mem->unencrypted != !!(attrs & DMA_ATTR_CC_SHARED))
> 		return NULL;
> 
> to
> 	if (mem->cc_shared != !!(attrs & DMA_ATTR_CC_SHARED))
> 		return NULL;
> 
> 
> I already sent a v6 in the hope of getting this merged for the next
> merge window. Should I send a v7, or would you prefer that I do the
> rename on top of v6?
> 

I would advocate for a v7 with the rename, vs. a separate follow-on
patch to do the rename, just to reduce churn. But I don't know what
the tradeoffs are in trying to hit the next merge window. If a follow-on
patch is more practical from a timing standpoint, I won't object.

Michael


^ permalink raw reply

* Re: [PATCH v6 06/11] x86/virt/tdx: Optimize tdx_pamt_get/put()
From: Kiryl Shutsemau @ 2026-06-04 16:59 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Gao, Chao, kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	Huang, Kai, Hansen, Dave, Zhao, Yan Y, seanjc@google.com,
	mingo@redhat.com, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, nik.borisov@suse.com,
	linux-doc@vger.kernel.org, hpa@zytor.com, tglx@kernel.org,
	Annapurve, Vishal, bp@alien8.de, kirill.shutemov@linux.intel.com,
	x86@kernel.org
In-Reply-To: <fe08f03a22acfe758cd97f7c2880deeafbc5fe58.camel@intel.com>

On Tue, May 26, 2026 at 04:42:24PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2026-05-26 at 16:57 +0800, Chao Gao wrote:
> > > -	scoped_guard(spinlock, &pamt_lock) {
> > 
> > This converts the scoped_guard() added by the previous patch to
> > explicit lock/unlock and goto. It would reduce code churn if the
> > previous patch used that form directly.
> 
> Yea, it's a good point. I actually debated doing it, but decided not to because
> the scoped version is cleaner for the non-optimized version. But for
> reviewability, never doing the scoped version is probably better.

I don't see a reason why we can't keep the scoped_guard() on get side.

On put side, we cannot get atomic_get_and_lock() semantics without
dropping the scoped_guard().

Maybe we should keep it for get?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v6 08/11] x86/tdx: Add APIs to support Dynamic PAMT ops from KVM's fault path
From: Kiryl Shutsemau @ 2026-06-04 17:11 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: bp, dave.hansen, hpa, kvm, linux-coco, linux-doc, linux-kernel,
	mingo, nik.borisov, pbonzini, seanjc, tglx, vannapurve, x86,
	chao.gao, yan.y.zhao, kai.huang
In-Reply-To: <20260526023515.288829-9-rick.p.edgecombe@intel.com>

On Mon, May 25, 2026 at 07:35:12PM -0700, Rick Edgecombe wrote:
> When handling an EPT violation, KVM holds a spinlock while manipulating
> the EPT. Before entering the spinlock it doesn't know how many EPT page
> tables will need to be installed or whether a huge page will be used. For
> this reason it allocates a worst case number of page tables that it might
> need as part of servicing the EPT violation.
> 
> Under Dynamic PAMT these pre-allocated pages will potentially need to have
> Dynamic PAMT backing pages installed for them. KVM already has helpers to
> manage topping up page caches before taking the MMU lock, but they cannot be
> passed from KVM to arch/x86 code.
> 
> The problem of how and when to install the DPAMT backing pages for the
> pages given to the TDX module during the fault path has had a lot of
> design attempts.
>  - Extracting KVM's MMU caches requires too much inlined code added to
>    headers.
>  - A few varieties of installing Dynamic PAMT backing when allocating the
>    S-EPT page tables. [0][1]
>  - Using mempool_t to transfer the pages between KVM and arch/x86 doesn't
>    work because it is the component is designed more around maintaining a
>    pool of pages, rather than topping up a continually drained cache.
> 
> So don't do these as they all had various problems. Instead just create a
> small simple data structure to use for handing a pre-allocated list of
> pages between KVM and arch/x86 code. Model this on KVM's existing MMU
> memory caches.
> 
> Add a tdx_pamt_cache arg to tdx_pamt_get() so it can draw pages from a
> cache when needed. Not all DPAMT page installations will happen under
> spinlock, for example control pages. So have tdx_pamt_get() maintain the
> existing behavior of allocating from the page allocator when NULL is
> passed for the struct tdx_pamt_cache arg. This prevents excess allocations
> for cases where it can be avoided.
> 
> Export the new helpers for KVM.
> 
> Assisted-by: GitHub Copilot:claude-opus-4-6 Claude:claude-opus-4-7
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Link: https://lore.kernel.org/kvm/de05853257e9cc66998101943f78a4b7e6e3d741.camel@intel.com/ [0]
> Link: https://lore.kernel.org/kvm/aYprxnSHKHUtk7pt@google.com/ [1]

Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v6 10/11] x86/virt/tdx: Enable Dynamic PAMT
From: Kiryl Shutsemau @ 2026-06-04 17:14 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: bp, dave.hansen, hpa, kvm, linux-coco, linux-doc, linux-kernel,
	mingo, nik.borisov, pbonzini, seanjc, tglx, vannapurve, x86,
	chao.gao, yan.y.zhao, kai.huang, Kirill A. Shutemov
In-Reply-To: <20260526023515.288829-11-rick.p.edgecombe@intel.com>

On Mon, May 25, 2026 at 07:35:14PM -0700, Rick Edgecombe wrote:
> @@ -152,7 +156,12 @@ const struct tdx_sys_info *tdx_get_sysinfo(void);
>  
>  static inline bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
>  {
> -	return false; /* To be enabled when kernel is ready */
> +	/*
> +	 * The TDX Module's internal Dynamic PAMT tree structure can't
> +	 * handle physical addresses with more than 48 bits.
> +	 */
> +	return sysinfo->features.tdx_features0 & TDX_FEATURES0_DYNAMIC_PAMT &&
> +	       boot_cpu_data.x86_phys_bits <= 48;

Should we warn for >48?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v5 05/20] dma-pool: track decrypted atomic pools and select them via attrs
From: Jason Gunthorpe @ 2026-06-04 18:24 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Michael Kelley, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Mostafa Saleh, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev@lists.ozlabs.org,
	linux-s390@vger.kernel.org, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86@kernel.org, Jiri Pirko
In-Reply-To: <yq5apl26qrof.fsf@kernel.org>

On Thu, Jun 04, 2026 at 08:27:36PM +0530, Aneesh Kumar K.V wrote:
> I already sent a v6 in the hope of getting this merged for the next
> merge window. Should I send a v7, or would you prefer that I do the
> rename on top of v6?

I think it is too late for such a major change, but this should be
imaginged to be for rc2ish next cycle. You also have to spell out how
the pkvm patch will get sequenced as well, it would be best to push
that it gets picked up right away.

Jason

^ permalink raw reply

* [CfP] Confidential Computing Microconference (LPC 2026)
From: Jörg Rödel @ 2026-06-04 18:29 UTC (permalink / raw)
  To: linux-coco, linux-kernel, kvm, virtualization, coconut-svsm,
	linux-sgx
  Cc: Dhaval Giani

Hi everyone,

We are pleased to officially open the Call for Presentations for the 2026
Confidential Computing Microconference at LPC this year. LPC will take place
from October 5th to 7th in Prague.

Our goal is to bring open-source developers and industry experts together for
productive discussions that lead to concrete solutions. We are looking for
interactive discussions, ongoing development topics, and problem-solving
sessions rather than static status updates.

We are looking for proposals covering, but not limited to, the following
developments and challenges:

	* Enhancements to CVM memory backing via `guest_memfd`
	* KVM Support for ARM CCA
	* Privilege separation features in KVM
	* CVM live migration
	* Secure VM Service Module (SVSM) architecture and Linux support
	* Trusted I/O software architecture
	* Solutions for the full CVM (remote) attestation problem
	* Linux as a CVM operating system across hypervisors
	* CVM Performance optimization and benchmarking

If you are working on any of these areas or have another critical Confidential
Computing topic that requires community alignment, please submit your proposal!

LPC microconferences are built around discussion and collaboration. Proposals
should focus on open problems, architectural roadblocks, or design choices that
would benefit from in-person feedback from the community.

* Submit here:		https://lpc.events/event/20/abstracts/
* Submission Deadline:	August 7, 2026

Make sure to select "Confidential Computing MC" as the track! This year the LPC
organization committee will grant pre-registration vouchers to anyone who has
submitted a topic. These are at the usual price ($600) which must be used
before registration opens. If your topic is not accepted you should be eligible
for a refund if your employer doesn’t approve your travel. For more details see

	https://lpc.events/blog/current/index.php/2026/04/06/changes-to-registration-availability-for-2026/

Looking forward to seeing you there,


- Dhaval and Joerg

^ permalink raw reply

* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Ackerley Tng @ 2026-06-04 19:05 UTC (permalink / raw)
  To: Suzuki K Poulose, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <9d15479e-e36b-4865-804c-7d93eb339e4e@arm.com>

Suzuki K Poulose <suzuki.poulose@arm.com> writes:

>
> [...snip...]
>
>> +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
>> +ignored completely. Otherwise, ``uaddr`` is required if
>> +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
>> +in the latter case guest memory can be initialized directly from userspace
>> +prior to converting it to private and passing the GPA range on to this
>> +interface.
>
> Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in
> the process of making it "private" ? i.e., the contents of a SNP shared
> page are preserved while transitioning to "SNP Private" (via RMP
> update).
>
> Suzuki
>

The following is the guest_memfd perspective, I didn't look at the SNP
spec:

Do you mean specifically for KVM_SEV_SNP_PAGE_TYPE_ZERO, or for any
type?

guest_memfd has no plans to do any special zeroing based on type.

guest_memfd decoupled zeroing from preparation a while ago (Michael had
some patches), so zeroing is supposed to be once during folio ownership
by guest_memfd, tracked by the uptodate flag, and preparation is tracked
outside of guest_memfd. So far only SNP does preparation.

>
>
>>
>> [...snip...]
>>

^ permalink raw reply

* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Michael Roth @ 2026-06-04 20:11 UTC (permalink / raw)
  To: Suzuki K Poulose
  Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <9d15479e-e36b-4865-804c-7d93eb339e4e@arm.com>

On Thu, Jun 04, 2026 at 04:29:19PM +0100, Suzuki K Poulose wrote:
> On 23/05/2026 01:18, Ackerley Tng via B4 Relay wrote:
> > From: Michael Roth <michael.roth@amd.com>
> > 
> > For vm_memory_attributes=1, in-place conversion/population is not
> > supported, so the initial contents necessarily must need to come
> > from a separate src address, which is enforced by the current
> > implementation. However, for vm_memory_attributes=0, it is possible for
> > guest memory to be initialized directly from userspace by mmap()'ing the
> > guest_memfd and writing to it while the corresponding GPA ranges are in
> > a 'shared' state before converting them to the 'private' state expected
> > by KVM_SEV_SNP_LAUNCH_UPDATE.
> > 
> > Update the handling/documentation for KVM_SEV_SNP_LAUNCH_UPDATE to allow
> > for 'uaddr' to be set to NULL when vm_memory_attributes=0, which
> > SNP_LAUNCH_UPDATE will then use to determine when it should/shouldn't
> > copy in data from a separate memory location. Continue to enforce
> > non-NULL for the original vm_memory_attributes=1 case.
> > 
> > Signed-off-by: Michael Roth <michael.roth@amd.com>
> > [Added src_page check in error handling path when the firmware command fails]
> > [Dropped ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES]
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> 
> 
> 
> 
> > ---
> >   Documentation/virt/kvm/x86/amd-memory-encryption.rst | 15 +++++++++++----
> >   arch/x86/kvm/svm/sev.c                               | 18 +++++++++++++-----
> >   virt/kvm/kvm_main.c                                  |  1 +
> >   3 files changed, 25 insertions(+), 9 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> > index b2395dd4769de..43085f65b2d85 100644
> > --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> > +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> > @@ -503,7 +503,8 @@ secrets.
> >   It is required that the GPA ranges initialized by this command have had the
> >   KVM_MEMORY_ATTRIBUTE_PRIVATE attribute set in advance. See the documentation
> > -for KVM_SET_MEMORY_ATTRIBUTES for more details on this aspect.
> > +for KVM_SET_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES2 for more details on
> > +this aspect.
> >   Upon success, this command is not guaranteed to have processed the entire
> >   range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
> > @@ -511,9 +512,15 @@ range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
> >   remaining range that has yet to be processed. The caller should continue
> >   calling this command until those fields indicate the entire range has been
> >   processed, e.g. ``len`` is 0, ``gfn_start`` is equal to the last GFN in the
> > -range plus 1, and ``uaddr`` is the last byte of the userspace-provided source
> > -buffer address plus 1. In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO,
> > -``uaddr`` will be ignored completely.
> > +range plus 1, and ``uaddr`` (if specified) is the last byte of the
> > +userspace-provided source buffer address plus 1.
> > +
> > +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
> > +ignored completely. Otherwise, ``uaddr`` is required if
> > +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
> > +in the latter case guest memory can be initialized directly from userspace
> > +prior to converting it to private and passing the GPA range on to this
> > +interface.
> 
> Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in the
> process of making it "private" ? i.e., the contents of a SNP shared
> page are preserved while transitioning to "SNP Private" (via RMP
> update).

sev_gmem_prepare() does sort of destroy contents since it finalizes the
shared->private conversion which puts the page in an unusable state
until the guest 'accepts' it as private memory and re-initializes the
contents.

But that's run-time, when the guest is doing conversions. The
documentation here is relating to initialization time when we are
setting up the initial pre-encrypted/pre-measured guest memory image,
via SNP_LAUNCH_UPDATE. That path calls into kvm_gmem_populate(), and it
is then sev_gmem_post_populate() callback that actually finalizes the
shared->private conversion. The sev_gmem_prepare() hook doesn't get used
in this flow (kvm_gmem_populate() calls __kvm_gmem_get_pfn() which skips
preparation).

-Mike

> 
> Suzuki
> 
> 
> 
> >   Parameters (in): struct  kvm_sev_snp_launch_update
> > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > index 1a361f08c7a3d..e1dbc827c2807 100644
> > --- a/arch/x86/kvm/svm/sev.c
> > +++ b/arch/x86/kvm/svm/sev.c
> > @@ -2343,7 +2343,15 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> >   	int level;
> >   	int ret;
> > -	if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page))
> > +	/*
> > +	 * For vm_memory_attributes=1, in-place conversion/population is not
> > +	 * supported, so the initial contents necessarily need to come from a
> > +	 * separate src address. For vm_memory_attributes=0, this isn't
> > +	 * necessarily the case, since the pages may have been populated
> > +	 * directly from userspace before calling KVM_SEV_SNP_LAUNCH_UPDATE.
> > +	 */
> > +	if (vm_memory_attributes &&
> > +	    sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page)
> >   		return -EINVAL;
> >   	ret = snp_lookup_rmpentry((u64)pfn, &assigned, &level);
> > @@ -2390,7 +2398,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> >   	 */
> >   	if (ret && !snp_page_reclaim(kvm, pfn) &&
> >   	    sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID &&
> > -	    sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) {
> > +	    sev_populate_args->fw_error == SEV_RET_INVALID_PARAM && src_page) {
> >   		void *src_vaddr = kmap_local_page(src_page);
> >   		void *dst_vaddr = kmap_local_pfn(pfn);
> > @@ -2423,8 +2431,8 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> >   	if (copy_from_user(&params, u64_to_user_ptr(argp->data), sizeof(params)))
> >   		return -EFAULT;
> > -	pr_debug("%s: GFN start 0x%llx length 0x%llx type %d flags %d\n", __func__,
> > -		 params.gfn_start, params.len, params.type, params.flags);
> > +	pr_debug("%s: GFN start 0x%llx length 0x%llx type %d flags %d src %llx\n", __func__,
> > +		 params.gfn_start, params.len, params.type, params.flags, params.uaddr);
> >   	if (!params.len || !PAGE_ALIGNED(params.len) || params.flags ||
> >   	    (params.type != KVM_SEV_SNP_PAGE_TYPE_NORMAL &&
> > @@ -2481,7 +2489,7 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> >   	params.gfn_start += count;
> >   	params.len -= count * PAGE_SIZE;
> > -	if (params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
> > +	if (src && params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
> >   		params.uaddr += count * PAGE_SIZE;
> >   	if (copy_to_user(u64_to_user_ptr(argp->data), &params, sizeof(params)))
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ba195bb239aaa..3bf212fd99193 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -105,6 +105,7 @@ module_param(allow_unsafe_mappings, bool, 0444);
> >   #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> >   bool vm_memory_attributes = true;
> >   module_param(vm_memory_attributes, bool, 0444);
> > +EXPORT_SYMBOL_FOR_KVM_INTERNAL(vm_memory_attributes);
> >   #endif
> >   DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
> >   EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
> > 
> 

^ permalink raw reply

* Re: [PATCH v7 00/42] guest_memfd: In-place conversion support
From: Sean Christopherson @ 2026-06-04 20:20 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CAEvNRgGpaggjd3=ooyzv7iEbmA-x1mWJHgjLSjPi8=5CPrk-yQ@mail.gmail.com>

On Wed, Jun 03, 2026, Ackerley Tng wrote:
> Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
> writes:
> 
> > This is v7 of guest_memfd in-place conversion support.
> >
> 
> Here's the outstanding items after going over everyone's comments
> including Sashiko's:
> 
> + KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
>     + Need to move page clearing into __kvm_gmem_get_pfn to resolve
>       leak where populate can put initialized kernel memory into TDX
>       guest
>     + See suggested fix at [1]

That fix works for me.  The initial guest image will typically be a tiny subset
of guest memory, so unnecessarily zeroing a few pages isn't a performance concern.

> + KVM: guest_memfd: Only prepare folios for private pages,
>     + s/non-CoCo/CoCo in commit message "INIT_SHARED is about to be
>       supported for non-CoCo VMs in a later patch in this series
>     + Use Suggested-by: Michael Roth <michael.roth@amd.com>
> + KVM: selftests: Test that shared/private status is consistent across
>   processes
>     + Improve test reliability using pthread_mutex
>     + I have a fixup patch offline.
> 	
> I would like feedback on these:
> 	
> + KVM: selftests: Test conversion with elevated page refcount
>     + Askar pointed out that soon vmsplice may not pin pages. Should I
>       pin pages through CONFIG_GUP_TEST like in [2]? I prefer not to
>       take a dependency on CONFIG_GUP_TEST.

I'm not exactly excited about taking a dependency on CONFIG_GUP_TEST either, but
it probably is the least awful choice.  E.g. KVM also pins pages is certain flows,
but we're _also_ actively working to remove the need to pin.

Hmm, maybe IORING_REGISTER_PBUF_RING?  AFAICT, it's almost literally a "pin user
memory" syscall.

> + KVM: selftests: Add script to exercise private_mem_conversions_test
>     + Would like to know what people think of a wrapper script before
>       I address Sashiko's comments.

NAK to a wrapper script.  This sounds like a perfect fit for Vipin's selftest
runner (which I'm like 4 months overdue for reviewing, testing, and merging).
If the runner _can't_ do what you want, then I'd rather improve the runner.

[*] https://lore.kernel.org/all/20260331194202.1722082-1-vipinsh@google.com

> 
> [1] https://lore.kernel.org/all/CAEvNRgEVC=fFuKVgZYvWyZD7t_zvUZihFG8hrACjvtkD5cwugw@mail.gmail.com/
> [2] https://lore.kernel.org/all/baa8838f623102931e755cf34c86314b305af49c.1747264138.git.ackerleytng@google.com/
> 
> >
> > [...snip...]
> >

^ permalink raw reply

* Re: [PATCH v7 00/42] guest_memfd: In-place conversion support
From: Ackerley Tng @ 2026-06-04 21:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <aiHeDZEPkAcWcSkn@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Wed, Jun 03, 2026, Ackerley Tng wrote:
>> Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
>> writes:
>>
>> > This is v7 of guest_memfd in-place conversion support.
>> >
>>
>> Here's the outstanding items after going over everyone's comments
>> including Sashiko's:
>>
>> + KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
>>     + Need to move page clearing into __kvm_gmem_get_pfn to resolve
>>       leak where populate can put initialized kernel memory into TDX
>>       guest
>>     + See suggested fix at [1]
>
> That fix works for me.  The initial guest image will typically be a tiny subset
> of guest memory, so unnecessarily zeroing a few pages isn't a performance concern.
>

In regular usage moving the zeroing in [1] doesn't change anything,
since the same zeroing would have first happened when the host faults
the pages to put the initial image. When populating, there's no more
zeroing since it was zeroed.

[1] covers the case where the host doesn't write anything to the pages
and directly tries to populate the pages to the guest.

>> + KVM: guest_memfd: Only prepare folios for private pages,
>>     + s/non-CoCo/CoCo in commit message "INIT_SHARED is about to be
>>       supported for non-CoCo VMs in a later patch in this series
>>     + Use Suggested-by: Michael Roth <michael.roth@amd.com>
>> + KVM: selftests: Test that shared/private status is consistent across
>>   processes
>>     + Improve test reliability using pthread_mutex
>>     + I have a fixup patch offline.
>> 	
>> I would like feedback on these:
>> 	
>> + KVM: selftests: Test conversion with elevated page refcount
>>     + Askar pointed out that soon vmsplice may not pin pages. Should I
>>       pin pages through CONFIG_GUP_TEST like in [2]? I prefer not to
>>       take a dependency on CONFIG_GUP_TEST.
>
> I'm not exactly excited about taking a dependency on CONFIG_GUP_TEST either, but
> it probably is the least awful choice.  E.g. KVM also pins pages is certain flows,
> but we're _also_ actively working to remove the need to pin.
>
> Hmm, maybe IORING_REGISTER_PBUF_RING?  AFAICT, it's almost literally a "pin user
> memory" syscall.
>

Hmm that takes a dependency on io_uring, which isn't always compiled
in. Between CONFIG_IO_URING and CONFIG_GUP_TEST, I'd rather
CONFIG_GUP_TEST.

>> + KVM: selftests: Add script to exercise private_mem_conversions_test
>>     + Would like to know what people think of a wrapper script before
>>       I address Sashiko's comments.
>
> NAK to a wrapper script.  This sounds like a perfect fit for Vipin's selftest
> runner (which I'm like 4 months overdue for reviewing, testing, and merging).
> If the runner _can't_ do what you want, then I'd rather improve the runner.
>
> [*] https://lore.kernel.org/all/20260331194202.1722082-1-vipinsh@google.com
>

Good to know we have this!

Thanks, I'll work on a v8 to clean up the above.

>>
>> [1] https://lore.kernel.org/all/CAEvNRgEVC=fFuKVgZYvWyZD7t_zvUZihFG8hrACjvtkD5cwugw@mail.gmail.com/
>> [2] https://lore.kernel.org/all/baa8838f623102931e755cf34c86314b305af49c.1747264138.git.ackerleytng@google.com/
>>
>> >
>> > [...snip...]
>> >

^ permalink raw reply

* Re: [PATCH v2 03/15] KVM: x86/xen: Don't truncate RAX when handling hypercall from protected guest
From: David Woodhouse @ 2026-06-04 21:48 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
	Kiryl Shutsemau, Paul Durrant, Dave Hansen, Rick Edgecombe, kvm,
	x86, linux-coco, linux-kernel, Yosry Ahmed, Kai Huang
In-Reply-To: <dc62e58e-b6ee-41e1-84a5-0716822fefc8@linux.intel.com>

[-- Attachment #1: Type: text/plain, Size: 587 bytes --]

On Mon, 2026-05-18 at 17:55 +0800, Binbin Wu wrote:
>  
> > I'm suggesting that you clean up longmode→is_64bit for the *hypercalls*
> > but leave 'long_mode' as is.
> > 
> 
> Yes, will only do it for is_64_bit_hypercall().

If you did this, I'm not sure I saw it? 

In response to https://lore.kernel.org/all/aiHPPUk5DY7rH-zL@v4bel/#r I
now find myself with both 'longmode' (current vCPU mode, should be
called is_64bit), and 'long_mode' (latched VM-wide mode) in the *same*
function.

I cannot live with that; I'm going to do the longmode→is_64bit change
locally.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v6 10/11] x86/virt/tdx: Enable Dynamic PAMT
From: Chao Gao @ 2026-06-05  5:25 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Rick Edgecombe, bp, dave.hansen, hpa, kvm, linux-coco, linux-doc,
	linux-kernel, mingo, nik.borisov, pbonzini, seanjc, tglx,
	vannapurve, x86, yan.y.zhao, kai.huang, Kirill A. Shutemov
In-Reply-To: <aiGyIQvudD5ZF3lf@thinkstation>

On Thu, Jun 04, 2026 at 06:14:17PM +0100, Kiryl Shutsemau wrote:
>On Mon, May 25, 2026 at 07:35:14PM -0700, Rick Edgecombe wrote:
>> @@ -152,7 +156,12 @@ const struct tdx_sys_info *tdx_get_sysinfo(void);
>>  
>>  static inline bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
>>  {
>> -	return false; /* To be enabled when kernel is ready */
>> +	/*
>> +	 * The TDX Module's internal Dynamic PAMT tree structure can't
>> +	 * handle physical addresses with more than 48 bits.
>> +	 */
>> +	return sysinfo->features.tdx_features0 & TDX_FEATURES0_DYNAMIC_PAMT &&
>> +	       boot_cpu_data.x86_phys_bits <= 48;
>
>Should we warn for >48?

Maybe we should drop this check. If the TDX module cannot handle that case,
advertising TDX_FEATURES0_DYNAMIC_PAMT is a bug and should be fixed by the
module.

^ permalink raw reply

* Re: [PATCH v6 06/11] x86/virt/tdx: Optimize tdx_pamt_get/put()
From: Chao Gao @ 2026-06-05  5:40 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Edgecombe, Rick P, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, Huang, Kai, Hansen, Dave, Zhao, Yan Y,
	seanjc@google.com, mingo@redhat.com, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, nik.borisov@suse.com,
	linux-doc@vger.kernel.org, hpa@zytor.com, tglx@kernel.org,
	Annapurve, Vishal, bp@alien8.de, kirill.shutemov@linux.intel.com,
	x86@kernel.org
In-Reply-To: <aiGq7XjmMrsqdBY5@thinkstation>

On Thu, Jun 04, 2026 at 05:59:02PM +0100, Kiryl Shutsemau wrote:
>On Tue, May 26, 2026 at 04:42:24PM +0000, Edgecombe, Rick P wrote:
>> On Tue, 2026-05-26 at 16:57 +0800, Chao Gao wrote:
>> > > -	scoped_guard(spinlock, &pamt_lock) {
>> > 
>> > This converts the scoped_guard() added by the previous patch to
>> > explicit lock/unlock and goto. It would reduce code churn if the
>> > previous patch used that form directly.
>> 
>> Yea, it's a good point. I actually debated doing it, but decided not to because
>> the scoped version is cleaner for the non-optimized version. But for
>> reviewability, never doing the scoped version is probably better.
>
>I don't see a reason why we can't keep the scoped_guard() on get side.

One additional reason to drop scoped_guard() is that it mixes cleanup helpers
with goto, which is discouraged. See [*]

 :Lastly, given that the benefit of cleanup helpers is removal of “goto”, and
 :that the “goto” statement can jump between scopes, the expectation is that
 :usage of “goto” and cleanup helpers is never mixed in the same function.

Removing scoped_guard() here also reduces indentation.

*: https://www.kernel.org/doc/html/v7.1-rc6/core-api/cleanup.html

>
>On put side, we cannot get atomic_get_and_lock() semantics without
>dropping the scoped_guard().
>
>Maybe we should keep it for get?
>
>-- 
>  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Gavin Shan @ 2026-06-05  6:23 UTC (permalink / raw)
  To: Steven Price, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <20260513131757.116630-30-steven.price@arm.com>

Hi Steve,

On 5/13/26 11:17 PM, Steven Price wrote:
> At runtime if the realm guest accesses memory which hasn't yet been
> mapped then KVM needs to either populate the region or fault the guest.
> 
> For memory in the lower (protected) region of IPA a fresh page is
> provided to the RMM which will zero the contents. For memory in the
> upper (shared) region of IPA, the memory from the memslot is mapped
> into the realm VM non secure.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
> Changes since v13:
>   * Numerous changes due to rebasing.
>   * Fix addr_range_desc() to encode the correct block size.
> Changes since v12:
>   * Switch to RMM v2.0 range based APIs.
> Changes since v11:
>   * Adapt to upstream changes.
> Changes since v10:
>   * RME->RMI renaming.
>   * Adapt to upstream gmem changes.
> Changes since v9:
>   * Fix call to kvm_stage2_unmap_range() in kvm_free_stage2_pgd() to set
>     may_block to avoid stall warnings.
>   * Minor coding style fixes.
> Changes since v8:
>   * Propagate the may_block flag.
>   * Minor comments and coding style changes.
> Changes since v7:
>   * Remove redundant WARN_ONs for realm_create_rtt_levels() - it will
>     internally WARN when necessary.
> Changes since v6:
>   * Handle PAGE_SIZE being larger than RMM granule size.
>   * Some minor renaming following review comments.
> Changes since v5:
>   * Reduce use of struct page in preparation for supporting the RMM
>     having a different page size to the host.
>   * Handle a race when delegating a page where another CPU has faulted on
>     a the same page (and already delegated the physical page) but not yet
>     mapped it. In this case simply return to the guest to either use the
>     mapping from the other CPU (or refault if the race is lost).
>   * The changes to populate_par_region() are moved into the previous
>     patch where they belong.
> Changes since v4:
>   * Code cleanup following review feedback.
>   * Drop the PTE_SHARED bit when creating unprotected page table entries.
>     This is now set by the RMM and the host has no control of it and the
>     spec requires the bit to be set to zero.
> Changes since v2:
>   * Avoid leaking memory if failing to map it in the realm.
>   * Correctly mask RTT based on LPA2 flag (see rtt_get_phys()).
>   * Adapt to changes in previous patches.
> ---
>   arch/arm64/include/asm/kvm_emulate.h |   8 ++
>   arch/arm64/include/asm/kvm_rmi.h     |  12 ++
>   arch/arm64/kvm/mmu.c                 | 128 ++++++++++++++++----
>   arch/arm64/kvm/rmi.c                 | 173 +++++++++++++++++++++++++++
>   4 files changed, 301 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> index 2e69fe494716..8b6f9d26b5d8 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -712,6 +712,14 @@ static inline bool kvm_realm_is_created(struct kvm *kvm)
>   	return kvm_is_realm(kvm) && kvm_realm_state(kvm) != REALM_STATE_NONE;
>   }
>   
> +static inline gpa_t kvm_gpa_from_fault(struct kvm *kvm, phys_addr_t ipa)
> +{
> +	if (!kvm_is_realm(kvm))
> +		return ipa;
> +
> +	return ipa & ~BIT(kvm->arch.realm.ia_bits - 1);
> +}
> +
>   static inline bool vcpu_is_rec(const struct kvm_vcpu *vcpu)
>   {
>   	return kvm_is_realm(vcpu->kvm);
> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/asm/kvm_rmi.h
> index a2b6bc412a22..b65cfec10dee 100644
> --- a/arch/arm64/include/asm/kvm_rmi.h
> +++ b/arch/arm64/include/asm/kvm_rmi.h
> @@ -6,6 +6,7 @@
>   #ifndef __ASM_KVM_RMI_H
>   #define __ASM_KVM_RMI_H
>   
> +#include <asm/kvm_pgtable.h>
>   #include <asm/rmi_smc.h>
>   
>   /**
> @@ -97,6 +98,17 @@ void kvm_realm_unmap_range(struct kvm *kvm,
>   			   unsigned long size,
>   			   bool unmap_private,
>   			   bool may_block);
> +int realm_map_protected(struct kvm *kvm,
> +			unsigned long base_ipa,
> +			kvm_pfn_t pfn,
> +			unsigned long size,
> +			struct kvm_mmu_memory_cache *memcache);
> +int realm_map_non_secure(struct realm *realm,
> +			 unsigned long ipa,
> +			 kvm_pfn_t pfn,
> +			 unsigned long size,
> +			 enum kvm_pgtable_prot prot,
> +			 struct kvm_mmu_memory_cache *memcache);
>   
>   static inline bool kvm_realm_is_private_address(struct realm *realm,
>   						unsigned long addr)
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index ac2a0f0106b0..776ffe56d17e 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -334,8 +334,15 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64
>   
>   	lockdep_assert_held_write(&kvm->mmu_lock);
>   	WARN_ON(size & ~PAGE_MASK);
> -	WARN_ON(stage2_apply_range(mmu, start, end, KVM_PGT_FN(kvm_pgtable_stage2_unmap),
> -				   may_block));
> +
> +	if (kvm_is_realm(kvm)) {
> +		kvm_realm_unmap_range(kvm, start, size, !only_shared,
> +				      may_block);
> +	} else {
> +		WARN_ON(stage2_apply_range(mmu, start, end,
> +					   KVM_PGT_FN(kvm_pgtable_stage2_unmap),
> +					   may_block));
> +	}
>   }
>   
>   void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start,
> @@ -358,7 +365,10 @@ static void stage2_flush_memslot(struct kvm *kvm,
>   	phys_addr_t addr = memslot->base_gfn << PAGE_SHIFT;
>   	phys_addr_t end = addr + PAGE_SIZE * memslot->npages;
>   
> -	kvm_stage2_flush_range(&kvm->arch.mmu, addr, end);
> +	if (kvm_is_realm(kvm))
> +		kvm_realm_unmap_range(kvm, addr, end - addr, false, true);
> +	else
> +		kvm_stage2_flush_range(&kvm->arch.mmu, addr, end);
>   }
>   
>   /**
> @@ -1103,6 +1113,10 @@ void stage2_unmap_vm(struct kvm *kvm)
>   	struct kvm_memory_slot *memslot;
>   	int idx, bkt;
>   
> +	/* For realms this is handled by the RMM so nothing to do here */
> +	if (kvm_is_realm(kvm))
> +		return;
> +
>   	idx = srcu_read_lock(&kvm->srcu);
>   	mmap_read_lock(current->mm);
>   	write_lock(&kvm->mmu_lock);
> @@ -1528,6 +1542,29 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
>   	return vma->vm_flags & VM_MTE_ALLOWED;
>   }
>   
> +static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa,
> +			 kvm_pfn_t pfn, unsigned long map_size,
> +			 enum kvm_pgtable_prot prot,
> +			 struct kvm_mmu_memory_cache *memcache)
> +{
> +	struct realm *realm = &kvm->arch.realm;
> +
> +	/*
> +	 * Write permission is required for now even though it's possible to
> +	 * map unprotected pages (granules) as read-only. It's impossible to
> +	 * map protected pages (granules) as read-only.
> +	 */
> +	if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W)))
> +		return -EFAULT;
> +

I'm a bit concerned with this. We don't have KVM_PGTABLE_PROT_W set in @prot
if the stage2 fault is raised due to memory read. With -EFAULT returned to VMM
(e.g. QEMU), the vCPU continuous execution is stopped and system won't be
working any more.

> +	ipa = ALIGN_DOWN(ipa, PAGE_SIZE);
> +	if (!kvm_realm_is_private_address(realm, ipa))
> +		return realm_map_non_secure(realm, ipa, pfn, map_size, prot,
> +					    memcache);
> +
> +	return realm_map_protected(kvm, ipa, pfn, map_size, memcache);
> +}
> +
>   static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
>   {
>   	switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma->vm_page_prot))) {
> @@ -1604,27 +1641,52 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>   	bool write_fault, exec_fault;
>   	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
>   	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> -	struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
> +	struct kvm_vcpu *vcpu = s2fd->vcpu;
> +	struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> +	gpa_t gpa = kvm_gpa_from_fault(vcpu->kvm, s2fd->fault_ipa);
>   	unsigned long mmu_seq;
>   	struct page *page;
> -	struct kvm *kvm = s2fd->vcpu->kvm;
> +	struct kvm *kvm = vcpu->kvm;
>   	void *memcache;
>   	kvm_pfn_t pfn;
>   	gfn_t gfn;
>   	int ret;
>   
> -	memcache = get_mmu_memcache(s2fd->vcpu);
> -	ret = topup_mmu_memcache(s2fd->vcpu, memcache);
> +	if (kvm_is_realm(vcpu->kvm)) {
> +		/* check for memory attribute mismatch */
> +		bool is_priv_gfn = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> +		/*
> +		 * For Realms, the shared address is an alias of the private
> +		 * PA with the top bit set. Thus if the fault address matches
> +		 * the GPA then it is the private alias.
> +		 */
> +		bool is_priv_fault = (gpa == s2fd->fault_ipa);
> +
> +		if (is_priv_gfn != is_priv_fault) {
> +			kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
> +						      kvm_is_write_fault(vcpu),
> +						      false,
> +						      is_priv_fault);
> +			/*
> +			 * KVM_EXIT_MEMORY_FAULT requires an return code of
> +			 * -EFAULT, see the API documentation
> +			 */
> +			return -EFAULT;
> +		}
> +	}
> +
> +	memcache = get_mmu_memcache(vcpu);
> +	ret = topup_mmu_memcache(vcpu, memcache);
>   	if (ret)
>   		return ret;
>   
>   	if (s2fd->nested)
>   		gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT;
>   	else
> -		gfn = s2fd->fault_ipa >> PAGE_SHIFT;
> +		gfn = gpa >> PAGE_SHIFT;
>   
> -	write_fault = kvm_is_write_fault(s2fd->vcpu);
> -	exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu);
> +	write_fault = kvm_is_write_fault(vcpu);
> +	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
>   
>   	VM_WARN_ON_ONCE(write_fault && exec_fault);
>   
> @@ -1634,7 +1696,7 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>   
>   	ret = kvm_gmem_get_pfn(kvm, s2fd->memslot, gfn, &pfn, &page, NULL);
>   	if (ret) {
> -		kvm_prepare_memory_fault_exit(s2fd->vcpu, s2fd->fault_ipa, PAGE_SIZE,
> +		kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>   					      write_fault, exec_fault, false);
>   		return ret;
>   	}
> @@ -1654,14 +1716,20 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>   	kvm_fault_lock(kvm);
>   	if (mmu_invalidate_retry(kvm, mmu_seq)) {
>   		ret = -EAGAIN;
> -		goto out_unlock;
> +		goto out_release_page;
> +	}
> +
> +	if (kvm_is_realm(kvm)) {
> +		ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn,
> +				    PAGE_SIZE, KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W, memcache);
> +		goto out_release_page;
>   	}
>   
>   	ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
>   						 __pfn_to_phys(pfn), prot,
>   						 memcache, flags);
>   
> -out_unlock:
> +out_release_page:
>   	kvm_release_faultin_page(kvm, page, !!ret, prot & KVM_PGTABLE_PROT_W);
>   	kvm_fault_unlock(kvm);
>   
> @@ -1847,7 +1915,7 @@ static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd,
>   	 * mapping size to ensure we find the right PFN and lay down the
>   	 * mapping in the right place.
>   	 */
> -	s2vi->gfn = ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize) >> PAGE_SHIFT;
> +	s2vi->gfn = kvm_gpa_from_fault(kvm, ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize)) >> PAGE_SHIFT;
>   
>   	s2vi->mte_allowed = kvm_vma_mte_allowed(vma);
>   
> @@ -2056,6 +2124,9 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd,
>   		prot &= ~KVM_NV_GUEST_MAP_SZ;
>   		ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, gfn_to_gpa(gfn),
>   								 prot, flags);
> +	} else if (kvm_is_realm(kvm)) {
> +		ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn, mapping_size,
> +				    prot, memcache);
>   	} else {
>   		ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size,
>   							 __pfn_to_phys(pfn), prot,

For the case kvm_is_realm(), need we adjust 's2fd->fault_ipa' for the sake of
huge pages. In kvm_s2_fault_map(), @gfn and @pfn may have been adjusted by
transparent_hugepage_adjust() to be aligned with huge page size. If the
adjustment happened in transparent_hugepage_adjust(), we need to align
s2fd->fault_ipa down to the huge page size either.


> @@ -2214,6 +2285,13 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
>   	return 0;
>   }
>   
> +static bool shared_ipa_fault(struct kvm *kvm, phys_addr_t fault_ipa)
> +{
> +	gpa_t gpa = kvm_gpa_from_fault(kvm, fault_ipa);
> +
> +	return (gpa != fault_ipa);
> +}
> +
>   /**
>    * kvm_handle_guest_abort - handles all 2nd stage aborts
>    * @vcpu:	the VCPU pointer
> @@ -2324,8 +2402,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>   		nested = &nested_trans;
>   	}
>   
> -	gfn = ipa >> PAGE_SHIFT;
> +	gfn = kvm_gpa_from_fault(vcpu->kvm, ipa) >> PAGE_SHIFT;
>   	memslot = gfn_to_memslot(vcpu->kvm, gfn);
> +
>   	hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
>   	write_fault = kvm_is_write_fault(vcpu);
>   	if (kvm_is_error_hva(hva) || (write_fault && !writable)) {
> @@ -2368,7 +2447,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>   		 * of the page size.
>   		 */
>   		ipa |= FAR_TO_FIPA_OFFSET(kvm_vcpu_get_hfar(vcpu));
> -		ret = io_mem_abort(vcpu, ipa);
> +		ret = io_mem_abort(vcpu, kvm_gpa_from_fault(vcpu->kvm, ipa));
>   		goto out_unlock;
>   	}
>   
> @@ -2396,7 +2475,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>   				!write_fault &&
>   				!kvm_vcpu_trap_is_exec_fault(vcpu));
>   
> -		if (kvm_slot_has_gmem(memslot))
> +		if (kvm_slot_has_gmem(memslot) && !shared_ipa_fault(vcpu->kvm, fault_ipa))
>   			ret = gmem_abort(&s2fd);
>   		else
>   			ret = user_mem_abort(&s2fd);
> @@ -2433,6 +2512,10 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>   	if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
>   		return false;
>   
> +	/* We don't support aging for Realms */
> +	if (kvm_is_realm(kvm))
> +		return true;
> +
>   	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
>   						   range->start << PAGE_SHIFT,
>   						   size, true);
> @@ -2449,6 +2532,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>   	if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
>   		return false;
>   
> +	/* We don't support aging for Realms */
> +	if (kvm_is_realm(kvm))
> +		return true;
> +
>   	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
>   						   range->start << PAGE_SHIFT,
>   						   size, false);
> @@ -2628,10 +2715,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>   		return -EFAULT;
>   
>   	/*
> -	 * Only support guest_memfd backed memslots with mappable memory, since
> -	 * there aren't any CoCo VMs that support only private memory on arm64.
> +	 * Only support guest_memfd backed memslots with mappable memory,
> +	 * unless the guest is a CCA realm guest.
>   	 */
> -	if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new))
> +	if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new) &&
> +	    !kvm_is_realm(kvm))
>   		return -EINVAL;
>   
>   	hva = new->userspace_addr;
> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
> index cae29fd3353c..761b38a4071c 100644
> --- a/arch/arm64/kvm/rmi.c
> +++ b/arch/arm64/kvm/rmi.c
> @@ -597,6 +597,179 @@ static int realm_data_map_init(struct kvm *kvm, unsigned long ipa,
>   	return ret;
>   }
>   
> +static unsigned long addr_range_desc(unsigned long phys, unsigned long size)
> +{
> +	unsigned long out = 0;
> +
> +	switch (size) {
> +	case P4D_SIZE:
> +		out = 3 | (1 << 2);
> +		break;
> +	case PUD_SIZE:
> +		out = 2 | (1 << 2);
> +		break;
> +	case PMD_SIZE:
> +		out = 1 | (1 << 2);
> +		break;
> +	case PAGE_SIZE:
> +		out = 0 | (1 << 2);
> +		break;
> +	default:
> +		/*
> +		 * Only support mapping at the page level granulatity when
> +		 * it's an unusual length. This should get us back onto a larger
> +		 * block size for the subsequent mappings.
> +		 */
> +		out = 0 | ((MIN(size >> PAGE_SHIFT, PTRS_PER_PTE - 1)) << 2);
> +		break;
> +	}
> +
> +	WARN_ON(phys & ~PAGE_MASK);
> +
> +	out |= phys & PAGE_MASK;
> +
> +	return out;
> +}
> +
> +int realm_map_protected(struct kvm *kvm,
> +			unsigned long ipa,
> +			kvm_pfn_t pfn,
> +			unsigned long map_size,
> +			struct kvm_mmu_memory_cache *memcache)
> +{
> +	struct realm *realm = &kvm->arch.realm;
> +	phys_addr_t phys = __pfn_to_phys(pfn);
> +	phys_addr_t base_phys = phys;
> +	phys_addr_t rd = virt_to_phys(realm->rd);
> +	unsigned long base_ipa = ipa;
> +	unsigned long ipa_top = ipa + map_size;
> +	int ret = 0;
> +
> +	if (WARN_ON(!IS_ALIGNED(map_size, PAGE_SIZE) ||
> +		    !IS_ALIGNED(ipa, map_size)))
> +		return -EINVAL;
> +
> +	if (rmi_delegate_range(phys, map_size)) {
> +		/*
> +		 * It's likely we raced with another VCPU on the same
> +		 * fault. Assume the other VCPU has handled the fault
> +		 * and return to the guest.
> +		 */
> +		return 0;
> +	}
> +
> +	while (ipa < ipa_top) {
> +		unsigned long flags = RMI_ADDR_TYPE_SINGLE;
> +		unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
> +		unsigned long out_top;
> +
> +		ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags, range_desc,
> +				       &out_top);
> +
> +		if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> +			/* Create missing RTTs and retry */
> +			int level = RMI_RETURN_INDEX(ret);
> +
> +			WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
> +			ret = realm_create_rtt_levels(realm, ipa, level,
> +						      KVM_PGTABLE_LAST_LEVEL,
> +						      memcache);
> +			if (ret)
> +				goto err_undelegate;
> +
> +			ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags,
> +					       range_desc, &out_top);
> +		}
> +
> +		if (WARN_ON(ret))
> +			goto err_undelegate;
> +
> +		phys += out_top - ipa;
> +		ipa = out_top;
> +	}
> +
> +	return 0;
> +
> +err_undelegate:
> +	realm_unmap_private_range(kvm, base_ipa, ipa, true);
> +	if (WARN_ON(rmi_undelegate_range(base_phys, map_size))) {
> +		/* Page can't be returned to NS world so is lost */
> +		get_page(phys_to_page(base_phys));
> +	}
> +	return -ENXIO;
> +}
> +
> +int realm_map_non_secure(struct realm *realm,
> +			 unsigned long ipa,
> +			 kvm_pfn_t pfn,
> +			 unsigned long size,
> +			 enum kvm_pgtable_prot prot,
> +			 struct kvm_mmu_memory_cache *memcache)
> +{
> +	unsigned long attr, flags = 0;
> +	phys_addr_t rd = virt_to_phys(realm->rd);
> +	phys_addr_t phys = __pfn_to_phys(pfn);
> +	unsigned long ipa_top = ipa + size;
> +	int ret;
> +
> +	if (WARN_ON(!IS_ALIGNED(size, PAGE_SIZE) ||
> +		    !IS_ALIGNED(ipa, size)))
> +		return -EINVAL;
> +
> +	switch (prot & (KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC)) {
> +	case KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC:
> +		return -EINVAL;
> +	case KVM_PGTABLE_PROT_DEVICE:
> +		attr = MT_S2_FWB_DEVICE_nGnRE;
> +		break;
> +	case KVM_PGTABLE_PROT_NORMAL_NC:
> +		attr = MT_S2_FWB_NORMAL_NC;
> +		break;
> +	default:
> +		attr = MT_S2_FWB_NORMAL;
> +	}
> +
> +	flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_MEMATTR, attr);
> +
> +	if (prot & KVM_PGTABLE_PROT_R)
> +		flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP, RMI_S2AP_DIRECT_READ);
> +	if (prot & KVM_PGTABLE_PROT_W)
> +		flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP, RMI_S2AP_DIRECT_WRITE);
> +
> +	flags |= RMI_ADDR_TYPE_SINGLE;
> +
> +	while (ipa < ipa_top) {
> +		unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
> +		unsigned long out_top;
> +
> +		ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags, range_desc,
> +					 &out_top);
> +
> +		if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> +			/* Create missing RTTs and retry */
> +			int level = RMI_RETURN_INDEX(ret);
> +
> +			WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
> +			ret = realm_create_rtt_levels(realm, ipa, level,
> +						      KVM_PGTABLE_LAST_LEVEL,
> +						      memcache);
> +			if (ret)
> +				return ret;
> +
> +			ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags,
> +						 range_desc, &out_top);
> +		}
> +
> +		if (WARN_ON(ret))
> +			return ret;
> +
> +		phys += out_top - ipa;
> +		ipa = out_top;
> +	}
> +
> +	return 0;
> +}
> +
>   static int populate_region_cb(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>   			      struct page *src_page, void *opaque)
>   {

Thanks,
Gavin


^ permalink raw reply

* Re: [PATCH v4 1/3] x86/tdx: Fix off-by-one in port I/O handling
From: Binbin Wu @ 2026-06-05  7:08 UTC (permalink / raw)
  To: Kiryl Shutsemau (Meta)
  Cc: tglx, mingo, bp, dave.hansen, seanjc, pbonzini,
	sathyanarayanan.kuppuswamy, kai.huang, xiaoyao.li,
	rick.p.edgecombe, david.laight.linux, ak, djbw, tsyrulnikov.borys,
	x86, kvm, linux-coco, linux-kernel, stable
In-Reply-To: <e5a75bb68a6a778c95cac2ef77acd55cfd24d389.1780584300.git.kas@kernel.org>



On 6/4/2026 10:46 PM, Kiryl Shutsemau (Meta) wrote:
> handle_in() and handle_out() in arch/x86/coco/tdx/tdx.c use:
> 
>     u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
> 
> GENMASK(h, l) includes bit h. For size=1 (INB), this produces
> GENMASK(8, 0) = 0x1FF (9 bits) instead of GENMASK(7, 0) = 0xFF (8
> bits). The mask is one bit too wide for all I/O sizes.
> 
> Fix the mask calculation.
> 
> Fixes: 03149948832a ("x86/tdx: Port I/O: Add runtime hypercalls")
> Reported-by: Borys Tsyrulnikov <tsyrulnikov.borys@gmail.com>
> Link: https://lore.kernel.org/all/CAKw_Dz96rfSQc6Rn+9QBcUFHhmkK+9zu+P=bxowfZwxrATCBRg@mail.gmail.com/
> Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Cc: stable@vger.kernel.org

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

> ---
>  arch/x86/coco/tdx/tdx.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 186915a17c50..65119362f9a2 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -693,7 +693,7 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
>  		.r13 = PORT_READ,
>  		.r14 = port,
>  	};
> -	u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
> +	u64 mask = GENMASK(BITS_PER_BYTE * size - 1, 0);
>  	bool success;
>  
>  	/*
> @@ -713,7 +713,7 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
>  
>  static bool handle_out(struct pt_regs *regs, int size, int port)
>  {
> -	u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
> +	u64 mask = GENMASK(BITS_PER_BYTE * size - 1, 0);
>  
>  	/*
>  	 * Emulate the I/O write via hypercall. More info about ABI can be found


^ permalink raw reply

* Re: [PATCH v4 3/3] x86/tdx: Fix zero-extension for 32-bit port I/O
From: Binbin Wu @ 2026-06-05  7:10 UTC (permalink / raw)
  To: Kiryl Shutsemau (Meta)
  Cc: tglx, mingo, bp, dave.hansen, seanjc, pbonzini,
	sathyanarayanan.kuppuswamy, kai.huang, xiaoyao.li,
	rick.p.edgecombe, david.laight.linux, ak, djbw, tsyrulnikov.borys,
	x86, kvm, linux-coco, linux-kernel, stable
In-Reply-To: <ca503ae3de72d90956fcaf5dbc0760ec20f5a5e0.1780584300.git.kas@kernel.org>



On 6/4/2026 10:47 PM, Kiryl Shutsemau (Meta) wrote:
> According to x86 architecture rules, 32-bit operations zero-extend the
> result to 64 bits. The current implementation of handle_in() only masks
> the lower 32 bits, which preserves the upper 32 bits of RAX when a
> 32-bit port IN instruction is emulated.
> 
> Use insn_assign_reg() to write the result back into RAX with proper
> partial-register-write semantics: 1- and 2-byte forms leave the upper
> bits untouched, the 4-byte form zero-extends to the full register.
> 
> Fixes: 03149948832a ("x86/tdx: Port I/O: Add runtime hypercalls")
> Reported-by: Borys Tsyrulnikov <tsyrulnikov.borys@gmail.com>
> Link: https://lore.kernel.org/all/CAKw_Dz96rfSQc6Rn+9QBcUFHhmkK+9zu+P=bxowfZwxrATCBRg@mail.gmail.com/
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> Cc: stable@vger.kernel.org

I think the concern sashiko commented in patch 2 is valid.

But for this patch itself,
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

> ---
>  arch/x86/coco/tdx/tdx.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 65119362f9a2..41cc23cc63dd 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -693,8 +693,8 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
>  		.r13 = PORT_READ,
>  		.r14 = port,
>  	};
> -	u64 mask = GENMASK(BITS_PER_BYTE * size - 1, 0);
>  	bool success;
> +	u64 val;
>  
>  	/*
>  	 * Emulate the I/O read via hypercall. More info about ABI can be found
> @@ -702,11 +702,9 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
>  	 * "TDG.VP.VMCALL<Instruction.IO>".
>  	 */
>  	success = !__tdx_hypercall(&args);
> +	val = success ? args.r11 : 0;
>  
> -	/* Update part of the register affected by the emulated instruction */
> -	regs->ax &= ~mask;
> -	if (success)
> -		regs->ax |= args.r11 & mask;
> +	insn_assign_reg(&regs->ax, val, size);
>  
>  	return success;
>  }


^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Lorenzo Pieralisi @ 2026-06-05  7:28 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
	Will Deacon, James Morse, Oliver Upton, Suzuki K Poulose,
	Zenghui Yu, linux-arm-kernel, linux-kernel, Joey Gouly,
	Alexandru Elisei, Christoffer Dall, Fuad Tabba, linux-coco,
	Ganapatrao Kulkarni, Shanker Donthineni, Alper Gun,
	Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
	Lorenzo.Pieralisi2
In-Reply-To: <3359f788-07fa-41a1-9ac7-45c58577c1fa@redhat.com>

On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:

[...]

> > +static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa,
> > +			 kvm_pfn_t pfn, unsigned long map_size,
> > +			 enum kvm_pgtable_prot prot,
> > +			 struct kvm_mmu_memory_cache *memcache)
> > +{
> > +	struct realm *realm = &kvm->arch.realm;
> > +
> > +	/*
> > +	 * Write permission is required for now even though it's possible to
> > +	 * map unprotected pages (granules) as read-only. It's impossible to
> > +	 * map protected pages (granules) as read-only.
> > +	 */
> > +	if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W)))
> > +		return -EFAULT;
> > +
> 
> I'm a bit concerned with this. We don't have KVM_PGTABLE_PROT_W set in @prot
> if the stage2 fault is raised due to memory read. With -EFAULT returned to VMM
> (e.g. QEMU), the vCPU continuous execution is stopped and system won't be
> working any more.
> 
> > +	ipa = ALIGN_DOWN(ipa, PAGE_SIZE);
> > +	if (!kvm_realm_is_private_address(realm, ipa))
> > +		return realm_map_non_secure(realm, ipa, pfn, map_size, prot,
> > +					    memcache);
> > +
> > +	return realm_map_protected(kvm, ipa, pfn, map_size, memcache);
> > +}
> > +
> >   static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
> >   {
> >   	switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma->vm_page_prot))) {
> > @@ -1604,27 +1641,52 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> >   	bool write_fault, exec_fault;
> >   	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
> >   	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > -	struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
> > +	struct kvm_vcpu *vcpu = s2fd->vcpu;
> > +	struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> > +	gpa_t gpa = kvm_gpa_from_fault(vcpu->kvm, s2fd->fault_ipa);
> >   	unsigned long mmu_seq;
> >   	struct page *page;
> > -	struct kvm *kvm = s2fd->vcpu->kvm;
> > +	struct kvm *kvm = vcpu->kvm;
> >   	void *memcache;
> >   	kvm_pfn_t pfn;
> >   	gfn_t gfn;
> >   	int ret;
> > -	memcache = get_mmu_memcache(s2fd->vcpu);
> > -	ret = topup_mmu_memcache(s2fd->vcpu, memcache);
> > +	if (kvm_is_realm(vcpu->kvm)) {
> > +		/* check for memory attribute mismatch */
> > +		bool is_priv_gfn = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> > +		/*
> > +		 * For Realms, the shared address is an alias of the private
> > +		 * PA with the top bit set. Thus if the fault address matches
> > +		 * the GPA then it is the private alias.
> > +		 */
> > +		bool is_priv_fault = (gpa == s2fd->fault_ipa);
> > +
> > +		if (is_priv_gfn != is_priv_fault) {
> > +			kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
> > +						      kvm_is_write_fault(vcpu),
> > +						      false,
> > +						      is_priv_fault);
> > +			/*
> > +			 * KVM_EXIT_MEMORY_FAULT requires an return code of
> > +			 * -EFAULT, see the API documentation
> > +			 */
> > +			return -EFAULT;
> > +		}
> > +	}
> > +
> > +	memcache = get_mmu_memcache(vcpu);
> > +	ret = topup_mmu_memcache(vcpu, memcache);
> >   	if (ret)
> >   		return ret;
> >   	if (s2fd->nested)
> >   		gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT;
> >   	else
> > -		gfn = s2fd->fault_ipa >> PAGE_SHIFT;
> > +		gfn = gpa >> PAGE_SHIFT;
> > -	write_fault = kvm_is_write_fault(s2fd->vcpu);
> > -	exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu);
> > +	write_fault = kvm_is_write_fault(vcpu);
> > +	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> >   	VM_WARN_ON_ONCE(write_fault && exec_fault);
> > @@ -1634,7 +1696,7 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> >   	ret = kvm_gmem_get_pfn(kvm, s2fd->memslot, gfn, &pfn, &page, NULL);
> >   	if (ret) {
> > -		kvm_prepare_memory_fault_exit(s2fd->vcpu, s2fd->fault_ipa, PAGE_SIZE,
> > +		kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
> >   					      write_fault, exec_fault, false);
> >   		return ret;
> >   	}
> > @@ -1654,14 +1716,20 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> >   	kvm_fault_lock(kvm);
> >   	if (mmu_invalidate_retry(kvm, mmu_seq)) {
> >   		ret = -EAGAIN;
> > -		goto out_unlock;
> > +		goto out_release_page;
> > +	}
> > +
> > +	if (kvm_is_realm(kvm)) {
> > +		ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn,
> > +				    PAGE_SIZE, KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W, memcache);
> > +		goto out_release_page;
> >   	}
> >   	ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
> >   						 __pfn_to_phys(pfn), prot,
> >   						 memcache, flags);
> > -out_unlock:
> > +out_release_page:
> >   	kvm_release_faultin_page(kvm, page, !!ret, prot & KVM_PGTABLE_PROT_W);
> >   	kvm_fault_unlock(kvm);
> > @@ -1847,7 +1915,7 @@ static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd,
> >   	 * mapping size to ensure we find the right PFN and lay down the
> >   	 * mapping in the right place.
> >   	 */
> > -	s2vi->gfn = ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize) >> PAGE_SHIFT;
> > +	s2vi->gfn = kvm_gpa_from_fault(kvm, ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize)) >> PAGE_SHIFT;
> >   	s2vi->mte_allowed = kvm_vma_mte_allowed(vma);
> > @@ -2056,6 +2124,9 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd,
> >   		prot &= ~KVM_NV_GUEST_MAP_SZ;
> >   		ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, gfn_to_gpa(gfn),
> >   								 prot, flags);
> > +	} else if (kvm_is_realm(kvm)) {
> > +		ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn, mapping_size,
> > +				    prot, memcache);
> >   	} else {
> >   		ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size,
> >   							 __pfn_to_phys(pfn), prot,
> 
> For the case kvm_is_realm(), need we adjust 's2fd->fault_ipa' for the sake of
> huge pages. In kvm_s2_fault_map(), @gfn and @pfn may have been adjusted by
> transparent_hugepage_adjust() to be aligned with huge page size. If the
> adjustment happened in transparent_hugepage_adjust(), we need to align
> s2fd->fault_ipa down to the huge page size either.

All of the above + some RMM changes are needed to get QEmu VMM going
with anon pages guest memory backing - currently testing various
configurations in the background.

Thanks,
Lorenzo

> > @@ -2214,6 +2285,13 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> >   	return 0;
> >   }
> > +static bool shared_ipa_fault(struct kvm *kvm, phys_addr_t fault_ipa)
> > +{
> > +	gpa_t gpa = kvm_gpa_from_fault(kvm, fault_ipa);
> > +
> > +	return (gpa != fault_ipa);
> > +}
> > +
> >   /**
> >    * kvm_handle_guest_abort - handles all 2nd stage aborts
> >    * @vcpu:	the VCPU pointer
> > @@ -2324,8 +2402,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> >   		nested = &nested_trans;
> >   	}
> > -	gfn = ipa >> PAGE_SHIFT;
> > +	gfn = kvm_gpa_from_fault(vcpu->kvm, ipa) >> PAGE_SHIFT;
> >   	memslot = gfn_to_memslot(vcpu->kvm, gfn);
> > +
> >   	hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
> >   	write_fault = kvm_is_write_fault(vcpu);
> >   	if (kvm_is_error_hva(hva) || (write_fault && !writable)) {
> > @@ -2368,7 +2447,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> >   		 * of the page size.
> >   		 */
> >   		ipa |= FAR_TO_FIPA_OFFSET(kvm_vcpu_get_hfar(vcpu));
> > -		ret = io_mem_abort(vcpu, ipa);
> > +		ret = io_mem_abort(vcpu, kvm_gpa_from_fault(vcpu->kvm, ipa));
> >   		goto out_unlock;
> >   	}
> > @@ -2396,7 +2475,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> >   				!write_fault &&
> >   				!kvm_vcpu_trap_is_exec_fault(vcpu));
> > -		if (kvm_slot_has_gmem(memslot))
> > +		if (kvm_slot_has_gmem(memslot) && !shared_ipa_fault(vcpu->kvm, fault_ipa))
> >   			ret = gmem_abort(&s2fd);
> >   		else
> >   			ret = user_mem_abort(&s2fd);
> > @@ -2433,6 +2512,10 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> >   	if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
> >   		return false;
> > +	/* We don't support aging for Realms */
> > +	if (kvm_is_realm(kvm))
> > +		return true;
> > +
> >   	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
> >   						   range->start << PAGE_SHIFT,
> >   						   size, true);
> > @@ -2449,6 +2532,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> >   	if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
> >   		return false;
> > +	/* We don't support aging for Realms */
> > +	if (kvm_is_realm(kvm))
> > +		return true;
> > +
> >   	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
> >   						   range->start << PAGE_SHIFT,
> >   						   size, false);
> > @@ -2628,10 +2715,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> >   		return -EFAULT;
> >   	/*
> > -	 * Only support guest_memfd backed memslots with mappable memory, since
> > -	 * there aren't any CoCo VMs that support only private memory on arm64.
> > +	 * Only support guest_memfd backed memslots with mappable memory,
> > +	 * unless the guest is a CCA realm guest.
> >   	 */
> > -	if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new))
> > +	if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new) &&
> > +	    !kvm_is_realm(kvm))
> >   		return -EINVAL;
> >   	hva = new->userspace_addr;
> > diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
> > index cae29fd3353c..761b38a4071c 100644
> > --- a/arch/arm64/kvm/rmi.c
> > +++ b/arch/arm64/kvm/rmi.c
> > @@ -597,6 +597,179 @@ static int realm_data_map_init(struct kvm *kvm, unsigned long ipa,
> >   	return ret;
> >   }
> > +static unsigned long addr_range_desc(unsigned long phys, unsigned long size)
> > +{
> > +	unsigned long out = 0;
> > +
> > +	switch (size) {
> > +	case P4D_SIZE:
> > +		out = 3 | (1 << 2);
> > +		break;
> > +	case PUD_SIZE:
> > +		out = 2 | (1 << 2);
> > +		break;
> > +	case PMD_SIZE:
> > +		out = 1 | (1 << 2);
> > +		break;
> > +	case PAGE_SIZE:
> > +		out = 0 | (1 << 2);
> > +		break;
> > +	default:
> > +		/*
> > +		 * Only support mapping at the page level granulatity when
> > +		 * it's an unusual length. This should get us back onto a larger
> > +		 * block size for the subsequent mappings.
> > +		 */
> > +		out = 0 | ((MIN(size >> PAGE_SHIFT, PTRS_PER_PTE - 1)) << 2);
> > +		break;
> > +	}
> > +
> > +	WARN_ON(phys & ~PAGE_MASK);
> > +
> > +	out |= phys & PAGE_MASK;
> > +
> > +	return out;
> > +}
> > +
> > +int realm_map_protected(struct kvm *kvm,
> > +			unsigned long ipa,
> > +			kvm_pfn_t pfn,
> > +			unsigned long map_size,
> > +			struct kvm_mmu_memory_cache *memcache)
> > +{
> > +	struct realm *realm = &kvm->arch.realm;
> > +	phys_addr_t phys = __pfn_to_phys(pfn);
> > +	phys_addr_t base_phys = phys;
> > +	phys_addr_t rd = virt_to_phys(realm->rd);
> > +	unsigned long base_ipa = ipa;
> > +	unsigned long ipa_top = ipa + map_size;
> > +	int ret = 0;
> > +
> > +	if (WARN_ON(!IS_ALIGNED(map_size, PAGE_SIZE) ||
> > +		    !IS_ALIGNED(ipa, map_size)))
> > +		return -EINVAL;
> > +
> > +	if (rmi_delegate_range(phys, map_size)) {
> > +		/*
> > +		 * It's likely we raced with another VCPU on the same
> > +		 * fault. Assume the other VCPU has handled the fault
> > +		 * and return to the guest.
> > +		 */
> > +		return 0;
> > +	}
> > +
> > +	while (ipa < ipa_top) {
> > +		unsigned long flags = RMI_ADDR_TYPE_SINGLE;
> > +		unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
> > +		unsigned long out_top;
> > +
> > +		ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags, range_desc,
> > +				       &out_top);
> > +
> > +		if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> > +			/* Create missing RTTs and retry */
> > +			int level = RMI_RETURN_INDEX(ret);
> > +
> > +			WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
> > +			ret = realm_create_rtt_levels(realm, ipa, level,
> > +						      KVM_PGTABLE_LAST_LEVEL,
> > +						      memcache);
> > +			if (ret)
> > +				goto err_undelegate;
> > +
> > +			ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags,
> > +					       range_desc, &out_top);
> > +		}
> > +
> > +		if (WARN_ON(ret))
> > +			goto err_undelegate;
> > +
> > +		phys += out_top - ipa;
> > +		ipa = out_top;
> > +	}
> > +
> > +	return 0;
> > +
> > +err_undelegate:
> > +	realm_unmap_private_range(kvm, base_ipa, ipa, true);
> > +	if (WARN_ON(rmi_undelegate_range(base_phys, map_size))) {
> > +		/* Page can't be returned to NS world so is lost */
> > +		get_page(phys_to_page(base_phys));
> > +	}
> > +	return -ENXIO;
> > +}
> > +
> > +int realm_map_non_secure(struct realm *realm,
> > +			 unsigned long ipa,
> > +			 kvm_pfn_t pfn,
> > +			 unsigned long size,
> > +			 enum kvm_pgtable_prot prot,
> > +			 struct kvm_mmu_memory_cache *memcache)
> > +{
> > +	unsigned long attr, flags = 0;
> > +	phys_addr_t rd = virt_to_phys(realm->rd);
> > +	phys_addr_t phys = __pfn_to_phys(pfn);
> > +	unsigned long ipa_top = ipa + size;
> > +	int ret;
> > +
> > +	if (WARN_ON(!IS_ALIGNED(size, PAGE_SIZE) ||
> > +		    !IS_ALIGNED(ipa, size)))
> > +		return -EINVAL;
> > +
> > +	switch (prot & (KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC)) {
> > +	case KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC:
> > +		return -EINVAL;
> > +	case KVM_PGTABLE_PROT_DEVICE:
> > +		attr = MT_S2_FWB_DEVICE_nGnRE;
> > +		break;
> > +	case KVM_PGTABLE_PROT_NORMAL_NC:
> > +		attr = MT_S2_FWB_NORMAL_NC;
> > +		break;
> > +	default:
> > +		attr = MT_S2_FWB_NORMAL;
> > +	}
> > +
> > +	flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_MEMATTR, attr);
> > +
> > +	if (prot & KVM_PGTABLE_PROT_R)
> > +		flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP, RMI_S2AP_DIRECT_READ);
> > +	if (prot & KVM_PGTABLE_PROT_W)
> > +		flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP, RMI_S2AP_DIRECT_WRITE);
> > +
> > +	flags |= RMI_ADDR_TYPE_SINGLE;
> > +
> > +	while (ipa < ipa_top) {
> > +		unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
> > +		unsigned long out_top;
> > +
> > +		ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags, range_desc,
> > +					 &out_top);
> > +
> > +		if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> > +			/* Create missing RTTs and retry */
> > +			int level = RMI_RETURN_INDEX(ret);
> > +
> > +			WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
> > +			ret = realm_create_rtt_levels(realm, ipa, level,
> > +						      KVM_PGTABLE_LAST_LEVEL,
> > +						      memcache);
> > +			if (ret)
> > +				return ret;
> > +
> > +			ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags,
> > +						 range_desc, &out_top);
> > +		}
> > +
> > +		if (WARN_ON(ret))
> > +			return ret;
> > +
> > +		phys += out_top - ipa;
> > +		ipa = out_top;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >   static int populate_region_cb(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> >   			      struct page *src_page, void *opaque)
> >   {
> 
> Thanks,
> Gavin
> 

^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Gavin Shan @ 2026-06-05  8:11 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
	Will Deacon, James Morse, Oliver Upton, Suzuki K Poulose,
	Zenghui Yu, linux-arm-kernel, linux-kernel, Joey Gouly,
	Alexandru Elisei, Christoffer Dall, Fuad Tabba, linux-coco,
	Ganapatrao Kulkarni, Shanker Donthineni, Alper Gun,
	Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
	Lorenzo.Pieralisi2
In-Reply-To: <aiJ6u83O0nVUtPyv@lpieralisi>

On 6/5/26 5:28 PM, Lorenzo Pieralisi wrote:
> On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:
> 
> [...]
> 
>>> +static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa,
>>> +			 kvm_pfn_t pfn, unsigned long map_size,
>>> +			 enum kvm_pgtable_prot prot,
>>> +			 struct kvm_mmu_memory_cache *memcache)
>>> +{
>>> +	struct realm *realm = &kvm->arch.realm;
>>> +
>>> +	/*
>>> +	 * Write permission is required for now even though it's possible to
>>> +	 * map unprotected pages (granules) as read-only. It's impossible to
>>> +	 * map protected pages (granules) as read-only.
>>> +	 */
>>> +	if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W)))
>>> +		return -EFAULT;
>>> +
>>
>> I'm a bit concerned with this. We don't have KVM_PGTABLE_PROT_W set in @prot
>> if the stage2 fault is raised due to memory read. With -EFAULT returned to VMM
>> (e.g. QEMU), the vCPU continuous execution is stopped and system won't be
>> working any more.
>>
>>> +	ipa = ALIGN_DOWN(ipa, PAGE_SIZE);
>>> +	if (!kvm_realm_is_private_address(realm, ipa))
>>> +		return realm_map_non_secure(realm, ipa, pfn, map_size, prot,
>>> +					    memcache);
>>> +
>>> +	return realm_map_protected(kvm, ipa, pfn, map_size, memcache);
>>> +}
>>> +
>>>    static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
>>>    {
>>>    	switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma->vm_page_prot))) {
>>> @@ -1604,27 +1641,52 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>>>    	bool write_fault, exec_fault;
>>>    	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
>>>    	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>>> -	struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
>>> +	struct kvm_vcpu *vcpu = s2fd->vcpu;
>>> +	struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
>>> +	gpa_t gpa = kvm_gpa_from_fault(vcpu->kvm, s2fd->fault_ipa);
>>>    	unsigned long mmu_seq;
>>>    	struct page *page;
>>> -	struct kvm *kvm = s2fd->vcpu->kvm;
>>> +	struct kvm *kvm = vcpu->kvm;
>>>    	void *memcache;
>>>    	kvm_pfn_t pfn;
>>>    	gfn_t gfn;
>>>    	int ret;
>>> -	memcache = get_mmu_memcache(s2fd->vcpu);
>>> -	ret = topup_mmu_memcache(s2fd->vcpu, memcache);
>>> +	if (kvm_is_realm(vcpu->kvm)) {
>>> +		/* check for memory attribute mismatch */
>>> +		bool is_priv_gfn = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
>>> +		/*
>>> +		 * For Realms, the shared address is an alias of the private
>>> +		 * PA with the top bit set. Thus if the fault address matches
>>> +		 * the GPA then it is the private alias.
>>> +		 */
>>> +		bool is_priv_fault = (gpa == s2fd->fault_ipa);
>>> +
>>> +		if (is_priv_gfn != is_priv_fault) {
>>> +			kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>>> +						      kvm_is_write_fault(vcpu),
>>> +						      false,
>>> +						      is_priv_fault);
>>> +			/*
>>> +			 * KVM_EXIT_MEMORY_FAULT requires an return code of
>>> +			 * -EFAULT, see the API documentation
>>> +			 */
>>> +			return -EFAULT;
>>> +		}
>>> +	}
>>> +
>>> +	memcache = get_mmu_memcache(vcpu);
>>> +	ret = topup_mmu_memcache(vcpu, memcache);
>>>    	if (ret)
>>>    		return ret;
>>>    	if (s2fd->nested)
>>>    		gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT;
>>>    	else
>>> -		gfn = s2fd->fault_ipa >> PAGE_SHIFT;
>>> +		gfn = gpa >> PAGE_SHIFT;
>>> -	write_fault = kvm_is_write_fault(s2fd->vcpu);
>>> -	exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu);
>>> +	write_fault = kvm_is_write_fault(vcpu);
>>> +	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
>>>    	VM_WARN_ON_ONCE(write_fault && exec_fault);
>>> @@ -1634,7 +1696,7 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>>>    	ret = kvm_gmem_get_pfn(kvm, s2fd->memslot, gfn, &pfn, &page, NULL);
>>>    	if (ret) {
>>> -		kvm_prepare_memory_fault_exit(s2fd->vcpu, s2fd->fault_ipa, PAGE_SIZE,
>>> +		kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>>>    					      write_fault, exec_fault, false);
>>>    		return ret;
>>>    	}
>>> @@ -1654,14 +1716,20 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>>>    	kvm_fault_lock(kvm);
>>>    	if (mmu_invalidate_retry(kvm, mmu_seq)) {
>>>    		ret = -EAGAIN;
>>> -		goto out_unlock;
>>> +		goto out_release_page;
>>> +	}
>>> +
>>> +	if (kvm_is_realm(kvm)) {
>>> +		ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn,
>>> +				    PAGE_SIZE, KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W, memcache);
>>> +		goto out_release_page;
>>>    	}
>>>    	ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
>>>    						 __pfn_to_phys(pfn), prot,
>>>    						 memcache, flags);
>>> -out_unlock:
>>> +out_release_page:
>>>    	kvm_release_faultin_page(kvm, page, !!ret, prot & KVM_PGTABLE_PROT_W);
>>>    	kvm_fault_unlock(kvm);
>>> @@ -1847,7 +1915,7 @@ static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd,
>>>    	 * mapping size to ensure we find the right PFN and lay down the
>>>    	 * mapping in the right place.
>>>    	 */
>>> -	s2vi->gfn = ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize) >> PAGE_SHIFT;
>>> +	s2vi->gfn = kvm_gpa_from_fault(kvm, ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize)) >> PAGE_SHIFT;
>>>    	s2vi->mte_allowed = kvm_vma_mte_allowed(vma);
>>> @@ -2056,6 +2124,9 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd,
>>>    		prot &= ~KVM_NV_GUEST_MAP_SZ;
>>>    		ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, gfn_to_gpa(gfn),
>>>    								 prot, flags);
>>> +	} else if (kvm_is_realm(kvm)) {
>>> +		ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn, mapping_size,
>>> +				    prot, memcache);
>>>    	} else {
>>>    		ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size,
>>>    							 __pfn_to_phys(pfn), prot,
>>
>> For the case kvm_is_realm(), need we adjust 's2fd->fault_ipa' for the sake of
>> huge pages. In kvm_s2_fault_map(), @gfn and @pfn may have been adjusted by
>> transparent_hugepage_adjust() to be aligned with huge page size. If the
>> adjustment happened in transparent_hugepage_adjust(), we need to align
>> s2fd->fault_ipa down to the huge page size either.
> 
> All of the above + some RMM changes are needed to get QEmu VMM going
> with anon pages guest memory backing - currently testing various
> configurations in the background.
> 

I tried to rebase Jean's latest QEMU series [1] to upstream QEMU, and found
that memory slots backed by THP are broken. With THP disabled on the host and
other fixes (mentioned in my prevous replies) applied on the top of this (v14)
series, I'm able to boot a realm guest with rebased QEMU series [2], plus more
fxies on the top.

[1] https://git.codelinaro.org/linaro/dcap/qemu.git  (branch: cca/latest)
[2] https://git.qemu.org/git/qemu.git                (branch: cca/gavin)

Lorenzo, You may be saying there is someone making QEMU to support ARM/CCA?
If so, I'm not sure if there is a QEMU repository for me to try?

Thanks,
Gavin

> Thanks,
> Lorenzo
> 
>>> @@ -2214,6 +2285,13 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
>>>    	return 0;
>>>    }
>>> +static bool shared_ipa_fault(struct kvm *kvm, phys_addr_t fault_ipa)
>>> +{
>>> +	gpa_t gpa = kvm_gpa_from_fault(kvm, fault_ipa);
>>> +
>>> +	return (gpa != fault_ipa);
>>> +}
>>> +
>>>    /**
>>>     * kvm_handle_guest_abort - handles all 2nd stage aborts
>>>     * @vcpu:	the VCPU pointer
>>> @@ -2324,8 +2402,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>>    		nested = &nested_trans;
>>>    	}
>>> -	gfn = ipa >> PAGE_SHIFT;
>>> +	gfn = kvm_gpa_from_fault(vcpu->kvm, ipa) >> PAGE_SHIFT;
>>>    	memslot = gfn_to_memslot(vcpu->kvm, gfn);
>>> +
>>>    	hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
>>>    	write_fault = kvm_is_write_fault(vcpu);
>>>    	if (kvm_is_error_hva(hva) || (write_fault && !writable)) {
>>> @@ -2368,7 +2447,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>>    		 * of the page size.
>>>    		 */
>>>    		ipa |= FAR_TO_FIPA_OFFSET(kvm_vcpu_get_hfar(vcpu));
>>> -		ret = io_mem_abort(vcpu, ipa);
>>> +		ret = io_mem_abort(vcpu, kvm_gpa_from_fault(vcpu->kvm, ipa));
>>>    		goto out_unlock;
>>>    	}
>>> @@ -2396,7 +2475,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>>    				!write_fault &&
>>>    				!kvm_vcpu_trap_is_exec_fault(vcpu));
>>> -		if (kvm_slot_has_gmem(memslot))
>>> +		if (kvm_slot_has_gmem(memslot) && !shared_ipa_fault(vcpu->kvm, fault_ipa))
>>>    			ret = gmem_abort(&s2fd);
>>>    		else
>>>    			ret = user_mem_abort(&s2fd);
>>> @@ -2433,6 +2512,10 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>>>    	if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
>>>    		return false;
>>> +	/* We don't support aging for Realms */
>>> +	if (kvm_is_realm(kvm))
>>> +		return true;
>>> +
>>>    	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
>>>    						   range->start << PAGE_SHIFT,
>>>    						   size, true);
>>> @@ -2449,6 +2532,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>>>    	if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
>>>    		return false;
>>> +	/* We don't support aging for Realms */
>>> +	if (kvm_is_realm(kvm))
>>> +		return true;
>>> +
>>>    	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
>>>    						   range->start << PAGE_SHIFT,
>>>    						   size, false);
>>> @@ -2628,10 +2715,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>>    		return -EFAULT;
>>>    	/*
>>> -	 * Only support guest_memfd backed memslots with mappable memory, since
>>> -	 * there aren't any CoCo VMs that support only private memory on arm64.
>>> +	 * Only support guest_memfd backed memslots with mappable memory,
>>> +	 * unless the guest is a CCA realm guest.
>>>    	 */
>>> -	if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new))
>>> +	if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new) &&
>>> +	    !kvm_is_realm(kvm))
>>>    		return -EINVAL;
>>>    	hva = new->userspace_addr;
>>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>>> index cae29fd3353c..761b38a4071c 100644
>>> --- a/arch/arm64/kvm/rmi.c
>>> +++ b/arch/arm64/kvm/rmi.c
>>> @@ -597,6 +597,179 @@ static int realm_data_map_init(struct kvm *kvm, unsigned long ipa,
>>>    	return ret;
>>>    }
>>> +static unsigned long addr_range_desc(unsigned long phys, unsigned long size)
>>> +{
>>> +	unsigned long out = 0;
>>> +
>>> +	switch (size) {
>>> +	case P4D_SIZE:
>>> +		out = 3 | (1 << 2);
>>> +		break;
>>> +	case PUD_SIZE:
>>> +		out = 2 | (1 << 2);
>>> +		break;
>>> +	case PMD_SIZE:
>>> +		out = 1 | (1 << 2);
>>> +		break;
>>> +	case PAGE_SIZE:
>>> +		out = 0 | (1 << 2);
>>> +		break;
>>> +	default:
>>> +		/*
>>> +		 * Only support mapping at the page level granulatity when
>>> +		 * it's an unusual length. This should get us back onto a larger
>>> +		 * block size for the subsequent mappings.
>>> +		 */
>>> +		out = 0 | ((MIN(size >> PAGE_SHIFT, PTRS_PER_PTE - 1)) << 2);
>>> +		break;
>>> +	}
>>> +
>>> +	WARN_ON(phys & ~PAGE_MASK);
>>> +
>>> +	out |= phys & PAGE_MASK;
>>> +
>>> +	return out;
>>> +}
>>> +
>>> +int realm_map_protected(struct kvm *kvm,
>>> +			unsigned long ipa,
>>> +			kvm_pfn_t pfn,
>>> +			unsigned long map_size,
>>> +			struct kvm_mmu_memory_cache *memcache)
>>> +{
>>> +	struct realm *realm = &kvm->arch.realm;
>>> +	phys_addr_t phys = __pfn_to_phys(pfn);
>>> +	phys_addr_t base_phys = phys;
>>> +	phys_addr_t rd = virt_to_phys(realm->rd);
>>> +	unsigned long base_ipa = ipa;
>>> +	unsigned long ipa_top = ipa + map_size;
>>> +	int ret = 0;
>>> +
>>> +	if (WARN_ON(!IS_ALIGNED(map_size, PAGE_SIZE) ||
>>> +		    !IS_ALIGNED(ipa, map_size)))
>>> +		return -EINVAL;
>>> +
>>> +	if (rmi_delegate_range(phys, map_size)) {
>>> +		/*
>>> +		 * It's likely we raced with another VCPU on the same
>>> +		 * fault. Assume the other VCPU has handled the fault
>>> +		 * and return to the guest.
>>> +		 */
>>> +		return 0;
>>> +	}
>>> +
>>> +	while (ipa < ipa_top) {
>>> +		unsigned long flags = RMI_ADDR_TYPE_SINGLE;
>>> +		unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
>>> +		unsigned long out_top;
>>> +
>>> +		ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags, range_desc,
>>> +				       &out_top);
>>> +
>>> +		if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>>> +			/* Create missing RTTs and retry */
>>> +			int level = RMI_RETURN_INDEX(ret);
>>> +
>>> +			WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
>>> +			ret = realm_create_rtt_levels(realm, ipa, level,
>>> +						      KVM_PGTABLE_LAST_LEVEL,
>>> +						      memcache);
>>> +			if (ret)
>>> +				goto err_undelegate;
>>> +
>>> +			ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags,
>>> +					       range_desc, &out_top);
>>> +		}
>>> +
>>> +		if (WARN_ON(ret))
>>> +			goto err_undelegate;
>>> +
>>> +		phys += out_top - ipa;
>>> +		ipa = out_top;
>>> +	}
>>> +
>>> +	return 0;
>>> +
>>> +err_undelegate:
>>> +	realm_unmap_private_range(kvm, base_ipa, ipa, true);
>>> +	if (WARN_ON(rmi_undelegate_range(base_phys, map_size))) {
>>> +		/* Page can't be returned to NS world so is lost */
>>> +		get_page(phys_to_page(base_phys));
>>> +	}
>>> +	return -ENXIO;
>>> +}
>>> +
>>> +int realm_map_non_secure(struct realm *realm,
>>> +			 unsigned long ipa,
>>> +			 kvm_pfn_t pfn,
>>> +			 unsigned long size,
>>> +			 enum kvm_pgtable_prot prot,
>>> +			 struct kvm_mmu_memory_cache *memcache)
>>> +{
>>> +	unsigned long attr, flags = 0;
>>> +	phys_addr_t rd = virt_to_phys(realm->rd);
>>> +	phys_addr_t phys = __pfn_to_phys(pfn);
>>> +	unsigned long ipa_top = ipa + size;
>>> +	int ret;
>>> +
>>> +	if (WARN_ON(!IS_ALIGNED(size, PAGE_SIZE) ||
>>> +		    !IS_ALIGNED(ipa, size)))
>>> +		return -EINVAL;
>>> +
>>> +	switch (prot & (KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC)) {
>>> +	case KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC:
>>> +		return -EINVAL;
>>> +	case KVM_PGTABLE_PROT_DEVICE:
>>> +		attr = MT_S2_FWB_DEVICE_nGnRE;
>>> +		break;
>>> +	case KVM_PGTABLE_PROT_NORMAL_NC:
>>> +		attr = MT_S2_FWB_NORMAL_NC;
>>> +		break;
>>> +	default:
>>> +		attr = MT_S2_FWB_NORMAL;
>>> +	}
>>> +
>>> +	flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_MEMATTR, attr);
>>> +
>>> +	if (prot & KVM_PGTABLE_PROT_R)
>>> +		flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP, RMI_S2AP_DIRECT_READ);
>>> +	if (prot & KVM_PGTABLE_PROT_W)
>>> +		flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP, RMI_S2AP_DIRECT_WRITE);
>>> +
>>> +	flags |= RMI_ADDR_TYPE_SINGLE;
>>> +
>>> +	while (ipa < ipa_top) {
>>> +		unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
>>> +		unsigned long out_top;
>>> +
>>> +		ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags, range_desc,
>>> +					 &out_top);
>>> +
>>> +		if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>>> +			/* Create missing RTTs and retry */
>>> +			int level = RMI_RETURN_INDEX(ret);
>>> +
>>> +			WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
>>> +			ret = realm_create_rtt_levels(realm, ipa, level,
>>> +						      KVM_PGTABLE_LAST_LEVEL,
>>> +						      memcache);
>>> +			if (ret)
>>> +				return ret;
>>> +
>>> +			ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags,
>>> +						 range_desc, &out_top);
>>> +		}
>>> +
>>> +		if (WARN_ON(ret))
>>> +			return ret;
>>> +
>>> +		phys += out_top - ipa;
>>> +		ipa = out_top;
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>>    static int populate_region_cb(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>>>    			      struct page *src_page, void *opaque)
>>>    {
>>
>> Thanks,
>> Gavin
>>
> 


^ permalink raw reply

* Re: [PATCH 03/15] x86/virt/tdx: Make TDX Module initialize Extensions
From: Tony Lindgren @ 2026-06-05  8:46 UTC (permalink / raw)
  To: Xu Yilun
  Cc: kas, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
	linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
	zhenzhong.duan, xiaoyao.li
In-Reply-To: <20260522034128.3144354-4-yilun.xu@linux.intel.com>

On Fri, May 22, 2026 at 11:41:16AM +0800, Xu Yilun wrote:
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1200,6 +1200,22 @@ static u64 to_hpa_list_info(struct page *root, unsigned int nr_pages)
>  	       FIELD_PREP(HPA_LIST_INFO_LAST_ENTRY, nr_pages - 1);
>  }
>  
> +/* Initialize the TDX Module Extensions then Extension-SEAMCALLs can be used */
> +static int tdx_ext_init(void)
> +{
> +	struct tdx_module_args args = {};
> +	u64 r;
> +
> +	do {
> +		r = seamcall(TDH_EXT_INIT, &args);
> +	} while (r == TDX_INTERRUPTED_RESUMABLE);
> +
> +	if (r != TDX_SUCCESS)
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
>  static int tdx_ext_mem_add(struct page *root, unsigned int nr_pages)
>  {
>  	struct tdx_module_args args = {

How about "Initialize the TDX Module Extensions for Extension-SEAMCALLs"
above for the comment?

Other than that:

Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>

^ permalink raw reply

* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Suzuki K Poulose @ 2026-06-05  8:54 UTC (permalink / raw)
  To: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CAEvNRgF43RBv77RgM67kXRRHDnQw4L5uwQTuvkJHzkHJWB1mag@mail.gmail.com>

On 04/06/2026 20:05, Ackerley Tng wrote:
> Suzuki K Poulose <suzuki.poulose@arm.com> writes:
> 
>>
>> [...snip...]
>>
>>> +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
>>> +ignored completely. Otherwise, ``uaddr`` is required if
>>> +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
>>> +in the latter case guest memory can be initialized directly from userspace
>>> +prior to converting it to private and passing the GPA range on to this
>>> +interface.
>>
>> Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in
>> the process of making it "private" ? i.e., the contents of a SNP shared
>> page are preserved while transitioning to "SNP Private" (via RMP
>> update).
>>
>> Suzuki
>>
> 
> The following is the guest_memfd perspective, I didn't look at the SNP
> spec:
> 
> Do you mean specifically for KVM_SEV_SNP_PAGE_TYPE_ZERO, or for any
> type?
> 
> guest_memfd has no plans to do any special zeroing based on type.
> 
> guest_memfd decoupled zeroing from preparation a while ago (Michael had
> some patches), so zeroing is supposed to be once during folio ownership
> by guest_memfd, tracked by the uptodate flag, and preparation is tracked
> outside of guest_memfd. So far only SNP does preparation.

I am talking about the SEV SNP conversions (specifically quoted in my 
response), I will follow up on Michael's response.

Suzuki


^ permalink raw reply

* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Suzuki K Poulose @ 2026-06-05  9:06 UTC (permalink / raw)
  To: Michael Roth
  Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <4muegrza5iyyhqx6wevdlssnb6wvlc4m4wmuz5hmd3xikkftc4@3e2lpuq6tjgr>

On 04/06/2026 21:11, Michael Roth wrote:
> On Thu, Jun 04, 2026 at 04:29:19PM +0100, Suzuki K Poulose wrote:
>> On 23/05/2026 01:18, Ackerley Tng via B4 Relay wrote:
>>> From: Michael Roth <michael.roth@amd.com>
>>>
>>> For vm_memory_attributes=1, in-place conversion/population is not
>>> supported, so the initial contents necessarily must need to come
>>> from a separate src address, which is enforced by the current
>>> implementation. However, for vm_memory_attributes=0, it is possible for
>>> guest memory to be initialized directly from userspace by mmap()'ing the
>>> guest_memfd and writing to it while the corresponding GPA ranges are in
>>> a 'shared' state before converting them to the 'private' state expected
>>> by KVM_SEV_SNP_LAUNCH_UPDATE.
>>>
>>> Update the handling/documentation for KVM_SEV_SNP_LAUNCH_UPDATE to allow
>>> for 'uaddr' to be set to NULL when vm_memory_attributes=0, which
>>> SNP_LAUNCH_UPDATE will then use to determine when it should/shouldn't
>>> copy in data from a separate memory location. Continue to enforce
>>> non-NULL for the original vm_memory_attributes=1 case.
>>>
>>> Signed-off-by: Michael Roth <michael.roth@amd.com>
>>> [Added src_page check in error handling path when the firmware command fails]
>>> [Dropped ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES]
>>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>>
>>
>>
>>
>>> ---
>>>    Documentation/virt/kvm/x86/amd-memory-encryption.rst | 15 +++++++++++----
>>>    arch/x86/kvm/svm/sev.c                               | 18 +++++++++++++-----
>>>    virt/kvm/kvm_main.c                                  |  1 +
>>>    3 files changed, 25 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
>>> index b2395dd4769de..43085f65b2d85 100644
>>> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
>>> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
>>> @@ -503,7 +503,8 @@ secrets.
>>>    It is required that the GPA ranges initialized by this command have had the
>>>    KVM_MEMORY_ATTRIBUTE_PRIVATE attribute set in advance. See the documentation
>>> -for KVM_SET_MEMORY_ATTRIBUTES for more details on this aspect.
>>> +for KVM_SET_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES2 for more details on
>>> +this aspect.
>>>    Upon success, this command is not guaranteed to have processed the entire
>>>    range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
>>> @@ -511,9 +512,15 @@ range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
>>>    remaining range that has yet to be processed. The caller should continue
>>>    calling this command until those fields indicate the entire range has been
>>>    processed, e.g. ``len`` is 0, ``gfn_start`` is equal to the last GFN in the
>>> -range plus 1, and ``uaddr`` is the last byte of the userspace-provided source
>>> -buffer address plus 1. In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO,
>>> -``uaddr`` will be ignored completely.
>>> +range plus 1, and ``uaddr`` (if specified) is the last byte of the
>>> +userspace-provided source buffer address plus 1.
>>> +
>>> +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
>>> +ignored completely. Otherwise, ``uaddr`` is required if
>>> +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
>>> +in the latter case guest memory can be initialized directly from userspace
>>> +prior to converting it to private and passing the GPA range on to this
>>> +interface.
>>
>> Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in the
>> process of making it "private" ? i.e., the contents of a SNP shared
>> page are preserved while transitioning to "SNP Private" (via RMP
>> update).
> 
> sev_gmem_prepare() does sort of destroy contents since it finalizes the
> shared->private conversion which puts the page in an unusable state
> until the guest 'accepts' it as private memory and re-initializes the
> contents.
> 
> But that's run-time, when the guest is doing conversions. The
> documentation here is relating to initialization time when we are
> setting up the initial pre-encrypted/pre-measured guest memory image,
> via SNP_LAUNCH_UPDATE. That path calls into kvm_gmem_populate(), and it
> is then sev_gmem_post_populate() callback that actually finalizes the
> shared->private conversion. The sev_gmem_prepare() hook doesn't get used
> in this flow (kvm_gmem_populate() calls __kvm_gmem_get_pfn() which skips
> preparation).

Thanks, thats the bit I was missing. Skipping the prepare path, with 
__kvm_gmem_get_pfn(). I was under the assumption that 
kvm_arch_gmem_prepare() was called for all PFNs allocated from gmem
and how SNP was handling this populate case.


Thanks
Suzuki


> 
> -Mike
> 
>>
>> Suzuki
>>
>>

^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Gavin Shan @ 2026-06-05 11:20 UTC (permalink / raw)
  To: Steven Price, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <20260513131757.116630-30-steven.price@arm.com>

Hi Steve,

On 5/13/26 11:17 PM, Steven Price wrote:
> At runtime if the realm guest accesses memory which hasn't yet been
> mapped then KVM needs to either populate the region or fault the guest.
> 
> For memory in the lower (protected) region of IPA a fresh page is
> provided to the RMM which will zero the contents. For memory in the
> upper (shared) region of IPA, the memory from the memslot is mapped
> into the realm VM non secure.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
> Changes since v13:
>   * Numerous changes due to rebasing.
>   * Fix addr_range_desc() to encode the correct block size.
> Changes since v12:
>   * Switch to RMM v2.0 range based APIs.
> Changes since v11:
>   * Adapt to upstream changes.
> Changes since v10:
>   * RME->RMI renaming.
>   * Adapt to upstream gmem changes.
> Changes since v9:
>   * Fix call to kvm_stage2_unmap_range() in kvm_free_stage2_pgd() to set
>     may_block to avoid stall warnings.
>   * Minor coding style fixes.
> Changes since v8:
>   * Propagate the may_block flag.
>   * Minor comments and coding style changes.
> Changes since v7:
>   * Remove redundant WARN_ONs for realm_create_rtt_levels() - it will
>     internally WARN when necessary.
> Changes since v6:
>   * Handle PAGE_SIZE being larger than RMM granule size.
>   * Some minor renaming following review comments.
> Changes since v5:
>   * Reduce use of struct page in preparation for supporting the RMM
>     having a different page size to the host.
>   * Handle a race when delegating a page where another CPU has faulted on
>     a the same page (and already delegated the physical page) but not yet
>     mapped it. In this case simply return to the guest to either use the
>     mapping from the other CPU (or refault if the race is lost).
>   * The changes to populate_par_region() are moved into the previous
>     patch where they belong.
> Changes since v4:
>   * Code cleanup following review feedback.
>   * Drop the PTE_SHARED bit when creating unprotected page table entries.
>     This is now set by the RMM and the host has no control of it and the
>     spec requires the bit to be set to zero.
> Changes since v2:
>   * Avoid leaking memory if failing to map it in the realm.
>   * Correctly mask RTT based on LPA2 flag (see rtt_get_phys()).
>   * Adapt to changes in previous patches.
> ---
>   arch/arm64/include/asm/kvm_emulate.h |   8 ++
>   arch/arm64/include/asm/kvm_rmi.h     |  12 ++
>   arch/arm64/kvm/mmu.c                 | 128 ++++++++++++++++----
>   arch/arm64/kvm/rmi.c                 | 173 +++++++++++++++++++++++++++
>   4 files changed, 301 insertions(+), 20 deletions(-)
> 

[...]

> @@ -1604,27 +1641,52 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>   	bool write_fault, exec_fault;
>   	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
>   	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> -	struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
> +	struct kvm_vcpu *vcpu = s2fd->vcpu;
> +	struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> +	gpa_t gpa = kvm_gpa_from_fault(vcpu->kvm, s2fd->fault_ipa);
>   	unsigned long mmu_seq;
>   	struct page *page;
> -	struct kvm *kvm = s2fd->vcpu->kvm;
> +	struct kvm *kvm = vcpu->kvm;
>   	void *memcache;
>   	kvm_pfn_t pfn;
>   	gfn_t gfn;
>   	int ret;
>   
> -	memcache = get_mmu_memcache(s2fd->vcpu);
> -	ret = topup_mmu_memcache(s2fd->vcpu, memcache);
> +	if (kvm_is_realm(vcpu->kvm)) {
> +		/* check for memory attribute mismatch */
> +		bool is_priv_gfn = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> +		/*
> +		 * For Realms, the shared address is an alias of the private
> +		 * PA with the top bit set. Thus if the fault address matches
> +		 * the GPA then it is the private alias.
> +		 */
> +		bool is_priv_fault = (gpa == s2fd->fault_ipa);
> +
> +		if (is_priv_gfn != is_priv_fault) {
> +			kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
> +						      kvm_is_write_fault(vcpu),
> +						      false,
> +						      is_priv_fault);
> +			/*
> +			 * KVM_EXIT_MEMORY_FAULT requires an return code of
> +			 * -EFAULT, see the API documentation
> +			 */
> +			return -EFAULT;
> +		}
> +	}
> +

For a Realm, gmem_abort() is called by kvm_handle_guest_abort() only when
we're faulting in the private (protected) space.

     if (kvm_slot_has_gmem(memslot) && !shared_ipa_fault(vcpu->kvm, fault_ipa))
         ret = gmem_abort(&s2fd);
     else
         ret = user_mem_abort(&s2fd);

With the condition, this block of code can be simplied to handle conversion
(shared -> private) instead of both directions.

     /* Convert the shared address to the private adress for Realm */
     if (kvm_is_realm(vcpu->kvm) &&
         !kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT)) {
         /*
          * KVM_EXIT_MEMORY_FAULT requires an return code of
          * -EFAULT, see the API documentation
          */
         kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
                                       kvm_is_write_fault(vcpu),
                                       false, true);
         return -EFAULT;
     }


[...]

> @@ -2396,7 +2475,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>   				!write_fault &&
>   				!kvm_vcpu_trap_is_exec_fault(vcpu));
>   
> -		if (kvm_slot_has_gmem(memslot))
> +		if (kvm_slot_has_gmem(memslot) && !shared_ipa_fault(vcpu->kvm, fault_ipa))
>   			ret = gmem_abort(&s2fd);
>   		else
>   			ret = user_mem_abort(&s2fd);
gmem_abort() is only called for faults in the protected (private) space.

Thanks,
Gavin


^ permalink raw reply

* Re: [PATCH v6 06/11] x86/virt/tdx: Optimize tdx_pamt_get/put()
From: Kiryl Shutsemau @ 2026-06-05 11:42 UTC (permalink / raw)
  To: Chao Gao
  Cc: Edgecombe, Rick P, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, Huang, Kai, Hansen, Dave, Zhao, Yan Y,
	seanjc@google.com, mingo@redhat.com, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, nik.borisov@suse.com,
	linux-doc@vger.kernel.org, hpa@zytor.com, tglx@kernel.org,
	Annapurve, Vishal, bp@alien8.de, kirill.shutemov@linux.intel.com,
	x86@kernel.org
In-Reply-To: <aiJhScChLZkH44eB@intel.com>

On Fri, Jun 05, 2026 at 01:40:25PM +0800, Chao Gao wrote:
> On Thu, Jun 04, 2026 at 05:59:02PM +0100, Kiryl Shutsemau wrote:
> >On Tue, May 26, 2026 at 04:42:24PM +0000, Edgecombe, Rick P wrote:
> >> On Tue, 2026-05-26 at 16:57 +0800, Chao Gao wrote:
> >> > > -	scoped_guard(spinlock, &pamt_lock) {
> >> > 
> >> > This converts the scoped_guard() added by the previous patch to
> >> > explicit lock/unlock and goto. It would reduce code churn if the
> >> > previous patch used that form directly.
> >> 
> >> Yea, it's a good point. I actually debated doing it, but decided not to because
> >> the scoped version is cleaner for the non-optimized version. But for
> >> reviewability, never doing the scoped version is probably better.
> >
> >I don't see a reason why we can't keep the scoped_guard() on get side.
> 
> One additional reason to drop scoped_guard() is that it mixes cleanup helpers
> with goto, which is discouraged. See [*]
> 
>  :Lastly, given that the benefit of cleanup helpers is removal of “goto”, and
>  :that the “goto” statement can jump between scopes, the expectation is that
>  :usage of “goto” and cleanup helpers is never mixed in the same function.

Fair enough.

But it can also be address if we free the PAMT page array with the guard
too :P

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox