public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Tom Lendacky <thomas.lendacky@amd.com>
To: Michael Kelley <mikelley@microsoft.com>,
	kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com,
	luto@kernel.org, peterz@infradead.org,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	kirill.shutemov@linux.intel.com, seanjc@google.com,
	rick.p.edgecombe@intel.com, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, x86@kernel.org
Subject: Re: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
Date: Wed, 2 Aug 2023 16:57:59 -0500	[thread overview]
Message-ID: <08bf0835-a4fa-38b5-4cbb-b2058a6a2da0@amd.com> (raw)
In-Reply-To: <1688661719-60329-1-git-send-email-mikelley@microsoft.com>

On 7/6/23 11:41, Michael Kelley wrote:
> In a CoCo VM when a page transitions from private to shared, or vice
> versa, attributes in the PTE must be updated *and* the hypervisor must
> be notified of the change. Because there are two separate steps, there's
> a window where the settings are inconsistent.  Normally the code that
> initiates the transition (via set_memory_decrypted() or
> set_memory_encrypted()) ensures that the memory is not being accessed
> during a transition, so the window of inconsistency is not a problem.
> However, the load_unaligned_zeropad() function can read arbitrary memory
> pages at arbitrary times, which could access a transitioning page during
> the window.  In such a case, CoCo VM specific exceptions are taken
> (depending on the CoCo architecture in use).  Current code in those
> exception handlers recovers and does "fixup" on the result returned by
> load_unaligned_zeropad().  Unfortunately, this exception handling and
> fixup code is tricky and somewhat fragile.  At the moment, it is
> broken for both TDX and SEV-SNP.
> 
> There's also a problem with the current code in paravisor scenarios:
> TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need
> to be forwarded from the paravisor to the Linux guest, but there
> are no architectural specs for how to do that.
> 
> To avoid these complexities of the CoCo exception handlers, change
> the core transition code in __set_memory_enc_pgtable() to do the
> following:
> 
> 1.  Remove aliasing mappings
> 2.  Remove the PRESENT bit from the PTEs of all transitioning pages
> 3.  Flush the TLB globally
> 4.  Flush the data cache if needed
> 5.  Set/clear the encryption attribute as appropriate
> 6.  Notify the hypervisor of the page status change
> 7.  Add back the PRESENT bit
> 
> With this approach, load_unaligned_zeropad() just takes its normal
> page-fault-based fixup path if it touches a page that is transitioning.
> As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> are completely decoupled.  CoCo VM page transitions can proceed
> without needing to handle architecture-specific exceptions and fix
> things up. This decoupling reduces the complexity due to separate
> TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> introduce new capabilities in future versions of the TDX and SEV-SNP
> architectures. Paravisor scenarios work properly without needing
> to forward exceptions.
> 
> This approach may make __set_memory_enc_pgtable() slightly slower
> because of touching the PTEs three times instead of just once. But
> the run time of this function is already dominated by the hypercall
> and the need to flush the TLB at least once and maybe twice. In any
> case, this function is only used for CoCo VM page transitions, and
> is already unsuitable for hot paths.
> 
> The architecture specific callback function for notifying the
> hypervisor typically must translate guest kernel virtual addresses
> into guest physical addresses to pass to the hypervisor.  Because
> the PTEs are invalid at the time of callback, the code for doing the
> translation needs updating.  virt_to_phys() or equivalent continues
> to work for direct map addresses.  But vmalloc addresses cannot use
> vmalloc_to_page() because that function requires the leaf PTE to be
> valid. Instead, slow_virt_to_phys() must be used. Both functions
> manually walk the page table hierarchy, so performance is the same.

This all seems reasonable if it allows the paravisor approach to run 
without issues.

> 
> Signed-off-by: Michael Kelley <mikelley@microsoft.com>
> ---
> 
> I'm assuming the TDX handling of the data cache flushing is the
> same with this new approach, and that it doesn't need to be paired
> with a TLB flush as in the current code.  If that's not a correct
> assumption, let me know.
> 
> I've left the two hypervisor callbacks, before and after Step 5
> above. If the PTEs are invalid, it seems like the order of Step 5
> and Step 6 shouldn't matter, so perhaps one of the callback could
> be dropped.  Let me know if there are reasons to do otherwise.
> 
> It may well be possible to optimize the new implementation of
> __set_memory_enc_pgtable(). The existing set_memory_np() and
> set_memory_p() functions do all the right things in a very clear
> fashion, but perhaps not as optimally as having all three PTE
> manipulations directly in the same function. It doesn't appear
> that optimizing the performance really matters here, but I'm open
> to suggestions.
> 
> I've tested this on TDX VMs and SEV-SNP + vTOM VMs.  I can also
> test on SEV-SNP VMs without vTOM. But I'll probably need help
> covering SEV and SEV-ES VMs.

I wouldn't think that SEV and SEV-ES VMs would have any issues. However, 
these types of VMs don't make hypercalls at the moment, but I don't know 
that any slow downs would be noticed.

> 
> This RFC patch does *not* remove code that would no longer be
> needed. If there's agreement to take this new approach, I'll
> add patches to remove the unneeded code.
> 
> This patch is built against linux-next20230704.


> 
>   arch/x86/hyperv/ivm.c        |  3 ++-
>   arch/x86/kernel/sev.c        |  2 +-
>   arch/x86/mm/pat/set_memory.c | 32 ++++++++++++--------------------
>   3 files changed, 15 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index 28be6df..2859ec3 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -308,7 +308,8 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
>   		return false;
>   
>   	for (i = 0, pfn = 0; i < pagecount; i++) {
> -		pfn_array[pfn] = virt_to_hvpfn((void *)kbuffer + i * HV_HYP_PAGE_SIZE);
> +		pfn_array[pfn] = slow_virt_to_phys((void *)kbuffer +
> +					i * HV_HYP_PAGE_SIZE) >> HV_HYP_PAGE_SHIFT;

Definitely needs a comment here (and below) that slow_virt_to_phys() is 
being used because of making the page not present.

>   		pfn++;
>   
>   		if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index 1ee7bed..59db55e 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -784,7 +784,7 @@ static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long
>   		hdr->end_entry = i;
>   
>   		if (is_vmalloc_addr((void *)vaddr)) {
> -			pfn = vmalloc_to_pfn((void *)vaddr);
> +			pfn = slow_virt_to_phys((void *)vaddr) >> PAGE_SHIFT;
>   			use_large_entry = false;
>   		} else {
>   			pfn = __pa(vaddr) >> PAGE_SHIFT;
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index bda9f12..8a194c7 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -2136,6 +2136,11 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>   	if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
>   		addr &= PAGE_MASK;
>   
> +	/* set_memory_np() removes aliasing mappings and flushes the TLB */

Is there any case where the TLB wouldn't be flushed when it should? Since, 
for SEV at least, the TLB flush being removed below was always performed.

Thanks,
Tom

> +	ret = set_memory_np(addr, numpages);
> +	if (ret)
> +		return ret;
> +
>   	memset(&cpa, 0, sizeof(cpa));
>   	cpa.vaddr = &addr;
>   	cpa.numpages = numpages;
> @@ -2143,36 +2148,23 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>   	cpa.mask_clr = enc ? pgprot_decrypted(empty) : pgprot_encrypted(empty);
>   	cpa.pgd = init_mm.pgd;
>   
> -	/* Must avoid aliasing mappings in the highmem code */
> -	kmap_flush_unused();
> -	vm_unmap_aliases();
> -
>   	/* Flush the caches as needed before changing the encryption attribute. */
> -	if (x86_platform.guest.enc_tlb_flush_required(enc))
> -		cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());
> +	if (x86_platform.guest.enc_cache_flush_required())
> +		cpa_flush(&cpa, 1);
>   
>   	/* Notify hypervisor that we are about to set/clr encryption attribute. */
>   	if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
>   		return -EIO;
>   
>   	ret = __change_page_attr_set_clr(&cpa, 1);
> -
> -	/*
> -	 * After changing the encryption attribute, we need to flush TLBs again
> -	 * in case any speculative TLB caching occurred (but no need to flush
> -	 * caches again).  We could just use cpa_flush_all(), but in case TLB
> -	 * flushing gets optimized in the cpa_flush() path use the same logic
> -	 * as above.
> -	 */
> -	cpa_flush(&cpa, 0);
> +	if (ret)
> +		return ret;
>   
>   	/* Notify hypervisor that we have successfully set/clr encryption attribute. */
> -	if (!ret) {
> -		if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> -			ret = -EIO;
> -	}
> +	if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> +		return -EIO;
>   
> -	return ret;
> +	return set_memory_p(&addr, numpages);
>   }
>   
>   static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)

  reply	other threads:[~2023-08-02 21:58 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-06 16:41 [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared Michael Kelley
2023-08-02 21:57 ` Tom Lendacky [this message]
2023-08-05 14:38   ` Michael Kelley (LINUX)
2023-08-06 22:19 ` kirill.shutemov
2023-08-16  2:54   ` Michael Kelley (LINUX)
2023-08-28 14:22     ` Michael Kelley (LINUX)
2023-08-28 16:13       ` kirill.shutemov
2023-08-28 21:00         ` Michael Kelley (LINUX)
2023-08-28 22:13           ` kirill.shutemov
2023-08-28 23:23             ` Michael Kelley (LINUX)
2023-08-28 23:57               ` kirill.shutemov
2023-08-30  0:02 ` Edgecombe, Rick P
2023-08-30  3:33   ` Michael Kelley (LINUX)
2023-08-30 23:40 ` Edgecombe, Rick P
2023-08-31 17:29   ` Edgecombe, Rick P
2023-09-01 14:44     ` Michael Kelley (LINUX)
2023-09-01 16:33       ` Edgecombe, Rick P

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=08bf0835-a4fa-38b5-4cbb-b2058a6a2da0@amd.com \
    --to=thomas.lendacky@amd.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=decui@microsoft.com \
    --cc=haiyangz@microsoft.com \
    --cc=hpa@zytor.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kys@microsoft.com \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mikelley@microsoft.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rick.p.edgecombe@intel.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=seanjc@google.com \
    --cc=tglx@linutronix.de \
    --cc=wei.liu@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox