From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
To: Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
x86@kernel.org
Cc: "Rafael J. Wysocki" <rafael@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Adrian Hunter <adrian.hunter@intel.com>,
Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com>,
Elena Reshetova <elena.reshetova@intel.com>,
Jun Nakajima <jun.nakajima@intel.com>,
Rick Edgecombe <rick.p.edgecombe@intel.com>,
Tom Lendacky <thomas.lendacky@amd.com>,
"Kalra, Ashish" <ashish.kalra@amd.com>,
Sean Christopherson <seanjc@google.com>,
"Huang, Kai" <kai.huang@intel.com>,
Ard Biesheuvel <ardb@kernel.org>, Baoquan He <bhe@redhat.com>,
"H. Peter Anvin" <hpa@zytor.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
"K. Y. Srinivasan" <kys@microsoft.com>,
Haiyang Zhang <haiyangz@microsoft.com>,
kexec@lists.infradead.org, linux-hyperv@vger.kernel.org,
linux-acpi@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Tao Liu <ltao@redhat.com>
Subject: [PATCHv11 11/19] x86/tdx: Convert shared memory back to private on kexec
Date: Tue, 28 May 2024 12:55:14 +0300 [thread overview]
Message-ID: <20240528095522.509667-12-kirill.shutemov@linux.intel.com> (raw)
In-Reply-To: <20240528095522.509667-1-kirill.shutemov@linux.intel.com>
TDX guests allocate shared buffers to perform I/O. It is done by
allocating pages normally from the buddy allocator and converting them
to shared with set_memory_decrypted().
The second, kexec-ed kernel has no idea what memory is converted this
way. It only sees E820_TYPE_RAM.
Accessing shared memory via private mapping is fatal. It leads to
unrecoverable TD exit.
On kexec walk direct mapping and convert all shared memory back to
private. It makes all RAM private again and second kernel may use it
normally.
The conversion occurs in two steps: stopping new conversions and
unsharing all memory. In the case of normal kexec, the stopping of
conversions takes place while scheduling is still functioning. This
allows for waiting until any ongoing conversions are finished. The
second step is carried out when all CPUs except one are inactive and
interrupts are disabled. This prevents any conflicts with code that may
access shared memory.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
---
arch/x86/coco/tdx/tdx.c | 69 +++++++++++++++++++++++++++++++
arch/x86/include/asm/pgtable.h | 5 +++
arch/x86/include/asm/set_memory.h | 3 ++
arch/x86/mm/pat/set_memory.c | 41 ++++++++++++++++--
4 files changed, 115 insertions(+), 3 deletions(-)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 979891e97d83..c0a651fa8963 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -7,6 +7,7 @@
#include <linux/cpufeature.h>
#include <linux/export.h>
#include <linux/io.h>
+#include <linux/kexec.h>
#include <asm/coco.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
@@ -14,6 +15,7 @@
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/pgtable.h>
+#include <asm/set_memory.h>
/* MMIO direction */
#define EPT_READ 0
@@ -831,6 +833,70 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
return 0;
}
+/* Stop new private<->shared conversions */
+static void tdx_kexec_begin(bool crash)
+{
+ /*
+ * Crash kernel reaches here with interrupts disabled: can't wait for
+ * conversions to finish.
+ *
+ * If race happened, just report and proceed.
+ */
+ if (!set_memory_enc_stop_conversion(!crash))
+ pr_warn("Failed to stop shared<->private conversions\n");
+}
+
+/* Walk direct mapping and convert all shared memory back to private */
+static void tdx_kexec_finish(void)
+{
+ unsigned long addr, end;
+ long found = 0, shared;
+
+ lockdep_assert_irqs_disabled();
+
+ addr = PAGE_OFFSET;
+ end = PAGE_OFFSET + get_max_mapped();
+
+ while (addr < end) {
+ unsigned long size;
+ unsigned int level;
+ pte_t *pte;
+
+ pte = lookup_address(addr, &level);
+ size = page_level_size(level);
+
+ if (pte && pte_decrypted(*pte)) {
+ int pages = size / PAGE_SIZE;
+
+ /*
+ * Touching memory with shared bit set triggers implicit
+ * conversion to shared.
+ *
+ * Make sure nobody touches the shared range from
+ * now on.
+ */
+ set_pte(pte, __pte(0));
+
+ if (!tdx_enc_status_changed(addr, pages, true)) {
+ pr_err("Failed to unshare range %#lx-%#lx\n",
+ addr, addr + size);
+ }
+
+ found += pages;
+ }
+
+ addr += size;
+ }
+
+ __flush_tlb_all();
+
+ shared = atomic_long_read(&nr_shared);
+ if (shared != found) {
+ pr_err("shared page accounting is off\n");
+ pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
+ }
+}
+
void __init tdx_early_init(void)
{
struct tdx_module_args args = {
@@ -890,6 +956,9 @@ void __init tdx_early_init(void)
x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
+ x86_platform.guest.enc_kexec_begin = tdx_kexec_begin;
+ x86_platform.guest.enc_kexec_finish = tdx_kexec_finish;
+
/*
* TDX intercepts the RDMSR to read the X2APIC ID in the parallel
* bringup low level code. That raises #VE which cannot be handled
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 65b8e5bb902c..e39311a89bf4 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -140,6 +140,11 @@ static inline int pte_young(pte_t pte)
return pte_flags(pte) & _PAGE_ACCESSED;
}
+static inline bool pte_decrypted(pte_t pte)
+{
+ return cc_mkdec(pte_val(pte)) == pte_val(pte);
+}
+
#define pmd_dirty pmd_dirty
static inline bool pmd_dirty(pmd_t pmd)
{
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 9aee31862b4a..d490db38db9e 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -49,8 +49,11 @@ int set_memory_wb(unsigned long addr, int numpages);
int set_memory_np(unsigned long addr, int numpages);
int set_memory_p(unsigned long addr, int numpages);
int set_memory_4k(unsigned long addr, int numpages);
+
+bool set_memory_enc_stop_conversion(bool wait);
int set_memory_encrypted(unsigned long addr, int numpages);
int set_memory_decrypted(unsigned long addr, int numpages);
+
int set_memory_np_noalias(unsigned long addr, int numpages);
int set_memory_nonglobal(unsigned long addr, int numpages);
int set_memory_global(unsigned long addr, int numpages);
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index a7a7a6c6a3fb..2a548b65ef5f 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2227,12 +2227,47 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
return ret;
}
+/*
+ * The lock serializes conversions between private and shared memory.
+ *
+ * It is taken for read on conversion. A write lock guarantees that no
+ * concurrent conversions are in progress.
+ */
+static DECLARE_RWSEM(mem_enc_lock);
+
+/*
+ * Stop new private<->shared conversions.
+ *
+ * Taking the exclusive mem_enc_lock waits for in-flight conversions to complete.
+ * The lock is not released to prevent new conversions from being started.
+ *
+ * If sleep is not allowed, as in a crash scenario, try to take the lock.
+ * Failure indicates that there is a race with the conversion.
+ */
+bool set_memory_enc_stop_conversion(bool wait)
+{
+ if (!wait)
+ return down_write_trylock(&mem_enc_lock);
+
+ down_write(&mem_enc_lock);
+
+ return true;
+}
+
static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
{
- if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
- return __set_memory_enc_pgtable(addr, numpages, enc);
+ int ret = 0;
- return 0;
+ if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
+ if (!down_read_trylock(&mem_enc_lock))
+ return -EBUSY;
+
+ ret = __set_memory_enc_pgtable(addr, numpages, enc);
+
+ up_read(&mem_enc_lock);
+ }
+
+ return ret;
}
int set_memory_encrypted(unsigned long addr, int numpages)
--
2.43.0
next prev parent reply other threads:[~2024-05-28 9:55 UTC|newest]
Thread overview: 76+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-28 9:55 [PATCHv11 00/19] x86/tdx: Add kexec support Kirill A. Shutemov
2024-05-28 9:55 ` [PATCHv11 01/19] x86/acpi: Extract ACPI MADT wakeup code into a separate file Kirill A. Shutemov
2024-05-28 13:47 ` Borislav Petkov
2024-05-28 9:55 ` [PATCHv11 02/19] x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init Kirill A. Shutemov
2024-05-28 9:55 ` [PATCHv11 03/19] cpu/hotplug: Add support for declaring CPU offlining not supported Kirill A. Shutemov
2024-05-28 9:55 ` [PATCHv11 04/19] cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup Kirill A. Shutemov
2024-05-28 9:55 ` [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion Kirill A. Shutemov
2024-05-29 10:47 ` Nikolay Borisov
2024-05-29 11:17 ` Kirill A. Shutemov
2024-05-29 11:28 ` Borislav Petkov
2024-05-29 12:33 ` Andrew Cooper
2024-05-29 15:15 ` Borislav Petkov
2024-06-04 0:24 ` H. Peter Anvin
2024-06-04 9:15 ` Borislav Petkov
2024-06-04 15:21 ` Kirill A. Shutemov
2024-06-04 17:57 ` Borislav Petkov
2024-06-11 18:26 ` H. Peter Anvin
2024-06-12 9:22 ` Kirill A. Shutemov
2024-06-12 23:06 ` Andrew Cooper
2024-06-12 23:25 ` H. Peter Anvin
2024-06-03 14:43 ` H. Peter Anvin
2024-06-12 12:10 ` Nikolay Borisov
2024-06-03 22:43 ` H. Peter Anvin
2024-05-28 9:55 ` [PATCHv11 06/19] x86/kexec: Keep CR4.MCE set during kexec for TDX guest Kirill A. Shutemov
2024-05-28 11:12 ` Huang, Kai
2024-05-29 11:39 ` Nikolay Borisov
2024-05-28 9:55 ` [PATCHv11 07/19] x86/mm: Make x86_platform.guest.enc_status_change_*() return errno Kirill A. Shutemov
2024-05-28 9:55 ` [PATCHv11 08/19] x86/mm: Return correct level from lookup_address() if pte is none Kirill A. Shutemov
2024-05-28 9:55 ` [PATCHv11 09/19] x86/tdx: Account shared memory Kirill A. Shutemov
2024-06-04 16:08 ` Dave Hansen
2024-06-04 16:24 ` Kirill A. Shutemov
2024-05-28 9:55 ` [PATCHv11 10/19] x86/mm: Add callbacks to prepare encrypted memory for kexec Kirill A. Shutemov
2024-05-29 10:42 ` Borislav Petkov
2024-06-02 12:39 ` [PATCHv11.1 " Kirill A. Shutemov
2024-06-02 12:42 ` Kirill A. Shutemov
2024-06-02 12:44 ` [PATCHv11.2 " Kirill A. Shutemov
2024-06-04 16:16 ` [PATCHv11 " Dave Hansen
2024-05-28 9:55 ` Kirill A. Shutemov [this message]
2024-05-31 15:14 ` [PATCHv11 11/19] x86/tdx: Convert shared memory back to private on kexec Borislav Petkov
2024-05-31 17:34 ` Kalra, Ashish
2024-05-31 18:06 ` Borislav Petkov
2024-06-02 14:20 ` Kirill A. Shutemov
2024-06-02 14:23 ` [PATCHv11.1 " Kirill A. Shutemov
2024-06-03 8:37 ` Borislav Petkov
2024-06-04 15:32 ` Kirill A. Shutemov
2024-06-04 15:47 ` Dave Hansen
2024-06-04 16:14 ` Kirill A. Shutemov
2024-06-04 18:05 ` Borislav Petkov
2024-06-05 12:21 ` Kirill A. Shutemov
2024-06-05 16:24 ` Borislav Petkov
2024-06-06 12:39 ` Kirill A. Shutemov
2024-06-04 16:27 ` [PATCHv11 " Dave Hansen
2024-06-05 12:43 ` Kirill A. Shutemov
2024-06-05 16:05 ` Dave Hansen
2024-05-28 9:55 ` [PATCHv11 12/19] x86/mm: Make e820__end_ram_pfn() cover E820_TYPE_ACPI ranges Kirill A. Shutemov
2024-05-28 9:55 ` [PATCHv11 13/19] x86/mm: Do not zap page table entries mapping unaccepted memory table during kdump Kirill A. Shutemov
2024-05-28 9:55 ` [PATCHv11 14/19] x86/acpi: Rename fields in acpi_madt_multiproc_wakeup structure Kirill A. Shutemov
2024-05-28 9:55 ` [PATCHv11 15/19] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case Kirill A. Shutemov
2024-05-28 9:55 ` [PATCHv11 16/19] x86/smp: Add smp_ops.stop_this_cpu() callback Kirill A. Shutemov
2024-05-28 9:55 ` [PATCHv11 17/19] x86/mm: Introduce kernel_ident_mapping_free() Kirill A. Shutemov
2024-05-28 9:55 ` [PATCHv11 18/19] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method Kirill A. Shutemov
2024-06-03 8:39 ` Borislav Petkov
2024-06-07 15:14 ` Kirill A. Shutemov
2024-06-10 13:40 ` Borislav Petkov
2024-06-10 14:01 ` Kirill A. Shutemov
2024-06-11 15:47 ` Kirill A. Shutemov
2024-06-11 19:46 ` Borislav Petkov
2024-06-12 9:24 ` Kirill A. Shutemov
2024-06-12 9:29 ` Borislav Petkov
2024-06-13 13:41 ` Kirill A. Shutemov
2024-06-13 14:56 ` Borislav Petkov
2024-06-14 14:06 ` Tom Lendacky
2024-06-18 12:20 ` Kirill A. Shutemov
2024-06-21 13:38 ` Borislav Petkov
2024-05-28 9:55 ` [PATCHv11 19/19] ACPI: tables: Print MULTIPROC_WAKEUP when MADT is parsed Kirill A. Shutemov
2024-05-28 10:01 ` [PATCHv11 00/19] x86/tdx: Add kexec support Rafael J. Wysocki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240528095522.509667-12-kirill.shutemov@linux.intel.com \
--to=kirill.shutemov@linux.intel.com \
--cc=adrian.hunter@intel.com \
--cc=ardb@kernel.org \
--cc=ashish.kalra@amd.com \
--cc=bhe@redhat.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=elena.reshetova@intel.com \
--cc=haiyangz@microsoft.com \
--cc=hpa@zytor.com \
--cc=jun.nakajima@intel.com \
--cc=kai.huang@intel.com \
--cc=kexec@lists.infradead.org \
--cc=kys@microsoft.com \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-coco@lists.linux.dev \
--cc=linux-hyperv@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=ltao@redhat.com \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rafael@kernel.org \
--cc=rick.p.edgecombe@intel.com \
--cc=sathyanarayanan.kuppuswamy@linux.intel.com \
--cc=seanjc@google.com \
--cc=tglx@linutronix.de \
--cc=thomas.lendacky@amd.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox