From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B2C91C3DA78 for ; Tue, 17 Jan 2023 10:42:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 287076B0074; Tue, 17 Jan 2023 05:42:18 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 237226B0075; Tue, 17 Jan 2023 05:42:18 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0D9B76B0078; Tue, 17 Jan 2023 05:42:18 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id ED8BD6B0074 for ; Tue, 17 Jan 2023 05:42:17 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id AAC07A079A for ; Tue, 17 Jan 2023 10:42:17 +0000 (UTC) X-FDA: 80363951514.11.377828D Received: from mail-io1-f44.google.com (mail-io1-f44.google.com [209.85.166.44]) by imf26.hostedemail.com (Postfix) with ESMTP id F24A9140013 for ; Tue, 17 Jan 2023 10:42:14 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=k8VWcZVd; spf=pass (imf26.hostedemail.com: domain of zhi.wang.linux@gmail.com designates 209.85.166.44 as permitted sender) smtp.mailfrom=zhi.wang.linux@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1673952135; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=p9RSB0BhHGTXVHOfWxtaA4bX1SBCybHhxqLyFrdPZmE=; b=aPs8cYcY1cmwn4Fo76WHQkj6oEGnmJgWeXTjGasGLCFzTP23zcqOd+Svp+Wxlwb6fUeEBZ OPJOeMghGDxfudTZ3YFIMCfkCsWzW1Ld4YIHEB0IcWBELzncTa14B4cJlVcX3c3ifZO4iR n98Te/0JTBN3eT2qhWlwOEAlT190CK8= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=k8VWcZVd; spf=pass (imf26.hostedemail.com: domain of zhi.wang.linux@gmail.com designates 209.85.166.44 as permitted sender) smtp.mailfrom=zhi.wang.linux@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1673952135; a=rsa-sha256; cv=none; b=3jxekKihJqLLmbAXg93A8gXDT8MXNTPkHE4uc9k2hxGNeAgK5AZI+H4B1ry9K9AMY9vwsE owHho31sFLg8PAip0yMzmePIKbAfvpklV4eLK4rse427XbhESx4gE2AzVGtF9yxA124fIA xOsN1MChQ6Tk81340NdCIHe1Etk1Lrk= Received: by mail-io1-f44.google.com with SMTP id b127so6684072iof.8 for ; Tue, 17 Jan 2023 02:42:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:from:to:cc:subject:date :message-id:reply-to; bh=p9RSB0BhHGTXVHOfWxtaA4bX1SBCybHhxqLyFrdPZmE=; b=k8VWcZVdA05kSA3xmlJoYb7deCbUU9YjweO55XWNW4IfmDji/qslO9QQG9m4fwrW0N yDANrDzpDb+aa3G0ReKthCtjWKUilbiw6OCCLtw9nGDE1YAK5vUtyyaNQqVtEJ5eU5RK Keva839sBX3DoYG/ed64sXzwvfX2BtVUr6UuXYUw3xVR6AGd9BQ+oL/ojZa+sn0WU/ZP MJgFUhM4xz2x5RLy2D40xA+xdEhLuhW12oybtD24zSpYsXRej17/iEcV98MhQa2d7+2e XCRI/T9r+DFwFJqu6bP1mPUXRyhPXSH7sMfo70XkZ3JWntRuhFeFPxCMbf7S0AfVmK4B obOA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=p9RSB0BhHGTXVHOfWxtaA4bX1SBCybHhxqLyFrdPZmE=; b=VwjpZPm+yWu8E0iYx8e5UDFLEDgTY9h5T/Ts0yvccm9ccd/wjWrGKju+4DiEdZd4cO sVlogNXe1mbwmuYwB745/5C7XJwIrsKxnZkgdxeV5gLfV31EIfgL17fwaGhCX09HBWmy DLlIXAb3Iv+qHzDXpRzVMPHwgem1sGTDpyH3PmXE5NZCNV64ri56rsLTBXhd9LYJ4epp /gartiSBTVMeY8bFxR7IYnQoBNgRVKOyd+cqvd6y+eLjiuvxSbSfiJo+q0xZY0owlqQO 2ArWpKQKLTVEFAHY9jhlA9OfLKLlwb7s7sZA2vqcHjYklG6TPG13hQUmqvZXAAhyYTOa E5/Q== X-Gm-Message-State: AFqh2kpNWwUeZrbxjK8lpflc8RVCiJSED5SrbzsgON5y5k0NHUGglITD hA85PvKLOoavaFogEJNBRq8= X-Google-Smtp-Source: AMrXdXsMjVzneQpnFD6YPzKcweoY87blKu8zb74uOTZ5iPScbWTixKwHYR/SUZP1V6RilnHO1ozs/Q== X-Received: by 2002:a6b:b2d0:0:b0:6e2:d939:4f30 with SMTP id b199-20020a6bb2d0000000b006e2d9394f30mr385034iof.0.1673952134013; Tue, 17 Jan 2023 02:42:14 -0800 (PST) Received: from localhost (88-115-161-74.elisa-laajakaista.fi. [88.115.161.74]) by smtp.gmail.com with ESMTPSA id g11-20020a02a08b000000b0038a3b8aaf11sm7656852jah.37.2023.01.17.02.42.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 17 Jan 2023 02:42:13 -0800 (PST) Date: Tue, 17 Jan 2023 12:42:03 +0200 From: Zhi Wang To: Michael Roth Cc: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Brijesh Singh , Jarkko Sakkinen Subject: Re: [PATCH RFC v7 20/64] x86/fault: Add support to handle the RMP fault for user address Message-ID: <20230117124203.00001961@gmail.com> In-Reply-To: <20221214194056.161492-21-michael.roth@amd.com> References: <20221214194056.161492-1-michael.roth@amd.com> <20221214194056.161492-21-michael.roth@amd.com> X-Mailer: Claws Mail 4.1.0 (GTK 3.24.33; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: gpcdujej8se1kzbgz3fygkfi7yib4kqs X-Rspamd-Queue-Id: F24A9140013 X-HE-Tag: 1673952134-719818 X-HE-Meta: U2FsdGVkX1++a69Zl2VU7/ZxE2dFqD83gOan/fg5fcGYa2ke/Fkq/zZADHUoxRLGQ30jRRaGI2ssmGJvhMZtRv7pf5KxoNwbY70McpK27OWNr8uGXPZHqSsQP6WWJyj7nPACAPaprYUIVkXIuryXR40uY/O8++ppEbyr/oGNP8/Tet/GbK6ylN3alpGYDbeRyeJzEkvMQIJAx1I7bT4aweiajdplaYzjlvBK+vfgmCWu5WEr68mSd3BQ6TuVL0b9jiAJ/DDnErU5BnkQGKJVJb2KxiW7yFkRaap/Hk+L0ebqV6AEqVC+gb1kj0H7RhXVVV1c1KlQ1DssgmxW5LD0wIORBxDkIsaWfFA1luJt1nGiKk3qmMgpW2QrzdbZMNLYF35RGVN9CRTmnlVIBzmkpVb+nJTXCq4lcwTsFTKjqoMYFYp7RiVGjiZkyYOujTcxLyF1kfUdh/Qtl1UfQR9mzZ1nXXOyB/+qqbJYiXfVJlXMzvg4M94pqfqFtFQBYL3m9oDBhDjLTJ6KgPvefMdLcPw7N6mSyKP8O3spmKLza02uel6TLOLjcflOBn6PHl6e6q9SwTXWZsBi0xbLifK5+ddwYcCLEyuRH1eqy1m5zLWqR3itVOxPCq2Ld6k/Lcz3iOwAyGoa9EvtGwlnAGzFZJWOVytkHzm4IKvulozYGq9stUCwoqUSiMndqwo9Hfg4aozxpJfMeJ4A0WSoO/Tv3pWKI/yt+Fi/2FOSK2qt0gwAzIbcOvBPst+ojmgxinsrscon1VCCQLWLOvivNHDDNG2sPDrMPiFmd0wGQBkSB2Q3ynUergi0r+2jrIiCen+4+wN0T4NExUmIXpCHJh8ZqQ5abilhAYXNINMKj4JN/YvQAYN1Vnj6irOe95WAiZ/XiyB78y66jOcZqP+lsSpZz9eGsv8/j8xZbIUKX2pseUi/GpPuXnnvJRXb+yCecBwNH3PSEN7bHtFruWW1bG4 4DgBBB41 /KnhfAwcUcVnwdv4JtFYhVtuEDq/D81Ij3ROMTb0N+qEQpCbu2zAfJdkbeN0Suk5DtSFArafC+3ObL3lYhJGlG1e3oofdUNzQbmntOJ0MJRM98Pmjy2H66ORFrMJ52XvR9Stw X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, 14 Dec 2022 13:40:12 -0600 Michael Roth wrote: > From: Brijesh Singh > > When SEV-SNP is enabled globally, a write from the host goes through the > RMP check. When the host writes to pages, hardware checks the following > conditions at the end of page walk: > > 1. Assigned bit in the RMP table is zero (i.e page is shared). > 2. If the page table entry that gives the sPA indicates that the target > page size is a large page, then all RMP entries for the 4KB > constituting pages of the target must have the assigned bit 0. > 3. Immutable bit in the RMP table is not zero. > Just being curious. AMD APM table 15-37 "RMP Page Assignment Settings" shows Immuable bit is "don't care" when a page is owned by the hypervisor. The table 15-39 "RMP Memory Access Checks" shows the hardware will do "Hypervisor-owned" check for host data write and page table access. I suppose "Hypervisor-owned" check means HW will check if the RMP entry is configured according to the table 15-37 (Assign bit = 0, ASID = 0, Immutable = X) None of them mentions that Immutable bit in the related RMP-entry should be 1 for hypervisor-owned page. I can understand 1) 2). Can you explain more about 3)? > The hardware will raise page fault if one of the above conditions is not > met. Try resolving the fault instead of taking fault again and again. If > the host attempts to write to the guest private memory then send the > SIGBUS signal to kill the process. If the page level between the host and > RMP entry does not match, then split the address to keep the RMP and host > page levels in sync. > > Co-developed-by: Jarkko Sakkinen > Signed-off-by: Jarkko Sakkinen > Co-developed-by: Ashish Kalra > Signed-off-by: Ashish Kalra > Signed-off-by: Brijesh Singh > Signed-off-by: Michael Roth > --- > arch/x86/mm/fault.c | 97 ++++++++++++++++++++++++++++++++++++++++ > include/linux/mm.h | 3 +- > include/linux/mm_types.h | 3 ++ > mm/memory.c | 10 +++++ > 4 files changed, 112 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index f8193b99e9c8..d611051dcf1e 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -33,6 +33,7 @@ > #include /* kvm_handle_async_pf */ > #include /* fixup_vdso_exception() */ > #include > +#include /* snp_lookup_rmpentry() */ > > #define CREATE_TRACE_POINTS > #include > @@ -414,6 +415,7 @@ static void dump_pagetable(unsigned long address) > pr_cont("PTE %lx", pte_val(*pte)); > out: > pr_cont("\n"); > + > return; > bad: > pr_info("BAD\n"); > @@ -1240,6 +1242,90 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, > } > NOKPROBE_SYMBOL(do_kern_addr_fault); > > +enum rmp_pf_ret { > + RMP_PF_SPLIT = 0, > + RMP_PF_RETRY = 1, > + RMP_PF_UNMAP = 2, > +}; > + > +/* > + * The goal of RMP faulting routine is really to check whether the > + * page that faulted should be accessible. That can be determined > + * simply by looking at the RMP entry for the 4k address being accessed. > + * If that entry has Assigned=1 then it's a bad address. It could be > + * because the 2MB region was assigned as a large page, or it could be > + * because the region is all 4k pages and that 4k was assigned. > + * In either case, it's a bad access. > + * There are basically two main possibilities: > + * 1. The 2M entry has Assigned=1 and Page_Size=1. Then all 511 middle > + * entries also have Assigned=1. This entire 2M region is a guest page. > + * 2. The 2M entry has Assigned=0 and Page_Size=0. Then the 511 middle > + * entries can be anything, this region consists of individual 4k assignments. > + */ > +static int handle_user_rmp_page_fault(struct pt_regs *regs, unsigned long error_code, > + unsigned long address) > +{ > + int rmp_level, level; > + pgd_t *pgd; > + pte_t *pte; > + u64 pfn; > + > + pgd = __va(read_cr3_pa()); > + pgd += pgd_index(address); > + > + pte = lookup_address_in_pgd(pgd, address, &level); > + > + /* > + * It can happen if there was a race between an unmap event and > + * the RMP fault delivery. > + */ > + if (!pte || !pte_present(*pte)) > + return RMP_PF_UNMAP; > + > + /* > + * RMP page fault handler follows this algorithm: > + * 1. Compute the pfn for the 4kb page being accessed > + * 2. Read that RMP entry -- If it is assigned then kill the process > + * 3. Otherwise, check the level from the host page table > + * If level=PG_LEVEL_4K then the page is already smashed > + * so just retry the instruction > + * 4. If level=PG_LEVEL_2M/1G, then the host page needs to be split > + */ > + > + pfn = pte_pfn(*pte); > + > + /* If its large page then calculte the fault pfn */ > + if (level > PG_LEVEL_4K) > + pfn = pfn | PFN_DOWN(address & (page_level_size(level) - 1)); > + > + /* > + * If its a guest private page, then the fault cannot be resolved. > + * Send a SIGBUS to terminate the process. > + * > + * As documented in APM vol3 pseudo-code for RMPUPDATE, when the 2M range > + * is covered by a valid (Assigned=1) 2M entry, the middle 511 4k entries > + * also have Assigned=1. This means that if there is an access to a page > + * which happens to lie within an Assigned 2M entry, the 4k RMP entry > + * will also have Assigned=1. Therefore, the kernel should see that > + * the page is not a valid page and the fault cannot be resolved. > + */ > + if (snp_lookup_rmpentry(pfn, &rmp_level)) { > + pr_info("Fatal RMP page fault, terminating process, entry assigned for pfn 0x%llx\n", > + pfn); > + do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS); > + return RMP_PF_RETRY; > + } > + > + /* > + * The backing page level is higher than the RMP page level, request > + * to split the page. > + */ > + if (level > rmp_level) > + return RMP_PF_SPLIT; > + > + return RMP_PF_RETRY; > +} > + > /* > * Handle faults in the user portion of the address space. Nothing in here > * should check X86_PF_USER without a specific justification: for almost > @@ -1337,6 +1423,17 @@ void do_user_addr_fault(struct pt_regs *regs, > if (error_code & X86_PF_INSTR) > flags |= FAULT_FLAG_INSTRUCTION; > > + /* > + * If its an RMP violation, try resolving it. > + */ > + if (error_code & X86_PF_RMP) { > + if (handle_user_rmp_page_fault(regs, error_code, address)) > + return; > + > + /* Ask to split the page */ > + flags |= FAULT_FLAG_PAGE_SPLIT; > + } > + > #ifdef CONFIG_X86_64 > /* > * Faults in the vsyscall page might need emulation. The > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 3c84f4e48cd7..2fd8e16d149c 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -466,7 +466,8 @@ static inline bool fault_flag_allow_retry_first(enum fault_flag flags) > { FAULT_FLAG_USER, "USER" }, \ > { FAULT_FLAG_REMOTE, "REMOTE" }, \ > { FAULT_FLAG_INSTRUCTION, "INSTRUCTION" }, \ > - { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" } > + { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \ > + { FAULT_FLAG_PAGE_SPLIT, "PAGESPLIT" } > > /* > * vm_fault is filled by the pagefault handler and passed to the vma's > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 500e536796ca..06ba34d51638 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -962,6 +962,8 @@ typedef struct { > * mapped R/O. > * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached. > * We should only access orig_pte if this flag set. > + * @FAULT_FLAG_PAGE_SPLIT: The fault was due page size mismatch, split the > + * region to smaller page size and retry. > * > * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify > * whether we would allow page faults to retry by specifying these two > @@ -999,6 +1001,7 @@ enum fault_flag { > FAULT_FLAG_INTERRUPTIBLE = 1 << 9, > FAULT_FLAG_UNSHARE = 1 << 10, > FAULT_FLAG_ORIG_PTE_VALID = 1 << 11, > + FAULT_FLAG_PAGE_SPLIT = 1 << 12, > }; > > typedef unsigned int __bitwise zap_flags_t; > diff --git a/mm/memory.c b/mm/memory.c > index f88c351aecd4..e68da7e403c6 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4996,6 +4996,12 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) > return 0; > } > > +static int handle_split_page_fault(struct vm_fault *vmf) > +{ > + __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL); > + return 0; > +} > + > /* > * By the time we get here, we already hold the mm semaphore > * > @@ -5078,6 +5084,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, > pmd_migration_entry_wait(mm, vmf.pmd); > return 0; > } > + > + if (flags & FAULT_FLAG_PAGE_SPLIT) > + return handle_split_page_fault(&vmf); > + > if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) { > if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) > return do_huge_pmd_numa_page(&vmf);