From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4995EC433E0 for ; Fri, 15 Jan 2021 17:09:35 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E90E72333E for ; Fri, 15 Jan 2021 17:09:34 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E90E72333E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 164EC8D01A2; Fri, 15 Jan 2021 12:09:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 117928D019D; Fri, 15 Jan 2021 12:09:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E85F08D01A2; Fri, 15 Jan 2021 12:09:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0176.hostedemail.com [216.40.44.176]) by kanga.kvack.org (Postfix) with ESMTP id D139E8D019D for ; Fri, 15 Jan 2021 12:09:30 -0500 (EST) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 8E91182E8E82 for ; Fri, 15 Jan 2021 17:09:30 +0000 (UTC) X-FDA: 77708645700.03.kite82_231858527531 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin03.hostedemail.com (Postfix) with ESMTP id 6A2E428A4EB for ; Fri, 15 Jan 2021 17:09:30 +0000 (UTC) X-HE-Tag: kite82_231858527531 X-Filterd-Recvd-Size: 12632 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124]) by imf21.hostedemail.com (Postfix) with ESMTP for ; Fri, 15 Jan 2021 17:09:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1610730569; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=pU1W6eF+5acoXg0XrMV/kYBFP7oiRoIxFmy0y+UKd/k=; b=bJXlr1TM4mx/IaZ+VMsfSdL33JEx2sdigWRQB1TV4PdUcYqzWCXL0vhDmsAI/Vy2fG0u5W MUhR+L95iQ7uudEpjciNQv9dTphNNK7fqRIB6XeTTxFj3UhqPwp1u6l7G5dI/i38Jiu+Sr JMxZfmzX+q/qlUzVpYUsySSWI98jxEg= Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-90-0PJG9U8UOfWhwdJ5mpOeoA-1; Fri, 15 Jan 2021 12:09:25 -0500 X-MC-Unique: 0PJG9U8UOfWhwdJ5mpOeoA-1 Received: by mail-qv1-f71.google.com with SMTP id h1so8237994qvr.7 for ; Fri, 15 Jan 2021 09:09:25 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=pU1W6eF+5acoXg0XrMV/kYBFP7oiRoIxFmy0y+UKd/k=; b=MRHYdiXkKe0q4r3ALlayL4hOE24KQhuij3tozRt5JdvbtmJEYGh5HXY3eRf6Xb89Dz qj/6X1ofimJ8Fxb03HyqYjdoPNjPA7xTNs8QmS3WgJuM4aQg2WALuyVVpLUghzNjnaZO jOm4hP4wsPJfTPDa01NWYzA31oSt7fmP5sTO0vcU1cSk4eSzpGLwZoQF3AcDrZqOhZyj yVmmBp6QdYTMFzov2qfwYpwWCIsOl5c3znju2GU+0K9BDugZbbcDtVq0Jzf0U2BIgKpk UqVgDb0fmlHEDLC1GGVXvK10luFcVTAWqp57gUfsi6oYb385SkZ5L0Zrv4MfPNMS+aFi QlaA== X-Gm-Message-State: AOAM533huQuiscTw4nor2S3sJ/Q2e0rKB5n1Flr9Ev2Dk6Nc5jLFoq/q X92zKrxm1GTUdz2ByepNUpARVFofKbmFlrVUAqQ906IUP/rOq8lnQ/EtzddxbDvsxhmnkXDSXC0 9FT10kPKxcP0= X-Received: by 2002:a0c:bd2b:: with SMTP id m43mr12934476qvg.32.1610730564948; Fri, 15 Jan 2021 09:09:24 -0800 (PST) X-Google-Smtp-Source: ABdhPJynSq2SBFF8P8zjh3jGjKduQkRr9w+ZGRCy7Dj1mcGpq2w7EdvE3LxZusU0Nl55C86yHtpjRA== X-Received: by 2002:a0c:bd2b:: with SMTP id m43mr12934439qvg.32.1610730564546; Fri, 15 Jan 2021 09:09:24 -0800 (PST) Received: from localhost.localdomain ([142.126.83.202]) by smtp.gmail.com with ESMTPSA id d123sm5187840qke.95.2021.01.15.09.09.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 15 Jan 2021 09:09:23 -0800 (PST) From: Peter Xu To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Mike Rapoport , Mike Kravetz , peterx@redhat.com, Jerome Glisse , "Kirill A . Shutemov" , Hugh Dickins , Axel Rasmussen , Matthew Wilcox , Andrew Morton , Andrea Arcangeli , Nadav Amit Subject: [PATCH RFC 08/30] shmem/userfaultfd: Handle uffd-wp special pte in page fault handler Date: Fri, 15 Jan 2021 12:08:45 -0500 Message-Id: <20210115170907.24498-9-peterx@redhat.com> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210115170907.24498-1-peterx@redhat.com> References: <20210115170907.24498-1-peterx@redhat.com> MIME-Version: 1.0 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=peterx@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: File-backed memories are prone to unmap/swap so the ptes are always unsta= ble. This could lead to userfaultfd-wp information got lost when unmapped or s= wapped out on such types of memory, for example, shmem. To keep such an informa= tion persistent, we will start to use the newly introduced swap-like special p= tes to replace a null pte when those ptes were removed. Prepare this by handling such a special pte first before it is applied. = Here a new fault flag FAULT_FLAG_UFFD_WP is introduced. When this flag is set= , it means the current fault is to resolve a page access (either read or write= ) to the uffd-wp special pte. The handling of this special pte page fault is similar to missing fault, = but it should happen after the pte missing logic since the special pte is design= ed to be a swap-like pte. Meanwhile it should be handled before do_swap_page()= so that the swap core logic won't be confused to see such an illegal swap pt= e. This is a slow path of uffd-wp handling, because unmap of wr-protected sh= mem ptes should be rare. So far it should only trigger in two conditions: (1) When trying to punch holes in shmem_fallocate(), there will be a pre-unmap optimization before evicting the page. That will create unmapped shmem ptes with wr-protected pages covered. (2) Swapping out of shmem pages Because of this, the page fault handling is simplifed too by always assum= ing it's a read fault when calling do_fault(). When it's a write fault, it'l= l fault again when retry the page access, then do_wp_page() will handle the= rest of message generation and delivery to the userfaultfd. Disable fault-around for such a special page fault, because the introduce= d new flag (FAULT_FLAG_UFFD_WP) only applies to current pte rather than all the= pages around it. Doing fault-around with the new flag could confuse all the re= st of pages when installing ptes from page cache when there's a cache hit. Signed-off-by: Peter Xu --- include/linux/mm.h | 2 + mm/memory.c | 107 +++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 105 insertions(+), 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index db6ae4d3fb4e..85d928764b64 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -426,6 +426,7 @@ extern pgprot_t protection_map[16]; * @FAULT_FLAG_REMOTE: The fault is not for current task/mm. * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch. * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal = signals. + * @FAULT_FLAG_UFFD_WP: When install new page entries, set uffd-wp bit. * * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify * whether we would allow page faults to retry by specifying these two @@ -456,6 +457,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_REMOTE 0x80 #define FAULT_FLAG_INSTRUCTION 0x100 #define FAULT_FLAG_INTERRUPTIBLE 0x200 +#define FAULT_FLAG_UFFD_WP 0x400 =20 /* * The default fault flags that should be used by most of the diff --git a/mm/memory.c b/mm/memory.c index 394c2602dce7..0b687f0be4d0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3797,6 +3797,7 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, = struct page *page) vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma =3D vmf->vma; + bool pte_changed, uffd_wp =3D vmf->flags & FAULT_FLAG_UFFD_WP; bool write =3D vmf->flags & FAULT_FLAG_WRITE; pte_t entry; vm_fault_t ret; @@ -3807,14 +3808,27 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, st= ruct page *page) return ret; } =20 + /* + * Note: besides pte missing, FAULT_FLAG_UFFD_WP could also trigger + * this path where vmf->pte got released before reaching here. In that + * case, even if vmf->pte=3D=3DNULL, the pte actually still contains th= e + * protection pte (by pte_swp_mkuffd_wp_special()). For that case, + * we'd also like to allocate a new pte like pte none, but check + * differently for changing pte. + */ if (!vmf->pte) { ret =3D pte_alloc_one_map(vmf); if (ret) return ret; } =20 + if (unlikely(uffd_wp)) + pte_changed =3D !pte_swp_uffd_wp_special(*vmf->pte); + else + pte_changed =3D !pte_none(*vmf->pte); + /* Re-check under ptl */ - if (unlikely(!pte_none(*vmf->pte))) { + if (unlikely(pte_changed)) { update_mmu_tlb(vma, vmf->address, vmf->pte); return VM_FAULT_NOPAGE; } @@ -3824,6 +3838,11 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, str= uct page *page) entry =3D pte_sw_mkyoung(entry); if (write) entry =3D maybe_mkwrite(pte_mkdirty(entry), vma); + if (uffd_wp) { + /* This should only be triggered by a read fault */ + WARN_ON_ONCE(write); + entry =3D pte_mkuffd_wp(pte_wrprotect(entry)); + } /* copy-on-write page */ if (write && !(vma->vm_flags & VM_SHARED)) { inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); @@ -3997,9 +4016,27 @@ static vm_fault_t do_fault_around(struct vm_fault = *vmf) return ret; } =20 +/* Return true if we should do read fault-around, false otherwise */ +static inline bool should_fault_around(struct vm_fault *vmf) +{ + /* No ->map_pages? No way to fault around... */ + if (!vmf->vma->vm_ops->map_pages) + return false; + + /* + * Don't do fault around for FAULT_FLAG_UFFD_WP because it means we + * want to recover a previously wr-protected pte. This flag is a + * per-pte information, so it could confuse all the pages around the + * current page when faulted in. Give up on that quickly. + */ + if (vmf->flags & FAULT_FLAG_UFFD_WP) + return false; + + return fault_around_bytes >> PAGE_SHIFT > 1; +} + static vm_fault_t do_read_fault(struct vm_fault *vmf) { - struct vm_area_struct *vma =3D vmf->vma; vm_fault_t ret =3D 0; =20 /* @@ -4007,7 +4044,7 @@ static vm_fault_t do_read_fault(struct vm_fault *vm= f) * if page by the offset is not ready to be mapped (cold cache or * something). */ - if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) { + if (should_fault_around(vmf)) { ret =3D do_fault_around(vmf); if (ret) return ret; @@ -4322,6 +4359,68 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf= , pud_t orig_pud) return VM_FAULT_FALLBACK; } =20 +static vm_fault_t uffd_wp_clear_special(struct vm_fault *vmf) +{ + vmf->pte =3D pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + /* + * Be careful so that we will only recover a special uffd-wp pte into a + * none pte. Otherwise it means the pte could have changed, so retry. + */ + if (pte_swp_uffd_wp_special(*vmf->pte)) + pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; +} + +/* + * This is actually a page-missing access, but with uffd-wp special pte + * installed. It means this pte was wr-protected before being unmapped. + */ +vm_fault_t uffd_wp_handle_special(struct vm_fault *vmf) +{ + /* Careful! vmf->pte unmapped after return */ + if (!pte_unmap_same(vmf)) + return 0; + + /* + * Just in case there're leftover special ptes even after the region + * got unregistered - we can simply clear them. + */ + if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma))) + return uffd_wp_clear_special(vmf); + + /* + * Tell all the rest of the fault code: we're handling a special pte, + * always remember to arm the uffd-wp bit when intalling the new pte. + */ + vmf->flags |=3D FAULT_FLAG_UFFD_WP; + + /* + * Let's assume this is a read fault no matter what. If it is a real + * write access, it'll fault again into do_wp_page() where the message + * will be generated before the thread yields itself. + * + * Ideally we can also handle write immediately before return, but this + * should be a slow path already (pte unmapped), so be simple first. + */ + vmf->flags &=3D ~FAULT_FLAG_WRITE; + + return do_fault(vmf); +} + +static vm_fault_t do_swap_pte(struct vm_fault *vmf) +{ + /* + * We need to handle special swap ptes before handling ptes that + * contain swap entries, always. + */ + if (unlikely(pte_swp_uffd_wp_special(vmf->orig_pte))) + return uffd_wp_handle_special(vmf); + + return do_swap_page(vmf); +} + /* * These routines also need to handle stuff like marking pages dirty * and/or accessed for architectures that don't do it in hardware (most @@ -4385,7 +4484,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault = *vmf) } =20 if (!pte_present(vmf->orig_pte)) - return do_swap_page(vmf); + return do_swap_pte(vmf); =20 if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) return do_numa_page(vmf); --=20 2.26.2