From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 748B3C636C9 for ; Thu, 15 Jul 2021 20:14:42 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1DAFE613BB for ; Thu, 15 Jul 2021 20:14:42 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1DAFE613BB Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0912C8D0102; Thu, 15 Jul 2021 16:14:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 067798D00EC; Thu, 15 Jul 2021 16:14:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DFD868D0102; Thu, 15 Jul 2021 16:14:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0246.hostedemail.com [216.40.44.246]) by kanga.kvack.org (Postfix) with ESMTP id B895D8D00EC for ; Thu, 15 Jul 2021 16:14:41 -0400 (EDT) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 7AAE61850E6BD for ; Thu, 15 Jul 2021 20:14:40 +0000 (UTC) X-FDA: 78365925120.24.8F30DF7 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf23.hostedemail.com (Postfix) with ESMTP id 32B28900009D for ; Thu, 15 Jul 2021 20:14:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1626380079; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=L9euKlRGsYS/Ufg0cVKfpbt0dLo3uk8zl6ETzBChaz4=; b=PRW3ht+aoPMrT4ctrUep8Q/74CXMjUaBvJuXpNJtlkKyay5rtcvkkn5AQVWYcmYPxgj5+I ITQI9522N3ITWL1cfM7fo8Xft6ann6mxPW6jEr+NwH6lIMybp+qlA6fEJzn1mYCNh9YWoL DNxGX0Qe+gS+fxLjsO+R6PQkRrrgPWQ= Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com [209.85.219.72]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-10-HLjVwyuFPPaI1Gg9hoc6aA-1; Thu, 15 Jul 2021 16:14:39 -0400 X-MC-Unique: HLjVwyuFPPaI1Gg9hoc6aA-1 Received: by mail-qv1-f72.google.com with SMTP id d10-20020a0ce44a0000b02902c99dfad03fso5040141qvm.8 for ; Thu, 15 Jul 2021 13:14:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=L9euKlRGsYS/Ufg0cVKfpbt0dLo3uk8zl6ETzBChaz4=; b=TLnU+TCc6iMPhy4b0wrHmLwfAbH83K5Qu+HlWWcMzYiK2NkDM/MM+ZYq6X/Ugdvp8L ENYBkldqYDPD4jYzi2K2MhVt09rSFBd2WaZCN2KbioxqURABtxIfrpAi4Ih4EBxYm3xj kad1ViXY2/JMdflk3O27xmOQt19y7szzc+YRK6/smj/JiaPkjm/3UCd9MjtgdkZZrHv8 ItDV1GZlLk91UH+W87DgJvwx6kVAgo02fLKjVIoHaIry2yvg44NbNvG6RvVud9/EnyKU z2O+YmHxLxrOjwO/1EVDfG4jRtkKcaSfHXBqsL0fpsjStWV5oFTN2j376njZf6TqnZPm qDZA== X-Gm-Message-State: AOAM533ry0uylKJ01wYK6oJ0HfFYCM+K9yZJEjJ4R3Co2+RnkOWzjg2N NtTRPyBhKFcM38hIWoF5JoFD4s256vXKDe3dCcviT6SoqJ1NLqkjq3tONxD3T1UyHKVJqDVmte4 b9CNUx2EWJGMMA4VaMkNeEqrpioLo2AlyzYJG9759/A6aPhEJZgVFR5QVbKe1 X-Received: by 2002:a05:620a:1aa9:: with SMTP id bl41mr5909430qkb.161.1626380077942; Thu, 15 Jul 2021 13:14:37 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyGks1mczDtLrE5l95Y83pqJYMo/gPJAZ/qUj6crhaRKJXILpdNGMPwVT1D+Lu61v38FTnwbg== X-Received: by 2002:a05:620a:1aa9:: with SMTP id bl41mr5909378qkb.161.1626380077507; Thu, 15 Jul 2021 13:14:37 -0700 (PDT) Received: from localhost.localdomain (bras-base-toroon474qw-grc-65-184-144-111-238.dsl.bell.ca. [184.144.111.238]) by smtp.gmail.com with ESMTPSA id p64sm2915206qka.114.2021.07.15.13.14.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 15 Jul 2021 13:14:36 -0700 (PDT) From: Peter Xu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Jason Gunthorpe , Mike Kravetz , David Hildenbrand , Alistair Popple , Matthew Wilcox , "Kirill A . Shutemov" , Hugh Dickins , Tiberiu Georgescu , Andrea Arcangeli , Axel Rasmussen , Nadav Amit , Mike Rapoport , Jerome Glisse , Andrew Morton , Miaohe Lin , peterx@redhat.com Subject: [PATCH v5 06/26] shmem/userfaultfd: Handle uffd-wp special pte in page fault handler Date: Thu, 15 Jul 2021 16:14:02 -0400 Message-Id: <20210715201422.211004-7-peterx@redhat.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20210715201422.211004-1-peterx@redhat.com> References: <20210715201422.211004-1-peterx@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="US-ASCII" X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 32B28900009D X-Stat-Signature: e4fruyrricy191p5kwck1661gcco699q Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=PRW3ht+a; spf=none (imf23.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-HE-Tag: 1626380080-62476 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: File-backed memories are prone to unmap/swap so the ptes are always unsta= ble. This could lead to userfaultfd-wp information got lost when unmapped or s= wapped out on such types of memory, for example, shmem. To keep such an informa= tion persistent, we will start to use the newly introduced swap-like special p= tes to replace a null pte when those ptes were removed. Prepare this by handling such a special pte first before it is applied in= the general page fault handler. The handling of this special pte page fault is similar to missing fault, = but it should happen after the pte missing logic since the special pte is design= ed to be a swap-like pte. Meanwhile it should be handled before do_swap_page()= so that the swap core logic won't be confused to see such an illegal swap pt= e. This is a slow path of uffd-wp handling, because unmap of wr-protected sh= mem ptes should be rare. So far it should only trigger in two conditions: (1) When trying to punch holes in shmem_fallocate(), there will be a pre-unmap optimization before evicting the page. That will create unmapped shmem ptes with wr-protected pages covered. (2) Swapping out of shmem pages Because of this, the page fault handling is simplifed too by not sending = the wr-protect message in the 1st page fault, instead the page will be instal= led read-only, so the message will be generated until the next write, which w= ill trigger the do_wp_page() path of general uffd-wp handling. Disable fault-around for all uffd-wp registered ranges for extra safety, = and clean the code up a bit after we introduced MINOR fault. Signed-off-by: Peter Xu --- include/linux/userfaultfd_k.h | 17 +++++++ mm/memory.c | 88 +++++++++++++++++++++++++++++++---- 2 files changed, 95 insertions(+), 10 deletions(-) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.= h index bb5a72a2b07a..92606d95b005 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -94,6 +94,18 @@ static inline bool uffd_disable_huge_pmd_share(struct = vm_area_struct *vma) return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR); } =20 +/* + * Don't do fault around for either WP or MINOR registered uffd range. = For + * MINOR registered range, fault around will be a total disaster and pte= s can + * be installed without notifications; for WP it should mostly be fine a= s long + * as the fault around checks for pte_none() before the installation, ho= wever + * to be super safe we just forbid it. + */ +static inline bool uffd_disable_fault_around(struct vm_area_struct *vma) +{ + return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR); +} + static inline bool userfaultfd_missing(struct vm_area_struct *vma) { return vma->vm_flags & VM_UFFD_MISSING; @@ -259,6 +271,11 @@ static inline bool pte_swp_uffd_wp_special(pte_t pte= ) return false; } =20 +static inline bool uffd_disable_fault_around(struct vm_area_struct *vma) +{ + return false; +} + #endif /* CONFIG_USERFAULTFD */ =20 #endif /* _LINUX_USERFAULTFD_K_H */ diff --git a/mm/memory.c b/mm/memory.c index 998a4f9a3744..ba8033ca6682 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3964,6 +3964,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct = page *page) void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long a= ddr) { struct vm_area_struct *vma =3D vmf->vma; + bool uffd_wp =3D pte_swp_uffd_wp_special(vmf->orig_pte); bool write =3D vmf->flags & FAULT_FLAG_WRITE; bool prefault =3D vmf->address !=3D addr; pte_t entry; @@ -3978,6 +3979,8 @@ void do_set_pte(struct vm_fault *vmf, struct page *= page, unsigned long addr) =20 if (write) entry =3D maybe_mkwrite(pte_mkdirty(entry), vma); + if (unlikely(uffd_wp)) + entry =3D pte_mkuffd_wp(pte_wrprotect(entry)); /* copy-on-write page */ if (write && !(vma->vm_flags & VM_SHARED)) { inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); @@ -4045,8 +4048,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf) vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); ret =3D 0; - /* Re-check under ptl */ - if (likely(pte_none(*vmf->pte))) + + /* + * Re-check under ptl. Note: this will cover both none pte and + * uffd-wp-special swap pte + */ + if (likely(pte_same(*vmf->pte, vmf->orig_pte))) do_set_pte(vmf, page, vmf->address); else ret =3D VM_FAULT_NOPAGE; @@ -4150,9 +4157,21 @@ static vm_fault_t do_fault_around(struct vm_fault = *vmf) return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff); } =20 +/* Return true if we should do read fault-around, false otherwise */ +static inline bool should_fault_around(struct vm_fault *vmf) +{ + /* No ->map_pages? No way to fault around... */ + if (!vmf->vma->vm_ops->map_pages) + return false; + + if (uffd_disable_fault_around(vmf->vma)) + return false; + + return fault_around_bytes >> PAGE_SHIFT > 1; +} + static vm_fault_t do_read_fault(struct vm_fault *vmf) { - struct vm_area_struct *vma =3D vmf->vma; vm_fault_t ret =3D 0; =20 /* @@ -4160,12 +4179,10 @@ static vm_fault_t do_read_fault(struct vm_fault *= vmf) * if page by the offset is not ready to be mapped (cold cache or * something). */ - if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) { - if (likely(!userfaultfd_minor(vmf->vma))) { - ret =3D do_fault_around(vmf); - if (ret) - return ret; - } + if (should_fault_around(vmf)) { + ret =3D do_fault_around(vmf); + if (ret) + return ret; } =20 ret =3D __do_fault(vmf); @@ -4484,6 +4501,57 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf= , pud_t orig_pud) return VM_FAULT_FALLBACK; } =20 +static vm_fault_t uffd_wp_clear_special(struct vm_fault *vmf) +{ + vmf->pte =3D pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + /* + * Be careful so that we will only recover a special uffd-wp pte into a + * none pte. Otherwise it means the pte could have changed, so retry. + */ + if (pte_swp_uffd_wp_special(*vmf->pte)) + pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; +} + +/* + * This is actually a page-missing access, but with uffd-wp special pte + * installed. It means this pte was wr-protected before being unmapped. + */ +static vm_fault_t uffd_wp_handle_special(struct vm_fault *vmf) +{ + /* Careful! vmf->pte unmapped after return */ + if (!pte_unmap_same(vmf)) + return 0; + + /* + * Just in case there're leftover special ptes even after the region + * got unregistered - we can simply clear them. + */ + if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma))) + return uffd_wp_clear_special(vmf); + + /* + * Here we share most code with do_fault(), in which we can identify + * whether this is "none pte fault" or "uffd-wp-special fault" by + * checking the vmf->orig_pte. + */ + return do_fault(vmf); +} + +static vm_fault_t do_swap_pte(struct vm_fault *vmf) +{ + /* + * We need to handle special swap ptes before handling ptes that + * contain swap entries, always. + */ + if (unlikely(pte_swp_uffd_wp_special(vmf->orig_pte))) + return uffd_wp_handle_special(vmf); + + return do_swap_page(vmf); +} + /* * These routines also need to handle stuff like marking pages dirty * and/or accessed for architectures that don't do it in hardware (most @@ -4558,7 +4626,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault = *vmf) } =20 if (!pte_present(vmf->orig_pte)) - return do_swap_page(vmf); + return do_swap_pte(vmf); =20 if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) return do_numa_page(vmf); --=20 2.31.1