From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8FC8DC433EF for ; Mon, 15 Nov 2021 07:56:29 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4157F63219 for ; Mon, 15 Nov 2021 07:56:29 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 4157F63219 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id E07176B0087; Mon, 15 Nov 2021 02:56:28 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D91066B0088; Mon, 15 Nov 2021 02:56:28 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C0A7B6B0089; Mon, 15 Nov 2021 02:56:28 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0207.hostedemail.com [216.40.44.207]) by kanga.kvack.org (Postfix) with ESMTP id AA0F96B0087 for ; Mon, 15 Nov 2021 02:56:28 -0500 (EST) Received: from smtpin31.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 51F5E8249980 for ; Mon, 15 Nov 2021 07:56:28 +0000 (UTC) X-FDA: 78810407256.31.A26756E Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf19.hostedemail.com (Postfix) with ESMTP id BD207B0000BD for ; Mon, 15 Nov 2021 07:56:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1636962987; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=PQTAYR2U0B9NQT5q5Dh9pQHL0/MM7Qz5TQPJEv+v98g=; b=VqdSPyr1TMRw5Sc1JaPh/CnbsFI90vx9l4gVJdaMgMnhZfZuKYauHz4MDJ+4s4YS97M8DS C9IqJzVs1i2J8FdB6U70AC2qXOwrdtkC2lG/0MdN7GM0YBP4YsMoFF4ZjVTBUt9Vn2TxVP SZfjUjdbjZt0iwgoqcYpfNMFIrFm6gY= Received: from mail-pf1-f199.google.com (mail-pf1-f199.google.com [209.85.210.199]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-398-0UCfQBLIN8ajQlpNf7ednA-1; Mon, 15 Nov 2021 02:56:26 -0500 X-MC-Unique: 0UCfQBLIN8ajQlpNf7ednA-1 Received: by mail-pf1-f199.google.com with SMTP id x14-20020a627c0e000000b0049473df362dso9577118pfc.12 for ; Sun, 14 Nov 2021 23:56:26 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=PQTAYR2U0B9NQT5q5Dh9pQHL0/MM7Qz5TQPJEv+v98g=; b=iURiSCGg43EhRVaL4u/OA8us7bbqsf5YOz+1YqJJuCNUfOnf6RmMUO2eidDNitBXDV huhw6n+xLC3dKz4/Jg51PKaLaJ51JEO1VEHFZiwLx92qESKeZAXD+o3YieBZxkfTSFdX hS49++HrXjg04EAemAcjIe34QUfiVXXSojBcYR/2fJP9QGf4j1wSg+EUlIaqfrh50BHa qg03Xi6PVt0UYHqvTwn4o5fgzd6unxQNtvcnmD7ultCKBtdu6Ln2VaUCJ3sukqc3mnyG jFEYK6RxP/Jcsnl3XS2sZjddNaR9vMjUhxQtYaMs2drc3F1eiuQCptCCtfJzJtjsiGSn hRSg== X-Gm-Message-State: AOAM533RatoJDSKYeWbDkpqkUsuUJa9v3tgBJNFej2HNKhR0wJmOXb5S RyxwFoFsq/igkCj+3WAf8QNTv7oMcXHoZTgVN5grKOyRlFyolNxS48MSA11H8U56koV+GyItNuL POljM1Tyr/CrRVPULgV8BdyuJ1F6UeWnhs3HOO3UY3MM7oMlL7gPCovxz8Ivz X-Received: by 2002:a17:90a:ca81:: with SMTP id y1mr61191372pjt.231.1636962984639; Sun, 14 Nov 2021 23:56:24 -0800 (PST) X-Google-Smtp-Source: ABdhPJwQVZcj01aFlbyqYO4BSw1DS5ur5lPKwB1eRxjFMR30tJL7G2a+EjobAshXx0ZYvcchxwbzZQ== X-Received: by 2002:a17:90a:ca81:: with SMTP id y1mr61191302pjt.231.1636962984128; Sun, 14 Nov 2021 23:56:24 -0800 (PST) Received: from localhost.localdomain ([191.101.132.223]) by smtp.gmail.com with ESMTPSA id e10sm15792796pfv.140.2021.11.14.23.56.16 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 14 Nov 2021 23:56:23 -0800 (PST) From: Peter Xu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Axel Rasmussen , Nadav Amit , Mike Rapoport , Hugh Dickins , Mike Kravetz , "Kirill A . Shutemov" , Alistair Popple , Jerome Glisse , Matthew Wilcox , Andrew Morton , peterx@redhat.com, David Hildenbrand , Andrea Arcangeli Subject: [PATCH v6 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler Date: Mon, 15 Nov 2021 15:55:05 +0800 Message-Id: <20211115075522.73795-7-peterx@redhat.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20211115075522.73795-1-peterx@redhat.com> References: <20211115075522.73795-1-peterx@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="US-ASCII" Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=VqdSPyr1; spf=none (imf19.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: BD207B0000BD X-Stat-Signature: pq1qrw74zcg4pdryg9fxy88t5hgsusix X-HE-Tag: 1636962977-367084 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: File-backed memories are prone to unmap/swap so the ptes are always unsta= ble, because they can be easily faulted back later using the page cache. This= could lead to uffd-wp getting lost when unmapping or swapping out such memory. = One example is shmem. PTE markers are needed to store those information. This patch prepares it by handling uffd-wp pte markers first it is applie= d elsewhere, so that the page fault handler can recognize uffd-wp pte marke= rs. The handling of uffd-wp pte markers is similar to missing fault, it's jus= t that we'll handle this "missing fault" when we see the pte markers, meanwhile = we need to make sure the marker information is kept during processing the fa= ult. This is a slow path of uffd-wp handling, because zapping of wr-protected = shmem ptes should be rare. So far it should only trigger in two conditions: (1) When trying to punch holes in shmem_fallocate(), there is an optimi= zation to zap the pgtables before evicting the page. (2) When swapping out shmem pages. Because of this, the page fault handling is simplifed too by not sending = the wr-protect message in the 1st page fault, instead the page will be instal= led read-only, so the uffd-wp message will be generated in the next fault, wh= ich will trigger the do_wp_page() path of general uffd-wp handling. Disable fault-around for all uffd-wp registered ranges for extra safety j= ust like uffd-minor fault, and clean the code up. Signed-off-by: Peter Xu --- include/linux/userfaultfd_k.h | 17 +++++++++ mm/memory.c | 71 ++++++++++++++++++++++++++++++----- 2 files changed, 79 insertions(+), 9 deletions(-) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.= h index 7d7ffec53ddb..05cec02140cb 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -96,6 +96,18 @@ static inline bool uffd_disable_huge_pmd_share(struct = vm_area_struct *vma) return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR); } =20 +/* + * Don't do fault around for either WP or MINOR registered uffd range. = For + * MINOR registered range, fault around will be a total disaster and pte= s can + * be installed without notifications; for WP it should mostly be fine a= s long + * as the fault around checks for pte_none() before the installation, ho= wever + * to be super safe we just forbid it. + */ +static inline bool uffd_disable_fault_around(struct vm_area_struct *vma) +{ + return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR); +} + static inline bool userfaultfd_missing(struct vm_area_struct *vma) { return vma->vm_flags & VM_UFFD_MISSING; @@ -236,6 +248,11 @@ static inline void userfaultfd_unmap_complete(struct= mm_struct *mm, { } =20 +static inline bool uffd_disable_fault_around(struct vm_area_struct *vma) +{ + return false; +} + #endif /* CONFIG_USERFAULTFD */ =20 static inline bool is_pte_marker_uffd_wp(pte_t pte) diff --git a/mm/memory.c b/mm/memory.c index d5966d9e24c3..e8557d43a87d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3452,6 +3452,43 @@ static vm_fault_t remove_device_exclusive_entry(st= ruct vm_fault *vmf) return 0; } =20 +static vm_fault_t pte_marker_clear(struct vm_fault *vmf) +{ + vmf->pte =3D pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + /* + * Be careful so that we will only recover a special uffd-wp pte into a + * none pte. Otherwise it means the pte could have changed, so retry. + */ + if (is_pte_marker(*vmf->pte)) + pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; +} + +/* + * This is actually a page-missing access, but with uffd-wp special pte + * installed. It means this pte was wr-protected before being unmapped. + */ +static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf) +{ + /* Careful! vmf->pte unmapped after return */ + if (!pte_unmap_same(vmf)) + return 0; + + /* + * Just in case there're leftover special ptes even after the region + * got unregistered - we can simply clear them. We can also do that + * proactively when e.g. when we do UFFDIO_UNREGISTER upon some uffd-wp + * ranges, but it should be more efficient to be done lazily here. + */ + if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma))) + return pte_marker_clear(vmf); + + /* do_fault() can handle pte markers too like none pte */ + return do_fault(vmf); +} + static vm_fault_t handle_pte_marker(struct vm_fault *vmf) { swp_entry_t entry =3D pte_to_swp_entry(vmf->orig_pte); @@ -3465,8 +3502,11 @@ static vm_fault_t handle_pte_marker(struct vm_faul= t *vmf) if (WARN_ON_ONCE(vma_is_anonymous(vmf->vma) || !marker)) return VM_FAULT_SIGBUS; =20 - /* TODO: handle pte markers */ - return 0; + if (marker & PTE_MARKER_UFFD_WP) + return pte_marker_handle_uffd_wp(vmf); + + /* This is an unknown pte marker */ + return VM_FAULT_SIGBUS; } =20 /* @@ -3968,6 +4008,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct = page *page) void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long a= ddr) { struct vm_area_struct *vma =3D vmf->vma; + bool uffd_wp =3D is_pte_marker_uffd_wp(vmf->orig_pte); bool write =3D vmf->flags & FAULT_FLAG_WRITE; bool prefault =3D vmf->address !=3D addr; pte_t entry; @@ -3982,6 +4023,8 @@ void do_set_pte(struct vm_fault *vmf, struct page *= page, unsigned long addr) =20 if (write) entry =3D maybe_mkwrite(pte_mkdirty(entry), vma); + if (unlikely(uffd_wp)) + entry =3D pte_mkuffd_wp(pte_wrprotect(entry)); /* copy-on-write page */ if (write && !(vma->vm_flags & VM_SHARED)) { inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); @@ -4155,9 +4198,21 @@ static vm_fault_t do_fault_around(struct vm_fault = *vmf) return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff); } =20 +/* Return true if we should do read fault-around, false otherwise */ +static inline bool should_fault_around(struct vm_fault *vmf) +{ + /* No ->map_pages? No way to fault around... */ + if (!vmf->vma->vm_ops->map_pages) + return false; + + if (uffd_disable_fault_around(vmf->vma)) + return false; + + return fault_around_bytes >> PAGE_SHIFT > 1; +} + static vm_fault_t do_read_fault(struct vm_fault *vmf) { - struct vm_area_struct *vma =3D vmf->vma; vm_fault_t ret =3D 0; =20 /* @@ -4165,12 +4220,10 @@ static vm_fault_t do_read_fault(struct vm_fault *= vmf) * if page by the offset is not ready to be mapped (cold cache or * something). */ - if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) { - if (likely(!userfaultfd_minor(vmf->vma))) { - ret =3D do_fault_around(vmf); - if (ret) - return ret; - } + if (should_fault_around(vmf)) { + ret =3D do_fault_around(vmf); + if (ret) + return ret; } =20 ret =3D __do_fault(vmf); --=20 2.32.0