From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37C8BC433EF for ; Mon, 15 Nov 2021 08:00:53 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B69B46321B for ; Mon, 15 Nov 2021 08:00:52 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org B69B46321B Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 1A6AA6B007B; Mon, 15 Nov 2021 03:00:52 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 156AF6B007D; Mon, 15 Nov 2021 03:00:52 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 045286B007E; Mon, 15 Nov 2021 03:00:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0037.hostedemail.com [216.40.44.37]) by kanga.kvack.org (Postfix) with ESMTP id E782A6B007B for ; Mon, 15 Nov 2021 03:00:51 -0500 (EST) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id A44ED180ED780 for ; Mon, 15 Nov 2021 08:00:51 +0000 (UTC) X-FDA: 78810418302.23.C8C59B3 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf06.hostedemail.com (Postfix) with ESMTP id 379E0801AB0D for ; Mon, 15 Nov 2021 08:00:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1636963250; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=enD0wV1PzTMcVZir6gX0xAfGiKSstXRuvW8GppqUCLE=; b=RS1jTk+j0wKyYoaLKLKgB19AzvbAjLpnu6HBuSIVfDflg2BIXOxT5wjL3UTC6/iHVMf1we YowlJx4R9xrergHlifumORrVmU71AMy54fb2wkAjP4XTdbB6LO4VNPQF1kNzhX7CAuVxRK KMtf4zp27cFO3i39xp+InBLccR5An7U= Received: from mail-pg1-f199.google.com (mail-pg1-f199.google.com [209.85.215.199]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-499-_F_6ASZhNiWVtDLfeD6G2g-1; Mon, 15 Nov 2021 03:00:49 -0500 X-MC-Unique: _F_6ASZhNiWVtDLfeD6G2g-1 Received: by mail-pg1-f199.google.com with SMTP id 76-20020a63054f000000b002c9284978aaso8771449pgf.10 for ; Mon, 15 Nov 2021 00:00:49 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=enD0wV1PzTMcVZir6gX0xAfGiKSstXRuvW8GppqUCLE=; b=aVg0KMNOxiONr3hzPJEo/B6seYMzXReflbsqOydHGQZrgiuJhsfIfNrh8qidANRx2w 3tH55N7EbMBnefFYJCNxa/S/Lxa2KDjn2dDqT5DXL/tKaFqyGZJGgaQis+3zvtMgNosq 8/YkbC1IPEgbRZCzN/KEgnF/xfbVB0Qt19U2eOIAFUuQPnU39REc7qd0uW/Sb8MGg2j4 a6E5oWmty8G1/lAxpD1lOh8dvklitEO7JbiamV8+RTEqO5trTwPZo52remnrqL/0J7SM AzB0j/4wn/t2vCiJFi1MueKXXh53Yv8NrfuKFJuwtx5cuYYU8NPohb30MYztEmFi53wU ZVtw== X-Gm-Message-State: AOAM531hq4bAvz2rdqeUHK5z5ORfgBnCKUohAA/g/hUpR5jx1KzjrJ+g 4/xbM/BR5vv9zbWwMgHauslXEF4FyMmeq2WcLPUpk2suADedi3F4IHObQ2gIQwYz4qC9hihyAs8 Y2sDU00jtS6/w0o/E2LC6/YYpbh8rSwp5V0C4bxJBZ44wXHiR5QJPX3mNYANp X-Received: by 2002:a05:6a00:84c:b0:494:6d40:ed76 with SMTP id q12-20020a056a00084c00b004946d40ed76mr31113179pfk.65.1636963248283; Mon, 15 Nov 2021 00:00:48 -0800 (PST) X-Google-Smtp-Source: ABdhPJxnNCoV7LVEKS2kGNyP+wQ8iY7s2Exa4jA+VMzI9tOHbNahQ0/af4BuXB848choZ6Ilf+Dr0g== X-Received: by 2002:a05:6a00:84c:b0:494:6d40:ed76 with SMTP id q12-20020a056a00084c00b004946d40ed76mr31113125pfk.65.1636963247852; Mon, 15 Nov 2021 00:00:47 -0800 (PST) Received: from localhost.localdomain ([94.177.118.89]) by smtp.gmail.com with ESMTPSA id t40sm14468176pfg.107.2021.11.15.00.00.41 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 15 Nov 2021 00:00:47 -0800 (PST) From: Peter Xu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Nadav Amit , peterx@redhat.com, Alistair Popple , Andrew Morton , Mike Kravetz , Mike Rapoport , Matthew Wilcox , Jerome Glisse , Axel Rasmussen , "Kirill A . Shutemov" , David Hildenbrand , Andrea Arcangeli , Hugh Dickins Subject: [PATCH v6 08/23] mm/shmem: Allow uffd wr-protect none pte for file-backed mem Date: Mon, 15 Nov 2021 16:00:34 +0800 Message-Id: <20211115080034.74526-1-peterx@redhat.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20211115075522.73795-1-peterx@redhat.com> References: <20211115075522.73795-1-peterx@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="US-ASCII" X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 379E0801AB0D X-Stat-Signature: 16bw99i51dt58kx6w5xgdj3c55wsfszu Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=RS1jTk+j; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf06.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=peterx@redhat.com X-HE-Tag: 1636963251-508064 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: File-backed memory differs from anonymous memory in that even if the pte = is missing, the data could still resides either in the file or in page/swap = cache. So when wr-protect a pte, we need to consider none ptes too. We do that by installing the uffd-wp pte markers when necessary. So when there's a future write to the pte, the fault handler will go the special = path to first fault-in the page as read-only, then report to userfaultfd serve= r with the wr-protect message. On the other hand, when unprotecting a page, it's also possible that the = pte got unmapped but replaced by the special uffd-wp marker. Then we'll need= to be able to recover from a uffd-wp pte marker into a none pte, so that the ne= xt access to the page will fault in correctly as usual when accessed the nex= t time. Special care needs to be taken throughout the change_protection_range() process. Since now we allow user to wr-protect a none pte, we need to be= able to pre-populate the page table entries if we see (!anonymous && MM_CP_UFF= D_WP) requests, otherwise change_protection_range() will always skip when the p= gtable entry does not exist. For example, the pgtable can be missing for a whole chunk of 2M pmd, but = the page cache can exist for the 2M range. When we want to wr-protect one 4K= page within the 2M pmd range, we need to pre-populate the pgtable and install = the pte marker showing that we want to get a message and block the thread whe= n the page cache of that 4K page is written. Without pre-populating the pmd, change_protection() will simply skip that whole pmd. Note that this patch only covers the small pages (pte level) but not cove= ring any of the transparent huge pages yet. That will be done later, and this= patch will be a preparation for it too. Signed-off-by: Peter Xu --- mm/mprotect.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 62 insertions(+), 1 deletion(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 890bc1f9ca24..be837c4dbc64 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -174,7 +175,16 @@ static unsigned long change_pte_range(struct vm_area= _struct *vma, pmd_t *pmd, if (pte_swp_uffd_wp(oldpte)) newpte =3D pte_swp_mkuffd_wp(newpte); } else if (is_pte_marker_entry(entry)) { - /* Skip it, the same as none pte */ + /* + * If this is uffd-wp pte marker and we'd like + * to unprotect it, drop it; the next page + * fault will trigger without uffd trapping. + */ + if (uffd_wp_resolve && + (pte_marker_get(entry) & PTE_MARKER_UFFD_WP)) { + pte_clear(vma->vm_mm, addr, pte); + pages++; + } continue; } else { newpte =3D oldpte; @@ -189,6 +199,20 @@ static unsigned long change_pte_range(struct vm_area= _struct *vma, pmd_t *pmd, set_pte_at(vma->vm_mm, addr, pte, newpte); pages++; } + } else { + /* It must be an none page, or what else?.. */ + WARN_ON_ONCE(!pte_none(oldpte)); + if (unlikely(uffd_wp && !vma_is_anonymous(vma))) { + /* + * For file-backed mem, we need to be able to + * wr-protect a none pte, because even if the + * pte is none, the page/swap cache could + * exist. Doing that by install a marker. + */ + set_pte_at(vma->vm_mm, addr, pte, + make_pte_marker(PTE_MARKER_UFFD_WP)); + pages++; + } } } while (pte++, addr +=3D PAGE_SIZE, addr !=3D end); arch_leave_lazy_mmu_mode(); @@ -222,6 +246,39 @@ static inline int pmd_none_or_clear_bad_unless_trans= _huge(pmd_t *pmd) return 0; } =20 +/* Return true if we're uffd wr-protecting file-backed memory, or false = */ +static inline bool +uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags) +{ + return (cp_flags & MM_CP_UFFD_WP) && !vma_is_anonymous(vma); +} + +/* + * If wr-protecting the range for file-backed, populate pgtable for the = case + * when pgtable is empty but page cache exists. When {pte|pmd|...}_allo= c() + * failed it means no memory, we don't have a better option but stop. + */ +#define change_pmd_prepare(vma, pmd, cp_flags) \ + do { \ + if (unlikely(uffd_wp_protect_file(vma, cp_flags))) { \ + if (WARN_ON_ONCE(pte_alloc(vma->vm_mm, pmd))) \ + break; \ + } \ + } while (0) +/* + * This is the general pud/p4d/pgd version of change_pmd_prepare(). We n= eed to + * have separate change_pmd_prepare() because pte_alloc() returns 0 on s= uccess, + * while {pmd|pud|p4d}_alloc() returns the valid pointer on success. + */ +#define change_prepare(vma, high, low, addr, cp_flags) \ + do { \ + if (unlikely(uffd_wp_protect_file(vma, cp_flags))) { \ + low##_t *p =3D low##_alloc(vma->vm_mm, high, addr); \ + if (WARN_ON_ONCE(p =3D=3D NULL)) \ + break; \ + } \ + } while (0) + static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, pgprot_t newprot, unsigned long cp_flags) @@ -240,6 +297,7 @@ static inline unsigned long change_pmd_range(struct v= m_area_struct *vma, =20 next =3D pmd_addr_end(addr, end); =20 + change_pmd_prepare(vma, pmd, cp_flags); /* * Automatic NUMA balancing walks the tables with mmap_lock * held for read. It's possible a parallel update to occur @@ -305,6 +363,7 @@ static inline unsigned long change_pud_range(struct v= m_area_struct *vma, pud =3D pud_offset(p4d, addr); do { next =3D pud_addr_end(addr, end); + change_prepare(vma, pud, pmd, addr, cp_flags); if (pud_none_or_clear_bad(pud)) continue; pages +=3D change_pmd_range(vma, pud, addr, next, newprot, @@ -325,6 +384,7 @@ static inline unsigned long change_p4d_range(struct v= m_area_struct *vma, p4d =3D p4d_offset(pgd, addr); do { next =3D p4d_addr_end(addr, end); + change_prepare(vma, p4d, pud, addr, cp_flags); if (p4d_none_or_clear_bad(p4d)) continue; pages +=3D change_pud_range(vma, p4d, addr, next, newprot, @@ -350,6 +410,7 @@ static unsigned long change_protection_range(struct v= m_area_struct *vma, inc_tlb_flush_pending(mm); do { next =3D pgd_addr_end(addr, end); + change_prepare(vma, pgd, p4d, addr, cp_flags); if (pgd_none_or_clear_bad(pgd)) continue; pages +=3D change_p4d_range(vma, pgd, addr, next, newprot, --=20 2.32.0