All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kiryl Shutsemau <kas@kernel.org>
To: akpm@linux-foundation.org, rppt@kernel.org, peterx@redhat.com,
	 david@kernel.org
Cc: ljs@kernel.org, surenb@google.com, vbabka@kernel.org,
	 Liam.Howlett@oracle.com, ziy@nvidia.com, corbet@lwn.net,
	skhan@linuxfoundation.org,  seanjc@google.com,
	pbonzini@redhat.com, jthoughton@google.com, aarcange@redhat.com,
	 sj@kernel.org, usama.arif@linux.dev, linux-mm@kvack.org,
	 linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kselftest@vger.kernel.org,  kvm@vger.kernel.org,
	kernel-team@meta.com
Subject: Re: [PATCH 07/14] mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP
Date: Thu, 30 Apr 2026 17:31:17 +0100	[thread overview]
Message-ID: <afODmDGimJ4QUkkv@thinkstation> (raw)
In-Reply-To: <afOCAZ-dTQiLFN0y@thinkstation>

On Thu, Apr 30, 2026 at 05:28:17PM +0100, Kiryl Shutsemau wrote:
> sashiko.dev -- https://sashiko.dev/#/patchset/20260427114607.4068647-1-kas@kernel.org -- wrote:
> > > @@ -1084,9 +1092,29 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
> > >  		pte_t pte, unsigned long addr, int nr)
> > >  {
> > >  	struct mm_struct *src_mm = src_vma->vm_mm;
> > > +	bool writable;
> > > +
> > > +	/*
> > > +	 * Snapshot writability before the RWP-disarm rewrite below: when the
> > > +	 * child is not RWP-armed, pte_modify(pte, dst_vma->vm_page_prot) can
> > > +	 * silently drop _PAGE_RW from a resolved (no-marker) writable PTE,
> > > +	 * so a later pte_write(pte) check would skip the COW wrprotect and
> > > +	 * leave the parent writable over a folio shared with the child.
> > > +	 */
> > > +	writable = pte_write(pte);
> > > +
> > > +	/*
> > > +	 * Child is not RWP-armed: restore accessible protection so the
> > > +	 * inherited PAGE_NONE does not cost a fault on first read.
> > > +	 */
> > > +	if (!userfaultfd_protected(dst_vma)) {
> > > +		if (userfaultfd_rwp(src_vma))
> > > +			pte = pte_modify(pte, dst_vma->vm_page_prot);
> > > +		pte = pte_clear_uffd(pte);
> > > +	}
> > Does this unconditional pte_modify() create invalid clean and writable PTEs
> > for shared mappings?
> >
> > Without checking pte_uffd(pte) first, this blindly modifies every present PTE
> > if the source VMA had RWP enabled. For shared writable mappings, vm_page_prot
> > includes _PAGE_RW. If a PTE was clean and mapped read-only to intercept the
> > first write for filesystem dirty-tracking, pte_modify() forces the write bit
> > on while preserving the clean state.
> 
> Two reasons this is safe in practice:
> 
> 1. RWP cannot reach a backing that uses the clean-RO + page_mkwrite()
>    dirty-tracking pattern. vma_can_userfault() admits RWP only on anon,
>    shmem and hugetlbfs (the ones with vm_uffd_ops); regular file-backed
>    shared mappings are rejected at register time.
> 
> 2. For backings that *do* support shared writable + dirty tracking (i.e.
>    the ones RWP isn't enabled on), the fs sets vma->vm_page_prot to RO
>    precisely so the first write traps page_mkwrite(). pte_modify(pte,
>    dst_vma->vm_page_prot) would not add _PAGE_RW in that configuration
>    either.
> 
> > Hardware writes could then bypass page_mkwrite(), leaving the filesystem
> > unaware that the page is dirty, and potentially causing silent data loss
> > upon reclaim.
> > Could this also overwrite other important PTE states like NUMA hinting
> > or soft-dirty tracking?
> 
> pte_modify() on the supported architectures preserves _PAGE_CHG_MASK,
> which covers _PAGE_DIRTY, _PAGE_ACCESSED, _PAGE_SOFT_DIRTY, and the PFN.
> NUMA hinting is encoded in the protection bits, so a transient hint
> state on a resolved (no-marker) PTE would be lost across this rewrite,
> but that's just a re-prime on the next NUMA scan, not a correctness
> issue.
> 
> > @@ -298,6 +298,16 @@ static __always_inline void change_present_ptes(struct mmu_gather *tlb,
> >      else if (uffd_wp_resolve || uffd_rwp_resolve)
> >          ptent = pte_clear_uffd(ptent);
> >
> > +    /*
> > +     * The uffd bit on a VM_UFFD_RWP VMA carries PROT_NONE
> > +     * semantics. If mprotect() or NUMA hinting changed the
> > +     * base protection, restore PAGE_NONE so the PTE still
> > +     * traps on any access. pte_modify() preserves
> > +     * _PAGE_UFFD.
> > +     */
> > +    if (userfaultfd_rwp(vma) && pte_uffd(ptent))
> > +        ptent = pte_modify(ptent, PAGE_NONE);
> >
> > For VMAs registered with both UFFD WP and RWP, could this silently upgrade
> > Write-Protected pages to Read-Write-Protected?
> 
> VM_UFFD_WP and VM_UFFD_RWP are mutually exclusive per VMA. The combination
> is rejected at registration time:
> 
>   fs/userfaultfd.c, userfaultfd_register():
>         if ((vm_flags & VM_UFFD_WP) && (vm_flags & VM_UFFD_RWP))
>                 goto out;       /* -EINVAL */
> 
> and re-registration cannot drop either bit (also -EBUSY in the same
> function), so a PTE in a WP-armed VMA can never satisfy
> userfaultfd_rwp(vma); the new branch never fires there.
> 
> [ ... ]
> > +        /* Re-arm RWP on the moved PTE if dst_vma is RWP-registered. */
> > +        if (userfaultfd_rwp(dst_vma)) {
> > +            orig_dst_pte = pte_modify(orig_dst_pte, PAGE_NONE);
> > +            orig_dst_pte = pte_mkuffd(orig_dst_pte);
> > +        }
> > +
> >          set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
> >
> > Could applying PAGE_NONE unconditionally to newly moved pages trap
> > applications in an infinite fault loop?
> 
> No -- the post-MOVE access is delivered as a normal RWP fault. In sync
> mode it goes to the registered handler, which resolves it with
> UFFDIO_RWPROTECT clearing MODE_RWP; in async mode the kernel resolves
> it in-kernel and the faulting thread continues. There is no loop.
> 
> The semantics here are intentional: a VM_UFFD_RWP VMA has the contract
> that every present PTE is either an active marker or a tracked-and-
> resolved PTE whose next access will re-trap. UFFDIO_MOVE into such a
> VMA must keep that contract, otherwise the moved-in page would be a
> silent hole in the working-set view. UFFDIO_MOVE has no mode flag for
> "skip protection", by design -- the same way it has no flag to skip
> WP arming if dst_vma were WP-armed (and the equivalent could be added
> there if we ever decide UFFDIO_MOVE should preserve markers in WP
> VMAs too).
> 

Oopsie. 

I put it in reply to the wrong patch. It suppose to be for 06/14.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

  reply	other threads:[~2026-04-30 16:31 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-27 11:45 [PATCH 00/14] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau (Meta)
2026-04-27 11:45 ` [PATCH 01/14] mm: decouple protnone helpers from CONFIG_NUMA_BALANCING Kiryl Shutsemau (Meta)
2026-04-30  4:47   ` SeongJae Park
2026-05-04  7:48   ` Mike Rapoport
2026-04-27 11:45 ` [PATCH 02/14] mm: rename uffd-wp PTE bit macros to uffd Kiryl Shutsemau (Meta)
2026-05-04  7:51   ` Mike Rapoport
2026-04-27 11:45 ` [PATCH 03/14] mm: rename uffd-wp PTE accessors " Kiryl Shutsemau (Meta)
2026-05-04  7:59   ` Mike Rapoport
2026-04-27 11:45 ` [PATCH 04/14] mm: add VM_UFFD_RWP VMA flag Kiryl Shutsemau (Meta)
2026-04-27 11:45 ` [PATCH 05/14] mm: add MM_CP_UFFD_RWP change_protection() flag Kiryl Shutsemau (Meta)
2026-04-27 11:45 ` [PATCH 06/14] mm: preserve RWP marker across PTE rewrites Kiryl Shutsemau (Meta)
2026-04-27 11:45 ` [PATCH 07/14] mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP Kiryl Shutsemau (Meta)
2026-04-30 16:28   ` Kiryl Shutsemau
2026-04-30 16:31     ` Kiryl Shutsemau [this message]
2026-04-27 11:45 ` [PATCH 08/14] userfaultfd: add UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT plumbing Kiryl Shutsemau (Meta)
2026-04-30 16:46   ` Kiryl Shutsemau
2026-04-27 11:45 ` [PATCH 09/14] mm/userfaultfd: add RWP fault delivery and expose UFFDIO_REGISTER_MODE_RWP Kiryl Shutsemau (Meta)
2026-04-30 16:51   ` Kiryl Shutsemau
2026-04-27 11:45 ` [PATCH 10/14] mm/pagemap: add PAGE_IS_ACCESSED for RWP tracking Kiryl Shutsemau (Meta)
2026-05-01 10:41   ` Kiryl Shutsemau
2026-04-27 11:45 ` [PATCH 11/14] userfaultfd: add UFFD_FEATURE_RWP_ASYNC for async fault resolution Kiryl Shutsemau (Meta)
2026-05-01 10:49   ` Kiryl Shutsemau
2026-04-27 11:46 ` [PATCH 12/14] userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle Kiryl Shutsemau (Meta)
2026-05-01 13:12   ` Kiryl Shutsemau
2026-04-27 11:46 ` [PATCH 13/14] selftests/mm: add userfaultfd RWP tests Kiryl Shutsemau (Meta)
2026-04-27 11:46 ` [PATCH 14/14] Documentation/userfaultfd: document RWP working set tracking Kiryl Shutsemau (Meta)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afODmDGimJ4QUkkv@thinkstation \
    --to=kas@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=david@kernel.org \
    --cc=jthoughton@google.com \
    --cc=kernel-team@meta.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=rppt@kernel.org \
    --cc=seanjc@google.com \
    --cc=sj@kernel.org \
    --cc=skhan@linuxfoundation.org \
    --cc=surenb@google.com \
    --cc=usama.arif@linux.dev \
    --cc=vbabka@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.