Re: [PATCH RFC] mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Ives van Hoorne <ives@codesandbox.io>,
	Andrew Morton <akpm@linux-foundation.org>,
	Alistair Popple <apopple@nvidia.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Nadav Amit <nadav.amit@gmail.com>,
	Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [PATCH RFC] mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA
Date: Wed, 7 Dec 2022 20:53:36 +0100	[thread overview]
Message-ID: <37a9442e-f6e5-35f5-0d51-669d60936b5f@redhat.com> (raw)
In-Reply-To: <Y5DQrHqc7RPYWpIA@x1n>

>>
>> On upstream during the next write fault, we'll end up in do_numa_page() and
>> simply remap the page writable due to vm_page_prot, not triggering a write
>> fault. I can see the "numa_hint_faults" counter in /proc/vmstat increasing
>> accordingly, so we're really in do_numa_page().
> 
> Seems true.  I think fundamentally it's because numa hint rely on PROT_NONE
> as the hint, and it explicitly checks against mprotect(PROT_NONE) using the
> accessible check:
> 
> 	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> 		return do_numa_page(vmf);
> 
> I'm not sure whether we should also add a pte_uffd_wp(vmf->orig_pte) here
> to mask out the uffd-wp cases.

:/ more special UFFD-wp casing, I'm not sure sure about that.

Most importantly, once someone unlocks NUMA hinting for shmem (e.g., 
MPOL_MF_LAZY, MPOL_F_NUMA_BALANCING) this might be problematic. That at 
least makes it sound fragile to me.

> 
> So far it seems the outcome is not extremely bad - PROT_WRITE only mappings
> are rare in real life, and also with the protnone recovery code (and along
> with the vm_page_prot patch coming) we'll be able to still recover the pte
> into a uffd-wp-ed pte without PROTNONE bit set.  But I don't have a solid
> clue yet on what's the best.
> 

Yes, just another way to trigger surprise uffd-wp behavior (at least 
surprising for me ;) ). But this time, not involving mprotect(). I 
suspect there are more cases, but I might be wrong.

I was primarily trying to find out which other cases might be affected.

[..]

>>
>>
>> Independent of uffd-wp on shmem, we seem to be missing propagating the
>> uffd-wp bit when splitting the huge zeropage. So uffd-wp'ing the huge
>> zeropage and then splitting it loses the uffd-wp markers. :/
> 
> For this one, thanks for the reproducer.  I'm not extremely sure whether
> it's a bug.
> 
> Firstly, I think your reproducer should just work well with shmem, afaiu,
> because shmem is based on pte markers and it should only work on pte level
> (not pmd).  The huge zero pmd should got split right after wr-protected.
> So the reproducer shouldn't go wrong on shmem at all with/without any
> recent fix.  Let me know otherwise.

shmem doesn't use the huge shared zeropage, so it should be fine.

> 
> For anon, I'm not sure it's a bug, because there's a semantic difference on
> anon/shmem.  The thing is losing wr-protect on the zero page is the same as
> losing wr-protect on a page that is not mapped.  For anon currently we
> can't track a page that is not mapped and we skip those ranges (being zero
> when read).  So fundamentally I am not sure whether it'll be an issue for
> existing anon uffd-wp users because if it is then it's more than zero
> pages.

I think it's a bug, although most probably a low priority one.

Once user space successfully placed an uffd-wp marker, and e.g., 
verified using pagemap that it is indeed placed, the system should not 
silently drop it.

The behavior between an ordinary THP and a huge zeropage differs. For 
THP, we handle the split correctly and don't lose the marker. Assuming 
the huge zeropage woud be disabled, the behavior would be (IMHO) 
correct. The test case would pass.

For example, QEMU with uffd-wp based snapshotting will make sure that 
all virtual addresses are populated (e.g., mapping the shared, 
eventually the huge zeropage -- populate_read_range()), before 
protecting using uffd-wp. Losing a uffd-wp marker would be problematic.

The good news is that we barely will end up PTE-mapping the huge 
zeropage unless there is real user-space interaction (mprotect(), 
mremap(), mmap()), so this shouldn't trigger in the QEMU use-case.

Anyhow, I'll send a patch in a couple of days and we can discuss 
further. It's independent of the other discussion, just wanted to report 
my findings after staring at that code for way too long today.

-- 
Thanks,

David / dhildenb

next prev parent reply	other threads:[~2022-12-07 19:53 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-02 12:27 [PATCH RFC] mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA David Hildenbrand
2022-12-02 16:33 ` Peter Xu
2022-12-02 16:56   ` David Hildenbrand
2022-12-02 17:11     ` David Hildenbrand
2022-12-05 21:08       ` Peter Xu
2022-12-06  0:46         ` [PATCH] mm/uffd: Always wr-protect pte in pte_mkuffd_wp() kernel test robot
2022-12-06 16:21           ` Peter Xu
2022-12-06 11:43         ` kernel test robot
2022-12-06 16:28         ` [PATCH RFC] mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA David Hildenbrand
2022-12-06 19:09           ` Hugh Dickins
2022-12-06 21:18             ` Peter Xu
2022-12-07 15:32               ` David Hildenbrand
2022-12-07 17:43                 ` Peter Xu
2022-12-07 19:53                   ` David Hildenbrand [this message]
2022-12-07 20:14                     ` Peter Xu
2022-12-06 21:27           ` Peter Xu
2022-12-07 13:33             ` David Hildenbrand
2022-12-07 15:59               ` Peter Xu
2022-12-07 20:10                 ` David Hildenbrand
2022-12-08 15:17                   ` Peter Xu
2022-12-06 18:38         ` [PATCH] mm/uffd: Always wr-protect pte in pte_mkuffd_wp() kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=37a9442e-f6e5-35f5-0d51-669d60936b5f@redhat.com \
    --to=david@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=hughd@google.com \
    --cc=ives@codesandbox.io \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nadav.amit@gmail.com \
    --cc=peterx@redhat.com \
    --cc=rppt@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).