All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Rapoport <rppt@kernel.org>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Xu <peterx@redhat.com>,
	David Hildenbrand <david@redhat.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linux-MM <linux-mm@kvack.org>
Subject: Re: userfaultfd: usability issue due to lack of UFFD events ordering
Date: Thu, 17 Feb 2022 23:15:46 +0200	[thread overview]
Message-ID: <Yg67AoSBMNM4JVvP@kernel.org> (raw)
In-Reply-To: <F195F8B6-05C4-45BC-BA10-632CA3699941@gmail.com>

On Tue, Feb 15, 2022 at 02:35:09PM -0800, Nadav Amit wrote:
> 
> 
> > On Feb 13, 2022, at 8:02 PM, Peter Xu <peterx@redhat.com> wrote:
> > 
> > Thanks for explaining.
> > 
> > I also digged out the discussion threads between you and Mike and that's a good
> > one too summarizing the problems:
> > 
> > https://lore.kernel.org/all/5921BA80-F263-4F8D-B7E6-316CEB602B51@gmail.com/
> > 
> > Scenario 4 is kind of special imho along all those, because that's the only one
> > that can be workarounded by user application by only copying pages one by one.
> > I know you were even leveraging iouring in your local tree, so that's probably
> > not a solution at all for you. But I'm just trying to start thinking without
> > that scenario for now.
> > 
> > Per my understanding, a major issue regarding the rest of the scenarios is
> > ordering of uffd messages may not match with how things are happening.  This
> > actually contains two problems.
> > 
> > First of all, mmap_sem is mostly held read for all page faults and most of the
> > mm changes except e.g. fork, then we can never serialize them.  Not to mention
> > uffd events releases mmap_sem within prep and completion.  Let's call it
> > problem 1.
> > 
> > The other problem 2 is we can never serialize faults against events.
> > 
> > For problem 1, I do sense something that mmap_sem is just not suitable for uffd
> > scenario. Say, we grant concurrent with most of the events like dontneed and
> > mremap, but when uffd ordering is a concern we may not want to grant that
> > concurrency.  I'm wondering whether it means uffd may need its own semaphore to
> > achieve this.  So for all events that uffd cares we take write lock on a new
> > uffd_sem after mmap_sem, meanwhile we don't release that uffd_sem after prep of
> > events, not until completion (the message is read).  It'll slow down uffd
> > tracked systems but guarantees ordering.
> 
> Peter,
> 
> Thanks for finding the time and looking into the issues that I encountered.
> 
> Your approach sounds possible, but it sounds to me unsafe to acquire uffd_sem
> after mmap_lock, since it might cause deadlocks (e.g., if a process uses events
> to manage its own memory).
> 
> > 
> > At the meantime, I'm wildly thinking whether we can tackle with the other
> > problem by merging the page fault queue with the event queue, aka, event_wqh
> > and fault_pending_wqh.  Obviously we'll need to identify the messages when
> > read() and conditionally move then into fault_wqh only if they come from page
> > faults, but that seems doable?
> 
> This, I guess is necessary in addition to your aforementioned proposal to have
> some semaphore protecting, can do the trick.
> 
> While I got your attention, let me share some other challenges I encountered
> using userfaultfd. They might be unrelated, but perhaps you can keep them in
> the back of your mind. Nobody should suffer as I did ;-)
> 
> 1. mmap_changing (i.e., -EAGAIN on ioctls) makes using userfaultfd harder than
> it should be, especially when using io-uring as I wish to do.
> 
> I think it is not too hard to address by changing the API. For instance, if
> uffd-ctx had a uffd-generation that would increase on each event, the user
> could have provided an ioctl-generation as part of copy/zero/etc ioctls, and
> the kernel would only fail the operation if ioctl copy/zero/etc operation
> only succeeds if the uffd-generation is lower/equal than the one provided by
> the user. 

Do you mean that if there were page faults with generations 1 and 3 and,
say, MADV_DONTNEED with generation 2, then even if the uffd copy that resolves
page fault 1 races with MADV_DONTNEED it will go through and the copy for
page fault 3 will fail?

But how would you order zapping the pages and copying into them internally?
Or may understanding of your idea was completely off?

As for technicality of adding a generation to uffd_msg and to
uffdio_{copy,zero,etc}, we can use __u32 reserved in the first one and 32
bits from mode in the second with a bit of care for wraparound.
 
> 2. userfaultfd is separated from other tracing/instrumentation mechanisms in
> the kernel. I, for instance, also wanted to track mmap events (let’s put
> aside for a second why). Tracking these events can be done with ptrace or
> perf_event_open() but then it is hard to correlate these events with
> userfaultfd. It would have been easier for users, I think, if userfaultfd
> notifications were provided through ptrace/tracepoints mechanisms as well.

This sounds like opening Pandora box ;-)

I think it's possible to trace userfaultfd events to some extent with a
probe at userfaultfd_event_wait_completion() entry and handle_userfault().
The "interesting" information is passed to these functions as parameters
and I believe all the data can be extracted with tools like bpftrace.
 
> 3. Nesting/chaining. It is not easy to allow two monitors to use userfaultfd
> concurrently. This seems as a general problem that I believe ptrace suffers
> from too. I know it might seem far-fetched to have 2 monitors at the moment,
> but I think that any tracking/instrumentation mechanism (e.g., ptrace,
> software-dirty, not to mention hardware virtualization) should be designed
> from the beginning with such support as adding it in a later stage can be
> tricky.

It's not too far fetched to have nested userfaultfd contexts even now. If
CRIU would need to post-copy restore a process that uses userfaultfd it
will need to deal with nested uffds.
 
> Thanks again,
> Nadav

-- 
Sincerely yours,
Mike.


  parent reply	other threads:[~2022-02-17 21:15 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-30  6:23 userfaultfd: usability issue due to lack of UFFD events ordering Nadav Amit
2022-01-31 10:42 ` Mike Rapoport
2022-01-31 10:48   ` David Hildenbrand
2022-01-31 14:05     ` Mike Rapoport
2022-01-31 14:12       ` David Hildenbrand
2022-01-31 14:28         ` Mike Rapoport
2022-01-31 14:41           ` David Hildenbrand
2022-01-31 18:47             ` Mike Rapoport
2022-01-31 22:39               ` Nadav Amit
2022-02-01  9:10                 ` Mike Rapoport
2022-02-10  7:48                 ` Peter Xu
2022-02-10 18:42                   ` Nadav Amit
2022-02-14  4:02                     ` Peter Xu
2022-02-15 22:35                       ` Nadav Amit
2022-02-16  8:27                         ` Peter Xu
2022-02-17 21:15                         ` Mike Rapoport [this message]
2022-01-31 17:23   ` Nadav Amit
2022-01-31 17:28     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yg67AoSBMNM4JVvP@kernel.org \
    --to=rppt@kernel.org \
    --cc=aarcange@redhat.com \
    --cc=david@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=nadav.amit@gmail.com \
    --cc=peterx@redhat.com \
    --cc=rppt@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.