From: Mike Rapoport <rppt@kernel.org>
To: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>,
akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
selinux@vger.kernel.org, surenb@google.com,
kernel-team@android.com, aarcange@redhat.com, peterx@redhat.com,
david@redhat.com, axelrasmussen@google.com, bgeffon@google.com,
willy@infradead.org, jannh@google.com, kaleshsingh@google.com,
ngeoffray@google.com, timmurray@google.com
Subject: Re: [PATCH v2 2/3] userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx
Date: Wed, 7 Feb 2024 17:27:23 +0200 [thread overview]
Message-ID: <ZcOhW8NR9XWhVnKS@kernel.org> (raw)
In-Reply-To: <CA+EESO7RNn0aQhLxY+NDddNNNT6586qX08=rphU1-XmyoP5mZQ@mail.gmail.com>
On Mon, Feb 05, 2024 at 12:53:33PM -0800, Lokesh Gidra wrote:
> On Sun, Feb 4, 2024 at 2:27 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > > 3) Based on [1] I see how mmap_changing helps in eliminating duplicate
> > > work (background copy) by uffd monitor, but didn't get if there is a
> > > correctness aspect too that I'm missing? I concur with Amit's point in
> > > [1] that getting -EEXIST when setting up the pte will avoid memory
> > > corruption, no?
> >
> > In the fork case without mmap_changing the child process may be get data or
> > zeroes depending on the race for mmap_lock between the fork and
> > uffdio_copy and -EEXIST is not enough for monitor to detect what was the
> > ordering between fork and uffdio_copy.
>
> This is extremely helpful. IIUC, there is a window after mmap_lock
> (write-mode) is released and before the uffd monitor thread is
> notified of fork. In that window, the monitor doesn't know that fork
> has already happened. So, without mmap_changing it would have done
> background copy only in the parent, thereby causing data inconsistency
> between parent and child processes.
Yes.
> It seems to me that the correctness argument for mmap_changing is
> there in case of FORK event and REMAP when mremap is called with
> MREMAP_DONTUNMAP. In all other cases its only benefit is by avoiding
> unnecessary background copies, right?
Yes, I think you are right, but it's possible I've forgot some nasty race
that will need mmap_changing for other events.
> > > > > > > > > @@ -783,7 +788,9 @@ bool userfaultfd_remove(struct vm_area_struct *vma,
> > > > > > > > > return true;
> > > > > > > > >
> > > > > > > > > userfaultfd_ctx_get(ctx);
> > > > > > > > > + down_write(&ctx->map_changing_lock);
> > > > > > > > > atomic_inc(&ctx->mmap_changing);
> > > > > > > > > + up_write(&ctx->map_changing_lock);
> > > > > > > > > mmap_read_unlock(mm);
> > > > > > > > >
> > > > > > > > > msg_init(&ewq.msg);
> > > > > >
> > > > > > If this happens in read mode, then why are you waiting for the readers
> > > > > > to leave? Can't you just increment the atomic? It's fine happening in
> > > > > > read mode today, so it should be fine with this new rwsem.
> > > > >
> > > > > It's been a while and the details are blurred now, but if I remember
> > > > > correctly, having this in read mode forced non-cooperative uffd monitor to
> > > > > be single threaded. If a monitor runs, say uffdio_copy, and in parallel a
> > > > > thread in the monitored process does MADV_DONTNEED, the latter will wait
> > > > > for userfaultfd_remove notification to be processed in the monitor and drop
> > > > > the VMA contents only afterwards. If a non-cooperative monitor would
> > > > > process notification in parallel with uffdio ops, MADV_DONTNEED could
> > > > > continue and race with uffdio_copy, so read mode wouldn't be enough.
> > > > >
> > > >
> > > > Right now this function won't stop to wait for readers to exit the
> > > > critical section, but with this change there will be a pause (since the
> > > > down_write() will need to wait for the readers with the read lock). So
> > > > this is adding a delay in this call path that isn't necessary (?) nor
> > > > existed before. If you have non-cooperative uffd monitors, then you
> > > > will have to wait for them to finish to mark the uffd as being removed,
> > > > where as before it was a fire & forget, this is now a wait to tell.
> > > >
> > > I think a lot will be clearer once we get a response to my questions
> > > above. IMHO not only this write-lock is needed here, we need to fix
> > > userfaultfd_remove() by splitting it into userfaultfd_remove_prep()
> > > and userfaultfd_remove_complete() (like all other non-cooperative
> > > operations) as well. This patch enables us to do that as we remove
> > > mmap_changing's dependency on mmap_lock for synchronization.
> >
> > The write-lock is not a requirement here for correctness and I don't see
> > why we would need userfaultfd_remove_prep().
> >
> > As I've said earlier, having a write-lock here will let CRIU to run
> > background copy in parallel with processing of uffd events, but I don't
> > feel strongly about doing it.
> >
> Got it. Anyways, such a change needn't be part of this patch, so I'm
> going to keep it unchanged.
You mean with a read lock?
--
Sincerely yours,
Mike.
next prev parent reply other threads:[~2024-02-07 15:27 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-29 19:35 [PATCH v2 0/3] per-vma locks in userfaultfd Lokesh Gidra
2024-01-29 19:35 ` [PATCH v2 1/3] userfaultfd: move userfaultfd_ctx struct to header file Lokesh Gidra
2024-01-30 7:12 ` Mike Rapoport
2024-01-29 19:35 ` [PATCH v2 2/3] userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx Lokesh Gidra
2024-01-29 21:00 ` Liam R. Howlett
2024-01-29 22:35 ` Lokesh Gidra
2024-01-30 3:46 ` Liam R. Howlett
2024-01-30 8:55 ` Mike Rapoport
2024-01-30 17:28 ` Liam R. Howlett
2024-01-31 2:24 ` Lokesh Gidra
2024-02-04 10:27 ` Mike Rapoport
2024-02-05 20:53 ` Lokesh Gidra
2024-02-07 15:27 ` Mike Rapoport [this message]
2024-02-07 20:24 ` Lokesh Gidra
2024-02-12 8:14 ` Mike Rapoport
2024-01-30 7:21 ` Mike Rapoport
2024-01-29 19:35 ` [PATCH v2 3/3] userfaultfd: use per-vma locks in userfaultfd operations Lokesh Gidra
2024-01-29 20:36 ` Liam R. Howlett
2024-01-29 20:52 ` Suren Baghdasaryan
2024-01-29 21:18 ` Liam R. Howlett
2024-01-30 0:28 ` Lokesh Gidra
2024-01-30 2:58 ` Liam R. Howlett
2024-01-31 2:49 ` Lokesh Gidra
2024-01-31 21:41 ` Liam R. Howlett
2024-02-05 21:46 ` Suren Baghdasaryan
2024-02-05 21:54 ` Lokesh Gidra
2024-02-05 22:00 ` Liam R. Howlett
2024-02-05 22:24 ` Lokesh Gidra
2024-02-06 14:35 ` Liam R. Howlett
2024-02-06 16:26 ` Lokesh Gidra
2024-02-06 17:07 ` Liam R. Howlett
2024-01-31 3:03 ` Suren Baghdasaryan
2024-01-31 21:43 ` Liam R. Howlett
2024-01-29 20:39 ` [PATCH v2 0/3] per-vma locks in userfaultfd Liam R. Howlett
2024-01-29 21:58 ` Lokesh Gidra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZcOhW8NR9XWhVnKS@kernel.org \
--to=rppt@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=bgeffon@google.com \
--cc=david@redhat.com \
--cc=jannh@google.com \
--cc=kaleshsingh@google.com \
--cc=kernel-team@android.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lokeshgidra@google.com \
--cc=ngeoffray@google.com \
--cc=peterx@redhat.com \
--cc=selinux@vger.kernel.org \
--cc=surenb@google.com \
--cc=timmurray@google.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.