Re: Splitting the mmap_sem

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jerome Glisse <jglisse@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-mm@kvack.org, Laurent Dufour <ldufour@linux.ibm.com>,
	David Rientjes <rientjes@google.com>,
	Vlastimil Babka <vbabka@suse.cz>, Hugh Dickins <hughd@google.com>,
	Michel Lespinasse <walken@google.com>,
	Davidlohr Bueso <dbueso@suse.de>
Subject: Re: Splitting the mmap_sem
Date: Thu, 5 Dec 2019 12:21:50 -0500	[thread overview]
Message-ID: <20191205172150.GD5819@redhat.com> (raw)
In-Reply-To: <20191203222147.GV20752@bombadil.infradead.org>

Adding few interested people in cc

On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> 
> [My thanks to Vlastimil, Michel, Liam, David, Davidlohr and Hugh for
>  their feedback on an earlier version of this.  I think the solution
>  we discussed doesn't quite work, so here's one which I think does.
>  See the last two paragraphs in particular.]
> 
> My preferred solution to the mmap_sem scalability problem is to allow
> VMAs to be looked up under the RCU read lock then take a per-VMA lock.
> I've been focusing on the first half of this problem (looking up VMAs
> in an RCU-safe data structure) and ignoring the second half (taking a
> lock while holding the RCU lock).
> 
> We can't take a semaphore while holding the RCU lock in case we have to
> sleep -- the VMA might not exist any more when we woke up.  Making the
> per-VMA lock a spinlock would be a massive change -- fault handlers are
> currently called with the mmap_sem held and may sleep.  So I think we
> need a per-VMA refcount.  That lets us sleep while handling a fault.
> There are over 100 fault handlers in the kernel, and I don't want to
> change the locking in all of them.
> 
> That makes modifications to the tree a little tricky.  At the moment,
> we take the rwsem for write which waits for all readers to finish, then
> we modify the VMAs, then we allow readers back in.  With RCU, there is
> no way to block readers, so different threads may (at the same time)
> see both an old and a new VMA for the same virtual address.
> 
> So calling mmap() looks like this:
> 
>       1 allocate a new VMA
>       2 update pointer(s) in maple tree
>       3 sleep until old VMAs have a zero refcount
>       4 synchronize_rcu()
>       5 free old VMAs
>       6 flush caches for affected range
>       7 return to userspace
> 
> While one thread is calling mmap(MAP_FIXED), two other threads which are
> accessing the same address may see different data from each other and
> have different page translations in their respective CPU caches until
> the thread calling mmap() returns.  I believe this is OK, but would
> greatly appreciate hearing from people who know better.

I do not believe this is OK, i believe this is wrong (not even considering
possible hardware issues that can arise from such aliasing).

That bein said i believe this can be solve "easily" when the new vma is
added you mark it as a new born (VMA_BABY :)) and page fault will have
to wait on it ie until the previous vma is fully gone and flush. So after
step (6 flush caches) you remove the VMA_BABY flag before returning to
userspace and page fault can resume.

I would also mark old VMA with a ZOMBIE flag so that any reader might have
a chance to back-off and retry. To check for that we should add a new
check to vmf_insert_page() (and similar) to avoid inserting pfn in ZOMBIE
vma. Note that i am not sure what we want to do here, can an application
rely on rwsem serialization unknowingly ie could it have one thread doing
page fault on a range that is about to be unmap by another thread ? I am
not sure this can happen today without SEGFAULT thanks to serialization
through rwsem.

Anyway with BABY and ZOMBIE, it should behave mostly as it does today
(modulo concurrency).

> 
> Some people are concerned that a reference count on the VMA will lead to
> contention moving from the mmap_sem to the refcount on a very large VMA
> for workloads which have one giant VMA covering the entire working set.
> For those workloads, I propose we use the existing ->map_pages() callback
> (changed to return a vm_fault_t from the current void).
> 
> It will be called with the RCU lock held and no reference count on
> the vma.  If it needs to sleep, it should bump the refcount, drop the
> RCU lock, prepare enough so that the next call will not need to sleep,
> then drop the refcount and return VM_FAULT_RETRY so the VM knows the
> VMA is no longer good, and it needs to walk the VMA tree from the start.

Just to make sure i understand, you propose that ->map_pages() becomes
a new ->fault() handler that get calls before ->fault() without refcount
so that we can update fs/drivers slowly to perform better in the new scheme
(ie avoid the overead of refcounting if possible at all) ?

The ->fault() callback would then be the "slow" path which will require
a refcount on the vma (taken by core mm code before dropping rcu lock).

Cheers,
Jérôme

next prev parent reply	other threads:[~2019-12-05 17:21 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-03 22:21 Splitting the mmap_sem Matthew Wilcox
2019-12-05 17:21 ` Jerome Glisse [this message]
2019-12-06  5:13   ` Matthew Wilcox
2019-12-06 17:30     ` Jerome Glisse
2019-12-09  3:33       ` Matthew Wilcox
2019-12-09 14:17         ` Jerome Glisse
2019-12-10 15:26   ` Vlastimil Babka
2019-12-10 16:07     ` Jerome Glisse
2019-12-10 18:09       ` Vlastimil Babka
2019-12-12 14:24 ` Kirill A. Shutemov
2019-12-12 15:40   ` Matthew Wilcox
2019-12-12 15:46     ` Kirill A. Shutemov
2019-12-13 14:33       ` Matthew Wilcox
2019-12-13 18:06         ` Kirill A. Shutemov
2019-12-13 18:21           ` Matthew Wilcox
2020-01-06 22:09     ` Matthew Wilcox
2020-01-07 12:34       ` Kirill A. Shutemov
2020-01-07 13:54         ` Matthew Wilcox
2020-01-07 14:27           ` Kirill A. Shutemov
2020-01-09 13:56             ` Vlastimil Babka
2020-01-09 17:03               ` Michal Hocko
2020-01-09 17:07                 ` Michal Hocko
2020-01-09 17:32                   ` SeongJae Park
2020-01-09 20:13                     ` Matthew Wilcox
2020-02-06 13:59                       ` Peter Zijlstra
2020-02-06 20:15                         ` Matthew Wilcox
2020-02-06 20:55                           ` Peter Zijlstra
2020-02-06 21:20                             ` Matthew Wilcox
2020-02-07  8:52                               ` Peter Zijlstra
2020-02-10 22:00                                 ` Matthew Wilcox
2020-02-19 17:14                                 ` Laurent Dufour

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191205172150.GD5819@redhat.com \
    --to=jglisse@redhat.com \
    --cc=dbueso@suse.de \
    --cc=hughd@google.com \
    --cc=ldufour@linux.ibm.com \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=vbabka@suse.cz \
    --cc=walken@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.