Re: Splitting the mmap_sem

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jerome Glisse <jglisse@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-mm@kvack.org, Laurent Dufour <ldufour@linux.ibm.com>,
	David Rientjes <rientjes@google.com>,
	Vlastimil Babka <vbabka@suse.cz>, Hugh Dickins <hughd@google.com>,
	Michel Lespinasse <walken@google.com>,
	Davidlohr Bueso <dbueso@suse.de>
Subject: Re: Splitting the mmap_sem
Date: Fri, 6 Dec 2019 12:30:30 -0500	[thread overview]
Message-ID: <20191206173030.GA3648@redhat.com> (raw)
In-Reply-To: <20191206051322.GA21007@bombadil.infradead.org>

On Thu, Dec 05, 2019 at 09:13:22PM -0800, Matthew Wilcox wrote:
> On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote:
> > Adding few interested people in cc
> 
> I figured they all read linux-mm already ;-)
> 
> > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > > While one thread is calling mmap(MAP_FIXED), two other threads which are
> > > accessing the same address may see different data from each other and
> > > have different page translations in their respective CPU caches until
> > > the thread calling mmap() returns.  I believe this is OK, but would
> > > greatly appreciate hearing from people who know better.
> > 
> > I do not believe this is OK, i believe this is wrong (not even considering
> > possible hardware issues that can arise from such aliasing).
> 
> Well, OK, but why do you believe it is wrong?  If thread A is executing
> a load instruction at the same time that thread B is calling mmap(),
> it really is indeterminate what value A loads.  It might be from before
> the call to mmap() and it might be from after.  And if thread C is also
> executing a load instruction at the same time, then it might already get
> a different result from thread A.  And can threads A and C really tell
> which of them executed the load instruction 'first'?  I think this is
> all so indeterminate already that the (lack of) guarantees I outlined
> above are acceptable.
> 
> But we should all agree on this, so _please_ continue to argue your case
> for why you believe it to be wrong.
> 

I agree that such application might looks like it is doing something that
is undeterminate but their might be application that catch SEGFAULT and use
it as synchronization. I did something similar for reverse engineering a
long time ago with a library call libsegfault ...

In any case, i agree that an application that is not catching SEGFAULT, and
which is doing the above (access patterns) is doing something undeterminate.

Nonetheless i believe it is important that at any point in time for all the
threads in a given process, on all the CPUs, a given virtual address should
always point to the same physical memory (or to nothing) ie we should never
have one CPU that sees a different physical memory from another CPU for the
same virtual address.

> [snip proposed solution -- if the problem needs solving, we can argue
> about how to solve it later]

Well i feel like you are also not discussing about the munmap() the above
seemed to be about MAP_FIXED (replacing an existing mapping). For munmap
too i believe we should agree on what should be the expected behavior and
from my POV again we should not allow new mapping to appear until a "running"
munmap is not fully done (ie all CPUs cache and TLB flushed). For the same
reason as above ie all CPUs always see same physical memory (or nothing) for
a given virtual address.

This is what we have today with the big rwsem and i think we need to keep
that behavior even with concurency. I do not believe this will impact the
performance and it is easy enough to solve so i feel safer doing so given
it does not cost anything.

So i would rather argue on why we should change the current behavior if we
can fix the concurrency without changing it (hence why discussing solution
might also be relevant here).

> > > Some people are concerned that a reference count on the VMA will lead to
> > > contention moving from the mmap_sem to the refcount on a very large VMA
> > > for workloads which have one giant VMA covering the entire working set.
> > > For those workloads, I propose we use the existing ->map_pages() callback
> > > (changed to return a vm_fault_t from the current void).
> > > 
> > > It will be called with the RCU lock held and no reference count on
> > > the vma.  If it needs to sleep, it should bump the refcount, drop the
> > > RCU lock, prepare enough so that the next call will not need to sleep,
> > > then drop the refcount and return VM_FAULT_RETRY so the VM knows the
> > > VMA is no longer good, and it needs to walk the VMA tree from the start.
> > 
> > Just to make sure i understand, you propose that ->map_pages() becomes
> > a new ->fault() handler that get calls before ->fault() without refcount
> > so that we can update fs/drivers slowly to perform better in the new scheme
> > (ie avoid the overead of refcounting if possible at all) ?
> > 
> > The ->fault() callback would then be the "slow" path which will require
> > a refcount on the vma (taken by core mm code before dropping rcu lock).
> 
> I would actually propose never updating most drivers.  There's just no
> need for them to handle such an unstable and tricky situation as this.
> Let's not make driver writers lives harder.
> 
> For the ones which need this kind of scalability (and let's be clear, they
> would already have *better* scalability than today due to the rwsem being
> split into a per-VMA refcount), then yes, implementing ->map_pages would
> be the way to go.  Indeed, they would probably benefit from implementing
> it today, since it will reduce the number of page faults.

Yes they will get better scalability but i see some of those drivers might
want the extra few mini-percent :) In any case, i believe that it might be
better to give a new name ie fix current map_pages() user and rename that
callback to something more explicit (atomic_map_pages() or something similar
i am not good at naming). But otherwise this looks like a good plan to avoid
excessive refcount overhead.

Cheers,
Jérôme

next prev parent reply	other threads:[~2019-12-06 17:30 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-03 22:21 Splitting the mmap_sem Matthew Wilcox
2019-12-05 17:21 ` Jerome Glisse
2019-12-06  5:13   ` Matthew Wilcox
2019-12-06 17:30     ` Jerome Glisse [this message]
2019-12-09  3:33       ` Matthew Wilcox
2019-12-09 14:17         ` Jerome Glisse
2019-12-10 15:26   ` Vlastimil Babka
2019-12-10 16:07     ` Jerome Glisse
2019-12-10 18:09       ` Vlastimil Babka
2019-12-12 14:24 ` Kirill A. Shutemov
2019-12-12 15:40   ` Matthew Wilcox
2019-12-12 15:46     ` Kirill A. Shutemov
2019-12-13 14:33       ` Matthew Wilcox
2019-12-13 18:06         ` Kirill A. Shutemov
2019-12-13 18:21           ` Matthew Wilcox
2020-01-06 22:09     ` Matthew Wilcox
2020-01-07 12:34       ` Kirill A. Shutemov
2020-01-07 13:54         ` Matthew Wilcox
2020-01-07 14:27           ` Kirill A. Shutemov
2020-01-09 13:56             ` Vlastimil Babka
2020-01-09 17:03               ` Michal Hocko
2020-01-09 17:07                 ` Michal Hocko
2020-01-09 17:32                   ` SeongJae Park
2020-01-09 20:13                     ` Matthew Wilcox
2020-02-06 13:59                       ` Peter Zijlstra
2020-02-06 20:15                         ` Matthew Wilcox
2020-02-06 20:55                           ` Peter Zijlstra
2020-02-06 21:20                             ` Matthew Wilcox
2020-02-07  8:52                               ` Peter Zijlstra
2020-02-10 22:00                                 ` Matthew Wilcox
2020-02-19 17:14                                 ` Laurent Dufour

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191206173030.GA3648@redhat.com \
    --to=jglisse@redhat.com \
    --cc=dbueso@suse.de \
    --cc=hughd@google.com \
    --cc=ldufour@linux.ibm.com \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=vbabka@suse.cz \
    --cc=walken@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.