[LSF/MM TOPIC] mmap locking topics

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM TOPIC] mmap locking topics
@ 2021-06-01  4:48 Michel Lespinasse
  2021-06-01 14:01 ` Matthew Wilcox
  2021-06-01 17:39 ` Liam Howlett
  0 siblings, 2 replies; 5+ messages in thread
From: Michel Lespinasse @ 2021-06-01  4:48 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm

Hi,

I have two MM topics to propose for LSF/MM/BPF 2021,
both in the area of mmap lock performance:

I - Speculative page faults

The idea there is to avoid taking the mmap lock during page faults,
at least for the easier cases. This requiers the fault handler to be
a careful to avoid races with mmap writers (and most particularly
munmap), and when the new page is ready to be inserted into the user
process, to verify, at the last moment (after taking the page table
lock), that there has been no race between the fault handler and any
mmap writers.  Such checks can be implemented locally, without hitting
any global locks, which results in very nice scalability improvements
when processing concurrent faults.

I think the idea is ready for prime time, and a patchset has been proposed,
but it is not getting much traction yet. I suspect we will need to discuss
the idea in person to figure out the next steps.

II - Fine grained MM locking

A major limitation of the current mmap lock design is that it covers a
process's entire address space. In threaded applications, it is common
for threads to issue concurrent requests for non-overlapping parts of
the process address space - for example, one thread might be mmaping
new memory while another releases a different range, and a third might
fault within his own address range too. The current mmap lock design
does not take the non-overlapping ranges into consideration, and
consequently serialises the 3 above requests rather than letting them
proceed in parallel.

There has been a lot of work spent mitigating the problem by reducing
the mmap lock hold times (for example, dropping the mmap lock during
page faults that hit disk, or lowering to a read lock during longer
mmap/munmap/populate operations). But this approach is hitting its
limits, and I think it would be better to fix the core of the problem
by making the mmap lock capable of allowing concurrent non-overlapping
operations.

I would like to propose an approach that:
- separates the mmap lock into two separate locks, one that is only
  held for short periods of time to protect mm-wide data structures
  (including the vma tree), and another that functions as a range lock
  and can be held for longer periods of time;
- allows for incremental conversion from the current code to being
  aware about locking ranges;

I have been maintaining a prototype for this, which has been shared
with a small set of people. The main holdup is with page fault
performance; in order to allow non-overlapping writers to proceed
while some page faults are in progress, the prototype needs to
maintain a shared structure holding addresses for each pending page
fault. Updating this shared structure gets very expenside in high
concurrency page fault benchmarks, though it seems quite unnoticeable
in macro benchmarks I hae looked at.

Sorry for the lenghty proposal - I swear I've tried to keep it short :)

Thanks,

--
Michel "walken" Lespinasse

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [LSF/MM TOPIC] mmap locking topics
  2021-06-01  4:48 [LSF/MM TOPIC] mmap locking topics Michel Lespinasse
@ 2021-06-01 14:01 ` Matthew Wilcox
  2021-06-02  3:34   ` Michel Lespinasse
  2021-06-01 17:39 ` Liam Howlett
  1 sibling, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2021-06-01 14:01 UTC (permalink / raw)
  To: Michel Lespinasse; +Cc: lsf-pc, linux-mm

On Mon, May 31, 2021 at 09:48:45PM -0700, Michel Lespinasse wrote:
> I - Speculative page faults
> 
> The idea there is to avoid taking the mmap lock during page faults,
> at least for the easier cases. This requiers the fault handler to be
> a careful to avoid races with mmap writers (and most particularly
> munmap), and when the new page is ready to be inserted into the user
> process, to verify, at the last moment (after taking the page table
> lock), that there has been no race between the fault handler and any
> mmap writers.  Such checks can be implemented locally, without hitting
> any global locks, which results in very nice scalability improvements
> when processing concurrent faults.
> 
> I think the idea is ready for prime time, and a patchset has been proposed,
> but it is not getting much traction yet. I suspect we will need to discuss
> the idea in person to figure out the next steps.

There is a lot of interest in this.  I disagree with Michel's approach
in that he wants to use seqlocks to detect whether any modification has
been made to the process's address space, whereas I want to use the VMA
tree to detect whether any modification has been made to this VMA.

> II - Fine grained MM locking
> 
> A major limitation of the current mmap lock design is that it covers a
> process's entire address space. In threaded applications, it is common
> for threads to issue concurrent requests for non-overlapping parts of
> the process address space - for example, one thread might be mmaping
> new memory while another releases a different range, and a third might
> fault within his own address range too. The current mmap lock design
> does not take the non-overlapping ranges into consideration, and
> consequently serialises the 3 above requests rather than letting them
> proceed in parallel.
> 
> There has been a lot of work spent mitigating the problem by reducing
> the mmap lock hold times (for example, dropping the mmap lock during
> page faults that hit disk, or lowering to a read lock during longer
> mmap/munmap/populate operations). But this approach is hitting its
> limits, and I think it would be better to fix the core of the problem
> by making the mmap lock capable of allowing concurrent non-overlapping
> operations.
> 
> I would like to propose an approach that:
> - separates the mmap lock into two separate locks, one that is only
>   held for short periods of time to protect mm-wide data structures
>   (including the vma tree), and another that functions as a range lock
>   and can be held for longer periods of time;
> - allows for incremental conversion from the current code to being
>   aware about locking ranges;
> 
> I have been maintaining a prototype for this, which has been shared
> with a small set of people. The main holdup is with page fault
> performance; in order to allow non-overlapping writers to proceed
> while some page faults are in progress, the prototype needs to
> maintain a shared structure holding addresses for each pending page
> fault. Updating this shared structure gets very expenside in high
> concurrency page fault benchmarks, though it seems quite unnoticeable
> in macro benchmarks I hae looked at.

Here I have larger disagreements with Michel.  I do not believe the
range lock is a useful tool for this problem.

The two topics above seem large enough, but there are other important
users of the mmap_sem that also hit contention.  /proc/$pid/maps, smaps
and similar files hit priority inversion problems which have been reduced,
but not solved.

The lockless VMA tree walk patches have run into problems (eg that
holding the RCU lock is not enough to prevent page table freeing),
and I don't have time to work on them right now due to the folio work.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [LSF/MM TOPIC] mmap locking topics
  2021-06-01 14:01 ` Matthew Wilcox
@ 2021-06-02  3:34   ` Michel Lespinasse
  2021-06-10  1:29     ` Suren Baghdasaryan
  0 siblings, 1 reply; 5+ messages in thread
From: Michel Lespinasse @ 2021-06-02  3:34 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Michel Lespinasse, lsf-pc, linux-mm

On Tue, Jun 01, 2021 at 03:01:00PM +0100, Matthew Wilcox wrote:
> On Mon, May 31, 2021 at 09:48:45PM -0700, Michel Lespinasse wrote:
> > I - Speculative page faults
> > 
> > The idea there is to avoid taking the mmap lock during page faults,
> > at least for the easier cases. This requiers the fault handler to be
> > a careful to avoid races with mmap writers (and most particularly
> > munmap), and when the new page is ready to be inserted into the user
> > process, to verify, at the last moment (after taking the page table
> > lock), that there has been no race between the fault handler and any
> > mmap writers.  Such checks can be implemented locally, without hitting
> > any global locks, which results in very nice scalability improvements
> > when processing concurrent faults.
> > 
> > I think the idea is ready for prime time, and a patchset has been proposed,
> > but it is not getting much traction yet. I suspect we will need to discuss
> > the idea in person to figure out the next steps.
> 
> There is a lot of interest in this.  I disagree with Michel's approach
> in that he wants to use seqlocks to detect whether any modification has
> been made to the process's address space, whereas I want to use the VMA
> tree to detect whether any modification has been made to this VMA.

I see the sequence count as being the easy & safe approach, but yes it
does have limitations that can lead to unnecessary fast path aborts.
It would be nice checking the VMAs to avoid *some* of these aborts,
but I do not think that is always applicable either - I wrote about
that in https://lwn.net/ml/linux-kernel/20210430224649.GA29203@lespinasse.org/
("Thoughts about concurrency checks at the end of the page fault")

> > II - Fine grained MM locking
> > 
> > A major limitation of the current mmap lock design is that it covers a
> > process's entire address space. In threaded applications, it is common
> > for threads to issue concurrent requests for non-overlapping parts of
> > the process address space - for example, one thread might be mmaping
> > new memory while another releases a different range, and a third might
> > fault within his own address range too. The current mmap lock design
> > does not take the non-overlapping ranges into consideration, and
> > consequently serialises the 3 above requests rather than letting them
> > proceed in parallel.
> > 
> > There has been a lot of work spent mitigating the problem by reducing
> > the mmap lock hold times (for example, dropping the mmap lock during
> > page faults that hit disk, or lowering to a read lock during longer
> > mmap/munmap/populate operations). But this approach is hitting its
> > limits, and I think it would be better to fix the core of the problem
> > by making the mmap lock capable of allowing concurrent non-overlapping
> > operations.
> > 
> > I would like to propose an approach that:
> > - separates the mmap lock into two separate locks, one that is only
> >   held for short periods of time to protect mm-wide data structures
> >   (including the vma tree), and another that functions as a range lock
> >   and can be held for longer periods of time;
> > - allows for incremental conversion from the current code to being
> >   aware about locking ranges;
> > 
> > I have been maintaining a prototype for this, which has been shared
> > with a small set of people. The main holdup is with page fault
> > performance; in order to allow non-overlapping writers to proceed
> > while some page faults are in progress, the prototype needs to
> > maintain a shared structure holding addresses for each pending page
> > fault. Updating this shared structure gets very expenside in high
> > concurrency page fault benchmarks, though it seems quite unnoticeable
> > in macro benchmarks I hae looked at.
> 
> Here I have larger disagreements with Michel.  I do not believe the
> range lock is a useful tool for this problem.

Regardless of any proposed solution, do you agree that most of the cases
where mmap lock blocking happens are between non-overlapping memory
operations that could conceivably be handled concurrently ?

In other words - do you believe that range locks would be too slow to
be a useful solution, or is it that you do not think they would
actually solve the issue ?

> The two topics above seem large enough, but there are other important
> users of the mmap_sem that also hit contention.  /proc/$pid/maps, smaps
> and similar files hit priority inversion problems which have been reduced,
> but not solved.

Yes - I do think this would be worth discussing too. Not sure if that is
a separate topic, or if this should be brought under a larger theme.

Generally - I think there are many issues people have with mmap
locking, and it's been really hard to make progress addressing these -
even when prototype solutions exist - due to a lack of concensus (many
of the people involved have different ideas as to which of the issues
are important to them). But I think that's what makes this an important
topic to be discussed ?

--
Michel "walken" Lespinasse


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [LSF/MM TOPIC] mmap locking topics
  2021-06-02  3:34   ` Michel Lespinasse
@ 2021-06-10  1:29     ` Suren Baghdasaryan
  0 siblings, 0 replies; 5+ messages in thread
From: Suren Baghdasaryan @ 2021-06-10  1:29 UTC (permalink / raw)
  To: Michel Lespinasse; +Cc: Matthew Wilcox, lsf-pc, linux-mm, Laurent Dufour

On Tue, Jun 1, 2021 at 8:34 PM Michel Lespinasse <michel@lespinasse.org> wrote:
>
> On Tue, Jun 01, 2021 at 03:01:00PM +0100, Matthew Wilcox wrote:
> > On Mon, May 31, 2021 at 09:48:45PM -0700, Michel Lespinasse wrote:
> > > I - Speculative page faults
> > >
> > > The idea there is to avoid taking the mmap lock during page faults,
> > > at least for the easier cases. This requiers the fault handler to be
> > > a careful to avoid races with mmap writers (and most particularly
> > > munmap), and when the new page is ready to be inserted into the user
> > > process, to verify, at the last moment (after taking the page table
> > > lock), that there has been no race between the fault handler and any
> > > mmap writers.  Such checks can be implemented locally, without hitting
> > > any global locks, which results in very nice scalability improvements
> > > when processing concurrent faults.
> > >
> > > I think the idea is ready for prime time, and a patchset has been proposed,
> > > but it is not getting much traction yet. I suspect we will need to discuss
> > > the idea in person to figure out the next steps.
> >
> > There is a lot of interest in this.  I disagree with Michel's approach
> > in that he wants to use seqlocks to detect whether any modification has
> > been made to the process's address space, whereas I want to use the VMA
> > tree to detect whether any modification has been made to this VMA.
>
> I see the sequence count as being the easy & safe approach, but yes it
> does have limitations that can lead to unnecessary fast path aborts.
> It would be nice checking the VMAs to avoid *some* of these aborts,
> but I do not think that is always applicable either - I wrote about
> that in https://lwn.net/ml/linux-kernel/20210430224649.GA29203@lespinasse.org/
> ("Thoughts about concurrency checks at the end of the page fault")

In Android we use the previous implementation of SPF posted by Laurent
Dufour (https://lore.kernel.org/patchwork/project/lkml/list/?series=390741&state=%2A&archive=both)
which unfortunately did not make it upstream. It improves start time
of some heavy multi-threaded applications and this is one of the most
important metrics for mobile devices. This patchset is also one of the
biggest out-of-tree patchsets we have to maintain in Android kernel
and it would be immensely valuable for us to have it upstreamed. I
would love to see this discussed at LSF/MM, will provide as much
supporting data as I can and will also push vendors and OEMs to
provide data as well.

>
> > > II - Fine grained MM locking
> > >
> > > A major limitation of the current mmap lock design is that it covers a
> > > process's entire address space. In threaded applications, it is common
> > > for threads to issue concurrent requests for non-overlapping parts of
> > > the process address space - for example, one thread might be mmaping
> > > new memory while another releases a different range, and a third might
> > > fault within his own address range too. The current mmap lock design
> > > does not take the non-overlapping ranges into consideration, and
> > > consequently serialises the 3 above requests rather than letting them
> > > proceed in parallel.
> > >
> > > There has been a lot of work spent mitigating the problem by reducing
> > > the mmap lock hold times (for example, dropping the mmap lock during
> > > page faults that hit disk, or lowering to a read lock during longer
> > > mmap/munmap/populate operations). But this approach is hitting its
> > > limits, and I think it would be better to fix the core of the problem
> > > by making the mmap lock capable of allowing concurrent non-overlapping
> > > operations.
> > >
> > > I would like to propose an approach that:
> > > - separates the mmap lock into two separate locks, one that is only
> > >   held for short periods of time to protect mm-wide data structures
> > >   (including the vma tree), and another that functions as a range lock
> > >   and can be held for longer periods of time;
> > > - allows for incremental conversion from the current code to being
> > >   aware about locking ranges;
> > >
> > > I have been maintaining a prototype for this, which has been shared
> > > with a small set of people. The main holdup is with page fault
> > > performance; in order to allow non-overlapping writers to proceed
> > > while some page faults are in progress, the prototype needs to
> > > maintain a shared structure holding addresses for each pending page
> > > fault. Updating this shared structure gets very expenside in high
> > > concurrency page fault benchmarks, though it seems quite unnoticeable
> > > in macro benchmarks I hae looked at.
> >
> > Here I have larger disagreements with Michel.  I do not believe the
> > range lock is a useful tool for this problem.
>
> Regardless of any proposed solution, do you agree that most of the cases
> where mmap lock blocking happens are between non-overlapping memory
> operations that could conceivably be handled concurrently ?
>
> In other words - do you believe that range locks would be too slow to
> be a useful solution, or is it that you do not think they would
> actually solve the issue ?
>
> > The two topics above seem large enough, but there are other important
> > users of the mmap_sem that also hit contention.  /proc/$pid/maps, smaps
> > and similar files hit priority inversion problems which have been reduced,
> > but not solved.
>
> Yes - I do think this would be worth discussing too. Not sure if that is
> a separate topic, or if this should be brought under a larger theme.
>
> Generally - I think there are many issues people have with mmap
> locking, and it's been really hard to make progress addressing these -
> even when prototype solutions exist - due to a lack of concensus (many
> of the people involved have different ideas as to which of the issues
> are important to them). But I think that's what makes this an important
> topic to be discussed ?
>
> --
> Michel "walken" Lespinasse
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [LSF/MM TOPIC] mmap locking topics
  2021-06-01  4:48 [LSF/MM TOPIC] mmap locking topics Michel Lespinasse
  2021-06-01 14:01 ` Matthew Wilcox
@ 2021-06-01 17:39 ` Liam Howlett
  1 sibling, 0 replies; 5+ messages in thread
From: Liam Howlett @ 2021-06-01 17:39 UTC (permalink / raw)
  To: Michel Lespinasse; +Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org


* Michel Lespinasse <michel@lespinasse.org> [210601 00:48]:
> Hi,
> 
> I have two MM topics to propose for LSF/MM/BPF 2021,
> both in the area of mmap lock performance:
> 
> 
> I - Speculative page faults
> 
> The idea there is to avoid taking the mmap lock during page faults,
> at least for the easier cases. This requiers the fault handler to be
> a careful to avoid races with mmap writers (and most particularly
> munmap), and when the new page is ready to be inserted into the user
> process, to verify, at the last moment (after taking the page table
> lock), that there has been no race between the fault handler and any
> mmap writers.  Such checks can be implemented locally, without hitting
> any global locks, which results in very nice scalability improvements
> when processing concurrent faults.
> 
> I think the idea is ready for prime time, and a patchset has been proposed,
> but it is not getting much traction yet. I suspect we will need to discuss
> the idea in person to figure out the next steps.

I agree that the locking should be avoided, especially in this critical
path.  I'd like to do this by simplifying the data structures tracking
the VMAs.  I feel like adding more tracking and special cases will
further complicate the existing code - which is already overly
complicated.

> II - Fine grained MM locking
> 
> A major limitation of the current mmap lock design is that it covers a
> process's entire address space. In threaded applications, it is common
> for threads to issue concurrent requests for non-overlapping parts of
> the process address space - for example, one thread might be mmaping
> new memory while another releases a different range, and a third might
> fault within his own address range too. The current mmap lock design
> does not take the non-overlapping ranges into consideration, and
> consequently serialises the 3 above requests rather than letting them
> proceed in parallel.
> 
> There has been a lot of work spent mitigating the problem by reducing
> the mmap lock hold times (for example, dropping the mmap lock during
> page faults that hit disk, or lowering to a read lock during longer
> mmap/munmap/populate operations). But this approach is hitting its
> limits, and I think it would be better to fix the core of the problem
> by making the mmap lock capable of allowing concurrent non-overlapping
> operations.
> 
> I would like to propose an approach that:
> - separates the mmap lock into two separate locks, one that is only
>   held for short periods of time to protect mm-wide data structures
>   (including the vma tree), and another that functions as a range lock
>   and can be held for longer periods of time;
> - allows for incremental conversion from the current code to being
>   aware about locking ranges;
> 
> I have been maintaining a prototype for this, which has been shared
> with a small set of people. The main holdup is with page fault
> performance; in order to allow non-overlapping writers to proceed
> while some page faults are in progress, the prototype needs to
> maintain a shared structure holding addresses for each pending page
> fault. Updating this shared structure gets very expenside in high
> concurrency page fault benchmarks, though it seems quite unnoticeable
> in macro benchmarks I hae looked at.
> 

Although locking the entire VMA has caused a bottleneck with the
increased thread count in modern hardware, I do not believe locking a
range of VMAs is the answer.  There is currently 3 data structures plus
the mmap_sem (and sometimes the page table lock), not to mention the
reverse mapping - all to keep track of VMAs.


There are currently three projects with at least five organizations
involved in tackling the mmap semaphore locking issue.  It would be
beneficial for all involved to hash out an overall view of where these
solutions should fit into the larger picture.

I am aware of the following projects in this area:
 - Replacing the rbtree with the Maple Tree
 - Speculative page faults (SPF), as discussed above.
 - Range Locking of VMAs, as discussed above.

If anyone has any other projects under development or ideas, please reply and
add them.

Thanks,
Liam

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-06-10  1:29 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-06-01  4:48 [LSF/MM TOPIC] mmap locking topics Michel Lespinasse
2021-06-01 14:01 ` Matthew Wilcox
2021-06-02  3:34   ` Michel Lespinasse
2021-06-10  1:29     ` Suren Baghdasaryan
2021-06-01 17:39 ` Liam Howlett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).