Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Nicholas Piggin <npiggin@gmail.com>
To: Christian Borntraeger <borntraeger@de.ibm.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Vasily Gorbik <gor@linux.ibm.com>,
	Heiko Carstens <hca@linux.ibm.com>,
	Andy Lutomirski <luto@kernel.org>, Will Deacon <will@kernel.org>
Cc: Anton Blanchard <anton@ozlabs.org>, Arnd Bergmann <arnd@arndb.de>,
	linux-arch <linux-arch@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Peter Zijlstra <peterz@infradead.org>, X86 ML <x86@kernel.org>
Subject: Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option
Date: Wed, 02 Dec 2020 13:47:40 +1000	[thread overview]
Message-ID: <1606879302.tdngvs3yq4.astroid@bobo.none> (raw)
In-Reply-To: <CALCETrXAR_9EGaOF8ymVkZycxgZkYk0dR+NjEpTfVzdcS3sOVw@mail.gmail.com>

Excerpts from Andy Lutomirski's message of December 1, 2020 4:31 am:
> other arch folk: there's some background here:
> 
> https://lkml.kernel.org/r/CALCETrVXUbe8LfNn-Qs+DzrOQaiw+sFUg1J047yByV31SaTOZw@mail.gmail.com
> 
> On Sun, Nov 29, 2020 at 12:16 PM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On Sat, Nov 28, 2020 at 7:54 PM Andy Lutomirski <luto@kernel.org> wrote:
>> >
>> > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>> > >
>> > > On big systems, the mm refcount can become highly contented when doing
>> > > a lot of context switching with threaded applications (particularly
>> > > switching between the idle thread and an application thread).
>> > >
>> > > Abandoning lazy tlb slows switching down quite a bit in the important
>> > > user->idle->user cases, so so instead implement a non-refcounted scheme
>> > > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>> > > any remaining lazy ones.
>> > >
>> > > Shootdown IPIs are some concern, but they have not been observed to be
>> > > a big problem with this scheme (the powerpc implementation generated
>> > > 314 additional interrupts on a 144 CPU system during a kernel compile).
>> > > There are a number of strategies that could be employed to reduce IPIs
>> > > if they turn out to be a problem for some workload.
>> >
>> > I'm still wondering whether we can do even better.
>> >
>>
>> Hold on a sec.. __mmput() unmaps VMAs, frees pagetables, and flushes
>> the TLB.  On x86, this will shoot down all lazies as long as even a
>> single pagetable was freed.  (Or at least it will if we don't have a
>> serious bug, but the code seems okay.  We'll hit pmd_free_tlb, which
>> sets tlb->freed_tables, which will trigger the IPI.)  So, on
>> architectures like x86, the shootdown approach should be free.  The
>> only way it ought to have any excess IPIs is if we have CPUs in
>> mm_cpumask() that don't need IPI to free pagetables, which could
>> happen on paravirt.
> 
> Indeed, on x86, we do this:
> 
> [   11.558844]  flush_tlb_mm_range.cold+0x18/0x1d
> [   11.559905]  tlb_finish_mmu+0x10e/0x1a0
> [   11.561068]  exit_mmap+0xc8/0x1a0
> [   11.561932]  mmput+0x29/0xd0
> [   11.562688]  do_exit+0x316/0xa90
> [   11.563588]  do_group_exit+0x34/0xb0
> [   11.564476]  __x64_sys_exit_group+0xf/0x10
> [   11.565512]  do_syscall_64+0x34/0x50
> 
> and we have info->freed_tables set.
> 
> What are the architectures that have large systems like?
> 
> x86: we already zap lazies, so it should cost basically nothing to do

This is not zapping lazies, this is freeing the user page tables.

"lazy mm" is where a switch to a kernel thread takes on the
previous mm for its kernel mapping rather than switch to init_mm.

> a little loop at the end of __mmput() to make sure that no lazies are
> left.  If we care about paravirt performance, we could implement one
> of the optimizations I mentioned above to fix up the refcounts instead
> of sending an IPI to any remaining lazies.

It might be possible x86's scheme you could scan mm_cpumask
carefully synchronized or something when the last user reference
gets dropped that frees the lazy at that point, but I don't know
what that would buy you because you're still having to maintain
the mm_cpumask on switches. powerpc's characteristics are just
different here so it makes sense whereas I don't know if it
would on x86.

> 
> arm64: AFAICT arm64's flush uses magic arm64 hardware support for
> remote flushes, so any lazy mm references will still exist after
> exit_mmap().  (arm64 uses lazy TLB, right?)  So this is kind of like
> the x86 paravirt case.  Are there large enough arm64 systems that any
> of this matters?
> 
> s390x: The code has too many acronyms for me to understand it fully,
> but I think it's more or less the same situation as arm64.  How big do
> s390x systems come?
> 
> power: Ridiculously complicated, seems to vary by system and kernel config.
> 
> So, Nick, your unconditional IPI scheme is apparently a big
> improvement for power, and it should be an improvement and have low
> cost for x86.

As said, the tradeoffs are different, I'm not so sure. It was a big 
improvement on a very big system with the powerpc mm_cpumask switching
model on a microbenchmark designed to stress this, which is about all
I can say for it.

> On arm64 and s390x it will add more IPIs on process
> exit but reduce contention on context switching depending on how lazy
> TLB works.  I suppose we could try it for all architectures without
> any further optimizations.

It will remain opt-in but certainly try it out and see. There are some
requirements as documented in the config option text.

> Or we could try one of the perhaps
> excessively clever improvements I linked above.  arm64, s390x people,
> what do you think?
> 

I'm not against improvements to the scheme. e.g., from the patch

+               /*
+                * IPI overheads have not found to be expensive, but they could
+                * be reduced in a number of possible ways, for example (in
+                * roughly increasing order of complexity):
+                * - A batch of mms requiring IPIs could be gathered and freed
+                *   at once.
+                * - CPUs could store their active mm somewhere that can be
+                *   remotely checked without a lock, to filter out
+                *   false-positives in the cpumask.
+                * - After mm_users or mm_count reaches zero, switching away
+                *   from the mm could clear mm_cpumask to reduce some IPIs
+                *   (some batching or delaying would help).
+                * - A delayed freeing and RCU-like quiescing sequence based on
+                *   mm switching to avoid IPIs completely.
+                */

But would like to have numbers before being too clever.

Thanks,
Nick

WARNING: multiple messages have this Message-ID (diff)

From: Nicholas Piggin <npiggin@gmail.com>
To: Christian Borntraeger <borntraeger@de.ibm.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Vasily Gorbik <gor@linux.ibm.com>,
	Heiko Carstens <hca@linux.ibm.com>,
	Andy Lutomirski <luto@kernel.org>, Will Deacon <will@kernel.org>
Cc: linux-arch <linux-arch@vger.kernel.org>,
	Arnd Bergmann <arnd@arndb.de>,
	Peter Zijlstra <peterz@infradead.org>, X86 ML <x86@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Subject: Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option
Date: Wed, 02 Dec 2020 13:47:40 +1000	[thread overview]
Message-ID: <1606879302.tdngvs3yq4.astroid@bobo.none> (raw)
In-Reply-To: <CALCETrXAR_9EGaOF8ymVkZycxgZkYk0dR+NjEpTfVzdcS3sOVw@mail.gmail.com>

Excerpts from Andy Lutomirski's message of December 1, 2020 4:31 am:
> other arch folk: there's some background here:
> 
> https://lkml.kernel.org/r/CALCETrVXUbe8LfNn-Qs+DzrOQaiw+sFUg1J047yByV31SaTOZw@mail.gmail.com
> 
> On Sun, Nov 29, 2020 at 12:16 PM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On Sat, Nov 28, 2020 at 7:54 PM Andy Lutomirski <luto@kernel.org> wrote:
>> >
>> > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>> > >
>> > > On big systems, the mm refcount can become highly contented when doing
>> > > a lot of context switching with threaded applications (particularly
>> > > switching between the idle thread and an application thread).
>> > >
>> > > Abandoning lazy tlb slows switching down quite a bit in the important
>> > > user->idle->user cases, so so instead implement a non-refcounted scheme
>> > > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>> > > any remaining lazy ones.
>> > >
>> > > Shootdown IPIs are some concern, but they have not been observed to be
>> > > a big problem with this scheme (the powerpc implementation generated
>> > > 314 additional interrupts on a 144 CPU system during a kernel compile).
>> > > There are a number of strategies that could be employed to reduce IPIs
>> > > if they turn out to be a problem for some workload.
>> >
>> > I'm still wondering whether we can do even better.
>> >
>>
>> Hold on a sec.. __mmput() unmaps VMAs, frees pagetables, and flushes
>> the TLB.  On x86, this will shoot down all lazies as long as even a
>> single pagetable was freed.  (Or at least it will if we don't have a
>> serious bug, but the code seems okay.  We'll hit pmd_free_tlb, which
>> sets tlb->freed_tables, which will trigger the IPI.)  So, on
>> architectures like x86, the shootdown approach should be free.  The
>> only way it ought to have any excess IPIs is if we have CPUs in
>> mm_cpumask() that don't need IPI to free pagetables, which could
>> happen on paravirt.
> 
> Indeed, on x86, we do this:
> 
> [   11.558844]  flush_tlb_mm_range.cold+0x18/0x1d
> [   11.559905]  tlb_finish_mmu+0x10e/0x1a0
> [   11.561068]  exit_mmap+0xc8/0x1a0
> [   11.561932]  mmput+0x29/0xd0
> [   11.562688]  do_exit+0x316/0xa90
> [   11.563588]  do_group_exit+0x34/0xb0
> [   11.564476]  __x64_sys_exit_group+0xf/0x10
> [   11.565512]  do_syscall_64+0x34/0x50
> 
> and we have info->freed_tables set.
> 
> What are the architectures that have large systems like?
> 
> x86: we already zap lazies, so it should cost basically nothing to do

This is not zapping lazies, this is freeing the user page tables.

"lazy mm" is where a switch to a kernel thread takes on the
previous mm for its kernel mapping rather than switch to init_mm.

> a little loop at the end of __mmput() to make sure that no lazies are
> left.  If we care about paravirt performance, we could implement one
> of the optimizations I mentioned above to fix up the refcounts instead
> of sending an IPI to any remaining lazies.

It might be possible x86's scheme you could scan mm_cpumask
carefully synchronized or something when the last user reference
gets dropped that frees the lazy at that point, but I don't know
what that would buy you because you're still having to maintain
the mm_cpumask on switches. powerpc's characteristics are just
different here so it makes sense whereas I don't know if it
would on x86.

> 
> arm64: AFAICT arm64's flush uses magic arm64 hardware support for
> remote flushes, so any lazy mm references will still exist after
> exit_mmap().  (arm64 uses lazy TLB, right?)  So this is kind of like
> the x86 paravirt case.  Are there large enough arm64 systems that any
> of this matters?
> 
> s390x: The code has too many acronyms for me to understand it fully,
> but I think it's more or less the same situation as arm64.  How big do
> s390x systems come?
> 
> power: Ridiculously complicated, seems to vary by system and kernel config.
> 
> So, Nick, your unconditional IPI scheme is apparently a big
> improvement for power, and it should be an improvement and have low
> cost for x86.

As said, the tradeoffs are different, I'm not so sure. It was a big 
improvement on a very big system with the powerpc mm_cpumask switching
model on a microbenchmark designed to stress this, which is about all
I can say for it.

> On arm64 and s390x it will add more IPIs on process
> exit but reduce contention on context switching depending on how lazy
> TLB works.  I suppose we could try it for all architectures without
> any further optimizations.

It will remain opt-in but certainly try it out and see. There are some
requirements as documented in the config option text.

> Or we could try one of the perhaps
> excessively clever improvements I linked above.  arm64, s390x people,
> what do you think?
> 

I'm not against improvements to the scheme. e.g., from the patch

+               /*
+                * IPI overheads have not found to be expensive, but they could
+                * be reduced in a number of possible ways, for example (in
+                * roughly increasing order of complexity):
+                * - A batch of mms requiring IPIs could be gathered and freed
+                *   at once.
+                * - CPUs could store their active mm somewhere that can be
+                *   remotely checked without a lock, to filter out
+                *   false-positives in the cpumask.
+                * - After mm_users or mm_count reaches zero, switching away
+                *   from the mm could clear mm_cpumask to reduce some IPIs
+                *   (some batching or delaying would help).
+                * - A delayed freeing and RCU-like quiescing sequence based on
+                *   mm switching to avoid IPIs completely.
+                */

But would like to have numbers before being too clever.

Thanks,
Nick

next prev parent reply	other threads:[~2020-12-02  3:48 UTC|newest]

Thread overview: 92+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-28 16:01 [PATCH 0/8] shoot lazy tlbs Nicholas Piggin
2020-11-28 16:01 ` Nicholas Piggin
2020-11-28 16:01 ` [PATCH 1/8] lazy tlb: introduce exit_lazy_tlb Nicholas Piggin
2020-11-28 16:01   ` Nicholas Piggin
2020-11-29  0:38   ` Andy Lutomirski
2020-11-29  0:38     ` Andy Lutomirski
2020-12-02  2:49     ` Nicholas Piggin
2020-12-02  2:49       ` Nicholas Piggin
2020-11-28 16:01 ` [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode Nicholas Piggin
2020-11-28 16:01   ` Nicholas Piggin
2020-11-28 17:55   ` Andy Lutomirski
2020-11-28 17:55     ` Andy Lutomirski
2020-12-02  2:49     ` Nicholas Piggin
2020-12-02  2:49       ` Nicholas Piggin
2020-12-03  5:09       ` Andy Lutomirski
2020-12-03  5:09         ` Andy Lutomirski
2020-12-05  8:00         ` Nicholas Piggin
2020-12-05  8:00           ` Nicholas Piggin
2020-12-05 16:11           ` Andy Lutomirski
2020-12-05 16:11             ` Andy Lutomirski
2020-12-05 23:14             ` Nicholas Piggin
2020-12-05 23:14               ` Nicholas Piggin
2020-12-06  0:36               ` Andy Lutomirski
2020-12-06  0:36                 ` Andy Lutomirski
2020-12-06  3:59                 ` Nicholas Piggin
2020-12-06  3:59                   ` Nicholas Piggin
2020-12-11  0:11                   ` Andy Lutomirski
2020-12-11  0:11                     ` Andy Lutomirski
2020-12-14  4:07                     ` Nicholas Piggin
2020-12-14  4:07                       ` Nicholas Piggin
2020-12-14  5:53                       ` Nicholas Piggin
2020-12-14  5:53                         ` Nicholas Piggin
2020-11-30 14:57   ` Mathieu Desnoyers
2020-11-30 14:57     ` Mathieu Desnoyers
2020-11-28 16:01 ` [PATCH 3/8] x86: remove ARCH_HAS_SYNC_CORE_BEFORE_USERMODE Nicholas Piggin
2020-11-28 16:01   ` Nicholas Piggin
2020-11-28 16:01 ` [PATCH 4/8] lazy tlb: introduce lazy mm refcount helper functions Nicholas Piggin
2020-11-28 16:01   ` Nicholas Piggin
2020-11-28 16:01 ` [PATCH 5/8] lazy tlb: allow lazy tlb mm switching to be configurable Nicholas Piggin
2020-11-28 16:01   ` Nicholas Piggin
2020-11-29  0:36   ` Andy Lutomirski
2020-11-29  0:36     ` Andy Lutomirski
2020-12-02  2:49     ` Nicholas Piggin
2020-12-02  2:49       ` Nicholas Piggin
2020-11-28 16:01 ` [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option Nicholas Piggin
2020-11-28 16:01   ` Nicholas Piggin
2020-11-29  3:54   ` Andy Lutomirski
2020-11-29  3:54     ` Andy Lutomirski
2020-11-29 20:16     ` Andy Lutomirski
2020-11-29 20:16       ` Andy Lutomirski
2020-11-30  9:25       ` Peter Zijlstra
2020-11-30  9:25         ` Peter Zijlstra
2020-11-30 18:31       ` Andy Lutomirski
2020-11-30 18:31         ` Andy Lutomirski
2020-12-01 21:27         ` Will Deacon
2020-12-01 21:27           ` Will Deacon
2020-12-01 21:50           ` Andy Lutomirski
2020-12-01 21:50             ` Andy Lutomirski
2020-12-01 23:04             ` Will Deacon
2020-12-01 23:04               ` Will Deacon
2020-12-02  3:47         ` Nicholas Piggin [this message]
2020-12-02  3:47           ` Nicholas Piggin
2020-12-03  5:05           ` Andy Lutomirski
2020-12-03  5:05             ` Andy Lutomirski
2020-12-03 17:03         ` Alexander Gordeev
2020-12-03 17:03           ` Alexander Gordeev
2020-12-03 17:14           ` Andy Lutomirski
2020-12-03 17:14             ` Andy Lutomirski
2020-12-03 18:33             ` Alexander Gordeev
2020-12-03 18:33               ` Alexander Gordeev
2020-11-30  9:26     ` Peter Zijlstra
2020-11-30  9:26       ` Peter Zijlstra
2020-11-30  9:30     ` Peter Zijlstra
2020-11-30  9:30       ` Peter Zijlstra
2020-11-30  9:34       ` Peter Zijlstra
2020-11-30  9:34         ` Peter Zijlstra
2020-12-02  3:09     ` Nicholas Piggin
2020-12-02  3:09       ` Nicholas Piggin
2020-12-02 11:17   ` Peter Zijlstra
2020-12-02 11:17     ` Peter Zijlstra
2020-12-02 12:45     ` Peter Zijlstra
2020-12-02 12:45       ` Peter Zijlstra
2020-12-02 14:19   ` Peter Zijlstra
2020-12-02 14:19     ` Peter Zijlstra
2020-12-02 14:38     ` Andy Lutomirski
2020-12-02 14:38       ` Andy Lutomirski
2020-12-02 16:29       ` Peter Zijlstra
2020-12-02 16:29         ` Peter Zijlstra
2020-11-28 16:01 ` [PATCH 7/8] powerpc: use lazy mm refcount helper functions Nicholas Piggin
2020-11-28 16:01   ` Nicholas Piggin
2020-11-28 16:01 ` [PATCH 8/8] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN Nicholas Piggin
2020-11-28 16:01   ` Nicholas Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1606879302.tdngvs3yq4.astroid@bobo.none \
    --to=npiggin@gmail.com \
    --cc=anton@ozlabs.org \
    --cc=arnd@arndb.de \
    --cc=borntraeger@de.ibm.com \
    --cc=catalin.marinas@arm.com \
    --cc=dave.hansen@intel.com \
    --cc=gor@linux.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=luto@kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=peterz@infradead.org \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.