From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757548AbcHYTxv (ORCPT ); Thu, 25 Aug 2016 15:53:51 -0400 Received: from mx1.redhat.com ([209.132.183.28]:45226 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756623AbcHYTxr (ORCPT ); Thu, 25 Aug 2016 15:53:47 -0400 Message-ID: <1472154763.2751.52.camel@redhat.com> Subject: Re: [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier From: Rik van Riel To: "H. Peter Anvin" , serebrin@google.com Cc: mingo@kernel.org, peterz@infradead.org, linux-kernel@vger.kernel.org, luto@kernel.org, bp@suse.de, mgorman@suse.de, tglx@linutronix.de Date: Thu, 25 Aug 2016 15:52:43 -0400 In-Reply-To: <7E7CF02F-F0B1-493A-98B3-B078174811DA@zytor.com> References: <20160825150515.02c2d8ea@riellap.home.surriel.com> <7E7CF02F-F0B1-493A-98B3-B078174811DA@zytor.com> Organization: Red Hat, Inc. Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 8bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Thu, 25 Aug 2016 19:53:00 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2016-08-25 at 12:42 -0700, H. Peter Anvin wrote: > On August 25, 2016 12:04:59 PM PDT, Rik van Riel > wrote: > > Subject: x86,mm,sched: make lazy TLB mode even lazier > > > > Lazy TLB mode can result in an idle CPU being woken up for a TLB > > flush, when all it really needed to do was flush %cr3 before the > > next context switch. > > > > This is mostly fine on bare metal, though sub-optimal from a power > > saving point of view, and deeper C states could make TLB flushes > > take a little longer than desired. > > > > On virtual machines, the pain can be much worse, especially if a > > currently non-running VCPU is woken up for a TLB invalidation > > IPI, on a CPU that is busy running another task. It could take > > a while before that IPI is handled, leading to performance issues. > > > > This patch is still ugly, and the sched.h include needs to be > > cleaned > > up a lot (how would the scheduler people like to see the context > > switch > > blocking abstracted?) > > > > This patch deals with the issue by introducing a third tlb state, > > TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next > > context switch. A CPU is transitioned from TLBSTATE_LAZY to > > TLBSTATE_FLUSH with the rq lock held, to prevent context switches. > > > > Nothing is done for a CPU that is already in TLBSTATE_FLUH mode. > > > > This patch is totally untested, because I am at a conference right > > now, and Benjamin has the test case :) > > > > Signed-off-by: Rik van Riel > > Reported-by: Benjamin Serebrin > > --- > > arch/x86/include/asm/tlbflush.h |  1 + > > arch/x86/mm/tlb.c               | 38 > > +++++++++++++++++++++++++++++++++++--- > > 2 files changed, 36 insertions(+), 3 deletions(-) > > > > diff --git a/arch/x86/include/asm/tlbflush.h > > b/arch/x86/include/asm/tlbflush.h > > index 4e5be94e079a..5ae8e4b174f8 100644 > > --- a/arch/x86/include/asm/tlbflush.h > > +++ b/arch/x86/include/asm/tlbflush.h > > @@ -310,6 +310,7 @@ void native_flush_tlb_others(const struct > > cpumask > > *cpumask, > > > > #define TLBSTATE_OK 1 > > #define TLBSTATE_LAZY 2 > > +#define TLBSTATE_FLUSH 3 > > > > static inline void reset_lazy_tlbstate(void) > > { > > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > > index 5643fd0b1a7d..5b4cda49ac0c 100644 > > --- a/arch/x86/mm/tlb.c > > +++ b/arch/x86/mm/tlb.c > > @@ -6,6 +6,7 @@ > > #include > > #include > > #include > > +#include "../../../kernel/sched/sched.h" > > > > #include > > #include > > @@ -140,10 +141,12 @@ void switch_mm_irqs_off(struct mm_struct > > *prev, > > struct mm_struct *next, > > } > > #ifdef CONFIG_SMP > >   else { > > + int oldstate = this_cpu_read(cpu_tlbstate.state); > > this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); > > BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); > > > > - if (!cpumask_test_cpu(cpu, mm_cpumask(next))) { > > + if (oldstate == TLBSTATE_FLUSH || > > + !cpumask_test_cpu(cpu, > > mm_cpumask(next))) { > > /* > >  * On established mms, the mm_cpumask is only > > changed > >  * from irq context, from ptep_clear_flush() > > while in > > @@ -242,11 +245,29 @@ static void flush_tlb_func(void *info) > > > > } > > > > +/* > > + * This function moves a CPU from TLBSTATE_LAZY to TLBSTATE_FLUSH, > > which > > + * will force it to flush %cr3 at the next context switch, > > effectively > > + * doing a delayed TLB flush for a CPU in lazy TLB mode. > > + * This takes the runqueue lock to protect against the race > > condition > > + * of the target CPU rescheduling while we change its TLB state. > > + * Do nothing if the TLB state is already set to TLBSTATE_FLUSH. > > + */ > > +static void set_lazy_tlbstate_flush(int cpu) { > > + if (per_cpu(cpu_tlbstate.state, cpu) == TLBSTATE_LAZY) { > > + raw_spin_lock(&cpu_rq(cpu)->lock); > > + if (per_cpu(cpu_tlbstate.state, cpu) == > > TLBSTATE_LAZY) > > + per_cpu(cpu_tlbstate.state, cpu) = > > TLBSTATE_FLUSH; > > + raw_spin_unlock(&cpu_rq(cpu)->lock); > > + } > > +} > > + > > void native_flush_tlb_others(const struct cpumask *cpumask, > >  struct mm_struct *mm, unsigned long > > start, > >  unsigned long end) > > { > > struct flush_tlb_info info; > > + unsigned int cpu; > > > > if (end == 0) > > end = start + PAGE_SIZE; > > @@ -262,8 +283,6 @@ void native_flush_tlb_others(const struct > > cpumask > > *cpumask, > > (end - start) >> PAGE_SHIFT); > > > > if (is_uv_system()) { > > - unsigned int cpu; > > - > > cpu = smp_processor_id(); > > cpumask = uv_flush_tlb_others(cpumask, mm, start, end, > > cpu); > > if (cpumask) > > @@ -271,6 +290,19 @@ void native_flush_tlb_others(const struct > > cpumask > > *cpumask, > > &info, > > 1); > > return; > > } > > + > > + /* > > +  * Instead of sending IPIs to CPUs in lazy TLB mode, move > > that > > +  * CPUs TLB state to TLBSTATE_FLUSH, causing the TLB to be > > flushed > > +  * at the next context switch. > > +  */ > > + for_each_cpu(cpu, cpumask) { > > + if (per_cpu(cpu_tlbstate.state, cpu) != > > TLBSTATE_OK) { > > + set_lazy_tlbstate_flush(cpu); > > + cpumask_clear_cpu(cpu, (struct cpumask > > *)cpumask); > > + } > > + } > > + > > smp_call_function_many(cpumask, flush_tlb_func, &info, 1); > > } > > > > Why grabbing a lock instead of cmpxchg? Good point, cmpxchg would work for doing a LAZY -> FLUSH transition. Additionally, my RFC patch has a race condition, in that it checks for !TLBSTATE_OK outside of the lock, and clears the CPU from the cpumask outside of the lock. We can indeed do the LAZY -> FLUSH transition with cmpxchg, and I should do that. We do not need to check against a FLUSH -> OK transition that happens while we are in native_flush_tlb_others, because the page tables will have been modified before, and the TLB is flushed at context switch time. We do not need to worry about not catching an OK -> LAZY transition, either. In that case, the CPU will simply stay in the bitmap, and get a TLB flush IPI. We can also continue doing nothing if a CPU is already in FLUSH TLB mode. However, a LAZY -> OK transition needs to be caught, because if that happens during a native_flush_tlb_others, a CPU may do a context switch without a TLB flush. Your cmpxchg idea would catch that last transition, and allow the CPU to stay in the bitmap, so it can have its TLB flushed via an IPI. I will change set_lazy_tlbstate_flush from a void to a bool, use cmpxchg, and do the cpumask_clear_cpu only if the cmpxchg from LAZY -> FLUSH succeeds. Thanks for the feedback, Peter! This also gets rid of the ugly bits of internal scheduler knowledge.