From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1757548AbcHYTxv (ORCPT <rfc822;w@1wt.eu>);
        Thu, 25 Aug 2016 15:53:51 -0400
Received: from mx1.redhat.com ([209.132.183.28]:45226 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1756623AbcHYTxr (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 25 Aug 2016 15:53:47 -0400
Message-ID: <1472154763.2751.52.camel@redhat.com>
Subject: Re: [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier
From: Rik van Riel <riel@redhat.com>
To: "H. Peter Anvin" <hpa@zytor.com>, serebrin@google.com
Cc: mingo@kernel.org, peterz@infradead.org, linux-kernel@vger.kernel.org,
        luto@kernel.org, bp@suse.de, mgorman@suse.de, tglx@linutronix.de
Date: Thu, 25 Aug 2016 15:52:43 -0400
In-Reply-To: <7E7CF02F-F0B1-493A-98B3-B078174811DA@zytor.com>
References: <20160825150515.02c2d8ea@riellap.home.surriel.com>
         <7E7CF02F-F0B1-493A-98B3-B078174811DA@zytor.com>
Organization: Red Hat, Inc.
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Thu, 25 Aug 2016 19:53:00 +0000 (UTC)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, 2016-08-25 at 12:42 -0700, H. Peter Anvin wrote:
> On August 25, 2016 12:04:59 PM PDT, Rik van Riel <riel@redhat.com>
> wrote:
> > Subject: x86,mm,sched: make lazy TLB mode even lazier
> > 
> > Lazy TLB mode can result in an idle CPU being woken up for a TLB
> > flush, when all it really needed to do was flush %cr3 before the
> > next context switch.
> > 
> > This is mostly fine on bare metal, though sub-optimal from a power
> > saving point of view, and deeper C states could make TLB flushes
> > take a little longer than desired.
> > 
> > On virtual machines, the pain can be much worse, especially if a
> > currently non-running VCPU is woken up for a TLB invalidation
> > IPI, on a CPU that is busy running another task. It could take
> > a while before that IPI is handled, leading to performance issues.
> > 
> > This patch is still ugly, and the sched.h include needs to be
> > cleaned
> > up a lot (how would the scheduler people like to see the context
> > switch
> > blocking abstracted?)
> > 
> > This patch deals with the issue by introducing a third tlb state,
> > TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next
> > context switch. A CPU is transitioned from TLBSTATE_LAZY to
> > TLBSTATE_FLUSH with the rq lock held, to prevent context switches.
> > 
> > Nothing is done for a CPU that is already in TLBSTATE_FLUH mode.
> > 
> > This patch is totally untested, because I am at a conference right
> > now, and Benjamin has the test case :)
> > 
> > Signed-off-by: Rik van Riel <riel@redhat.com>
> > Reported-by: Benjamin Serebrin <serebrin@google.com>
> > ---
> > arch/x86/include/asm/tlbflush.h |  1 +
> > arch/x86/mm/tlb.c               | 38
> > +++++++++++++++++++++++++++++++++++---
> > 2 files changed, 36 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/tlbflush.h
> > b/arch/x86/include/asm/tlbflush.h
> > index 4e5be94e079a..5ae8e4b174f8 100644
> > --- a/arch/x86/include/asm/tlbflush.h
> > +++ b/arch/x86/include/asm/tlbflush.h
> > @@ -310,6 +310,7 @@ void native_flush_tlb_others(const struct
> > cpumask
> > *cpumask,
> > 
> > #define TLBSTATE_OK	1
> > #define TLBSTATE_LAZY	2
> > +#define TLBSTATE_FLUSH	3
> > 
> > static inline void reset_lazy_tlbstate(void)
> > {
> > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> > index 5643fd0b1a7d..5b4cda49ac0c 100644
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -6,6 +6,7 @@
> > #include <linux/interrupt.h>
> > #include <linux/module.h>
> > #include <linux/cpu.h>
> > +#include "../../../kernel/sched/sched.h"
> > 
> > #include <asm/tlbflush.h>
> > #include <asm/mmu_context.h>
> > @@ -140,10 +141,12 @@ void switch_mm_irqs_off(struct mm_struct
> > *prev,
> > struct mm_struct *next,
> > 	}
> > #ifdef CONFIG_SMP
> > 	  else {
> > +		int oldstate = this_cpu_read(cpu_tlbstate.state);
> > 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
> > 		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
> > 
> > -		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
> > +		if (oldstate == TLBSTATE_FLUSH ||
> > +				!cpumask_test_cpu(cpu,
> > mm_cpumask(next))) {
> > 			/*
> > 			 * On established mms, the mm_cpumask is only
> > changed
> > 			 * from irq context, from ptep_clear_flush()
> > while in
> > @@ -242,11 +245,29 @@ static void flush_tlb_func(void *info)
> > 
> > }
> > 
> > +/*
> > + * This function moves a CPU from TLBSTATE_LAZY to TLBSTATE_FLUSH,
> > which
> > + * will force it to flush %cr3 at the next context switch,
> > effectively
> > + * doing a delayed TLB flush for a CPU in lazy TLB mode.
> > + * This takes the runqueue lock to protect against the race
> > condition
> > + * of the target CPU rescheduling while we change its TLB state.
> > + * Do nothing if the TLB state is already set to TLBSTATE_FLUSH.
> > + */
> > +static void set_lazy_tlbstate_flush(int cpu) {
> > +	if (per_cpu(cpu_tlbstate.state, cpu) == TLBSTATE_LAZY) {
> > +		raw_spin_lock(&cpu_rq(cpu)->lock);
> > +		if (per_cpu(cpu_tlbstate.state, cpu) ==
> > TLBSTATE_LAZY)
> > +			per_cpu(cpu_tlbstate.state, cpu) =
> > TLBSTATE_FLUSH;
> > +		raw_spin_unlock(&cpu_rq(cpu)->lock);
> > +	}
> > +}
> > +
> > void native_flush_tlb_others(const struct cpumask *cpumask,
> > 				 struct mm_struct *mm, unsigned long
> > start,
> > 				 unsigned long end)
> > {
> > 	struct flush_tlb_info info;
> > +	unsigned int cpu;
> > 
> > 	if (end == 0)
> > 		end = start + PAGE_SIZE;
> > @@ -262,8 +283,6 @@ void native_flush_tlb_others(const struct
> > cpumask
> > *cpumask,
> > 				(end - start) >> PAGE_SHIFT);
> > 
> > 	if (is_uv_system()) {
> > -		unsigned int cpu;
> > -
> > 		cpu = smp_processor_id();
> > 		cpumask = uv_flush_tlb_others(cpumask, mm, start, end,
> > cpu);
> > 		if (cpumask)
> > @@ -271,6 +290,19 @@ void native_flush_tlb_others(const struct
> > cpumask
> > *cpumask,
> > 								&info,
> > 1);
> > 		return;
> > 	}
> > +
> > +	/*
> > +	 * Instead of sending IPIs to CPUs in lazy TLB mode, move
> > that
> > +	 * CPUs TLB state to TLBSTATE_FLUSH, causing the TLB to be
> > flushed
> > +	 * at the next context switch.
> > +	 */
> > +	for_each_cpu(cpu, cpumask) {
> > +		if (per_cpu(cpu_tlbstate.state, cpu) !=
> > TLBSTATE_OK) {
> > +			set_lazy_tlbstate_flush(cpu);
> > +			cpumask_clear_cpu(cpu, (struct cpumask
> > *)cpumask);
> > +		}
> > +	}
> > +
> > 	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
> > }
> > 
> 
> Why grabbing a lock instead of cmpxchg?

Good point, cmpxchg would work for doing a
LAZY -> FLUSH transition.

Additionally, my RFC patch has a race condition,
in that it checks for !TLBSTATE_OK outside of the
lock, and clears the CPU from the cpumask outside
of the lock.

We can indeed do the LAZY -> FLUSH transition with
cmpxchg, and I should do that.

We do not need to check against a FLUSH -> OK
transition that happens while we are in
native_flush_tlb_others, because the page tables
will have been modified before, and the TLB is
flushed at context switch time.

We do not need to worry about not catching an
OK -> LAZY transition, either. In that case,
the CPU will simply stay in the bitmap, and
get a TLB flush IPI.

We can also continue doing nothing if a CPU
is already in FLUSH TLB mode.

However, a LAZY -> OK transition needs to be
caught, because if that happens during a
native_flush_tlb_others, a CPU may do a context
switch without a TLB flush.

Your cmpxchg idea would catch that last transition,
and allow the CPU to stay in the bitmap, so it can
have its TLB flushed via an IPI.

I will change set_lazy_tlbstate_flush from a void
to a bool, use cmpxchg, and do the cpumask_clear_cpu
only if the cmpxchg from LAZY -> FLUSH succeeds.

Thanks for the feedback, Peter!

This also gets rid of the ugly bits of internal
scheduler knowledge.