All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rik van Riel <riel@redhat.com>
To: "H. Peter Anvin" <hpa@zytor.com>, serebrin@google.com
Cc: mingo@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, luto@kernel.org, bp@suse.de,
	mgorman@suse.de, tglx@linutronix.de
Subject: Re: [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier
Date: Thu, 25 Aug 2016 15:52:43 -0400	[thread overview]
Message-ID: <1472154763.2751.52.camel@redhat.com> (raw)
In-Reply-To: <7E7CF02F-F0B1-493A-98B3-B078174811DA@zytor.com>

On Thu, 2016-08-25 at 12:42 -0700, H. Peter Anvin wrote:
> On August 25, 2016 12:04:59 PM PDT, Rik van Riel <riel@redhat.com>
> wrote:
> > Subject: x86,mm,sched: make lazy TLB mode even lazier
> > 
> > Lazy TLB mode can result in an idle CPU being woken up for a TLB
> > flush, when all it really needed to do was flush %cr3 before the
> > next context switch.
> > 
> > This is mostly fine on bare metal, though sub-optimal from a power
> > saving point of view, and deeper C states could make TLB flushes
> > take a little longer than desired.
> > 
> > On virtual machines, the pain can be much worse, especially if a
> > currently non-running VCPU is woken up for a TLB invalidation
> > IPI, on a CPU that is busy running another task. It could take
> > a while before that IPI is handled, leading to performance issues.
> > 
> > This patch is still ugly, and the sched.h include needs to be
> > cleaned
> > up a lot (how would the scheduler people like to see the context
> > switch
> > blocking abstracted?)
> > 
> > This patch deals with the issue by introducing a third tlb state,
> > TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next
> > context switch. A CPU is transitioned from TLBSTATE_LAZY to
> > TLBSTATE_FLUSH with the rq lock held, to prevent context switches.
> > 
> > Nothing is done for a CPU that is already in TLBSTATE_FLUH mode.
> > 
> > This patch is totally untested, because I am at a conference right
> > now, and Benjamin has the test case :)
> > 
> > Signed-off-by: Rik van Riel <riel@redhat.com>
> > Reported-by: Benjamin Serebrin <serebrin@google.com>
> > ---
> > arch/x86/include/asm/tlbflush.h |  1 +
> > arch/x86/mm/tlb.c               | 38
> > +++++++++++++++++++++++++++++++++++---
> > 2 files changed, 36 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/tlbflush.h
> > b/arch/x86/include/asm/tlbflush.h
> > index 4e5be94e079a..5ae8e4b174f8 100644
> > --- a/arch/x86/include/asm/tlbflush.h
> > +++ b/arch/x86/include/asm/tlbflush.h
> > @@ -310,6 +310,7 @@ void native_flush_tlb_others(const struct
> > cpumask
> > *cpumask,
> > 
> > #define TLBSTATE_OK	1
> > #define TLBSTATE_LAZY	2
> > +#define TLBSTATE_FLUSH	3
> > 
> > static inline void reset_lazy_tlbstate(void)
> > {
> > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> > index 5643fd0b1a7d..5b4cda49ac0c 100644
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -6,6 +6,7 @@
> > #include <linux/interrupt.h>
> > #include <linux/module.h>
> > #include <linux/cpu.h>
> > +#include "../../../kernel/sched/sched.h"
> > 
> > #include <asm/tlbflush.h>
> > #include <asm/mmu_context.h>
> > @@ -140,10 +141,12 @@ void switch_mm_irqs_off(struct mm_struct
> > *prev,
> > struct mm_struct *next,
> > 	}
> > #ifdef CONFIG_SMP
> > 	  else {
> > +		int oldstate = this_cpu_read(cpu_tlbstate.state);
> > 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
> > 		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
> > 
> > -		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
> > +		if (oldstate == TLBSTATE_FLUSH ||
> > +				!cpumask_test_cpu(cpu,
> > mm_cpumask(next))) {
> > 			/*
> > 			 * On established mms, the mm_cpumask is only
> > changed
> > 			 * from irq context, from ptep_clear_flush()
> > while in
> > @@ -242,11 +245,29 @@ static void flush_tlb_func(void *info)
> > 
> > }
> > 
> > +/*
> > + * This function moves a CPU from TLBSTATE_LAZY to TLBSTATE_FLUSH,
> > which
> > + * will force it to flush %cr3 at the next context switch,
> > effectively
> > + * doing a delayed TLB flush for a CPU in lazy TLB mode.
> > + * This takes the runqueue lock to protect against the race
> > condition
> > + * of the target CPU rescheduling while we change its TLB state.
> > + * Do nothing if the TLB state is already set to TLBSTATE_FLUSH.
> > + */
> > +static void set_lazy_tlbstate_flush(int cpu) {
> > +	if (per_cpu(cpu_tlbstate.state, cpu) == TLBSTATE_LAZY) {
> > +		raw_spin_lock(&cpu_rq(cpu)->lock);
> > +		if (per_cpu(cpu_tlbstate.state, cpu) ==
> > TLBSTATE_LAZY)
> > +			per_cpu(cpu_tlbstate.state, cpu) =
> > TLBSTATE_FLUSH;
> > +		raw_spin_unlock(&cpu_rq(cpu)->lock);
> > +	}
> > +}
> > +
> > void native_flush_tlb_others(const struct cpumask *cpumask,
> > 				 struct mm_struct *mm, unsigned long
> > start,
> > 				 unsigned long end)
> > {
> > 	struct flush_tlb_info info;
> > +	unsigned int cpu;
> > 
> > 	if (end == 0)
> > 		end = start + PAGE_SIZE;
> > @@ -262,8 +283,6 @@ void native_flush_tlb_others(const struct
> > cpumask
> > *cpumask,
> > 				(end - start) >> PAGE_SHIFT);
> > 
> > 	if (is_uv_system()) {
> > -		unsigned int cpu;
> > -
> > 		cpu = smp_processor_id();
> > 		cpumask = uv_flush_tlb_others(cpumask, mm, start, end,
> > cpu);
> > 		if (cpumask)
> > @@ -271,6 +290,19 @@ void native_flush_tlb_others(const struct
> > cpumask
> > *cpumask,
> > 								&info,
> > 1);
> > 		return;
> > 	}
> > +
> > +	/*
> > +	 * Instead of sending IPIs to CPUs in lazy TLB mode, move
> > that
> > +	 * CPUs TLB state to TLBSTATE_FLUSH, causing the TLB to be
> > flushed
> > +	 * at the next context switch.
> > +	 */
> > +	for_each_cpu(cpu, cpumask) {
> > +		if (per_cpu(cpu_tlbstate.state, cpu) !=
> > TLBSTATE_OK) {
> > +			set_lazy_tlbstate_flush(cpu);
> > +			cpumask_clear_cpu(cpu, (struct cpumask
> > *)cpumask);
> > +		}
> > +	}
> > +
> > 	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
> > }
> > 
> 
> Why grabbing a lock instead of cmpxchg?

Good point, cmpxchg would work for doing a
LAZY -> FLUSH transition.

Additionally, my RFC patch has a race condition,
in that it checks for !TLBSTATE_OK outside of the
lock, and clears the CPU from the cpumask outside
of the lock.

We can indeed do the LAZY -> FLUSH transition with
cmpxchg, and I should do that.

We do not need to check against a FLUSH -> OK
transition that happens while we are in
native_flush_tlb_others, because the page tables
will have been modified before, and the TLB is
flushed at context switch time.

We do not need to worry about not catching an
OK -> LAZY transition, either. In that case,
the CPU will simply stay in the bitmap, and
get a TLB flush IPI.

We can also continue doing nothing if a CPU
is already in FLUSH TLB mode.

However, a LAZY -> OK transition needs to be
caught, because if that happens during a
native_flush_tlb_others, a CPU may do a context
switch without a TLB flush.

Your cmpxchg idea would catch that last transition,
and allow the CPU to stay in the bitmap, so it can
have its TLB flushed via an IPI.

I will change set_lazy_tlbstate_flush from a void
to a bool, use cmpxchg, and do the cpumask_clear_cpu
only if the cmpxchg from LAZY -> FLUSH succeeds.

Thanks for the feedback, Peter!

This also gets rid of the ugly bits of internal
scheduler knowledge.

  reply	other threads:[~2016-08-25 19:53 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-25 19:04 [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier Rik van Riel
2016-08-25 19:42 ` H. Peter Anvin
2016-08-25 19:52   ` Rik van Riel [this message]
2016-08-25 20:15   ` [PATCH RFC v2] " Rik van Riel
2016-08-25 21:01   ` [PATCH RFC v3] " Rik van Riel
2016-08-27  8:03     ` Ingo Molnar
2016-08-27 23:02       ` Linus Torvalds
2016-08-30 19:53         ` [PATCH RFC v4] " Rik van Riel
2016-08-30 21:09           ` [PATCH RFC v5] " Rik van Riel
2016-08-31 16:27         ` [PATCH RFC v6] " Rik van Riel
2016-09-08  6:56           ` Ingo Molnar
2016-09-09  0:09             ` Benjamin Serebrin
2016-09-09  4:39               ` Andy Lutomirski
2016-09-09  7:44                 ` Peter Zijlstra
2017-04-25  3:30                   ` Andy Lutomirski
2016-08-29 15:24       ` [PATCH RFC v3] " Rik van Riel
2016-08-29 16:08   ` [PATCH RFC UGLY] " Rik van Riel
2016-08-28  8:11 ` Andy Lutomirski
2016-08-29 14:54   ` Rik van Riel
2016-08-29 23:55     ` Andy Lutomirski
2016-08-30  1:14       ` H. Peter Anvin
2016-08-30 18:23         ` Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1472154763.2751.52.camel@redhat.com \
    --to=riel@redhat.com \
    --cc=bp@suse.de \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=serebrin@google.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.