All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Rik van Riel <riel@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
	serebrin@google.com, peterz@infradead.org,
	linux-kernel@vger.kernel.org, luto@kernel.org, bp@suse.de,
	mgorman@suse.de, tglx@linutronix.de,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH RFC v3] x86,mm,sched: make lazy TLB mode even lazier
Date: Sat, 27 Aug 2016 10:03:16 +0200	[thread overview]
Message-ID: <20160827080316.GA11325@gmail.com> (raw)
In-Reply-To: <20160825170133.0a783ae8@riellap.home.surriel.com>


* Rik van Riel <riel@redhat.com> wrote:

> On Thu, 25 Aug 2016 12:42:15 -0700
> "H. Peter Anvin" <hpa@zytor.com> wrote:
> 
> > Why grabbing a lock instead of cmpxchg?
> 
> ... and some more cleanups later, this might actually be
> good to merge, assuming it works for Benjamin :)
> 
> ---8<---

LGTM in principle (it's a pretty clever trick!), just some minor stylistic nits:

> 
> Subject: x86,mm,sched: make lazy TLB mode even lazier
> 
> Lazy TLB mode can result in an idle CPU being woken up for a TLB
> flush, when all it really needed to do was flush %cr3 before the
> next context switch.

s/%cr3/CR3

> This is mostly fine on bare metal, though sub-optimal from a power
> saving point of view, and deeper C states could make TLB flushes
> take a little longer than desired.

s/C state/C-state/

> On virtual machines, the pain can be much worse, especially if a
> currently non-running VCPU is woken up for a TLB invalidation
> IPI, on a CPU that is busy running another task. It could take
> a while before that IPI is handled, leading to performance issues.

So I was going to suggest:

 s/VCPU/vCPU

But then, out of curiosity, I ran:

   triton:~/tip> for N in $(git log v3.0.. | grep -i vcpu | sed 's/[^a-zA-Z\d\s]/ /g'); do echo $N; done | grep -iw vcpu  | sort | uniq -c | sort -n
      3 Vcpu
    125 vCPU
    527 VCPU
   1611 vcpu

So never mind, I guess we missed that boat by a few years already ;-)

> 
> This patch deals with the issue by introducing a third tlb state,
> TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next
> context switch.

s/tlb state/TLB state

> 
> A CPU that transitions from TLBSTATE_LAZY to TLBSTATE_OK during
> the attempted transition to TLBSTATE_FLUSH will get a TLB flush
> IPI, just like a CPU that was in TLBSTATE_OK to begin with.
> 
> Nothing is done for a CPU that is already in TLBSTATE_FLUSH mode.
> 
> This patch is totally untested, because I am at a conference right
> now, and Benjamin has the test case :)
> 
> Benjamin, does this help your issue?
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Reported-by: Benjamin Serebrin <serebrin@google.com>
> ---
>  arch/x86/include/asm/tlbflush.h |  1 +
>  arch/x86/mm/tlb.c               | 47 ++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 45 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 4e5be94e079a..5ae8e4b174f8 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -310,6 +310,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>  
>  #define TLBSTATE_OK	1
>  #define TLBSTATE_LAZY	2
> +#define TLBSTATE_FLUSH	3
>  
>  static inline void reset_lazy_tlbstate(void)
>  {
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 5643fd0b1a7d..4352db65a129 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -140,10 +140,12 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  	}
>  #ifdef CONFIG_SMP
>  	  else {
> +		int oldstate = this_cpu_read(cpu_tlbstate.state);
>  		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);

Please separate local variable definition blocks from statements by an extra 
newline.

>  		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
>  
> -		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
> +		if (oldstate == TLBSTATE_FLUSH ||
> +				!cpumask_test_cpu(cpu, mm_cpumask(next))) {
>  			/*
>  			 * On established mms, the mm_cpumask is only changed
>  			 * from irq context, from ptep_clear_flush() while in
> @@ -242,11 +244,42 @@ static void flush_tlb_func(void *info)
>  
>  }
>  
> +/*
> + * Determine whether a CPU's TLB needs to be flushed now, or whether the
> + * flush can be delayed until the next context switch, by changing the
> + * tlbstate from TLBSTATE_LAZY to TLBSTATE_FLUSH.
> + */
> +static bool lazy_tlb_can_skip_flush(int cpu) {
> +	int *tlbstate = &per_cpu(cpu_tlbstate.state, cpu);
> +	int old;

Yeah, so this is not how the standard function definition looks like.

> +	/* A task on the CPU is actively using the mm. Flush the TLB. */
> +	if (*tlbstate == TLBSTATE_OK)
> +		return false;
> +
> +	/* The TLB will be flushed on the next context switch. */
> +	if (*tlbstate == TLBSTATE_FLUSH)
> +		return true;
> +
> +	/*
> +	 * The CPU is in TLBSTATE_LAZY, which could context switch back
> +	 * to TLBSTATE_OK, re-using the TLB state without a TLB flush.
> +	 * In that case, a TLB flush IPI needs to be sent.
> +	 *
> +	 * Otherwise, the TLB state is now TLBSTATE_FLUSH, and the
> +	 * TLB flush IPI can be skipped.
> +	 */
> +	old = cmpxchg(tlbstate, TLBSTATE_LAZY, TLBSTATE_FLUSH);
> +
> +	return old != TLBSTATE_OK;
> +}
> +
>  void native_flush_tlb_others(const struct cpumask *cpumask,
>  				 struct mm_struct *mm, unsigned long start,
>  				 unsigned long end)
>  {
>  	struct flush_tlb_info info;
> +	unsigned int cpu;
>  
>  	if (end == 0)
>  		end = start + PAGE_SIZE;
> @@ -262,8 +295,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>  				(end - start) >> PAGE_SHIFT);
>  
>  	if (is_uv_system()) {
> -		unsigned int cpu;
> -
>  		cpu = smp_processor_id();
>  		cpumask = uv_flush_tlb_others(cpumask, mm, start, end, cpu);
>  		if (cpumask)
> @@ -271,6 +302,16 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>  								&info, 1);
>  		return;
>  	}
> +
> +	/*
> +	 * Instead of sending IPIs to CPUs in lazy TLB mode, move that
> +	 * CPUs TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed
> +	 * at the next context switch.

s/CPUs/CPU's

> +	 */
> +	for_each_cpu(cpu, cpumask)
> +		if (lazy_tlb_can_skip_flush(cpu))
> +			cpumask_clear_cpu(cpu, (struct cpumask *)cpumask);

Please remove the 'const' from the cpumask type definition instead of this ugly 
cast!

Plus this needs one more pair of curly braces and it's perfect! ;-)

I'd also like to wait for the Tested-by from Benjamin as well before we can 
proceeed.

Thanks,

	Ingo

  reply	other threads:[~2016-08-27  8:03 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-25 19:04 [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier Rik van Riel
2016-08-25 19:42 ` H. Peter Anvin
2016-08-25 19:52   ` Rik van Riel
2016-08-25 20:15   ` [PATCH RFC v2] " Rik van Riel
2016-08-25 21:01   ` [PATCH RFC v3] " Rik van Riel
2016-08-27  8:03     ` Ingo Molnar [this message]
2016-08-27 23:02       ` Linus Torvalds
2016-08-30 19:53         ` [PATCH RFC v4] " Rik van Riel
2016-08-30 21:09           ` [PATCH RFC v5] " Rik van Riel
2016-08-31 16:27         ` [PATCH RFC v6] " Rik van Riel
2016-09-08  6:56           ` Ingo Molnar
2016-09-09  0:09             ` Benjamin Serebrin
2016-09-09  4:39               ` Andy Lutomirski
2016-09-09  7:44                 ` Peter Zijlstra
2017-04-25  3:30                   ` Andy Lutomirski
2016-08-29 15:24       ` [PATCH RFC v3] " Rik van Riel
2016-08-29 16:08   ` [PATCH RFC UGLY] " Rik van Riel
2016-08-28  8:11 ` Andy Lutomirski
2016-08-29 14:54   ` Rik van Riel
2016-08-29 23:55     ` Andy Lutomirski
2016-08-30  1:14       ` H. Peter Anvin
2016-08-30 18:23         ` Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160827080316.GA11325@gmail.com \
    --to=mingo@kernel.org \
    --cc=bp@suse.de \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=serebrin@google.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.