Re: [PATCH 0/3] TLB flush multiple pages per IPI v5

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@kernel.org>
To: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>, Hugh Dickins <hughd@google.com>,
	Minchan Kim <minchan@kernel.org>,
	Andi Kleen <andi@firstfloor.org>, H Peter Anvin <hpa@zytor.com>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
Date: Mon, 8 Jun 2015 22:03:08 +0200	[thread overview]
Message-ID: <20150608200308.GA16978@gmail.com> (raw)
In-Reply-To: <20150608195237.GA15429@gmail.com>


* Ingo Molnar <mingo@kernel.org> wrote:

> So what I measured agrees generally with the comment you added in the commit:
> 
>  + * Each single flush is about 100 ns, so this caps the maximum overhead at
>  + * _about_ 3,000 ns.
> 
> Let that sink through: 3,000 nsecs = 3 usecs, that's like eternity!
> 
> A CR3 driven TLB flush takes less time than a single INVLPG (!):
> 
>    [    0.389028] x86/fpu: Cost of: __flush_tlb()               fn            :    96 cycles
>    [    0.405885] x86/fpu: Cost of: __flush_tlb_one()           fn            :   260 cycles
>    [    0.414302] x86/fpu: Cost of: __flush_tlb_range()         fn            :   404 cycles
> 
> it's true that a full flush has hidden costs not measured above, because it has 
> knock-on effects (because it drops non-global TLB entries), but it's not _that_ 
> bad due to:
> 
>   - there almost always being a L1 or L2 cache miss when a TLB miss occurs,
>     which latency can be overlaid
> 
>   - global bit being held for kernel entries
> 
>   - user-space with high memory pressure trashing through TLBs typically

I also have cache-cold numbers from another (Intel) system:

[    0.176473] x86/bench:##########################################################################
[    0.185656] x86/bench: Running x86 benchmarks:                     cache-    hot /   cold cycles
[    1.234448] x86/bench: Cost of: null                                    :     35 /     73 cycles
[    ........]
[   27.930451] x86/bench:########  MM instructions:          ######################################
[   28.979251] x86/bench: Cost of: __flush_tlb()             fn            :    251 /    366 cycles
[   30.028795] x86/bench: Cost of: __flush_tlb_global()      fn            :    746 /   1795 cycles
[   31.077862] x86/bench: Cost of: __flush_tlb_one()         fn            :    237 /    883 cycles
[   32.127371] x86/bench: Cost of: __flush_tlb_range()       fn            :    312 /   1603 cycles
[   35.254202] x86/bench: Cost of: wbinvd()                  insn          : 2491761 / 2491922 cycles

Note how the numbers are even worse in the cache-cold case: the algorithmic 
complexity of __flush_tlb_range() versus __flush_tlb() makes it run slower 
(because we miss the I$), while the TLB cache-preservation argument is probably 
weaker, because when we are cache cold then TLB refill latency probably matters 
less (as it can be overlapped).

So __flush_tlb_range() is software trying to beat hardware, and that's almost 
always a bad idea on x86.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Ingo Molnar <mingo@kernel.org>
To: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>, Hugh Dickins <hughd@google.com>,
	Minchan Kim <minchan@kernel.org>,
	Andi Kleen <andi@firstfloor.org>, H Peter Anvin <hpa@zytor.com>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
Date: Mon, 8 Jun 2015 22:03:08 +0200	[thread overview]
Message-ID: <20150608200308.GA16978@gmail.com> (raw)
In-Reply-To: <20150608195237.GA15429@gmail.com>


* Ingo Molnar <mingo@kernel.org> wrote:

> So what I measured agrees generally with the comment you added in the commit:
> 
>  + * Each single flush is about 100 ns, so this caps the maximum overhead at
>  + * _about_ 3,000 ns.
> 
> Let that sink through: 3,000 nsecs = 3 usecs, that's like eternity!
> 
> A CR3 driven TLB flush takes less time than a single INVLPG (!):
> 
>    [    0.389028] x86/fpu: Cost of: __flush_tlb()               fn            :    96 cycles
>    [    0.405885] x86/fpu: Cost of: __flush_tlb_one()           fn            :   260 cycles
>    [    0.414302] x86/fpu: Cost of: __flush_tlb_range()         fn            :   404 cycles
> 
> it's true that a full flush has hidden costs not measured above, because it has 
> knock-on effects (because it drops non-global TLB entries), but it's not _that_ 
> bad due to:
> 
>   - there almost always being a L1 or L2 cache miss when a TLB miss occurs,
>     which latency can be overlaid
> 
>   - global bit being held for kernel entries
> 
>   - user-space with high memory pressure trashing through TLBs typically

I also have cache-cold numbers from another (Intel) system:

[    0.176473] x86/bench:##########################################################################
[    0.185656] x86/bench: Running x86 benchmarks:                     cache-    hot /   cold cycles
[    1.234448] x86/bench: Cost of: null                                    :     35 /     73 cycles
[    ........]
[   27.930451] x86/bench:########  MM instructions:          ######################################
[   28.979251] x86/bench: Cost of: __flush_tlb()             fn            :    251 /    366 cycles
[   30.028795] x86/bench: Cost of: __flush_tlb_global()      fn            :    746 /   1795 cycles
[   31.077862] x86/bench: Cost of: __flush_tlb_one()         fn            :    237 /    883 cycles
[   32.127371] x86/bench: Cost of: __flush_tlb_range()       fn            :    312 /   1603 cycles
[   35.254202] x86/bench: Cost of: wbinvd()                  insn          : 2491761 / 2491922 cycles

Note how the numbers are even worse in the cache-cold case: the algorithmic 
complexity of __flush_tlb_range() versus __flush_tlb() makes it run slower 
(because we miss the I$), while the TLB cache-preservation argument is probably 
weaker, because when we are cache cold then TLB refill latency probably matters 
less (as it can be overlapped).

So __flush_tlb_range() is software trying to beat hardware, and that's almost 
always a bad idea on x86.

Thanks,

	Ingo

next prev parent reply	other threads:[~2015-06-08 20:03 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-08 12:50 [PATCH 0/3] TLB flush multiple pages per IPI v5 Mel Gorman
2015-06-08 12:50 ` Mel Gorman
2015-06-08 12:50 ` [PATCH 1/3] x86, mm: Trace when an IPI is about to be sent Mel Gorman
2015-06-08 12:50   ` Mel Gorman
2015-06-08 12:50 ` [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped Mel Gorman
2015-06-08 12:50   ` Mel Gorman
2015-06-08 22:38   ` Andrew Morton
2015-06-08 22:38     ` Andrew Morton
2015-06-09 11:07     ` Mel Gorman
2015-06-09 11:07       ` Mel Gorman
2015-06-08 12:50 ` [PATCH 3/3] mm: Defer flush of writable TLB entries Mel Gorman
2015-06-08 12:50   ` Mel Gorman
2015-06-08 17:45 ` [PATCH 0/3] TLB flush multiple pages per IPI v5 Ingo Molnar
2015-06-08 17:45   ` Ingo Molnar
2015-06-08 18:21   ` Dave Hansen
2015-06-08 18:21     ` Dave Hansen
2015-06-08 19:52     ` Ingo Molnar
2015-06-08 19:52       ` Ingo Molnar
2015-06-08 20:03       ` Ingo Molnar [this message]
2015-06-08 20:03         ` Ingo Molnar
2015-06-08 21:07       ` Dave Hansen
2015-06-08 21:07         ` Dave Hansen
2015-06-08 21:50         ` Ingo Molnar
2015-06-08 21:50           ` Ingo Molnar
2015-06-09  8:47   ` Mel Gorman
2015-06-09  8:47     ` Mel Gorman
2015-06-09 10:32     ` Ingo Molnar
2015-06-09 10:32       ` Ingo Molnar
2015-06-09 11:20       ` Mel Gorman
2015-06-09 11:20         ` Mel Gorman
2015-06-09 12:43         ` Ingo Molnar
2015-06-09 12:43           ` Ingo Molnar
2015-06-09 13:05           ` Mel Gorman
2015-06-09 13:05             ` Mel Gorman
2015-06-10  8:51             ` Ingo Molnar
2015-06-10  8:51               ` Ingo Molnar
2015-06-10  9:08               ` Ingo Molnar
2015-06-10  9:08                 ` Ingo Molnar
2015-06-10 10:15                 ` Mel Gorman
2015-06-10 10:15                   ` Mel Gorman
2015-06-11 15:26                   ` Ingo Molnar
2015-06-11 15:26                     ` Ingo Molnar
2015-06-10  9:19               ` Mel Gorman
2015-06-10  9:19                 ` Mel Gorman
2015-06-09 15:34           ` Dave Hansen
2015-06-09 15:34             ` Dave Hansen
2015-06-09 16:49             ` Dave Hansen
2015-06-09 16:49               ` Dave Hansen
2015-06-09 21:14               ` Dave Hansen
2015-06-09 21:14                 ` Dave Hansen
2015-06-09 21:54                 ` Linus Torvalds
2015-06-09 21:54                   ` Linus Torvalds
2015-06-09 22:32                   ` Mel Gorman
2015-06-09 22:32                     ` Mel Gorman
2015-06-09 22:35                     ` Mel Gorman
2015-06-09 22:35                       ` Mel Gorman
2015-06-10 13:13                   ` Andi Kleen
2015-06-10 13:13                     ` Andi Kleen
2015-06-10 16:17                     ` Linus Torvalds
2015-06-10 16:17                       ` Linus Torvalds
2015-06-10 16:42                       ` Linus Torvalds
2015-06-10 16:42                         ` Linus Torvalds
2015-06-10 17:24                         ` Mel Gorman
2015-06-10 17:24                           ` Mel Gorman
2015-06-10 17:31                           ` Linus Torvalds
2015-06-10 17:31                             ` Linus Torvalds
2015-06-10 18:08                         ` Josh Boyer
2015-06-10 18:08                           ` Josh Boyer
2015-06-10 17:07                       ` Mel Gorman
2015-06-10 17:07                         ` Mel Gorman
2015-06-21 20:22             ` Kirill A. Shutemov
2015-06-21 20:22               ` Kirill A. Shutemov
2015-06-25 11:48               ` Ingo Molnar
2015-06-25 11:48                 ` Ingo Molnar
2015-06-25 18:36                 ` Linus Torvalds
2015-06-25 19:15                   ` Vlastimil Babka
2015-06-25 19:15                     ` Vlastimil Babka
2015-06-25 22:04                     ` Linus Torvalds
2015-06-25 22:04                       ` Linus Torvalds
2015-06-25 18:46                 ` Dave Hansen
2015-06-25 18:46                   ` Dave Hansen
2015-06-26  9:08                   ` Ingo Molnar
2015-06-26  9:08                     ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150608200308.GA16978@gmail.com \
    --to=mingo@kernel.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=dave.hansen@intel.com \
    --cc=hpa@zytor.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=minchan@kernel.org \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.