Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@kernel.org>
To: Mel Gorman <mgorman@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>, Hugh Dickins <hughd@google.com>,
	Minchan Kim <minchan@kernel.org>,
	Dave Hansen <dave.hansen@intel.com>,
	Andi Kleen <andi@firstfloor.org>, H Peter Anvin <hpa@zytor.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
Date: Wed, 10 Jun 2015 10:26:40 +0200	[thread overview]
Message-ID: <20150610082640.GA24483@gmail.com> (raw)
In-Reply-To: <1433871118-15207-3-git-send-email-mgorman@suse.de>

* Mel Gorman <mgorman@suse.de> wrote:

> On a 4-socket machine the results were
> 
>                                         4.1.0-rc6          4.1.0-rc6
>                                     batchdirty-v6      batchunmap-v6
> Ops lru-file-mmap-read-elapsed   121.27 (  0.00%)   118.79 (  2.05%)
> 
>            4.1.0-rc6      4.1.0-rc6
>         batchdirty-v6 batchunmap-v6
> User          620.84         608.48
> System       4245.35        4152.89
> Elapsed       122.65         120.15
> 
> In this case the workload completed faster and there was less CPU overhead
> but as it's a NUMA machine there are a lot of factors at play. It's easier
> to quantify on a single socket machine;
> 
>                                         4.1.0-rc6          4.1.0-rc6
>                                     batchdirty-v6      batchunmap-v6
> Ops lru-file-mmap-read-elapsed    20.35 (  0.00%)    21.52 ( -5.75%)
> 
>            4.1.0-rc6   4.1.0-rc6
>         batchdirty-v6r5batchunmap-v6r5
> User           58.02       60.70
> System         77.57       81.92
> Elapsed        22.14       23.16
> 
> That shows the workload takes 5.75% longer to complete with a similar
> increase in the system CPU usage.

Btw., do you have any stddev noise numbers?

The batching speedup is brutal enough to not need any noise estimations, it's a 
clear winner.

But this PFN tracking patch is more difficult to judge as the numbers are pretty 
close to each other.

> It is expected that there is overhead to tracking the PFNs and flushing 
> individual pages. This can be quantified but we cannot quantify the indirect 
> savings due to active unrelated TLB entries being preserved. Whether this 
> matters depends on whether the workload was using those entries and if they 
> would be used before a context switch but targeting the TLB flushes is the 
> conservative and safer choice.

So this is how I picture a realistic TLB flushing 'worst case': a workload that 
uses about 80% of the TLB cache in a 'fast' function and trashes memory in a 
'slow' function, and does alternate calls to the two functions from the same task.

Typical dTLB sizes on x86 are a couple of hundred entries (you can see the precise 
count in x86info -c), up to 1024 entries on the latest uarchs.

A cached TLB miss will take about 10-20 cycles (progressively more if the lookup 
chain misses in the cache) - but that cost is partially hidden if the L1 data 
cache was missed (which is likely for most TLB-flush intense workloads), and will 
be almost completely hidden if it goes out to the L3 cache or goes to RAM. (It 
takes up cache/memory bandwidth though, but unless the access patters are totally 
sparse, it should be a small fraction.)

A single INVLPG with its 200+ cycles cost is equivalent to about 10-20 TLB misses. 
That's a lot.

So this kind of workload should trigger the TLB flushing 'worst case': with say 
512 dTLB entries you could see up to 5k-10k cycles of hidden/indirect cost, but 
potentially parallelized with other misses going on with the same data accesses.

The current limit for INVLPG flushing is 33 entries: that's 10k-20k cycles max 
with an INVLPG cost of 250 cycles - this could explain the results you got.

But the problem is: AFAICS you can only decrease the INVLPG count by decreasing 
the batching size - the additional IPI costs will overwhelm any TLB preservation 
benefits. So depending on the cost relationship between INVLPG, TLB miss cost and 
IPI cost, it might not be possible to see a speedup even in the worst-case.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2015-06-10  8:26 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-09 17:31 [PATCH 0/3] TLB flush multiple pages per IPI v6 Mel Gorman
2015-06-09 17:31 ` [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent Mel Gorman
2015-06-09 17:31 ` [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages Mel Gorman
2015-06-09 20:01   ` Rik van Riel
2015-06-10  7:47   ` Ingo Molnar
2015-06-10  8:14     ` Mel Gorman
2015-06-10  8:21       ` Ingo Molnar
2015-06-10  8:51         ` Mel Gorman
2015-06-10  8:26   ` Ingo Molnar [this message]
2015-06-10  9:58     ` Mel Gorman
2015-06-10  8:33   ` Ingo Molnar
2015-06-10  8:59     ` Mel Gorman
2015-06-11 15:02       ` Ingo Molnar
2015-06-11 15:25         ` Mel Gorman
2015-06-09 17:31 ` [PATCH 3/4] mm: Defer flush of writable TLB entries Mel Gorman
2015-06-09 20:02   ` Rik van Riel
2015-06-10  7:50   ` Ingo Molnar
2015-06-10  8:17     ` Mel Gorman
2015-06-09 17:31 ` [PATCH 4/4] mm: Send one IPI per CPU to TLB flush pages that were recently unmapped Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2015-07-06 13:39 [PATCH 0/4] TLB flush multiple pages per IPI v7 Mel Gorman
2015-07-06 13:39 ` [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150610082640.GA24483@gmail.com \
    --to=mingo@kernel.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=dave.hansen@intel.com \
    --cc=hpa@zytor.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=minchan@kernel.org \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).