From: Dave Chinner <david@fromorbit.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt B <jackdachef@gmail.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
xfs@oss.sgi.com, linux-mm <linux-mm@kvack.org>,
Mel Gorman <mgorman@suse.de>,
Andrew Morton <akpm@linux-foundation.org>,
Ingo Molnar <mingo@kernel.org>
Subject: Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
Date: Tue, 3 Mar 2015 16:20:04 +1100 [thread overview]
Message-ID: <20150303052004.GM18360@dastard> (raw)
In-Reply-To: <CA+55aFw+fb=Fh4M2wA4dVskgqN7PhZRGZS6JTMx4Rb1Qn++oaA@mail.gmail.com>
On Mon, Mar 02, 2015 at 06:37:47PM -0800, Linus Torvalds wrote:
> On Mon, Mar 2, 2015 at 6:22 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > There might be some other case where the new "just change the
> > protection" doesn't do the "oh, but it the protection didn't change,
> > don't bother flushing". I don't see it.
>
> Hmm. I wonder.. In change_pte_range(), we just unconditionally change
> the protection bits.
>
> But the old numa code used to do
>
> if (!pte_numa(oldpte)) {
> ptep_set_numa(mm, addr, pte);
>
> so it would actually avoid the pte update if a numa-prot page was
> marked numa-prot again.
>
> But are those migrate-page calls really common enough to make these
> things happen often enough on the same pages for this all to matter?
It's looking like that's a possibility. I am running a fake-numa=4
config on this test VM so it's got 4 nodes of 4p/4GB RAM each.
both kernels are running through the same page fault path and that
is straight through migrate_pages().
3.19:
13.70% 0.01% [kernel] [k] native_flush_tlb_others
- native_flush_tlb_others
- 98.58% flush_tlb_page
ptep_clear_flush
try_to_unmap_one
rmap_walk
try_to_unmap
migrate_pages
migrate_misplaced_page
- handle_mm_fault
- 96.88% __do_page_fault
trace_do_page_fault
do_async_page_fault
+ async_page_fault
+ 3.12% __get_user_pages
+ 1.40% flush_tlb_mm_range
4.0-rc1:
- 67.12% 0.04% [kernel] [k] native_flush_tlb_others
- native_flush_tlb_others
- 99.80% flush_tlb_page
ptep_clear_flush
try_to_unmap_one
rmap_walk
try_to_unmap
migrate_pages
migrate_misplaced_page
- handle_mm_fault
- 99.50% __do_page_fault
trace_do_page_fault
do_async_page_fault
- async_page_fault
Same call chain, just a lot more CPU used further down the stack.
> Odd.
>
> So it would be good if your profiles just show "there's suddenly a
> *lot* more calls to flush_tlb_page() from XYZ" and the culprit is
> obvious that way..
Ok, I did a simple 'perf stat -e tlb:tlb_flush -a -r 6 sleep 10' to
count all the tlb flush events from the kernel. I then pulled the
full events for a 30s period to get a sampling of the reason
associated with each flush event.
4.0-rc1:
Performance counter stats for 'system wide' (6 runs):
2,190,503 tlb:tlb_flush ( +- 8.30% )
10.001970663 seconds time elapsed ( +- 0.00% )
The reason breakdown:
81% TLB_REMOTE_SHOOTDOWN
19% TLB_FLUSH_ON_TASK_SWITCH
3.19:
Performance counter stats for 'system wide' (6 runs):
467,151 tlb:tlb_flush ( +- 25.50% )
10.002021491 seconds time elapsed ( +- 0.00% )
The reason breakdown:
6% TLB_REMOTE_SHOOTDOWN
94% TLB_FLUSH_ON_TASK_SWITCH
The difference would appear to be the number of remote TLB
shootdowns that are occurring from otherwise identical page fault
paths.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>,
Andrew Morton <akpm@linux-foundation.org>,
Ingo Molnar <mingo@kernel.org>, Matt B <jackdachef@gmail.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>,
xfs@oss.sgi.com
Subject: Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
Date: Tue, 3 Mar 2015 16:20:04 +1100 [thread overview]
Message-ID: <20150303052004.GM18360@dastard> (raw)
In-Reply-To: <CA+55aFw+fb=Fh4M2wA4dVskgqN7PhZRGZS6JTMx4Rb1Qn++oaA@mail.gmail.com>
On Mon, Mar 02, 2015 at 06:37:47PM -0800, Linus Torvalds wrote:
> On Mon, Mar 2, 2015 at 6:22 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > There might be some other case where the new "just change the
> > protection" doesn't do the "oh, but it the protection didn't change,
> > don't bother flushing". I don't see it.
>
> Hmm. I wonder.. In change_pte_range(), we just unconditionally change
> the protection bits.
>
> But the old numa code used to do
>
> if (!pte_numa(oldpte)) {
> ptep_set_numa(mm, addr, pte);
>
> so it would actually avoid the pte update if a numa-prot page was
> marked numa-prot again.
>
> But are those migrate-page calls really common enough to make these
> things happen often enough on the same pages for this all to matter?
It's looking like that's a possibility. I am running a fake-numa=4
config on this test VM so it's got 4 nodes of 4p/4GB RAM each.
both kernels are running through the same page fault path and that
is straight through migrate_pages().
3.19:
13.70% 0.01% [kernel] [k] native_flush_tlb_others
- native_flush_tlb_others
- 98.58% flush_tlb_page
ptep_clear_flush
try_to_unmap_one
rmap_walk
try_to_unmap
migrate_pages
migrate_misplaced_page
- handle_mm_fault
- 96.88% __do_page_fault
trace_do_page_fault
do_async_page_fault
+ async_page_fault
+ 3.12% __get_user_pages
+ 1.40% flush_tlb_mm_range
4.0-rc1:
- 67.12% 0.04% [kernel] [k] native_flush_tlb_others
- native_flush_tlb_others
- 99.80% flush_tlb_page
ptep_clear_flush
try_to_unmap_one
rmap_walk
try_to_unmap
migrate_pages
migrate_misplaced_page
- handle_mm_fault
- 99.50% __do_page_fault
trace_do_page_fault
do_async_page_fault
- async_page_fault
Same call chain, just a lot more CPU used further down the stack.
> Odd.
>
> So it would be good if your profiles just show "there's suddenly a
> *lot* more calls to flush_tlb_page() from XYZ" and the culprit is
> obvious that way..
Ok, I did a simple 'perf stat -e tlb:tlb_flush -a -r 6 sleep 10' to
count all the tlb flush events from the kernel. I then pulled the
full events for a 30s period to get a sampling of the reason
associated with each flush event.
4.0-rc1:
Performance counter stats for 'system wide' (6 runs):
2,190,503 tlb:tlb_flush ( +- 8.30% )
10.001970663 seconds time elapsed ( +- 0.00% )
The reason breakdown:
81% TLB_REMOTE_SHOOTDOWN
19% TLB_FLUSH_ON_TASK_SWITCH
3.19:
Performance counter stats for 'system wide' (6 runs):
467,151 tlb:tlb_flush ( +- 25.50% )
10.002021491 seconds time elapsed ( +- 0.00% )
The reason breakdown:
6% TLB_REMOTE_SHOOTDOWN
94% TLB_FLUSH_ON_TASK_SWITCH
The difference would appear to be the number of remote TLB
shootdowns that are occurring from otherwise identical page fault
paths.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>,
Andrew Morton <akpm@linux-foundation.org>,
Ingo Molnar <mingo@kernel.org>, Matt B <jackdachef@gmail.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>,
xfs@oss.sgi.com
Subject: Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
Date: Tue, 3 Mar 2015 16:20:04 +1100 [thread overview]
Message-ID: <20150303052004.GM18360@dastard> (raw)
In-Reply-To: <CA+55aFw+fb=Fh4M2wA4dVskgqN7PhZRGZS6JTMx4Rb1Qn++oaA@mail.gmail.com>
On Mon, Mar 02, 2015 at 06:37:47PM -0800, Linus Torvalds wrote:
> On Mon, Mar 2, 2015 at 6:22 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > There might be some other case where the new "just change the
> > protection" doesn't do the "oh, but it the protection didn't change,
> > don't bother flushing". I don't see it.
>
> Hmm. I wonder.. In change_pte_range(), we just unconditionally change
> the protection bits.
>
> But the old numa code used to do
>
> if (!pte_numa(oldpte)) {
> ptep_set_numa(mm, addr, pte);
>
> so it would actually avoid the pte update if a numa-prot page was
> marked numa-prot again.
>
> But are those migrate-page calls really common enough to make these
> things happen often enough on the same pages for this all to matter?
It's looking like that's a possibility. I am running a fake-numa=4
config on this test VM so it's got 4 nodes of 4p/4GB RAM each.
both kernels are running through the same page fault path and that
is straight through migrate_pages().
3.19:
13.70% 0.01% [kernel] [k] native_flush_tlb_others
- native_flush_tlb_others
- 98.58% flush_tlb_page
ptep_clear_flush
try_to_unmap_one
rmap_walk
try_to_unmap
migrate_pages
migrate_misplaced_page
- handle_mm_fault
- 96.88% __do_page_fault
trace_do_page_fault
do_async_page_fault
+ async_page_fault
+ 3.12% __get_user_pages
+ 1.40% flush_tlb_mm_range
4.0-rc1:
- 67.12% 0.04% [kernel] [k] native_flush_tlb_others
- native_flush_tlb_others
- 99.80% flush_tlb_page
ptep_clear_flush
try_to_unmap_one
rmap_walk
try_to_unmap
migrate_pages
migrate_misplaced_page
- handle_mm_fault
- 99.50% __do_page_fault
trace_do_page_fault
do_async_page_fault
- async_page_fault
Same call chain, just a lot more CPU used further down the stack.
> Odd.
>
> So it would be good if your profiles just show "there's suddenly a
> *lot* more calls to flush_tlb_page() from XYZ" and the culprit is
> obvious that way..
Ok, I did a simple 'perf stat -e tlb:tlb_flush -a -r 6 sleep 10' to
count all the tlb flush events from the kernel. I then pulled the
full events for a 30s period to get a sampling of the reason
associated with each flush event.
4.0-rc1:
Performance counter stats for 'system wide' (6 runs):
2,190,503 tlb:tlb_flush ( +- 8.30% )
10.001970663 seconds time elapsed ( +- 0.00% )
The reason breakdown:
81% TLB_REMOTE_SHOOTDOWN
19% TLB_FLUSH_ON_TASK_SWITCH
3.19:
Performance counter stats for 'system wide' (6 runs):
467,151 tlb:tlb_flush ( +- 25.50% )
10.002021491 seconds time elapsed ( +- 0.00% )
The reason breakdown:
6% TLB_REMOTE_SHOOTDOWN
94% TLB_FLUSH_ON_TASK_SWITCH
The difference would appear to be the number of remote TLB
shootdowns that are occurring from otherwise identical page fault
paths.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2015-03-03 5:20 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-03-02 1:04 [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation Dave Chinner
2015-03-02 1:04 ` Dave Chinner
2015-03-02 1:04 ` Dave Chinner
2015-03-02 19:47 ` Linus Torvalds
2015-03-02 19:47 ` Linus Torvalds
2015-03-02 19:47 ` Linus Torvalds
2015-03-03 1:47 ` Dave Chinner
2015-03-03 1:47 ` Dave Chinner
2015-03-03 1:47 ` Dave Chinner
2015-03-03 2:22 ` Linus Torvalds
2015-03-03 2:22 ` Linus Torvalds
2015-03-03 2:22 ` Linus Torvalds
2015-03-03 2:37 ` Linus Torvalds
2015-03-03 2:37 ` Linus Torvalds
2015-03-03 2:37 ` Linus Torvalds
2015-03-03 5:20 ` Dave Chinner [this message]
2015-03-03 5:20 ` Dave Chinner
2015-03-03 5:20 ` Dave Chinner
2015-03-03 6:56 ` Linus Torvalds
2015-03-03 6:56 ` Linus Torvalds
2015-03-03 6:56 ` Linus Torvalds
2015-03-03 11:34 ` Dave Chinner
2015-03-03 11:34 ` Dave Chinner
2015-03-03 11:34 ` Dave Chinner
2015-03-03 13:43 ` Mel Gorman
2015-03-03 13:43 ` Mel Gorman
2015-03-03 13:43 ` Mel Gorman
2015-03-03 21:33 ` Dave Chinner
2015-03-03 21:33 ` Dave Chinner
2015-03-03 21:33 ` Dave Chinner
2015-03-04 20:00 ` Mel Gorman
2015-03-04 20:00 ` Mel Gorman
2015-03-04 20:00 ` Mel Gorman
2015-03-04 23:00 ` Dave Chinner
2015-03-04 23:00 ` Dave Chinner
2015-03-04 23:00 ` Dave Chinner
2015-03-04 23:35 ` Ingo Molnar
2015-03-04 23:35 ` Ingo Molnar
2015-03-04 23:35 ` Ingo Molnar
2015-03-04 23:51 ` Dave Chinner
2015-03-04 23:51 ` Dave Chinner
2015-03-04 23:51 ` Dave Chinner
-- strict thread matches above, loose matches on Subject: below --
2015-03-02 19:17 Matt
2015-03-02 19:25 ` Dave Hansen
2015-03-02 19:45 ` Matt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150303052004.GM18360@dastard \
--to=david@fromorbit.com \
--cc=akpm@linux-foundation.org \
--cc=jackdachef@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mingo@kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.