[regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
@ 2015-03-02 19:17 Matt
  2015-03-02 19:25 ` Dave Hansen
  0 siblings, 1 reply; 17+ messages in thread
From: Matt @ 2015-03-02 19:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Linux Kernel, Linus Torvalds, Will Deacon, Dave Hansen

Hi Dave,


is the following thread and patch related to your problem,

I just happened to stumble upon it a few days ago:

https://lkml.org/lkml/2014/12/17/280 ,
http://marc.info/?l=linux-kernel&m=141876582909898&w=2
Re: post-3.18 performance regression in TLB flushing code

Linus already posted a fix to the problem, however I can't seem to
find the matching commit in his tree (searching for "TLC regression"
or "TLB cache").


Kind Regards

Matt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
  2015-03-02 19:17 [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation Matt
@ 2015-03-02 19:25 ` Dave Hansen
  2015-03-02 19:45   ` Matt
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Hansen @ 2015-03-02 19:25 UTC (permalink / raw)
  To: Matt, Dave Chinner; +Cc: Linux Kernel, Linus Torvalds, Will Deacon

On 03/02/2015 11:17 AM, Matt wrote:
> Linus already posted a fix to the problem, however I can't seem to
> find the matching commit in his tree (searching for "TLC regression"
> or "TLB cache").

It's in 721c21c17ab958abf19a8fc611c3bd4743680e38 iirc.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
  2015-03-02 19:25 ` Dave Hansen
@ 2015-03-02 19:45   ` Matt
  0 siblings, 0 replies; 17+ messages in thread
From: Matt @ 2015-03-02 19:45 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Linux Kernel, Linus Torvalds, Will Deacon

On Mon, Mar 2, 2015 at 8:25 PM, Dave Hansen <dave@sr71.net> wrote:
> On 03/02/2015 11:17 AM, Matt wrote:
>> Linus already posted a fix to the problem, however I can't seem to
>> find the matching commit in his tree (searching for "TLC regression"
>> or "TLB cache").
>
> It's in 721c21c17ab958abf19a8fc611c3bd4743680e38 iirc.

Mea culpa, should have looked at the date of the thread - was just
grasping at straws to make an help attempt :/

I'll refrain from posting in this thread then to avoid clutter & load
to the list

(this is way over my head, I'm mostly doing minor patch porting and
custom kernels as a hobby)

Kind Regards

Matt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
@ 2015-03-02  1:04 Dave Chinner
  2015-03-02 19:47 ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2015-03-02  1:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, xfs

Hi folks,

Running one of my usual benchmarks (fsmark to create 50 million zero
length files in a 500TB filesystem, then running xfs_repair on it)
has indicated a significant regression in xfs_repair performance.

config				  3.19		4.0-rc1
defaults			 8m08s		  9m34s
-o ag_stride=-1			 4m04s		  4m38s
-o bhash=101073			 6m04s		 17m43s
-o ag_stride=-1,bhash=101073	 4m54s		  9m58s

The default is for create a number of concurrent threads to progress
AGs in parallel (https://lkml.org/lkml/2014/7/3/15), and this is
running on a 500AG filesystem so lots of parallelism. "-o
ag_stride=-1" turns this off, and just leaves a single prefetch
group working on AGs sequentially. As you can see, turning off the
concurrency halves the runtime.

The concurrency is really there for large spinning disk arrays,
where IO wait time dominates performance. I'm running on SSDs, so
ther eis almost no IO wait time.

The "-o bhash=X" controls the size of the buffer cache. The default
value is 4096, which means xfs_repair is oeprating with a memory
footprint of about 1GB and is small enough to suffer from readahead
thrashing on large filesystems. Setting it to 101073 gives increases that
to around 7-10GB and prevents readahead thrashing, so should run
much faster than the default concurrent config. It does run faster
for 3.19, but for 4.0-rc1 it runs almost twice as slow, and burns a
huge amount of system CPU time doing so.

Across the board the 4.0-rc1 numbers are much slower, and the
degradation is far worse when using the large memory footprint
configs. Perf points straight at the cause - this is from 4.0-rc1
on the "-o bhash=101073" config:

-   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
   - default_send_IPI_mask_sequence_phys
      - 99.99% physflat_send_IPI_mask
         - 99.37% native_send_call_func_ipi
              smp_call_function_many
            - native_flush_tlb_others
               - 99.85% flush_tlb_page
                    ptep_clear_flush
                    try_to_unmap_one
                    rmap_walk
                    try_to_unmap
                    migrate_pages
                    migrate_misplaced_page
                  - handle_mm_fault
                     - 99.73% __do_page_fault
                          trace_do_page_fault
                          do_async_page_fault
                        + async_page_fault
           0.63% native_send_call_func_single_ipi
              generic_exec_single
              smp_call_function_single

And the same profile output from 3.19 shows:

-    9.61%     9.61%  [kernel]            [k] default_send_IPI_mask_sequence_phys
   - default_send_IPI_mask_sequence_phys
      - 99.98% physflat_send_IPI_mask
         - 96.26% native_send_call_func_ipi
              smp_call_function_many
            - native_flush_tlb_others
               - 98.44% flush_tlb_page
                    ptep_clear_flush
                    try_to_unmap_one
                    rmap_walk
                    try_to_unmap
                    migrate_pages
                    migrate_misplaced_page
                    handle_mm_fault
               + 1.56% flush_tlb_mm_range
         + 3.74% native_send_call_func_single_ipi

So either there's been a massive increase in the number of IPIs
being sent, or the cost per IPI have greatly increased. Either way,
the result is a pretty significant performance degradatation.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
  2015-03-02  1:04 Dave Chinner
@ 2015-03-02 19:47 ` Linus Torvalds
  2015-03-03  1:47   ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2015-03-02 19:47 UTC (permalink / raw)
  To: Dave Chinner, Andrew Morton, Ingo Molnar, Matt B
  Cc: Linux Kernel Mailing List, linux-mm, xfs

On Sun, Mar 1, 2015 at 5:04 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Across the board the 4.0-rc1 numbers are much slower, and the
> degradation is far worse when using the large memory footprint
> configs. Perf points straight at the cause - this is from 4.0-rc1
> on the "-o bhash=101073" config:
>
> -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
>       - 99.99% physflat_send_IPI_mask
>          - 99.37% native_send_call_func_ipi
..
>
> And the same profile output from 3.19 shows:
>
> -    9.61%     9.61%  [kernel]            [k] default_send_IPI_mask_sequence_phys
>      - 99.98% physflat_send_IPI_mask
>          - 96.26% native_send_call_func_ipi
...
>
> So either there's been a massive increase in the number of IPIs
> being sent, or the cost per IPI have greatly increased. Either way,
> the result is a pretty significant performance degradatation.

And on Mon, Mar 2, 2015 at 11:17 AM, Matt <jackdachef@gmail.com> wrote:
>
> Linus already posted a fix to the problem, however I can't seem to
> find the matching commit in his tree (searching for "TLC regression"
> or "TLB cache").

That was commit f045bbb9fa1b, which was then refined by commit
721c21c17ab9, because it turned out that ARM64 had a very subtle
relationship with tlb->end and fullmm.

But both of those hit 3.19, so none of this should affect 4.0-rc1.
There's something else going on.

I assume it's the mm queue from Andrew, so adding him to the cc. There
are changes to the page migration etc, which could explain it.

There are also a fair amount of APIC changes in 4.0-rc1, so I guess it
really could be just that the IPI sending itself has gotten much
slower. Adding Ingo for that, although I don't think
default_send_IPI_mask_sequence_phys() itself hasn't actually changed,
only other things around the apic. So I'd be inclined to blame the mm
changes.

Obviously bisection would find it..

                          Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
  2015-03-02 19:47 ` Linus Torvalds
@ 2015-03-03  1:47   ` Dave Chinner
  2015-03-03  2:22     ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2015-03-03  1:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Ingo Molnar, Matt B, Linux Kernel Mailing List,
	linux-mm, xfs

On Mon, Mar 02, 2015 at 11:47:52AM -0800, Linus Torvalds wrote:
> On Sun, Mar 1, 2015 at 5:04 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Across the board the 4.0-rc1 numbers are much slower, and the
> > degradation is far worse when using the large memory footprint
> > configs. Perf points straight at the cause - this is from 4.0-rc1
> > on the "-o bhash=101073" config:
> >
> > -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
> >       - 99.99% physflat_send_IPI_mask
> >          - 99.37% native_send_call_func_ipi
> ..
> >
> > And the same profile output from 3.19 shows:
> >
> > -    9.61%     9.61%  [kernel]            [k] default_send_IPI_mask_sequence_phys
> >      - 99.98% physflat_send_IPI_mask
> >          - 96.26% native_send_call_func_ipi
> ...
> >
> > So either there's been a massive increase in the number of IPIs
> > being sent, or the cost per IPI have greatly increased. Either way,
> > the result is a pretty significant performance degradatation.
....
> I assume it's the mm queue from Andrew, so adding him to the cc. There
> are changes to the page migration etc, which could explain it.
> 
> There are also a fair amount of APIC changes in 4.0-rc1, so I guess it
> really could be just that the IPI sending itself has gotten much
> slower. Adding Ingo for that, although I don't think
> default_send_IPI_mask_sequence_phys() itself hasn't actually changed,
> only other things around the apic. So I'd be inclined to blame the mm
> changes.
> 
> Obviously bisection would find it..

Yes, though the time it takes to do a 13 step bisection means it's
something I don't do just for an initial bug report. ;)

Anyway, the difference between good and bad is pretty clear, so
I'm pretty confident the bisect is solid:

4d9424669946532be754a6e116618dcb58430cb4 is the first bad commit
commit 4d9424669946532be754a6e116618dcb58430cb4
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Feb 12 14:58:28 2015 -0800

    mm: convert p[te|md]_mknonnuma and remaining page table manipulations
    
    With PROT_NONE, the traditional page table manipulation functions are
    sufficient.
    
    [andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()]
    [akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Acked-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
    Tested-by: Sasha Levin <sasha.levin@oracle.com>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Dave Jones <davej@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

:040000 040000 50985a3f84e80bb2bdd049d4f34739d99436f988 1bc79bfac2c138844373b603f9bc5914f0d010f3 M        arch
:040000 040000 ea69bcd1c59f832a4b012a57b4eb1d0c7516947d 0822692fa6c356952e723b56038585716fa51723 M        include
:040000 040000 c11960b9f1ee72edb08dc3fdc46f590fb1d545f7 f5d17ff5b639adcb7363a196a9efe70f2a7312b5 M        mm

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
  2015-03-03  1:47   ` Dave Chinner
@ 2015-03-03  2:22     ` Linus Torvalds
  2015-03-03  2:37       ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2015-03-03  2:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Ingo Molnar, Matt B, Linux Kernel Mailing List,
	linux-mm, xfs

On Mon, Mar 2, 2015 at 5:47 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Anyway, the difference between good and bad is pretty clear, so
> I'm pretty confident the bisect is solid:
>
> 4d9424669946532be754a6e116618dcb58430cb4 is the first bad commit

Well, it's the mm queue from Andrew, so I'm not surprised. That said,
I don't see why that particular one should matter.

Hmm. In your profiles, can you tell which caller of "flush_tlb_page()"
 changed the most? The change from "mknnuma" to "prot_none" *should*
be 100% equivalent (both just change the page to be not-present, just
set different bits elsewhere in the pte), but clearly something
wasn't.

Oh. Except for that special "huge-zero-page" special case that got
dropped, but that got re-introduced in commit e944fd67b625.

There might be some other case where the new "just change the
protection" doesn't do the "oh, but it the protection didn't change,
don't bother flushing". I don't see it.

                          Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
  2015-03-03  2:22     ` Linus Torvalds
@ 2015-03-03  2:37       ` Linus Torvalds
  2015-03-03  5:20         ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2015-03-03  2:37 UTC (permalink / raw)
  To: Dave Chinner, Mel Gorman
  Cc: Andrew Morton, Ingo Molnar, Matt B, Linux Kernel Mailing List,
	linux-mm, xfs

On Mon, Mar 2, 2015 at 6:22 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> There might be some other case where the new "just change the
> protection" doesn't do the "oh, but it the protection didn't change,
> don't bother flushing". I don't see it.

Hmm. I wonder.. In change_pte_range(), we just unconditionally change
the protection bits.

But the old numa code used to do

    if (!pte_numa(oldpte)) {
        ptep_set_numa(mm, addr, pte);

so it would actually avoid the pte update if a numa-prot page was
marked numa-prot again.

But are those migrate-page calls really common enough to make these
things happen often enough on the same pages for this all to matter?

Odd.

So it would be good if your profiles just show "there's suddenly a
*lot* more calls to flush_tlb_page() from XYZ" and the culprit is
obvious that way..

                       Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
  2015-03-03  2:37       ` Linus Torvalds
@ 2015-03-03  5:20         ` Dave Chinner
  2015-03-03  6:56           ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2015-03-03  5:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Andrew Morton, Ingo Molnar, Matt B,
	Linux Kernel Mailing List, linux-mm, xfs

On Mon, Mar 02, 2015 at 06:37:47PM -0800, Linus Torvalds wrote:
> On Mon, Mar 2, 2015 at 6:22 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > There might be some other case where the new "just change the
> > protection" doesn't do the "oh, but it the protection didn't change,
> > don't bother flushing". I don't see it.
> 
> Hmm. I wonder.. In change_pte_range(), we just unconditionally change
> the protection bits.
> 
> But the old numa code used to do
> 
>     if (!pte_numa(oldpte)) {
>         ptep_set_numa(mm, addr, pte);
> 
> so it would actually avoid the pte update if a numa-prot page was
> marked numa-prot again.
> 
> But are those migrate-page calls really common enough to make these
> things happen often enough on the same pages for this all to matter?

It's looking like that's a possibility.  I am running a fake-numa=4
config on this test VM so it's got 4 nodes of 4p/4GB RAM each.
both kernels are running through the same page fault path and that
is straight through migrate_pages().

3.19:

   13.70%     0.01%  [kernel]            [k] native_flush_tlb_others
   - native_flush_tlb_others
      - 98.58% flush_tlb_page
           ptep_clear_flush
           try_to_unmap_one
           rmap_walk
           try_to_unmap
           migrate_pages
           migrate_misplaced_page
         - handle_mm_fault
            - 96.88% __do_page_fault
                 trace_do_page_fault
                 do_async_page_fault
               + async_page_fault
            + 3.12% __get_user_pages
      + 1.40% flush_tlb_mm_range

4.0-rc1:

-   67.12%     0.04%  [kernel]            [k] native_flush_tlb_others
   - native_flush_tlb_others
      - 99.80% flush_tlb_page
           ptep_clear_flush
           try_to_unmap_one
           rmap_walk
           try_to_unmap
           migrate_pages
           migrate_misplaced_page
         - handle_mm_fault
            - 99.50% __do_page_fault
                 trace_do_page_fault
                 do_async_page_fault
               - async_page_fault

Same call chain, just a lot more CPU used further down the stack.

> Odd.
> 
> So it would be good if your profiles just show "there's suddenly a
> *lot* more calls to flush_tlb_page() from XYZ" and the culprit is
> obvious that way..

Ok, I did a simple 'perf stat -e tlb:tlb_flush -a -r 6 sleep 10' to
count all the tlb flush events from the kernel. I then pulled the
full events for a 30s period to get a sampling of the reason
associated with each flush event.

4.0-rc1:

 Performance counter stats for 'system wide' (6 runs):

         2,190,503      tlb:tlb_flush      ( +-  8.30% )

      10.001970663 seconds time elapsed    ( +-  0.00% )

The reason breakdown:

	81% TLB_REMOTE_SHOOTDOWN
	19% TLB_FLUSH_ON_TASK_SWITCH

3.19:

 Performance counter stats for 'system wide' (6 runs):

           467,151      tlb:tlb_flush      ( +- 25.50% )

      10.002021491 seconds time elapsed    ( +-  0.00% )

The reason breakdown:

	  6% TLB_REMOTE_SHOOTDOWN
	 94% TLB_FLUSH_ON_TASK_SWITCH

The difference would appear to be the number of remote TLB
shootdowns that are occurring from otherwise identical page fault
paths.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
  2015-03-03  5:20         ` Dave Chinner
@ 2015-03-03  6:56           ` Linus Torvalds
  2015-03-03 11:34             ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2015-03-03  6:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mel Gorman, Andrew Morton, Ingo Molnar, Matt B,
	Linux Kernel Mailing List, linux-mm, xfs

On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> But are those migrate-page calls really common enough to make these
>> things happen often enough on the same pages for this all to matter?
>
> It's looking like that's a possibility.

Hmm. Looking closer, commit 10c1045f28e8 already should have
re-introduced the "pte was already NUMA" case.

So that's not it either, afaik. Plus your numbers seem to say that
it's really "migrate_pages()" that is done more. So it feels like the
numa balancing isn't working right.

But I'm not seeing what would cause that in that commit. It really all
looks the same to me. The few special-cases it drops get re-introduced
later (although in a different form).

Mel, do you see what I'm missing?

                     Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
  2015-03-03  6:56           ` Linus Torvalds
@ 2015-03-03 11:34             ` Dave Chinner
  2015-03-03 13:43               ` Mel Gorman
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2015-03-03 11:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Andrew Morton, Ingo Molnar, Matt B,
	Linux Kernel Mailing List, linux-mm, xfs

On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote:
> On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> But are those migrate-page calls really common enough to make these
> >> things happen often enough on the same pages for this all to matter?
> >
> > It's looking like that's a possibility.
> 
> Hmm. Looking closer, commit 10c1045f28e8 already should have
> re-introduced the "pte was already NUMA" case.
> 
> So that's not it either, afaik. Plus your numbers seem to say that
> it's really "migrate_pages()" that is done more. So it feels like the
> numa balancing isn't working right.

So that should show up in the vmstats, right? Oh, and there's a
tracepoint in migrate_pages, too. Same 6x10s samples in phase 3:

3.19:

	55,898      migrate:mm_migrate_pages

And a sample of the events shows 99.99% of these are:

mm_migrate_pages:     nr_succeeded=1 nr_failed=0 mode=MIGRATE_ASYNC reason=

4.0-rc1:

	364,442      migrate:mm_migrate_pages

They are also single page MIGRATE_ASYNC events like for 3.19.

And 'grep "numa\|migrate" /proc/vmstat' output for the entire
xfs_repair run:

3.19:

numa_hit 5163221
numa_miss 121274
numa_foreign 121274
numa_interleave 12116
numa_local 5153127
numa_other 131368
numa_pte_updates 36482466
numa_huge_pte_updates 0
numa_hint_faults 34816515
numa_hint_faults_local 9197961
numa_pages_migrated 1228114
pgmigrate_success 1228114
pgmigrate_fail 0

4.0-rc1:

numa_hit 36952043
numa_miss 92471
numa_foreign 92471
numa_interleave 10964
numa_local 36927384
numa_other 117130
numa_pte_updates 84010995
numa_huge_pte_updates 0
numa_hint_faults 81697505
numa_hint_faults_local 21765799
numa_pages_migrated 32916316
pgmigrate_success 32916316
pgmigrate_fail 0

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
  2015-03-03 11:34             ` Dave Chinner
@ 2015-03-03 13:43               ` Mel Gorman
  2015-03-03 21:33                 ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2015-03-03 13:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Matt B,
	Linux Kernel Mailing List, linux-mm, xfs

On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote:
> On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote:
> > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >>
> > >> But are those migrate-page calls really common enough to make these
> > >> things happen often enough on the same pages for this all to matter?
> > >
> > > It's looking like that's a possibility.
> > 
> > Hmm. Looking closer, commit 10c1045f28e8 already should have
> > re-introduced the "pte was already NUMA" case.
> > 
> > So that's not it either, afaik. Plus your numbers seem to say that
> > it's really "migrate_pages()" that is done more. So it feels like the
> > numa balancing isn't working right.
> 
> So that should show up in the vmstats, right? Oh, and there's a
> tracepoint in migrate_pages, too. Same 6x10s samples in phase 3:
> 

The stats indicate both more updates and more faults. Can you try this
please? It's against 4.0-rc1.

---8<---
mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing

Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

   Across the board the 4.0-rc1 numbers are much slower, and the
   degradation is far worse when using the large memory footprint
   configs. Perf points straight at the cause - this is from 4.0-rc1
   on the "-o bhash=101073" config:

   -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
      - default_send_IPI_mask_sequence_phys
         - 99.99% physflat_send_IPI_mask
            - 99.37% native_send_call_func_ipi
                 smp_call_function_many
               - native_flush_tlb_others
                  - 99.85% flush_tlb_page
                       ptep_clear_flush
                       try_to_unmap_one
                       rmap_walk
                       try_to_unmap
                       migrate_pages
                       migrate_misplaced_page
                     - handle_mm_fault
                        - 99.73% __do_page_fault
                             trace_do_page_fault
                             do_async_page_fault
                           + async_page_fault
              0.63% native_send_call_func_single_ipi
                 generic_exec_single
                 smp_call_function_single

This was bisected to commit 4d94246699 ("mm: convert p[te|md]_mknonnuma
and remaining page table manipulations") but I expect the full issue is
related series up to and including that patch.

There are two important changes that might be relevant here. The first is
marking huge PMDs to trap a hinting fault potentially sends an IPI to flush
TLBs. This did not show up in Dave's report and it almost certainly is not
a factor but it would affect IPI counts for other users. The second is that
the PTE protection update now clears the PTE leaving a window where parallel
faults can be trapped resulting in more overhead from faults. Higher faults,
even if correct can result in higher scan rates indirectly and may explain
what Dave is saying.

This is not signed off or tested.
---
 mm/huge_memory.c | 11 +++++++++--
 mm/mprotect.c    | 17 +++++++++++++++--
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fc00c8cb5a82..7fc4732c77d7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1494,8 +1494,15 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		}
 
 		if (!prot_numa || !pmd_protnone(*pmd)) {
-			ret = 1;
-			entry = pmdp_get_and_clear_notify(mm, addr, pmd);
+			/*
+			 * NUMA hinting update can avoid a clear and flush as
+			 * it is not a functional correctness issue if access
+			 * occurs after the update
+			 */
+			if (prot_numa)
+				entry = *pmd;
+			else
+				entry = pmdp_get_and_clear_notify(mm, addr, pmd);
 			entry = pmd_modify(entry, newprot);
 			ret = HPAGE_PMD_NR;
 			set_pmd_at(mm, addr, pmd, entry);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 44727811bf4c..1efd03ffa0d8 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -77,19 +77,32 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			pte_t ptent;
 
 			/*
-			 * Avoid trapping faults against the zero or KSM
-			 * pages. See similar comment in change_huge_pmd.
+			 * prot_numa does not clear the pte during protection
+			 * update as asynchronous hardware updates are not
+			 * a concern but unnecessary faults while the PTE is
+			 * cleared is overhead.
 			 */
 			if (prot_numa) {
 				struct page *page;
 
 				page = vm_normal_page(vma, addr, oldpte);
+
+				/*
+				 * Avoid trapping faults against the zero or KSM
+				 * pages. See similar comment in change_huge_pmd.
+				 */
 				if (!page || PageKsm(page))
 					continue;
 
 				/* Avoid TLB flush if possible */
 				if (pte_protnone(oldpte))
 					continue;
+
+				ptent = *pte;
+				ptent = pte_modify(ptent, newprot);
+				set_pte_at(mm, addr, pte, ptent);
+				pages++;
+				continue;
 			}
 
 			ptent = ptep_modify_prot_start(mm, addr, pte);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
  2015-03-03 13:43               ` Mel Gorman
@ 2015-03-03 21:33                 ` Dave Chinner
  2015-03-04 20:00                   ` Mel Gorman
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2015-03-03 21:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Matt B,
	Linux Kernel Mailing List, linux-mm, xfs

On Tue, Mar 03, 2015 at 01:43:46PM +0000, Mel Gorman wrote:
> On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote:
> > On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote:
> > > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > >>
> > > >> But are those migrate-page calls really common enough to make these
> > > >> things happen often enough on the same pages for this all to matter?
> > > >
> > > > It's looking like that's a possibility.
> > > 
> > > Hmm. Looking closer, commit 10c1045f28e8 already should have
> > > re-introduced the "pte was already NUMA" case.
> > > 
> > > So that's not it either, afaik. Plus your numbers seem to say that
> > > it's really "migrate_pages()" that is done more. So it feels like the
> > > numa balancing isn't working right.
> > 
> > So that should show up in the vmstats, right? Oh, and there's a
> > tracepoint in migrate_pages, too. Same 6x10s samples in phase 3:
> > 
> 
> The stats indicate both more updates and more faults. Can you try this
> please? It's against 4.0-rc1.
> 
> ---8<---
> mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing

Makes no noticable difference to behaviour or performance. Stats:

359,857      migrate:mm_migrate_pages ( +-  5.54% )

numa_hit 36026802
numa_miss 14287
numa_foreign 14287
numa_interleave 18408
numa_local 36006052
numa_other 35037
numa_pte_updates 81803359
numa_huge_pte_updates 0
numa_hint_faults 79810798
numa_hint_faults_local 21227730
numa_pages_migrated 32037516
pgmigrate_success 32037516
pgmigrate_fail 0

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
  2015-03-03 21:33                 ` Dave Chinner
@ 2015-03-04 20:00                   ` Mel Gorman
  2015-03-04 23:00                     ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2015-03-04 20:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Matt B,
	Linux Kernel Mailing List, linux-mm, xfs

On Wed, Mar 04, 2015 at 08:33:53AM +1100, Dave Chinner wrote:
> On Tue, Mar 03, 2015 at 01:43:46PM +0000, Mel Gorman wrote:
> > On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote:
> > > On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote:
> > > > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > > >>
> > > > >> But are those migrate-page calls really common enough to make these
> > > > >> things happen often enough on the same pages for this all to matter?
> > > > >
> > > > > It's looking like that's a possibility.
> > > > 
> > > > Hmm. Looking closer, commit 10c1045f28e8 already should have
> > > > re-introduced the "pte was already NUMA" case.
> > > > 
> > > > So that's not it either, afaik. Plus your numbers seem to say that
> > > > it's really "migrate_pages()" that is done more. So it feels like the
> > > > numa balancing isn't working right.
> > > 
> > > So that should show up in the vmstats, right? Oh, and there's a
> > > tracepoint in migrate_pages, too. Same 6x10s samples in phase 3:
> > > 
> > 
> > The stats indicate both more updates and more faults. Can you try this
> > please? It's against 4.0-rc1.
> > 
> > ---8<---
> > mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing
> 
> Makes no noticable difference to behaviour or performance. Stats:
> 

After going through the series again, I did not spot why there is a
difference. It's functionally similar and I would hate the theory that
this is somehow hardware related due to the use of bits it takes action
on. There is nothing in the manual that indicates that it would. Try this
as I don't want to leave this hanging before LSF/MM because it'll mask other
reports. It alters the maximum rate automatic NUMA balancing scans ptes.

---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ce18f3c097a..40ae5d84d4ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -799,7 +799,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * calculated based on the tasks virtual memory size and
  * numa_balancing_scan_size.
  */
-unsigned int sysctl_numa_balancing_scan_period_min = 1000;
+unsigned int sysctl_numa_balancing_scan_period_min = 2000;
 unsigned int sysctl_numa_balancing_scan_period_max = 60000;
 
 /* Portion of address space to scan in MB */

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
  2015-03-04 20:00                   ` Mel Gorman
@ 2015-03-04 23:00                     ` Dave Chinner
  2015-03-04 23:35                       ` Ingo Molnar
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2015-03-04 23:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Matt B,
	Linux Kernel Mailing List, linux-mm, xfs

On Wed, Mar 04, 2015 at 08:00:46PM +0000, Mel Gorman wrote:
> On Wed, Mar 04, 2015 at 08:33:53AM +1100, Dave Chinner wrote:
> > On Tue, Mar 03, 2015 at 01:43:46PM +0000, Mel Gorman wrote:
> > > On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote:
> > > > On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote:
> > > > > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > > > >>
> > > > > >> But are those migrate-page calls really common enough to make these
> > > > > >> things happen often enough on the same pages for this all to matter?
> > > > > >
> > > > > > It's looking like that's a possibility.
> > > > > 
> > > > > Hmm. Looking closer, commit 10c1045f28e8 already should have
> > > > > re-introduced the "pte was already NUMA" case.
> > > > > 
> > > > > So that's not it either, afaik. Plus your numbers seem to say that
> > > > > it's really "migrate_pages()" that is done more. So it feels like the
> > > > > numa balancing isn't working right.
> > > > 
> > > > So that should show up in the vmstats, right? Oh, and there's a
> > > > tracepoint in migrate_pages, too. Same 6x10s samples in phase 3:
> > > > 
> > > 
> > > The stats indicate both more updates and more faults. Can you try this
> > > please? It's against 4.0-rc1.
> > > 
> > > ---8<---
> > > mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing
> > 
> > Makes no noticable difference to behaviour or performance. Stats:
> > 
> 
> After going through the series again, I did not spot why there is a
> difference. It's functionally similar and I would hate the theory that
> this is somehow hardware related due to the use of bits it takes action
> on.

I doubt it's hardware related - I'm testing inside a VM, and the
host is a year old Dell r820 server, so it's a pretty common
hardware I'd think.

Guest:

processor       : 15
vendor_id       : GenuineIntel
cpu family      : 6
model           : 6
model name      : QEMU Virtual CPU version 2.0.0
stepping        : 3
microcode       : 0x1
cpu MHz         : 2199.998
cache size      : 4096 KB
physical id     : 15
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 15
initial apicid  : 15
fpu             : yes
fpu_exception   : yes
cpuid level     : 4
wp              : yes
flags           : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good nopl pni cx16 x2apic popcnt hypervisor lahf_lm
bugs            :
bogomips        : 4399.99
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

Host:

processor       : 31
vendor_id       : GenuineIntel
cpu family      : 6
model           : 45
model name      : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
stepping        : 7
microcode       : 0x70d
cpu MHz         : 1190.750
cache size      : 16384 KB
physical id     : 1
siblings        : 16
core id         : 7
cpu cores       : 8
apicid          : 47
initial apicid  : 47
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips        : 4400.75
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

> There is nothing in the manual that indicates that it would. Try this
> as I don't want to leave this hanging before LSF/MM because it'll mask other
> reports. It alters the maximum rate automatic NUMA balancing scans ptes.
> 
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7ce18f3c097a..40ae5d84d4ba 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -799,7 +799,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
>   * calculated based on the tasks virtual memory size and
>   * numa_balancing_scan_size.
>   */
> -unsigned int sysctl_numa_balancing_scan_period_min = 1000;
> +unsigned int sysctl_numa_balancing_scan_period_min = 2000;
>  unsigned int sysctl_numa_balancing_scan_period_max = 60000;

Made absolutely no difference:

	357,635      migrate:mm_migrate_pages      ( +-  4.11% )

numa_hit 36724642
numa_miss 92477
numa_foreign 92477
numa_interleave 11835
numa_local 36709671
numa_other 107448
numa_pte_updates 83924860
numa_huge_pte_updates 0
numa_hint_faults 81856035
numa_hint_faults_local 22104529
numa_pages_migrated 32766735
pgmigrate_success 32766735
pgmigrate_fail 0

Runtime was actually a minute worse (18m35s vs 17m39s) than without
this patch.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
  2015-03-04 23:00                     ` Dave Chinner
@ 2015-03-04 23:35                       ` Ingo Molnar
  2015-03-04 23:51                         ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2015-03-04 23:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Matt B,
	Linux Kernel Mailing List, linux-mm, xfs


* Dave Chinner <david@fromorbit.com> wrote:

> > After going through the series again, I did not spot why there is 
> > a difference. It's functionally similar and I would hate the 
> > theory that this is somehow hardware related due to the use of 
> > bits it takes action on.
> 
> I doubt it's hardware related - I'm testing inside a VM, [...]

That might be significant, I doubt Mel considered KVM's interpretation 
of pte details?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
  2015-03-04 23:35                       ` Ingo Molnar
@ 2015-03-04 23:51                         ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2015-03-04 23:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Matt B,
	Linux Kernel Mailing List, linux-mm, xfs

On Thu, Mar 05, 2015 at 12:35:45AM +0100, Ingo Molnar wrote:
> 
> * Dave Chinner <david@fromorbit.com> wrote:
> 
> > > After going through the series again, I did not spot why there is 
> > > a difference. It's functionally similar and I would hate the 
> > > theory that this is somehow hardware related due to the use of 
> > > bits it takes action on.
> > 
> > I doubt it's hardware related - I'm testing inside a VM, [...]
> 
> That might be significant, I doubt Mel considered KVM's interpretation 
> of pte details?

I did actaully mention that before:

| I am running a fake-numa=4 config on this test VM so it's got 4
| nodes of 4p/4GB RAM each.

but I think it got snipped before Mel was cc'd.

Perhaps size of the nodes is relevant, too, because the steady state
phase 3 memory usage is 5-6GB when this problem first shows up, and
then continues into phase 4 where memory usage grows again and peaks
at ~10GB....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-03-04 23:51 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-02 19:17 [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation Matt
2015-03-02 19:25 ` Dave Hansen
2015-03-02 19:45   ` Matt
  -- strict thread matches above, loose matches on Subject: below --
2015-03-02  1:04 Dave Chinner
2015-03-02 19:47 ` Linus Torvalds
2015-03-03  1:47   ` Dave Chinner
2015-03-03  2:22     ` Linus Torvalds
2015-03-03  2:37       ` Linus Torvalds
2015-03-03  5:20         ` Dave Chinner
2015-03-03  6:56           ` Linus Torvalds
2015-03-03 11:34             ` Dave Chinner
2015-03-03 13:43               ` Mel Gorman
2015-03-03 21:33                 ` Dave Chinner
2015-03-04 20:00                   ` Mel Gorman
2015-03-04 23:00                     ` Dave Chinner
2015-03-04 23:35                       ` Ingo Molnar
2015-03-04 23:51                         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox