public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
@ 2015-03-02 19:17 Matt
  2015-03-02 19:25 ` Dave Hansen
  0 siblings, 1 reply; 17+ messages in thread
From: Matt @ 2015-03-02 19:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Linux Kernel, Linus Torvalds, Will Deacon, Dave Hansen

Hi Dave,


is the following thread and patch related to your problem,

I just happened to stumble upon it a few days ago:

https://lkml.org/lkml/2014/12/17/280 ,
http://marc.info/?l=linux-kernel&m=141876582909898&w=2
Re: post-3.18 performance regression in TLB flushing code

Linus already posted a fix to the problem, however I can't seem to
find the matching commit in his tree (searching for "TLC regression"
or "TLB cache").


Kind Regards

Matt

^ permalink raw reply	[flat|nested] 17+ messages in thread
* [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
@ 2015-03-02  1:04 Dave Chinner
  2015-03-02 19:47 ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2015-03-02  1:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, xfs

Hi folks,

Running one of my usual benchmarks (fsmark to create 50 million zero
length files in a 500TB filesystem, then running xfs_repair on it)
has indicated a significant regression in xfs_repair performance.

config				  3.19		4.0-rc1
defaults			 8m08s		  9m34s
-o ag_stride=-1			 4m04s		  4m38s
-o bhash=101073			 6m04s		 17m43s
-o ag_stride=-1,bhash=101073	 4m54s		  9m58s

The default is for create a number of concurrent threads to progress
AGs in parallel (https://lkml.org/lkml/2014/7/3/15), and this is
running on a 500AG filesystem so lots of parallelism. "-o
ag_stride=-1" turns this off, and just leaves a single prefetch
group working on AGs sequentially. As you can see, turning off the
concurrency halves the runtime.

The concurrency is really there for large spinning disk arrays,
where IO wait time dominates performance. I'm running on SSDs, so
ther eis almost no IO wait time.

The "-o bhash=X" controls the size of the buffer cache. The default
value is 4096, which means xfs_repair is oeprating with a memory
footprint of about 1GB and is small enough to suffer from readahead
thrashing on large filesystems. Setting it to 101073 gives increases that
to around 7-10GB and prevents readahead thrashing, so should run
much faster than the default concurrent config. It does run faster
for 3.19, but for 4.0-rc1 it runs almost twice as slow, and burns a
huge amount of system CPU time doing so.

Across the board the 4.0-rc1 numbers are much slower, and the
degradation is far worse when using the large memory footprint
configs. Perf points straight at the cause - this is from 4.0-rc1
on the "-o bhash=101073" config:

-   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
   - default_send_IPI_mask_sequence_phys
      - 99.99% physflat_send_IPI_mask
         - 99.37% native_send_call_func_ipi
              smp_call_function_many
            - native_flush_tlb_others
               - 99.85% flush_tlb_page
                    ptep_clear_flush
                    try_to_unmap_one
                    rmap_walk
                    try_to_unmap
                    migrate_pages
                    migrate_misplaced_page
                  - handle_mm_fault
                     - 99.73% __do_page_fault
                          trace_do_page_fault
                          do_async_page_fault
                        + async_page_fault
           0.63% native_send_call_func_single_ipi
              generic_exec_single
              smp_call_function_single

And the same profile output from 3.19 shows:

-    9.61%     9.61%  [kernel]            [k] default_send_IPI_mask_sequence_phys
   - default_send_IPI_mask_sequence_phys
      - 99.98% physflat_send_IPI_mask
         - 96.26% native_send_call_func_ipi
              smp_call_function_many
            - native_flush_tlb_others
               - 98.44% flush_tlb_page
                    ptep_clear_flush
                    try_to_unmap_one
                    rmap_walk
                    try_to_unmap
                    migrate_pages
                    migrate_misplaced_page
                    handle_mm_fault
               + 1.56% flush_tlb_mm_range
         + 3.74% native_send_call_func_single_ipi

So either there's been a massive increase in the number of IPIs
being sent, or the cost per IPI have greatly increased. Either way,
the result is a pretty significant performance degradatation.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-03-04 23:51 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-02 19:17 [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation Matt
2015-03-02 19:25 ` Dave Hansen
2015-03-02 19:45   ` Matt
  -- strict thread matches above, loose matches on Subject: below --
2015-03-02  1:04 Dave Chinner
2015-03-02 19:47 ` Linus Torvalds
2015-03-03  1:47   ` Dave Chinner
2015-03-03  2:22     ` Linus Torvalds
2015-03-03  2:37       ` Linus Torvalds
2015-03-03  5:20         ` Dave Chinner
2015-03-03  6:56           ` Linus Torvalds
2015-03-03 11:34             ` Dave Chinner
2015-03-03 13:43               ` Mel Gorman
2015-03-03 21:33                 ` Dave Chinner
2015-03-04 20:00                   ` Mel Gorman
2015-03-04 23:00                     ` Dave Chinner
2015-03-04 23:35                       ` Ingo Molnar
2015-03-04 23:51                         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox