public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
  • * Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
           [not found] <1404413420.8764.42.camel@j-VirtualBox>
           [not found] ` <1404416236.3179.18.camel@buesod1.americas.hpqcorp.net>
    @ 2014-07-04  1:01 ` Dave Chinner
      2014-07-04  1:46   ` Jason Low
      1 sibling, 1 reply; 22+ messages in thread
    From: Dave Chinner @ 2014-07-04  1:01 UTC (permalink / raw)
      To: Jason Low
      Cc: Davidlohr Bueso, Peter Zijlstra, Tim Chen, Ingo Molnar,
    	linux-kernel
    
    [re-added lkml]
    
    On Thu, Jul 03, 2014 at 11:50:20AM -0700, Jason Low wrote:
    > On Wed, Jul 2, 2014 at 7:32 PM, Dave Chinner <david@fromorbit.com> wrote:
    > > This is what the kernel profile looks like on the strided run:
    > >
    > > -  83.06%  [kernel]  [k] osq_lock
    > >    - osq_lock
    > >       - 100.00% rwsem_down_write_failed
    > >          - call_rwsem_down_write_failed
    > >             - 99.55% sys_mprotect
    > >                  tracesys
    > >                  __GI___mprotect
    > > -  12.02%  [kernel]  [k] rwsem_down_write_failed
    > 
    > Hi Dave,
    > 
    > So with no sign of rwsem_spin_on_owner(), yet with such heavy contention in
    > osq_lock, this makes me wonder if it's spending most of its time spinning
    > on !owner while a reader has the lock? (We don't set sem->owner for the readers.)
    > 
    > If that's an issue, maybe the below is worth a test, in which we'll just
    > avoid spinning if rwsem_can_spin_on_owner() finds that there is no owner.
    > If we just had to enter the slowpath yet there is no owner, we'll be conservative
    > and assume readers have the lock.
    
    That makes it quite a bit faster:
    
            XFS_REPAIR Summary    Fri Jul  4 10:39:32 2014
    
    Phase           Start           End             Duration
    Phase 1:        07/04 10:38:04  07/04 10:38:05  1 second
    Phase 2:        07/04 10:38:05  07/04 10:38:08  3 seconds
    Phase 3:        07/04 10:38:08  07/04 10:39:12  1 minute, 4 seconds
    Phase 4:        07/04 10:39:12  07/04 10:39:21  9 seconds
    Phase 5:        07/04 10:39:21  07/04 10:39:22  1 second
    Phase 6:        07/04 10:39:22  07/04 10:39:30  8 seconds
    Phase 7:        07/04 10:39:30  07/04 10:39:30  
    
    Total run time: 1 minute, 26 seconds
    done
    
    real    1m28.504s
    user    1m23.990s
    sys     3m20.132s
    
    So system time goes down massively, and speed comes up to within 30%
    of the single AG run.  But it's still 2-3000 IOPS down, but it's
    acceptible for the moment.
    
    FWIW, the kernel profile ifor the multi-AG run now looks like:
    
      29.64%  [kernel]  [k] _raw_spin_unlock_irq
       - _raw_spin_unlock_irq
          + 35.34% __schedule
          - 34.15% call_rwsem_down_write_failed
             + 99.27% sys_mprotect
          - 30.02% call_rwsem_down_read_failed
               95.59% __do_page_fault
    -  24.65%  [kernel]  [k] _raw_spin_unlock_irqrestore
       - _raw_spin_unlock_irqrestore
          - 69.38% rwsem_wake
             - call_rwsem_wake
                - 83.32% sys_mprotect
                + 15.54% __do_page_fault
          + 22.55% try_to_wake_up
    +   9.77%  [kernel]  [k] default_send_IPI_mask_sequence_phys
    -   3.21%  [kernel]  [k] smp_call_function_many
       - smp_call_function_many
          - 99.22% flush_tlb_page
    +   2.51%  [kernel]  [k] rwsem_down_write_failed
    
    It's much more like the 3.15 profile - it's only wasting half the
    CPU spinning on the internal spinlock and it's now going fast enough
    to be blowing another 10-12% of the CPU time sending tlb flushes to
    other CPUs....
    
    One thing I did notice, even with the single-AG-at-a-time run, is
    that the system time is *significantly* reduced with this patch,
    even though it doesn't change performance.
    
    ie unpatched:
    
    	unpatched	patched
    runtime   0m58s		  1m2s
    systime   4m55s		  1m1s
    
    So not spinning when there are read holders is a major win even
    when there are few threads contending on read/write.
    
    FWIW, the rwsems in the struct xfs_inode are often heavily
    read/write contended, so there are lots of IO related workloads that
    are going to regress on XFS without this optimisation...
    
    Anyway, consider the patch:
    
    Tested-by: Dave Chinner <dchinner@redhat.com>
    
    Cheers,
    
    Dave.
    -- 
    Dave Chinner
    david@fromorbit.com
    
    ^ permalink raw reply	[flat|nested] 22+ messages in thread
  • * [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
    @ 2014-07-03  2:32 Dave Chinner
      2014-07-03  3:31 ` Davidlohr Bueso
      0 siblings, 1 reply; 22+ messages in thread
    From: Dave Chinner @ 2014-07-03  2:32 UTC (permalink / raw)
      To: linux-kernel; +Cc: Davidlohr Bueso, Peter Zijlstra, Tim Chen, Ingo Molnar
    
    Hi folks,
    
    I've got a workload that hammers the mmap_sem via multi-threads
    memory allocation and page faults: it's called xfs_repair. I've been
    debugging problems with the latest release, and in the process of
    tracking down some recent regressions I noticed that turning off all
    the cross-ag IO concurrency resulted in the repair running about
    4-5x faster.
    
    i.e.:
    
    # time xfs_repair -f -vv fs.img
    .....
    
            XFS_REPAIR Summary    Thu Jul  3 11:51:41 2014
    
    Phase           Start           End             Duration
    Phase 1:        07/03 11:47:15  07/03 11:47:15
    Phase 2:        07/03 11:47:15  07/03 11:47:35  20 seconds
    Phase 3:        07/03 11:47:35  07/03 11:51:26  3 minutes, 51 seconds
    Phase 4:        07/03 11:51:26  07/03 11:51:31  5 seconds
    Phase 5:        07/03 11:51:31  07/03 11:51:31
    Phase 6:        07/03 11:51:31  07/03 11:51:39  8 seconds
    Phase 7:        07/03 11:51:39  07/03 11:51:39
    
    Total run time: 4 minutes, 24 seconds
    done
    
    real    4m26.399s
    user    1m6.023s
    sys     27m26.707s
    $
    
    And turning off AG striding:
    
    # time xfs_repair -f -vv -o ag_stride=-1 fs.img
    .....
            XFS_REPAIR Summary    Thu Jul  3 11:41:28 2014
    
    Phase           Start           End             Duration
    Phase 1:        07/03 11:40:27  07/03 11:40:27  
    Phase 2:        07/03 11:40:27  07/03 11:40:36  9 seconds
    Phase 3:        07/03 11:40:36  07/03 11:41:12  36 seconds
    Phase 4:        07/03 11:41:12  07/03 11:41:17  5 seconds
    Phase 5:        07/03 11:41:17  07/03 11:41:18  1 second
    Phase 6:        07/03 11:41:18  07/03 11:41:25  7 seconds
    Phase 7:        07/03 11:41:25  07/03 11:41:25  
    
    Total run time: 58 seconds
    done
    
    real    0m59.893s
    user    0m41.935s
    sys     4m55.867s
    $
    
    The difference is in phase 2 and 3, which is where all the memory
    allocation and IO that populates the userspace buffer cache takes
    place. The remainder of the phases run from the cache. All IO is
    direct IO, so there is no kernel caching at all. The filesystem
    image has a lot of metadata in it - it has 336 AGs and the buffer
    cache grows to about 6GB in size during phase 3.
    
    The difference in performance is in the system CPU time, and it
    results directly in lower IO dispatch rates - about 2,000 IOPS
    instead of ~12,000.
    
    This is what the kernel profile looks like on the strided run:
    
    -  83.06%  [kernel]  [k] osq_lock
       - osq_lock
          - 100.00% rwsem_down_write_failed
             - call_rwsem_down_write_failed
                - 99.55% sys_mprotect
                     tracesys
                     __GI___mprotect
    -  12.02%  [kernel]  [k] rwsem_down_write_failed
         rwsem_down_write_failed
         call_rwsem_down_write_failed
    +   1.09%  [kernel]  [k] _raw_spin_unlock_irqrestore
    +   0.92%  [kernel]  [k] _raw_spin_unlock_irq
    +   0.68%  [kernel]  [k] __do_softirq
    +   0.33%  [kernel]  [k] default_send_IPI_mask_sequence_phys
    +   0.10%  [kernel]  [k] __do_page_fault
    
    Basically, all the kernel time is spent processing lock contention
    rather than doing real work.
    
    I haven't tested back on 3.15 yet, but historically the strided AG
    repair for such filesystems (which I test all the time on 100+500TB
    filesystem images) is about 20-25% faster on this storage subsystem.
    Yes, the old code also burnt a lot of CPU due to lock contention,
    but it didn't go 5x slower than having no threading at all.
    
    So this looks like a significant performance regression due to:
    
    4fc828e locking/rwsem: Support optimistic spinning
    
    which changed the rwsem behaviour in 3.16-rc1.
    
    Cheers,
    
    Dave.
    -- 
    Dave Chinner
    david@fromorbit.com
    
    ^ permalink raw reply	[flat|nested] 22+ messages in thread

    end of thread, other threads:[~2014-07-16 19:21 UTC | newest]
    
    Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
    -- links below jump to the message on this page --
         [not found] <1404413420.8764.42.camel@j-VirtualBox>
         [not found] ` <1404416236.3179.18.camel@buesod1.americas.hpqcorp.net>
    2014-07-03 20:08   ` [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation Davidlohr Bueso
    2014-07-04  1:01 ` Dave Chinner
    2014-07-04  1:46   ` Jason Low
    2014-07-04  1:54     ` Jason Low
    2014-07-04  6:13       ` Dave Chinner
    2014-07-04  7:06         ` Jason Low
    2014-07-04  8:21           ` Dave Chinner
    2014-07-04  8:53           ` Davidlohr Bueso
    2014-07-05  3:14             ` Jason Low
    2014-07-04  7:52       ` Peter Zijlstra
    2014-07-04  8:40         ` Davidlohr Bueso
    2014-07-05  3:49           ` Jason Low
         [not found]             ` <CAAW_DMjgd5+EOvZX7_iZe-jHp=00Nf7MX3z6hBCRPgOfqnMtEA@mail.gmail.com>
    2014-07-14  9:55               ` Peter Zijlstra
    2014-07-14 17:10                 ` Jason Low
    2014-07-15  2:17                 ` Dave Chinner
    2014-07-16 19:20             ` [tip:locking/urgent] locking/rwsem: Allow conservative optimistic spinning when readers have lock tip-bot for Jason Low
    2014-07-03  2:32 [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation Dave Chinner
    2014-07-03  3:31 ` Davidlohr Bueso
    2014-07-03  4:59   ` Dave Chinner
    2014-07-03  5:39     ` Dave Chinner
    2014-07-03  7:38       ` Peter Zijlstra
    2014-07-03  7:56         ` Dave Chinner
    

    This is a public inbox, see mirroring instructions
    for how to clone and mirror all data and code used for this inbox