[regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
@ 2014-07-03  2:32 Dave Chinner
  2014-07-03  3:31 ` Davidlohr Bueso
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2014-07-03  2:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: Davidlohr Bueso, Peter Zijlstra, Tim Chen, Ingo Molnar

Hi folks,

I've got a workload that hammers the mmap_sem via multi-threads
memory allocation and page faults: it's called xfs_repair. I've been
debugging problems with the latest release, and in the process of
tracking down some recent regressions I noticed that turning off all
the cross-ag IO concurrency resulted in the repair running about
4-5x faster.

i.e.:

# time xfs_repair -f -vv fs.img
.....

        XFS_REPAIR Summary    Thu Jul  3 11:51:41 2014

Phase           Start           End             Duration
Phase 1:        07/03 11:47:15  07/03 11:47:15
Phase 2:        07/03 11:47:15  07/03 11:47:35  20 seconds
Phase 3:        07/03 11:47:35  07/03 11:51:26  3 minutes, 51 seconds
Phase 4:        07/03 11:51:26  07/03 11:51:31  5 seconds
Phase 5:        07/03 11:51:31  07/03 11:51:31
Phase 6:        07/03 11:51:31  07/03 11:51:39  8 seconds
Phase 7:        07/03 11:51:39  07/03 11:51:39

Total run time: 4 minutes, 24 seconds
done

real    4m26.399s
user    1m6.023s
sys     27m26.707s
$

And turning off AG striding:

# time xfs_repair -f -vv -o ag_stride=-1 fs.img
.....
        XFS_REPAIR Summary    Thu Jul  3 11:41:28 2014

Phase           Start           End             Duration
Phase 1:        07/03 11:40:27  07/03 11:40:27  
Phase 2:        07/03 11:40:27  07/03 11:40:36  9 seconds
Phase 3:        07/03 11:40:36  07/03 11:41:12  36 seconds
Phase 4:        07/03 11:41:12  07/03 11:41:17  5 seconds
Phase 5:        07/03 11:41:17  07/03 11:41:18  1 second
Phase 6:        07/03 11:41:18  07/03 11:41:25  7 seconds
Phase 7:        07/03 11:41:25  07/03 11:41:25  

Total run time: 58 seconds
done

real    0m59.893s
user    0m41.935s
sys     4m55.867s
$

The difference is in phase 2 and 3, which is where all the memory
allocation and IO that populates the userspace buffer cache takes
place. The remainder of the phases run from the cache. All IO is
direct IO, so there is no kernel caching at all. The filesystem
image has a lot of metadata in it - it has 336 AGs and the buffer
cache grows to about 6GB in size during phase 3.

The difference in performance is in the system CPU time, and it
results directly in lower IO dispatch rates - about 2,000 IOPS
instead of ~12,000.

This is what the kernel profile looks like on the strided run:

-  83.06%  [kernel]  [k] osq_lock
   - osq_lock
      - 100.00% rwsem_down_write_failed
         - call_rwsem_down_write_failed
            - 99.55% sys_mprotect
                 tracesys
                 __GI___mprotect
-  12.02%  [kernel]  [k] rwsem_down_write_failed
     rwsem_down_write_failed
     call_rwsem_down_write_failed
+   1.09%  [kernel]  [k] _raw_spin_unlock_irqrestore
+   0.92%  [kernel]  [k] _raw_spin_unlock_irq
+   0.68%  [kernel]  [k] __do_softirq
+   0.33%  [kernel]  [k] default_send_IPI_mask_sequence_phys
+   0.10%  [kernel]  [k] __do_page_fault

Basically, all the kernel time is spent processing lock contention
rather than doing real work.

I haven't tested back on 3.15 yet, but historically the strided AG
repair for such filesystems (which I test all the time on 100+500TB
filesystem images) is about 20-25% faster on this storage subsystem.
Yes, the old code also burnt a lot of CPU due to lock contention,
but it didn't go 5x slower than having no threading at all.

So this looks like a significant performance regression due to:

4fc828e locking/rwsem: Support optimistic spinning

which changed the rwsem behaviour in 3.16-rc1.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
  2014-07-03  2:32 [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation Dave Chinner
@ 2014-07-03  3:31 ` Davidlohr Bueso
  2014-07-03  4:59   ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Davidlohr Bueso @ 2014-07-03  3:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, Peter Zijlstra, Tim Chen, Ingo Molnar

On Thu, 2014-07-03 at 12:32 +1000, Dave Chinner wrote:
> Hi folks,
> 
> I've got a workload that hammers the mmap_sem via multi-threads
> memory allocation and page faults: it's called xfs_repair. 

Another reason for concurrent address space operations :/

> I've been
> debugging problems with the latest release, and in the process of
> tracking down some recent regressions I noticed that turning off all
> the cross-ag IO concurrency resulted in the repair running about
> 4-5x faster.
> 
> i.e.:
> 
> # time xfs_repair -f -vv fs.img
> .....
> 
>         XFS_REPAIR Summary    Thu Jul  3 11:51:41 2014
> 
> Phase           Start           End             Duration
> Phase 1:        07/03 11:47:15  07/03 11:47:15
> Phase 2:        07/03 11:47:15  07/03 11:47:35  20 seconds
> Phase 3:        07/03 11:47:35  07/03 11:51:26  3 minutes, 51 seconds
> Phase 4:        07/03 11:51:26  07/03 11:51:31  5 seconds
> Phase 5:        07/03 11:51:31  07/03 11:51:31
> Phase 6:        07/03 11:51:31  07/03 11:51:39  8 seconds
> Phase 7:        07/03 11:51:39  07/03 11:51:39
> 
> Total run time: 4 minutes, 24 seconds
> done
> 
> real    4m26.399s
> user    1m6.023s
> sys     27m26.707s
> $
> 
> And turning off AG striding:
> 
> # time xfs_repair -f -vv -o ag_stride=-1 fs.img
> .....
>         XFS_REPAIR Summary    Thu Jul  3 11:41:28 2014
> 
> Phase           Start           End             Duration
> Phase 1:        07/03 11:40:27  07/03 11:40:27  
> Phase 2:        07/03 11:40:27  07/03 11:40:36  9 seconds
> Phase 3:        07/03 11:40:36  07/03 11:41:12  36 seconds

The *real* degradation is here then.

> Phase 4:        07/03 11:41:12  07/03 11:41:17  5 seconds
> Phase 5:        07/03 11:41:17  07/03 11:41:18  1 second
> Phase 6:        07/03 11:41:18  07/03 11:41:25  7 seconds
> Phase 7:        07/03 11:41:25  07/03 11:41:25  
> 
> Total run time: 58 seconds
> done
> 
> real    0m59.893s
> user    0m41.935s
> sys     4m55.867s
> $
> 
> The difference is in phase 2 and 3, which is where all the memory
> allocation and IO that populates the userspace buffer cache takes
> place. The remainder of the phases run from the cache. All IO is
> direct IO, so there is no kernel caching at all. The filesystem
> image has a lot of metadata in it - it has 336 AGs and the buffer
> cache grows to about 6GB in size during phase 3.
> 
> The difference in performance is in the system CPU time, and it
> results directly in lower IO dispatch rates - about 2,000 IOPS
> instead of ~12,000.
> 
> This is what the kernel profile looks like on the strided run:
> 
> -  83.06%  [kernel]  [k] osq_lock
>    - osq_lock
>       - 100.00% rwsem_down_write_failed
>          - call_rwsem_down_write_failed
>             - 99.55% sys_mprotect
>                  tracesys
>                  __GI___mprotect
> -  12.02%  [kernel]  [k] rwsem_down_write_failed
>      rwsem_down_write_failed
>      call_rwsem_down_write_failed
> +   1.09%  [kernel]  [k] _raw_spin_unlock_irqrestore
> +   0.92%  [kernel]  [k] _raw_spin_unlock_irq
> +   0.68%  [kernel]  [k] __do_softirq
> +   0.33%  [kernel]  [k] default_send_IPI_mask_sequence_phys
> +   0.10%  [kernel]  [k] __do_page_fault
> 
> Basically, all the kernel time is spent processing lock contention
> rather than doing real work.

While before it just blocked.

> I haven't tested back on 3.15 yet, but historically the strided AG
> repair for such filesystems (which I test all the time on 100+500TB
> filesystem images) is about 20-25% faster on this storage subsystem.
> Yes, the old code also burnt a lot of CPU due to lock contention,
> but it didn't go 5x slower than having no threading at all.
> 
> So this looks like a significant performance regression due to:
> 
> 4fc828e locking/rwsem: Support optimistic spinning
> 
> which changed the rwsem behaviour in 3.16-rc1.

So the mmap_sem is held long enough in this workload that the cost of
blocking is actually significantly smaller than just spinning --
particularly MCS. How many threads are you running when you see the
issue?

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
  2014-07-03  3:31 ` Davidlohr Bueso
@ 2014-07-03  4:59   ` Dave Chinner
  2014-07-03  5:39     ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2014-07-03  4:59 UTC (permalink / raw)
  To: Davidlohr Bueso; +Cc: linux-kernel, Peter Zijlstra, Tim Chen, Ingo Molnar

On Wed, Jul 02, 2014 at 08:31:08PM -0700, Davidlohr Bueso wrote:
> On Thu, 2014-07-03 at 12:32 +1000, Dave Chinner wrote:
> > Hi folks,
> > 
> > I've got a workload that hammers the mmap_sem via multi-threads
> > memory allocation and page faults: it's called xfs_repair. 
> 
> Another reason for concurrent address space operations :/

*nod*

> >         XFS_REPAIR Summary    Thu Jul  3 11:41:28 2014
> > 
> > Phase           Start           End             Duration
> > Phase 1:        07/03 11:40:27  07/03 11:40:27  
> > Phase 2:        07/03 11:40:27  07/03 11:40:36  9 seconds
> > Phase 3:        07/03 11:40:36  07/03 11:41:12  36 seconds
> 
> The *real* degradation is here then.

Yes, as I said, it's in phase 2 and 3 where all the IO and memory
allocation is done.

> > This is what the kernel profile looks like on the strided run:
> > 
> > -  83.06%  [kernel]  [k] osq_lock
> >    - osq_lock
> >       - 100.00% rwsem_down_write_failed
> >          - call_rwsem_down_write_failed
> >             - 99.55% sys_mprotect
> >                  tracesys
> >                  __GI___mprotect
> > -  12.02%  [kernel]  [k] rwsem_down_write_failed
> >      rwsem_down_write_failed
> >      call_rwsem_down_write_failed
> > +   1.09%  [kernel]  [k] _raw_spin_unlock_irqrestore
> > +   0.92%  [kernel]  [k] _raw_spin_unlock_irq
> > +   0.68%  [kernel]  [k] __do_softirq
> > +   0.33%  [kernel]  [k] default_send_IPI_mask_sequence_phys
> > +   0.10%  [kernel]  [k] __do_page_fault
> > 
> > Basically, all the kernel time is spent processing lock contention
> > rather than doing real work.
> 
> While before it just blocked.

Yup, pretty much - there was contention on the rwsem internal
spinlock, but nothing else burnt CPU time.

> > I haven't tested back on 3.15 yet, but historically the strided AG
> > repair for such filesystems (which I test all the time on 100+500TB
> > filesystem images) is about 20-25% faster on this storage subsystem.
> > Yes, the old code also burnt a lot of CPU due to lock contention,
> > but it didn't go 5x slower than having no threading at all.
> > 
> > So this looks like a significant performance regression due to:
> > 
> > 4fc828e locking/rwsem: Support optimistic spinning
> > 
> > which changed the rwsem behaviour in 3.16-rc1.
> 
> So the mmap_sem is held long enough in this workload that the cost of
> blocking is actually significantly smaller than just spinning --

The issues is that the memory allocation pattern alternates between
read and write locks. i.e. write lock on mprotect at allocation,
read lock on page fault when processing the contents. Hence we have
a pretty consistent pattern of allocation (and hence mprotect)
in prefetch threads, while there page faults are in the
processing threads.

> particularly MCS. How many threads are you running when you see the
> issue?

Lots. In phase 3:

$ ps -eLF |grep [x]fs_repair | wc -l
140
$

We use 6 threads per AG being processed:

	- 4 metadata prefetch threads (do allocation and IO),
	- 1 prefetch control thread
	- 1 metadata processing thread (do page faults)

That works out about right - default is to create a new processing
group every 15 AGs, so with 336 AGs we should have roughly 22 AGs
being processed concurrently...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
  2014-07-03  4:59   ` Dave Chinner
@ 2014-07-03  5:39     ` Dave Chinner
  2014-07-03  7:38       ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2014-07-03  5:39 UTC (permalink / raw)
  To: Davidlohr Bueso; +Cc: linux-kernel, Peter Zijlstra, Tim Chen, Ingo Molnar

On Thu, Jul 03, 2014 at 02:59:33PM +1000, Dave Chinner wrote:
> On Wed, Jul 02, 2014 at 08:31:08PM -0700, Davidlohr Bueso wrote:
> > On Thu, 2014-07-03 at 12:32 +1000, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > I've got a workload that hammers the mmap_sem via multi-threads
> > > memory allocation and page faults: it's called xfs_repair. 
> > 
> > Another reason for concurrent address space operations :/
> 
> *nod*
> 
> > >         XFS_REPAIR Summary    Thu Jul  3 11:41:28 2014
> > > 
> > > Phase           Start           End             Duration
> > > Phase 1:        07/03 11:40:27  07/03 11:40:27  
> > > Phase 2:        07/03 11:40:27  07/03 11:40:36  9 seconds
> > > Phase 3:        07/03 11:40:36  07/03 11:41:12  36 seconds
> > 
> > The *real* degradation is here then.
> 
> Yes, as I said, it's in phase 2 and 3 where all the IO and memory
> allocation is done.
> 
> > > This is what the kernel profile looks like on the strided run:
> > > 
> > > -  83.06%  [kernel]  [k] osq_lock
> > >    - osq_lock
> > >       - 100.00% rwsem_down_write_failed
> > >          - call_rwsem_down_write_failed
> > >             - 99.55% sys_mprotect
> > >                  tracesys
> > >                  __GI___mprotect
> > > -  12.02%  [kernel]  [k] rwsem_down_write_failed
> > >      rwsem_down_write_failed
> > >      call_rwsem_down_write_failed
> > > +   1.09%  [kernel]  [k] _raw_spin_unlock_irqrestore
> > > +   0.92%  [kernel]  [k] _raw_spin_unlock_irq
> > > +   0.68%  [kernel]  [k] __do_softirq
> > > +   0.33%  [kernel]  [k] default_send_IPI_mask_sequence_phys
> > > +   0.10%  [kernel]  [k] __do_page_fault
> > > 
> > > Basically, all the kernel time is spent processing lock contention
> > > rather than doing real work.
> > 
> > While before it just blocked.
> 
> Yup, pretty much - there was contention on the rwsem internal
> spinlock, but nothing else burnt CPU time.
> 
> > > I haven't tested back on 3.15 yet, but historically the strided AG
> > > repair for such filesystems (which I test all the time on 100+500TB
> > > filesystem images) is about 20-25% faster on this storage subsystem.
> > > Yes, the old code also burnt a lot of CPU due to lock contention,
> > > but it didn't go 5x slower than having no threading at all.
> > > 
> > > So this looks like a significant performance regression due to:
> > > 
> > > 4fc828e locking/rwsem: Support optimistic spinning
> > > 
> > > which changed the rwsem behaviour in 3.16-rc1.
> > 
> > So the mmap_sem is held long enough in this workload that the cost of
> > blocking is actually significantly smaller than just spinning --
> 
> The issues is that the memory allocation pattern alternates between
> read and write locks. i.e. write lock on mprotect at allocation,
> read lock on page fault when processing the contents. Hence we have
> a pretty consistent pattern of allocation (and hence mprotect)
> in prefetch threads, while there page faults are in the
> processing threads.
> 
> > particularly MCS. How many threads are you running when you see the
> > issue?
> 
> Lots. In phase 3:
> 
> $ ps -eLF |grep [x]fs_repair | wc -l
> 140
> $
> 
> We use 6 threads per AG being processed:
> 
> 	- 4 metadata prefetch threads (do allocation and IO),
> 	- 1 prefetch control thread
> 	- 1 metadata processing thread (do page faults)
> 
> That works out about right - default is to create a new processing
> group every 15 AGs, so with 336 AGs we should have roughly 22 AGs
> being processed concurrently...

There's another regression with the optimisitic spinning in rwsems
as well: it increases the size of the struct rw_semaphore by 16
bytes. That has increased the size of the struct xfs_inode by 32
bytes.

That's pretty damn significant - it's no uncommon to see machines
with tens of millions of cached XFS inodes, so increasing the size
of the inode by 4% is actually very significant. That's enough to go
from having a well balanced workload to not being able to fit the
working set of inodes in memory.

Filesystem developers will do almost anything to remove a few bytes
from the struct inode because inode cache footprint is extremely
important for performance. We also tend to get upset and
unreasonable when other people undo that hard work by making changes
that bloat the generic structures embedded in the inode
structures....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
  2014-07-03  5:39     ` Dave Chinner
@ 2014-07-03  7:38       ` Peter Zijlstra
  2014-07-03  7:56         ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2014-07-03  7:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Davidlohr Bueso, linux-kernel, Tim Chen, Ingo Molnar, jason.low2

[-- Attachment #1: Type: text/plain, Size: 1046 bytes --]

On Thu, Jul 03, 2014 at 03:39:11PM +1000, Dave Chinner wrote:
> There's another regression with the optimisitic spinning in rwsems
> as well: it increases the size of the struct rw_semaphore by 16
> bytes. That has increased the size of the struct xfs_inode by 32
> bytes.
> 
> That's pretty damn significant - it's no uncommon to see machines
> with tens of millions of cached XFS inodes, so increasing the size
> of the inode by 4% is actually very significant. That's enough to go
> from having a well balanced workload to not being able to fit the
> working set of inodes in memory.
> 
> Filesystem developers will do almost anything to remove a few bytes
> from the struct inode because inode cache footprint is extremely
> important for performance. We also tend to get upset and
> unreasonable when other people undo that hard work by making changes
> that bloat the generic structures embedded in the inode
> structures....

Jason Low actually did a patch, yesterday, to shrink rwsem back to its
old size (on 64bit).

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
  2014-07-03  7:38       ` Peter Zijlstra
@ 2014-07-03  7:56         ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2014-07-03  7:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Davidlohr Bueso, linux-kernel, Tim Chen, Ingo Molnar, jason.low2

On Thu, Jul 03, 2014 at 09:38:52AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 03, 2014 at 03:39:11PM +1000, Dave Chinner wrote:
> > There's another regression with the optimisitic spinning in rwsems
> > as well: it increases the size of the struct rw_semaphore by 16
> > bytes. That has increased the size of the struct xfs_inode by 32
> > bytes.
> > 
> > That's pretty damn significant - it's no uncommon to see machines
> > with tens of millions of cached XFS inodes, so increasing the size
> > of the inode by 4% is actually very significant. That's enough to go
> > from having a well balanced workload to not being able to fit the
> > working set of inodes in memory.
> > 
> > Filesystem developers will do almost anything to remove a few bytes
> > from the struct inode because inode cache footprint is extremely
> > important for performance. We also tend to get upset and
> > unreasonable when other people undo that hard work by making changes
> > that bloat the generic structures embedded in the inode
> > structures....
> 
> Jason Low actually did a patch, yesterday, to shrink rwsem back to its
> old size (on 64bit).

That's good to know. Thanks, Peter.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

[parent not found: <1404413420.8764.42.camel@j-VirtualBox>]

[parent not found: <1404416236.3179.18.camel@buesod1.americas.hpqcorp.net>]

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
       [not found] ` <1404416236.3179.18.camel@buesod1.americas.hpqcorp.net>
@ 2014-07-03 20:08   ` Davidlohr Bueso
  0 siblings, 0 replies; 21+ messages in thread
From: Davidlohr Bueso @ 2014-07-03 20:08 UTC (permalink / raw)
  To: Jason Low
  Cc: Dave Chinner, Peter Zijlstra, Tim Chen, Ingo Molnar, linux-kernel

Adding lkml.

On Thu, 2014-07-03 at 12:37 -0700, Davidlohr Bueso wrote:
> On Thu, 2014-07-03 at 11:50 -0700, Jason Low wrote:
> > On Wed, Jul 2, 2014 at 7:32 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > This is what the kernel profile looks like on the strided run:
> > >
> > > -  83.06%  [kernel]  [k] osq_lock
> > >    - osq_lock
> > >       - 100.00% rwsem_down_write_failed
> > >          - call_rwsem_down_write_failed
> > >             - 99.55% sys_mprotect
> > >                  tracesys
> > >                  __GI___mprotect
> > > -  12.02%  [kernel]  [k] rwsem_down_write_failed
> > 
> > Hi Dave,
> > 
> > So with no sign of rwsem_spin_on_owner(), yet with such heavy contention in
> > osq_lock, this makes me wonder if it's spending most of its time spinning
> > on !owner while a reader has the lock? (We don't set sem->owner for the readers.)
> 
> That would explain the long hold times with the memory allocation
> patterns between read and write locking described by Dave.
> 
> > If that's an issue, maybe the below is worth a test, in which we'll just
> > avoid spinning if rwsem_can_spin_on_owner() finds that there is no owner.
> > If we just had to enter the slowpath yet there is no owner, we'll be conservative
> > and assume readers have the lock.
> 
> I do worry a bit about the effects here when this is not an issue.
> Workloads that have smaller hold times could very well take a
> performance hit by blocking right away instead of wasting a few extra
> cycles just spinning.
> 
> > (David, you've tested something like this in the original patch with AIM7 and still
> > got the big performance boosts right?)
> 
> I have not, but will. I wouldn't mind sacrificing a bit of the great
> performance numbers we're getting on workloads that mostly take the lock
> for writing, if it means not being so devastating for when readers are
> in the picture. This is a major difference with mutexes wrt optimistic
> spinning.
> 
> Thanks,
> Davidlohr



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
       [not found] <1404413420.8764.42.camel@j-VirtualBox>
       [not found] ` <1404416236.3179.18.camel@buesod1.americas.hpqcorp.net>
@ 2014-07-04  1:01 ` Dave Chinner
  2014-07-04  1:46   ` Jason Low
  1 sibling, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2014-07-04  1:01 UTC (permalink / raw)
  To: Jason Low
  Cc: Davidlohr Bueso, Peter Zijlstra, Tim Chen, Ingo Molnar,
	linux-kernel

[re-added lkml]

On Thu, Jul 03, 2014 at 11:50:20AM -0700, Jason Low wrote:
> On Wed, Jul 2, 2014 at 7:32 PM, Dave Chinner <david@fromorbit.com> wrote:
> > This is what the kernel profile looks like on the strided run:
> >
> > -  83.06%  [kernel]  [k] osq_lock
> >    - osq_lock
> >       - 100.00% rwsem_down_write_failed
> >          - call_rwsem_down_write_failed
> >             - 99.55% sys_mprotect
> >                  tracesys
> >                  __GI___mprotect
> > -  12.02%  [kernel]  [k] rwsem_down_write_failed
> 
> Hi Dave,
> 
> So with no sign of rwsem_spin_on_owner(), yet with such heavy contention in
> osq_lock, this makes me wonder if it's spending most of its time spinning
> on !owner while a reader has the lock? (We don't set sem->owner for the readers.)
> 
> If that's an issue, maybe the below is worth a test, in which we'll just
> avoid spinning if rwsem_can_spin_on_owner() finds that there is no owner.
> If we just had to enter the slowpath yet there is no owner, we'll be conservative
> and assume readers have the lock.

That makes it quite a bit faster:

        XFS_REPAIR Summary    Fri Jul  4 10:39:32 2014

Phase           Start           End             Duration
Phase 1:        07/04 10:38:04  07/04 10:38:05  1 second
Phase 2:        07/04 10:38:05  07/04 10:38:08  3 seconds
Phase 3:        07/04 10:38:08  07/04 10:39:12  1 minute, 4 seconds
Phase 4:        07/04 10:39:12  07/04 10:39:21  9 seconds
Phase 5:        07/04 10:39:21  07/04 10:39:22  1 second
Phase 6:        07/04 10:39:22  07/04 10:39:30  8 seconds
Phase 7:        07/04 10:39:30  07/04 10:39:30  

Total run time: 1 minute, 26 seconds
done

real    1m28.504s
user    1m23.990s
sys     3m20.132s

So system time goes down massively, and speed comes up to within 30%
of the single AG run.  But it's still 2-3000 IOPS down, but it's
acceptible for the moment.

FWIW, the kernel profile ifor the multi-AG run now looks like:

  29.64%  [kernel]  [k] _raw_spin_unlock_irq
   - _raw_spin_unlock_irq
      + 35.34% __schedule
      - 34.15% call_rwsem_down_write_failed
         + 99.27% sys_mprotect
      - 30.02% call_rwsem_down_read_failed
           95.59% __do_page_fault
-  24.65%  [kernel]  [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 69.38% rwsem_wake
         - call_rwsem_wake
            - 83.32% sys_mprotect
            + 15.54% __do_page_fault
      + 22.55% try_to_wake_up
+   9.77%  [kernel]  [k] default_send_IPI_mask_sequence_phys
-   3.21%  [kernel]  [k] smp_call_function_many
   - smp_call_function_many
      - 99.22% flush_tlb_page
+   2.51%  [kernel]  [k] rwsem_down_write_failed

It's much more like the 3.15 profile - it's only wasting half the
CPU spinning on the internal spinlock and it's now going fast enough
to be blowing another 10-12% of the CPU time sending tlb flushes to
other CPUs....

One thing I did notice, even with the single-AG-at-a-time run, is
that the system time is *significantly* reduced with this patch,
even though it doesn't change performance.

ie unpatched:

	unpatched	patched
runtime   0m58s		  1m2s
systime   4m55s		  1m1s

So not spinning when there are read holders is a major win even
when there are few threads contending on read/write.

FWIW, the rwsems in the struct xfs_inode are often heavily
read/write contended, so there are lots of IO related workloads that
are going to regress on XFS without this optimisation...

Anyway, consider the patch:

Tested-by: Dave Chinner <dchinner@redhat.com>

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
  2014-07-04  1:01 ` Dave Chinner
@ 2014-07-04  1:46   ` Jason Low
  2014-07-04  1:54     ` Jason Low
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Low @ 2014-07-04  1:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Davidlohr Bueso, Peter Zijlstra, Tim Chen, Ingo Molnar,
	linux-kernel, Linus Torvalds

On Fri, 2014-07-04 at 11:01 +1000, Dave Chinner wrote:
> [re-added lkml]
> 
> On Thu, Jul 03, 2014 at 11:50:20AM -0700, Jason Low wrote:
> > On Wed, Jul 2, 2014 at 7:32 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > This is what the kernel profile looks like on the strided run:
> > >
> > > -  83.06%  [kernel]  [k] osq_lock
> > >    - osq_lock
> > >       - 100.00% rwsem_down_write_failed
> > >          - call_rwsem_down_write_failed
> > >             - 99.55% sys_mprotect
> > >                  tracesys
> > >                  __GI___mprotect
> > > -  12.02%  [kernel]  [k] rwsem_down_write_failed
> > 
> > Hi Dave,
> > 
> > So with no sign of rwsem_spin_on_owner(), yet with such heavy contention in
> > osq_lock, this makes me wonder if it's spending most of its time spinning
> > on !owner while a reader has the lock? (We don't set sem->owner for the readers.)
> > 
> > If that's an issue, maybe the below is worth a test, in which we'll just
> > avoid spinning if rwsem_can_spin_on_owner() finds that there is no owner.
> > If we just had to enter the slowpath yet there is no owner, we'll be conservative
> > and assume readers have the lock.
> 
> That makes it quite a bit faster:
> 
>         XFS_REPAIR Summary    Fri Jul  4 10:39:32 2014
> 
> Phase           Start           End             Duration
> Phase 1:        07/04 10:38:04  07/04 10:38:05  1 second
> Phase 2:        07/04 10:38:05  07/04 10:38:08  3 seconds
> Phase 3:        07/04 10:38:08  07/04 10:39:12  1 minute, 4 seconds
> Phase 4:        07/04 10:39:12  07/04 10:39:21  9 seconds
> Phase 5:        07/04 10:39:21  07/04 10:39:22  1 second
> Phase 6:        07/04 10:39:22  07/04 10:39:30  8 seconds
> Phase 7:        07/04 10:39:30  07/04 10:39:30  
> 
> Total run time: 1 minute, 26 seconds
> done
> 
> real    1m28.504s
> user    1m23.990s
> sys     3m20.132s
> 
> So system time goes down massively, and speed comes up to within 30%
> of the single AG run.  But it's still 2-3000 IOPS down, but it's
> acceptible for the moment.
> 
> FWIW, the kernel profile ifor the multi-AG run now looks like:
> 
>   29.64%  [kernel]  [k] _raw_spin_unlock_irq
>    - _raw_spin_unlock_irq
>       + 35.34% __schedule
>       - 34.15% call_rwsem_down_write_failed
>          + 99.27% sys_mprotect
>       - 30.02% call_rwsem_down_read_failed
>            95.59% __do_page_fault
> -  24.65%  [kernel]  [k] _raw_spin_unlock_irqrestore
>    - _raw_spin_unlock_irqrestore
>       - 69.38% rwsem_wake
>          - call_rwsem_wake
>             - 83.32% sys_mprotect
>             + 15.54% __do_page_fault
>       + 22.55% try_to_wake_up
> +   9.77%  [kernel]  [k] default_send_IPI_mask_sequence_phys
> -   3.21%  [kernel]  [k] smp_call_function_many
>    - smp_call_function_many
>       - 99.22% flush_tlb_page
> +   2.51%  [kernel]  [k] rwsem_down_write_failed
> 
> It's much more like the 3.15 profile - it's only wasting half the
> CPU spinning on the internal spinlock and it's now going fast enough
> to be blowing another 10-12% of the CPU time sending tlb flushes to
> other CPUs....
> 
> One thing I did notice, even with the single-AG-at-a-time run, is
> that the system time is *significantly* reduced with this patch,
> even though it doesn't change performance.
> 
> ie unpatched:
> 
> 	unpatched	patched
> runtime   0m58s		  1m2s
> systime   4m55s		  1m1s
> 
> So not spinning when there are read holders is a major win even
> when there are few threads contending on read/write.
> 
> FWIW, the rwsems in the struct xfs_inode are often heavily
> read/write contended, so there are lots of IO related workloads that
> are going to regress on XFS without this optimisation...
> 
> Anyway, consider the patch:
> 
> Tested-by: Dave Chinner <dchinner@redhat.com>

Hi Dave,

Thanks for testing. I'll update the patch with an actual changelog.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
  2014-07-04  1:46   ` Jason Low
@ 2014-07-04  1:54     ` Jason Low
  2014-07-04  6:13       ` Dave Chinner
  2014-07-04  7:52       ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: Jason Low @ 2014-07-04  1:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Davidlohr Bueso, Peter Zijlstra, Tim Chen, Ingo Molnar,
	linux-kernel, Linus Torvalds

On Thu, 2014-07-03 at 18:46 -0700, Jason Low wrote:
> On Fri, 2014-07-04 at 11:01 +1000, Dave Chinner wrote:

> > FWIW, the rwsems in the struct xfs_inode are often heavily
> > read/write contended, so there are lots of IO related workloads that
> > are going to regress on XFS without this optimisation...
> > 
> > Anyway, consider the patch:
> > 
> > Tested-by: Dave Chinner <dchinner@redhat.com>
> 
> Hi Dave,
> 
> Thanks for testing. I'll update the patch with an actual changelog.

---
Subject: [PATCH] rwsem: In rwsem_can_spin_on_owner(), return false if no owner

It was found that the rwsem optimistic spinning feature can potentially degrade
performance when there are readers. Perf profiles indicate in some workloads
that significant time can be spent spinning on !owner. This is because we don't
set the lock owner when readers(s) obtain the rwsem.

In this patch, we'll modify rwsem_can_spin_on_owner() such that we'll return
false if there is no lock owner. The rationale is that if we just entered the
slowpath, yet there is no lock owner, then there is a possibility that a reader
has the lock. To be conservative, we'll avoid spinning in these situations.

Dave Chinner found performance benefits with this patch in the xfs_repair
workload, where the total run time went from approximately 4 minutes 24 seconds,
down to approximately 1 minute 26 seconds with the patch.

Tested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
---
 kernel/locking/rwsem-xadd.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index dacc321..c40c7d2 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -285,10 +285,10 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
 static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 {
 	struct task_struct *owner;
-	bool on_cpu = true;
+	bool on_cpu = false;
 
 	if (need_resched())
-		return 0;
+		return false;
 
 	rcu_read_lock();
 	owner = ACCESS_ONCE(sem->owner);
@@ -297,9 +297,9 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 	rcu_read_unlock();
 
 	/*
-	 * If sem->owner is not set, the rwsem owner may have
-	 * just acquired it and not set the owner yet or the rwsem
-	 * has been released.
+	 * If sem->owner is not set, yet we have just recently entered the
+	 * slowpath, then there is a possibility reader(s) may have the lock.
+	 * To be safe, avoid spinning in these situations.
 	 */
 	return on_cpu;
 }
-- 
1.7.9.5




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
  2014-07-04  1:54     ` Jason Low
@ 2014-07-04  6:13       ` Dave Chinner
  2014-07-04  7:06         ` Jason Low
  2014-07-04  7:52       ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2014-07-04  6:13 UTC (permalink / raw)
  To: Jason Low
  Cc: Davidlohr Bueso, Peter Zijlstra, Tim Chen, Ingo Molnar,
	linux-kernel, Linus Torvalds

On Thu, Jul 03, 2014 at 06:54:50PM -0700, Jason Low wrote:
> On Thu, 2014-07-03 at 18:46 -0700, Jason Low wrote:
> > On Fri, 2014-07-04 at 11:01 +1000, Dave Chinner wrote:
> 
> > > FWIW, the rwsems in the struct xfs_inode are often heavily
> > > read/write contended, so there are lots of IO related workloads that
> > > are going to regress on XFS without this optimisation...
> > > 
> > > Anyway, consider the patch:
> > > 
> > > Tested-by: Dave Chinner <dchinner@redhat.com>
> > 
> > Hi Dave,
> > 
> > Thanks for testing. I'll update the patch with an actual changelog.
> 
> ---
> Subject: [PATCH] rwsem: In rwsem_can_spin_on_owner(), return false if no owner
> 
> It was found that the rwsem optimistic spinning feature can potentially degrade
> performance when there are readers. Perf profiles indicate in some workloads
> that significant time can be spent spinning on !owner. This is because we don't
> set the lock owner when readers(s) obtain the rwsem.

I don't think you're being a little shifty with the truth here.
There's no "potentially degrade performance" here - I reported a
massive real world performance regression caused by optimistic
spinning.  That is:

"Commit 4fc828e ("locking/rwsem: Support optimistic spinning")
introduced a major performance regression for workloads such as
xfs_repair which mix read and write locking of the mmap_sem across
many threads. The result was xfs_repair ran 5x slower on 3.16-rc2
than on 3.15 and using 20x more system CPU time."

"Perf profiles indicate....

> In this patch, we'll modify rwsem_can_spin_on_owner() such that we'll return
> false if there is no lock owner. The rationale is that if we just entered the
> slowpath, yet there is no lock owner, then there is a possibility that a reader
> has the lock. To be conservative, we'll avoid spinning in these situations.
> 
> Dave Chinner found performance benefits with this patch in the xfs_repair
> workload, where the total run time went from approximately 4 minutes 24 seconds,
> down to approximately 1 minute 26 seconds with the patch.

Which brought it back to close to the same performance as on 3.15.
This is not a new performance improvement patch - it's a regression
fix and the commit message needs to reflect that.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
  2014-07-04  6:13       ` Dave Chinner
@ 2014-07-04  7:06         ` Jason Low
  2014-07-04  8:21           ` Dave Chinner
  2014-07-04  8:53           ` Davidlohr Bueso
  0 siblings, 2 replies; 21+ messages in thread
From: Jason Low @ 2014-07-04  7:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Davidlohr Bueso, Peter Zijlstra, Tim Chen, Ingo Molnar,
	linux-kernel, Linus Torvalds

On Fri, 2014-07-04 at 16:13 +1000, Dave Chinner wrote:
> On Thu, Jul 03, 2014 at 06:54:50PM -0700, Jason Low wrote:
> > On Thu, 2014-07-03 at 18:46 -0700, Jason Low wrote:
> > > On Fri, 2014-07-04 at 11:01 +1000, Dave Chinner wrote:
> > 
> > > > FWIW, the rwsems in the struct xfs_inode are often heavily
> > > > read/write contended, so there are lots of IO related workloads that
> > > > are going to regress on XFS without this optimisation...
> > > > 
> > > > Anyway, consider the patch:
> > > > 
> > > > Tested-by: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Hi Dave,
> > > 
> > > Thanks for testing. I'll update the patch with an actual changelog.
> > 
> > ---
> > Subject: [PATCH] rwsem: In rwsem_can_spin_on_owner(), return false if no owner
> > 
> > It was found that the rwsem optimistic spinning feature can potentially degrade
> > performance when there are readers. Perf profiles indicate in some workloads
> > that significant time can be spent spinning on !owner. This is because we don't
> > set the lock owner when readers(s) obtain the rwsem.
> 
> I don't think you're being a little shifty with the truth here.
> There's no "potentially degrade performance" here - I reported a
> massive real world performance regression caused by optimistic
> spinning.

Sure, though I mainly used the word "potentially" since there can be
other workloads out there where spinning even when readers have the lock
is a positive thing.

And agreed that the changelog can be modified to try reflecting more on
it being a "regression fix" then a "new performance" addition.

So how about the following?

---
Subject: [PATCH] rwsem: In rwsem_can_spin_on_owner(), return false if no owner

Commit 4fc828e24cd9 ("locking/rwsem: Support optimistic spinning")
introduced a major performance regression for workloads such as
xfs_repair which mix read and write locking of the mmap_sem across
many threads. The result was xfs_repair ran 5x slower on 3.16-rc2
than on 3.15 and using 20x more system CPU time.

Perf profiles indicate in some workloads that significant time can
be spent spinning on !owner. This is because we don't set the lock
owner when readers(s) obtain the rwsem.

In this patch, we'll modify rwsem_can_spin_on_owner() such that we'll
return false if there is no lock owner. The rationale is that if we
just entered the slowpath, yet there is no lock owner, then there is
a possibility that a reader has the lock. To be conservative, we'll
avoid spinning in these situations.

This patch reduced the total run time of the xfs_repair workload from
about 4 minutes 24 seconds down to approximately 1 minute 26 seconds,
back to close to the same performance as on 3.15.

Tested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
---
 kernel/locking/rwsem-xadd.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index dacc321..c40c7d2 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -285,10 +285,10 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
 static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 {
 	struct task_struct *owner;
-	bool on_cpu = true;
+	bool on_cpu = false;
 
 	if (need_resched())
-		return 0;
+		return false;
 
 	rcu_read_lock();
 	owner = ACCESS_ONCE(sem->owner);
@@ -297,9 +297,9 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 	rcu_read_unlock();
 
 	/*
-	 * If sem->owner is not set, the rwsem owner may have
-	 * just acquired it and not set the owner yet or the rwsem
-	 * has been released.
+	 * If sem->owner is not set, yet we have just recently entered the
+	 * slowpath, then there is a possibility reader(s) may have the lock.
+	 * To be safe, avoid spinning in these situations.
 	 */
 	return on_cpu;
 }
-- 
1.7.9.5




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
  2014-07-04  7:06         ` Jason Low
@ 2014-07-04  8:21           ` Dave Chinner
  2014-07-04  8:53           ` Davidlohr Bueso
  1 sibling, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2014-07-04  8:21 UTC (permalink / raw)
  To: Jason Low
  Cc: Davidlohr Bueso, Peter Zijlstra, Tim Chen, Ingo Molnar,
	linux-kernel, Linus Torvalds

On Fri, Jul 04, 2014 at 12:06:19AM -0700, Jason Low wrote:
> On Fri, 2014-07-04 at 16:13 +1000, Dave Chinner wrote:
> > On Thu, Jul 03, 2014 at 06:54:50PM -0700, Jason Low wrote:
> > > On Thu, 2014-07-03 at 18:46 -0700, Jason Low wrote:
> > > > On Fri, 2014-07-04 at 11:01 +1000, Dave Chinner wrote:
> > > 
> > > > > FWIW, the rwsems in the struct xfs_inode are often heavily
> > > > > read/write contended, so there are lots of IO related workloads that
> > > > > are going to regress on XFS without this optimisation...
> > > > > 
> > > > > Anyway, consider the patch:
> > > > > 
> > > > > Tested-by: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Hi Dave,
> > > > 
> > > > Thanks for testing. I'll update the patch with an actual changelog.
> > > 
> > > ---
> > > Subject: [PATCH] rwsem: In rwsem_can_spin_on_owner(), return false if no owner
> > > 
> > > It was found that the rwsem optimistic spinning feature can potentially degrade
> > > performance when there are readers. Perf profiles indicate in some workloads
> > > that significant time can be spent spinning on !owner. This is because we don't
> > > set the lock owner when readers(s) obtain the rwsem.
> > 
> > I don't think you're being a little shifty with the truth here.
> > There's no "potentially degrade performance" here - I reported a
> > massive real world performance regression caused by optimistic
> > spinning.
> 
> Sure, though I mainly used the word "potentially" since there can be
> other workloads out there where spinning even when readers have the lock
> is a positive thing.
> 
> And agreed that the changelog can be modified to try reflecting more on
> it being a "regression fix" then a "new performance" addition.
> 
> So how about the following?

Looks good. Thanks!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
  2014-07-04  7:06         ` Jason Low
  2014-07-04  8:21           ` Dave Chinner
@ 2014-07-04  8:53           ` Davidlohr Bueso
  2014-07-05  3:14             ` Jason Low
  1 sibling, 1 reply; 21+ messages in thread
From: Davidlohr Bueso @ 2014-07-04  8:53 UTC (permalink / raw)
  To: Jason Low
  Cc: Dave Chinner, Peter Zijlstra, Tim Chen, Ingo Molnar, linux-kernel,
	Linus Torvalds

On Fri, 2014-07-04 at 00:06 -0700, Jason Low 
> Subject: [PATCH] rwsem: In rwsem_can_spin_on_owner(), return false if no owner

Could we change the subject to something more descriptive, ie:

rwsem: Allow conservative optimistic spinning for reader/writer paths

Thanks,
Davidlohr

> Commit 4fc828e24cd9 ("locking/rwsem: Support optimistic spinning")
> introduced a major performance regression for workloads such as
> xfs_repair which mix read and write locking of the mmap_sem across
> many threads. The result was xfs_repair ran 5x slower on 3.16-rc2
> than on 3.15 and using 20x more system CPU time.
> 
> Perf profiles indicate in some workloads that significant time can
> be spent spinning on !owner. This is because we don't set the lock
> owner when readers(s) obtain the rwsem.
> 
> In this patch, we'll modify rwsem_can_spin_on_owner() such that we'll
> return false if there is no lock owner. The rationale is that if we
> just entered the slowpath, yet there is no lock owner, then there is
> a possibility that a reader has the lock. To be conservative, we'll
> avoid spinning in these situations.
> 
> This patch reduced the total run time of the xfs_repair workload from
> about 4 minutes 24 seconds down to approximately 1 minute 26 seconds,
> back to close to the same performance as on 3.15.
> 
> Tested-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Jason Low <jason.low2@hp.com>
> ---
>  kernel/locking/rwsem-xadd.c |   10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> index dacc321..c40c7d2 100644
> --- a/kernel/locking/rwsem-xadd.c
> +++ b/kernel/locking/rwsem-xadd.c
> @@ -285,10 +285,10 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
>  static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
>  {
>  	struct task_struct *owner;
> -	bool on_cpu = true;
> +	bool on_cpu = false;
>  
>  	if (need_resched())
> -		return 0;
> +		return false;
>  
>  	rcu_read_lock();
>  	owner = ACCESS_ONCE(sem->owner);
> @@ -297,9 +297,9 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
>  	rcu_read_unlock();
>  
>  	/*
> -	 * If sem->owner is not set, the rwsem owner may have
> -	 * just acquired it and not set the owner yet or the rwsem
> -	 * has been released.
> +	 * If sem->owner is not set, yet we have just recently entered the
> +	 * slowpath, then there is a possibility reader(s) may have the lock.
> +	 * To be safe, avoid spinning in these situations.
>  	 */
>  	return on_cpu;
>  }



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
  2014-07-04  8:53           ` Davidlohr Bueso
@ 2014-07-05  3:14             ` Jason Low
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Low @ 2014-07-05  3:14 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Dave Chinner, Peter Zijlstra, Tim Chen, Ingo Molnar, linux-kernel,
	Linus Torvalds

On Fri, 2014-07-04 at 01:53 -0700, Davidlohr Bueso wrote:
> On Fri, 2014-07-04 at 00:06 -0700, Jason Low 
> > Subject: [PATCH] rwsem: In rwsem_can_spin_on_owner(), return false if no owner
> 
> Could we change the subject to something more descriptive, ie:
> 
> rwsem: Allow conservative optimistic spinning for reader/writer paths

Sure, a Subject like this would better explain what is the effect of the
patch.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
  2014-07-04  1:54     ` Jason Low
  2014-07-04  6:13       ` Dave Chinner
@ 2014-07-04  7:52       ` Peter Zijlstra
  2014-07-04  8:40         ` Davidlohr Bueso
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2014-07-04  7:52 UTC (permalink / raw)
  To: Jason Low
  Cc: Dave Chinner, Davidlohr Bueso, Tim Chen, Ingo Molnar,
	linux-kernel, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 1190 bytes --]

On Thu, Jul 03, 2014 at 06:54:50PM -0700, Jason Low wrote:
> Subject: [PATCH] rwsem: In rwsem_can_spin_on_owner(), return false if no owner
> 
> It was found that the rwsem optimistic spinning feature can potentially degrade
> performance when there are readers. Perf profiles indicate in some workloads
> that significant time can be spent spinning on !owner. This is because we don't
> set the lock owner when readers(s) obtain the rwsem.
> 
> In this patch, we'll modify rwsem_can_spin_on_owner() such that we'll return
> false if there is no lock owner. The rationale is that if we just entered the
> slowpath, yet there is no lock owner, then there is a possibility that a reader
> has the lock. To be conservative, we'll avoid spinning in these situations.
> 
> Dave Chinner found performance benefits with this patch in the xfs_repair
> workload, where the total run time went from approximately 4 minutes 24 seconds,
> down to approximately 1 minute 26 seconds with the patch.
> 
> Tested-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Jason Low <jason.low2@hp.com>

Davidlohr, you'll be running this through your AIM and other benchmarks,
I suppose?

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
  2014-07-04  7:52       ` Peter Zijlstra
@ 2014-07-04  8:40         ` Davidlohr Bueso
  2014-07-05  3:49           ` Jason Low
  0 siblings, 1 reply; 21+ messages in thread
From: Davidlohr Bueso @ 2014-07-04  8:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jason Low, Dave Chinner, Tim Chen, Ingo Molnar, linux-kernel,
	Linus Torvalds

On Fri, 2014-07-04 at 09:52 +0200, Peter Zijlstra wrote:
> On Thu, Jul 03, 2014 at 06:54:50PM -0700, Jason Low wrote:
> > Subject: [PATCH] rwsem: In rwsem_can_spin_on_owner(), return false if no owner
> > 
> > It was found that the rwsem optimistic spinning feature can potentially degrade
> > performance when there are readers. Perf profiles indicate in some workloads
> > that significant time can be spent spinning on !owner. This is because we don't
> > set the lock owner when readers(s) obtain the rwsem.
> > 
> > In this patch, we'll modify rwsem_can_spin_on_owner() such that we'll return
> > false if there is no lock owner. The rationale is that if we just entered the
> > slowpath, yet there is no lock owner, then there is a possibility that a reader
> > has the lock. To be conservative, we'll avoid spinning in these situations.
> > 
> > Dave Chinner found performance benefits with this patch in the xfs_repair
> > workload, where the total run time went from approximately 4 minutes 24 seconds,
> > down to approximately 1 minute 26 seconds with the patch.
> > 
> > Tested-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Jason Low <jason.low2@hp.com>
> 
> Davidlohr, you'll be running this through your AIM and other benchmarks,
> I suppose?

I ran it through aim7, and as I suspected we take a performance hit with
'custom' ~-14% throughput for > 300 users (which is still overall quite
higher than rwsems without opt spinning, at around ~+45%), and we
actually improve a bit (~+15%) in 'disk' with >1000 users -- which
afaict resembles Dave's workload a bit. So all in all I'm quite happy
with this patch.

Acked-by: Davidlohr Bueso <davidlohr@hp.com>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
  2014-07-04  8:40         ` Davidlohr Bueso
@ 2014-07-05  3:49           ` Jason Low
       [not found]             ` <CAAW_DMjgd5+EOvZX7_iZe-jHp=00Nf7MX3z6hBCRPgOfqnMtEA@mail.gmail.com>
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Low @ 2014-07-05  3:49 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Dave Chinner, Tim Chen, Ingo Molnar, linux-kernel,
	Linus Torvalds

On Fri, 2014-07-04 at 01:40 -0700, Davidlohr Bueso wrote:
> On Fri, 2014-07-04 at 09:52 +0200, Peter Zijlstra wrote:
> > Davidlohr, you'll be running this through your AIM and other benchmarks,
> > I suppose?
> 
> I ran it through aim7, and as I suspected we take a performance hit with
> 'custom' ~-14% throughput for > 300 users (which is still overall quite
> higher than rwsems without opt spinning, at around ~+45%), and we
> actually improve a bit (~+15%) in 'disk' with >1000 users -- which
> afaict resembles Dave's workload a bit. So all in all I'm quite happy
> with this patch.

Here is the patch with the updates to the changelog.

---
Subject: [PATCH] rwsem: Allow conservative optimistic spinning when readers have lock

Commit 4fc828e24cd9 ("locking/rwsem: Support optimistic spinning")
introduced a major performance regression for workloads such as
xfs_repair which mix read and write locking of the mmap_sem across
many threads. The result was xfs_repair ran 5x slower on 3.16-rc2
than on 3.15 and using 20x more system CPU time.

Perf profiles indicate in some workloads that significant time can
be spent spinning on !owner. This is because we don't set the lock
owner when readers(s) obtain the rwsem.

In this patch, we'll modify rwsem_can_spin_on_owner() such that we'll
return false if there is no lock owner. The rationale is that if we
just entered the slowpath, yet there is no lock owner, then there is
a possibility that a reader has the lock. To be conservative, we'll
avoid spinning in these situations.

This patch reduced the total run time of the xfs_repair workload from
about 4 minutes 24 seconds down to approximately 1 minute 26 seconds,
back to close to the same performance as on 3.15.

Retesting of AIM7, which were some of the workloads used to test the
original optimistic spinning code, confirmed that we still get big
performance gains with optimistic spinning, even with this additional
regression fix. Davidlohr found that while the 'custom' workload took
a performance hit of ~-14% to throughput for >300 users with this
additional patch, the overall gain with optimistic spinning is
still ~+45%. The 'disk' workload even improved by ~+15% at >1000 users.

Tested-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
---
 kernel/locking/rwsem-xadd.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index dacc321..c40c7d2 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -285,10 +285,10 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
 static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 {
 	struct task_struct *owner;
-	bool on_cpu = true;
+	bool on_cpu = false;

 	if (need_resched())
-		return 0;
+		return false;

 	rcu_read_lock();
 	owner = ACCESS_ONCE(sem->owner);
@@ -297,9 +297,9 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 	rcu_read_unlock();

 	/*
-	 * If sem->owner is not set, the rwsem owner may have
-	 * just acquired it and not set the owner yet or the rwsem
-	 * has been released.
+	 * If sem->owner is not set, yet we have just recently entered the
+	 * slowpath, then there is a possibility reader(s) may have the lock.
+	 * To be safe, avoid spinning in these situations.
 	 */
 	return on_cpu;
 }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

[parent not found: <CAAW_DMjgd5+EOvZX7_iZe-jHp=00Nf7MX3z6hBCRPgOfqnMtEA@mail.gmail.com>]

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
       [not found]             ` <CAAW_DMjgd5+EOvZX7_iZe-jHp=00Nf7MX3z6hBCRPgOfqnMtEA@mail.gmail.com>
@ 2014-07-14  9:55               ` Peter Zijlstra
  2014-07-14 17:10                 ` Jason Low
  2014-07-15  2:17                 ` Dave Chinner
  0 siblings, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2014-07-14  9:55 UTC (permalink / raw)
  To: Xie Miao
  Cc: Jason Low, Davidlohr Bueso, Dave Chinner, Tim Chen, Ingo Molnar,
	linux-kernel, Linus Torvalds, miaox

On Mon, Jul 14, 2014 at 05:37:13PM +0800, Xie Miao wrote:
> Hi, Jason
> 
> Could you re-sent this patch? Because it seems it is ignored.

Its not ignored; its not merged because of unrelated stability issues.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
  2014-07-14  9:55               ` Peter Zijlstra
@ 2014-07-14 17:10                 ` Jason Low
  2014-07-15  2:17                 ` Dave Chinner
  1 sibling, 0 replies; 21+ messages in thread
From: Jason Low @ 2014-07-14 17:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Xie Miao, Davidlohr Bueso, Dave Chinner, Tim Chen, Ingo Molnar,
	linux-kernel, Linus Torvalds, miaox

On Mon, 2014-07-14 at 11:55 +0200, Peter Zijlstra wrote:
> On Mon, Jul 14, 2014 at 05:37:13PM +0800, Xie Miao wrote:
> > Hi, Jason
> > 
> > Could you re-sent this patch? Because it seems it is ignored.
> 
> Its not ignored;

Right, I also noticed the patch in the git peterz/queue tree :)



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation
  2014-07-14  9:55               ` Peter Zijlstra
  2014-07-14 17:10                 ` Jason Low
@ 2014-07-15  2:17                 ` Dave Chinner
  1 sibling, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2014-07-15  2:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Xie Miao, Jason Low, Davidlohr Bueso, Tim Chen, Ingo Molnar,
	linux-kernel, Linus Torvalds, miaox

On Mon, Jul 14, 2014 at 11:55:39AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 14, 2014 at 05:37:13PM +0800, Xie Miao wrote:
> > Hi, Jason
> > 
> > Could you re-sent this patch? Because it seems it is ignored.
> 
> Its not ignored; its not merged because of unrelated stability issues.

Is it going to be merged? If not, can you please revert the
optimistic spinning patch that caused the regression?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-07-15  2:18 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-03  2:32 [regression, 3.16-rc] rwsem: optimistic spinning causing performance degradation Dave Chinner
2014-07-03  3:31 ` Davidlohr Bueso
2014-07-03  4:59   ` Dave Chinner
2014-07-03  5:39     ` Dave Chinner
2014-07-03  7:38       ` Peter Zijlstra
2014-07-03  7:56         ` Dave Chinner
     [not found] <1404413420.8764.42.camel@j-VirtualBox>
     [not found] ` <1404416236.3179.18.camel@buesod1.americas.hpqcorp.net>
2014-07-03 20:08   ` Davidlohr Bueso
2014-07-04  1:01 ` Dave Chinner
2014-07-04  1:46   ` Jason Low
2014-07-04  1:54     ` Jason Low
2014-07-04  6:13       ` Dave Chinner
2014-07-04  7:06         ` Jason Low
2014-07-04  8:21           ` Dave Chinner
2014-07-04  8:53           ` Davidlohr Bueso
2014-07-05  3:14             ` Jason Low
2014-07-04  7:52       ` Peter Zijlstra
2014-07-04  8:40         ` Davidlohr Bueso
2014-07-05  3:49           ` Jason Low
     [not found]             ` <CAAW_DMjgd5+EOvZX7_iZe-jHp=00Nf7MX3z6hBCRPgOfqnMtEA@mail.gmail.com>
2014-07-14  9:55               ` Peter Zijlstra
2014-07-14 17:10                 ` Jason Low
2014-07-15  2:17                 ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox