[PATCH] rwsem: reduce spinlock contention in wakeup code path

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] rwsem: reduce spinlock contention in wakeup code path
@ 2013-09-27 19:00 Waiman Long
  2013-09-27 19:28 ` Linus Torvalds
  2013-09-27 19:32 ` Peter Hurley
  0 siblings, 2 replies; 55+ messages in thread
From: Waiman Long @ 2013-09-27 19:00 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Waiman Long, linux-kernel, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Linus Torvalds,
	Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J

With the 3.12-rc2 kernel, there is sizable spinlock contention on
the rwsem wakeup code path when running AIM7's high_systime workload
on a 8-socket 80-core DL980 (HT off) as reported by perf:

  7.64%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
             |--41.77%-- rwsem_wake
  1.61%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irq
             |--92.37%-- rwsem_down_write_failed

That was 4.7% of recorded CPU cycles.

On a large NUMA machine, it is entirely possible that a fairly large
number of threads are queuing up in the ticket spinlock queue to do
the wakeup operation. In fact, only one will be needed.  This patch
tries to reduce spinlock contention by doing just that.

A new wakeup field is added to the rwsem structure. This field is
set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a
thread is pending to do the wakeup call. It is cleared on exit from
those functions.

By checking if the wakeup flag is set, a thread can exit rwsem_wake()
immediately if another thread is pending to do the wakeup instead of
waiting to get the spinlock and find out that nothing need to be done.

The setting of the wakeup flag may not be visible on all processors in
some architectures. However, this won't affect program correctness. The
clearing of the wakeup flag before spin_unlock will ensure that it is
visible to all processors.

With this patch, the performance improvement on jobs per minute (JPM)
of the high_systime workload (at 1500 users) was as follows:

HT	JPM w/o patch	JPM with patch	% Change
--	-------------	--------------	--------
off	   128466	   150000	 +16.8%
on	   121606	   146778	 +20.7%

The new perf profile (HT off) was as follows:

  2.96%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
             |--0.94%-- rwsem_wake
  1.00%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irq
             |--88.70%-- rwsem_down_write_failed

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 include/linux/rwsem.h |    2 ++
 lib/rwsem.c           |   19 +++++++++++++++++++
 2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 0616ffe..e25792e 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -25,6 +25,7 @@ struct rw_semaphore;
 struct rw_semaphore {
 	long			count;
 	raw_spinlock_t		wait_lock;
+	int			wakeup;	/* Waking-up in progress flag */
 	struct list_head	wait_list;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map	dep_map;
@@ -58,6 +59,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #define __RWSEM_INITIALIZER(name)			\
 	{ RWSEM_UNLOCKED_VALUE,				\
 	  __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock),	\
+	  0,						\
 	  LIST_HEAD_INIT((name).wait_list)		\
 	  __RWSEM_DEP_MAP_INIT(name) }
 
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 19c5fa9..39290a5 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -25,6 +25,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	lockdep_init_map(&sem->dep_map, name, key, 0);
 #endif
 	sem->count = RWSEM_UNLOCKED_VALUE;
+	sem->wakeup = 0;
 	raw_spin_lock_init(&sem->wait_lock);
 	INIT_LIST_HEAD(&sem->wait_list);
 }
@@ -66,6 +67,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	struct list_head *next;
 	long oldcount, woken, loop, adjustment;
 
+	sem->wakeup = 1;	/* Waking up in progress */
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
 	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
 		if (wake_type == RWSEM_WAKE_ANY)
@@ -137,6 +139,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	next->prev = &sem->wait_list;
 
  out:
+	sem->wakeup = 0;
 	return sem;
 }
 
@@ -256,11 +259,27 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 {
 	unsigned long flags;
 
+	if (sem->wakeup)
+		return sem;	/* Waking up in progress already */
+	/*
+	 * Optimistically set the wakeup flag to indicate that the current
+	 * flag is going to wakeup the sleeping waiters so that the
+	 * following threads don't need to wait for doing the wakeup.
+	 * It is perfectly fine if another thread resets the flag. It just
+	 * leads to another thread waiting to call __rwsem_do_wake().
+	 */
+	sem->wakeup = 1;
 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
 		sem = __rwsem_do_wake(sem, RWSEM_WAKE_ANY);
+	else
+		sem->wakeup = 0;	/* Make sure wakeup flag is reset */
+	/*
+	 * The spin_unlock() call will force the nulled wakeup flag to
+	 * be visible to all the processors.
+	 */
 
 	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-27 19:00 [PATCH] rwsem: reduce spinlock contention in wakeup code path Waiman Long
@ 2013-09-27 19:28 ` Linus Torvalds
  2013-09-27 19:39   ` Davidlohr Bueso
  2013-09-28  7:41   ` Ingo Molnar
  2013-09-27 19:32 ` Peter Hurley
  1 sibling, 2 replies; 55+ messages in thread
From: Linus Torvalds @ 2013-09-27 19:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Andrew Morton, Linux Kernel Mailing List,
	Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen,
	Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J

On Fri, Sep 27, 2013 at 12:00 PM, Waiman Long <Waiman.Long@hp.com> wrote:
>
> On a large NUMA machine, it is entirely possible that a fairly large
> number of threads are queuing up in the ticket spinlock queue to do
> the wakeup operation. In fact, only one will be needed.  This patch
> tries to reduce spinlock contention by doing just that.
>
> A new wakeup field is added to the rwsem structure. This field is
> set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a
> thread is pending to do the wakeup call. It is cleared on exit from
> those functions.

Ok, this is *much* simpler than adding the new MCS spinlock, so I'm
wondering what the performance difference between the two are.

I'm obviously always in favor of just removing lock contention over
trying to improve the lock scalability, so I really like Waiman's
approach over Tim's new MCS lock. Not because I dislike MCS locks in
general (or you, Tim ;), it's really more fundamental: I just
fundamentally believe more in trying to avoid lock contention than in
trying to improve lock behavior when that contention happens.

As such, I love exactly these kinds of things that Wainman's patch
does, and I'm heavily biased.

But I know I'm heavily biased, so I'd really like to get comparative
numbers for these things. Waiman, can you compare your patch with
Tim's (and Alex's) 6-patch series to make the rwsem's use MCS locks
for the spinlock?

The numbers Tim quotes for the MCS patch series ("high_systime (+7%)")
are lower than the ones you quote (16-20%), but that may be due to
hardware platform differences and just methodology. Tim was also
looking at exim performance.

So Tim/Waiman, mind comparing the two approaches on the setups you have?

                   Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-27 19:28 ` Linus Torvalds
@ 2013-09-27 19:39   ` Davidlohr Bueso
  2013-09-27 21:49     ` Tim Chen
  2013-09-28  7:41   ` Ingo Molnar
  1 sibling, 1 reply; 55+ messages in thread
From: Davidlohr Bueso @ 2013-09-27 19:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J

On Fri, 2013-09-27 at 12:28 -0700, Linus Torvalds wrote:
> On Fri, Sep 27, 2013 at 12:00 PM, Waiman Long <Waiman.Long@hp.com> wrote:
> >
> > On a large NUMA machine, it is entirely possible that a fairly large
> > number of threads are queuing up in the ticket spinlock queue to do
> > the wakeup operation. In fact, only one will be needed.  This patch
> > tries to reduce spinlock contention by doing just that.
> >
> > A new wakeup field is added to the rwsem structure. This field is
> > set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a
> > thread is pending to do the wakeup call. It is cleared on exit from
> > those functions.
> 
> Ok, this is *much* simpler than adding the new MCS spinlock, so I'm
> wondering what the performance difference between the two are.

Both approaches should be complementary. The idea of optimistic spinning
in rwsems is to avoid putting putting the writer on the wait queue -
reducing contention and giving a greater chance for the rwsem
to get acquired. Waiman's approach is once the blocking actually occurs,
and at this point I'm not sure how this will affect writer stealing
logic.

Thanks,
Davidlohr




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-27 19:39   ` Davidlohr Bueso
@ 2013-09-27 21:49     ` Tim Chen
  2013-09-28  6:45       ` Ingo Molnar
  0 siblings, 1 reply; 55+ messages in thread
From: Tim Chen @ 2013-09-27 21:49 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Peter Zijlstra, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Fri, 2013-09-27 at 12:39 -0700, Davidlohr Bueso wrote:
> On Fri, 2013-09-27 at 12:28 -0700, Linus Torvalds wrote:
> > On Fri, Sep 27, 2013 at 12:00 PM, Waiman Long <Waiman.Long@hp.com> wrote:
> > >
> > > On a large NUMA machine, it is entirely possible that a fairly large
> > > number of threads are queuing up in the ticket spinlock queue to do
> > > the wakeup operation. In fact, only one will be needed.  This patch
> > > tries to reduce spinlock contention by doing just that.
> > >
> > > A new wakeup field is added to the rwsem structure. This field is
> > > set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a
> > > thread is pending to do the wakeup call. It is cleared on exit from
> > > those functions.
> > 
> > Ok, this is *much* simpler than adding the new MCS spinlock, so I'm
> > wondering what the performance difference between the two are.
> 
> Both approaches should be complementary. The idea of optimistic spinning
> in rwsems is to avoid putting putting the writer on the wait queue -
> reducing contention and giving a greater chance for the rwsem
> to get acquired. Waiman's approach is once the blocking actually occurs,
> and at this point I'm not sure how this will affect writer stealing
> logic.
> 

I agree with the view that the two approaches are complementary 
to each other.  They address different bottleneck areas in the
rwsem. Here're the performance numbers for exim workload 
compared to a vanilla kernel.

Waimain's patch:        +2.0%
Alex+Tim's patchset:    +4.8%
Waiman+Alex+Tim:	+5.3%

Tim


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-27 21:49     ` Tim Chen
@ 2013-09-28  6:45       ` Ingo Molnar
  0 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2013-09-28  6:45 UTC (permalink / raw)
  To: Tim Chen
  Cc: Davidlohr Bueso, Linus Torvalds, Waiman Long, Ingo Molnar,
	Andrew Morton, Linux Kernel Mailing List, Rik van Riel,
	Peter Hurley, Davidlohr Bueso, Alex Shi, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J


* Tim Chen <tim.c.chen@linux.intel.com> wrote:

> On Fri, 2013-09-27 at 12:39 -0700, Davidlohr Bueso wrote:
> > On Fri, 2013-09-27 at 12:28 -0700, Linus Torvalds wrote:
> > > On Fri, Sep 27, 2013 at 12:00 PM, Waiman Long <Waiman.Long@hp.com> wrote:
> > > >
> > > > On a large NUMA machine, it is entirely possible that a fairly large
> > > > number of threads are queuing up in the ticket spinlock queue to do
> > > > the wakeup operation. In fact, only one will be needed.  This patch
> > > > tries to reduce spinlock contention by doing just that.
> > > >
> > > > A new wakeup field is added to the rwsem structure. This field is
> > > > set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a
> > > > thread is pending to do the wakeup call. It is cleared on exit from
> > > > those functions.
> > > 
> > > Ok, this is *much* simpler than adding the new MCS spinlock, so I'm
> > > wondering what the performance difference between the two are.
> > 
> > Both approaches should be complementary. The idea of optimistic spinning
> > in rwsems is to avoid putting putting the writer on the wait queue -
> > reducing contention and giving a greater chance for the rwsem
> > to get acquired. Waiman's approach is once the blocking actually occurs,
> > and at this point I'm not sure how this will affect writer stealing
> > logic.
> > 
> 
> I agree with the view that the two approaches are complementary to each 
> other.  They address different bottleneck areas in the rwsem. Here're 
> the performance numbers for exim workload compared to a vanilla kernel.
> 
> Waimain's patch:        +2.0%
> Alex+Tim's patchset:    +4.8%
> Waiman+Alex+Tim:        +5.3%

I think I'd like to see a combo series submitted :-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-27 19:28 ` Linus Torvalds
  2013-09-27 19:39   ` Davidlohr Bueso
@ 2013-09-28  7:41   ` Ingo Molnar
  2013-09-28 18:55     ` Linus Torvalds
  1 sibling, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2013-09-28  7:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Sep 27, 2013 at 12:00 PM, Waiman Long <Waiman.Long@hp.com> wrote:
> >
> > On a large NUMA machine, it is entirely possible that a fairly large 
> > number of threads are queuing up in the ticket spinlock queue to do 
> > the wakeup operation. In fact, only one will be needed.  This patch 
> > tries to reduce spinlock contention by doing just that.
> >
> > A new wakeup field is added to the rwsem structure. This field is set 
> > on entry to rwsem_wake() and __rwsem_do_wake() to mark that a thread 
> > is pending to do the wakeup call. It is cleared on exit from those 
> > functions.
> 
> Ok, this is *much* simpler than adding the new MCS spinlock, so I'm 
> wondering what the performance difference between the two are.
> 
> I'm obviously always in favor of just removing lock contention over 
> trying to improve the lock scalability, so I really like Waiman's 
> approach over Tim's new MCS lock. Not because I dislike MCS locks in 
> general (or you, Tim ;), it's really more fundamental: I just 
> fundamentally believe more in trying to avoid lock contention than in 
> trying to improve lock behavior when that contention happens.
> 
> As such, I love exactly these kinds of things that Wainman's patch does, 
> and I'm heavily biased.

Yeah, I fully agree. The reason I'm still very sympathetic to Tim's 
efforts is that they address a regression caused by a mechanic 
mutex->rwsem conversion:

  5a505085f043 mm/rmap: Convert the struct anon_vma::mutex to an rwsem

... and Tim's patches turn that regression into an actual speedup.

The main regression component happens to be more prominent on larger 
systems which inevitably have higher contention. That was a result of 
mutexes having better contention behavior than rwsems - which was not a 
property Tim picked arbitrarily.

The other advantage would be that by factoring out the MCS code it gets 
reviewed and seen more prominently - this resulted in a micro-optimization 
already. So people working on mutex or rwsem scalability will have a 
chance to help each other.

A third advantage would be that arguably our rwsems degrade on really big 
systems catastrophically. We had that with spinlocks and mutexes and it 
got fixd. Since rwsems are a long-term concern for us I thought that 
something like MCS queueing could be considered a fundamental 
quality-of-implementation requirement as well.

In any case I fully agree that we never want to overdo it, our priority of 
optimization and our ranking will always be:

	- lockless algorithms
	- reduction of critical paths
	...
	- improved locking fast path
	- improved locking slow path

and people will see much better speedups if they concentrate on entries 
higher on this list.

It's a pretty rigid order as well: no slowpath improvement is allowed to 
hurt any of the higher priority optimization concepts and people are 
encouraged to always work on the higher order concepts before considering 
the symptoms of contention.

But as long as contention behavior improvements are cleanly done, don't 
cause regressions, do not hinder primary scalability efforts and the 
numbers are as impressive as Tim's, it looks like good stuff to me.

I was thinking about putting them into the locking tree once the review 
and testing process converges and wanted to send them to you in the v3.13 
merge window.

The only potential downside (modulo bugs) I can see from Tim's patches is 
XFS's heavy dependence on fine behavioral details of rwsem. But if done 
right then none of these scalability efforts should impact those details.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-28  7:41   ` Ingo Molnar
@ 2013-09-28 18:55     ` Linus Torvalds
  2013-09-28 19:13       ` Andi Kleen
                         ` (4 more replies)
  0 siblings, 5 replies; 55+ messages in thread
From: Linus Torvalds @ 2013-09-28 18:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J

On Sat, Sep 28, 2013 at 12:41 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> Yeah, I fully agree. The reason I'm still very sympathetic to Tim's
> efforts is that they address a regression caused by a mechanic
> mutex->rwsem conversion:
>
>   5a505085f043 mm/rmap: Convert the struct anon_vma::mutex to an rwsem
>
> ... and Tim's patches turn that regression into an actual speedup.

Btw, I really hate that thing. I think we should turn it back into a
spinlock. None of what it protects needs a mutex or an rwsem.

Because you guys talk about the regression of turning it into a rwsem,
but nobody talks about the *original* regression.

And it *used* to be a spinlock, and it was changed into a mutex back
in 2011 by commit 2b575eb64f7a. That commit doesn't even have a reason
listed for it, although my dim memory of it is that the reason was
preemption latency.

And that caused big regressions too.

Of course, since then, we may well have screwed things up and now we
sleep under it, but I still really think it was a mistake to do it in
the first place.

So if the primary reason for this is really just that f*cking anon_vma
lock, then I would seriously suggest:

 - turn it back into a spinlock (or rwlock_t, since we subsequently
separated the read and write paths)

 - fix up any breakage (ie new scheduling points) that exposes

 - look at possible other approaches wrt latency on that thing.

Hmm?

                Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-28 18:55     ` Linus Torvalds
@ 2013-09-28 19:13       ` Andi Kleen
  2013-09-28 19:22         ` Linus Torvalds
  2013-09-28 19:21       ` Ingo Molnar
                         ` (3 subsequent siblings)
  4 siblings, 1 reply; 55+ messages in thread
From: Andi Kleen @ 2013-09-28 19:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J

> Of course, since then, we may well have screwed things up and now we
> sleep under it, but I still really think it was a mistake to do it in
> the first place.
> 
> So if the primary reason for this is really just that f*cking anon_vma
> lock, then I would seriously suggest:
> 
>  - turn it back into a spinlock (or rwlock_t, since we subsequently
> separated the read and write paths)

Yes please. spinlocks/rwlocks have so much nicer performance behavior than
rwsems/mutexes (which noone seems to fully understand)

We had also significant performance regressions from every such
spinning->sleeping change in the VM (this was just the latest)

And afaik anon_vma is usually hold short.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-28 19:13       ` Andi Kleen
@ 2013-09-28 19:22         ` Linus Torvalds
  0 siblings, 0 replies; 55+ messages in thread
From: Linus Torvalds @ 2013-09-28 19:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Chandramouleeswaran, Aswin, Norton, Scott J

On Sat, Sep 28, 2013 at 12:13 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> And afaik anon_vma is usually hold short.

Yes.

But the problem with anon_vma is that the "usually" may be the 99.9%
case, but then there are some insane loads that do tons of forking
without execve, and they really make some of the rmap code work very
very hard. And then they all not only share that one root vma, but the
mm/rmap.c code ends up having to walk all their VM's because there
could be a page in there somewhere.

These loads aren't necessarily very realistic and very much not
common, but I think AIM7 actually has one of those cases, iirc.

Our anon_vma locking really is some of the more complex parts of the
kernel. Not because of the lock itself, but because of the subtle
rules about the whole anon_vma chain and how we have to lock the root
of the chain etc etc. And under all _normal_ behavior it's not a
problem at all. But I personally dread looking at some of that code,
because if we get anything wrong there (and it's happened), it's too
painful for words.

                  Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-28 18:55     ` Linus Torvalds
  2013-09-28 19:13       ` Andi Kleen
@ 2013-09-28 19:21       ` Ingo Molnar
  2013-09-28 19:33         ` Linus Torvalds
  2013-09-28 19:37         ` [PATCH] anon_vmas: Convert the rwsem to an rwlock_t Ingo Molnar
  2013-09-29 23:06       ` [PATCH] rwsem: reduce spinlock contention in wakeup code path Davidlohr Bueso
                         ` (2 subsequent siblings)
  4 siblings, 2 replies; 55+ messages in thread
From: Ingo Molnar @ 2013-09-28 19:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Sep 28, 2013 at 12:41 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > Yeah, I fully agree. The reason I'm still very sympathetic to Tim's
> > efforts is that they address a regression caused by a mechanic
> > mutex->rwsem conversion:
> >
> >   5a505085f043 mm/rmap: Convert the struct anon_vma::mutex to an rwsem
> >
> > ... and Tim's patches turn that regression into an actual speedup.
> 
> Btw, I really hate that thing. I think we should turn it back into a 
> spinlock. None of what it protects needs a mutex or an rwsem.
> 
> Because you guys talk about the regression of turning it into a rwsem, 
> but nobody talks about the *original* regression.
> 
> And it *used* to be a spinlock, and it was changed into a mutex back in 
> 2011 by commit 2b575eb64f7a. That commit doesn't even have a reason 
> listed for it, although my dim memory of it is that the reason was 
> preemption latency.

Yeah, I think it was latency.

> And that caused big regressions too.
> 
> Of course, since then, we may well have screwed things up and now we 
> sleep under it, but I still really think it was a mistake to do it in 
> the first place.
> 
> So if the primary reason for this is really just that f*cking anon_vma 
> lock, then I would seriously suggest:
> 
>  - turn it back into a spinlock (or rwlock_t, since we subsequently
>    separated the read and write paths)
> 
>  - fix up any breakage (ie new scheduling points) that exposes
> 
>  - look at possible other approaches wrt latency on that thing.
> 
> Hmm?

If we do that then I suspect the next step will be queued rwlocks :-/ The 
current rwlock_t implementation is rather primitive by modern standards. 
(We'd probably have killed rwlock_t long ago if not for the 
tasklist_lock.)

But yeah, it would work and conceptually a hard spinlock fits something as 
lowlevel as the anon-vma lock.

I did a quick review pass and it appears nothing obvious is scheduling 
with the anon-vma lock held. If it did in a non-obvious way it's likely a 
bug anyway. The hugepage code grew a lot of logic running under the 
anon-vma lock, but it all seems atomic.

So a conversion to rwlock_t could be attempted. (It should be relatively 
easy patch as well, because the locking operation is now nicely abstracted 
out.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-28 19:21       ` Ingo Molnar
@ 2013-09-28 19:33         ` Linus Torvalds
  2013-09-28 19:39           ` Ingo Molnar
                             ` (2 more replies)
  2013-09-28 19:37         ` [PATCH] anon_vmas: Convert the rwsem to an rwlock_t Ingo Molnar
  1 sibling, 3 replies; 55+ messages in thread
From: Linus Torvalds @ 2013-09-28 19:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J

On Sat, Sep 28, 2013 at 12:21 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> If we do that then I suspect the next step will be queued rwlocks :-/ The
> current rwlock_t implementation is rather primitive by modern standards.
> (We'd probably have killed rwlock_t long ago if not for the
> tasklist_lock.)

Yeah, I'm not happy about or rwlocks. That's one lock that currently
is so broken that I think we could easily argue for making that one
queued.

Waiman had a qrwlock series that looked reasonable, and I think his
later versions were drop-in replacements (ie they automatically just
did the RightThing(tm) wrt interrupts taking a recursive read lock - I
objected to the first versions that required that to be stated
explicitly).

I think Waiman's patches (even the later ones) made the queued rwlocks
be a side-by-side implementation with the old rwlocks, and I think
that was just being unnecessarily careful. It might be useful for
testing to have a config option to switch between the two, but we
might as well go all the way.

The old rwlock's really have been a disappointment - they are slower
than spinlocks, and seldom/never end up scaling any better.  Their
main advantage was literally the irq behavior - allowing readers to
happen without the expense of worrying about irq's.

              Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-28 19:33         ` Linus Torvalds
@ 2013-09-28 19:39           ` Ingo Molnar
  2013-09-30 10:44           ` Peter Zijlstra
  2013-09-30 15:58           ` Waiman Long
  2 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2013-09-28 19:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Sep 28, 2013 at 12:21 PM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > If we do that then I suspect the next step will be queued rwlocks :-/ The
> > current rwlock_t implementation is rather primitive by modern standards.
> > (We'd probably have killed rwlock_t long ago if not for the
> > tasklist_lock.)
> 
> Yeah, I'm not happy about or rwlocks. That's one lock that currently
> is so broken that I think we could easily argue for making that one
> queued.
> 
> Waiman had a qrwlock series that looked reasonable, and I think his
> later versions were drop-in replacements (ie they automatically just
> did the RightThing(tm) wrt interrupts taking a recursive read lock - I
> objected to the first versions that required that to be stated
> explicitly).
> 
> I think Waiman's patches (even the later ones) made the queued rwlocks
> be a side-by-side implementation with the old rwlocks, and I think
> that was just being unnecessarily careful. It might be useful for
> testing to have a config option to switch between the two, but we
> might as well go all the way.
> 
> The old rwlock's really have been a disappointment - they are slower 
> than spinlocks, and seldom/never end up scaling any better.  Their main 
> advantage was literally the irq behavior - allowing readers to happen 
> without the expense of worrying about irq's.

Yeah.

But at least here the read side will not play, as the AIM7 workloads where 
the testing goes on excercises the write path exclusively I think.

Still, the lack of queueing ought to hurt - the question is by how much.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-28 19:33         ` Linus Torvalds
  2013-09-28 19:39           ` Ingo Molnar
@ 2013-09-30 10:44           ` Peter Zijlstra
  2013-09-30 16:13             ` Linus Torvalds
  2013-09-30 15:58           ` Waiman Long
  2 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2013-09-30 10:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Sat, Sep 28, 2013 at 12:33:36PM -0700, Linus Torvalds wrote:
> The old rwlock's really have been a disappointment - they are slower
> than spinlocks, and seldom/never end up scaling any better.  Their
> main advantage was literally the irq behavior - allowing readers to
> happen without the expense of worrying about irq's.

So in part that is fundamental to the whole rw-spinlock concept.

Typically lock hold times should be short for spinlock type locks and if
your hold times are short, the lock acquisition times are significant.

And a read acquisition is still a RMW operation on the lock, thus read
locks are still entirely bound by the cacheline transfer of the lock
itself.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-30 10:44           ` Peter Zijlstra
@ 2013-09-30 16:13             ` Linus Torvalds
  2013-09-30 16:41               ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2013-09-30 16:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Mon, Sep 30, 2013 at 3:44 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, Sep 28, 2013 at 12:33:36PM -0700, Linus Torvalds wrote:
>> The old rwlock's really have been a disappointment - they are slower
>> than spinlocks, and seldom/never end up scaling any better.  Their
>> main advantage was literally the irq behavior - allowing readers to
>> happen without the expense of worrying about irq's.
>
> So in part that is fundamental to the whole rw-spinlock concept.

No, I agree. But it's really just that our rwlock implementation isn't
very good. It's neither really high-performing nor fair, and the
premise of a rw-lock is that it should scale better than a spinlock.

And yes, under heavy reader activity *does* work out for that (ok, you
get cacheline bouncing, but at least you don't get the spinning when
you have tons of readers), the extreme unfairness towards writers
under heavy reader activity makes it often simply unacceptable.

So unlike a lot of other "let's try to make our locking fancy" that I
dislike because it tends to hide the fundamental problem of
contention, the rwlock patches make me go "those actually _fix_ a
fundamental problem".

                Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-30 16:13             ` Linus Torvalds
@ 2013-09-30 16:41               ` Peter Zijlstra
  2013-10-01  7:28                 ` Ingo Molnar
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2013-09-30 16:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Mon, Sep 30, 2013 at 09:13:52AM -0700, Linus Torvalds wrote:

> So unlike a lot of other "let's try to make our locking fancy" that I
> dislike because it tends to hide the fundamental problem of
> contention, the rwlock patches make me go "those actually _fix_ a
> fundamental problem".

So here I'm slightly disagreeing; fixing a fundamental problem would be
coming up a better anon_vma management that doesn't create such immense
chains.

Its still the same lock, spinlock or not. 

And regardless of if we keep anon_vma lock a rwsem or not; I think we
should merge those rwsem patches as they do improve the lock
implementation and the hard work has already been done.

However the biggest ugly by far here is that mm_take_all_locks() thing;
couldn't we implement that by basically freezing all tasks referencing
that mm?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-30 16:41               ` Peter Zijlstra
@ 2013-10-01  7:28                 ` Ingo Molnar
  2013-10-01  8:09                   ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2013-10-01  7:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

* Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Sep 30, 2013 at 09:13:52AM -0700, Linus Torvalds wrote:
>  
> > So unlike a lot of other "let's try to make our locking fancy" that I 
> > dislike because it tends to hide the fundamental problem of 
> > contention, the rwlock patches make me go "those actually _fix_ a 
> > fundamental problem".
> 
> So here I'm slightly disagreeing; fixing a fundamental problem would be 
> coming up a better anon_vma management that doesn't create such immense 
> chains.

So, I think the fundamental problem seems to be that when rwsems are 
applied to this usecase, they still don't perform as well as a primitive 
rwlock.

That means that when rwsems cause a context switch it is a loss, while an 
rwlock_t burning CPU time by looping around is a win. I'm not sure it's 
even about 'immense chains' - if those were true then context-switching 
should actually improve performance by allowing other work to continue 
while the heavy chains are processed.

Alas that's not what happens!

Or is AIM7 essentially triggering a single large lock? I doubt that's the 
case though.

> Its still the same lock, spinlock or not.
> 
> And regardless of if we keep anon_vma lock a rwsem or not; I think we 
> should merge those rwsem patches as they do improve the lock 
> implementation and the hard work has already been done.

That I mostly agree with, except that without a serious usecase do we have 
a guarantee that bugs in fancies queueing in rwsems gets ironed out?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-10-01  7:28                 ` Ingo Molnar
@ 2013-10-01  8:09                   ` Peter Zijlstra
  2013-10-01  8:25                     ` Ingo Molnar
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2013-10-01  8:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Tue, Oct 01, 2013 at 09:28:15AM +0200, Ingo Molnar wrote:
> That I mostly agree with, except that without a serious usecase do we have 
> a guarantee that bugs in fancies queueing in rwsems gets ironed out?

Methinks mmap_sem is still a big enough lock to work out a locking
primitive :-)

In fact, try something like this from userspace:

n-threads:

  pthread_mutex_lock(&mutex);
  foo = mmap();
  pthread_mutex_lock(&mutex);

  /* work */

  pthread_mutex_unlock(&mutex);
  munma(foo);
  pthread_mutex_unlock(&mutex);

vs

n-threads:

  foo = mmap();
  /* work */
  munmap(foo);


I've had reports that the former was significantly faster than the
latter.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-10-01  8:09                   ` Peter Zijlstra
@ 2013-10-01  8:25                     ` Ingo Molnar
  0 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2013-10-01  8:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Oct 01, 2013 at 09:28:15AM +0200, Ingo Molnar wrote:
>
> > That I mostly agree with, except that without a serious usecase do we 
> > have a guarantee that bugs in fancies queueing in rwsems gets ironed 
> > out?
> 
> Methinks mmap_sem is still a big enough lock to work out a locking 
> primitive :-)

I mean the AIM7 usecase probably falls away - we need to find another one 
that shows the inefficiencies.

> In fact, try something like this from userspace:
> 
> n-threads:
> 
>   pthread_mutex_lock(&mutex);
>   foo = mmap();
>   pthread_mutex_lock(&mutex);
> 
>   /* work */
> 
>   pthread_mutex_unlock(&mutex);
>   munma(foo);
>   pthread_mutex_unlock(&mutex);
> 
> vs
> 
> n-threads:
> 
>   foo = mmap();
>   /* work */
>   munmap(foo);
> 
> 
> I've had reports that the former was significantly faster than the
> latter.

That looks like a legitimate pattern that ought to trigger in many apps. 
Would be nice to turn this into a:

	perf bench mm thread-create

testcase or so.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-28 19:33         ` Linus Torvalds
  2013-09-28 19:39           ` Ingo Molnar
  2013-09-30 10:44           ` Peter Zijlstra
@ 2013-09-30 15:58           ` Waiman Long
  2013-10-01  7:33             ` Ingo Molnar
  2 siblings, 1 reply; 55+ messages in thread
From: Waiman Long @ 2013-09-30 15:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J

On 09/28/2013 03:33 PM, Linus Torvalds wrote:
> On Sat, Sep 28, 2013 at 12:21 PM, Ingo Molnar<mingo@kernel.org>  wrote:
>> If we do that then I suspect the next step will be queued rwlocks :-/ The
>> current rwlock_t implementation is rather primitive by modern standards.
>> (We'd probably have killed rwlock_t long ago if not for the
>> tasklist_lock.)
> Yeah, I'm not happy about or rwlocks. That's one lock that currently
> is so broken that I think we could easily argue for making that one
> queued.
>
> Waiman had a qrwlock series that looked reasonable, and I think his
> later versions were drop-in replacements (ie they automatically just
> did the RightThing(tm) wrt interrupts taking a recursive read lock - I
> objected to the first versions that required that to be stated
> explicitly).

The latest version (v3) will allow recursive read lock in interrupt 
handlers.

> I think Waiman's patches (even the later ones) made the queued rwlocks
> be a side-by-side implementation with the old rwlocks, and I think
> that was just being unnecessarily careful. It might be useful for
> testing to have a config option to switch between the two, but we
> might as well go all the way.

It is not actually a side-by-side implementation. A user can choose 
either regular rwlock or the queue one, but never both by setting a 
configuration parameter. However, I now think that maybe we should do it 
the lockref way by pre-determining it on a per-architecture level 
without user visible configuration option.

-Longman

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-30 15:58           ` Waiman Long
@ 2013-10-01  7:33             ` Ingo Molnar
  2013-10-01 20:03               ` Waiman Long
  0 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2013-10-01  7:33 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J


* Waiman Long <waiman.long@hp.com> wrote:

> > I think Waiman's patches (even the later ones) made the queued rwlocks 
> > be a side-by-side implementation with the old rwlocks, and I think 
> > that was just being unnecessarily careful. It might be useful for 
> > testing to have a config option to switch between the two, but we 
> > might as well go all the way.
> 
> It is not actually a side-by-side implementation. A user can choose 
> either regular rwlock or the queue one, but never both by setting a 
> configuration parameter. However, I now think that maybe we should do it 
> the lockref way by pre-determining it on a per-architecture level 
> without user visible configuration option.

Well, as I pointed it out to you during review, such a Kconfig driven 
locking API choice is a no-go!

What I suggested instead: there's absolutely no problem with providing a 
better rwlock_t implementation, backed with numbers and all that.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-10-01  7:33             ` Ingo Molnar
@ 2013-10-01 20:03               ` Waiman Long
  0 siblings, 0 replies; 55+ messages in thread
From: Waiman Long @ 2013-10-01 20:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J

On 10/01/2013 03:33 AM, Ingo Molnar wrote:
> * Waiman Long<waiman.long@hp.com>  wrote:
>
>>> I think Waiman's patches (even the later ones) made the queued rwlocks
>>> be a side-by-side implementation with the old rwlocks, and I think
>>> that was just being unnecessarily careful. It might be useful for
>>> testing to have a config option to switch between the two, but we
>>> might as well go all the way.
>> It is not actually a side-by-side implementation. A user can choose
>> either regular rwlock or the queue one, but never both by setting a
>> configuration parameter. However, I now think that maybe we should do it
>> the lockref way by pre-determining it on a per-architecture level
>> without user visible configuration option.
> Well, as I pointed it out to you during review, such a Kconfig driven
> locking API choice is a no-go!
>
> What I suggested instead: there's absolutely no problem with providing a
> better rwlock_t implementation, backed with numbers and all that.
>
> Thanks,
>
> 	Ingo

Yes, this is what I am planning to do. The next version of my qrwlock 
patch will force the switch to queue rwlock for x86 architecture. The 
other architectures have to be done separately.

-Longman


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-28 19:21       ` Ingo Molnar
  2013-09-28 19:33         ` Linus Torvalds
@ 2013-09-28 19:37         ` Ingo Molnar
  2013-09-28 19:43           ` Linus Torvalds
  2013-09-30  8:52           ` [PATCH] " Andrea Arcangeli
  1 sibling, 2 replies; 55+ messages in thread
From: Ingo Molnar @ 2013-09-28 19:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J


* Ingo Molnar <mingo@kernel.org> wrote:

> If we do that then I suspect the next step will be queued rwlocks :-/ 
> The current rwlock_t implementation is rather primitive by modern 
> standards. (We'd probably have killed rwlock_t long ago if not for the 
> tasklist_lock.)
> 
> But yeah, it would work and conceptually a hard spinlock fits something 
> as lowlevel as the anon-vma lock.
> 
> I did a quick review pass and it appears nothing obvious is scheduling 
> with the anon-vma lock held. If it did in a non-obvious way it's likely 
> a bug anyway. The hugepage code grew a lot of logic running under the 
> anon-vma lock, but it all seems atomic.
> 
> So a conversion to rwlock_t could be attempted. (It should be relatively 
> easy patch as well, because the locking operation is now nicely 
> abstracted out.)

Here's a totally untested patch to convert the anon vma lock to an 
rwlock_t.

I think its lack of modern queueing will hurt on big systems big time - it 
might even regress. But ... it's hard to tell such things in advance.

[ That might as well be for the better as it will eventually be fixed, 
  which in turn will improve tasklist_lock workloads ;-) ]

Thanks,

	Ingo

------------->
Subject: anon_vmas: Convert the rwsem to an rwlock_t
From: Ingo Molnar <mingo@kernel.org>

--
 include/linux/mmu_notifier.h |  2 +-
 include/linux/rmap.h         | 19 +++++++++----------
 mm/huge_memory.c             |  4 ++--
 mm/mmap.c                    | 10 +++++-----
 mm/rmap.c                    | 24 ++++++++++++------------
 5 files changed, 29 insertions(+), 30 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index deca874..628e807 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -151,7 +151,7 @@ struct mmu_notifier_ops {
  * Therefore notifier chains can only be traversed when either
  *
  * 1. mmap_sem is held.
- * 2. One of the reverse map locks is held (i_mmap_mutex or anon_vma->rwsem).
+ * 2. One of the reverse map locks is held (i_mmap_mutex or anon_vma->rwlock).
  * 3. No other concurrent thread can access the list (release)
  */
 struct mmu_notifier {
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 6dacb93..f4ab929 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -7,7 +7,7 @@
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
-#include <linux/rwsem.h>
+#include <linux/rwlock.h>
 #include <linux/memcontrol.h>
 
 /*
@@ -26,7 +26,7 @@
  */
 struct anon_vma {
 	struct anon_vma *root;		/* Root of this anon_vma tree */
-	struct rw_semaphore rwsem;	/* W: modification, R: walking the list */
+	rwlock_t rwlock;		/* W: modification, R: walking the list */
 	/*
 	 * The refcount is taken on an anon_vma when there is no
 	 * guarantee that the vma of page tables will exist for
@@ -64,7 +64,7 @@ struct anon_vma_chain {
 	struct vm_area_struct *vma;
 	struct anon_vma *anon_vma;
 	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
-	struct rb_node rb;			/* locked by anon_vma->rwsem */
+	struct rb_node rb;			/* locked by anon_vma->rwlock */
 	unsigned long rb_subtree_last;
 #ifdef CONFIG_DEBUG_VM_RB
 	unsigned long cached_vma_start, cached_vma_last;
@@ -108,37 +108,36 @@ static inline void vma_lock_anon_vma(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		down_write(&anon_vma->root->rwsem);
+		write_lock(&anon_vma->root->rwlock);
 }
 
 static inline void vma_unlock_anon_vma(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		up_write(&anon_vma->root->rwsem);
+		write_unlock(&anon_vma->root->rwlock);
 }
 
 static inline void anon_vma_lock_write(struct anon_vma *anon_vma)
 {
-	down_write(&anon_vma->root->rwsem);
+	write_lock(&anon_vma->root->rwlock);
 }
 
 static inline void anon_vma_unlock_write(struct anon_vma *anon_vma)
 {
-	up_write(&anon_vma->root->rwsem);
+	write_unlock(&anon_vma->root->rwlock);
 }
 
 static inline void anon_vma_lock_read(struct anon_vma *anon_vma)
 {
-	down_read(&anon_vma->root->rwsem);
+	read_unlock(&anon_vma->root->rwlock);
 }
 
 static inline void anon_vma_unlock_read(struct anon_vma *anon_vma)
 {
-	up_read(&anon_vma->root->rwsem);
+	read_unlock(&anon_vma->root->rwlock);
 }
 
-
 /*
  * anon_vma helper functions.
  */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7489884..78f6c08 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1542,7 +1542,7 @@ static int __split_huge_page_splitting(struct page *page,
 		 * We can't temporarily set the pmd to null in order
 		 * to split it, the pmd must remain marked huge at all
 		 * times or the VM won't take the pmd_trans_huge paths
-		 * and it won't wait on the anon_vma->root->rwsem to
+		 * and it won't wait on the anon_vma->root->rwlock to
 		 * serialize against split_huge_page*.
 		 */
 		pmdp_splitting_flush(vma, address, pmd);
@@ -1747,7 +1747,7 @@ static int __split_huge_page_map(struct page *page,
 	return ret;
 }
 
-/* must be called with anon_vma->root->rwsem held */
+/* must be called with anon_vma->root->rwlock held */
 static void __split_huge_page(struct page *page,
 			      struct anon_vma *anon_vma,
 			      struct list_head *list)
diff --git a/mm/mmap.c b/mm/mmap.c
index 9d54851..25ce233 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2955,15 +2955,15 @@ static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma)
 		 * The LSB of head.next can't change from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		down_write_nest_lock(&anon_vma->root->rwsem, &mm->mmap_sem);
+		down_write_nest_lock(&anon_vma->root->rwlock, &mm->mmap_sem);
 		/*
 		 * We can safely modify head.next after taking the
-		 * anon_vma->root->rwsem. If some other vma in this mm shares
+		 * anon_vma->root->rwlock. If some other vma in this mm shares
 		 * the same anon_vma we won't take it again.
 		 *
 		 * No need of atomic instructions here, head.next
 		 * can't change from under us thanks to the
-		 * anon_vma->root->rwsem.
+		 * anon_vma->root->rwlock.
 		 */
 		if (__test_and_set_bit(0, (unsigned long *)
 				       &anon_vma->root->rb_root.rb_node))
@@ -3012,7 +3012,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
  * vma in this mm is backed by the same anon_vma or address_space.
  *
  * We can take all the locks in random order because the VM code
- * taking i_mmap_mutex or anon_vma->rwsem outside the mmap_sem never
+ * taking i_mmap_mutex or anon_vma->rwslockoutside the mmap_sem never
  * takes more than one of them in a row. Secondly we're protected
  * against a concurrent mm_take_all_locks() by the mm_all_locks_mutex.
  *
@@ -3065,7 +3065,7 @@ static void vm_unlock_anon_vma(struct anon_vma *anon_vma)
 		 *
 		 * No need of atomic instructions here, head.next
 		 * can't change from under us until we release the
-		 * anon_vma->root->rwsem.
+		 * anon_vma->root->rwlock.
 		 */
 		if (!__test_and_clear_bit(0, (unsigned long *)
 					  &anon_vma->root->rb_root.rb_node))
diff --git a/mm/rmap.c b/mm/rmap.c
index fd3ee7a..d101133 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -24,7 +24,7 @@
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_mutex
- *         anon_vma->rwsem
+ *         anon_vma->rwlock
  *           mm->page_table_lock or pte_lock
  *             zone->lru_lock (in mark_page_accessed, isolate_lru_page)
  *             swap_lock (in swap_duplicate, swap_info_get)
@@ -37,7 +37,7 @@
  *                           in arch-dependent flush_dcache_mmap_lock,
  *                           within bdi.wb->list_lock in __sync_single_inode)
  *
- * anon_vma->rwsem,mapping->i_mutex      (memory_failure, collect_procs_anon)
+ * anon_vma->rwlock,mapping->i_mutex      (memory_failure, collect_procs_anon)
  *   ->tasklist_lock
  *     pte map lock
  */
@@ -98,12 +98,12 @@ static inline void anon_vma_free(struct anon_vma *anon_vma)
 	 * page_lock_anon_vma_read()	VS	put_anon_vma()
 	 *   down_read_trylock()		  atomic_dec_and_test()
 	 *   LOCK				  MB
-	 *   atomic_read()			  rwsem_is_locked()
+	 *   atomic_read()			  rwlock_is_locked()
 	 *
 	 * LOCK should suffice since the actual taking of the lock must
 	 * happen _before_ what follows.
 	 */
-	if (rwsem_is_locked(&anon_vma->root->rwsem)) {
+	if (write_can_lock(&anon_vma->root->rwlock)) {
 		anon_vma_lock_write(anon_vma);
 		anon_vma_unlock_write(anon_vma);
 	}
@@ -219,9 +219,9 @@ static inline struct anon_vma *lock_anon_vma_root(struct anon_vma *root, struct
 	struct anon_vma *new_root = anon_vma->root;
 	if (new_root != root) {
 		if (WARN_ON_ONCE(root))
-			up_write(&root->rwsem);
+			write_unlock(&root->rwlock);
 		root = new_root;
-		down_write(&root->rwsem);
+		write_lock(&root->rwlock);
 	}
 	return root;
 }
@@ -229,7 +229,7 @@ static inline struct anon_vma *lock_anon_vma_root(struct anon_vma *root, struct
 static inline void unlock_anon_vma_root(struct anon_vma *root)
 {
 	if (root)
-		up_write(&root->rwsem);
+		write_unlock(&root->rwlock);
 }
 
 /*
@@ -349,7 +349,7 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
 	/*
 	 * Iterate the list once more, it now only contains empty and unlinked
 	 * anon_vmas, destroy them. Could not do before due to __put_anon_vma()
-	 * needing to write-acquire the anon_vma->root->rwsem.
+	 * needing to write-acquire the anon_vma->root->rwlock.
 	 */
 	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
 		struct anon_vma *anon_vma = avc->anon_vma;
@@ -365,7 +365,7 @@ static void anon_vma_ctor(void *data)
 {
 	struct anon_vma *anon_vma = data;
 
-	init_rwsem(&anon_vma->rwsem);
+	rwlock_init(&anon_vma->rwlock);
 	atomic_set(&anon_vma->refcount, 0);
 	anon_vma->rb_root = RB_ROOT;
 }
@@ -457,14 +457,14 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
 	root_anon_vma = ACCESS_ONCE(anon_vma->root);
-	if (down_read_trylock(&root_anon_vma->rwsem)) {
+	if (read_trylock(&root_anon_vma->rwlock)) {
 		/*
 		 * If the page is still mapped, then this anon_vma is still
 		 * its anon_vma, and holding the mutex ensures that it will
 		 * not go away, see anon_vma_free().
 		 */
 		if (!page_mapped(page)) {
-			up_read(&root_anon_vma->rwsem);
+			read_unlock(&root_anon_vma->rwlock);
 			anon_vma = NULL;
 		}
 		goto out;
@@ -1293,7 +1293,7 @@ out_mlock:
 	/*
 	 * We need mmap_sem locking, Otherwise VM_LOCKED check makes
 	 * unstable result and race. Plus, We can't wait here because
-	 * we now hold anon_vma->rwsem or mapping->i_mmap_mutex.
+	 * we now hold anon_vma->rwlock or mapping->i_mmap_mutex.
 	 * if trylock failed, the page remain in evictable lru and later
 	 * vmscan could retry to move the page to unevictable lru if the
 	 * page is actually mlocked.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-28 19:37         ` [PATCH] anon_vmas: Convert the rwsem to an rwlock_t Ingo Molnar
@ 2013-09-28 19:43           ` Linus Torvalds
  2013-09-28 19:46             ` Ingo Molnar
  2013-09-28 19:52             ` [PATCH, v2] " Ingo Molnar
  2013-09-30  8:52           ` [PATCH] " Andrea Arcangeli
  1 sibling, 2 replies; 55+ messages in thread
From: Linus Torvalds @ 2013-09-28 19:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J

On Sat, Sep 28, 2013 at 12:37 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> -               down_write_nest_lock(&anon_vma->root->rwsem, &mm->mmap_sem);
> +               down_write_nest_lock(&anon_vma->root->rwlock, &mm->mmap_sem);

That's just completely bogus, and cannot work.

Maybe just a "write_lock(&anon_vma->root->rwlock)" (which is just
anon_vma_unlock_write(anon_vma)). But I think we might have a lockdep
issue. I'm not quite sure what's up with the nesting there.

> -       if (rwsem_is_locked(&anon_vma->root->rwsem)) {
> +       if (write_can_lock(&anon_vma->root->rwlock)) {
>                 anon_vma_lock_write(anon_vma);
>                 anon_vma_unlock_write(anon_vma);
>         }

That's the wrong way around. It should be

        if (!write_can_lock(&anon_vma->root->rwlock)) {

so some more testing definitely needed.

          Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-28 19:43           ` Linus Torvalds
@ 2013-09-28 19:46             ` Ingo Molnar
  2013-09-28 19:52             ` [PATCH, v2] " Ingo Molnar
  1 sibling, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2013-09-28 19:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Sep 28, 2013 at 12:37 PM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > -               down_write_nest_lock(&anon_vma->root->rwsem, &mm->mmap_sem);
> > +               down_write_nest_lock(&anon_vma->root->rwlock, &mm->mmap_sem);
> 
> That's just completely bogus, and cannot work.

Told you it's totally untested :-) Found that build failure a few minutes 
ago (the place escaped my search pattern), I'm trying the fix below.

	Ingo

diff --git a/mm/mmap.c b/mm/mmap.c
index 25ce233..7ee85bf 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2955,7 +2955,7 @@ static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma)
 		 * The LSB of head.next can't change from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		down_write_nest_lock(&anon_vma->root->rwlock, &mm->mmap_sem);
+		write_lock(&anon_vma->root->rwlock);
 		/*
 		 * We can safely modify head.next after taking the
 		 * anon_vma->root->rwlock. If some other vma in this mm shares

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-28 19:43           ` Linus Torvalds
  2013-09-28 19:46             ` Ingo Molnar
@ 2013-09-28 19:52             ` Ingo Molnar
  2013-09-30 11:00               ` Peter Zijlstra
                                 ` (3 more replies)
  1 sibling, 4 replies; 55+ messages in thread
From: Ingo Molnar @ 2013-09-28 19:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Sep 28, 2013 at 12:37 PM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > -               down_write_nest_lock(&anon_vma->root->rwsem, &mm->mmap_sem);
> > +               down_write_nest_lock(&anon_vma->root->rwlock, &mm->mmap_sem);
> 
> That's just completely bogus, and cannot work.
> 
> Maybe just a "write_lock(&anon_vma->root->rwlock)" (which is just
> anon_vma_unlock_write(anon_vma)). But I think we might have a lockdep
> issue. I'm not quite sure what's up with the nesting there.
> 
> > -       if (rwsem_is_locked(&anon_vma->root->rwsem)) {
> > +       if (write_can_lock(&anon_vma->root->rwlock)) {
> >                 anon_vma_lock_write(anon_vma);
> >                 anon_vma_unlock_write(anon_vma);
> >         }
> 
> That's the wrong way around. It should be
> 
>         if (!write_can_lock(&anon_vma->root->rwlock)) {
> 
> so some more testing definitely needed.

Yeah, that silly API asymmetry has bitten me before as well :-/

The attached version booted up fine under 16-way KVM:

 sh-4.2# uptime
  19:50:08 up 0 min,  0 users,  load average: 0.00, 0.00, 0.00

That's all the testing it will get this evening though. Patch should be 
good enough for Tim to try?

Thanks,

	Ingo

====================>
Subject: anon_vmas: Convert the rwsem to an rwlock_t
From: Ingo Molnar <mingo@kernel.org>
Date: Sat, 28 Sep 2013 21:37:39 +0200

Here's a an almost totally untested patch to convert the anon vma lock to 
an rwlock_t.

I think its lack of modern queueing will hurt on big systems big time - it 
might even regress. But ... it's hard to tell such things in advance.

[ That might as well be for the better as it will eventually be fixed,
  which in turn will improve tasklist_lock workloads ;-) ]

--
 include/linux/mmu_notifier.h |    2 +-
 include/linux/rmap.h         |   19 +++++++++----------
 mm/huge_memory.c             |    4 ++--
 mm/mmap.c                    |   10 +++++-----
 mm/rmap.c                    |   24 ++++++++++++------------
 5 files changed, 29 insertions(+), 30 deletions(-)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Index: tip/include/linux/mmu_notifier.h
===================================================================
--- tip.orig/include/linux/mmu_notifier.h
+++ tip/include/linux/mmu_notifier.h
@@ -151,7 +151,7 @@ struct mmu_notifier_ops {
  * Therefore notifier chains can only be traversed when either
  *
  * 1. mmap_sem is held.
- * 2. One of the reverse map locks is held (i_mmap_mutex or anon_vma->rwsem).
+ * 2. One of the reverse map locks is held (i_mmap_mutex or anon_vma->rwlock).
  * 3. No other concurrent thread can access the list (release)
  */
 struct mmu_notifier {
Index: tip/include/linux/rmap.h
===================================================================
--- tip.orig/include/linux/rmap.h
+++ tip/include/linux/rmap.h
@@ -7,7 +7,7 @@
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
-#include <linux/rwsem.h>
+#include <linux/rwlock.h>
 #include <linux/memcontrol.h>
 
 /*
@@ -26,7 +26,7 @@
  */
 struct anon_vma {
 	struct anon_vma *root;		/* Root of this anon_vma tree */
-	struct rw_semaphore rwsem;	/* W: modification, R: walking the list */
+	rwlock_t rwlock;		/* W: modification, R: walking the list */
 	/*
 	 * The refcount is taken on an anon_vma when there is no
 	 * guarantee that the vma of page tables will exist for
@@ -64,7 +64,7 @@ struct anon_vma_chain {
 	struct vm_area_struct *vma;
 	struct anon_vma *anon_vma;
 	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
-	struct rb_node rb;			/* locked by anon_vma->rwsem */
+	struct rb_node rb;			/* locked by anon_vma->rwlock */
 	unsigned long rb_subtree_last;
 #ifdef CONFIG_DEBUG_VM_RB
 	unsigned long cached_vma_start, cached_vma_last;
@@ -108,37 +108,36 @@ static inline void vma_lock_anon_vma(str
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		down_write(&anon_vma->root->rwsem);
+		write_lock(&anon_vma->root->rwlock);
 }
 
 static inline void vma_unlock_anon_vma(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		up_write(&anon_vma->root->rwsem);
+		write_unlock(&anon_vma->root->rwlock);
 }
 
 static inline void anon_vma_lock_write(struct anon_vma *anon_vma)
 {
-	down_write(&anon_vma->root->rwsem);
+	write_lock(&anon_vma->root->rwlock);
 }
 
 static inline void anon_vma_unlock_write(struct anon_vma *anon_vma)
 {
-	up_write(&anon_vma->root->rwsem);
+	write_unlock(&anon_vma->root->rwlock);
 }
 
 static inline void anon_vma_lock_read(struct anon_vma *anon_vma)
 {
-	down_read(&anon_vma->root->rwsem);
+	read_unlock(&anon_vma->root->rwlock);
 }
 
 static inline void anon_vma_unlock_read(struct anon_vma *anon_vma)
 {
-	up_read(&anon_vma->root->rwsem);
+	read_unlock(&anon_vma->root->rwlock);
 }
 
-
 /*
  * anon_vma helper functions.
  */
Index: tip/mm/huge_memory.c
===================================================================
--- tip.orig/mm/huge_memory.c
+++ tip/mm/huge_memory.c
@@ -1542,7 +1542,7 @@ static int __split_huge_page_splitting(s
 		 * We can't temporarily set the pmd to null in order
 		 * to split it, the pmd must remain marked huge at all
 		 * times or the VM won't take the pmd_trans_huge paths
-		 * and it won't wait on the anon_vma->root->rwsem to
+		 * and it won't wait on the anon_vma->root->rwlock to
 		 * serialize against split_huge_page*.
 		 */
 		pmdp_splitting_flush(vma, address, pmd);
@@ -1747,7 +1747,7 @@ static int __split_huge_page_map(struct
 	return ret;
 }
 
-/* must be called with anon_vma->root->rwsem held */
+/* must be called with anon_vma->root->rwlock held */
 static void __split_huge_page(struct page *page,
 			      struct anon_vma *anon_vma,
 			      struct list_head *list)
Index: tip/mm/mmap.c
===================================================================
--- tip.orig/mm/mmap.c
+++ tip/mm/mmap.c
@@ -2955,15 +2955,15 @@ static void vm_lock_anon_vma(struct mm_s
 		 * The LSB of head.next can't change from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		down_write_nest_lock(&anon_vma->root->rwsem, &mm->mmap_sem);
+		write_lock(&anon_vma->root->rwlock);
 		/*
 		 * We can safely modify head.next after taking the
-		 * anon_vma->root->rwsem. If some other vma in this mm shares
+		 * anon_vma->root->rwlock. If some other vma in this mm shares
 		 * the same anon_vma we won't take it again.
 		 *
 		 * No need of atomic instructions here, head.next
 		 * can't change from under us thanks to the
-		 * anon_vma->root->rwsem.
+		 * anon_vma->root->rwlock.
 		 */
 		if (__test_and_set_bit(0, (unsigned long *)
 				       &anon_vma->root->rb_root.rb_node))
@@ -3012,7 +3012,7 @@ static void vm_lock_mapping(struct mm_st
  * vma in this mm is backed by the same anon_vma or address_space.
  *
  * We can take all the locks in random order because the VM code
- * taking i_mmap_mutex or anon_vma->rwsem outside the mmap_sem never
+ * taking i_mmap_mutex or anon_vma->rwslockoutside the mmap_sem never
  * takes more than one of them in a row. Secondly we're protected
  * against a concurrent mm_take_all_locks() by the mm_all_locks_mutex.
  *
@@ -3065,7 +3065,7 @@ static void vm_unlock_anon_vma(struct an
 		 *
 		 * No need of atomic instructions here, head.next
 		 * can't change from under us until we release the
-		 * anon_vma->root->rwsem.
+		 * anon_vma->root->rwlock.
 		 */
 		if (!__test_and_clear_bit(0, (unsigned long *)
 					  &anon_vma->root->rb_root.rb_node))
Index: tip/mm/rmap.c
===================================================================
--- tip.orig/mm/rmap.c
+++ tip/mm/rmap.c
@@ -24,7 +24,7 @@
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_mutex
- *         anon_vma->rwsem
+ *         anon_vma->rwlock
  *           mm->page_table_lock or pte_lock
  *             zone->lru_lock (in mark_page_accessed, isolate_lru_page)
  *             swap_lock (in swap_duplicate, swap_info_get)
@@ -37,7 +37,7 @@
  *                           in arch-dependent flush_dcache_mmap_lock,
  *                           within bdi.wb->list_lock in __sync_single_inode)
  *
- * anon_vma->rwsem,mapping->i_mutex      (memory_failure, collect_procs_anon)
+ * anon_vma->rwlock,mapping->i_mutex      (memory_failure, collect_procs_anon)
  *   ->tasklist_lock
  *     pte map lock
  */
@@ -98,12 +98,12 @@ static inline void anon_vma_free(struct
 	 * page_lock_anon_vma_read()	VS	put_anon_vma()
 	 *   down_read_trylock()		  atomic_dec_and_test()
 	 *   LOCK				  MB
-	 *   atomic_read()			  rwsem_is_locked()
+	 *   atomic_read()			  rwlock_is_locked()
 	 *
 	 * LOCK should suffice since the actual taking of the lock must
 	 * happen _before_ what follows.
 	 */
-	if (rwsem_is_locked(&anon_vma->root->rwsem)) {
+	if (!write_can_lock(&anon_vma->root->rwlock)) {
 		anon_vma_lock_write(anon_vma);
 		anon_vma_unlock_write(anon_vma);
 	}
@@ -219,9 +219,9 @@ static inline struct anon_vma *lock_anon
 	struct anon_vma *new_root = anon_vma->root;
 	if (new_root != root) {
 		if (WARN_ON_ONCE(root))
-			up_write(&root->rwsem);
+			write_unlock(&root->rwlock);
 		root = new_root;
-		down_write(&root->rwsem);
+		write_lock(&root->rwlock);
 	}
 	return root;
 }
@@ -229,7 +229,7 @@ static inline struct anon_vma *lock_anon
 static inline void unlock_anon_vma_root(struct anon_vma *root)
 {
 	if (root)
-		up_write(&root->rwsem);
+		write_unlock(&root->rwlock);
 }
 
 /*
@@ -349,7 +349,7 @@ void unlink_anon_vmas(struct vm_area_str
 	/*
 	 * Iterate the list once more, it now only contains empty and unlinked
 	 * anon_vmas, destroy them. Could not do before due to __put_anon_vma()
-	 * needing to write-acquire the anon_vma->root->rwsem.
+	 * needing to write-acquire the anon_vma->root->rwlock.
 	 */
 	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
 		struct anon_vma *anon_vma = avc->anon_vma;
@@ -365,7 +365,7 @@ static void anon_vma_ctor(void *data)
 {
 	struct anon_vma *anon_vma = data;
 
-	init_rwsem(&anon_vma->rwsem);
+	rwlock_init(&anon_vma->rwlock);
 	atomic_set(&anon_vma->refcount, 0);
 	anon_vma->rb_root = RB_ROOT;
 }
@@ -457,14 +457,14 @@ struct anon_vma *page_lock_anon_vma_read
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
 	root_anon_vma = ACCESS_ONCE(anon_vma->root);
-	if (down_read_trylock(&root_anon_vma->rwsem)) {
+	if (read_trylock(&root_anon_vma->rwlock)) {
 		/*
 		 * If the page is still mapped, then this anon_vma is still
 		 * its anon_vma, and holding the mutex ensures that it will
 		 * not go away, see anon_vma_free().
 		 */
 		if (!page_mapped(page)) {
-			up_read(&root_anon_vma->rwsem);
+			read_unlock(&root_anon_vma->rwlock);
 			anon_vma = NULL;
 		}
 		goto out;
@@ -1293,7 +1293,7 @@ out_mlock:
 	/*
 	 * We need mmap_sem locking, Otherwise VM_LOCKED check makes
 	 * unstable result and race. Plus, We can't wait here because
-	 * we now hold anon_vma->rwsem or mapping->i_mmap_mutex.
+	 * we now hold anon_vma->rwlock or mapping->i_mmap_mutex.
 	 * if trylock failed, the page remain in evictable lru and later
 	 * vmscan could retry to move the page to unevictable lru if the
 	 * page is actually mlocked.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-28 19:52             ` [PATCH, v2] " Ingo Molnar
@ 2013-09-30 11:00               ` Peter Zijlstra
  2013-09-30 11:44               ` Peter Zijlstra
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2013-09-30 11:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Sat, Sep 28, 2013 at 09:52:07PM +0200, Ingo Molnar wrote:
> Index: tip/mm/rmap.c
> ===================================================================
> --- tip.orig/mm/rmap.c
> +++ tip/mm/rmap.c
> @@ -98,12 +98,12 @@ static inline void anon_vma_free(struct
>  	 * page_lock_anon_vma_read()	VS	put_anon_vma()
>  	 *   down_read_trylock()		  atomic_dec_and_test()
>  	 *   LOCK				  MB
> -	 *   atomic_read()			  rwsem_is_locked()
> +	 *   atomic_read()			  rwlock_is_locked()
>  	 *
>  	 * LOCK should suffice since the actual taking of the lock must
>  	 * happen _before_ what follows.
>  	 */
> -	if (rwsem_is_locked(&anon_vma->root->rwsem)) {
> +	if (!write_can_lock(&anon_vma->root->rwlock)) {
>  		anon_vma_lock_write(anon_vma);
>  		anon_vma_unlock_write(anon_vma);
>  	}
> @@ -457,14 +457,14 @@ struct anon_vma *page_lock_anon_vma_read
>  
>  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>  	root_anon_vma = ACCESS_ONCE(anon_vma->root);
> -	if (down_read_trylock(&root_anon_vma->rwsem)) {
> +	if (read_trylock(&root_anon_vma->rwlock)) {
>  		/*
>  		 * If the page is still mapped, then this anon_vma is still
>  		 * its anon_vma, and holding the mutex ensures that it will
>  		 * not go away, see anon_vma_free().
>  		 */
>  		if (!page_mapped(page)) {
> -			up_read(&root_anon_vma->rwsem);
> +			read_unlock(&root_anon_vma->rwlock);
>  			anon_vma = NULL;
>  		}
>  		goto out;
> @@ -1293,7 +1293,7 @@ out_mlock:
>  	/*
>  	 * We need mmap_sem locking, Otherwise VM_LOCKED check makes
>  	 * unstable result and race. Plus, We can't wait here because
> -	 * we now hold anon_vma->rwsem or mapping->i_mmap_mutex.
> +	 * we now hold anon_vma->rwlock or mapping->i_mmap_mutex.
>  	 * if trylock failed, the page remain in evictable lru and later
>  	 * vmscan could retry to move the page to unevictable lru if the
>  	 * page is actually mlocked.


You can remove all that -- all that trickery was only needed because the
lock could sleep;

---
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -85,29 +85,6 @@ static inline struct anon_vma *anon_vma_
 static inline void anon_vma_free(struct anon_vma *anon_vma)
 {
 	VM_BUG_ON(atomic_read(&anon_vma->refcount));
-
-	/*
-	 * Synchronize against page_lock_anon_vma_read() such that
-	 * we can safely hold the lock without the anon_vma getting
-	 * freed.
-	 *
-	 * Relies on the full mb implied by the atomic_dec_and_test() from
-	 * put_anon_vma() against the acquire barrier implied by
-	 * down_read_trylock() from page_lock_anon_vma_read(). This orders:
-	 *
-	 * page_lock_anon_vma_read()	VS	put_anon_vma()
-	 *   down_read_trylock()		  atomic_dec_and_test()
-	 *   LOCK				  MB
-	 *   atomic_read()			  rwlock_is_locked()
-	 *
-	 * LOCK should suffice since the actual taking of the lock must
-	 * happen _before_ what follows.
-	 */
-	if (!write_can_lock(&anon_vma->root->rwlock)) {
-		anon_vma_lock_write(anon_vma);
-		anon_vma_unlock_write(anon_vma);
-	}
-
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
@@ -437,15 +414,10 @@ struct anon_vma *page_get_anon_vma(struc
 
 /*
  * Similar to page_get_anon_vma() except it locks the anon_vma.
- *
- * Its a little more complex as it tries to keep the fast path to a single
- * atomic op -- the trylock. If we fail the trylock, we fall back to getting a
- * reference like with page_get_anon_vma() and then block on the mutex.
  */
 struct anon_vma *page_lock_anon_vma_read(struct page *page)
 {
 	struct anon_vma *anon_vma = NULL;
-	struct anon_vma *root_anon_vma;
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
@@ -456,51 +428,22 @@ struct anon_vma *page_lock_anon_vma_read
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
-	root_anon_vma = ACCESS_ONCE(anon_vma->root);
-	if (read_trylock(&root_anon_vma->rwlock)) {
-		/*
-		 * If the page is still mapped, then this anon_vma is still
-		 * its anon_vma, and holding the mutex ensures that it will
-		 * not go away, see anon_vma_free().
-		 */
-		if (!page_mapped(page)) {
-			read_unlock(&root_anon_vma->rwlock);
-			anon_vma = NULL;
-		}
-		goto out;
-	}
-
-	/* trylock failed, we got to sleep */
-	if (!atomic_inc_not_zero(&anon_vma->refcount)) {
-		anon_vma = NULL;
-		goto out;
-	}
-
-	if (!page_mapped(page)) {
-		put_anon_vma(anon_vma);
-		anon_vma = NULL;
-		goto out;
-	}
-
-	/* we pinned the anon_vma, its safe to sleep */
-	rcu_read_unlock();
 	anon_vma_lock_read(anon_vma);
 
-	if (atomic_dec_and_test(&anon_vma->refcount)) {
-		/*
-		 * Oops, we held the last refcount, release the lock
-		 * and bail -- can't simply use put_anon_vma() because
-		 * we'll deadlock on the anon_vma_lock_write() recursion.
-		 */
+	/*
+	 * If this page is still mapped, then its anon_vma cannot have been
+	 * freed.  But if it has been unmapped, we have no security against the
+	 * anon_vma structure being freed and reused (for another anon_vma:
+	 * SLAB_DESTROY_BY_RCU guarantees that - so the atomic_inc_not_zero()
+	 * above cannot corrupt).
+	 */
+	if (!page_mapped(page)) {
 		anon_vma_unlock_read(anon_vma);
-		__put_anon_vma(anon_vma);
 		anon_vma = NULL;
 	}
-
-	return anon_vma;
-
 out:
 	rcu_read_unlock();
+
 	return anon_vma;
 }
 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-28 19:52             ` [PATCH, v2] " Ingo Molnar
  2013-09-30 11:00               ` Peter Zijlstra
@ 2013-09-30 11:44               ` Peter Zijlstra
  2013-09-30 17:03               ` Andrew Morton
  2013-09-30 17:10               ` Tim Chen
  3 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2013-09-30 11:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Sat, Sep 28, 2013 at 09:52:07PM +0200, Ingo Molnar wrote:
> Index: tip/mm/mmap.c
> ===================================================================
> --- tip.orig/mm/mmap.c
> +++ tip/mm/mmap.c
> @@ -2955,15 +2955,15 @@ static void vm_lock_anon_vma(struct mm_s
>  		 * The LSB of head.next can't change from under us
>  		 * because we hold the mm_all_locks_mutex.
>  		 */
> -		down_write_nest_lock(&anon_vma->root->rwsem, &mm->mmap_sem);
> +		write_lock(&anon_vma->root->rwlock);

Tssk, you should know better...

But yes; this is the most horrid site; however I think it was mostly
file based VMAs that caused the immense amount of locks taken since this
is only done 'very' early on in the life of KVM.

I can't remember how early; I hope almost instantly and certainly before
spawning all the various vcpu threads as that would create all the
thread stacks which are anon and would indeed blow up the non-preempt
time here.

So if we keep i_mmap_mutex -- and I hope we do; that unmap_mutex horror
wasn't pretty at all -- see commit 97a894136 . The below shouldn't be
too bad I think/hope.


---
--- a/include/linux/rwlock.h
+++ b/include/linux/rwlock.h
@@ -64,6 +64,19 @@ do {								\
 #define write_lock(lock)	_raw_write_lock(lock)
 #define read_lock(lock)		_raw_read_lock(lock)
 
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+# define write_lock_nest_lock(lock, nest_lock)				\
+do {									\
+	typecheck(struct lockdep_map *, &(nest_lock)->dep_map);		\
+	preempt_disable();						\
+	lock_acquire_exclusive(&lock->dep_map, 0, 0, 			\
+			       &(nest_lock)->dep_map, _RET_IP_);	\
+	LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock);	\
+} while (0)
+#else
+# define write_lock_nest_lock(lock, nest_lock)	write_lock(lock)
+#endif
+
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
 
 #define read_lock_irqsave(lock, flags)			\
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2955,7 +2955,7 @@ static void vm_lock_anon_vma(struct mm_s
 		 * The LSB of head.next can't change from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		write_lock(&anon_vma->root->rwlock);
+		write_lock_nest_lock(&anon_vma->root->rwlock);
 		/*
 		 * We can safely modify head.next after taking the
 		 * anon_vma->root->rwlock. If some other vma in this mm shares

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-28 19:52             ` [PATCH, v2] " Ingo Molnar
  2013-09-30 11:00               ` Peter Zijlstra
  2013-09-30 11:44               ` Peter Zijlstra
@ 2013-09-30 17:03               ` Andrew Morton
  2013-09-30 17:25                 ` Linus Torvalds
  2013-09-30 17:10               ` Tim Chen
  3 siblings, 1 reply; 55+ messages in thread
From: Andrew Morton @ 2013-09-30 17:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Waiman Long, Ingo Molnar,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J

On Sat, 28 Sep 2013 21:52:07 +0200 Ingo Molnar <mingo@kernel.org> wrote:

> Here's a an almost totally untested patch to convert the anon vma lock to 
> an rwlock_t.

Both the anon vma lock and i_mmap_lock used to be spinlocks.  Peter
turned them both into sleeping locks in this series:
https://lkml.org/lkml/2011/4/1/141

Later, Ingo turned the anon vma lock from an rwsem into a mutex
(https://lkml.org/lkml/2012/12/1/141) to permit a scalability fix
(https://lkml.org/lkml/2012/12/1/142).

Let's convince ourselves that we won't be undoing things which will
return to bite us?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-30 17:03               ` Andrew Morton
@ 2013-09-30 17:25                 ` Linus Torvalds
  0 siblings, 0 replies; 55+ messages in thread
From: Linus Torvalds @ 2013-09-30 17:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Linux Kernel Mailing List,
	Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen,
	Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J

On Mon, Sep 30, 2013 at 10:03 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> Let's convince ourselves that we won't be undoing things which will
> return to bite us?

Umm. We call them regressions, and we fix them.

As already mentioned, the original switch to a mutex didn't even have
any good explanation for it, much less actual data.

When we make mistakes, we'd better realize it and fix them, rather
than assume that they were fixes just because they were made.

The performance numbers are pretty damn compelling for this having
been a major mistake.

(Of course, the current performance numbers also contain the
conversion to a rwlock_t rather than back to a spinlock, which may
actually help wrt the original situation too)

                 Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-28 19:52             ` [PATCH, v2] " Ingo Molnar
                                 ` (2 preceding siblings ...)
  2013-09-30 17:03               ` Andrew Morton
@ 2013-09-30 17:10               ` Tim Chen
  2013-09-30 18:14                 ` Peter Zijlstra
  3 siblings, 1 reply; 55+ messages in thread
From: Tim Chen @ 2013-09-30 17:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Peter Zijlstra, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Sat, 2013-09-28 at 21:52 +0200, Ingo Molnar wrote:
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Sat, Sep 28, 2013 at 12:37 PM, Ingo Molnar <mingo@kernel.org> wrote:
> > >
> > > -               down_write_nest_lock(&anon_vma->root->rwsem, &mm->mmap_sem);
> > > +               down_write_nest_lock(&anon_vma->root->rwlock, &mm->mmap_sem);
> > 
> > That's just completely bogus, and cannot work.
> > 
> > Maybe just a "write_lock(&anon_vma->root->rwlock)" (which is just
> > anon_vma_unlock_write(anon_vma)). But I think we might have a lockdep
> > issue. I'm not quite sure what's up with the nesting there.
> > 
> > > -       if (rwsem_is_locked(&anon_vma->root->rwsem)) {
> > > +       if (write_can_lock(&anon_vma->root->rwlock)) {
> > >                 anon_vma_lock_write(anon_vma);
> > >                 anon_vma_unlock_write(anon_vma);
> > >         }
> > 
> > That's the wrong way around. It should be
> > 
> >         if (!write_can_lock(&anon_vma->root->rwlock)) {
> > 
> > so some more testing definitely needed.
> 
> Yeah, that silly API asymmetry has bitten me before as well :-/
> 
> The attached version booted up fine under 16-way KVM:
> 
>  sh-4.2# uptime
>   19:50:08 up 0 min,  0 users,  load average: 0.00, 0.00, 0.00
> 
> That's all the testing it will get this evening though. Patch should be 
> good enough for Tim to try?


Here's the exim workload data:

rwsem improvment:
Waimain's patch:        +2.0%
Alex+Tim's patchset:    +4.8%
Waiman+Alex+Tim:        +5.3%

convert rwsem to rwlock_t for root anon_vma lock
Ingo's patch		+11.7%

Yes, changing the anon-vma root lock to rwlock_t gives the most boost.

However, I think that Waiman's patches, Alex's patches and my
patches can still be combined together to improve scalability of 
rwsem, even if we should decide to convert this particular rwsem to
rwlock_t.

Thanks.

Tim

> 
> Thanks,
> 
> 	Ingo
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-30 17:10               ` Tim Chen
@ 2013-09-30 18:14                 ` Peter Zijlstra
  2013-09-30 19:23                   ` Tim Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2013-09-30 18:14 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Linus Torvalds, Waiman Long, Ingo Molnar,
	Andrew Morton, Linux Kernel Mailing List, Rik van Riel,
	Peter Hurley, Davidlohr Bueso, Alex Shi, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Mon, Sep 30, 2013 at 10:10:27AM -0700, Tim Chen wrote:
> Here's the exim workload data:
> 
> rwsem improvment:
> Waimain's patch:        +2.0%
> Alex+Tim's patchset:    +4.8%
> Waiman+Alex+Tim:        +5.3%
> 
> convert rwsem to rwlock_t for root anon_vma lock
> Ingo's patch		+11.7%
> 

What happens if you stuff Waiman's qrwlock patches on top of that?
admittedly and oft mentioned in this thread, our current rwlock_t is
somewhat suboptimal under a number of conditions.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-30 18:14                 ` Peter Zijlstra
@ 2013-09-30 19:23                   ` Tim Chen
  2013-09-30 19:35                     ` Waiman Long
  0 siblings, 1 reply; 55+ messages in thread
From: Tim Chen @ 2013-09-30 19:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Linus Torvalds, Waiman Long, Ingo Molnar,
	Andrew Morton, Linux Kernel Mailing List, Rik van Riel,
	Peter Hurley, Davidlohr Bueso, Alex Shi, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Mon, 2013-09-30 at 20:14 +0200, Peter Zijlstra wrote:
> On Mon, Sep 30, 2013 at 10:10:27AM -0700, Tim Chen wrote:
> > Here's the exim workload data:
> > 
> > rwsem improvment:
> > Waimain's patch:        +2.0%
> > Alex+Tim's patchset:    +4.8%
> > Waiman+Alex+Tim:        +5.3%
> > 
> > convert rwsem to rwlock_t for root anon_vma lock
> > Ingo's patch		+11.7%
> > 
> 
> What happens if you stuff Waiman's qrwlock patches on top of that?
> admittedly and oft mentioned in this thread, our current rwlock_t is
> somewhat suboptimal under a number of conditions.

I've tested with Waiman's qrwlock patches on top of Ingo's patches.
It does not affect the throughput for exim and I still get
about +11.7% throughput change (same as with Ingo's patch only).

Tim


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-30 19:23                   ` Tim Chen
@ 2013-09-30 19:35                     ` Waiman Long
  2013-09-30 19:47                       ` Tim Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Waiman Long @ 2013-09-30 19:35 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Ingo Molnar,
	Andrew Morton, Linux Kernel Mailing List, Rik van Riel,
	Peter Hurley, Davidlohr Bueso, Alex Shi, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On 09/30/2013 03:23 PM, Tim Chen wrote:
> On Mon, 2013-09-30 at 20:14 +0200, Peter Zijlstra wrote:
>> On Mon, Sep 30, 2013 at 10:10:27AM -0700, Tim Chen wrote:
>>> Here's the exim workload data:
>>>
>>> rwsem improvment:
>>> Waimain's patch:        +2.0%
>>> Alex+Tim's patchset:    +4.8%
>>> Waiman+Alex+Tim:        +5.3%
>>>
>>> convert rwsem to rwlock_t for root anon_vma lock
>>> Ingo's patch		+11.7%
>>>
>> What happens if you stuff Waiman's qrwlock patches on top of that?
>> admittedly and oft mentioned in this thread, our current rwlock_t is
>> somewhat suboptimal under a number of conditions.
> I've tested with Waiman's qrwlock patches on top of Ingo's patches.
> It does not affect the throughput for exim and I still get
> about +11.7% throughput change (same as with Ingo's patch only).
>
> Tim
>

My qrwlock doesn't enable qrwlock by default. You have to use menuconfig 
to explicitly enable it. Have you done that when you build the test 
kernel? I am thinking of explicitly enabling it for x86 if the anon-vma 
lock is converted back to a rwlock.

-Longman

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-30 19:35                     ` Waiman Long
@ 2013-09-30 19:47                       ` Tim Chen
  2013-09-30 22:03                         ` Tim Chen
  2013-10-01  2:41                         ` Waiman Long
  0 siblings, 2 replies; 55+ messages in thread
From: Tim Chen @ 2013-09-30 19:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Ingo Molnar,
	Andrew Morton, Linux Kernel Mailing List, Rik van Riel,
	Peter Hurley, Davidlohr Bueso, Alex Shi, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Mon, 2013-09-30 at 15:35 -0400, Waiman Long wrote:
> On 09/30/2013 03:23 PM, Tim Chen wrote:
> > On Mon, 2013-09-30 at 20:14 +0200, Peter Zijlstra wrote:
> >> On Mon, Sep 30, 2013 at 10:10:27AM -0700, Tim Chen wrote:
> >>> Here's the exim workload data:
> >>>
> >>> rwsem improvment:
> >>> Waimain's patch:        +2.0%
> >>> Alex+Tim's patchset:    +4.8%
> >>> Waiman+Alex+Tim:        +5.3%
> >>>
> >>> convert rwsem to rwlock_t for root anon_vma lock
> >>> Ingo's patch		+11.7%
> >>>
> >> What happens if you stuff Waiman's qrwlock patches on top of that?
> >> admittedly and oft mentioned in this thread, our current rwlock_t is
> >> somewhat suboptimal under a number of conditions.
> > I've tested with Waiman's qrwlock patches on top of Ingo's patches.
> > It does not affect the throughput for exim and I still get
> > about +11.7% throughput change (same as with Ingo's patch only).
> >
> > Tim
> >
> 
> My qrwlock doesn't enable qrwlock by default. You have to use menuconfig 
> to explicitly enable it. Have you done that when you build the test 
> kernel? I am thinking of explicitly enabling it for x86 if the anon-vma 
> lock is converted back to a rwlock.
> 

Yes, I have explicitly enabled it during my testing.

Thanks.
Tim


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-30 19:47                       ` Tim Chen
@ 2013-09-30 22:03                         ` Tim Chen
  2013-10-01  2:41                         ` Waiman Long
  1 sibling, 0 replies; 55+ messages in thread
From: Tim Chen @ 2013-09-30 22:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Ingo Molnar,
	Andrew Morton, Linux Kernel Mailing List, Rik van Riel,
	Peter Hurley, Davidlohr Bueso, Alex Shi, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Mon, 2013-09-30 at 12:47 -0700, Tim Chen wrote:
> On Mon, 2013-09-30 at 15:35 -0400, Waiman Long wrote:
> > On 09/30/2013 03:23 PM, Tim Chen wrote:
> > > On Mon, 2013-09-30 at 20:14 +0200, Peter Zijlstra wrote:
> > >> On Mon, Sep 30, 2013 at 10:10:27AM -0700, Tim Chen wrote:
> > >>> Here's the exim workload data:
> > >>>
> > >>> rwsem improvment:
> > >>> Waimain's patch:        +2.0%
> > >>> Alex+Tim's patchset:    +4.8%
> > >>> Waiman+Alex+Tim:        +5.3%
> > >>>
> > >>> convert rwsem to rwlock_t for root anon_vma lock
> > >>> Ingo's patch		+11.7%
> > >>>
> > >> What happens if you stuff Waiman's qrwlock patches on top of that?
> > >> admittedly and oft mentioned in this thread, our current rwlock_t is
> > >> somewhat suboptimal under a number of conditions.
> > > I've tested with Waiman's qrwlock patches on top of Ingo's patches.
> > > It does not affect the throughput for exim and I still get
> > > about +11.7% throughput change (same as with Ingo's patch only).
> > >
> > > Tim
> > >
> > 
> > My qrwlock doesn't enable qrwlock by default. You have to use menuconfig 
> > to explicitly enable it. Have you done that when you build the test 
> > kernel? I am thinking of explicitly enabling it for x86 if the anon-vma 
> > lock is converted back to a rwlock.
> > 
> 
> Yes, I have explicitly enabled it during my testing.
> 
The workload I have is dominated by writer locks, with only very
occasional readers. So it may not benefit from the various tweaks you
have put in qrwlock.

Tim 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-30 19:47                       ` Tim Chen
  2013-09-30 22:03                         ` Tim Chen
@ 2013-10-01  2:41                         ` Waiman Long
  1 sibling, 0 replies; 55+ messages in thread
From: Waiman Long @ 2013-10-01  2:41 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Ingo Molnar,
	Andrew Morton, Linux Kernel Mailing List, Rik van Riel,
	Peter Hurley, Davidlohr Bueso, Alex Shi, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On 09/30/2013 03:47 PM, Tim Chen wrote:
>> My qrwlock doesn't enable qrwlock by default. You have to use menuconfig
>> to explicitly enable it. Have you done that when you build the test
>> kernel? I am thinking of explicitly enabling it for x86 if the anon-vma
>> lock is converted back to a rwlock.
>>
> Yes, I have explicitly enabled it during my testing.
>
> Thanks.
> Tim
>
Thank for the info.

I had tested Ingo's 2nd patch myself with the qrwlock patch on a 8-node
machine with a 3.12.0-rc2 kernel. The results of AIM7's high_systime
(at 1500 users) were:

Anon-vmas lock          JPM        %Change
--------------          ---        -------
     rwsem        148265           -
     rwlock        238715          +61%
     qrwlock        242048          +63%

So the queue rwlock was only slightly faster in this case. Below were
the perf profile with rwlock:

  18.20%   reaim  [kernel.kallsyms]  [k] __write_lock_failed
   9.36%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   2.91%   reaim  [kernel.kallsyms]  [k] mspin_lock
   2.73%   reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
   2.23%      ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   1.29%   reaim  [kernel.kallsyms]  [k] __read_lock_failed
   1.21%    true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   1.14%   reaim  [kernel.kallsyms]  [k] zap_pte_range
   1.13%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock
   1.04%   reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner

The perf profile with qrwlock:

  10.57%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   7.98%   reaim  [kernel.kallsyms]  [k] queue_write_lock_slowpath
   5.83%   reaim  [kernel.kallsyms]  [k] mspin_lock
   2.86%      ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   2.71%   reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
   1.52%    true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   1.51%   reaim  [kernel.kallsyms]  [k] queue_read_lock_slowpath
   1.35%   reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
   1.12%   reaim  [kernel.kallsyms]  [k] zap_pte_range
   1.06%   reaim  [kernel.kallsyms]  [k] perf_event_aux_ctx
   1.01%   reaim  [kernel.kallsyms]  [k] perf_event_aux

In the qrwlock kernel, less time were spent in the rwlock slowpath
path (about half). However, more time were now spent in the spinlock
and mutex spinning. Another observation is that no noticeable idle
time was reported whereas the system could be half idle with rwsem.
There was also a lot less idle balancing activities.

The qrwlock is fair wrt the writers. So its performance may not be
as good as the fully unfair rwlock. However, queuing reduces a lot
of cache contention traffic, thus improving performance. It is the
interplay of these 2 factors that determine how much performance
benefit we can see. Another factor to consider is that when we have
less contention in anon-vmas, other areas of contentions will show up.

With qrwlock, the spinlock contention was:

  10.57%   reaim  [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
              |--58.70%-- release_pages
              |--38.42%-- pagevec_lru_move_fn
              |--0.64%-- get_page_from_freelist
              |--0.64%-- __page_cache_release
               --1.60%-- [...]

   2.86%      ls  [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
                 |--52.73%-- pagevec_lru_move_fn
                 |--46.25%-- release_pages
                  --1.02%-- [...]

   1.52%    true  [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
               |--53.76%-- pagevec_lru_move_fn
               |--43.95%-- release_pages
               |--1.02%-- __page_cache_release

-Longman


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-28 19:37         ` [PATCH] anon_vmas: Convert the rwsem to an rwlock_t Ingo Molnar
  2013-09-28 19:43           ` Linus Torvalds
@ 2013-09-30  8:52           ` Andrea Arcangeli
  2013-09-30 14:40             ` Jerome Glisse
                               ` (2 more replies)
  1 sibling, 3 replies; 55+ messages in thread
From: Andrea Arcangeli @ 2013-09-30  8:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J, Haggai Eran,
	Sagi Grimberg, Or Gerlitz, Jerome Glisse

Hi everyone,

On Sat, Sep 28, 2013 at 09:37:39PM +0200, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
> > If we do that then I suspect the next step will be queued rwlocks :-/ 
> > The current rwlock_t implementation is rather primitive by modern 
> > standards. (We'd probably have killed rwlock_t long ago if not for the 
> > tasklist_lock.)
> > 
> > But yeah, it would work and conceptually a hard spinlock fits something 
> > as lowlevel as the anon-vma lock.
> > 
> > I did a quick review pass and it appears nothing obvious is scheduling 
> > with the anon-vma lock held. If it did in a non-obvious way it's likely 
> > a bug anyway. The hugepage code grew a lot of logic running under the 
> > anon-vma lock, but it all seems atomic.
> > 
> > So a conversion to rwlock_t could be attempted. (It should be relatively 
> > easy patch as well, because the locking operation is now nicely 
> > abstracted out.)
> 
> Here's a totally untested patch to convert the anon vma lock to an 
> rwlock_t.

Sorry having to break the party but the sleepable locks for anon_vma
and i_mmap_mutex are now requirement for the "pageable RDMA" effort
recently achieved upstream by mellanox with the MMU notifier.

And as far as I can tell that's the only single good reason for why
those locks shouldn't be spinlocks (otherwise I would have also
complianed at the time of that conversion, the original regression was
known, ask Andi). After the lock conversion it took a while to fix all
other minor bits to make mmu notifier methods fully sleepable.

The problem with the spinlocks is that in the rmap code (like
try_to_unmap) we need to call mmu_notifier_invalidate_page with an
"mm" as parameter, and the callee assumes the "mm" won't go away under
it. The other second requirement is that the page cannot be freed
until we call the mmu_notifier_invalidate_page (secondary MMU is ok to
still access the page after the linux pte has been dropped and the TLB
flushed).

In the rmap code the only things that keep things afloat is either the
anon_vma lock or the i_mmap_mutex so it is quite tricky to drop that
lock while keeping "mm" and "page" both afloat for the invalidate
post-anon-vma-unlock.

Maybe there are ways to makes that safe? (reference counting,
trylocking the mmap_sem). But there isn't a very strightforward way to
do that.

It isn't just mellanox drivers: originally SGI XPMEM driver also
needed to schedule in those methods (then they figured how to get away
in only scheduling in the mmu_notifier range calls but I suppose they
prefer to be able to schedule in all invalidate methods including
mmu_notifier_invalidate_page).

Nvidia is now also going to use mmu notifier to allow the GPU (acting
as a secondary MMU with pagetables) to access main memory without
requiring RAM pinning and without disabling all VM features on the
graphics memory (and without GART physical). Probably they don't need
to schedule but I CC'ed Jerome just in case they need (as that would
add one more relevant user of the feature).

As far as KVM is concerned, there is no benefit in scheduling in the
methods. The KVM mmu notifier invalidate consists in zeroing out one
or more sptes and sending an IPI to flush the guest TLBs if others
vcpus are running in other cpus. It's a mechanism pretty much
identical to the one used by the primary MMU and it only requires irqs
enabled to avoid deadlocks in case of cross-IPIs.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-30  8:52           ` [PATCH] " Andrea Arcangeli
@ 2013-09-30 14:40             ` Jerome Glisse
  2013-09-30 16:26             ` Linus Torvalds
  2013-09-30 18:21             ` Andi Kleen
  2 siblings, 0 replies; 55+ messages in thread
From: Jerome Glisse @ 2013-09-30 14:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Linus Torvalds, Waiman Long, Ingo Molnar,
	Andrew Morton, Linux Kernel Mailing List, Rik van Riel,
	Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J, Haggai Eran,
	Sagi Grimberg, Or Gerlitz

On Mon, Sep 30, 2013 at 10:52:43AM +0200, Andrea Arcangeli wrote:
> Hi everyone,
> 
> On Sat, Sep 28, 2013 at 09:37:39PM +0200, Ingo Molnar wrote:
> > 
> > * Ingo Molnar <mingo@kernel.org> wrote:
> > 
> > > If we do that then I suspect the next step will be queued rwlocks :-/ 
> > > The current rwlock_t implementation is rather primitive by modern 
> > > standards. (We'd probably have killed rwlock_t long ago if not for the 
> > > tasklist_lock.)
> > > 
> > > But yeah, it would work and conceptually a hard spinlock fits something 
> > > as lowlevel as the anon-vma lock.
> > > 
> > > I did a quick review pass and it appears nothing obvious is scheduling 
> > > with the anon-vma lock held. If it did in a non-obvious way it's likely 
> > > a bug anyway. The hugepage code grew a lot of logic running under the 
> > > anon-vma lock, but it all seems atomic.
> > > 
> > > So a conversion to rwlock_t could be attempted. (It should be relatively 
> > > easy patch as well, because the locking operation is now nicely 
> > > abstracted out.)
> > 
> > Here's a totally untested patch to convert the anon vma lock to an 
> > rwlock_t.
> 
> Sorry having to break the party but the sleepable locks for anon_vma
> and i_mmap_mutex are now requirement for the "pageable RDMA" effort
> recently achieved upstream by mellanox with the MMU notifier.
> 
> And as far as I can tell that's the only single good reason for why
> those locks shouldn't be spinlocks (otherwise I would have also
> complianed at the time of that conversion, the original regression was
> known, ask Andi). After the lock conversion it took a while to fix all
> other minor bits to make mmu notifier methods fully sleepable.
> 
> The problem with the spinlocks is that in the rmap code (like
> try_to_unmap) we need to call mmu_notifier_invalidate_page with an
> "mm" as parameter, and the callee assumes the "mm" won't go away under
> it. The other second requirement is that the page cannot be freed
> until we call the mmu_notifier_invalidate_page (secondary MMU is ok to
> still access the page after the linux pte has been dropped and the TLB
> flushed).
> 
> In the rmap code the only things that keep things afloat is either the
> anon_vma lock or the i_mmap_mutex so it is quite tricky to drop that
> lock while keeping "mm" and "page" both afloat for the invalidate
> post-anon-vma-unlock.
> 
> Maybe there are ways to makes that safe? (reference counting,
> trylocking the mmap_sem). But there isn't a very strightforward way to
> do that.
> 
> It isn't just mellanox drivers: originally SGI XPMEM driver also
> needed to schedule in those methods (then they figured how to get away
> in only scheduling in the mmu_notifier range calls but I suppose they
> prefer to be able to schedule in all invalidate methods including
> mmu_notifier_invalidate_page).
> 
> Nvidia is now also going to use mmu notifier to allow the GPU (acting
> as a secondary MMU with pagetables) to access main memory without
> requiring RAM pinning and without disabling all VM features on the
> graphics memory (and without GART physical). Probably they don't need
> to schedule but I CC'ed Jerome just in case they need (as that would
> add one more relevant user of the feature).

Thanks for the cc if it was on mm mailing list i would have seen it
but i am way behind on my lkml folder. Yes right now we have to sleep
in mmu_notifier_invalidate_page but it's somewhat solvable.

The mkclean path is solvable more or less easily but the try_to_unmap
do have to sleep and that's bad. Especialy the shrinker path. My plan
is to provide informations down the mmu notifier API and fails quickly
on shrinker path.

For shrinking i am not sure if i should just rely on shrinker API to
force device or not. It's not something i want to tackle in first
patchset anyway.

I am hoping to be able to send patchset in next few weeks. But the
sleeping inside invalidate page is definitly on my WHATTHEHECKLIST.

Cheers,
Jerome

> 
> As far as KVM is concerned, there is no benefit in scheduling in the
> methods. The KVM mmu notifier invalidate consists in zeroing out one
> or more sptes and sending an IPI to flush the guest TLBs if others
> vcpus are running in other cpus. It's a mechanism pretty much
> identical to the one used by the primary MMU and it only requires irqs
> enabled to avoid deadlocks in case of cross-IPIs.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-30  8:52           ` [PATCH] " Andrea Arcangeli
  2013-09-30 14:40             ` Jerome Glisse
@ 2013-09-30 16:26             ` Linus Torvalds
  2013-09-30 19:16               ` Andrea Arcangeli
  2013-09-30 18:21             ` Andi Kleen
  2 siblings, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2013-09-30 16:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J, Haggai Eran,
	Sagi Grimberg, Or Gerlitz, Jerome Glisse

On Mon, Sep 30, 2013 at 1:52 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Sorry having to break the party but the sleepable locks for anon_vma
> and i_mmap_mutex are now requirement for the "pageable RDMA" effort
> recently achieved upstream by mellanox with the MMU notifier.

I'll happily break that.

No way will some random crazy RDMA be a reason why we'd drop this kind
of improvement on actual loads that matter way more.

                 Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-30 16:26             ` Linus Torvalds
@ 2013-09-30 19:16               ` Andrea Arcangeli
  0 siblings, 0 replies; 55+ messages in thread
From: Andrea Arcangeli @ 2013-09-30 19:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J, Haggai Eran,
	Sagi Grimberg, Or Gerlitz, Jerome Glisse

On Mon, Sep 30, 2013 at 09:26:21AM -0700, Linus Torvalds wrote:
> On Mon, Sep 30, 2013 at 1:52 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > Sorry having to break the party but the sleepable locks for anon_vma
> > and i_mmap_mutex are now requirement for the "pageable RDMA" effort
> > recently achieved upstream by mellanox with the MMU notifier.
> 
> I'll happily break that.

Unless a solution is found that could allow to invalidate secondary
MMUs with a spinlock/rwlock for anon_vma->lock/i_mmap_lock, would it
be acceptable to switch between spinlock/rwlock mutex/rwsem through a
config option? option CONFIG_SLEEPABLE_RMAP, implicitly selected by
CONFIG_SLEEPABLE_MMU_NOTIFIER, in turn selected by the RDMA and nvidia
drivers if they're built (ideally nvidia drivers will figure out how
to avoid scheduling).

I mean it only requires a wrapping header file, aside from the header
file it wouldn't be a much bigger patch than the one posted already.

It would be much easier to switch between spinning and sleeping locks
to keep benchmarking different scenarios too.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-30  8:52           ` [PATCH] " Andrea Arcangeli
  2013-09-30 14:40             ` Jerome Glisse
  2013-09-30 16:26             ` Linus Torvalds
@ 2013-09-30 18:21             ` Andi Kleen
  2013-09-30 19:04               ` Jerome Glisse
  2 siblings, 1 reply; 55+ messages in thread
From: Andi Kleen @ 2013-09-30 18:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Linus Torvalds, Waiman Long, Ingo Molnar,
	Andrew Morton, Linux Kernel Mailing List, Rik van Riel,
	Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J, Haggai Eran,
	Sagi Grimberg, Or Gerlitz, Jerome Glisse

> Nvidia is now also going to use mmu notifier to allow the GPU (acting
> as a secondary MMU with pagetables) to access main memory without

That's a good point. iirc AMD GPUs are also going into 
the same direction, and likely more. So we probably need
to support that usage.

Too bad :-/

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t
  2013-09-30 18:21             ` Andi Kleen
@ 2013-09-30 19:04               ` Jerome Glisse
  0 siblings, 0 replies; 55+ messages in thread
From: Jerome Glisse @ 2013-09-30 19:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrea Arcangeli, Ingo Molnar, Linus Torvalds, Waiman Long,
	Ingo Molnar, Andrew Morton, Linux Kernel Mailing List,
	Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen,
	Peter Zijlstra, Matthew R Wilcox, Dave Hansen, Michel Lespinasse,
	Chandramouleeswaran, Aswin, Norton, Scott J, Haggai Eran,
	Sagi Grimberg, Or Gerlitz

On Mon, Sep 30, 2013 at 08:21:12PM +0200, Andi Kleen wrote:
> > Nvidia is now also going to use mmu notifier to allow the GPU (acting
> > as a secondary MMU with pagetables) to access main memory without
> 
> That's a good point. iirc AMD GPUs are also going into 
> the same direction, and likely more. So we probably need
> to support that usage.
> 
> Too bad :-/
> 
> -Andi

In all fairness, current AMD solution is hardware only, you can see the
code in drivers/iommu/amd_iommu_v2.c but they might be interested in
the software solution for added features like abilities to move process
memory to video memory and migrating it back when CPU access it.

Others too might be interested (Intel, infiniband, or any driver with
an mmu that can do pagefault and has couple others features).

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-28 18:55     ` Linus Torvalds
  2013-09-28 19:13       ` Andi Kleen
  2013-09-28 19:21       ` Ingo Molnar
@ 2013-09-29 23:06       ` Davidlohr Bueso
  2013-09-29 23:26         ` Linus Torvalds
  2013-09-30  1:11       ` Michel Lespinasse
  2013-09-30 10:46       ` Peter Zijlstra
  4 siblings, 1 reply; 55+ messages in thread
From: Davidlohr Bueso @ 2013-09-29 23:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J

On Sat, 2013-09-28 at 11:55 -0700, Linus Torvalds wrote:
> On Sat, Sep 28, 2013 at 12:41 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > Yeah, I fully agree. The reason I'm still very sympathetic to Tim's
> > efforts is that they address a regression caused by a mechanic
> > mutex->rwsem conversion:
> >
> >   5a505085f043 mm/rmap: Convert the struct anon_vma::mutex to an rwsem
> >
> > ... and Tim's patches turn that regression into an actual speedup.
> 
> Btw, I really hate that thing. I think we should turn it back into a
> spinlock. None of what it protects needs a mutex or an rwsem.

The same should apply to i_mmap_mutex, having a similar responsibility
to the anon-vma lock with file backed pages. A few months ago I had
suggested changing that lock to rwsem, giving some pretty reasonable
performance improvement numbers.

http://lwn.net/Articles/556342/

> 
> Because you guys talk about the regression of turning it into a rwsem,
> but nobody talks about the *original* regression.
> 
> And it *used* to be a spinlock, and it was changed into a mutex back
> in 2011 by commit 2b575eb64f7a. That commit doesn't even have a reason
> listed for it, although my dim memory of it is that the reason was
> preemption latency.
> 
> And that caused big regressions too.

After testing Ingo's anon-vma rwlock_t conversion (v2) on a 8 socket, 80
core system with aim7, I am quite surprised about the numbers -
considering the lack of queuing in rwlocks. A lot of the tests didn't
show hardly any difference, but those that really contend this lock
(with high amounts of users) benefited quite nicely:

Alltests: +28% throughput after 1000 users and runtime was reduced from
7.2 to 6.6 secs.

Custom: +61% throughput after 100 users and runtime was reduced from 7
to 4.9 secs.

High_systime: +40% throughput after 1000 users and runtime was reduced
from 19 to 15.5 secs.

Shared: +30.5% throughput after 100 users and runtime was reduced from
6.5 to 5.1 secs.

Short: Lots of variance in the numbers, but avg of +29% throughput - no
particular performance degradation either.

The only workload that took a hit was fserver with a -6% throughput for
small amounts of users (10-100).

Theoretically at least, adding queuing to the rwlocks should only help
the situation furthermore. I'll also run some test for a similar
conversion for the i_mmap mutex next week.

Going back to the rwsems, I believe adding optimistic spinning is still
valid, independent of the anon-vma lock as it benefits all users.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-29 23:06       ` [PATCH] rwsem: reduce spinlock contention in wakeup code path Davidlohr Bueso
@ 2013-09-29 23:26         ` Linus Torvalds
  2013-09-30  0:40           ` Davidlohr Bueso
  2013-09-30  6:57           ` Ingo Molnar
  0 siblings, 2 replies; 55+ messages in thread
From: Linus Torvalds @ 2013-09-29 23:26 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J

On Sun, Sep 29, 2013 at 4:06 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
>>
>> Btw, I really hate that thing. I think we should turn it back into a
>> spinlock. None of what it protects needs a mutex or an rwsem.
>
> The same should apply to i_mmap_mutex, having a similar responsibility
> to the anon-vma lock with file backed pages. A few months ago I had
> suggested changing that lock to rwsem, giving some pretty reasonable
> performance improvement numbers.
>
> http://lwn.net/Articles/556342/

Ok, that's pretty convincing too.

Side note: are you sure that the i_mmap_mutex needs to be a sleeping
lock at all? It's documented to nest outside the anon_vma->rwsem, so
as long as that is a sleeping lock, the i_mmap_mutex needs to be one
too, but looking at the actual users, most of them seem to be *very*
similar to the anon_vma->rwsem users. It is a very close cousin to the
anon_vma->rwsem, after all (just for file-backed pages rather than
anonymous ones). No?

I dunno. Maybe the ranges are too big and it really has latency
issues, the few I looked at looked like fairly trivial interval-tree
operations, though.

And your numbers for Ingo's patch:

> After testing Ingo's anon-vma rwlock_t conversion (v2) on a 8 socket, 80
> core system with aim7, I am quite surprised about the numbers -
> considering the lack of queuing in rwlocks. A lot of the tests didn't
> show hardly any difference, but those that really contend this lock
> (with high amounts of users) benefited quite nicely:
>
> Alltests: +28% throughput after 1000 users and runtime was reduced from
> 7.2 to 6.6 secs.
>
> Custom: +61% throughput after 100 users and runtime was reduced from 7
> to 4.9 secs.
>
> High_systime: +40% throughput after 1000 users and runtime was reduced
> from 19 to 15.5 secs.
>
> Shared: +30.5% throughput after 100 users and runtime was reduced from
> 6.5 to 5.1 secs.
>
> Short: Lots of variance in the numbers, but avg of +29% throughput - no
> particular performance degradation either.

Are just overwhelming, in my opinion. The conversion *from* a spinlock
never had this kind of support behind it.

Btw, did anybody run Ingo's patch with lockdep and the spinlock sleep
debugging code to verify that we haven't introduced any problems wrt
sleeping since the lock was converted into a rw-semaphore?

Because quite frankly, considering these kinds of numbers, I really
don't see how we could possibly make excuses for keeping that
rw-semaphore unless there is some absolutely _horrible_ latency issue?

            Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-29 23:26         ` Linus Torvalds
@ 2013-09-30  0:40           ` Davidlohr Bueso
  2013-09-30  0:51             ` Linus Torvalds
  2013-09-30  6:57           ` Ingo Molnar
  1 sibling, 1 reply; 55+ messages in thread
From: Davidlohr Bueso @ 2013-09-30  0:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J

On Sun, 2013-09-29 at 16:26 -0700, Linus Torvalds wrote:
> On Sun, Sep 29, 2013 at 4:06 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> >>
> >> Btw, I really hate that thing. I think we should turn it back into a
> >> spinlock. None of what it protects needs a mutex or an rwsem.
> >
> > The same should apply to i_mmap_mutex, having a similar responsibility
> > to the anon-vma lock with file backed pages. A few months ago I had
> > suggested changing that lock to rwsem, giving some pretty reasonable
> > performance improvement numbers.
> >
> > http://lwn.net/Articles/556342/
> 
> Ok, that's pretty convincing too.
> 
> Side note: are you sure that the i_mmap_mutex needs to be a sleeping
> lock at all? It's documented to nest outside the anon_vma->rwsem, so
> as long as that is a sleeping lock, the i_mmap_mutex needs to be one
> too, but looking at the actual users, most of them seem to be *very*
> similar to the anon_vma->rwsem users. It is a very close cousin to the
> anon_vma->rwsem, after all (just for file-backed pages rather than
> anonymous ones). No?

Right, I brought that up exactly because it is so similar to the
anon-vma lock. Both should really be the same type, whether it be rwsem
or rwlock. Hopefully the rwlock conversion tests will also benefit just
like Ingo's patch did, I should have numbers early next week.

> 
> I dunno. Maybe the ranges are too big and it really has latency
> issues, the few I looked at looked like fairly trivial interval-tree
> operations, though.
> 
> And your numbers for Ingo's patch:
> 
> > After testing Ingo's anon-vma rwlock_t conversion (v2) on a 8 socket, 80
> > core system with aim7, I am quite surprised about the numbers -
> > considering the lack of queuing in rwlocks. A lot of the tests didn't
> > show hardly any difference, but those that really contend this lock
> > (with high amounts of users) benefited quite nicely:
> >
> > Alltests: +28% throughput after 1000 users and runtime was reduced from
> > 7.2 to 6.6 secs.
> >
> > Custom: +61% throughput after 100 users and runtime was reduced from 7
> > to 4.9 secs.
> >
> > High_systime: +40% throughput after 1000 users and runtime was reduced
> > from 19 to 15.5 secs.
> >
> > Shared: +30.5% throughput after 100 users and runtime was reduced from
> > 6.5 to 5.1 secs.
> >
> > Short: Lots of variance in the numbers, but avg of +29% throughput - no
> > particular performance degradation either.
> 
> Are just overwhelming, in my opinion. The conversion *from* a spinlock
> never had this kind of support behind it.
> 
> Btw, did anybody run Ingo's patch with lockdep and the spinlock sleep
> debugging code to verify that we haven't introduced any problems wrt
> sleeping since the lock was converted into a rw-semaphore?

Hmm, I'm getting the following at bootup:

=============================================
[ INFO: possible recursive locking detected ]
3.12.0-rc2+ #7 Not tainted
---------------------------------------------
qemu-kvm/64239 is trying to acquire lock:
 (&anon_vma->rwlock){+.+...}, at: [<ffffffff8115fba6>] mm_take_all_locks+0x146/0x190

but task is already holding lock:
 (&anon_vma->rwlock){+.+...}, at: [<ffffffff8115fba6>] mm_take_all_locks+0x146/0x190

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&anon_vma->rwlock);
  lock(&anon_vma->rwlock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

4 locks held by qemu-kvm/64239:
 #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117c25c>] do_mmu_notifier_register+0x13c/0x180
 #1:  (mm_all_locks_mutex){+.+...}, at: [<ffffffff8115fa9b>] mm_take_all_locks+0x3b/0x190
 #2:  (&mapping->i_mmap_mutex){+.+...}, at: [<ffffffff8115fb26>] mm_take_all_locks+0xc6/0x190
 #3:  (&anon_vma->rwlock){+.+...}, at: [<ffffffff8115fba6>] mm_take_all_locks+0x146/0x190

stack backtrace:
CPU: 51 PID: 64239 Comm: qemu-kvm Not tainted 3.12.0-rc2+ #7
Hardware name: HP ProLiant DL980 G7, BIOS P66 06/24/2011
 ffff8b9f79294a80 ffff8a9f7c375c08 ffffffff815802dc 0000000000000003
 ffff8b9f79294100 ffff8a9f7c375c38 ffffffff810bda34 ffff8b9f79294100
 ffff8b9f79294a80 ffff8b9f79294100 0000000000000000 ffff8a9f7c375c98
Call Trace:
 [<ffffffff815802dc>] dump_stack+0x49/0x5d
 [<ffffffff810bda34>] print_deadlock_bug+0xf4/0x100
 [<ffffffff810bf864>] validate_chain+0x504/0x7a0
 [<ffffffff810bfe0d>] __lock_acquire+0x30d/0x590
 [<ffffffff8115fba6>] ? mm_take_all_locks+0x146/0x190
 [<ffffffff810c0130>] lock_acquire+0xa0/0x110
 [<ffffffff8115fba6>] ? mm_take_all_locks+0x146/0x190
 [<ffffffff810bebcd>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff81585861>] _raw_write_lock+0x31/0x40
 [<ffffffff8115fba6>] ? mm_take_all_locks+0x146/0x190
 [<ffffffff8115fba6>] mm_take_all_locks+0x146/0x190
 [<ffffffff8117c196>] do_mmu_notifier_register+0x76/0x180
 [<ffffffff8117c2d3>] mmu_notifier_register+0x13/0x20
 [<ffffffffa0365f5c>] kvm_create_vm+0x22c/0x300 [kvm]
 [<ffffffffa03660e8>] kvm_dev_ioctl+0xb8/0x140 [kvm]
 [<ffffffff811a4569>] do_vfs_ioctl+0x89/0x350
 [<ffffffff8158e9b7>] ? sysret_check+0x1b/0x56
 [<ffffffff811a48d1>] SyS_ioctl+0xa1/0xb0

This is probably due to the change in vm_lock_anon_vma():

-               down_write_nest_lock(&anon_vma->root->rwsem, &mm->mmap_sem);
+               write_lock(&anon_vma->root->rwlock);




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-30  0:40           ` Davidlohr Bueso
@ 2013-09-30  0:51             ` Linus Torvalds
  0 siblings, 0 replies; 55+ messages in thread
From: Linus Torvalds @ 2013-09-30  0:51 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J

On Sun, Sep 29, 2013 at 5:40 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
>
> Hmm, I'm getting the following at bootup:
>
>  May be due to missing lock nesting notation

Yes it is. And that reminds me of a problem I think we had with this
code: we had a possible case of the preemption counter nesting too
deeply. I forget the details, but it was something people worried
about.

That mm_take_all_locks() thing is really special, and I suspect that
if we go down this way we should just do a single preempt-disable and
then use the arch_write_lock() to avoid both the lockdep splat _and_
the preemption counter overflow.

                 Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-29 23:26         ` Linus Torvalds
  2013-09-30  0:40           ` Davidlohr Bueso
@ 2013-09-30  6:57           ` Ingo Molnar
  1 sibling, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2013-09-30  6:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> [...]
> 
> And your numbers for Ingo's patch:
> 
> > After testing Ingo's anon-vma rwlock_t conversion (v2) on a 8 socket, 
> > 80 core system with aim7, I am quite surprised about the numbers - 
> > considering the lack of queuing in rwlocks. A lot of the tests didn't 
> > show hardly any difference, but those that really contend this lock 
> > (with high amounts of users) benefited quite nicely:
> >
> > Alltests: +28% throughput after 1000 users and runtime was reduced from
> > 7.2 to 6.6 secs.
> >
> > Custom: +61% throughput after 100 users and runtime was reduced from 7
> > to 4.9 secs.
> >
> > High_systime: +40% throughput after 1000 users and runtime was reduced
> > from 19 to 15.5 secs.
> >
> > Shared: +30.5% throughput after 100 users and runtime was reduced from
> > 6.5 to 5.1 secs.
> >
> > Short: Lots of variance in the numbers, but avg of +29% throughput - 
> > no particular performance degradation either.
> 
> Are just overwhelming, in my opinion. The conversion *from* a spinlock 
> never had this kind of support behind it.

Agreed. Especially given how primitive rwlock_t is especially on 80 cores, 
this is really a no-brainer conversion.

I have to say I am surprised by the numbers - after so many years it's 
still amazing how powerful the "get work done and don't interrupt it" 
batching concept is in computing...

> Btw, did anybody run Ingo's patch with lockdep and the spinlock sleep 
> debugging code to verify that we haven't introduced any problems wrt 
> sleeping since the lock was converted into a rw-semaphore?
> 
> Because quite frankly, considering these kinds of numbers, I really 
> don't see how we could possibly make excuses for keeping that 
> rw-semaphore unless there is some absolutely _horrible_ latency issue?

Given that there's only about a dozen critical sections that this lock 
covers I simply cannot imagine any latency problem that couldn't be fixed 
in some other fashion. (shrinking the critical section, breaking up a bad 
loop, etc.)

[ Btw., if PREEMPT_RT goes upstream we might not even need to break
  latencies all that much: people whose usecase values scheduling latency
  above throughput would run such a critical section preemptible anyway. ]

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-28 18:55     ` Linus Torvalds
                         ` (2 preceding siblings ...)
  2013-09-29 23:06       ` [PATCH] rwsem: reduce spinlock contention in wakeup code path Davidlohr Bueso
@ 2013-09-30  1:11       ` Michel Lespinasse
  2013-09-30  7:05         ` Ingo Molnar
  2013-09-30 10:46       ` Peter Zijlstra
  4 siblings, 1 reply; 55+ messages in thread
From: Michel Lespinasse @ 2013-09-30  1:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Sat, Sep 28, 2013 at 11:55 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Btw, I really hate that thing. I think we should turn it back into a
> spinlock. None of what it protects needs a mutex or an rwsem.
>
> Because you guys talk about the regression of turning it into a rwsem,
> but nobody talks about the *original* regression.
>
> And it *used* to be a spinlock, and it was changed into a mutex back
> in 2011 by commit 2b575eb64f7a. That commit doesn't even have a reason
> listed for it, although my dim memory of it is that the reason was
> preemption latency.

I was wondering about that too. Regarding latencies, we used to have
unbounded latencies for anon_vma operations as the AVC chains could
get long under some workloads; now that we index the VMAs matching a
given anon_vma with an interval tree this particular source of
latencies should be gone. So yes, it could be worth trying to go back
to a non-sleeping lock.

That said, I am very scared of using rwlock_t here, and I would much
prefer we choose a fair lock (either spinlock or a new rwlock
implementation which guarantees not to starve any locker thread)

-- 
Michel Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-30  1:11       ` Michel Lespinasse
@ 2013-09-30  7:05         ` Ingo Molnar
  2013-09-30 16:03           ` Waiman Long
  0 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2013-09-30  7:05 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J


* Michel Lespinasse <walken@google.com> wrote:

> That said, I am very scared of using rwlock_t here, and I would much 
> prefer we choose a fair lock (either spinlock or a new rwlock 
> implementation which guarantees not to starve any locker thread)

Given how few users rwlock_t has today we could attempt to make it 
reader-writer fair, as long as the usecase of a hardirq or softirq
context always getting nested read access on the same CPU is preserved. 

(but that can be done - if nothing else then with an explicit context 
check.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-30  7:05         ` Ingo Molnar
@ 2013-09-30 16:03           ` Waiman Long
  0 siblings, 0 replies; 55+ messages in thread
From: Waiman Long @ 2013-09-30 16:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Michel Lespinasse, Linus Torvalds, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra,
	Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On 09/30/2013 03:05 AM, Ingo Molnar wrote:
> * Michel Lespinasse<walken@google.com>  wrote:
>
>> That said, I am very scared of using rwlock_t here, and I would much
>> prefer we choose a fair lock (either spinlock or a new rwlock
>> implementation which guarantees not to starve any locker thread)
> Given how few users rwlock_t has today we could attempt to make it
> reader-writer fair, as long as the usecase of a hardirq or softirq
> context always getting nested read access on the same CPU is preserved.
>
> (but that can be done - if nothing else then with an explicit context
> check.)
>
> Thanks,
>
> 	Ingo

My queue rwlock patch is able to handle the use case of a hardirq and 
softirq context and can be made almost as fair as the ticket spinlock.

-Longman

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-28 18:55     ` Linus Torvalds
                         ` (3 preceding siblings ...)
  2013-09-30  1:11       ` Michel Lespinasse
@ 2013-09-30 10:46       ` Peter Zijlstra
  2013-10-01  7:48         ` Ingo Molnar
  4 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2013-09-30 10:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Sat, Sep 28, 2013 at 11:55:26AM -0700, Linus Torvalds wrote:
> So if the primary reason for this is really just that f*cking anon_vma
> lock, then I would seriously suggest:

I would still like to see the rwsem patches merged; even if we end up
going back to a spin style anon_vma lock.

There's been various reports in the past about how programs are
significantly faster if they wrap their mmap() calls in a pthread_mutex.
And this was purely down to the fact that rwsem writer-writer contention
blows chunks.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-30 10:46       ` Peter Zijlstra
@ 2013-10-01  7:48         ` Ingo Molnar
  2013-10-01  8:10           ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2013-10-01  7:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Sat, Sep 28, 2013 at 11:55:26AM -0700, Linus Torvalds wrote:
> > So if the primary reason for this is really just that f*cking anon_vma
> > lock, then I would seriously suggest:
> 
> I would still like to see the rwsem patches merged; even if we end up 
> going back to a spin style anon_vma lock.
> 
> There's been various reports in the past about how programs are 
> significantly faster if they wrap their mmap() calls in a pthread_mutex. 
> And this was purely down to the fact that rwsem writer-writer contention 
> blows chunks.

That's about the mm->mmap_sem rwsem, right?

That impact would have to be measured carefully, and not just for 
workloads where we know that better contention logic helps, but other MM 
workloads that are hitting hard on mmap_sem.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-10-01  7:48         ` Ingo Molnar
@ 2013-10-01  8:10           ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2013-10-01  8:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton,
	Linux Kernel Mailing List, Rik van Riel, Peter Hurley,
	Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Tue, Oct 01, 2013 at 09:48:02AM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Sat, Sep 28, 2013 at 11:55:26AM -0700, Linus Torvalds wrote:
> > > So if the primary reason for this is really just that f*cking anon_vma
> > > lock, then I would seriously suggest:
> > 
> > I would still like to see the rwsem patches merged; even if we end up 
> > going back to a spin style anon_vma lock.
> > 
> > There's been various reports in the past about how programs are 
> > significantly faster if they wrap their mmap() calls in a pthread_mutex. 
> > And this was purely down to the fact that rwsem writer-writer contention 
> > blows chunks.
> 
> That's about the mm->mmap_sem rwsem, right?

Yep.

> That impact would have to be measured carefully, and not just for 
> workloads where we know that better contention logic helps, but other MM 
> workloads that are hitting hard on mmap_sem.

Sure.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-27 19:00 [PATCH] rwsem: reduce spinlock contention in wakeup code path Waiman Long
  2013-09-27 19:28 ` Linus Torvalds
@ 2013-09-27 19:32 ` Peter Hurley
  2013-09-28  0:46   ` Waiman Long
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Hurley @ 2013-09-27 19:32 UTC (permalink / raw)
  To: Waiman Long, Ingo Molnar, Andrew Morton
  Cc: linux-kernel, Rik van Riel, Davidlohr Bueso, Alex Shi, Tim Chen,
	Linus Torvalds, Peter Zijlstra, Andrea Arcangeli,
	Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On 09/27/2013 03:00 PM, Waiman Long wrote:
> With the 3.12-rc2 kernel, there is sizable spinlock contention on
> the rwsem wakeup code path when running AIM7's high_systime workload
> on a 8-socket 80-core DL980 (HT off) as reported by perf:
>
>    7.64%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
>               |--41.77%-- rwsem_wake
>    1.61%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irq
>               |--92.37%-- rwsem_down_write_failed
>
> That was 4.7% of recorded CPU cycles.
>
> On a large NUMA machine, it is entirely possible that a fairly large
> number of threads are queuing up in the ticket spinlock queue to do
> the wakeup operation. In fact, only one will be needed.  This patch
> tries to reduce spinlock contention by doing just that.
>
> A new wakeup field is added to the rwsem structure. This field is
> set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a
> thread is pending to do the wakeup call. It is cleared on exit from
> those functions.
>
> By checking if the wakeup flag is set, a thread can exit rwsem_wake()
> immediately if another thread is pending to do the wakeup instead of
> waiting to get the spinlock and find out that nothing need to be done.

This will leave readers stranded if a former writer is in __rwsem_do_wake
to wake up the readers and another writer steals the lock, but before
the former writer exits without having woken up the readers, the locking
stealing writer drops the lock and sees the wakeup flag is set, so
doesn't bother to wake the readers.

Regards,
Peter Hurley


> The setting of the wakeup flag may not be visible on all processors in
> some architectures. However, this won't affect program correctness. The
> clearing of the wakeup flag before spin_unlock will ensure that it is
> visible to all processors.
>
> With this patch, the performance improvement on jobs per minute (JPM)
> of the high_systime workload (at 1500 users) was as follows:
>
> HT	JPM w/o patch	JPM with patch	% Change
> --	-------------	--------------	--------
> off	   128466	   150000	 +16.8%
> on	   121606	   146778	 +20.7%
>
> The new perf profile (HT off) was as follows:
>
>    2.96%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
>               |--0.94%-- rwsem_wake
>    1.00%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irq
>               |--88.70%-- rwsem_down_write_failed
>
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>   include/linux/rwsem.h |    2 ++
>   lib/rwsem.c           |   19 +++++++++++++++++++
>   2 files changed, 21 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
> index 0616ffe..e25792e 100644
> --- a/include/linux/rwsem.h
> +++ b/include/linux/rwsem.h
> @@ -25,6 +25,7 @@ struct rw_semaphore;
>   struct rw_semaphore {
>   	long			count;
>   	raw_spinlock_t		wait_lock;
> +	int			wakeup;	/* Waking-up in progress flag */
>   	struct list_head	wait_list;
>   #ifdef CONFIG_DEBUG_LOCK_ALLOC
>   	struct lockdep_map	dep_map;
> @@ -58,6 +59,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
>   #define __RWSEM_INITIALIZER(name)			\
>   	{ RWSEM_UNLOCKED_VALUE,				\
>   	  __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock),	\
> +	  0,						\
>   	  LIST_HEAD_INIT((name).wait_list)		\
>   	  __RWSEM_DEP_MAP_INIT(name) }
>
> diff --git a/lib/rwsem.c b/lib/rwsem.c
> index 19c5fa9..39290a5 100644
> --- a/lib/rwsem.c
> +++ b/lib/rwsem.c
> @@ -25,6 +25,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
>   	lockdep_init_map(&sem->dep_map, name, key, 0);
>   #endif
>   	sem->count = RWSEM_UNLOCKED_VALUE;
> +	sem->wakeup = 0;
>   	raw_spin_lock_init(&sem->wait_lock);
>   	INIT_LIST_HEAD(&sem->wait_list);
>   }
> @@ -66,6 +67,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
>   	struct list_head *next;
>   	long oldcount, woken, loop, adjustment;
>
> +	sem->wakeup = 1;	/* Waking up in progress */
>   	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
>   	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
>   		if (wake_type == RWSEM_WAKE_ANY)
> @@ -137,6 +139,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
>   	next->prev = &sem->wait_list;
>
>    out:
> +	sem->wakeup = 0;
>   	return sem;
>   }
>
> @@ -256,11 +259,27 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
>   {
>   	unsigned long flags;
>
> +	if (sem->wakeup)
> +		return sem;	/* Waking up in progress already */
> +	/*
> +	 * Optimistically set the wakeup flag to indicate that the current
> +	 * flag is going to wakeup the sleeping waiters so that the
> +	 * following threads don't need to wait for doing the wakeup.
> +	 * It is perfectly fine if another thread resets the flag. It just
> +	 * leads to another thread waiting to call __rwsem_do_wake().
> +	 */
> +	sem->wakeup = 1;
>   	raw_spin_lock_irqsave(&sem->wait_lock, flags);
>
>   	/* do nothing if list empty */
>   	if (!list_empty(&sem->wait_list))
>   		sem = __rwsem_do_wake(sem, RWSEM_WAKE_ANY);
> +	else
> +		sem->wakeup = 0;	/* Make sure wakeup flag is reset */
> +	/*
> +	 * The spin_unlock() call will force the nulled wakeup flag to
> +	 * be visible to all the processors.
> +	 */
>
>   	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
>
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
  2013-09-27 19:32 ` Peter Hurley
@ 2013-09-28  0:46   ` Waiman Long
  0 siblings, 0 replies; 55+ messages in thread
From: Waiman Long @ 2013-09-28  0:46 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Ingo Molnar, Andrew Morton, linux-kernel, Rik van Riel,
	Davidlohr Bueso, Alex Shi, Tim Chen, Linus Torvalds,
	Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
	Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
	Norton, Scott J

On 09/27/2013 03:32 PM, Peter Hurley wrote:
> On 09/27/2013 03:00 PM, Waiman Long wrote:
>> With the 3.12-rc2 kernel, there is sizable spinlock contention on
>> the rwsem wakeup code path when running AIM7's high_systime workload
>> on a 8-socket 80-core DL980 (HT off) as reported by perf:
>>
>>    7.64%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
>>               |--41.77%-- rwsem_wake
>>    1.61%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irq
>>               |--92.37%-- rwsem_down_write_failed
>>
>> That was 4.7% of recorded CPU cycles.
>>
>> On a large NUMA machine, it is entirely possible that a fairly large
>> number of threads are queuing up in the ticket spinlock queue to do
>> the wakeup operation. In fact, only one will be needed.  This patch
>> tries to reduce spinlock contention by doing just that.
>>
>> A new wakeup field is added to the rwsem structure. This field is
>> set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a
>> thread is pending to do the wakeup call. It is cleared on exit from
>> those functions.
>>
>> By checking if the wakeup flag is set, a thread can exit rwsem_wake()
>> immediately if another thread is pending to do the wakeup instead of
>> waiting to get the spinlock and find out that nothing need to be done.
>
> This will leave readers stranded if a former writer is in __rwsem_do_wake
> to wake up the readers and another writer steals the lock, but before
> the former writer exits without having woken up the readers, the locking
> stealing writer drops the lock and sees the wakeup flag is set, so
> doesn't bother to wake the readers.
>
> Regards,
> Peter Hurley
>

Yes, you are right. That can be a problem. Thank for pointing this out. 
The workloads that I used doesn't seem to exercise the readers. I will 
modify the patch to add code handle this failure case by resetting the 
wakeup flag, pushing it out and then retrying one more time to get the 
read lock. I  think that should address the problem.

Regards,
Longman

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2013-10-01 20:03 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-27 19:00 [PATCH] rwsem: reduce spinlock contention in wakeup code path Waiman Long
2013-09-27 19:28 ` Linus Torvalds
2013-09-27 19:39   ` Davidlohr Bueso
2013-09-27 21:49     ` Tim Chen
2013-09-28  6:45       ` Ingo Molnar
2013-09-28  7:41   ` Ingo Molnar
2013-09-28 18:55     ` Linus Torvalds
2013-09-28 19:13       ` Andi Kleen
2013-09-28 19:22         ` Linus Torvalds
2013-09-28 19:21       ` Ingo Molnar
2013-09-28 19:33         ` Linus Torvalds
2013-09-28 19:39           ` Ingo Molnar
2013-09-30 10:44           ` Peter Zijlstra
2013-09-30 16:13             ` Linus Torvalds
2013-09-30 16:41               ` Peter Zijlstra
2013-10-01  7:28                 ` Ingo Molnar
2013-10-01  8:09                   ` Peter Zijlstra
2013-10-01  8:25                     ` Ingo Molnar
2013-09-30 15:58           ` Waiman Long
2013-10-01  7:33             ` Ingo Molnar
2013-10-01 20:03               ` Waiman Long
2013-09-28 19:37         ` [PATCH] anon_vmas: Convert the rwsem to an rwlock_t Ingo Molnar
2013-09-28 19:43           ` Linus Torvalds
2013-09-28 19:46             ` Ingo Molnar
2013-09-28 19:52             ` [PATCH, v2] " Ingo Molnar
2013-09-30 11:00               ` Peter Zijlstra
2013-09-30 11:44               ` Peter Zijlstra
2013-09-30 17:03               ` Andrew Morton
2013-09-30 17:25                 ` Linus Torvalds
2013-09-30 17:10               ` Tim Chen
2013-09-30 18:14                 ` Peter Zijlstra
2013-09-30 19:23                   ` Tim Chen
2013-09-30 19:35                     ` Waiman Long
2013-09-30 19:47                       ` Tim Chen
2013-09-30 22:03                         ` Tim Chen
2013-10-01  2:41                         ` Waiman Long
2013-09-30  8:52           ` [PATCH] " Andrea Arcangeli
2013-09-30 14:40             ` Jerome Glisse
2013-09-30 16:26             ` Linus Torvalds
2013-09-30 19:16               ` Andrea Arcangeli
2013-09-30 18:21             ` Andi Kleen
2013-09-30 19:04               ` Jerome Glisse
2013-09-29 23:06       ` [PATCH] rwsem: reduce spinlock contention in wakeup code path Davidlohr Bueso
2013-09-29 23:26         ` Linus Torvalds
2013-09-30  0:40           ` Davidlohr Bueso
2013-09-30  0:51             ` Linus Torvalds
2013-09-30  6:57           ` Ingo Molnar
2013-09-30  1:11       ` Michel Lespinasse
2013-09-30  7:05         ` Ingo Molnar
2013-09-30 16:03           ` Waiman Long
2013-09-30 10:46       ` Peter Zijlstra
2013-10-01  7:48         ` Ingo Molnar
2013-10-01  8:10           ` Peter Zijlstra
2013-09-27 19:32 ` Peter Hurley
2013-09-28  0:46   ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).