* [PATCH] rwsem: reduce spinlock contention in wakeup code path
@ 2013-09-27 19:00 Waiman Long
2013-09-27 19:28 ` Linus Torvalds
2013-09-27 19:32 ` Peter Hurley
0 siblings, 2 replies; 55+ messages in thread
From: Waiman Long @ 2013-09-27 19:00 UTC (permalink / raw)
To: Ingo Molnar, Andrew Morton
Cc: Waiman Long, linux-kernel, Rik van Riel, Peter Hurley,
Davidlohr Bueso, Alex Shi, Tim Chen, Linus Torvalds,
Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen,
Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin,
Norton, Scott J
With the 3.12-rc2 kernel, there is sizable spinlock contention on
the rwsem wakeup code path when running AIM7's high_systime workload
on a 8-socket 80-core DL980 (HT off) as reported by perf:
7.64% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
|--41.77%-- rwsem_wake
1.61% reaim [kernel.kallsyms] [k] _raw_spin_lock_irq
|--92.37%-- rwsem_down_write_failed
That was 4.7% of recorded CPU cycles.
On a large NUMA machine, it is entirely possible that a fairly large
number of threads are queuing up in the ticket spinlock queue to do
the wakeup operation. In fact, only one will be needed. This patch
tries to reduce spinlock contention by doing just that.
A new wakeup field is added to the rwsem structure. This field is
set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a
thread is pending to do the wakeup call. It is cleared on exit from
those functions.
By checking if the wakeup flag is set, a thread can exit rwsem_wake()
immediately if another thread is pending to do the wakeup instead of
waiting to get the spinlock and find out that nothing need to be done.
The setting of the wakeup flag may not be visible on all processors in
some architectures. However, this won't affect program correctness. The
clearing of the wakeup flag before spin_unlock will ensure that it is
visible to all processors.
With this patch, the performance improvement on jobs per minute (JPM)
of the high_systime workload (at 1500 users) was as follows:
HT JPM w/o patch JPM with patch % Change
-- ------------- -------------- --------
off 128466 150000 +16.8%
on 121606 146778 +20.7%
The new perf profile (HT off) was as follows:
2.96% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
|--0.94%-- rwsem_wake
1.00% reaim [kernel.kallsyms] [k] _raw_spin_lock_irq
|--88.70%-- rwsem_down_write_failed
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
include/linux/rwsem.h | 2 ++
lib/rwsem.c | 19 +++++++++++++++++++
2 files changed, 21 insertions(+), 0 deletions(-)
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 0616ffe..e25792e 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -25,6 +25,7 @@ struct rw_semaphore;
struct rw_semaphore {
long count;
raw_spinlock_t wait_lock;
+ int wakeup; /* Waking-up in progress flag */
struct list_head wait_list;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
@@ -58,6 +59,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
#define __RWSEM_INITIALIZER(name) \
{ RWSEM_UNLOCKED_VALUE, \
__RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \
+ 0, \
LIST_HEAD_INIT((name).wait_list) \
__RWSEM_DEP_MAP_INIT(name) }
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 19c5fa9..39290a5 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -25,6 +25,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
lockdep_init_map(&sem->dep_map, name, key, 0);
#endif
sem->count = RWSEM_UNLOCKED_VALUE;
+ sem->wakeup = 0;
raw_spin_lock_init(&sem->wait_lock);
INIT_LIST_HEAD(&sem->wait_list);
}
@@ -66,6 +67,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
struct list_head *next;
long oldcount, woken, loop, adjustment;
+ sem->wakeup = 1; /* Waking up in progress */
waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
if (wake_type == RWSEM_WAKE_ANY)
@@ -137,6 +139,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
next->prev = &sem->wait_list;
out:
+ sem->wakeup = 0;
return sem;
}
@@ -256,11 +259,27 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
{
unsigned long flags;
+ if (sem->wakeup)
+ return sem; /* Waking up in progress already */
+ /*
+ * Optimistically set the wakeup flag to indicate that the current
+ * flag is going to wakeup the sleeping waiters so that the
+ * following threads don't need to wait for doing the wakeup.
+ * It is perfectly fine if another thread resets the flag. It just
+ * leads to another thread waiting to call __rwsem_do_wake().
+ */
+ sem->wakeup = 1;
raw_spin_lock_irqsave(&sem->wait_lock, flags);
/* do nothing if list empty */
if (!list_empty(&sem->wait_list))
sem = __rwsem_do_wake(sem, RWSEM_WAKE_ANY);
+ else
+ sem->wakeup = 0; /* Make sure wakeup flag is reset */
+ /*
+ * The spin_unlock() call will force the nulled wakeup flag to
+ * be visible to all the processors.
+ */
raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
--
1.7.1
^ permalink raw reply related [flat|nested] 55+ messages in thread* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-27 19:00 [PATCH] rwsem: reduce spinlock contention in wakeup code path Waiman Long @ 2013-09-27 19:28 ` Linus Torvalds 2013-09-27 19:39 ` Davidlohr Bueso 2013-09-28 7:41 ` Ingo Molnar 2013-09-27 19:32 ` Peter Hurley 1 sibling, 2 replies; 55+ messages in thread From: Linus Torvalds @ 2013-09-27 19:28 UTC (permalink / raw) To: Waiman Long Cc: Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Fri, Sep 27, 2013 at 12:00 PM, Waiman Long <Waiman.Long@hp.com> wrote: > > On a large NUMA machine, it is entirely possible that a fairly large > number of threads are queuing up in the ticket spinlock queue to do > the wakeup operation. In fact, only one will be needed. This patch > tries to reduce spinlock contention by doing just that. > > A new wakeup field is added to the rwsem structure. This field is > set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a > thread is pending to do the wakeup call. It is cleared on exit from > those functions. Ok, this is *much* simpler than adding the new MCS spinlock, so I'm wondering what the performance difference between the two are. I'm obviously always in favor of just removing lock contention over trying to improve the lock scalability, so I really like Waiman's approach over Tim's new MCS lock. Not because I dislike MCS locks in general (or you, Tim ;), it's really more fundamental: I just fundamentally believe more in trying to avoid lock contention than in trying to improve lock behavior when that contention happens. As such, I love exactly these kinds of things that Wainman's patch does, and I'm heavily biased. But I know I'm heavily biased, so I'd really like to get comparative numbers for these things. Waiman, can you compare your patch with Tim's (and Alex's) 6-patch series to make the rwsem's use MCS locks for the spinlock? The numbers Tim quotes for the MCS patch series ("high_systime (+7%)") are lower than the ones you quote (16-20%), but that may be due to hardware platform differences and just methodology. Tim was also looking at exim performance. So Tim/Waiman, mind comparing the two approaches on the setups you have? Linus ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-27 19:28 ` Linus Torvalds @ 2013-09-27 19:39 ` Davidlohr Bueso 2013-09-27 21:49 ` Tim Chen 2013-09-28 7:41 ` Ingo Molnar 1 sibling, 1 reply; 55+ messages in thread From: Davidlohr Bueso @ 2013-09-27 19:39 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Fri, 2013-09-27 at 12:28 -0700, Linus Torvalds wrote: > On Fri, Sep 27, 2013 at 12:00 PM, Waiman Long <Waiman.Long@hp.com> wrote: > > > > On a large NUMA machine, it is entirely possible that a fairly large > > number of threads are queuing up in the ticket spinlock queue to do > > the wakeup operation. In fact, only one will be needed. This patch > > tries to reduce spinlock contention by doing just that. > > > > A new wakeup field is added to the rwsem structure. This field is > > set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a > > thread is pending to do the wakeup call. It is cleared on exit from > > those functions. > > Ok, this is *much* simpler than adding the new MCS spinlock, so I'm > wondering what the performance difference between the two are. Both approaches should be complementary. The idea of optimistic spinning in rwsems is to avoid putting putting the writer on the wait queue - reducing contention and giving a greater chance for the rwsem to get acquired. Waiman's approach is once the blocking actually occurs, and at this point I'm not sure how this will affect writer stealing logic. Thanks, Davidlohr ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-27 19:39 ` Davidlohr Bueso @ 2013-09-27 21:49 ` Tim Chen 2013-09-28 6:45 ` Ingo Molnar 0 siblings, 1 reply; 55+ messages in thread From: Tim Chen @ 2013-09-27 21:49 UTC (permalink / raw) To: Davidlohr Bueso Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Fri, 2013-09-27 at 12:39 -0700, Davidlohr Bueso wrote: > On Fri, 2013-09-27 at 12:28 -0700, Linus Torvalds wrote: > > On Fri, Sep 27, 2013 at 12:00 PM, Waiman Long <Waiman.Long@hp.com> wrote: > > > > > > On a large NUMA machine, it is entirely possible that a fairly large > > > number of threads are queuing up in the ticket spinlock queue to do > > > the wakeup operation. In fact, only one will be needed. This patch > > > tries to reduce spinlock contention by doing just that. > > > > > > A new wakeup field is added to the rwsem structure. This field is > > > set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a > > > thread is pending to do the wakeup call. It is cleared on exit from > > > those functions. > > > > Ok, this is *much* simpler than adding the new MCS spinlock, so I'm > > wondering what the performance difference between the two are. > > Both approaches should be complementary. The idea of optimistic spinning > in rwsems is to avoid putting putting the writer on the wait queue - > reducing contention and giving a greater chance for the rwsem > to get acquired. Waiman's approach is once the blocking actually occurs, > and at this point I'm not sure how this will affect writer stealing > logic. > I agree with the view that the two approaches are complementary to each other. They address different bottleneck areas in the rwsem. Here're the performance numbers for exim workload compared to a vanilla kernel. Waimain's patch: +2.0% Alex+Tim's patchset: +4.8% Waiman+Alex+Tim: +5.3% Tim ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-27 21:49 ` Tim Chen @ 2013-09-28 6:45 ` Ingo Molnar 0 siblings, 0 replies; 55+ messages in thread From: Ingo Molnar @ 2013-09-28 6:45 UTC (permalink / raw) To: Tim Chen Cc: Davidlohr Bueso, Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J * Tim Chen <tim.c.chen@linux.intel.com> wrote: > On Fri, 2013-09-27 at 12:39 -0700, Davidlohr Bueso wrote: > > On Fri, 2013-09-27 at 12:28 -0700, Linus Torvalds wrote: > > > On Fri, Sep 27, 2013 at 12:00 PM, Waiman Long <Waiman.Long@hp.com> wrote: > > > > > > > > On a large NUMA machine, it is entirely possible that a fairly large > > > > number of threads are queuing up in the ticket spinlock queue to do > > > > the wakeup operation. In fact, only one will be needed. This patch > > > > tries to reduce spinlock contention by doing just that. > > > > > > > > A new wakeup field is added to the rwsem structure. This field is > > > > set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a > > > > thread is pending to do the wakeup call. It is cleared on exit from > > > > those functions. > > > > > > Ok, this is *much* simpler than adding the new MCS spinlock, so I'm > > > wondering what the performance difference between the two are. > > > > Both approaches should be complementary. The idea of optimistic spinning > > in rwsems is to avoid putting putting the writer on the wait queue - > > reducing contention and giving a greater chance for the rwsem > > to get acquired. Waiman's approach is once the blocking actually occurs, > > and at this point I'm not sure how this will affect writer stealing > > logic. > > > > I agree with the view that the two approaches are complementary to each > other. They address different bottleneck areas in the rwsem. Here're > the performance numbers for exim workload compared to a vanilla kernel. > > Waimain's patch: +2.0% > Alex+Tim's patchset: +4.8% > Waiman+Alex+Tim: +5.3% I think I'd like to see a combo series submitted :-) Thanks, Ingo ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-27 19:28 ` Linus Torvalds 2013-09-27 19:39 ` Davidlohr Bueso @ 2013-09-28 7:41 ` Ingo Molnar 2013-09-28 18:55 ` Linus Torvalds 1 sibling, 1 reply; 55+ messages in thread From: Ingo Molnar @ 2013-09-28 7:41 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Sep 27, 2013 at 12:00 PM, Waiman Long <Waiman.Long@hp.com> wrote: > > > > On a large NUMA machine, it is entirely possible that a fairly large > > number of threads are queuing up in the ticket spinlock queue to do > > the wakeup operation. In fact, only one will be needed. This patch > > tries to reduce spinlock contention by doing just that. > > > > A new wakeup field is added to the rwsem structure. This field is set > > on entry to rwsem_wake() and __rwsem_do_wake() to mark that a thread > > is pending to do the wakeup call. It is cleared on exit from those > > functions. > > Ok, this is *much* simpler than adding the new MCS spinlock, so I'm > wondering what the performance difference between the two are. > > I'm obviously always in favor of just removing lock contention over > trying to improve the lock scalability, so I really like Waiman's > approach over Tim's new MCS lock. Not because I dislike MCS locks in > general (or you, Tim ;), it's really more fundamental: I just > fundamentally believe more in trying to avoid lock contention than in > trying to improve lock behavior when that contention happens. > > As such, I love exactly these kinds of things that Wainman's patch does, > and I'm heavily biased. Yeah, I fully agree. The reason I'm still very sympathetic to Tim's efforts is that they address a regression caused by a mechanic mutex->rwsem conversion: 5a505085f043 mm/rmap: Convert the struct anon_vma::mutex to an rwsem ... and Tim's patches turn that regression into an actual speedup. The main regression component happens to be more prominent on larger systems which inevitably have higher contention. That was a result of mutexes having better contention behavior than rwsems - which was not a property Tim picked arbitrarily. The other advantage would be that by factoring out the MCS code it gets reviewed and seen more prominently - this resulted in a micro-optimization already. So people working on mutex or rwsem scalability will have a chance to help each other. A third advantage would be that arguably our rwsems degrade on really big systems catastrophically. We had that with spinlocks and mutexes and it got fixd. Since rwsems are a long-term concern for us I thought that something like MCS queueing could be considered a fundamental quality-of-implementation requirement as well. In any case I fully agree that we never want to overdo it, our priority of optimization and our ranking will always be: - lockless algorithms - reduction of critical paths ... - improved locking fast path - improved locking slow path and people will see much better speedups if they concentrate on entries higher on this list. It's a pretty rigid order as well: no slowpath improvement is allowed to hurt any of the higher priority optimization concepts and people are encouraged to always work on the higher order concepts before considering the symptoms of contention. But as long as contention behavior improvements are cleanly done, don't cause regressions, do not hinder primary scalability efforts and the numbers are as impressive as Tim's, it looks like good stuff to me. I was thinking about putting them into the locking tree once the review and testing process converges and wanted to send them to you in the v3.13 merge window. The only potential downside (modulo bugs) I can see from Tim's patches is XFS's heavy dependence on fine behavioral details of rwsem. But if done right then none of these scalability efforts should impact those details. Thanks, Ingo ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-28 7:41 ` Ingo Molnar @ 2013-09-28 18:55 ` Linus Torvalds 2013-09-28 19:13 ` Andi Kleen ` (4 more replies) 0 siblings, 5 replies; 55+ messages in thread From: Linus Torvalds @ 2013-09-28 18:55 UTC (permalink / raw) To: Ingo Molnar Cc: Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Sat, Sep 28, 2013 at 12:41 AM, Ingo Molnar <mingo@kernel.org> wrote: > > > Yeah, I fully agree. The reason I'm still very sympathetic to Tim's > efforts is that they address a regression caused by a mechanic > mutex->rwsem conversion: > > 5a505085f043 mm/rmap: Convert the struct anon_vma::mutex to an rwsem > > ... and Tim's patches turn that regression into an actual speedup. Btw, I really hate that thing. I think we should turn it back into a spinlock. None of what it protects needs a mutex or an rwsem. Because you guys talk about the regression of turning it into a rwsem, but nobody talks about the *original* regression. And it *used* to be a spinlock, and it was changed into a mutex back in 2011 by commit 2b575eb64f7a. That commit doesn't even have a reason listed for it, although my dim memory of it is that the reason was preemption latency. And that caused big regressions too. Of course, since then, we may well have screwed things up and now we sleep under it, but I still really think it was a mistake to do it in the first place. So if the primary reason for this is really just that f*cking anon_vma lock, then I would seriously suggest: - turn it back into a spinlock (or rwlock_t, since we subsequently separated the read and write paths) - fix up any breakage (ie new scheduling points) that exposes - look at possible other approaches wrt latency on that thing. Hmm? Linus ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-28 18:55 ` Linus Torvalds @ 2013-09-28 19:13 ` Andi Kleen 2013-09-28 19:22 ` Linus Torvalds 2013-09-28 19:21 ` Ingo Molnar ` (3 subsequent siblings) 4 siblings, 1 reply; 55+ messages in thread From: Andi Kleen @ 2013-09-28 19:13 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J > Of course, since then, we may well have screwed things up and now we > sleep under it, but I still really think it was a mistake to do it in > the first place. > > So if the primary reason for this is really just that f*cking anon_vma > lock, then I would seriously suggest: > > - turn it back into a spinlock (or rwlock_t, since we subsequently > separated the read and write paths) Yes please. spinlocks/rwlocks have so much nicer performance behavior than rwsems/mutexes (which noone seems to fully understand) We had also significant performance regressions from every such spinning->sleeping change in the VM (this was just the latest) And afaik anon_vma is usually hold short. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-28 19:13 ` Andi Kleen @ 2013-09-28 19:22 ` Linus Torvalds 0 siblings, 0 replies; 55+ messages in thread From: Linus Torvalds @ 2013-09-28 19:22 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Chandramouleeswaran, Aswin, Norton, Scott J On Sat, Sep 28, 2013 at 12:13 PM, Andi Kleen <andi@firstfloor.org> wrote: > > And afaik anon_vma is usually hold short. Yes. But the problem with anon_vma is that the "usually" may be the 99.9% case, but then there are some insane loads that do tons of forking without execve, and they really make some of the rmap code work very very hard. And then they all not only share that one root vma, but the mm/rmap.c code ends up having to walk all their VM's because there could be a page in there somewhere. These loads aren't necessarily very realistic and very much not common, but I think AIM7 actually has one of those cases, iirc. Our anon_vma locking really is some of the more complex parts of the kernel. Not because of the lock itself, but because of the subtle rules about the whole anon_vma chain and how we have to lock the root of the chain etc etc. And under all _normal_ behavior it's not a problem at all. But I personally dread looking at some of that code, because if we get anything wrong there (and it's happened), it's too painful for words. Linus ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-28 18:55 ` Linus Torvalds 2013-09-28 19:13 ` Andi Kleen @ 2013-09-28 19:21 ` Ingo Molnar 2013-09-28 19:33 ` Linus Torvalds 2013-09-28 19:37 ` [PATCH] anon_vmas: Convert the rwsem to an rwlock_t Ingo Molnar 2013-09-29 23:06 ` [PATCH] rwsem: reduce spinlock contention in wakeup code path Davidlohr Bueso ` (2 subsequent siblings) 4 siblings, 2 replies; 55+ messages in thread From: Ingo Molnar @ 2013-09-28 19:21 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, Sep 28, 2013 at 12:41 AM, Ingo Molnar <mingo@kernel.org> wrote: > > > > > > Yeah, I fully agree. The reason I'm still very sympathetic to Tim's > > efforts is that they address a regression caused by a mechanic > > mutex->rwsem conversion: > > > > 5a505085f043 mm/rmap: Convert the struct anon_vma::mutex to an rwsem > > > > ... and Tim's patches turn that regression into an actual speedup. > > Btw, I really hate that thing. I think we should turn it back into a > spinlock. None of what it protects needs a mutex or an rwsem. > > Because you guys talk about the regression of turning it into a rwsem, > but nobody talks about the *original* regression. > > And it *used* to be a spinlock, and it was changed into a mutex back in > 2011 by commit 2b575eb64f7a. That commit doesn't even have a reason > listed for it, although my dim memory of it is that the reason was > preemption latency. Yeah, I think it was latency. > And that caused big regressions too. > > Of course, since then, we may well have screwed things up and now we > sleep under it, but I still really think it was a mistake to do it in > the first place. > > So if the primary reason for this is really just that f*cking anon_vma > lock, then I would seriously suggest: > > - turn it back into a spinlock (or rwlock_t, since we subsequently > separated the read and write paths) > > - fix up any breakage (ie new scheduling points) that exposes > > - look at possible other approaches wrt latency on that thing. > > Hmm? If we do that then I suspect the next step will be queued rwlocks :-/ The current rwlock_t implementation is rather primitive by modern standards. (We'd probably have killed rwlock_t long ago if not for the tasklist_lock.) But yeah, it would work and conceptually a hard spinlock fits something as lowlevel as the anon-vma lock. I did a quick review pass and it appears nothing obvious is scheduling with the anon-vma lock held. If it did in a non-obvious way it's likely a bug anyway. The hugepage code grew a lot of logic running under the anon-vma lock, but it all seems atomic. So a conversion to rwlock_t could be attempted. (It should be relatively easy patch as well, because the locking operation is now nicely abstracted out.) Thanks, Ingo ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-28 19:21 ` Ingo Molnar @ 2013-09-28 19:33 ` Linus Torvalds 2013-09-28 19:39 ` Ingo Molnar ` (2 more replies) 2013-09-28 19:37 ` [PATCH] anon_vmas: Convert the rwsem to an rwlock_t Ingo Molnar 1 sibling, 3 replies; 55+ messages in thread From: Linus Torvalds @ 2013-09-28 19:33 UTC (permalink / raw) To: Ingo Molnar Cc: Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Sat, Sep 28, 2013 at 12:21 PM, Ingo Molnar <mingo@kernel.org> wrote: > > If we do that then I suspect the next step will be queued rwlocks :-/ The > current rwlock_t implementation is rather primitive by modern standards. > (We'd probably have killed rwlock_t long ago if not for the > tasklist_lock.) Yeah, I'm not happy about or rwlocks. That's one lock that currently is so broken that I think we could easily argue for making that one queued. Waiman had a qrwlock series that looked reasonable, and I think his later versions were drop-in replacements (ie they automatically just did the RightThing(tm) wrt interrupts taking a recursive read lock - I objected to the first versions that required that to be stated explicitly). I think Waiman's patches (even the later ones) made the queued rwlocks be a side-by-side implementation with the old rwlocks, and I think that was just being unnecessarily careful. It might be useful for testing to have a config option to switch between the two, but we might as well go all the way. The old rwlock's really have been a disappointment - they are slower than spinlocks, and seldom/never end up scaling any better. Their main advantage was literally the irq behavior - allowing readers to happen without the expense of worrying about irq's. Linus ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-28 19:33 ` Linus Torvalds @ 2013-09-28 19:39 ` Ingo Molnar 2013-09-30 10:44 ` Peter Zijlstra 2013-09-30 15:58 ` Waiman Long 2 siblings, 0 replies; 55+ messages in thread From: Ingo Molnar @ 2013-09-28 19:39 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, Sep 28, 2013 at 12:21 PM, Ingo Molnar <mingo@kernel.org> wrote: > > > > If we do that then I suspect the next step will be queued rwlocks :-/ The > > current rwlock_t implementation is rather primitive by modern standards. > > (We'd probably have killed rwlock_t long ago if not for the > > tasklist_lock.) > > Yeah, I'm not happy about or rwlocks. That's one lock that currently > is so broken that I think we could easily argue for making that one > queued. > > Waiman had a qrwlock series that looked reasonable, and I think his > later versions were drop-in replacements (ie they automatically just > did the RightThing(tm) wrt interrupts taking a recursive read lock - I > objected to the first versions that required that to be stated > explicitly). > > I think Waiman's patches (even the later ones) made the queued rwlocks > be a side-by-side implementation with the old rwlocks, and I think > that was just being unnecessarily careful. It might be useful for > testing to have a config option to switch between the two, but we > might as well go all the way. > > The old rwlock's really have been a disappointment - they are slower > than spinlocks, and seldom/never end up scaling any better. Their main > advantage was literally the irq behavior - allowing readers to happen > without the expense of worrying about irq's. Yeah. But at least here the read side will not play, as the AIM7 workloads where the testing goes on excercises the write path exclusively I think. Still, the lack of queueing ought to hurt - the question is by how much. Thanks, Ingo ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-28 19:33 ` Linus Torvalds 2013-09-28 19:39 ` Ingo Molnar @ 2013-09-30 10:44 ` Peter Zijlstra 2013-09-30 16:13 ` Linus Torvalds 2013-09-30 15:58 ` Waiman Long 2 siblings, 1 reply; 55+ messages in thread From: Peter Zijlstra @ 2013-09-30 10:44 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Sat, Sep 28, 2013 at 12:33:36PM -0700, Linus Torvalds wrote: > The old rwlock's really have been a disappointment - they are slower > than spinlocks, and seldom/never end up scaling any better. Their > main advantage was literally the irq behavior - allowing readers to > happen without the expense of worrying about irq's. So in part that is fundamental to the whole rw-spinlock concept. Typically lock hold times should be short for spinlock type locks and if your hold times are short, the lock acquisition times are significant. And a read acquisition is still a RMW operation on the lock, thus read locks are still entirely bound by the cacheline transfer of the lock itself. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-30 10:44 ` Peter Zijlstra @ 2013-09-30 16:13 ` Linus Torvalds 2013-09-30 16:41 ` Peter Zijlstra 0 siblings, 1 reply; 55+ messages in thread From: Linus Torvalds @ 2013-09-30 16:13 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Mon, Sep 30, 2013 at 3:44 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Sat, Sep 28, 2013 at 12:33:36PM -0700, Linus Torvalds wrote: >> The old rwlock's really have been a disappointment - they are slower >> than spinlocks, and seldom/never end up scaling any better. Their >> main advantage was literally the irq behavior - allowing readers to >> happen without the expense of worrying about irq's. > > So in part that is fundamental to the whole rw-spinlock concept. No, I agree. But it's really just that our rwlock implementation isn't very good. It's neither really high-performing nor fair, and the premise of a rw-lock is that it should scale better than a spinlock. And yes, under heavy reader activity *does* work out for that (ok, you get cacheline bouncing, but at least you don't get the spinning when you have tons of readers), the extreme unfairness towards writers under heavy reader activity makes it often simply unacceptable. So unlike a lot of other "let's try to make our locking fancy" that I dislike because it tends to hide the fundamental problem of contention, the rwlock patches make me go "those actually _fix_ a fundamental problem". Linus ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-30 16:13 ` Linus Torvalds @ 2013-09-30 16:41 ` Peter Zijlstra 2013-10-01 7:28 ` Ingo Molnar 0 siblings, 1 reply; 55+ messages in thread From: Peter Zijlstra @ 2013-09-30 16:41 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Mon, Sep 30, 2013 at 09:13:52AM -0700, Linus Torvalds wrote: > So unlike a lot of other "let's try to make our locking fancy" that I > dislike because it tends to hide the fundamental problem of > contention, the rwlock patches make me go "those actually _fix_ a > fundamental problem". So here I'm slightly disagreeing; fixing a fundamental problem would be coming up a better anon_vma management that doesn't create such immense chains. Its still the same lock, spinlock or not. And regardless of if we keep anon_vma lock a rwsem or not; I think we should merge those rwsem patches as they do improve the lock implementation and the hard work has already been done. However the biggest ugly by far here is that mm_take_all_locks() thing; couldn't we implement that by basically freezing all tasks referencing that mm? ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-30 16:41 ` Peter Zijlstra @ 2013-10-01 7:28 ` Ingo Molnar 2013-10-01 8:09 ` Peter Zijlstra 0 siblings, 1 reply; 55+ messages in thread From: Ingo Molnar @ 2013-10-01 7:28 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J * Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Sep 30, 2013 at 09:13:52AM -0700, Linus Torvalds wrote: > > > So unlike a lot of other "let's try to make our locking fancy" that I > > dislike because it tends to hide the fundamental problem of > > contention, the rwlock patches make me go "those actually _fix_ a > > fundamental problem". > > So here I'm slightly disagreeing; fixing a fundamental problem would be > coming up a better anon_vma management that doesn't create such immense > chains. So, I think the fundamental problem seems to be that when rwsems are applied to this usecase, they still don't perform as well as a primitive rwlock. That means that when rwsems cause a context switch it is a loss, while an rwlock_t burning CPU time by looping around is a win. I'm not sure it's even about 'immense chains' - if those were true then context-switching should actually improve performance by allowing other work to continue while the heavy chains are processed. Alas that's not what happens! Or is AIM7 essentially triggering a single large lock? I doubt that's the case though. > Its still the same lock, spinlock or not. > > And regardless of if we keep anon_vma lock a rwsem or not; I think we > should merge those rwsem patches as they do improve the lock > implementation and the hard work has already been done. That I mostly agree with, except that without a serious usecase do we have a guarantee that bugs in fancies queueing in rwsems gets ironed out? Thanks, Ingo ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-10-01 7:28 ` Ingo Molnar @ 2013-10-01 8:09 ` Peter Zijlstra 2013-10-01 8:25 ` Ingo Molnar 0 siblings, 1 reply; 55+ messages in thread From: Peter Zijlstra @ 2013-10-01 8:09 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Tue, Oct 01, 2013 at 09:28:15AM +0200, Ingo Molnar wrote: > That I mostly agree with, except that without a serious usecase do we have > a guarantee that bugs in fancies queueing in rwsems gets ironed out? Methinks mmap_sem is still a big enough lock to work out a locking primitive :-) In fact, try something like this from userspace: n-threads: pthread_mutex_lock(&mutex); foo = mmap(); pthread_mutex_lock(&mutex); /* work */ pthread_mutex_unlock(&mutex); munma(foo); pthread_mutex_unlock(&mutex); vs n-threads: foo = mmap(); /* work */ munmap(foo); I've had reports that the former was significantly faster than the latter. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-10-01 8:09 ` Peter Zijlstra @ 2013-10-01 8:25 ` Ingo Molnar 0 siblings, 0 replies; 55+ messages in thread From: Ingo Molnar @ 2013-10-01 8:25 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J * Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Oct 01, 2013 at 09:28:15AM +0200, Ingo Molnar wrote: > > > That I mostly agree with, except that without a serious usecase do we > > have a guarantee that bugs in fancies queueing in rwsems gets ironed > > out? > > Methinks mmap_sem is still a big enough lock to work out a locking > primitive :-) I mean the AIM7 usecase probably falls away - we need to find another one that shows the inefficiencies. > In fact, try something like this from userspace: > > n-threads: > > pthread_mutex_lock(&mutex); > foo = mmap(); > pthread_mutex_lock(&mutex); > > /* work */ > > pthread_mutex_unlock(&mutex); > munma(foo); > pthread_mutex_unlock(&mutex); > > vs > > n-threads: > > foo = mmap(); > /* work */ > munmap(foo); > > > I've had reports that the former was significantly faster than the > latter. That looks like a legitimate pattern that ought to trigger in many apps. Would be nice to turn this into a: perf bench mm thread-create testcase or so. Thanks, Ingo ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-28 19:33 ` Linus Torvalds 2013-09-28 19:39 ` Ingo Molnar 2013-09-30 10:44 ` Peter Zijlstra @ 2013-09-30 15:58 ` Waiman Long 2013-10-01 7:33 ` Ingo Molnar 2 siblings, 1 reply; 55+ messages in thread From: Waiman Long @ 2013-09-30 15:58 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On 09/28/2013 03:33 PM, Linus Torvalds wrote: > On Sat, Sep 28, 2013 at 12:21 PM, Ingo Molnar<mingo@kernel.org> wrote: >> If we do that then I suspect the next step will be queued rwlocks :-/ The >> current rwlock_t implementation is rather primitive by modern standards. >> (We'd probably have killed rwlock_t long ago if not for the >> tasklist_lock.) > Yeah, I'm not happy about or rwlocks. That's one lock that currently > is so broken that I think we could easily argue for making that one > queued. > > Waiman had a qrwlock series that looked reasonable, and I think his > later versions were drop-in replacements (ie they automatically just > did the RightThing(tm) wrt interrupts taking a recursive read lock - I > objected to the first versions that required that to be stated > explicitly). The latest version (v3) will allow recursive read lock in interrupt handlers. > I think Waiman's patches (even the later ones) made the queued rwlocks > be a side-by-side implementation with the old rwlocks, and I think > that was just being unnecessarily careful. It might be useful for > testing to have a config option to switch between the two, but we > might as well go all the way. It is not actually a side-by-side implementation. A user can choose either regular rwlock or the queue one, but never both by setting a configuration parameter. However, I now think that maybe we should do it the lockref way by pre-determining it on a per-architecture level without user visible configuration option. -Longman ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-30 15:58 ` Waiman Long @ 2013-10-01 7:33 ` Ingo Molnar 2013-10-01 20:03 ` Waiman Long 0 siblings, 1 reply; 55+ messages in thread From: Ingo Molnar @ 2013-10-01 7:33 UTC (permalink / raw) To: Waiman Long Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J * Waiman Long <waiman.long@hp.com> wrote: > > I think Waiman's patches (even the later ones) made the queued rwlocks > > be a side-by-side implementation with the old rwlocks, and I think > > that was just being unnecessarily careful. It might be useful for > > testing to have a config option to switch between the two, but we > > might as well go all the way. > > It is not actually a side-by-side implementation. A user can choose > either regular rwlock or the queue one, but never both by setting a > configuration parameter. However, I now think that maybe we should do it > the lockref way by pre-determining it on a per-architecture level > without user visible configuration option. Well, as I pointed it out to you during review, such a Kconfig driven locking API choice is a no-go! What I suggested instead: there's absolutely no problem with providing a better rwlock_t implementation, backed with numbers and all that. Thanks, Ingo ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-10-01 7:33 ` Ingo Molnar @ 2013-10-01 20:03 ` Waiman Long 0 siblings, 0 replies; 55+ messages in thread From: Waiman Long @ 2013-10-01 20:03 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On 10/01/2013 03:33 AM, Ingo Molnar wrote: > * Waiman Long<waiman.long@hp.com> wrote: > >>> I think Waiman's patches (even the later ones) made the queued rwlocks >>> be a side-by-side implementation with the old rwlocks, and I think >>> that was just being unnecessarily careful. It might be useful for >>> testing to have a config option to switch between the two, but we >>> might as well go all the way. >> It is not actually a side-by-side implementation. A user can choose >> either regular rwlock or the queue one, but never both by setting a >> configuration parameter. However, I now think that maybe we should do it >> the lockref way by pre-determining it on a per-architecture level >> without user visible configuration option. > Well, as I pointed it out to you during review, such a Kconfig driven > locking API choice is a no-go! > > What I suggested instead: there's absolutely no problem with providing a > better rwlock_t implementation, backed with numbers and all that. > > Thanks, > > Ingo Yes, this is what I am planning to do. The next version of my qrwlock patch will force the switch to queue rwlock for x86 architecture. The other architectures have to be done separately. -Longman ^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-28 19:21 ` Ingo Molnar 2013-09-28 19:33 ` Linus Torvalds @ 2013-09-28 19:37 ` Ingo Molnar 2013-09-28 19:43 ` Linus Torvalds 2013-09-30 8:52 ` [PATCH] " Andrea Arcangeli 1 sibling, 2 replies; 55+ messages in thread From: Ingo Molnar @ 2013-09-28 19:37 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J * Ingo Molnar <mingo@kernel.org> wrote: > If we do that then I suspect the next step will be queued rwlocks :-/ > The current rwlock_t implementation is rather primitive by modern > standards. (We'd probably have killed rwlock_t long ago if not for the > tasklist_lock.) > > But yeah, it would work and conceptually a hard spinlock fits something > as lowlevel as the anon-vma lock. > > I did a quick review pass and it appears nothing obvious is scheduling > with the anon-vma lock held. If it did in a non-obvious way it's likely > a bug anyway. The hugepage code grew a lot of logic running under the > anon-vma lock, but it all seems atomic. > > So a conversion to rwlock_t could be attempted. (It should be relatively > easy patch as well, because the locking operation is now nicely > abstracted out.) Here's a totally untested patch to convert the anon vma lock to an rwlock_t. I think its lack of modern queueing will hurt on big systems big time - it might even regress. But ... it's hard to tell such things in advance. [ That might as well be for the better as it will eventually be fixed, which in turn will improve tasklist_lock workloads ;-) ] Thanks, Ingo -------------> Subject: anon_vmas: Convert the rwsem to an rwlock_t From: Ingo Molnar <mingo@kernel.org> -- include/linux/mmu_notifier.h | 2 +- include/linux/rmap.h | 19 +++++++++---------- mm/huge_memory.c | 4 ++-- mm/mmap.c | 10 +++++----- mm/rmap.c | 24 ++++++++++++------------ 5 files changed, 29 insertions(+), 30 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index deca874..628e807 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -151,7 +151,7 @@ struct mmu_notifier_ops { * Therefore notifier chains can only be traversed when either * * 1. mmap_sem is held. - * 2. One of the reverse map locks is held (i_mmap_mutex or anon_vma->rwsem). + * 2. One of the reverse map locks is held (i_mmap_mutex or anon_vma->rwlock). * 3. No other concurrent thread can access the list (release) */ struct mmu_notifier { diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 6dacb93..f4ab929 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -7,7 +7,7 @@ #include <linux/list.h> #include <linux/slab.h> #include <linux/mm.h> -#include <linux/rwsem.h> +#include <linux/rwlock.h> #include <linux/memcontrol.h> /* @@ -26,7 +26,7 @@ */ struct anon_vma { struct anon_vma *root; /* Root of this anon_vma tree */ - struct rw_semaphore rwsem; /* W: modification, R: walking the list */ + rwlock_t rwlock; /* W: modification, R: walking the list */ /* * The refcount is taken on an anon_vma when there is no * guarantee that the vma of page tables will exist for @@ -64,7 +64,7 @@ struct anon_vma_chain { struct vm_area_struct *vma; struct anon_vma *anon_vma; struct list_head same_vma; /* locked by mmap_sem & page_table_lock */ - struct rb_node rb; /* locked by anon_vma->rwsem */ + struct rb_node rb; /* locked by anon_vma->rwlock */ unsigned long rb_subtree_last; #ifdef CONFIG_DEBUG_VM_RB unsigned long cached_vma_start, cached_vma_last; @@ -108,37 +108,36 @@ static inline void vma_lock_anon_vma(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; if (anon_vma) - down_write(&anon_vma->root->rwsem); + write_lock(&anon_vma->root->rwlock); } static inline void vma_unlock_anon_vma(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; if (anon_vma) - up_write(&anon_vma->root->rwsem); + write_unlock(&anon_vma->root->rwlock); } static inline void anon_vma_lock_write(struct anon_vma *anon_vma) { - down_write(&anon_vma->root->rwsem); + write_lock(&anon_vma->root->rwlock); } static inline void anon_vma_unlock_write(struct anon_vma *anon_vma) { - up_write(&anon_vma->root->rwsem); + write_unlock(&anon_vma->root->rwlock); } static inline void anon_vma_lock_read(struct anon_vma *anon_vma) { - down_read(&anon_vma->root->rwsem); + read_unlock(&anon_vma->root->rwlock); } static inline void anon_vma_unlock_read(struct anon_vma *anon_vma) { - up_read(&anon_vma->root->rwsem); + read_unlock(&anon_vma->root->rwlock); } - /* * anon_vma helper functions. */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7489884..78f6c08 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1542,7 +1542,7 @@ static int __split_huge_page_splitting(struct page *page, * We can't temporarily set the pmd to null in order * to split it, the pmd must remain marked huge at all * times or the VM won't take the pmd_trans_huge paths - * and it won't wait on the anon_vma->root->rwsem to + * and it won't wait on the anon_vma->root->rwlock to * serialize against split_huge_page*. */ pmdp_splitting_flush(vma, address, pmd); @@ -1747,7 +1747,7 @@ static int __split_huge_page_map(struct page *page, return ret; } -/* must be called with anon_vma->root->rwsem held */ +/* must be called with anon_vma->root->rwlock held */ static void __split_huge_page(struct page *page, struct anon_vma *anon_vma, struct list_head *list) diff --git a/mm/mmap.c b/mm/mmap.c index 9d54851..25ce233 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2955,15 +2955,15 @@ static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma) * The LSB of head.next can't change from under us * because we hold the mm_all_locks_mutex. */ - down_write_nest_lock(&anon_vma->root->rwsem, &mm->mmap_sem); + down_write_nest_lock(&anon_vma->root->rwlock, &mm->mmap_sem); /* * We can safely modify head.next after taking the - * anon_vma->root->rwsem. If some other vma in this mm shares + * anon_vma->root->rwlock. If some other vma in this mm shares * the same anon_vma we won't take it again. * * No need of atomic instructions here, head.next * can't change from under us thanks to the - * anon_vma->root->rwsem. + * anon_vma->root->rwlock. */ if (__test_and_set_bit(0, (unsigned long *) &anon_vma->root->rb_root.rb_node)) @@ -3012,7 +3012,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping) * vma in this mm is backed by the same anon_vma or address_space. * * We can take all the locks in random order because the VM code - * taking i_mmap_mutex or anon_vma->rwsem outside the mmap_sem never + * taking i_mmap_mutex or anon_vma->rwslockoutside the mmap_sem never * takes more than one of them in a row. Secondly we're protected * against a concurrent mm_take_all_locks() by the mm_all_locks_mutex. * @@ -3065,7 +3065,7 @@ static void vm_unlock_anon_vma(struct anon_vma *anon_vma) * * No need of atomic instructions here, head.next * can't change from under us until we release the - * anon_vma->root->rwsem. + * anon_vma->root->rwlock. */ if (!__test_and_clear_bit(0, (unsigned long *) &anon_vma->root->rb_root.rb_node)) diff --git a/mm/rmap.c b/mm/rmap.c index fd3ee7a..d101133 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -24,7 +24,7 @@ * mm->mmap_sem * page->flags PG_locked (lock_page) * mapping->i_mmap_mutex - * anon_vma->rwsem + * anon_vma->rwlock * mm->page_table_lock or pte_lock * zone->lru_lock (in mark_page_accessed, isolate_lru_page) * swap_lock (in swap_duplicate, swap_info_get) @@ -37,7 +37,7 @@ * in arch-dependent flush_dcache_mmap_lock, * within bdi.wb->list_lock in __sync_single_inode) * - * anon_vma->rwsem,mapping->i_mutex (memory_failure, collect_procs_anon) + * anon_vma->rwlock,mapping->i_mutex (memory_failure, collect_procs_anon) * ->tasklist_lock * pte map lock */ @@ -98,12 +98,12 @@ static inline void anon_vma_free(struct anon_vma *anon_vma) * page_lock_anon_vma_read() VS put_anon_vma() * down_read_trylock() atomic_dec_and_test() * LOCK MB - * atomic_read() rwsem_is_locked() + * atomic_read() rwlock_is_locked() * * LOCK should suffice since the actual taking of the lock must * happen _before_ what follows. */ - if (rwsem_is_locked(&anon_vma->root->rwsem)) { + if (write_can_lock(&anon_vma->root->rwlock)) { anon_vma_lock_write(anon_vma); anon_vma_unlock_write(anon_vma); } @@ -219,9 +219,9 @@ static inline struct anon_vma *lock_anon_vma_root(struct anon_vma *root, struct struct anon_vma *new_root = anon_vma->root; if (new_root != root) { if (WARN_ON_ONCE(root)) - up_write(&root->rwsem); + write_unlock(&root->rwlock); root = new_root; - down_write(&root->rwsem); + write_lock(&root->rwlock); } return root; } @@ -229,7 +229,7 @@ static inline struct anon_vma *lock_anon_vma_root(struct anon_vma *root, struct static inline void unlock_anon_vma_root(struct anon_vma *root) { if (root) - up_write(&root->rwsem); + write_unlock(&root->rwlock); } /* @@ -349,7 +349,7 @@ void unlink_anon_vmas(struct vm_area_struct *vma) /* * Iterate the list once more, it now only contains empty and unlinked * anon_vmas, destroy them. Could not do before due to __put_anon_vma() - * needing to write-acquire the anon_vma->root->rwsem. + * needing to write-acquire the anon_vma->root->rwlock. */ list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) { struct anon_vma *anon_vma = avc->anon_vma; @@ -365,7 +365,7 @@ static void anon_vma_ctor(void *data) { struct anon_vma *anon_vma = data; - init_rwsem(&anon_vma->rwsem); + rwlock_init(&anon_vma->rwlock); atomic_set(&anon_vma->refcount, 0); anon_vma->rb_root = RB_ROOT; } @@ -457,14 +457,14 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page) anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); root_anon_vma = ACCESS_ONCE(anon_vma->root); - if (down_read_trylock(&root_anon_vma->rwsem)) { + if (read_trylock(&root_anon_vma->rwlock)) { /* * If the page is still mapped, then this anon_vma is still * its anon_vma, and holding the mutex ensures that it will * not go away, see anon_vma_free(). */ if (!page_mapped(page)) { - up_read(&root_anon_vma->rwsem); + read_unlock(&root_anon_vma->rwlock); anon_vma = NULL; } goto out; @@ -1293,7 +1293,7 @@ out_mlock: /* * We need mmap_sem locking, Otherwise VM_LOCKED check makes * unstable result and race. Plus, We can't wait here because - * we now hold anon_vma->rwsem or mapping->i_mmap_mutex. + * we now hold anon_vma->rwlock or mapping->i_mmap_mutex. * if trylock failed, the page remain in evictable lru and later * vmscan could retry to move the page to unevictable lru if the * page is actually mlocked. ^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-28 19:37 ` [PATCH] anon_vmas: Convert the rwsem to an rwlock_t Ingo Molnar @ 2013-09-28 19:43 ` Linus Torvalds 2013-09-28 19:46 ` Ingo Molnar 2013-09-28 19:52 ` [PATCH, v2] " Ingo Molnar 2013-09-30 8:52 ` [PATCH] " Andrea Arcangeli 1 sibling, 2 replies; 55+ messages in thread From: Linus Torvalds @ 2013-09-28 19:43 UTC (permalink / raw) To: Ingo Molnar Cc: Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Sat, Sep 28, 2013 at 12:37 PM, Ingo Molnar <mingo@kernel.org> wrote: > > - down_write_nest_lock(&anon_vma->root->rwsem, &mm->mmap_sem); > + down_write_nest_lock(&anon_vma->root->rwlock, &mm->mmap_sem); That's just completely bogus, and cannot work. Maybe just a "write_lock(&anon_vma->root->rwlock)" (which is just anon_vma_unlock_write(anon_vma)). But I think we might have a lockdep issue. I'm not quite sure what's up with the nesting there. > - if (rwsem_is_locked(&anon_vma->root->rwsem)) { > + if (write_can_lock(&anon_vma->root->rwlock)) { > anon_vma_lock_write(anon_vma); > anon_vma_unlock_write(anon_vma); > } That's the wrong way around. It should be if (!write_can_lock(&anon_vma->root->rwlock)) { so some more testing definitely needed. Linus ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-28 19:43 ` Linus Torvalds @ 2013-09-28 19:46 ` Ingo Molnar 2013-09-28 19:52 ` [PATCH, v2] " Ingo Molnar 1 sibling, 0 replies; 55+ messages in thread From: Ingo Molnar @ 2013-09-28 19:46 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, Sep 28, 2013 at 12:37 PM, Ingo Molnar <mingo@kernel.org> wrote: > > > > - down_write_nest_lock(&anon_vma->root->rwsem, &mm->mmap_sem); > > + down_write_nest_lock(&anon_vma->root->rwlock, &mm->mmap_sem); > > That's just completely bogus, and cannot work. Told you it's totally untested :-) Found that build failure a few minutes ago (the place escaped my search pattern), I'm trying the fix below. Ingo diff --git a/mm/mmap.c b/mm/mmap.c index 25ce233..7ee85bf 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2955,7 +2955,7 @@ static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma) * The LSB of head.next can't change from under us * because we hold the mm_all_locks_mutex. */ - down_write_nest_lock(&anon_vma->root->rwlock, &mm->mmap_sem); + write_lock(&anon_vma->root->rwlock); /* * We can safely modify head.next after taking the * anon_vma->root->rwlock. If some other vma in this mm shares ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-28 19:43 ` Linus Torvalds 2013-09-28 19:46 ` Ingo Molnar @ 2013-09-28 19:52 ` Ingo Molnar 2013-09-30 11:00 ` Peter Zijlstra ` (3 more replies) 1 sibling, 4 replies; 55+ messages in thread From: Ingo Molnar @ 2013-09-28 19:52 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, Sep 28, 2013 at 12:37 PM, Ingo Molnar <mingo@kernel.org> wrote: > > > > - down_write_nest_lock(&anon_vma->root->rwsem, &mm->mmap_sem); > > + down_write_nest_lock(&anon_vma->root->rwlock, &mm->mmap_sem); > > That's just completely bogus, and cannot work. > > Maybe just a "write_lock(&anon_vma->root->rwlock)" (which is just > anon_vma_unlock_write(anon_vma)). But I think we might have a lockdep > issue. I'm not quite sure what's up with the nesting there. > > > - if (rwsem_is_locked(&anon_vma->root->rwsem)) { > > + if (write_can_lock(&anon_vma->root->rwlock)) { > > anon_vma_lock_write(anon_vma); > > anon_vma_unlock_write(anon_vma); > > } > > That's the wrong way around. It should be > > if (!write_can_lock(&anon_vma->root->rwlock)) { > > so some more testing definitely needed. Yeah, that silly API asymmetry has bitten me before as well :-/ The attached version booted up fine under 16-way KVM: sh-4.2# uptime 19:50:08 up 0 min, 0 users, load average: 0.00, 0.00, 0.00 That's all the testing it will get this evening though. Patch should be good enough for Tim to try? Thanks, Ingo ====================> Subject: anon_vmas: Convert the rwsem to an rwlock_t From: Ingo Molnar <mingo@kernel.org> Date: Sat, 28 Sep 2013 21:37:39 +0200 Here's a an almost totally untested patch to convert the anon vma lock to an rwlock_t. I think its lack of modern queueing will hurt on big systems big time - it might even regress. But ... it's hard to tell such things in advance. [ That might as well be for the better as it will eventually be fixed, which in turn will improve tasklist_lock workloads ;-) ] -- include/linux/mmu_notifier.h | 2 +- include/linux/rmap.h | 19 +++++++++---------- mm/huge_memory.c | 4 ++-- mm/mmap.c | 10 +++++----- mm/rmap.c | 24 ++++++++++++------------ 5 files changed, 29 insertions(+), 30 deletions(-) Signed-off-by: Ingo Molnar <mingo@kernel.org> Index: tip/include/linux/mmu_notifier.h =================================================================== --- tip.orig/include/linux/mmu_notifier.h +++ tip/include/linux/mmu_notifier.h @@ -151,7 +151,7 @@ struct mmu_notifier_ops { * Therefore notifier chains can only be traversed when either * * 1. mmap_sem is held. - * 2. One of the reverse map locks is held (i_mmap_mutex or anon_vma->rwsem). + * 2. One of the reverse map locks is held (i_mmap_mutex or anon_vma->rwlock). * 3. No other concurrent thread can access the list (release) */ struct mmu_notifier { Index: tip/include/linux/rmap.h =================================================================== --- tip.orig/include/linux/rmap.h +++ tip/include/linux/rmap.h @@ -7,7 +7,7 @@ #include <linux/list.h> #include <linux/slab.h> #include <linux/mm.h> -#include <linux/rwsem.h> +#include <linux/rwlock.h> #include <linux/memcontrol.h> /* @@ -26,7 +26,7 @@ */ struct anon_vma { struct anon_vma *root; /* Root of this anon_vma tree */ - struct rw_semaphore rwsem; /* W: modification, R: walking the list */ + rwlock_t rwlock; /* W: modification, R: walking the list */ /* * The refcount is taken on an anon_vma when there is no * guarantee that the vma of page tables will exist for @@ -64,7 +64,7 @@ struct anon_vma_chain { struct vm_area_struct *vma; struct anon_vma *anon_vma; struct list_head same_vma; /* locked by mmap_sem & page_table_lock */ - struct rb_node rb; /* locked by anon_vma->rwsem */ + struct rb_node rb; /* locked by anon_vma->rwlock */ unsigned long rb_subtree_last; #ifdef CONFIG_DEBUG_VM_RB unsigned long cached_vma_start, cached_vma_last; @@ -108,37 +108,36 @@ static inline void vma_lock_anon_vma(str { struct anon_vma *anon_vma = vma->anon_vma; if (anon_vma) - down_write(&anon_vma->root->rwsem); + write_lock(&anon_vma->root->rwlock); } static inline void vma_unlock_anon_vma(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; if (anon_vma) - up_write(&anon_vma->root->rwsem); + write_unlock(&anon_vma->root->rwlock); } static inline void anon_vma_lock_write(struct anon_vma *anon_vma) { - down_write(&anon_vma->root->rwsem); + write_lock(&anon_vma->root->rwlock); } static inline void anon_vma_unlock_write(struct anon_vma *anon_vma) { - up_write(&anon_vma->root->rwsem); + write_unlock(&anon_vma->root->rwlock); } static inline void anon_vma_lock_read(struct anon_vma *anon_vma) { - down_read(&anon_vma->root->rwsem); + read_unlock(&anon_vma->root->rwlock); } static inline void anon_vma_unlock_read(struct anon_vma *anon_vma) { - up_read(&anon_vma->root->rwsem); + read_unlock(&anon_vma->root->rwlock); } - /* * anon_vma helper functions. */ Index: tip/mm/huge_memory.c =================================================================== --- tip.orig/mm/huge_memory.c +++ tip/mm/huge_memory.c @@ -1542,7 +1542,7 @@ static int __split_huge_page_splitting(s * We can't temporarily set the pmd to null in order * to split it, the pmd must remain marked huge at all * times or the VM won't take the pmd_trans_huge paths - * and it won't wait on the anon_vma->root->rwsem to + * and it won't wait on the anon_vma->root->rwlock to * serialize against split_huge_page*. */ pmdp_splitting_flush(vma, address, pmd); @@ -1747,7 +1747,7 @@ static int __split_huge_page_map(struct return ret; } -/* must be called with anon_vma->root->rwsem held */ +/* must be called with anon_vma->root->rwlock held */ static void __split_huge_page(struct page *page, struct anon_vma *anon_vma, struct list_head *list) Index: tip/mm/mmap.c =================================================================== --- tip.orig/mm/mmap.c +++ tip/mm/mmap.c @@ -2955,15 +2955,15 @@ static void vm_lock_anon_vma(struct mm_s * The LSB of head.next can't change from under us * because we hold the mm_all_locks_mutex. */ - down_write_nest_lock(&anon_vma->root->rwsem, &mm->mmap_sem); + write_lock(&anon_vma->root->rwlock); /* * We can safely modify head.next after taking the - * anon_vma->root->rwsem. If some other vma in this mm shares + * anon_vma->root->rwlock. If some other vma in this mm shares * the same anon_vma we won't take it again. * * No need of atomic instructions here, head.next * can't change from under us thanks to the - * anon_vma->root->rwsem. + * anon_vma->root->rwlock. */ if (__test_and_set_bit(0, (unsigned long *) &anon_vma->root->rb_root.rb_node)) @@ -3012,7 +3012,7 @@ static void vm_lock_mapping(struct mm_st * vma in this mm is backed by the same anon_vma or address_space. * * We can take all the locks in random order because the VM code - * taking i_mmap_mutex or anon_vma->rwsem outside the mmap_sem never + * taking i_mmap_mutex or anon_vma->rwslockoutside the mmap_sem never * takes more than one of them in a row. Secondly we're protected * against a concurrent mm_take_all_locks() by the mm_all_locks_mutex. * @@ -3065,7 +3065,7 @@ static void vm_unlock_anon_vma(struct an * * No need of atomic instructions here, head.next * can't change from under us until we release the - * anon_vma->root->rwsem. + * anon_vma->root->rwlock. */ if (!__test_and_clear_bit(0, (unsigned long *) &anon_vma->root->rb_root.rb_node)) Index: tip/mm/rmap.c =================================================================== --- tip.orig/mm/rmap.c +++ tip/mm/rmap.c @@ -24,7 +24,7 @@ * mm->mmap_sem * page->flags PG_locked (lock_page) * mapping->i_mmap_mutex - * anon_vma->rwsem + * anon_vma->rwlock * mm->page_table_lock or pte_lock * zone->lru_lock (in mark_page_accessed, isolate_lru_page) * swap_lock (in swap_duplicate, swap_info_get) @@ -37,7 +37,7 @@ * in arch-dependent flush_dcache_mmap_lock, * within bdi.wb->list_lock in __sync_single_inode) * - * anon_vma->rwsem,mapping->i_mutex (memory_failure, collect_procs_anon) + * anon_vma->rwlock,mapping->i_mutex (memory_failure, collect_procs_anon) * ->tasklist_lock * pte map lock */ @@ -98,12 +98,12 @@ static inline void anon_vma_free(struct * page_lock_anon_vma_read() VS put_anon_vma() * down_read_trylock() atomic_dec_and_test() * LOCK MB - * atomic_read() rwsem_is_locked() + * atomic_read() rwlock_is_locked() * * LOCK should suffice since the actual taking of the lock must * happen _before_ what follows. */ - if (rwsem_is_locked(&anon_vma->root->rwsem)) { + if (!write_can_lock(&anon_vma->root->rwlock)) { anon_vma_lock_write(anon_vma); anon_vma_unlock_write(anon_vma); } @@ -219,9 +219,9 @@ static inline struct anon_vma *lock_anon struct anon_vma *new_root = anon_vma->root; if (new_root != root) { if (WARN_ON_ONCE(root)) - up_write(&root->rwsem); + write_unlock(&root->rwlock); root = new_root; - down_write(&root->rwsem); + write_lock(&root->rwlock); } return root; } @@ -229,7 +229,7 @@ static inline struct anon_vma *lock_anon static inline void unlock_anon_vma_root(struct anon_vma *root) { if (root) - up_write(&root->rwsem); + write_unlock(&root->rwlock); } /* @@ -349,7 +349,7 @@ void unlink_anon_vmas(struct vm_area_str /* * Iterate the list once more, it now only contains empty and unlinked * anon_vmas, destroy them. Could not do before due to __put_anon_vma() - * needing to write-acquire the anon_vma->root->rwsem. + * needing to write-acquire the anon_vma->root->rwlock. */ list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) { struct anon_vma *anon_vma = avc->anon_vma; @@ -365,7 +365,7 @@ static void anon_vma_ctor(void *data) { struct anon_vma *anon_vma = data; - init_rwsem(&anon_vma->rwsem); + rwlock_init(&anon_vma->rwlock); atomic_set(&anon_vma->refcount, 0); anon_vma->rb_root = RB_ROOT; } @@ -457,14 +457,14 @@ struct anon_vma *page_lock_anon_vma_read anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); root_anon_vma = ACCESS_ONCE(anon_vma->root); - if (down_read_trylock(&root_anon_vma->rwsem)) { + if (read_trylock(&root_anon_vma->rwlock)) { /* * If the page is still mapped, then this anon_vma is still * its anon_vma, and holding the mutex ensures that it will * not go away, see anon_vma_free(). */ if (!page_mapped(page)) { - up_read(&root_anon_vma->rwsem); + read_unlock(&root_anon_vma->rwlock); anon_vma = NULL; } goto out; @@ -1293,7 +1293,7 @@ out_mlock: /* * We need mmap_sem locking, Otherwise VM_LOCKED check makes * unstable result and race. Plus, We can't wait here because - * we now hold anon_vma->rwsem or mapping->i_mmap_mutex. + * we now hold anon_vma->rwlock or mapping->i_mmap_mutex. * if trylock failed, the page remain in evictable lru and later * vmscan could retry to move the page to unevictable lru if the * page is actually mlocked. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-28 19:52 ` [PATCH, v2] " Ingo Molnar @ 2013-09-30 11:00 ` Peter Zijlstra 2013-09-30 11:44 ` Peter Zijlstra ` (2 subsequent siblings) 3 siblings, 0 replies; 55+ messages in thread From: Peter Zijlstra @ 2013-09-30 11:00 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Sat, Sep 28, 2013 at 09:52:07PM +0200, Ingo Molnar wrote: > Index: tip/mm/rmap.c > =================================================================== > --- tip.orig/mm/rmap.c > +++ tip/mm/rmap.c > @@ -98,12 +98,12 @@ static inline void anon_vma_free(struct > * page_lock_anon_vma_read() VS put_anon_vma() > * down_read_trylock() atomic_dec_and_test() > * LOCK MB > - * atomic_read() rwsem_is_locked() > + * atomic_read() rwlock_is_locked() > * > * LOCK should suffice since the actual taking of the lock must > * happen _before_ what follows. > */ > - if (rwsem_is_locked(&anon_vma->root->rwsem)) { > + if (!write_can_lock(&anon_vma->root->rwlock)) { > anon_vma_lock_write(anon_vma); > anon_vma_unlock_write(anon_vma); > } > @@ -457,14 +457,14 @@ struct anon_vma *page_lock_anon_vma_read > > anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); > root_anon_vma = ACCESS_ONCE(anon_vma->root); > - if (down_read_trylock(&root_anon_vma->rwsem)) { > + if (read_trylock(&root_anon_vma->rwlock)) { > /* > * If the page is still mapped, then this anon_vma is still > * its anon_vma, and holding the mutex ensures that it will > * not go away, see anon_vma_free(). > */ > if (!page_mapped(page)) { > - up_read(&root_anon_vma->rwsem); > + read_unlock(&root_anon_vma->rwlock); > anon_vma = NULL; > } > goto out; > @@ -1293,7 +1293,7 @@ out_mlock: > /* > * We need mmap_sem locking, Otherwise VM_LOCKED check makes > * unstable result and race. Plus, We can't wait here because > - * we now hold anon_vma->rwsem or mapping->i_mmap_mutex. > + * we now hold anon_vma->rwlock or mapping->i_mmap_mutex. > * if trylock failed, the page remain in evictable lru and later > * vmscan could retry to move the page to unevictable lru if the > * page is actually mlocked. You can remove all that -- all that trickery was only needed because the lock could sleep; --- --- a/mm/rmap.c +++ b/mm/rmap.c @@ -85,29 +85,6 @@ static inline struct anon_vma *anon_vma_ static inline void anon_vma_free(struct anon_vma *anon_vma) { VM_BUG_ON(atomic_read(&anon_vma->refcount)); - - /* - * Synchronize against page_lock_anon_vma_read() such that - * we can safely hold the lock without the anon_vma getting - * freed. - * - * Relies on the full mb implied by the atomic_dec_and_test() from - * put_anon_vma() against the acquire barrier implied by - * down_read_trylock() from page_lock_anon_vma_read(). This orders: - * - * page_lock_anon_vma_read() VS put_anon_vma() - * down_read_trylock() atomic_dec_and_test() - * LOCK MB - * atomic_read() rwlock_is_locked() - * - * LOCK should suffice since the actual taking of the lock must - * happen _before_ what follows. - */ - if (!write_can_lock(&anon_vma->root->rwlock)) { - anon_vma_lock_write(anon_vma); - anon_vma_unlock_write(anon_vma); - } - kmem_cache_free(anon_vma_cachep, anon_vma); } @@ -437,15 +414,10 @@ struct anon_vma *page_get_anon_vma(struc /* * Similar to page_get_anon_vma() except it locks the anon_vma. - * - * Its a little more complex as it tries to keep the fast path to a single - * atomic op -- the trylock. If we fail the trylock, we fall back to getting a - * reference like with page_get_anon_vma() and then block on the mutex. */ struct anon_vma *page_lock_anon_vma_read(struct page *page) { struct anon_vma *anon_vma = NULL; - struct anon_vma *root_anon_vma; unsigned long anon_mapping; rcu_read_lock(); @@ -456,51 +428,22 @@ struct anon_vma *page_lock_anon_vma_read goto out; anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); - root_anon_vma = ACCESS_ONCE(anon_vma->root); - if (read_trylock(&root_anon_vma->rwlock)) { - /* - * If the page is still mapped, then this anon_vma is still - * its anon_vma, and holding the mutex ensures that it will - * not go away, see anon_vma_free(). - */ - if (!page_mapped(page)) { - read_unlock(&root_anon_vma->rwlock); - anon_vma = NULL; - } - goto out; - } - - /* trylock failed, we got to sleep */ - if (!atomic_inc_not_zero(&anon_vma->refcount)) { - anon_vma = NULL; - goto out; - } - - if (!page_mapped(page)) { - put_anon_vma(anon_vma); - anon_vma = NULL; - goto out; - } - - /* we pinned the anon_vma, its safe to sleep */ - rcu_read_unlock(); anon_vma_lock_read(anon_vma); - if (atomic_dec_and_test(&anon_vma->refcount)) { - /* - * Oops, we held the last refcount, release the lock - * and bail -- can't simply use put_anon_vma() because - * we'll deadlock on the anon_vma_lock_write() recursion. - */ + /* + * If this page is still mapped, then its anon_vma cannot have been + * freed. But if it has been unmapped, we have no security against the + * anon_vma structure being freed and reused (for another anon_vma: + * SLAB_DESTROY_BY_RCU guarantees that - so the atomic_inc_not_zero() + * above cannot corrupt). + */ + if (!page_mapped(page)) { anon_vma_unlock_read(anon_vma); - __put_anon_vma(anon_vma); anon_vma = NULL; } - - return anon_vma; - out: rcu_read_unlock(); + return anon_vma; } ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-28 19:52 ` [PATCH, v2] " Ingo Molnar 2013-09-30 11:00 ` Peter Zijlstra @ 2013-09-30 11:44 ` Peter Zijlstra 2013-09-30 17:03 ` Andrew Morton 2013-09-30 17:10 ` Tim Chen 3 siblings, 0 replies; 55+ messages in thread From: Peter Zijlstra @ 2013-09-30 11:44 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Sat, Sep 28, 2013 at 09:52:07PM +0200, Ingo Molnar wrote: > Index: tip/mm/mmap.c > =================================================================== > --- tip.orig/mm/mmap.c > +++ tip/mm/mmap.c > @@ -2955,15 +2955,15 @@ static void vm_lock_anon_vma(struct mm_s > * The LSB of head.next can't change from under us > * because we hold the mm_all_locks_mutex. > */ > - down_write_nest_lock(&anon_vma->root->rwsem, &mm->mmap_sem); > + write_lock(&anon_vma->root->rwlock); Tssk, you should know better... But yes; this is the most horrid site; however I think it was mostly file based VMAs that caused the immense amount of locks taken since this is only done 'very' early on in the life of KVM. I can't remember how early; I hope almost instantly and certainly before spawning all the various vcpu threads as that would create all the thread stacks which are anon and would indeed blow up the non-preempt time here. So if we keep i_mmap_mutex -- and I hope we do; that unmap_mutex horror wasn't pretty at all -- see commit 97a894136 . The below shouldn't be too bad I think/hope. --- --- a/include/linux/rwlock.h +++ b/include/linux/rwlock.h @@ -64,6 +64,19 @@ do { \ #define write_lock(lock) _raw_write_lock(lock) #define read_lock(lock) _raw_read_lock(lock) +#ifdef CONFIG_DEBUG_LOCK_ALLOC +# define write_lock_nest_lock(lock, nest_lock) \ +do { \ + typecheck(struct lockdep_map *, &(nest_lock)->dep_map); \ + preempt_disable(); \ + lock_acquire_exclusive(&lock->dep_map, 0, 0, \ + &(nest_lock)->dep_map, _RET_IP_); \ + LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock); \ +} while (0) +#else +# define write_lock_nest_lock(lock, nest_lock) write_lock(lock) +#endif + #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) #define read_lock_irqsave(lock, flags) \ --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2955,7 +2955,7 @@ static void vm_lock_anon_vma(struct mm_s * The LSB of head.next can't change from under us * because we hold the mm_all_locks_mutex. */ - write_lock(&anon_vma->root->rwlock); + write_lock_nest_lock(&anon_vma->root->rwlock); /* * We can safely modify head.next after taking the * anon_vma->root->rwlock. If some other vma in this mm shares ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-28 19:52 ` [PATCH, v2] " Ingo Molnar 2013-09-30 11:00 ` Peter Zijlstra 2013-09-30 11:44 ` Peter Zijlstra @ 2013-09-30 17:03 ` Andrew Morton 2013-09-30 17:25 ` Linus Torvalds 2013-09-30 17:10 ` Tim Chen 3 siblings, 1 reply; 55+ messages in thread From: Andrew Morton @ 2013-09-30 17:03 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Sat, 28 Sep 2013 21:52:07 +0200 Ingo Molnar <mingo@kernel.org> wrote: > Here's a an almost totally untested patch to convert the anon vma lock to > an rwlock_t. Both the anon vma lock and i_mmap_lock used to be spinlocks. Peter turned them both into sleeping locks in this series: https://lkml.org/lkml/2011/4/1/141 Later, Ingo turned the anon vma lock from an rwsem into a mutex (https://lkml.org/lkml/2012/12/1/141) to permit a scalability fix (https://lkml.org/lkml/2012/12/1/142). Let's convince ourselves that we won't be undoing things which will return to bite us? ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-30 17:03 ` Andrew Morton @ 2013-09-30 17:25 ` Linus Torvalds 0 siblings, 0 replies; 55+ messages in thread From: Linus Torvalds @ 2013-09-30 17:25 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Mon, Sep 30, 2013 at 10:03 AM, Andrew Morton <akpm@linux-foundation.org> wrote: > > Let's convince ourselves that we won't be undoing things which will > return to bite us? Umm. We call them regressions, and we fix them. As already mentioned, the original switch to a mutex didn't even have any good explanation for it, much less actual data. When we make mistakes, we'd better realize it and fix them, rather than assume that they were fixes just because they were made. The performance numbers are pretty damn compelling for this having been a major mistake. (Of course, the current performance numbers also contain the conversion to a rwlock_t rather than back to a spinlock, which may actually help wrt the original situation too) Linus ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-28 19:52 ` [PATCH, v2] " Ingo Molnar ` (2 preceding siblings ...) 2013-09-30 17:03 ` Andrew Morton @ 2013-09-30 17:10 ` Tim Chen 2013-09-30 18:14 ` Peter Zijlstra 3 siblings, 1 reply; 55+ messages in thread From: Tim Chen @ 2013-09-30 17:10 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Sat, 2013-09-28 at 21:52 +0200, Ingo Molnar wrote: > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Sat, Sep 28, 2013 at 12:37 PM, Ingo Molnar <mingo@kernel.org> wrote: > > > > > > - down_write_nest_lock(&anon_vma->root->rwsem, &mm->mmap_sem); > > > + down_write_nest_lock(&anon_vma->root->rwlock, &mm->mmap_sem); > > > > That's just completely bogus, and cannot work. > > > > Maybe just a "write_lock(&anon_vma->root->rwlock)" (which is just > > anon_vma_unlock_write(anon_vma)). But I think we might have a lockdep > > issue. I'm not quite sure what's up with the nesting there. > > > > > - if (rwsem_is_locked(&anon_vma->root->rwsem)) { > > > + if (write_can_lock(&anon_vma->root->rwlock)) { > > > anon_vma_lock_write(anon_vma); > > > anon_vma_unlock_write(anon_vma); > > > } > > > > That's the wrong way around. It should be > > > > if (!write_can_lock(&anon_vma->root->rwlock)) { > > > > so some more testing definitely needed. > > Yeah, that silly API asymmetry has bitten me before as well :-/ > > The attached version booted up fine under 16-way KVM: > > sh-4.2# uptime > 19:50:08 up 0 min, 0 users, load average: 0.00, 0.00, 0.00 > > That's all the testing it will get this evening though. Patch should be > good enough for Tim to try? Here's the exim workload data: rwsem improvment: Waimain's patch: +2.0% Alex+Tim's patchset: +4.8% Waiman+Alex+Tim: +5.3% convert rwsem to rwlock_t for root anon_vma lock Ingo's patch +11.7% Yes, changing the anon-vma root lock to rwlock_t gives the most boost. However, I think that Waiman's patches, Alex's patches and my patches can still be combined together to improve scalability of rwsem, even if we should decide to convert this particular rwsem to rwlock_t. Thanks. Tim > > Thanks, > > Ingo > ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-30 17:10 ` Tim Chen @ 2013-09-30 18:14 ` Peter Zijlstra 2013-09-30 19:23 ` Tim Chen 0 siblings, 1 reply; 55+ messages in thread From: Peter Zijlstra @ 2013-09-30 18:14 UTC (permalink / raw) To: Tim Chen Cc: Ingo Molnar, Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Mon, Sep 30, 2013 at 10:10:27AM -0700, Tim Chen wrote: > Here's the exim workload data: > > rwsem improvment: > Waimain's patch: +2.0% > Alex+Tim's patchset: +4.8% > Waiman+Alex+Tim: +5.3% > > convert rwsem to rwlock_t for root anon_vma lock > Ingo's patch +11.7% > What happens if you stuff Waiman's qrwlock patches on top of that? admittedly and oft mentioned in this thread, our current rwlock_t is somewhat suboptimal under a number of conditions. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-30 18:14 ` Peter Zijlstra @ 2013-09-30 19:23 ` Tim Chen 2013-09-30 19:35 ` Waiman Long 0 siblings, 1 reply; 55+ messages in thread From: Tim Chen @ 2013-09-30 19:23 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Mon, 2013-09-30 at 20:14 +0200, Peter Zijlstra wrote: > On Mon, Sep 30, 2013 at 10:10:27AM -0700, Tim Chen wrote: > > Here's the exim workload data: > > > > rwsem improvment: > > Waimain's patch: +2.0% > > Alex+Tim's patchset: +4.8% > > Waiman+Alex+Tim: +5.3% > > > > convert rwsem to rwlock_t for root anon_vma lock > > Ingo's patch +11.7% > > > > What happens if you stuff Waiman's qrwlock patches on top of that? > admittedly and oft mentioned in this thread, our current rwlock_t is > somewhat suboptimal under a number of conditions. I've tested with Waiman's qrwlock patches on top of Ingo's patches. It does not affect the throughput for exim and I still get about +11.7% throughput change (same as with Ingo's patch only). Tim ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-30 19:23 ` Tim Chen @ 2013-09-30 19:35 ` Waiman Long 2013-09-30 19:47 ` Tim Chen 0 siblings, 1 reply; 55+ messages in thread From: Waiman Long @ 2013-09-30 19:35 UTC (permalink / raw) To: Tim Chen Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On 09/30/2013 03:23 PM, Tim Chen wrote: > On Mon, 2013-09-30 at 20:14 +0200, Peter Zijlstra wrote: >> On Mon, Sep 30, 2013 at 10:10:27AM -0700, Tim Chen wrote: >>> Here's the exim workload data: >>> >>> rwsem improvment: >>> Waimain's patch: +2.0% >>> Alex+Tim's patchset: +4.8% >>> Waiman+Alex+Tim: +5.3% >>> >>> convert rwsem to rwlock_t for root anon_vma lock >>> Ingo's patch +11.7% >>> >> What happens if you stuff Waiman's qrwlock patches on top of that? >> admittedly and oft mentioned in this thread, our current rwlock_t is >> somewhat suboptimal under a number of conditions. > I've tested with Waiman's qrwlock patches on top of Ingo's patches. > It does not affect the throughput for exim and I still get > about +11.7% throughput change (same as with Ingo's patch only). > > Tim > My qrwlock doesn't enable qrwlock by default. You have to use menuconfig to explicitly enable it. Have you done that when you build the test kernel? I am thinking of explicitly enabling it for x86 if the anon-vma lock is converted back to a rwlock. -Longman ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-30 19:35 ` Waiman Long @ 2013-09-30 19:47 ` Tim Chen 2013-09-30 22:03 ` Tim Chen 2013-10-01 2:41 ` Waiman Long 0 siblings, 2 replies; 55+ messages in thread From: Tim Chen @ 2013-09-30 19:47 UTC (permalink / raw) To: Waiman Long Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Mon, 2013-09-30 at 15:35 -0400, Waiman Long wrote: > On 09/30/2013 03:23 PM, Tim Chen wrote: > > On Mon, 2013-09-30 at 20:14 +0200, Peter Zijlstra wrote: > >> On Mon, Sep 30, 2013 at 10:10:27AM -0700, Tim Chen wrote: > >>> Here's the exim workload data: > >>> > >>> rwsem improvment: > >>> Waimain's patch: +2.0% > >>> Alex+Tim's patchset: +4.8% > >>> Waiman+Alex+Tim: +5.3% > >>> > >>> convert rwsem to rwlock_t for root anon_vma lock > >>> Ingo's patch +11.7% > >>> > >> What happens if you stuff Waiman's qrwlock patches on top of that? > >> admittedly and oft mentioned in this thread, our current rwlock_t is > >> somewhat suboptimal under a number of conditions. > > I've tested with Waiman's qrwlock patches on top of Ingo's patches. > > It does not affect the throughput for exim and I still get > > about +11.7% throughput change (same as with Ingo's patch only). > > > > Tim > > > > My qrwlock doesn't enable qrwlock by default. You have to use menuconfig > to explicitly enable it. Have you done that when you build the test > kernel? I am thinking of explicitly enabling it for x86 if the anon-vma > lock is converted back to a rwlock. > Yes, I have explicitly enabled it during my testing. Thanks. Tim ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-30 19:47 ` Tim Chen @ 2013-09-30 22:03 ` Tim Chen 2013-10-01 2:41 ` Waiman Long 1 sibling, 0 replies; 55+ messages in thread From: Tim Chen @ 2013-09-30 22:03 UTC (permalink / raw) To: Waiman Long Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Mon, 2013-09-30 at 12:47 -0700, Tim Chen wrote: > On Mon, 2013-09-30 at 15:35 -0400, Waiman Long wrote: > > On 09/30/2013 03:23 PM, Tim Chen wrote: > > > On Mon, 2013-09-30 at 20:14 +0200, Peter Zijlstra wrote: > > >> On Mon, Sep 30, 2013 at 10:10:27AM -0700, Tim Chen wrote: > > >>> Here's the exim workload data: > > >>> > > >>> rwsem improvment: > > >>> Waimain's patch: +2.0% > > >>> Alex+Tim's patchset: +4.8% > > >>> Waiman+Alex+Tim: +5.3% > > >>> > > >>> convert rwsem to rwlock_t for root anon_vma lock > > >>> Ingo's patch +11.7% > > >>> > > >> What happens if you stuff Waiman's qrwlock patches on top of that? > > >> admittedly and oft mentioned in this thread, our current rwlock_t is > > >> somewhat suboptimal under a number of conditions. > > > I've tested with Waiman's qrwlock patches on top of Ingo's patches. > > > It does not affect the throughput for exim and I still get > > > about +11.7% throughput change (same as with Ingo's patch only). > > > > > > Tim > > > > > > > My qrwlock doesn't enable qrwlock by default. You have to use menuconfig > > to explicitly enable it. Have you done that when you build the test > > kernel? I am thinking of explicitly enabling it for x86 if the anon-vma > > lock is converted back to a rwlock. > > > > Yes, I have explicitly enabled it during my testing. > The workload I have is dominated by writer locks, with only very occasional readers. So it may not benefit from the various tweaks you have put in qrwlock. Tim ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-30 19:47 ` Tim Chen 2013-09-30 22:03 ` Tim Chen @ 2013-10-01 2:41 ` Waiman Long 1 sibling, 0 replies; 55+ messages in thread From: Waiman Long @ 2013-10-01 2:41 UTC (permalink / raw) To: Tim Chen Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On 09/30/2013 03:47 PM, Tim Chen wrote: >> My qrwlock doesn't enable qrwlock by default. You have to use menuconfig >> to explicitly enable it. Have you done that when you build the test >> kernel? I am thinking of explicitly enabling it for x86 if the anon-vma >> lock is converted back to a rwlock. >> > Yes, I have explicitly enabled it during my testing. > > Thanks. > Tim > Thank for the info. I had tested Ingo's 2nd patch myself with the qrwlock patch on a 8-node machine with a 3.12.0-rc2 kernel. The results of AIM7's high_systime (at 1500 users) were: Anon-vmas lock JPM %Change -------------- --- ------- rwsem 148265 - rwlock 238715 +61% qrwlock 242048 +63% So the queue rwlock was only slightly faster in this case. Below were the perf profile with rwlock: 18.20% reaim [kernel.kallsyms] [k] __write_lock_failed 9.36% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave 2.91% reaim [kernel.kallsyms] [k] mspin_lock 2.73% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert 2.23% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave 1.29% reaim [kernel.kallsyms] [k] __read_lock_failed 1.21% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave 1.14% reaim [kernel.kallsyms] [k] zap_pte_range 1.13% reaim [kernel.kallsyms] [k] _raw_spin_lock 1.04% reaim [kernel.kallsyms] [k] mutex_spin_on_owner The perf profile with qrwlock: 10.57% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave 7.98% reaim [kernel.kallsyms] [k] queue_write_lock_slowpath 5.83% reaim [kernel.kallsyms] [k] mspin_lock 2.86% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave 2.71% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert 1.52% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave 1.51% reaim [kernel.kallsyms] [k] queue_read_lock_slowpath 1.35% reaim [kernel.kallsyms] [k] mutex_spin_on_owner 1.12% reaim [kernel.kallsyms] [k] zap_pte_range 1.06% reaim [kernel.kallsyms] [k] perf_event_aux_ctx 1.01% reaim [kernel.kallsyms] [k] perf_event_aux In the qrwlock kernel, less time were spent in the rwlock slowpath path (about half). However, more time were now spent in the spinlock and mutex spinning. Another observation is that no noticeable idle time was reported whereas the system could be half idle with rwsem. There was also a lot less idle balancing activities. The qrwlock is fair wrt the writers. So its performance may not be as good as the fully unfair rwlock. However, queuing reduces a lot of cache contention traffic, thus improving performance. It is the interplay of these 2 factors that determine how much performance benefit we can see. Another factor to consider is that when we have less contention in anon-vmas, other areas of contentions will show up. With qrwlock, the spinlock contention was: 10.57% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave |--58.70%-- release_pages |--38.42%-- pagevec_lru_move_fn |--0.64%-- get_page_from_freelist |--0.64%-- __page_cache_release --1.60%-- [...] 2.86% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave |--52.73%-- pagevec_lru_move_fn |--46.25%-- release_pages --1.02%-- [...] 1.52% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave |--53.76%-- pagevec_lru_move_fn |--43.95%-- release_pages |--1.02%-- __page_cache_release -Longman ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-28 19:37 ` [PATCH] anon_vmas: Convert the rwsem to an rwlock_t Ingo Molnar 2013-09-28 19:43 ` Linus Torvalds @ 2013-09-30 8:52 ` Andrea Arcangeli 2013-09-30 14:40 ` Jerome Glisse ` (2 more replies) 1 sibling, 3 replies; 55+ messages in thread From: Andrea Arcangeli @ 2013-09-30 8:52 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J, Haggai Eran, Sagi Grimberg, Or Gerlitz, Jerome Glisse Hi everyone, On Sat, Sep 28, 2013 at 09:37:39PM +0200, Ingo Molnar wrote: > > * Ingo Molnar <mingo@kernel.org> wrote: > > > If we do that then I suspect the next step will be queued rwlocks :-/ > > The current rwlock_t implementation is rather primitive by modern > > standards. (We'd probably have killed rwlock_t long ago if not for the > > tasklist_lock.) > > > > But yeah, it would work and conceptually a hard spinlock fits something > > as lowlevel as the anon-vma lock. > > > > I did a quick review pass and it appears nothing obvious is scheduling > > with the anon-vma lock held. If it did in a non-obvious way it's likely > > a bug anyway. The hugepage code grew a lot of logic running under the > > anon-vma lock, but it all seems atomic. > > > > So a conversion to rwlock_t could be attempted. (It should be relatively > > easy patch as well, because the locking operation is now nicely > > abstracted out.) > > Here's a totally untested patch to convert the anon vma lock to an > rwlock_t. Sorry having to break the party but the sleepable locks for anon_vma and i_mmap_mutex are now requirement for the "pageable RDMA" effort recently achieved upstream by mellanox with the MMU notifier. And as far as I can tell that's the only single good reason for why those locks shouldn't be spinlocks (otherwise I would have also complianed at the time of that conversion, the original regression was known, ask Andi). After the lock conversion it took a while to fix all other minor bits to make mmu notifier methods fully sleepable. The problem with the spinlocks is that in the rmap code (like try_to_unmap) we need to call mmu_notifier_invalidate_page with an "mm" as parameter, and the callee assumes the "mm" won't go away under it. The other second requirement is that the page cannot be freed until we call the mmu_notifier_invalidate_page (secondary MMU is ok to still access the page after the linux pte has been dropped and the TLB flushed). In the rmap code the only things that keep things afloat is either the anon_vma lock or the i_mmap_mutex so it is quite tricky to drop that lock while keeping "mm" and "page" both afloat for the invalidate post-anon-vma-unlock. Maybe there are ways to makes that safe? (reference counting, trylocking the mmap_sem). But there isn't a very strightforward way to do that. It isn't just mellanox drivers: originally SGI XPMEM driver also needed to schedule in those methods (then they figured how to get away in only scheduling in the mmu_notifier range calls but I suppose they prefer to be able to schedule in all invalidate methods including mmu_notifier_invalidate_page). Nvidia is now also going to use mmu notifier to allow the GPU (acting as a secondary MMU with pagetables) to access main memory without requiring RAM pinning and without disabling all VM features on the graphics memory (and without GART physical). Probably they don't need to schedule but I CC'ed Jerome just in case they need (as that would add one more relevant user of the feature). As far as KVM is concerned, there is no benefit in scheduling in the methods. The KVM mmu notifier invalidate consists in zeroing out one or more sptes and sending an IPI to flush the guest TLBs if others vcpus are running in other cpus. It's a mechanism pretty much identical to the one used by the primary MMU and it only requires irqs enabled to avoid deadlocks in case of cross-IPIs. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-30 8:52 ` [PATCH] " Andrea Arcangeli @ 2013-09-30 14:40 ` Jerome Glisse 2013-09-30 16:26 ` Linus Torvalds 2013-09-30 18:21 ` Andi Kleen 2 siblings, 0 replies; 55+ messages in thread From: Jerome Glisse @ 2013-09-30 14:40 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J, Haggai Eran, Sagi Grimberg, Or Gerlitz On Mon, Sep 30, 2013 at 10:52:43AM +0200, Andrea Arcangeli wrote: > Hi everyone, > > On Sat, Sep 28, 2013 at 09:37:39PM +0200, Ingo Molnar wrote: > > > > * Ingo Molnar <mingo@kernel.org> wrote: > > > > > If we do that then I suspect the next step will be queued rwlocks :-/ > > > The current rwlock_t implementation is rather primitive by modern > > > standards. (We'd probably have killed rwlock_t long ago if not for the > > > tasklist_lock.) > > > > > > But yeah, it would work and conceptually a hard spinlock fits something > > > as lowlevel as the anon-vma lock. > > > > > > I did a quick review pass and it appears nothing obvious is scheduling > > > with the anon-vma lock held. If it did in a non-obvious way it's likely > > > a bug anyway. The hugepage code grew a lot of logic running under the > > > anon-vma lock, but it all seems atomic. > > > > > > So a conversion to rwlock_t could be attempted. (It should be relatively > > > easy patch as well, because the locking operation is now nicely > > > abstracted out.) > > > > Here's a totally untested patch to convert the anon vma lock to an > > rwlock_t. > > Sorry having to break the party but the sleepable locks for anon_vma > and i_mmap_mutex are now requirement for the "pageable RDMA" effort > recently achieved upstream by mellanox with the MMU notifier. > > And as far as I can tell that's the only single good reason for why > those locks shouldn't be spinlocks (otherwise I would have also > complianed at the time of that conversion, the original regression was > known, ask Andi). After the lock conversion it took a while to fix all > other minor bits to make mmu notifier methods fully sleepable. > > The problem with the spinlocks is that in the rmap code (like > try_to_unmap) we need to call mmu_notifier_invalidate_page with an > "mm" as parameter, and the callee assumes the "mm" won't go away under > it. The other second requirement is that the page cannot be freed > until we call the mmu_notifier_invalidate_page (secondary MMU is ok to > still access the page after the linux pte has been dropped and the TLB > flushed). > > In the rmap code the only things that keep things afloat is either the > anon_vma lock or the i_mmap_mutex so it is quite tricky to drop that > lock while keeping "mm" and "page" both afloat for the invalidate > post-anon-vma-unlock. > > Maybe there are ways to makes that safe? (reference counting, > trylocking the mmap_sem). But there isn't a very strightforward way to > do that. > > It isn't just mellanox drivers: originally SGI XPMEM driver also > needed to schedule in those methods (then they figured how to get away > in only scheduling in the mmu_notifier range calls but I suppose they > prefer to be able to schedule in all invalidate methods including > mmu_notifier_invalidate_page). > > Nvidia is now also going to use mmu notifier to allow the GPU (acting > as a secondary MMU with pagetables) to access main memory without > requiring RAM pinning and without disabling all VM features on the > graphics memory (and without GART physical). Probably they don't need > to schedule but I CC'ed Jerome just in case they need (as that would > add one more relevant user of the feature). Thanks for the cc if it was on mm mailing list i would have seen it but i am way behind on my lkml folder. Yes right now we have to sleep in mmu_notifier_invalidate_page but it's somewhat solvable. The mkclean path is solvable more or less easily but the try_to_unmap do have to sleep and that's bad. Especialy the shrinker path. My plan is to provide informations down the mmu notifier API and fails quickly on shrinker path. For shrinking i am not sure if i should just rely on shrinker API to force device or not. It's not something i want to tackle in first patchset anyway. I am hoping to be able to send patchset in next few weeks. But the sleeping inside invalidate page is definitly on my WHATTHEHECKLIST. Cheers, Jerome > > As far as KVM is concerned, there is no benefit in scheduling in the > methods. The KVM mmu notifier invalidate consists in zeroing out one > or more sptes and sending an IPI to flush the guest TLBs if others > vcpus are running in other cpus. It's a mechanism pretty much > identical to the one used by the primary MMU and it only requires irqs > enabled to avoid deadlocks in case of cross-IPIs. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-30 8:52 ` [PATCH] " Andrea Arcangeli 2013-09-30 14:40 ` Jerome Glisse @ 2013-09-30 16:26 ` Linus Torvalds 2013-09-30 19:16 ` Andrea Arcangeli 2013-09-30 18:21 ` Andi Kleen 2 siblings, 1 reply; 55+ messages in thread From: Linus Torvalds @ 2013-09-30 16:26 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J, Haggai Eran, Sagi Grimberg, Or Gerlitz, Jerome Glisse On Mon, Sep 30, 2013 at 1:52 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > Sorry having to break the party but the sleepable locks for anon_vma > and i_mmap_mutex are now requirement for the "pageable RDMA" effort > recently achieved upstream by mellanox with the MMU notifier. I'll happily break that. No way will some random crazy RDMA be a reason why we'd drop this kind of improvement on actual loads that matter way more. Linus ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-30 16:26 ` Linus Torvalds @ 2013-09-30 19:16 ` Andrea Arcangeli 0 siblings, 0 replies; 55+ messages in thread From: Andrea Arcangeli @ 2013-09-30 19:16 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J, Haggai Eran, Sagi Grimberg, Or Gerlitz, Jerome Glisse On Mon, Sep 30, 2013 at 09:26:21AM -0700, Linus Torvalds wrote: > On Mon, Sep 30, 2013 at 1:52 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > Sorry having to break the party but the sleepable locks for anon_vma > > and i_mmap_mutex are now requirement for the "pageable RDMA" effort > > recently achieved upstream by mellanox with the MMU notifier. > > I'll happily break that. Unless a solution is found that could allow to invalidate secondary MMUs with a spinlock/rwlock for anon_vma->lock/i_mmap_lock, would it be acceptable to switch between spinlock/rwlock mutex/rwsem through a config option? option CONFIG_SLEEPABLE_RMAP, implicitly selected by CONFIG_SLEEPABLE_MMU_NOTIFIER, in turn selected by the RDMA and nvidia drivers if they're built (ideally nvidia drivers will figure out how to avoid scheduling). I mean it only requires a wrapping header file, aside from the header file it wouldn't be a much bigger patch than the one posted already. It would be much easier to switch between spinning and sleeping locks to keep benchmarking different scenarios too. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-30 8:52 ` [PATCH] " Andrea Arcangeli 2013-09-30 14:40 ` Jerome Glisse 2013-09-30 16:26 ` Linus Torvalds @ 2013-09-30 18:21 ` Andi Kleen 2013-09-30 19:04 ` Jerome Glisse 2 siblings, 1 reply; 55+ messages in thread From: Andi Kleen @ 2013-09-30 18:21 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J, Haggai Eran, Sagi Grimberg, Or Gerlitz, Jerome Glisse > Nvidia is now also going to use mmu notifier to allow the GPU (acting > as a secondary MMU with pagetables) to access main memory without That's a good point. iirc AMD GPUs are also going into the same direction, and likely more. So we probably need to support that usage. Too bad :-/ -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t 2013-09-30 18:21 ` Andi Kleen @ 2013-09-30 19:04 ` Jerome Glisse 0 siblings, 0 replies; 55+ messages in thread From: Jerome Glisse @ 2013-09-30 19:04 UTC (permalink / raw) To: Andi Kleen Cc: Andrea Arcangeli, Ingo Molnar, Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Chandramouleeswaran, Aswin, Norton, Scott J, Haggai Eran, Sagi Grimberg, Or Gerlitz On Mon, Sep 30, 2013 at 08:21:12PM +0200, Andi Kleen wrote: > > Nvidia is now also going to use mmu notifier to allow the GPU (acting > > as a secondary MMU with pagetables) to access main memory without > > That's a good point. iirc AMD GPUs are also going into > the same direction, and likely more. So we probably need > to support that usage. > > Too bad :-/ > > -Andi In all fairness, current AMD solution is hardware only, you can see the code in drivers/iommu/amd_iommu_v2.c but they might be interested in the software solution for added features like abilities to move process memory to video memory and migrating it back when CPU access it. Others too might be interested (Intel, infiniband, or any driver with an mmu that can do pagefault and has couple others features). Cheers, Jerome ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-28 18:55 ` Linus Torvalds 2013-09-28 19:13 ` Andi Kleen 2013-09-28 19:21 ` Ingo Molnar @ 2013-09-29 23:06 ` Davidlohr Bueso 2013-09-29 23:26 ` Linus Torvalds 2013-09-30 1:11 ` Michel Lespinasse 2013-09-30 10:46 ` Peter Zijlstra 4 siblings, 1 reply; 55+ messages in thread From: Davidlohr Bueso @ 2013-09-29 23:06 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Sat, 2013-09-28 at 11:55 -0700, Linus Torvalds wrote: > On Sat, Sep 28, 2013 at 12:41 AM, Ingo Molnar <mingo@kernel.org> wrote: > > > > > > Yeah, I fully agree. The reason I'm still very sympathetic to Tim's > > efforts is that they address a regression caused by a mechanic > > mutex->rwsem conversion: > > > > 5a505085f043 mm/rmap: Convert the struct anon_vma::mutex to an rwsem > > > > ... and Tim's patches turn that regression into an actual speedup. > > Btw, I really hate that thing. I think we should turn it back into a > spinlock. None of what it protects needs a mutex or an rwsem. The same should apply to i_mmap_mutex, having a similar responsibility to the anon-vma lock with file backed pages. A few months ago I had suggested changing that lock to rwsem, giving some pretty reasonable performance improvement numbers. http://lwn.net/Articles/556342/ > > Because you guys talk about the regression of turning it into a rwsem, > but nobody talks about the *original* regression. > > And it *used* to be a spinlock, and it was changed into a mutex back > in 2011 by commit 2b575eb64f7a. That commit doesn't even have a reason > listed for it, although my dim memory of it is that the reason was > preemption latency. > > And that caused big regressions too. After testing Ingo's anon-vma rwlock_t conversion (v2) on a 8 socket, 80 core system with aim7, I am quite surprised about the numbers - considering the lack of queuing in rwlocks. A lot of the tests didn't show hardly any difference, but those that really contend this lock (with high amounts of users) benefited quite nicely: Alltests: +28% throughput after 1000 users and runtime was reduced from 7.2 to 6.6 secs. Custom: +61% throughput after 100 users and runtime was reduced from 7 to 4.9 secs. High_systime: +40% throughput after 1000 users and runtime was reduced from 19 to 15.5 secs. Shared: +30.5% throughput after 100 users and runtime was reduced from 6.5 to 5.1 secs. Short: Lots of variance in the numbers, but avg of +29% throughput - no particular performance degradation either. The only workload that took a hit was fserver with a -6% throughput for small amounts of users (10-100). Theoretically at least, adding queuing to the rwlocks should only help the situation furthermore. I'll also run some test for a similar conversion for the i_mmap mutex next week. Going back to the rwsems, I believe adding optimistic spinning is still valid, independent of the anon-vma lock as it benefits all users. Thanks, Davidlohr ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-29 23:06 ` [PATCH] rwsem: reduce spinlock contention in wakeup code path Davidlohr Bueso @ 2013-09-29 23:26 ` Linus Torvalds 2013-09-30 0:40 ` Davidlohr Bueso 2013-09-30 6:57 ` Ingo Molnar 0 siblings, 2 replies; 55+ messages in thread From: Linus Torvalds @ 2013-09-29 23:26 UTC (permalink / raw) To: Davidlohr Bueso Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Sun, Sep 29, 2013 at 4:06 PM, Davidlohr Bueso <davidlohr@hp.com> wrote: >> >> Btw, I really hate that thing. I think we should turn it back into a >> spinlock. None of what it protects needs a mutex or an rwsem. > > The same should apply to i_mmap_mutex, having a similar responsibility > to the anon-vma lock with file backed pages. A few months ago I had > suggested changing that lock to rwsem, giving some pretty reasonable > performance improvement numbers. > > http://lwn.net/Articles/556342/ Ok, that's pretty convincing too. Side note: are you sure that the i_mmap_mutex needs to be a sleeping lock at all? It's documented to nest outside the anon_vma->rwsem, so as long as that is a sleeping lock, the i_mmap_mutex needs to be one too, but looking at the actual users, most of them seem to be *very* similar to the anon_vma->rwsem users. It is a very close cousin to the anon_vma->rwsem, after all (just for file-backed pages rather than anonymous ones). No? I dunno. Maybe the ranges are too big and it really has latency issues, the few I looked at looked like fairly trivial interval-tree operations, though. And your numbers for Ingo's patch: > After testing Ingo's anon-vma rwlock_t conversion (v2) on a 8 socket, 80 > core system with aim7, I am quite surprised about the numbers - > considering the lack of queuing in rwlocks. A lot of the tests didn't > show hardly any difference, but those that really contend this lock > (with high amounts of users) benefited quite nicely: > > Alltests: +28% throughput after 1000 users and runtime was reduced from > 7.2 to 6.6 secs. > > Custom: +61% throughput after 100 users and runtime was reduced from 7 > to 4.9 secs. > > High_systime: +40% throughput after 1000 users and runtime was reduced > from 19 to 15.5 secs. > > Shared: +30.5% throughput after 100 users and runtime was reduced from > 6.5 to 5.1 secs. > > Short: Lots of variance in the numbers, but avg of +29% throughput - no > particular performance degradation either. Are just overwhelming, in my opinion. The conversion *from* a spinlock never had this kind of support behind it. Btw, did anybody run Ingo's patch with lockdep and the spinlock sleep debugging code to verify that we haven't introduced any problems wrt sleeping since the lock was converted into a rw-semaphore? Because quite frankly, considering these kinds of numbers, I really don't see how we could possibly make excuses for keeping that rw-semaphore unless there is some absolutely _horrible_ latency issue? Linus ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-29 23:26 ` Linus Torvalds @ 2013-09-30 0:40 ` Davidlohr Bueso 2013-09-30 0:51 ` Linus Torvalds 2013-09-30 6:57 ` Ingo Molnar 1 sibling, 1 reply; 55+ messages in thread From: Davidlohr Bueso @ 2013-09-30 0:40 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Sun, 2013-09-29 at 16:26 -0700, Linus Torvalds wrote: > On Sun, Sep 29, 2013 at 4:06 PM, Davidlohr Bueso <davidlohr@hp.com> wrote: > >> > >> Btw, I really hate that thing. I think we should turn it back into a > >> spinlock. None of what it protects needs a mutex or an rwsem. > > > > The same should apply to i_mmap_mutex, having a similar responsibility > > to the anon-vma lock with file backed pages. A few months ago I had > > suggested changing that lock to rwsem, giving some pretty reasonable > > performance improvement numbers. > > > > http://lwn.net/Articles/556342/ > > Ok, that's pretty convincing too. > > Side note: are you sure that the i_mmap_mutex needs to be a sleeping > lock at all? It's documented to nest outside the anon_vma->rwsem, so > as long as that is a sleeping lock, the i_mmap_mutex needs to be one > too, but looking at the actual users, most of them seem to be *very* > similar to the anon_vma->rwsem users. It is a very close cousin to the > anon_vma->rwsem, after all (just for file-backed pages rather than > anonymous ones). No? Right, I brought that up exactly because it is so similar to the anon-vma lock. Both should really be the same type, whether it be rwsem or rwlock. Hopefully the rwlock conversion tests will also benefit just like Ingo's patch did, I should have numbers early next week. > > I dunno. Maybe the ranges are too big and it really has latency > issues, the few I looked at looked like fairly trivial interval-tree > operations, though. > > And your numbers for Ingo's patch: > > > After testing Ingo's anon-vma rwlock_t conversion (v2) on a 8 socket, 80 > > core system with aim7, I am quite surprised about the numbers - > > considering the lack of queuing in rwlocks. A lot of the tests didn't > > show hardly any difference, but those that really contend this lock > > (with high amounts of users) benefited quite nicely: > > > > Alltests: +28% throughput after 1000 users and runtime was reduced from > > 7.2 to 6.6 secs. > > > > Custom: +61% throughput after 100 users and runtime was reduced from 7 > > to 4.9 secs. > > > > High_systime: +40% throughput after 1000 users and runtime was reduced > > from 19 to 15.5 secs. > > > > Shared: +30.5% throughput after 100 users and runtime was reduced from > > 6.5 to 5.1 secs. > > > > Short: Lots of variance in the numbers, but avg of +29% throughput - no > > particular performance degradation either. > > Are just overwhelming, in my opinion. The conversion *from* a spinlock > never had this kind of support behind it. > > Btw, did anybody run Ingo's patch with lockdep and the spinlock sleep > debugging code to verify that we haven't introduced any problems wrt > sleeping since the lock was converted into a rw-semaphore? Hmm, I'm getting the following at bootup: ============================================= [ INFO: possible recursive locking detected ] 3.12.0-rc2+ #7 Not tainted --------------------------------------------- qemu-kvm/64239 is trying to acquire lock: (&anon_vma->rwlock){+.+...}, at: [<ffffffff8115fba6>] mm_take_all_locks+0x146/0x190 but task is already holding lock: (&anon_vma->rwlock){+.+...}, at: [<ffffffff8115fba6>] mm_take_all_locks+0x146/0x190 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&anon_vma->rwlock); lock(&anon_vma->rwlock); *** DEADLOCK *** May be due to missing lock nesting notation 4 locks held by qemu-kvm/64239: #0: (&mm->mmap_sem){++++++}, at: [<ffffffff8117c25c>] do_mmu_notifier_register+0x13c/0x180 #1: (mm_all_locks_mutex){+.+...}, at: [<ffffffff8115fa9b>] mm_take_all_locks+0x3b/0x190 #2: (&mapping->i_mmap_mutex){+.+...}, at: [<ffffffff8115fb26>] mm_take_all_locks+0xc6/0x190 #3: (&anon_vma->rwlock){+.+...}, at: [<ffffffff8115fba6>] mm_take_all_locks+0x146/0x190 stack backtrace: CPU: 51 PID: 64239 Comm: qemu-kvm Not tainted 3.12.0-rc2+ #7 Hardware name: HP ProLiant DL980 G7, BIOS P66 06/24/2011 ffff8b9f79294a80 ffff8a9f7c375c08 ffffffff815802dc 0000000000000003 ffff8b9f79294100 ffff8a9f7c375c38 ffffffff810bda34 ffff8b9f79294100 ffff8b9f79294a80 ffff8b9f79294100 0000000000000000 ffff8a9f7c375c98 Call Trace: [<ffffffff815802dc>] dump_stack+0x49/0x5d [<ffffffff810bda34>] print_deadlock_bug+0xf4/0x100 [<ffffffff810bf864>] validate_chain+0x504/0x7a0 [<ffffffff810bfe0d>] __lock_acquire+0x30d/0x590 [<ffffffff8115fba6>] ? mm_take_all_locks+0x146/0x190 [<ffffffff810c0130>] lock_acquire+0xa0/0x110 [<ffffffff8115fba6>] ? mm_take_all_locks+0x146/0x190 [<ffffffff810bebcd>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff81585861>] _raw_write_lock+0x31/0x40 [<ffffffff8115fba6>] ? mm_take_all_locks+0x146/0x190 [<ffffffff8115fba6>] mm_take_all_locks+0x146/0x190 [<ffffffff8117c196>] do_mmu_notifier_register+0x76/0x180 [<ffffffff8117c2d3>] mmu_notifier_register+0x13/0x20 [<ffffffffa0365f5c>] kvm_create_vm+0x22c/0x300 [kvm] [<ffffffffa03660e8>] kvm_dev_ioctl+0xb8/0x140 [kvm] [<ffffffff811a4569>] do_vfs_ioctl+0x89/0x350 [<ffffffff8158e9b7>] ? sysret_check+0x1b/0x56 [<ffffffff811a48d1>] SyS_ioctl+0xa1/0xb0 This is probably due to the change in vm_lock_anon_vma(): - down_write_nest_lock(&anon_vma->root->rwsem, &mm->mmap_sem); + write_lock(&anon_vma->root->rwlock); ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-30 0:40 ` Davidlohr Bueso @ 2013-09-30 0:51 ` Linus Torvalds 0 siblings, 0 replies; 55+ messages in thread From: Linus Torvalds @ 2013-09-30 0:51 UTC (permalink / raw) To: Davidlohr Bueso Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Sun, Sep 29, 2013 at 5:40 PM, Davidlohr Bueso <davidlohr@hp.com> wrote: > > Hmm, I'm getting the following at bootup: > > May be due to missing lock nesting notation Yes it is. And that reminds me of a problem I think we had with this code: we had a possible case of the preemption counter nesting too deeply. I forget the details, but it was something people worried about. That mm_take_all_locks() thing is really special, and I suspect that if we go down this way we should just do a single preempt-disable and then use the arch_write_lock() to avoid both the lockdep splat _and_ the preemption counter overflow. Linus ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-29 23:26 ` Linus Torvalds 2013-09-30 0:40 ` Davidlohr Bueso @ 2013-09-30 6:57 ` Ingo Molnar 1 sibling, 0 replies; 55+ messages in thread From: Ingo Molnar @ 2013-09-30 6:57 UTC (permalink / raw) To: Linus Torvalds Cc: Davidlohr Bueso, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J * Linus Torvalds <torvalds@linux-foundation.org> wrote: > [...] > > And your numbers for Ingo's patch: > > > After testing Ingo's anon-vma rwlock_t conversion (v2) on a 8 socket, > > 80 core system with aim7, I am quite surprised about the numbers - > > considering the lack of queuing in rwlocks. A lot of the tests didn't > > show hardly any difference, but those that really contend this lock > > (with high amounts of users) benefited quite nicely: > > > > Alltests: +28% throughput after 1000 users and runtime was reduced from > > 7.2 to 6.6 secs. > > > > Custom: +61% throughput after 100 users and runtime was reduced from 7 > > to 4.9 secs. > > > > High_systime: +40% throughput after 1000 users and runtime was reduced > > from 19 to 15.5 secs. > > > > Shared: +30.5% throughput after 100 users and runtime was reduced from > > 6.5 to 5.1 secs. > > > > Short: Lots of variance in the numbers, but avg of +29% throughput - > > no particular performance degradation either. > > Are just overwhelming, in my opinion. The conversion *from* a spinlock > never had this kind of support behind it. Agreed. Especially given how primitive rwlock_t is especially on 80 cores, this is really a no-brainer conversion. I have to say I am surprised by the numbers - after so many years it's still amazing how powerful the "get work done and don't interrupt it" batching concept is in computing... > Btw, did anybody run Ingo's patch with lockdep and the spinlock sleep > debugging code to verify that we haven't introduced any problems wrt > sleeping since the lock was converted into a rw-semaphore? > > Because quite frankly, considering these kinds of numbers, I really > don't see how we could possibly make excuses for keeping that > rw-semaphore unless there is some absolutely _horrible_ latency issue? Given that there's only about a dozen critical sections that this lock covers I simply cannot imagine any latency problem that couldn't be fixed in some other fashion. (shrinking the critical section, breaking up a bad loop, etc.) [ Btw., if PREEMPT_RT goes upstream we might not even need to break latencies all that much: people whose usecase values scheduling latency above throughput would run such a critical section preemptible anyway. ] Thanks, Ingo ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-28 18:55 ` Linus Torvalds ` (2 preceding siblings ...) 2013-09-29 23:06 ` [PATCH] rwsem: reduce spinlock contention in wakeup code path Davidlohr Bueso @ 2013-09-30 1:11 ` Michel Lespinasse 2013-09-30 7:05 ` Ingo Molnar 2013-09-30 10:46 ` Peter Zijlstra 4 siblings, 1 reply; 55+ messages in thread From: Michel Lespinasse @ 2013-09-30 1:11 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Sat, Sep 28, 2013 at 11:55 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > Btw, I really hate that thing. I think we should turn it back into a > spinlock. None of what it protects needs a mutex or an rwsem. > > Because you guys talk about the regression of turning it into a rwsem, > but nobody talks about the *original* regression. > > And it *used* to be a spinlock, and it was changed into a mutex back > in 2011 by commit 2b575eb64f7a. That commit doesn't even have a reason > listed for it, although my dim memory of it is that the reason was > preemption latency. I was wondering about that too. Regarding latencies, we used to have unbounded latencies for anon_vma operations as the AVC chains could get long under some workloads; now that we index the VMAs matching a given anon_vma with an interval tree this particular source of latencies should be gone. So yes, it could be worth trying to go back to a non-sleeping lock. That said, I am very scared of using rwlock_t here, and I would much prefer we choose a fair lock (either spinlock or a new rwlock implementation which guarantees not to starve any locker thread) -- Michel Lespinasse A program is never fully debugged until the last user dies. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-30 1:11 ` Michel Lespinasse @ 2013-09-30 7:05 ` Ingo Molnar 2013-09-30 16:03 ` Waiman Long 0 siblings, 1 reply; 55+ messages in thread From: Ingo Molnar @ 2013-09-30 7:05 UTC (permalink / raw) To: Michel Lespinasse Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J * Michel Lespinasse <walken@google.com> wrote: > That said, I am very scared of using rwlock_t here, and I would much > prefer we choose a fair lock (either spinlock or a new rwlock > implementation which guarantees not to starve any locker thread) Given how few users rwlock_t has today we could attempt to make it reader-writer fair, as long as the usecase of a hardirq or softirq context always getting nested read access on the same CPU is preserved. (but that can be done - if nothing else then with an explicit context check.) Thanks, Ingo ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-30 7:05 ` Ingo Molnar @ 2013-09-30 16:03 ` Waiman Long 0 siblings, 0 replies; 55+ messages in thread From: Waiman Long @ 2013-09-30 16:03 UTC (permalink / raw) To: Ingo Molnar Cc: Michel Lespinasse, Linus Torvalds, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On 09/30/2013 03:05 AM, Ingo Molnar wrote: > * Michel Lespinasse<walken@google.com> wrote: > >> That said, I am very scared of using rwlock_t here, and I would much >> prefer we choose a fair lock (either spinlock or a new rwlock >> implementation which guarantees not to starve any locker thread) > Given how few users rwlock_t has today we could attempt to make it > reader-writer fair, as long as the usecase of a hardirq or softirq > context always getting nested read access on the same CPU is preserved. > > (but that can be done - if nothing else then with an explicit context > check.) > > Thanks, > > Ingo My queue rwlock patch is able to handle the use case of a hardirq and softirq context and can be made almost as fair as the ticket spinlock. -Longman ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-28 18:55 ` Linus Torvalds ` (3 preceding siblings ...) 2013-09-30 1:11 ` Michel Lespinasse @ 2013-09-30 10:46 ` Peter Zijlstra 2013-10-01 7:48 ` Ingo Molnar 4 siblings, 1 reply; 55+ messages in thread From: Peter Zijlstra @ 2013-09-30 10:46 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Sat, Sep 28, 2013 at 11:55:26AM -0700, Linus Torvalds wrote: > So if the primary reason for this is really just that f*cking anon_vma > lock, then I would seriously suggest: I would still like to see the rwsem patches merged; even if we end up going back to a spin style anon_vma lock. There's been various reports in the past about how programs are significantly faster if they wrap their mmap() calls in a pthread_mutex. And this was purely down to the fact that rwsem writer-writer contention blows chunks. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-30 10:46 ` Peter Zijlstra @ 2013-10-01 7:48 ` Ingo Molnar 2013-10-01 8:10 ` Peter Zijlstra 0 siblings, 1 reply; 55+ messages in thread From: Ingo Molnar @ 2013-10-01 7:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J * Peter Zijlstra <peterz@infradead.org> wrote: > On Sat, Sep 28, 2013 at 11:55:26AM -0700, Linus Torvalds wrote: > > So if the primary reason for this is really just that f*cking anon_vma > > lock, then I would seriously suggest: > > I would still like to see the rwsem patches merged; even if we end up > going back to a spin style anon_vma lock. > > There's been various reports in the past about how programs are > significantly faster if they wrap their mmap() calls in a pthread_mutex. > And this was purely down to the fact that rwsem writer-writer contention > blows chunks. That's about the mm->mmap_sem rwsem, right? That impact would have to be measured carefully, and not just for workloads where we know that better contention logic helps, but other MM workloads that are hitting hard on mmap_sem. Thanks, Ingo ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-10-01 7:48 ` Ingo Molnar @ 2013-10-01 8:10 ` Peter Zijlstra 0 siblings, 0 replies; 55+ messages in thread From: Peter Zijlstra @ 2013-10-01 8:10 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List, Rik van Riel, Peter Hurley, Davidlohr Bueso, Alex Shi, Tim Chen, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On Tue, Oct 01, 2013 at 09:48:02AM +0200, Ingo Molnar wrote: > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > On Sat, Sep 28, 2013 at 11:55:26AM -0700, Linus Torvalds wrote: > > > So if the primary reason for this is really just that f*cking anon_vma > > > lock, then I would seriously suggest: > > > > I would still like to see the rwsem patches merged; even if we end up > > going back to a spin style anon_vma lock. > > > > There's been various reports in the past about how programs are > > significantly faster if they wrap their mmap() calls in a pthread_mutex. > > And this was purely down to the fact that rwsem writer-writer contention > > blows chunks. > > That's about the mm->mmap_sem rwsem, right? Yep. > That impact would have to be measured carefully, and not just for > workloads where we know that better contention logic helps, but other MM > workloads that are hitting hard on mmap_sem. Sure. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-27 19:00 [PATCH] rwsem: reduce spinlock contention in wakeup code path Waiman Long 2013-09-27 19:28 ` Linus Torvalds @ 2013-09-27 19:32 ` Peter Hurley 2013-09-28 0:46 ` Waiman Long 1 sibling, 1 reply; 55+ messages in thread From: Peter Hurley @ 2013-09-27 19:32 UTC (permalink / raw) To: Waiman Long, Ingo Molnar, Andrew Morton Cc: linux-kernel, Rik van Riel, Davidlohr Bueso, Alex Shi, Tim Chen, Linus Torvalds, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On 09/27/2013 03:00 PM, Waiman Long wrote: > With the 3.12-rc2 kernel, there is sizable spinlock contention on > the rwsem wakeup code path when running AIM7's high_systime workload > on a 8-socket 80-core DL980 (HT off) as reported by perf: > > 7.64% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave > |--41.77%-- rwsem_wake > 1.61% reaim [kernel.kallsyms] [k] _raw_spin_lock_irq > |--92.37%-- rwsem_down_write_failed > > That was 4.7% of recorded CPU cycles. > > On a large NUMA machine, it is entirely possible that a fairly large > number of threads are queuing up in the ticket spinlock queue to do > the wakeup operation. In fact, only one will be needed. This patch > tries to reduce spinlock contention by doing just that. > > A new wakeup field is added to the rwsem structure. This field is > set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a > thread is pending to do the wakeup call. It is cleared on exit from > those functions. > > By checking if the wakeup flag is set, a thread can exit rwsem_wake() > immediately if another thread is pending to do the wakeup instead of > waiting to get the spinlock and find out that nothing need to be done. This will leave readers stranded if a former writer is in __rwsem_do_wake to wake up the readers and another writer steals the lock, but before the former writer exits without having woken up the readers, the locking stealing writer drops the lock and sees the wakeup flag is set, so doesn't bother to wake the readers. Regards, Peter Hurley > The setting of the wakeup flag may not be visible on all processors in > some architectures. However, this won't affect program correctness. The > clearing of the wakeup flag before spin_unlock will ensure that it is > visible to all processors. > > With this patch, the performance improvement on jobs per minute (JPM) > of the high_systime workload (at 1500 users) was as follows: > > HT JPM w/o patch JPM with patch % Change > -- ------------- -------------- -------- > off 128466 150000 +16.8% > on 121606 146778 +20.7% > > The new perf profile (HT off) was as follows: > > 2.96% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave > |--0.94%-- rwsem_wake > 1.00% reaim [kernel.kallsyms] [k] _raw_spin_lock_irq > |--88.70%-- rwsem_down_write_failed > > Signed-off-by: Waiman Long <Waiman.Long@hp.com> > --- > include/linux/rwsem.h | 2 ++ > lib/rwsem.c | 19 +++++++++++++++++++ > 2 files changed, 21 insertions(+), 0 deletions(-) > > diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h > index 0616ffe..e25792e 100644 > --- a/include/linux/rwsem.h > +++ b/include/linux/rwsem.h > @@ -25,6 +25,7 @@ struct rw_semaphore; > struct rw_semaphore { > long count; > raw_spinlock_t wait_lock; > + int wakeup; /* Waking-up in progress flag */ > struct list_head wait_list; > #ifdef CONFIG_DEBUG_LOCK_ALLOC > struct lockdep_map dep_map; > @@ -58,6 +59,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem) > #define __RWSEM_INITIALIZER(name) \ > { RWSEM_UNLOCKED_VALUE, \ > __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \ > + 0, \ > LIST_HEAD_INIT((name).wait_list) \ > __RWSEM_DEP_MAP_INIT(name) } > > diff --git a/lib/rwsem.c b/lib/rwsem.c > index 19c5fa9..39290a5 100644 > --- a/lib/rwsem.c > +++ b/lib/rwsem.c > @@ -25,6 +25,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name, > lockdep_init_map(&sem->dep_map, name, key, 0); > #endif > sem->count = RWSEM_UNLOCKED_VALUE; > + sem->wakeup = 0; > raw_spin_lock_init(&sem->wait_lock); > INIT_LIST_HEAD(&sem->wait_list); > } > @@ -66,6 +67,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) > struct list_head *next; > long oldcount, woken, loop, adjustment; > > + sem->wakeup = 1; /* Waking up in progress */ > waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list); > if (waiter->type == RWSEM_WAITING_FOR_WRITE) { > if (wake_type == RWSEM_WAKE_ANY) > @@ -137,6 +139,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) > next->prev = &sem->wait_list; > > out: > + sem->wakeup = 0; > return sem; > } > > @@ -256,11 +259,27 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem) > { > unsigned long flags; > > + if (sem->wakeup) > + return sem; /* Waking up in progress already */ > + /* > + * Optimistically set the wakeup flag to indicate that the current > + * flag is going to wakeup the sleeping waiters so that the > + * following threads don't need to wait for doing the wakeup. > + * It is perfectly fine if another thread resets the flag. It just > + * leads to another thread waiting to call __rwsem_do_wake(). > + */ > + sem->wakeup = 1; > raw_spin_lock_irqsave(&sem->wait_lock, flags); > > /* do nothing if list empty */ > if (!list_empty(&sem->wait_list)) > sem = __rwsem_do_wake(sem, RWSEM_WAKE_ANY); > + else > + sem->wakeup = 0; /* Make sure wakeup flag is reset */ > + /* > + * The spin_unlock() call will force the nulled wakeup flag to > + * be visible to all the processors. > + */ > > raw_spin_unlock_irqrestore(&sem->wait_lock, flags); > > ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path 2013-09-27 19:32 ` Peter Hurley @ 2013-09-28 0:46 ` Waiman Long 0 siblings, 0 replies; 55+ messages in thread From: Waiman Long @ 2013-09-28 0:46 UTC (permalink / raw) To: Peter Hurley Cc: Ingo Molnar, Andrew Morton, linux-kernel, Rik van Riel, Davidlohr Bueso, Alex Shi, Tim Chen, Linus Torvalds, Peter Zijlstra, Andrea Arcangeli, Matthew R Wilcox, Dave Hansen, Michel Lespinasse, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J On 09/27/2013 03:32 PM, Peter Hurley wrote: > On 09/27/2013 03:00 PM, Waiman Long wrote: >> With the 3.12-rc2 kernel, there is sizable spinlock contention on >> the rwsem wakeup code path when running AIM7's high_systime workload >> on a 8-socket 80-core DL980 (HT off) as reported by perf: >> >> 7.64% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave >> |--41.77%-- rwsem_wake >> 1.61% reaim [kernel.kallsyms] [k] _raw_spin_lock_irq >> |--92.37%-- rwsem_down_write_failed >> >> That was 4.7% of recorded CPU cycles. >> >> On a large NUMA machine, it is entirely possible that a fairly large >> number of threads are queuing up in the ticket spinlock queue to do >> the wakeup operation. In fact, only one will be needed. This patch >> tries to reduce spinlock contention by doing just that. >> >> A new wakeup field is added to the rwsem structure. This field is >> set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a >> thread is pending to do the wakeup call. It is cleared on exit from >> those functions. >> >> By checking if the wakeup flag is set, a thread can exit rwsem_wake() >> immediately if another thread is pending to do the wakeup instead of >> waiting to get the spinlock and find out that nothing need to be done. > > This will leave readers stranded if a former writer is in __rwsem_do_wake > to wake up the readers and another writer steals the lock, but before > the former writer exits without having woken up the readers, the locking > stealing writer drops the lock and sees the wakeup flag is set, so > doesn't bother to wake the readers. > > Regards, > Peter Hurley > Yes, you are right. That can be a problem. Thank for pointing this out. The workloads that I used doesn't seem to exercise the readers. I will modify the patch to add code handle this failure case by resetting the wakeup flag, pushing it out and then retrying one more time to get the read lock. I think that should address the problem. Regards, Longman ^ permalink raw reply [flat|nested] 55+ messages in thread
end of thread, other threads:[~2013-10-01 20:03 UTC | newest] Thread overview: 55+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-09-27 19:00 [PATCH] rwsem: reduce spinlock contention in wakeup code path Waiman Long 2013-09-27 19:28 ` Linus Torvalds 2013-09-27 19:39 ` Davidlohr Bueso 2013-09-27 21:49 ` Tim Chen 2013-09-28 6:45 ` Ingo Molnar 2013-09-28 7:41 ` Ingo Molnar 2013-09-28 18:55 ` Linus Torvalds 2013-09-28 19:13 ` Andi Kleen 2013-09-28 19:22 ` Linus Torvalds 2013-09-28 19:21 ` Ingo Molnar 2013-09-28 19:33 ` Linus Torvalds 2013-09-28 19:39 ` Ingo Molnar 2013-09-30 10:44 ` Peter Zijlstra 2013-09-30 16:13 ` Linus Torvalds 2013-09-30 16:41 ` Peter Zijlstra 2013-10-01 7:28 ` Ingo Molnar 2013-10-01 8:09 ` Peter Zijlstra 2013-10-01 8:25 ` Ingo Molnar 2013-09-30 15:58 ` Waiman Long 2013-10-01 7:33 ` Ingo Molnar 2013-10-01 20:03 ` Waiman Long 2013-09-28 19:37 ` [PATCH] anon_vmas: Convert the rwsem to an rwlock_t Ingo Molnar 2013-09-28 19:43 ` Linus Torvalds 2013-09-28 19:46 ` Ingo Molnar 2013-09-28 19:52 ` [PATCH, v2] " Ingo Molnar 2013-09-30 11:00 ` Peter Zijlstra 2013-09-30 11:44 ` Peter Zijlstra 2013-09-30 17:03 ` Andrew Morton 2013-09-30 17:25 ` Linus Torvalds 2013-09-30 17:10 ` Tim Chen 2013-09-30 18:14 ` Peter Zijlstra 2013-09-30 19:23 ` Tim Chen 2013-09-30 19:35 ` Waiman Long 2013-09-30 19:47 ` Tim Chen 2013-09-30 22:03 ` Tim Chen 2013-10-01 2:41 ` Waiman Long 2013-09-30 8:52 ` [PATCH] " Andrea Arcangeli 2013-09-30 14:40 ` Jerome Glisse 2013-09-30 16:26 ` Linus Torvalds 2013-09-30 19:16 ` Andrea Arcangeli 2013-09-30 18:21 ` Andi Kleen 2013-09-30 19:04 ` Jerome Glisse 2013-09-29 23:06 ` [PATCH] rwsem: reduce spinlock contention in wakeup code path Davidlohr Bueso 2013-09-29 23:26 ` Linus Torvalds 2013-09-30 0:40 ` Davidlohr Bueso 2013-09-30 0:51 ` Linus Torvalds 2013-09-30 6:57 ` Ingo Molnar 2013-09-30 1:11 ` Michel Lespinasse 2013-09-30 7:05 ` Ingo Molnar 2013-09-30 16:03 ` Waiman Long 2013-09-30 10:46 ` Peter Zijlstra 2013-10-01 7:48 ` Ingo Molnar 2013-10-01 8:10 ` Peter Zijlstra 2013-09-27 19:32 ` Peter Hurley 2013-09-28 0:46 ` Waiman Long
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).