* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued @ 2014-02-06 14:52 Daniel J Blueman 0 siblings, 0 replies; 26+ messages in thread From: Daniel J Blueman @ 2014-02-06 14:52 UTC (permalink / raw) To: Peter Zijlstra; +Cc: LKML, Waiman Long On Thursday, 6 February 2014 22:10:01 UTC+8, Peter Zijlstra wrote: > On Wed, Feb 05, 2014 at 04:44:34PM -0500, Waiman Long wrote: > > I have an alternative way of breaking out of the MCS lock waiting queue when > > need_resched() is set. I overload the locked flag to indicate a skipped node > > if negative. > > I'm not quite seeing how it works (then again, I've not really read the > patch carefully). > > Suppose you break out; at that point you get queued and go to sleep. > Suppose you got woken up while you MCS entry is still 'pending' and > magically win the race and acquire the lock. > > At that point your MCS entry can be re-used while its still part of the > list. > > Its a fantastically small race window, but I don't see anything that > makes it impossible. > > > I run the patch through the AIM7 high-systime workload on a > > 4-socket server and it seemed to run fine. > > How do people run this AIM7 piece of shit? I let it run for over an hour > and it generated exactly 0 numbers, it just sits there eating cpu-time > and creating a racket from my pantry. Without any better advice, I was building the OSDL AIM7 [1], tweaking DISKDIR in data/reaim.config and running (eg on a 384-core setup): $ src/reaim -c data/reaim.config -f data/workfile.compute -i 16 -e 384 Thanks, Daniel [1] http://sourceforge.net/projects/re-aim-7/ -- Daniel J Blueman Principal Software Engineer, Numascale ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v2 0/5] mutex: Mutex scalability patches
@ 2014-01-28 19:13 Jason Low
2014-01-28 19:13 ` [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued Jason Low
0 siblings, 1 reply; 26+ messages in thread
From: Jason Low @ 2014-01-28 19:13 UTC (permalink / raw)
To: mingo, peterz, paulmck, Waiman.Long, torvalds, tglx, jason.low2
Cc: linux-kernel, riel, akpm, davidlohr, hpa, andi, aswin,
scott.norton, chegu_vinod
v1->v2:
- Replace the previous patch that limits the # of times a thread can spin with
!lock->owner with a patch that releases the mutex before holding the wait_lock
in the __mutex_unlock_common_slowpath() function.
- Add a patch which allows a thread to attempt 1 mutex_spin_on_owner() without
checking need_resched() if need_resched() triggered while in the MCS queue.
- Add a patch which disables preemption between modifying lock->owner and
acquiring/releasing the mutex.
This patchset addresses a few scalability issues with mutexes.
Patch 1 has the mutex_can_spin_on_owner() funtion check for need_resched()
before being added to MCS queue.
Patches 2, 3 are to fix issues with threads spinning when
there is no lock owner when the mutex is under high contention.
Patch 4 and 5 are RFC patches. Patch 4 disables preemption between modifying
lock->owner and locking/unlocking the mutex. Patch 5 addresses the situation
where spinners can potentially wait a long time in the MCS queue for a chance
to spin on mutex owner (not checking for need_resched()), yet ends up not
getting to spin.
These changes benefit the AIM7 fserver and high_systime workloads (run on disk)
on an 8 socket, 80 core box. The table below shows the performance
improvements with 3.13 + patches 1, 2, 3 when compared to the 3.13 baseline,
and the performance improvements with 3.13 + all 5 patches compared to
the 3.13 baseline.
Note: I split the % improvement into these 2 categories because
patch 3 and patch 5 are the most interesting/important patches in
this patchset in terms of performance improvements.
---------------------------------------------------------------
high_systime
---------------------------------------------------------------
# users | avg % improvement with | avg % improvement with
| 3.13 + patch 1, 2, 3 | 3.13 + patch 1, 2, 3, 4, 5
---------------------------------------------------------------
1000-2000 | +27.05% | +53.35%
---------------------------------------------------------------
100-900 | +36.11% | +52.56%
---------------------------------------------------------------
10-90 | +2.47% | +4.05%
---------------------------------------------------------------
---------------------------------------------------------------
fserver
---------------------------------------------------------------
# users | avg % improvement with | avg % improvement with
| 3.13 + patch 1, 2, 3 | 3.13 + patch 1, 2, 3, 4, 5
---------------------------------------------------------------
1000-2000 | +18.31% | +37.65%
---------------------------------------------------------------
100-900 | +5.99% | +17.50%
---------------------------------------------------------------
10-90 | +2.47% | +6.10%
^ permalink raw reply [flat|nested] 26+ messages in thread* [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-01-28 19:13 [PATCH v2 0/5] mutex: Mutex scalability patches Jason Low @ 2014-01-28 19:13 ` Jason Low 2014-01-28 21:07 ` Peter Zijlstra 0 siblings, 1 reply; 26+ messages in thread From: Jason Low @ 2014-01-28 19:13 UTC (permalink / raw) To: mingo, peterz, paulmck, Waiman.Long, torvalds, tglx, jason.low2 Cc: linux-kernel, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod Before a thread attempts mutex spin on owner, it is first added to a queue using an MCS lock so that only 1 thread spins on owner at a time. However, when the spinner is queued, it is unable to check if it needs to reschedule and will remain on the queue. This could cause the spinner to spin longer than its allocated time. However, once it is the spinner's turn to spin on owner, it would immediately go to slowpath if it need_resched() and gets no spin time. In these situations, not only does the spinner take up more time for a chance to spin for the mutex, it also ends up not getting to spin once it gets its turn. One solution would be to exit the MCS queue and go to mutex slowpath if need_resched(). However, that may require a lot of overhead. For instance, if a thread at the end of the queue need_resched(), in order to remove it from the queue, we would have to unqueue and requeue all other spinners in front of it. This RFC patch tries to address the issue in another context by avoiding situations where spinners immediately get sent to the slowpath on need_resched() upon getting to spin. We will first check for need_resched() right after acquiring the MCS lock. If need_resched() is true, then need_resched() triggered while the thread is waiting in the MCS queue (since patch 1 makes the spinner check for need_resched() before entering the queue). In this case, we will allow the thread to have at least 1 try to do mutex_spin_on_owner() regardless of need_resched(). This patch also removes the need_resched() in the outer loop in case we require a few extra spins to observe lock->count == 1, and patch 4 ensures we won't be spinning with lock owner preempted. And if the need_resched() check after acquiring the MCS lock is false, then we won't give the spinner any extra spinning. Signed-off-by: Jason Low <jason.low2@hp.com> --- kernel/locking/mutex.c | 22 ++++++++++++++++------ 1 files changed, 16 insertions(+), 6 deletions(-) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index cfaaf53..2281a48 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -187,11 +187,11 @@ static inline bool owner_running(struct mutex *lock, struct task_struct *owner) * access and not reliable. */ static noinline -int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner) +int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner, int spin_grace_period) { rcu_read_lock(); while (owner_running(lock, owner)) { - if (need_resched()) + if (need_resched() && !spin_grace_period) break; arch_mutex_cpu_relax(); @@ -428,7 +428,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, struct task_struct *task = current; struct mutex_waiter waiter; unsigned long flags; - int ret; + int ret, spin_grace_period; struct mspin_node node; preempt_disable(); @@ -461,6 +461,12 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, goto slowpath; mspin_lock(MLOCK(lock), &node); + /* + * If need_resched() triggered while queued, then we will still give + * this spinner a chance to spin for the mutex, rather than send this + * immediately to slowpath after waiting for its turn. + */ + spin_grace_period = (need_resched()) ? 1 : 0; for (;;) { struct task_struct *owner; @@ -485,8 +491,12 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, * release the lock or go to sleep. */ owner = ACCESS_ONCE(lock->owner); - if (owner && !mutex_spin_on_owner(lock, owner)) - break; + if (owner) { + if (!mutex_spin_on_owner(lock, owner, spin_grace_period)) + break; + + spin_grace_period = 0; + } if ((atomic_read(&lock->count) == 1) && (atomic_cmpxchg(&lock->count, 1, 0) == 1)) { @@ -510,7 +520,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, * we're an RT task that will live-lock because we won't let * the owner complete. */ - if (!owner && (need_resched() || rt_task(task))) + if (!owner && rt_task(task)) break; /* -- 1.7.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-01-28 19:13 ` [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued Jason Low @ 2014-01-28 21:07 ` Peter Zijlstra 2014-01-28 22:51 ` Jason Low 0 siblings, 1 reply; 26+ messages in thread From: Peter Zijlstra @ 2014-01-28 21:07 UTC (permalink / raw) To: Jason Low Cc: mingo, paulmck, Waiman.Long, torvalds, tglx, linux-kernel, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod On Tue, Jan 28, 2014 at 11:13:16AM -0800, Jason Low wrote: > Before a thread attempts mutex spin on owner, it is first added to a queue > using an MCS lock so that only 1 thread spins on owner at a time. However, when > the spinner is queued, it is unable to check if it needs to reschedule and > will remain on the queue. This could cause the spinner to spin longer > than its allocated time. what allocated time? > However, once it is the spinner's turn to spin on > owner, it would immediately go to slowpath if it need_resched() and gets no spin > time. In these situations, not only does the spinner take up more time for a > chance to spin for the mutex, it also ends up not getting to spin once it > gets its turn. > > One solution would be to exit the MCS queue and go to mutex slowpath if > need_resched(). However, that may require a lot of overhead. For instance, if a > thread at the end of the queue need_resched(), in order to remove it from the > queue, we would have to unqueue and requeue all other spinners in front of it. If you can do that, you can also walk the list and find prev and cmpxchg the entry out. But I don't think you can do either, as we simply don't have a head pointer. > This RFC patch tries to address the issue in another context by avoiding > situations where spinners immediately get sent to the slowpath on > need_resched() upon getting to spin. > We will first check for need_resched() > right after acquiring the MCS lock. If need_resched() is true, then > need_resched() triggered while the thread is waiting in the MCS queue (since > patch 1 makes the spinner check for need_resched() before entering the queue). > In this case, we will allow the thread to have at least 1 try to do > mutex_spin_on_owner() regardless of need_resched(). No! We should absolutely not ever willfully ignore need_resched(). Esp. not for unbounded spins. > This patch also removes > the need_resched() in the outer loop in case we require a few extra spins to > observe lock->count == 1, and patch 4 ensures we won't be spinning with > lock owner preempted. > > And if the need_resched() check after acquiring the MCS lock is false, then > we won't give the spinner any extra spinning. But urgh, nasty problem. Lemme ponder this a bit. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-01-28 21:07 ` Peter Zijlstra @ 2014-01-28 22:51 ` Jason Low 2014-01-29 11:51 ` Peter Zijlstra 0 siblings, 1 reply; 26+ messages in thread From: Jason Low @ 2014-01-28 22:51 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, paulmck, Waiman.Long, torvalds, tglx, linux-kernel, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod On Tue, 2014-01-28 at 22:07 +0100, Peter Zijlstra wrote: > On Tue, Jan 28, 2014 at 11:13:16AM -0800, Jason Low wrote: > > Before a thread attempts mutex spin on owner, it is first added to a queue > > using an MCS lock so that only 1 thread spins on owner at a time. However, when > > the spinner is queued, it is unable to check if it needs to reschedule and > > will remain on the queue. This could cause the spinner to spin longer > > than its allocated time. > > what allocated time? Hi Peter, By "spin longer than its allocated time", I meant to say that the thread can continue spinning/waiting in the MCS queue after need_resched() has been set. > > However, once it is the spinner's turn to spin on > > owner, it would immediately go to slowpath if it need_resched() and gets no spin > > time. In these situations, not only does the spinner take up more time for a > > chance to spin for the mutex, it also ends up not getting to spin once it > > gets its turn. > > > > One solution would be to exit the MCS queue and go to mutex slowpath if > > need_resched(). However, that may require a lot of overhead. For instance, if a > > thread at the end of the queue need_resched(), in order to remove it from the > > queue, we would have to unqueue and requeue all other spinners in front of it. > > If you can do that, you can also walk the list and find prev and cmpxchg > the entry out. But I don't think you can do either, as we simply don't > have a head pointer. > > > This RFC patch tries to address the issue in another context by avoiding > > situations where spinners immediately get sent to the slowpath on > > need_resched() upon getting to spin. > > > We will first check for need_resched() > > right after acquiring the MCS lock. If need_resched() is true, then > > need_resched() triggered while the thread is waiting in the MCS queue (since > > patch 1 makes the spinner check for need_resched() before entering the queue). > > > In this case, we will allow the thread to have at least 1 try to do > > mutex_spin_on_owner() regardless of need_resched(). > > No! We should absolutely not ever willfully ignore need_resched(). Esp. > not for unbounded spins. Ok. This was sort of a proof of concept patch to illustrate the type of performance gains we can get by addressing this issue. > > This patch also removes > > the need_resched() in the outer loop in case we require a few extra spins to > > observe lock->count == 1, and patch 4 ensures we won't be spinning with > > lock owner preempted. > > > > And if the need_resched() check after acquiring the MCS lock is false, then > > we won't give the spinner any extra spinning. > > But urgh, nasty problem. Lemme ponder this a bit. Thanks, Jason ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-01-28 22:51 ` Jason Low @ 2014-01-29 11:51 ` Peter Zijlstra 2014-01-31 3:29 ` Jason Low 2014-02-05 21:44 ` Waiman Long 0 siblings, 2 replies; 26+ messages in thread From: Peter Zijlstra @ 2014-01-29 11:51 UTC (permalink / raw) To: Jason Low Cc: mingo, paulmck, Waiman.Long, torvalds, tglx, linux-kernel, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote: > > But urgh, nasty problem. Lemme ponder this a bit. OK, please have a very careful look at the below. It survived a boot with udev -- which usually stresses mutex contention enough to explode (in fact it did a few time when I got the contention/cancel path wrong), however I have not ran anything else on it. The below is an MCS variant that allows relatively cheap unqueueing. But its somewhat tricky and I might have gotten a case wrong, esp. the double concurrent cancel case got my head hurting (I didn't attempt a tripple unqueue). Applies to tip/master but does generate a few (harmless) compile warnings because I didn't fully clean up the mcs_spinlock vs m_spinlock thing. Also, there's a comment in the slowpath that bears consideration. --- kernel/locking/mutex.c | 158 +++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 148 insertions(+), 10 deletions(-) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 45fe1b5293d6..4a69da73903c 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -166,6 +166,9 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock) struct task_struct *owner; int retval = 1; + if (need_resched()) + return 0; + rcu_read_lock(); owner = ACCESS_ONCE(lock->owner); if (owner) @@ -358,6 +361,134 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, spin_unlock_mutex(&lock->base.wait_lock, flags); } +struct m_spinlock { + struct m_spinlock *next, *prev; + int locked; /* 1 if lock acquired */ +}; + +/* + * Using a single mcs node per CPU is safe because mutex_lock() should not be + * called from interrupt context and we have preemption disabled over the mcs + * lock usage. + */ +static DEFINE_PER_CPU(struct m_spinlock, m_node); + +static bool m_spin_lock(struct m_spinlock **lock) +{ + struct m_spinlock *node = this_cpu_ptr(&m_node); + struct m_spinlock *prev, *next; + + node->locked = 0; + node->next = NULL; + + node->prev = prev = xchg(lock, node); + if (likely(prev == NULL)) + return true; + + ACCESS_ONCE(prev->next) = node; + + /* + * Normally @prev is untouchable after the above store; because at that + * moment unlock can proceed and wipe the node element from stack. + * + * However, since our nodes are static per-cpu storage, we're + * guaranteed their existence -- this allows us to apply + * cmpxchg in an attempt to undo our queueing. + */ + + while (!smp_load_acquire(&node->locked)) { + if (need_resched()) + goto unqueue; + arch_mutex_cpu_relax(); + } + return true; + +unqueue: + + /* + * Undo our @prev->next assignment; this will make @prev's unlock() + * wait for a next pointer since @lock points to us. + */ + if (cmpxchg(&prev->next, node, NULL) != node) { /* A -> B */ + /* + * @prev->next no longer pointed to us; either we hold the lock + * or @prev cancelled the wait too and we need to reload and + * retry. + */ + if (smp_load_acquire(&node->locked)) + return true; + + /* + * Because we observed the new @prev->next, the smp_wmb() at (B) + * ensures that we must now observe the new @node->prev. + */ + prev = ACCESS_ONCE(node->prev); + goto unqueue; + } + + if (smp_load_acquire(&node->locked)) { + /* + * OOPS, we were too late, we already got the lock. No harm + * done though; @prev is now unused an nobody cares we frobbed + * it. + */ + return true; + } + + /* + * Per the above logic @prev's unlock() will now wait, + * therefore @prev is now stable. + */ + + if (cmpxchg(lock, node, prev) == node) { + /* + * We were still the last queued, we moved @lock back. @prev + * will now observe @lock and will complete its unlock(). + */ + return false; + } + + /* + * We're not the last to be queued, obtain ->next. + */ + + while (!(next = ACCESS_ONCE(node->next))) + arch_mutex_cpu_relax(); + + ACCESS_ONCE(next->prev) = prev; + + /* + * Ensure that @next->prev is written before we write @prev->next, + * this guarantees that when the cmpxchg at (A) fails we must + * observe the new prev value. + */ + smp_wmb(); /* B -> A */ + + /* + * And point @prev to our next, and we're unlinked. We can use an + * non-atomic op because only we modify @prev->next. + */ + ACCESS_ONCE(prev->next) = next; + + return false; +} + +static void m_spin_unlock(struct m_spinlock **lock) +{ + struct m_spinlock *node = this_cpu_ptr(&m_node); + struct m_spinlock *next = ACCESS_ONCE(node->next); + + while (likely(!next)) { + if (likely(cmpxchg(lock, node, NULL) == node)) + return; + + arch_mutex_cpu_relax(); + next = ACCESS_ONCE(node->next); + } + + smp_store_release(&next->locked, 1); +} + /* * Lock a mutex (possibly interruptible), slowpath: */ @@ -400,9 +531,11 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, if (!mutex_can_spin_on_owner(lock)) goto slowpath; + if (!m_spin_lock(&lock->mcs_lock)) + goto slowpath; + for (;;) { struct task_struct *owner; - struct mcs_spinlock node; if (use_ww_ctx && ww_ctx->acquired > 0) { struct ww_mutex *ww; @@ -417,19 +550,16 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, * performed the optimistic spinning cannot be done. */ if (ACCESS_ONCE(ww->ctx)) - goto slowpath; + break; } /* * If there's an owner, wait for it to either * release the lock or go to sleep. */ - mcs_spin_lock(&lock->mcs_lock, &node); owner = ACCESS_ONCE(lock->owner); - if (owner && !mutex_spin_on_owner(lock, owner)) { - mcs_spin_unlock(&lock->mcs_lock, &node); - goto slowpath; - } + if (owner && !mutex_spin_on_owner(lock, owner)) + break; if ((atomic_read(&lock->count) == 1) && (atomic_cmpxchg(&lock->count, 1, 0) == 1)) { @@ -442,11 +572,10 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, } mutex_set_owner(lock); - mcs_spin_unlock(&lock->mcs_lock, &node); + m_spin_unlock(&lock->mcs_lock); preempt_enable(); return 0; } - mcs_spin_unlock(&lock->mcs_lock, &node); /* * When there's no owner, we might have preempted between the @@ -455,7 +584,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, * the owner complete. */ if (!owner && (need_resched() || rt_task(task))) - goto slowpath; + break; /* * The cpu_relax() call is a compiler barrier which forces @@ -465,10 +594,19 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, */ arch_mutex_cpu_relax(); } + m_spin_unlock(&lock->mcs_lock); slowpath: #endif spin_lock_mutex(&lock->wait_lock, flags); + /* + * XXX arguably, when need_resched() is set and there's other pending + * owners we should not try-acquire and simply queue and schedule(). + * + * There's nothing worse than obtaining a lock only to get immediately + * scheduled out. + */ + /* once more, can we acquire the lock? */ if (MUTEX_SHOW_NO_WAITER(lock) && (atomic_xchg(&lock->count, 0) == 1)) goto skip_wait; ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-01-29 11:51 ` Peter Zijlstra @ 2014-01-31 3:29 ` Jason Low 2014-01-31 14:09 ` Peter Zijlstra 2014-02-05 21:44 ` Waiman Long 1 sibling, 1 reply; 26+ messages in thread From: Jason Low @ 2014-01-31 3:29 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, paulmck, Waiman.Long, torvalds, tglx, linux-kernel, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod On Wed, 2014-01-29 at 12:51 +0100, Peter Zijlstra wrote: > On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote: > > > But urgh, nasty problem. Lemme ponder this a bit. > > OK, please have a very careful look at the below. It survived a boot > with udev -- which usually stresses mutex contention enough to explode > (in fact it did a few time when I got the contention/cancel path wrong), > however I have not ran anything else on it. I tested this patch on a 2 socket, 8 core machine with the AIM7 fserver workload. After 100 users, the system gets soft lockups. Some condition may be causing threads to not leave the "goto unqueue" loop. I added a debug counter, and threads were able to reach more than 1,000,000,000 "goto unqueue". I also was initially thinking if there can be problems when multiple threads need_resched() and unqueue at the same time. As an example, 2 nodes that need to reschedule are next to each other in the middle of the MCS queue. The 1st node executes "while (!(next = ACCESS_ONCE(node->next)))" and exits the while loop because next is not NULL. Then, the 2nd node execute its "if (cmpxchg(&prev->next, node, NULL) != node)". We may then end up in a situation where the node before the 1st node gets linked with the outdated 2nd node. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-01-31 3:29 ` Jason Low @ 2014-01-31 14:09 ` Peter Zijlstra 2014-01-31 20:01 ` Jason Low 2014-02-02 20:02 ` Peter Zijlstra 0 siblings, 2 replies; 26+ messages in thread From: Peter Zijlstra @ 2014-01-31 14:09 UTC (permalink / raw) To: Jason Low Cc: mingo, paulmck, Waiman.Long, torvalds, tglx, linux-kernel, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod On Thu, Jan 30, 2014 at 07:29:37PM -0800, Jason Low wrote: > On Wed, 2014-01-29 at 12:51 +0100, Peter Zijlstra wrote: > > On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote: > > > > But urgh, nasty problem. Lemme ponder this a bit. > > > > OK, please have a very careful look at the below. It survived a boot > > with udev -- which usually stresses mutex contention enough to explode > > (in fact it did a few time when I got the contention/cancel path wrong), > > however I have not ran anything else on it. > > I tested this patch on a 2 socket, 8 core machine with the AIM7 fserver > workload. After 100 users, the system gets soft lockups. > > Some condition may be causing threads to not leave the "goto unqueue" > loop. I added a debug counter, and threads were able to reach more than > 1,000,000,000 "goto unqueue". > > I also was initially thinking if there can be problems when multiple > threads need_resched() and unqueue at the same time. As an example, 2 > nodes that need to reschedule are next to each other in the middle of > the MCS queue. The 1st node executes "while (!(next = > ACCESS_ONCE(node->next)))" and exits the while loop because next is not > NULL. Then, the 2nd node execute its "if (cmpxchg(&prev->next, node, > NULL) != node)". We may then end up in a situation where the node before > the 1st node gets linked with the outdated 2nd node. Yes indeed, I found two bugs. If you consider the MCS lock with 4 queued nodes: .----. L -> | N4 | `----' ^ V .----. | N3 | `----' ^ V .----. | N2 | `----' ^ V .----. | N1 | `----' And look at the 3 unqueue steps in the below patch, and read 3A to mean, Node 3 ran step A. Both bugs were in Step B, the first was triggered through: 3A,4A,3B,4B In this case the old code would have 3 spinning in @n3->next, however 4B would have moved L to N3 and left @n3->next empty. FAIL. The second was: 2A,2B,3A,2C,3B,3C In this case the cancellation of both 2 and 3 'succeed' and we end up with N1 linked to N3 and N2 linked to N4. Total fail. The first bug was fixed by having Step-B spin on both @n->next and @l just like the unlink() path already did. The second bug was fixed by having Step-B xchg(@n->next, NULL) instead of only reading it; this way the 3A above cannot complete until after 2C and we have a coherent chain back. I've downloaded AIM7 from sf.net and I hope I'm running it with 100+ loads but I'm not entirely sure I got this thing right, its not really making progress with or without patch :/ --- --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -166,6 +166,9 @@ static inline int mutex_can_spin_on_owne struct task_struct *owner; int retval = 1; + if (need_resched()) + return 0; + rcu_read_lock(); owner = ACCESS_ONCE(lock->owner); if (owner) @@ -358,6 +361,166 @@ ww_mutex_set_context_fastpath(struct ww_ spin_unlock_mutex(&lock->base.wait_lock, flags); } +struct m_spinlock { + struct m_spinlock *next, *prev; + int locked; /* 1 if lock acquired */ +}; + +/* + * Using a single mcs node per CPU is safe because mutex_lock() should not be + * called from interrupt context and we have preemption disabled over the mcs + * lock usage. + */ +static DEFINE_PER_CPU(struct m_spinlock, m_node); + +static bool m_spin_lock(struct m_spinlock **lock) +{ + struct m_spinlock *node = this_cpu_ptr(&m_node); + struct m_spinlock *prev, *next; + + node->locked = 0; + node->next = NULL; + + node->prev = prev = xchg(lock, node); + if (likely(prev == NULL)) + return true; + + ACCESS_ONCE(prev->next) = node; + + /* + * Normally @prev is untouchable after the above store; because at that + * moment unlock can proceed and wipe the node element from stack. + * + * However, since our nodes are static per-cpu storage, we're + * guaranteed their existence -- this allows us to apply + * cmpxchg in an attempt to undo our queueing. + */ + + while (!smp_load_acquire(&node->locked)) { + if (need_resched()) + goto unqueue; + arch_mutex_cpu_relax(); + } + return true; + +unqueue: + /* + * Step - A -- stabilize @prev + * + * Undo our @prev->next assignment; this will make @prev's + * unlock()/cancel() wait for a next pointer since @lock points to us + * (or later). + */ + + for (;;) { + next = cmpxchg(&prev->next, node, NULL); /* A -> B,C */ + + /* + * Because the unlock() path retains @prev->next (for + * performance) we must check @node->locked after clearing + * @prev->next to see if we raced. + * + * Ordered by the cmpxchg() above and the conditional-store in + * unlock(). + */ + if (smp_load_acquire(&node->locked)) { + /* + * OOPS, we were too late, we already got the lock. No + * harm done though; @prev is now unused an nobody + * cares we frobbed it. + */ + return true; + } + + if (next == node) + break; + + /* + * @prev->next didn't point to us anymore, we didn't own the + * lock, so reload and try again. + * + * Because we observed the new @prev->next, the smp_wmb() at + * (C) ensures that we must now observe the new @node->prev. + */ + prev = ACCESS_ONCE(node->prev); + } + + /* + * Step - B -- stabilize @next + * + * Similar to unlock(), wait for @node->next or move @lock from @node + * back to @prev. + */ + + for (;;) { + if (*lock == node && cmpxchg(lock, node, prev) == node) { + /* + * We were the last queued, we moved @lock back. @prev + * will now observe @lock and will complete its + * unlock()/cancel(). + */ + return false; + } + + /* + * We must xchg() the @node->next value, because if we were to + * leave it in, a concurrent cancel() from @node->next might + * complete Step-A and think its @prev is still valid. + * + * If the concurrent cancel() wins the race, we'll wait for + * either @lock to point to us, through its Step-B, or wait for + * a new @node->next from its Step-C. + */ + next = xchg(&node->next, NULL); /* B -> A */ + if (next) + break; + + arch_mutex_cpu_relax(); + } + + /* + * Step - C -- unlink + * + * @prev is stable because its still waiting for a new @prev->next + * pointer, @next is stable because our @node->next pointer is NULL and + * it will wait in Step-A. + */ + + ACCESS_ONCE(next->prev) = prev; + + /* + * Ensure that @next->prev is written before we write @prev->next, + * this guarantees that when the cmpxchg at (A) fails we must + * observe the new prev value. + */ + smp_wmb(); /* C -> A */ + + /* + * And point @prev to our next, and we're unlinked. + */ + ACCESS_ONCE(prev->next) = next; + + return false; +} + +static void m_spin_unlock(struct m_spinlock **lock) +{ + struct m_spinlock *node = this_cpu_ptr(&m_node); + struct m_spinlock *next; + + for (;;) { + if (likely(cmpxchg(lock, node, NULL) == node)) + return; + + next = ACCESS_ONCE(node->next); + if (unlikely(next)) + break; + + arch_mutex_cpu_relax(); + } + smp_store_release(&next->locked, 1); +} + /* * Lock a mutex (possibly interruptible), slowpath: */ @@ -400,9 +563,11 @@ __mutex_lock_common(struct mutex *lock, if (!mutex_can_spin_on_owner(lock)) goto slowpath; + if (!m_spin_lock(&lock->mcs_lock)) + goto slowpath; + for (;;) { struct task_struct *owner; - struct mcs_spinlock node; if (use_ww_ctx && ww_ctx->acquired > 0) { struct ww_mutex *ww; @@ -417,19 +582,16 @@ __mutex_lock_common(struct mutex *lock, * performed the optimistic spinning cannot be done. */ if (ACCESS_ONCE(ww->ctx)) - goto slowpath; + break; } /* * If there's an owner, wait for it to either * release the lock or go to sleep. */ - mcs_spin_lock(&lock->mcs_lock, &node); owner = ACCESS_ONCE(lock->owner); - if (owner && !mutex_spin_on_owner(lock, owner)) { - mcs_spin_unlock(&lock->mcs_lock, &node); - goto slowpath; - } + if (owner && !mutex_spin_on_owner(lock, owner)) + break; if ((atomic_read(&lock->count) == 1) && (atomic_cmpxchg(&lock->count, 1, 0) == 1)) { @@ -442,11 +604,10 @@ __mutex_lock_common(struct mutex *lock, } mutex_set_owner(lock); - mcs_spin_unlock(&lock->mcs_lock, &node); + m_spin_unlock(&lock->mcs_lock); preempt_enable(); return 0; } - mcs_spin_unlock(&lock->mcs_lock, &node); /* * When there's no owner, we might have preempted between the @@ -455,7 +616,7 @@ __mutex_lock_common(struct mutex *lock, * the owner complete. */ if (!owner && (need_resched() || rt_task(task))) - goto slowpath; + break; /* * The cpu_relax() call is a compiler barrier which forces @@ -465,10 +626,19 @@ __mutex_lock_common(struct mutex *lock, */ arch_mutex_cpu_relax(); } + m_spin_unlock(&lock->mcs_lock); slowpath: #endif spin_lock_mutex(&lock->wait_lock, flags); + /* + * XXX arguably, when need_resched() is set and there's other pending + * owners we should not try-acquire and simply queue and schedule(). + * + * There's nothing worse than obtaining a lock only to get immediately + * scheduled out. + */ + /* once more, can we acquire the lock? */ if (MUTEX_SHOW_NO_WAITER(lock) && (atomic_xchg(&lock->count, 0) == 1)) goto skip_wait; ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-01-31 14:09 ` Peter Zijlstra @ 2014-01-31 20:01 ` Jason Low 2014-01-31 20:08 ` Peter Zijlstra 2014-02-02 20:02 ` Peter Zijlstra 1 sibling, 1 reply; 26+ messages in thread From: Jason Low @ 2014-01-31 20:01 UTC (permalink / raw) To: Peter Zijlstra Cc: Jason Low, Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds, Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel, Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod On Fri, Jan 31, 2014 at 6:09 AM, Peter Zijlstra <peterz@infradead.org> wrote: > I've downloaded AIM7 from sf.net and I hope I'm running it with 100+ > loads but I'm not entirely sure I got this thing right, its not really > making progress with or without patch :/ Ingo's program http://lkml.org/lkml/2006/1/8/50 using the V option may be able to generate similar mutex contention. Currently still getting soft lockups with the updated version. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-01-31 20:01 ` Jason Low @ 2014-01-31 20:08 ` Peter Zijlstra 2014-02-02 21:01 ` Jason Low 2014-02-02 22:02 ` Paul E. McKenney 0 siblings, 2 replies; 26+ messages in thread From: Peter Zijlstra @ 2014-01-31 20:08 UTC (permalink / raw) To: Jason Low Cc: Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds, Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel, Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod On Fri, Jan 31, 2014 at 12:01:37PM -0800, Jason Low wrote: > On Fri, Jan 31, 2014 at 6:09 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > I've downloaded AIM7 from sf.net and I hope I'm running it with 100+ > > loads but I'm not entirely sure I got this thing right, its not really > > making progress with or without patch :/ > > Ingo's program http://lkml.org/lkml/2006/1/8/50 using the V option may > be able to generate similar mutex contention. > > Currently still getting soft lockups with the updated version. Bugger.. ok clearly I need to think harder still. I'm fairly sure this cancelation can work though, just seems tricky to get right :-) I'll give that proglet from Ingo a go, although that might be Monday ere I get to it. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-01-31 20:08 ` Peter Zijlstra @ 2014-02-02 21:01 ` Jason Low 2014-02-02 21:12 ` Peter Zijlstra 2014-02-02 22:02 ` Paul E. McKenney 1 sibling, 1 reply; 26+ messages in thread From: Jason Low @ 2014-02-02 21:01 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds, Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel, Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod On Fri, 2014-01-31 at 21:08 +0100, Peter Zijlstra wrote: > On Fri, Jan 31, 2014 at 12:01:37PM -0800, Jason Low wrote: > > Currently still getting soft lockups with the updated version. > > Bugger.. ok clearly I need to think harder still. I'm fairly sure this > cancelation can work though, just seems tricky to get right :-) Ok, I believe I have found a race condition between m_spin_lock() and m_spin_unlock(). In m_spin_unlock(), we do "next = ACCESS_ONCE(node->next)". Then, if next is not NULL, we proceed to set next->locked to 1. A thread in m_spin_lock() in the unqueue path could execute "next = cmpxchg(&prev->next, node, NULL)" after the thread in m_spin_unlock() accesses its node->next and finds that it is not NULL. Then, the thread in m_spin_lock() could check !node->locked before the thread in m_spin_unlock() sets next->locked to 1. The following addition change was able to solve the initial lockups that were occurring when running fserver on a 2 socket box. --- diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 9eb4dbe..e71a84a 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -513,8 +513,13 @@ static void m_spin_unlock(struct m_spinlock **lock) return; next = ACCESS_ONCE(node->next); - if (unlikely(next)) - break; + + if (unlikely(next)) { + next = cmpxchg(&node->next, next, NULL); + + if (next) + break; + } arch_mutex_cpu_relax(); } ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-02-02 21:01 ` Jason Low @ 2014-02-02 21:12 ` Peter Zijlstra 2014-02-03 18:39 ` Jason Low 0 siblings, 1 reply; 26+ messages in thread From: Peter Zijlstra @ 2014-02-02 21:12 UTC (permalink / raw) To: Jason Low Cc: Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds, Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel, Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod On Sun, Feb 02, 2014 at 01:01:23PM -0800, Jason Low wrote: > On Fri, 2014-01-31 at 21:08 +0100, Peter Zijlstra wrote: > > On Fri, Jan 31, 2014 at 12:01:37PM -0800, Jason Low wrote: > > > Currently still getting soft lockups with the updated version. > > > > Bugger.. ok clearly I need to think harder still. I'm fairly sure this > > cancelation can work though, just seems tricky to get right :-) > > Ok, I believe I have found a race condition between m_spin_lock() and > m_spin_unlock(). > > In m_spin_unlock(), we do "next = ACCESS_ONCE(node->next)". Then, if > next is not NULL, we proceed to set next->locked to 1. > > A thread in m_spin_lock() in the unqueue path could execute > "next = cmpxchg(&prev->next, node, NULL)" after the thread in > m_spin_unlock() accesses its node->next and finds that it is not NULL. > Then, the thread in m_spin_lock() could check !node->locked before > the thread in m_spin_unlock() sets next->locked to 1. Yes indeed. How silly of me to not spot that! > The following addition change was able to solve the initial lockups that were > occurring when running fserver on a 2 socket box. > > --- > diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c > index 9eb4dbe..e71a84a 100644 > --- a/kernel/locking/mutex.c > +++ b/kernel/locking/mutex.c > @@ -513,8 +513,13 @@ static void m_spin_unlock(struct m_spinlock **lock) > return; > > next = ACCESS_ONCE(node->next); > - if (unlikely(next)) > - break; > + > + if (unlikely(next)) { > + next = cmpxchg(&node->next, next, NULL); > + > + if (next) The cmpxchg could fail and next still be !NULL I suppose. > + break; > + } The way I wrote that same loop in step-B, is: for (;;) { if (*lock == node && cmpxchg(lock, node, prev) == node) return next = xchg(&node->next, NULL); /* B -> A */ if (next) break; arch_mutex_cpu_relax(); } I suppose we can make that something like: if (node->next) { next = xchg(&node->next, NULL); if (next) break } To avoid the xchg on every loop. I had wanted to avoid the additional locked op in the unlock path, but yes that does make things easier. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-02-02 21:12 ` Peter Zijlstra @ 2014-02-03 18:39 ` Jason Low 2014-02-03 19:25 ` Peter Zijlstra 0 siblings, 1 reply; 26+ messages in thread From: Jason Low @ 2014-02-03 18:39 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds, Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel, Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod On Sun, 2014-02-02 at 22:12 +0100, Peter Zijlstra wrote: > The way I wrote that same loop in step-B, is: > > > for (;;) { > if (*lock == node && cmpxchg(lock, node, prev) == node) > return > > next = xchg(&node->next, NULL); /* B -> A */ > if (next) > break; > > arch_mutex_cpu_relax(); > } > > I suppose we can make that something like: > > > if (node->next) { > next = xchg(&node->next, NULL); > if (next) > break > } > > To avoid the xchg on every loop. Ah yes, we want to use xchg() on &node->next. Since the cmpxchg() is now in a loop in the unlock function, an additional (*lock == node) check before the cmpxchg() would also be nice to avoid spinning on cmpxchg() there too. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-02-03 18:39 ` Jason Low @ 2014-02-03 19:25 ` Peter Zijlstra 2014-02-03 20:55 ` Jason Low 2014-02-04 7:13 ` Jason Low 0 siblings, 2 replies; 26+ messages in thread From: Peter Zijlstra @ 2014-02-03 19:25 UTC (permalink / raw) To: Jason Low Cc: Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds, Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel, Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod On Mon, Feb 03, 2014 at 10:39:20AM -0800, Jason Low wrote: > > To avoid the xchg on every loop. > > Ah yes, we want to use xchg() on &node->next. > > Since the cmpxchg() is now in a loop in the unlock function, an > additional (*lock == node) check before the cmpxchg() would also be nice > to avoid spinning on cmpxchg() there too. Right, I have the below; you can find the patches this depends upon here: http://programming.kicks-ass.net/sekrit/patches.tar.bz2 --- Subject: locking, mutex: Cancelable MCS lock for adaptive spinning From: Peter Zijlstra <peterz@infradead.org> Date: Wed, 29 Jan 2014 12:51:42 +0100 Since we want a task waiting for a mutex_lock() to go to sleep and reschedule on need_resched() we must be able to abort the mcs_spin_lock() around the adaptive spin. Therefore implement a cancelable mcs lock. XXX: anybody got a better name than m_spinlock? Cc: paulmck@linux.vnet.ibm.com Cc: Waiman.Long@hp.com Cc: torvalds@linux-foundation.org Cc: tglx@linutronix.de Cc: riel@redhat.com Cc: akpm@linux-foundation.org Cc: davidlohr@hp.com Cc: hpa@zytor.com Cc: andi@firstfloor.org Cc: aswin@hp.com Cc: scott.norton@hp.com Cc: chegu_vinod@hp.com Cc: mingo@redhat.com Cc: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-7jr68p4f447w2e0ck7y1yl06@git.kernel.org --- include/linux/mutex.h | 4 - kernel/locking/Makefile | 2 kernel/locking/mcs_spinlock.c | 156 ++++++++++++++++++++++++++++++++++++++++++ kernel/locking/mcs_spinlock.h | 18 ++++ kernel/locking/mutex.c | 10 +- 5 files changed, 183 insertions(+), 7 deletions(-) --- a/include/linux/mutex.h +++ b/include/linux/mutex.h @@ -46,7 +46,7 @@ * - detects multi-task circular deadlocks and prints out all affected * locks and tasks (and only those tasks) */ -struct mcs_spinlock; +struct m_spinlock; struct mutex { /* 1: unlocked, 0: locked, negative: locked, possible waiters */ atomic_t count; @@ -56,7 +56,7 @@ struct mutex { struct task_struct *owner; #endif #ifdef CONFIG_MUTEX_SPIN_ON_OWNER - struct mcs_spinlock *mcs_lock; /* Spinner MCS lock */ + struct m_spinlock *m_lock; /* Spinner MCS lock */ #endif #ifdef CONFIG_DEBUG_MUTEXES const char *name; --- a/kernel/locking/Makefile +++ b/kernel/locking/Makefile @@ -1,5 +1,5 @@ -obj-y += mutex.o semaphore.o rwsem.o lglock.o +obj-y += mutex.o semaphore.o rwsem.o lglock.o mcs_spinlock.o ifdef CONFIG_FUNCTION_TRACER CFLAGS_REMOVE_lockdep.o = -pg --- /dev/null +++ b/kernel/locking/mcs_spinlock.c @@ -0,0 +1,156 @@ + +#include <linux/percpu.h> +#include <linux/mutex.h> +#include <linux/sched.h> +#include "mcs_spinlock.h" + +#ifdef CONFIG_SMP + +/* + * Using a single mcs node per CPU is safe because mutex_lock() should not be + * called from interrupt context and we have preemption disabled over the mcs + * lock usage. + */ +static DEFINE_PER_CPU_SHARED_ALIGNED(struct m_spinlock, m_node); + +/* + * Get a stable @node->next pointer, either for unlock() or unqueue() purposes. + * Can return NULL in case we were the last queued and we updated @lock instead. + */ +static inline struct m_spinlock * +m_spin_wait_next(struct m_spinlock **lock, struct m_spinlock *node, + struct m_spinlock *prev) +{ + struct m_spinlock *next = NULL; + + for (;;) { + if (*lock == node && cmpxchg(lock, node, prev) == node) { + /* + * We were the last queued, we moved @lock back. @prev + * will now observe @lock and will complete its + * unlock()/unqueue(). + */ + break; + } + + /* + * We must xchg() the @node->next value, because if we were to + * leave it in, a concurrent unlock()/unqueue() from + * @node->next might complete Step-A and think its @prev is + * still valid. + * + * If the concurrent unlock()/unqueue() wins the race, we'll + * wait for either @lock to point to us, through its Step-B, or + * wait for a new @node->next from its Step-C. + */ + if (node->next) { + next = xchg(&node->next, NULL); + if (next) + break; + } + + arch_mutex_cpu_relax(); + } + + return next; +} + +bool m_spin_lock(struct m_spinlock **lock) +{ + struct m_spinlock *node = this_cpu_ptr(&m_node); + struct m_spinlock *prev, *next; + + node->locked = 0; + node->next = NULL; + + node->prev = prev = xchg(lock, node); + if (likely(prev == NULL)) + return true; + + ACCESS_ONCE(prev->next) = node; + + /* + * Normally @prev is untouchable after the above store; because at that + * moment unlock can proceed and wipe the node element from stack. + * + * However, since our nodes are static per-cpu storage, we're + * guaranteed their existence -- this allows us to apply + * cmpxchg in an attempt to undo our queueing. + */ + + while (!smp_load_acquire(&node->locked)) { + if (need_resched()) + goto unqueue; + arch_mutex_cpu_relax(); + } + return true; + +unqueue: + /* + * Step - A -- stabilize @prev + * + * Undo our @prev->next assignment; this will make @prev's + * unlock()/unqueue() wait for a next pointer since @lock points to us + * (or later). + */ + + for (;;) { + if (prev->next == node && + cmpxchg(&prev->next, node, NULL) == node) + break; + + /* + * We can only fail the cmpxchg() racing against an unlock(), + * in which case we should observe @node->locked becomming + * true. + */ + if (smp_load_acquire(&node->locked)) + return true; + + /* + * Or we race against a concurrent unqueue()'s step-B, in which + * case its step-C will write us a new @node->prev pointer. + */ + prev = ACCESS_ONCE(node->prev); + } + + /* + * Step - B -- stabilize @next + * + * Similar to unlock(), wait for @node->next or move @lock from @node + * back to @prev. + */ + + next = m_spin_wait_next(lock, node, prev); + if (!next) + return false; + + /* + * Step - C -- unlink + * + * @prev is stable because its still waiting for a new @prev->next + * pointer, @next is stable because our @node->next pointer is NULL and + * it will wait in Step-A. + */ + + ACCESS_ONCE(next->prev) = prev; + ACCESS_ONCE(prev->next) = next; + + return false; +} + +void m_spin_unlock(struct m_spinlock **lock) +{ + struct m_spinlock *node = this_cpu_ptr(&m_node); + struct m_spinlock *next; + + if (likely(cmpxchg(lock, node, NULL) == node)) + return; + + next = m_spin_wait_next(lock, node, NULL); + if (next) + ACCESS_ONCE(next->locked) = 1; +} + +#endif + --- a/kernel/locking/mcs_spinlock.h +++ b/kernel/locking/mcs_spinlock.h @@ -109,4 +109,22 @@ void mcs_spin_unlock(struct mcs_spinlock arch_mcs_spin_unlock_contended(&next->locked); } +/* + * Cancellable version of the MCS lock above. + * + * This version can fail acquisition and unqueue a spinner; it assumes no + * nesting. + * + * Intended for adaptive spinning of sleeping locks: + * mutex_lock()/rwsem_down_{read,write}() etc. + */ + +struct m_spinlock { + struct m_spinlock *next, *prev; + int locked; /* 1 if lock acquired */ +}; + +extern bool m_spin_lock(struct m_spinlock **lock); +extern void m_spin_unlock(struct m_spinlock **lock); + #endif /* __LINUX_MCS_SPINLOCK_H */ --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -53,7 +53,7 @@ __mutex_init(struct mutex *lock, const c INIT_LIST_HEAD(&lock->wait_list); mutex_clear_owner(lock); #ifdef CONFIG_MUTEX_SPIN_ON_OWNER - lock->mcs_lock = NULL; + lock->m_lock = NULL; #endif debug_mutex_init(lock, name, key); @@ -403,7 +403,9 @@ __mutex_lock_common(struct mutex *lock, if (!mutex_can_spin_on_owner(lock)) goto slowpath; - mcs_spin_lock(&lock->mcs_lock); + if (!m_spin_lock(&lock->m_lock)) + goto slowpath; + for (;;) { struct task_struct *owner; @@ -442,7 +444,7 @@ __mutex_lock_common(struct mutex *lock, } mutex_set_owner(lock); - mcs_spin_unlock(&lock->mcs_lock); + m_spin_unlock(&lock->m_lock); preempt_enable(); return 0; } @@ -464,7 +466,7 @@ __mutex_lock_common(struct mutex *lock, */ arch_mutex_cpu_relax(); } - mcs_spin_unlock(&lock->mcs_lock); + m_spin_unlock(&lock->m_lock); slowpath: #endif spin_lock_mutex(&lock->wait_lock, flags); ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-02-03 19:25 ` Peter Zijlstra @ 2014-02-03 20:55 ` Jason Low 2014-02-03 21:06 ` Peter Zijlstra 2014-02-04 7:13 ` Jason Low 1 sibling, 1 reply; 26+ messages in thread From: Jason Low @ 2014-02-03 20:55 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds, Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel, Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod On Mon, 2014-02-03 at 20:25 +0100, Peter Zijlstra wrote: > On Mon, Feb 03, 2014 at 10:39:20AM -0800, Jason Low wrote: > > > To avoid the xchg on every loop. > > > > Ah yes, we want to use xchg() on &node->next. > > > > Since the cmpxchg() is now in a loop in the unlock function, an > > additional (*lock == node) check before the cmpxchg() would also be nice > > to avoid spinning on cmpxchg() there too. > > Right, I have the below; you can find the patches this depends upon > here: > > http://programming.kicks-ass.net/sekrit/patches.tar.bz2 > > --- > Subject: locking, mutex: Cancelable MCS lock for adaptive spinning > From: Peter Zijlstra <peterz@infradead.org> > Date: Wed, 29 Jan 2014 12:51:42 +0100 > > Since we want a task waiting for a mutex_lock() to go to sleep and > reschedule on need_resched() we must be able to abort the > mcs_spin_lock() around the adaptive spin. > > Therefore implement a cancelable mcs lock. > > XXX: anybody got a better name than m_spinlock? So I was thinking something along the lines of mcs_spin_lock_cancelable() as that's essentially what this function does. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-02-03 20:55 ` Jason Low @ 2014-02-03 21:06 ` Peter Zijlstra 2014-02-03 21:56 ` Jason Low 0 siblings, 1 reply; 26+ messages in thread From: Peter Zijlstra @ 2014-02-03 21:06 UTC (permalink / raw) To: Jason Low Cc: Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds, Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel, Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod On Mon, Feb 03, 2014 at 12:55:34PM -0800, Jason Low wrote: > On Mon, 2014-02-03 at 20:25 +0100, Peter Zijlstra wrote: > > XXX: anybody got a better name than m_spinlock? > > So I was thinking something along the lines of > mcs_spin_lock_cancelable() as that's essentially what this function > does. sure, but what do we call the data structure that goes with it? Can't have two struct mcs_spinlock :/ ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-02-03 21:06 ` Peter Zijlstra @ 2014-02-03 21:56 ` Jason Low 0 siblings, 0 replies; 26+ messages in thread From: Jason Low @ 2014-02-03 21:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds, Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel, Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod On Mon, 2014-02-03 at 22:06 +0100, Peter Zijlstra wrote: > On Mon, Feb 03, 2014 at 12:55:34PM -0800, Jason Low wrote: > > On Mon, 2014-02-03 at 20:25 +0100, Peter Zijlstra wrote: > > > XXX: anybody got a better name than m_spinlock? > > > > So I was thinking something along the lines of > > mcs_spin_lock_cancelable() as that's essentially what this function > > does. > > sure, but what do we call the data structure that goes with it? Can't > have two struct mcs_spinlock :/ If this structure is only going to be used for cancelable mcs locking, what do you think of "struct cancelable_mcs_spinlock"? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-02-03 19:25 ` Peter Zijlstra 2014-02-03 20:55 ` Jason Low @ 2014-02-04 7:13 ` Jason Low 1 sibling, 0 replies; 26+ messages in thread From: Jason Low @ 2014-02-04 7:13 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds, Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel, Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod On Mon, 2014-02-03 at 20:25 +0100, Peter Zijlstra wrote: > +void m_spin_unlock(struct m_spinlock **lock) > +{ > + struct m_spinlock *node = this_cpu_ptr(&m_node); > + struct m_spinlock *next; > + > + if (likely(cmpxchg(lock, node, NULL) == node)) > + return; At this current point, (node->next != NULL) is a likely scenario. Perhaps we can also add the following code here: next = xchg(&node->next, NULL); if (next) { ACCESS_ONCE(next->locked) = 1; return; } > + next = m_spin_wait_next(lock, node, NULL); > + if (next) > + ACCESS_ONCE(next->locked) = 1; > +} ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-01-31 20:08 ` Peter Zijlstra 2014-02-02 21:01 ` Jason Low @ 2014-02-02 22:02 ` Paul E. McKenney 1 sibling, 0 replies; 26+ messages in thread From: Paul E. McKenney @ 2014-02-02 22:02 UTC (permalink / raw) To: Peter Zijlstra Cc: Jason Low, Ingo Molnar, Waiman Long, Linus Torvalds, Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel, Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod On Fri, Jan 31, 2014 at 09:08:25PM +0100, Peter Zijlstra wrote: > On Fri, Jan 31, 2014 at 12:01:37PM -0800, Jason Low wrote: > > On Fri, Jan 31, 2014 at 6:09 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > I've downloaded AIM7 from sf.net and I hope I'm running it with 100+ > > > loads but I'm not entirely sure I got this thing right, its not really > > > making progress with or without patch :/ > > > > Ingo's program http://lkml.org/lkml/2006/1/8/50 using the V option may > > be able to generate similar mutex contention. > > > > Currently still getting soft lockups with the updated version. > > Bugger.. ok clearly I need to think harder still. I'm fairly sure this > cancelation can work though, just seems tricky to get right :-) We used to do something similar to avoid passing locks off to tasks that had been interrupted while spinning, and it was a bit tricky. But we had it a bit easier, because we didn't actually have to remove the element from the queue, just bypass it at lock-grant time. Thanx, Paul > I'll give that proglet from Ingo a go, although that might be Monday ere > I get to it. > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-01-31 14:09 ` Peter Zijlstra 2014-01-31 20:01 ` Jason Low @ 2014-02-02 20:02 ` Peter Zijlstra 1 sibling, 0 replies; 26+ messages in thread From: Peter Zijlstra @ 2014-02-02 20:02 UTC (permalink / raw) To: Jason Low Cc: mingo, paulmck, Waiman.Long, torvalds, tglx, linux-kernel, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod On Fri, Jan 31, 2014 at 03:09:41PM +0100, Peter Zijlstra wrote: > +struct m_spinlock { > + struct m_spinlock *next, *prev; > + int locked; /* 1 if lock acquired */ > +}; > + > +/* > + * Using a single mcs node per CPU is safe because mutex_lock() should not be > + * called from interrupt context and we have preemption disabled over the mcs > + * lock usage. > + */ > +static DEFINE_PER_CPU(struct m_spinlock, m_node); > + > +static bool m_spin_lock(struct m_spinlock **lock) > +{ > + struct m_spinlock *node = this_cpu_ptr(&m_node); > + struct m_spinlock *prev, *next; > + > + node->locked = 0; > + node->next = NULL; > + > + node->prev = prev = xchg(lock, node); > + if (likely(prev == NULL)) > + return true; > + > + ACCESS_ONCE(prev->next) = node; > + > + /* > + * Normally @prev is untouchable after the above store; because at that > + * moment unlock can proceed and wipe the node element from stack. > + * > + * However, since our nodes are static per-cpu storage, we're > + * guaranteed their existence -- this allows us to apply > + * cmpxchg in an attempt to undo our queueing. > + */ > + > + while (!smp_load_acquire(&node->locked)) { > + if (need_resched()) > + goto unqueue; > + arch_mutex_cpu_relax(); > + } > + return true; > + > +unqueue: > + /* > + * Step - A -- stabilize @prev > + * > + * Undo our @prev->next assignment; this will make @prev's > + * unlock()/cancel() wait for a next pointer since @lock points to us > + * (or later). > + */ > + > + for (;;) { > + next = cmpxchg(&prev->next, node, NULL); /* A -> B,C */ > + > + /* > + * Because the unlock() path retains @prev->next (for > + * performance) we must check @node->locked after clearing > + * @prev->next to see if we raced. > + * > + * Ordered by the cmpxchg() above and the conditional-store in > + * unlock(). > + */ > + if (smp_load_acquire(&node->locked)) { > + /* > + * OOPS, we were too late, we already got the lock. No > + * harm done though; @prev is now unused an nobody > + * cares we frobbed it. > + */ > + return true; > + } > + > + if (next == node) > + break; > + > + /* > + * @prev->next didn't point to us anymore, we didn't own the > + * lock, so reload and try again. > + * > + * Because we observed the new @prev->next, the smp_wmb() at > + * (C) ensures that we must now observe the new @node->prev. > + */ > + prev = ACCESS_ONCE(node->prev); > + } > + > + /* > + * Step - B -- stabilize @next > + * > + * Similar to unlock(), wait for @node->next or move @lock from @node > + * back to @prev. > + */ > + > + for (;;) { > + if (*lock == node && cmpxchg(lock, node, prev) == node) { > + /* > + * We were the last queued, we moved @lock back. @prev > + * will now observe @lock and will complete its > + * unlock()/cancel(). > + */ > + return false; > + } > + > + /* > + * We must xchg() the @node->next value, because if we were to > + * leave it in, a concurrent cancel() from @node->next might > + * complete Step-A and think its @prev is still valid. > + * > + * If the concurrent cancel() wins the race, we'll wait for > + * either @lock to point to us, through its Step-B, or wait for > + * a new @node->next from its Step-C. > + */ > + next = xchg(&node->next, NULL); /* B -> A */ > + if (next) > + break; > + > + arch_mutex_cpu_relax(); > + } > + > + /* > + * Step - C -- unlink > + * > + * @prev is stable because its still waiting for a new @prev->next > + * pointer, @next is stable because our @node->next pointer is NULL and > + * it will wait in Step-A. > + */ > + > + ACCESS_ONCE(next->prev) = prev; > + > + /* > + * Ensure that @next->prev is written before we write @prev->next, > + * this guarantees that when the cmpxchg at (A) fails we must > + * observe the new prev value. > + */ > + smp_wmb(); /* C -> A */ OK, I've definitely stared at this code for too long :/ I don't think the above barrier is right -- or even required for that matter. At this point I can't see anything wrong with the order of either of these stores. If the latter store hits first an unlock can can happen and release the @next entry, which is fine as the Step-A loop can deal with that, if the unlock doesn't happen, we'll simply wait for the first store to become visible before trying to undo later again. If the former store hits first, we'll simply wait for the later store to appear and we'll try to undo it. This obviously doesn't explain lockups, but it does reduce the code to stare at ever so slightly. > + /* > + * And point @prev to our next, and we're unlinked. > + */ > + ACCESS_ONCE(prev->next) = next; > + > + return false; > +} > + > +static void m_spin_unlock(struct m_spinlock **lock) > +{ > + struct m_spinlock *node = this_cpu_ptr(&m_node); > + struct m_spinlock *next; > + > + for (;;) { > + if (likely(cmpxchg(lock, node, NULL) == node)) > + return; > + > + next = ACCESS_ONCE(node->next); > + if (unlikely(next)) > + break; > + > + arch_mutex_cpu_relax(); > + } > + smp_store_release(&next->locked, 1); > +} ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-01-29 11:51 ` Peter Zijlstra 2014-01-31 3:29 ` Jason Low @ 2014-02-05 21:44 ` Waiman Long 2014-02-06 14:04 ` Peter Zijlstra 2014-02-06 17:44 ` Jason Low 1 sibling, 2 replies; 26+ messages in thread From: Waiman Long @ 2014-02-05 21:44 UTC (permalink / raw) To: Peter Zijlstra Cc: Jason Low, mingo, paulmck, torvalds, tglx, linux-kernel, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod On 01/29/2014 06:51 AM, Peter Zijlstra wrote: > On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote: >>> But urgh, nasty problem. Lemme ponder this a bit. > OK, please have a very careful look at the below. It survived a boot > with udev -- which usually stresses mutex contention enough to explode > (in fact it did a few time when I got the contention/cancel path wrong), > however I have not ran anything else on it. > > The below is an MCS variant that allows relatively cheap unqueueing. But > its somewhat tricky and I might have gotten a case wrong, esp. the > double concurrent cancel case got my head hurting (I didn't attempt a > tripple unqueue). > > Applies to tip/master but does generate a few (harmless) compile > warnings because I didn't fully clean up the mcs_spinlock vs m_spinlock > thing. > > Also, there's a comment in the slowpath that bears consideration. > > I have an alternative way of breaking out of the MCS lock waiting queue when need_resched() is set. I overload the locked flag to indicate a skipped node if negative. I run the patch through the AIM7 high-systime workload on a 4-socket server and it seemed to run fine. Please check the following POC patch to see if you have any comment. diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h index e9a4d74..84a0b32 100644 --- a/include/linux/mcs_spinlock.h +++ b/include/linux/mcs_spinlock.h @@ -14,17 +14,45 @@ struct mcs_spinlock { struct mcs_spinlock *next; - int locked; /* 1 if lock acquired */ + int locked; /* 1 if lock acquired, -1 if skipping node */ }; +/* + * The node skipping feature requires the MCS node to be persistent, + * i.e. not allocated on stack. In addition, the MCS node cannot be + * reused until after the locked flag is cleared. The mcs_skip_node() + * macro must be defined before including this header file. + */ +#ifdef mcs_skip_node +#undef arch_mcs_spin_lock_contended +#undef arch_mcs_spin_unlock_contended + +#define arch_mcs_spin_lock_contended(n) \ +do { \ + while (!smp_load_acquire(&(n)->locked)) { \ + if (mcs_skip_node(n)) { \ + if (cmpxchg(&(n)->locked, 0, -1) == 0) \ + return; \ + } \ + arch_mutex_cpu_relax(); \ + }; \ +} while (0) + +#define arch_mcs_spin_unlock_contended(n) \ + if (cmpxchg(&(n)->locked, 0, 1) == 0) \ + return + +#define mcs_set_locked(n, v) (n)->locked = (v) +#else /* mcs_skip_node */ + #ifndef arch_mcs_spin_lock_contended /* * Using smp_load_acquire() provides a memory barrier that ensures * subsequent operations happen after the lock is acquired. */ -#define arch_mcs_spin_lock_contended(l) \ +#define arch_mcs_spin_lock_contended(n) \ do { \ - while (!(smp_load_acquire(l))) \ + while (!(smp_load_acquire(&(n)->locked))) \ arch_mutex_cpu_relax(); \ } while (0) #endif @@ -35,9 +63,12 @@ do { \ * operations in the critical section has been completed before * unlocking. */ -#define arch_mcs_spin_unlock_contended(l) \ - smp_store_release((l), 1) +#define arch_mcs_spin_unlock_contended(n) \ + smp_store_release(&(n)->locked, 1) + +#define mcs_set_locked(n, v) #endif +#endif /* mcs_skip_node */ /* * Note: the smp_load_acquire/smp_store_release pair is not @@ -77,12 +108,13 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node) * value won't be used. If a debug mode is needed to * audit lock status, then set node->locked value here. */ + mcs_set_locked(node, 1); return; } ACCESS_ONCE(prev->next) = node; /* Wait until the lock holder passes the lock down. */ - arch_mcs_spin_lock_contended(&node->locked); + arch_mcs_spin_lock_contended(node); } /* @@ -94,19 +126,35 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node) { struct mcs_spinlock *next = ACCESS_ONCE(node->next); +#ifdef mcs_skip_node +check_next_node: +#endif if (likely(!next)) { /* * Release the lock by setting it to NULL */ - if (likely(cmpxchg(lock, node, NULL) == node)) + if (likely(cmpxchg(lock, node, NULL) == node)) { + mcs_set_locked(node, 0); return; + } /* Wait until the next pointer is set */ while (!(next = ACCESS_ONCE(node->next))) arch_mutex_cpu_relax(); } + /* Clear the lock flag to indicate the node can be used again. */ + mcs_set_locked(node, 0); /* Pass lock to next waiter. */ - arch_mcs_spin_unlock_contended(&next->locked); + arch_mcs_spin_unlock_contended(next); + +#ifdef mcs_skip_node + /* + * The next node should be skipped + */ + node = next; + next = ACCESS_ONCE(node->next); + goto check_next_node; +#endif } #endif /* __LINUX_MCS_SPINLOCK_H */ diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 45fe1b5..6351eca 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -25,6 +25,10 @@ #include <linux/spinlock.h> #include <linux/interrupt.h> #include <linux/debug_locks.h> +/* + * Allowed an MCS node to be skipped if need_resched() is true. + */ +#define mcs_skip_node(n) need_resched() #include <linux/mcs_spinlock.h> /* @@ -45,6 +49,13 @@ */ #define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) >= 0) +/* + * Using a single mcs node per CPU is safe because mutex_lock() should not be + * called from interrupt context and we have preemption disabled over the mcs + * lock usage. + */ +static DEFINE_PER_CPU(struct mcs_spinlock, m_node); + void __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key) { @@ -166,6 +177,9 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock) struct task_struct *owner; int retval = 1; + if (need_resched()) + return 0; + rcu_read_lock(); owner = ACCESS_ONCE(lock->owner); if (owner) @@ -370,6 +384,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, struct mutex_waiter waiter; unsigned long flags; int ret; + struct mcs_spinlock *node; preempt_disable(); mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip); @@ -400,9 +415,13 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, if (!mutex_can_spin_on_owner(lock)) goto slowpath; + node = this_cpu_ptr(&m_node); + if (node->locked < 0) + /* MCS node not available yet */ + goto slowpath; + for (;;) { struct task_struct *owner; - struct mcs_spinlock node; if (use_ww_ctx && ww_ctx->acquired > 0) { struct ww_mutex *ww; @@ -424,10 +443,15 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, * If there's an owner, wait for it to either * release the lock or go to sleep. */ - mcs_spin_lock(&lock->mcs_lock, &node); + mcs_spin_lock(&lock->mcs_lock, node); + if (node->locked < 0) + /* + * need_resched() true, no unlock needed + */ + goto slowpath; owner = ACCESS_ONCE(lock->owner); if (owner && !mutex_spin_on_owner(lock, owner)) { - mcs_spin_unlock(&lock->mcs_lock, &node); + mcs_spin_unlock(&lock->mcs_lock, node); goto slowpath; } @@ -442,11 +466,11 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, } mutex_set_owner(lock); - mcs_spin_unlock(&lock->mcs_lock, &node); + mcs_spin_unlock(&lock->mcs_lock, node); preempt_enable(); return 0; } - mcs_spin_unlock(&lock->mcs_lock, &node); + mcs_spin_unlock(&lock->mcs_lock, node); /* * When there's no owner, we might have preempted between the ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-02-05 21:44 ` Waiman Long @ 2014-02-06 14:04 ` Peter Zijlstra 2014-02-06 18:45 ` Waiman Long 2014-02-06 17:44 ` Jason Low 1 sibling, 1 reply; 26+ messages in thread From: Peter Zijlstra @ 2014-02-06 14:04 UTC (permalink / raw) To: Waiman Long Cc: Jason Low, mingo, paulmck, torvalds, tglx, linux-kernel, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod On Wed, Feb 05, 2014 at 04:44:34PM -0500, Waiman Long wrote: > I have an alternative way of breaking out of the MCS lock waiting queue when > need_resched() is set. I overload the locked flag to indicate a skipped node > if negative. I'm not quite seeing how it works (then again, I've not really read the patch carefully). Suppose you break out; at that point you get queued and go to sleep. Suppose you got woken up while you MCS entry is still 'pending' and magically win the race and acquire the lock. At that point your MCS entry can be re-used while its still part of the list. Its a fantastically small race window, but I don't see anything that makes it impossible. > I run the patch through the AIM7 high-systime workload on a > 4-socket server and it seemed to run fine. How do people run this AIM7 piece of shit? I let it run for over an hour and it generated exactly 0 numbers, it just sits there eating cpu-time and creating a racket from my pantry. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-02-06 14:04 ` Peter Zijlstra @ 2014-02-06 18:45 ` Waiman Long 2014-02-06 20:10 ` Norton, Scott J 0 siblings, 1 reply; 26+ messages in thread From: Waiman Long @ 2014-02-06 18:45 UTC (permalink / raw) To: Peter Zijlstra Cc: Jason Low, mingo, paulmck, torvalds, tglx, linux-kernel, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod On 02/06/2014 09:04 AM, Peter Zijlstra wrote: > On Wed, Feb 05, 2014 at 04:44:34PM -0500, Waiman Long wrote: >> I have an alternative way of breaking out of the MCS lock waiting queue when >> need_resched() is set. I overload the locked flag to indicate a skipped node >> if negative. > I'm not quite seeing how it works (then again, I've not really read the > patch carefully). > > Suppose you break out; at that point you get queued and go to sleep. > Suppose you got woken up while you MCS entry is still 'pending' and > magically win the race and acquire the lock. > > At that point your MCS entry can be re-used while its still part of the > list. Actually the MCS node entry cannot be reused until an MCS queue head task traverses that entry and clear the locked flag. So it is possible that the affected CPU won't be able to participate in optimistic spinning for a while if the mutex is heavily contended. > Its a fantastically small race window, but I don't see anything that > makes it impossible. I will send out an official patch with some performance data to solicit further feedback. >> I run the patch through the AIM7 high-systime workload on a >> 4-socket server and it seemed to run fine. > How do people run this AIM7 piece of shit? I let it run for over an hour > and it generated exactly 0 numbers, it just sits there eating cpu-time > and creating a racket from my pantry. AIM7 can be tricky to set up. Fortunately, someone in our team had done the ground work so I can just grab and run it. -Longman ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-02-06 18:45 ` Waiman Long @ 2014-02-06 20:10 ` Norton, Scott J 2014-02-10 17:01 ` Peter Zijlstra 0 siblings, 1 reply; 26+ messages in thread From: Norton, Scott J @ 2014-02-06 20:10 UTC (permalink / raw) To: Long, Wai Man, Peter Zijlstra Cc: Low, Jason, mingo@redhat.com, paulmck@linux.vnet.ibm.com, torvalds@linux-foundation.org, tglx@linutronix.de, linux-kernel@vger.kernel.org, riel@redhat.com, akpm@linux-foundation.org, Bueso, Davidlohr, hpa@zytor.com, andi@firstfloor.org, Chandramouleeswaran, Aswin, Vinod, Chegu >> I run the patch through the AIM7 high-systime workload on a >> 4-socket server and it seemed to run fine. > How do people run this AIM7 piece of shit? I let it run for over an hour > and it generated exactly 0 numbers, it just sits there eating cpu-time > and creating a racket from my pantry. ./reaim -s100 -e2000 -t -j100 -i100 -f workfile.high_systime The reaim.config file contains: FILESIZE 10k POOLSIZE 1m DISKDIR /t0 DISKDIR /t1 DISKDIR /t2 DISKDIR /t3 DISKDIR /t4 DISKDIR /t5 DISKDIR /t6 DISKDIR /t7 DISKDIR /t8 DISKDIR /t9 DISKDIR /t10 DISKDIR /t11 DISKDIR /t12 DISKDIR /t13 DISKDIR /t14 DISKDIR /t15 The way Longman uses this is to create 16 ramdisk filesystems through /dev/ram* and then mount those filesystems to the /t* directories. Although you could run it through a regular filesystem also. It will use whatever you place in the reaim.config file as DISKDIR. You can specify one or more DISKDIR directories. Scott ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-02-06 20:10 ` Norton, Scott J @ 2014-02-10 17:01 ` Peter Zijlstra 0 siblings, 0 replies; 26+ messages in thread From: Peter Zijlstra @ 2014-02-10 17:01 UTC (permalink / raw) To: Norton, Scott J Cc: Long, Wai Man, Low, Jason, mingo@redhat.com, paulmck@linux.vnet.ibm.com, torvalds@linux-foundation.org, tglx@linutronix.de, linux-kernel@vger.kernel.org, riel@redhat.com, akpm@linux-foundation.org, Bueso, Davidlohr, hpa@zytor.com, andi@firstfloor.org, Chandramouleeswaran, Aswin, Vinod, Chegu On Thu, Feb 06, 2014 at 08:10:02PM +0000, Norton, Scott J wrote: > > How do people run this AIM7 piece of shit? I let it run for over an hour > > and it generated exactly 0 numbers, it just sits there eating cpu-time > > and creating a racket from my pantry. > > ./reaim -s100 -e2000 -t -j100 -i100 -f workfile.high_systime > > The reaim.config file contains: > > FILESIZE 10k > POOLSIZE 1m > DISKDIR /t0 > DISKDIR /t1 > DISKDIR /t2 > DISKDIR /t3 > DISKDIR /t4 > DISKDIR /t5 > DISKDIR /t6 > DISKDIR /t7 > DISKDIR /t8 > DISKDIR /t9 > DISKDIR /t10 > DISKDIR /t11 > DISKDIR /t12 > DISKDIR /t13 > DISKDIR /t14 > DISKDIR /t15 > > The way Longman uses this is to create 16 ramdisk filesystems through > /dev/ram* and then mount those filesystems to the /t* directories. > Although you could run it through a regular filesystem also. It will use > whatever you place in the reaim.config file as DISKDIR. You can specify > one or more DISKDIR directories. OK, and we're back to creating a racket but not producing useful numbers; how long is that crap supposed to run before it gives a number? Surely it can produce a useful number after a few minutes of runtime.. Letting it run for hours is just a waste of time and money. Note that I'm running this on a WSM-EP with (2*6*2) 24 CPUs and 24G of ram. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-02-05 21:44 ` Waiman Long 2014-02-06 14:04 ` Peter Zijlstra @ 2014-02-06 17:44 ` Jason Low 2014-02-06 18:37 ` Waiman Long 1 sibling, 1 reply; 26+ messages in thread From: Jason Low @ 2014-02-06 17:44 UTC (permalink / raw) To: Waiman Long Cc: Peter Zijlstra, mingo, paulmck, torvalds, tglx, linux-kernel, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod On Wed, 2014-02-05 at 16:44 -0500, Waiman Long wrote: > On 01/29/2014 06:51 AM, Peter Zijlstra wrote: > > On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote: > >>> But urgh, nasty problem. Lemme ponder this a bit. > > OK, please have a very careful look at the below. It survived a boot > > with udev -- which usually stresses mutex contention enough to explode > > (in fact it did a few time when I got the contention/cancel path wrong), > > however I have not ran anything else on it. > > > > The below is an MCS variant that allows relatively cheap unqueueing. But > > its somewhat tricky and I might have gotten a case wrong, esp. the > > double concurrent cancel case got my head hurting (I didn't attempt a > > tripple unqueue). > > > > Applies to tip/master but does generate a few (harmless) compile > > warnings because I didn't fully clean up the mcs_spinlock vs m_spinlock > > thing. > > > > Also, there's a comment in the slowpath that bears consideration. > > > > > > I have an alternative way of breaking out of the MCS lock waiting queue > when need_resched() is set. I overload the locked flag to indicate a > skipped node if negative. I run the patch through the AIM7 high-systime > workload on a 4-socket server and it seemed to run fine. > > Please check the following POC patch to see if you have any comment. So one of the concerns I had with the approach of skipping nodes was that, under heavy contention, we potentially could cause optimistic spinning to be disabled on CPUs for a while since the nodes can't be used until they have been released. One advantage of the unqueuing method would be that nodes are usable after the spinners exit the MCS queue and go to sleep. Jason ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued 2014-02-06 17:44 ` Jason Low @ 2014-02-06 18:37 ` Waiman Long 0 siblings, 0 replies; 26+ messages in thread From: Waiman Long @ 2014-02-06 18:37 UTC (permalink / raw) To: Jason Low Cc: Peter Zijlstra, mingo, paulmck, torvalds, tglx, linux-kernel, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod On 02/06/2014 12:44 PM, Jason Low wrote: > On Wed, 2014-02-05 at 16:44 -0500, Waiman Long wrote: >> On 01/29/2014 06:51 AM, Peter Zijlstra wrote: >>> On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote: >>>>> But urgh, nasty problem. Lemme ponder this a bit. >>> OK, please have a very careful look at the below. It survived a boot >>> with udev -- which usually stresses mutex contention enough to explode >>> (in fact it did a few time when I got the contention/cancel path wrong), >>> however I have not ran anything else on it. >>> >>> The below is an MCS variant that allows relatively cheap unqueueing. But >>> its somewhat tricky and I might have gotten a case wrong, esp. the >>> double concurrent cancel case got my head hurting (I didn't attempt a >>> tripple unqueue). >>> >>> Applies to tip/master but does generate a few (harmless) compile >>> warnings because I didn't fully clean up the mcs_spinlock vs m_spinlock >>> thing. >>> >>> Also, there's a comment in the slowpath that bears consideration. >>> >>> >> I have an alternative way of breaking out of the MCS lock waiting queue >> when need_resched() is set. I overload the locked flag to indicate a >> skipped node if negative. I run the patch through the AIM7 high-systime >> workload on a 4-socket server and it seemed to run fine. >> >> Please check the following POC patch to see if you have any comment. > So one of the concerns I had with the approach of skipping nodes was > that, under heavy contention, we potentially could cause optimistic > spinning to be disabled on CPUs for a while since the nodes can't be > used until they have been released. One advantage of the unqueuing > method would be that nodes are usable after the spinners exit the MCS > queue and go to sleep. > > Jason > Under heavy contention when many threads are trying to access the mutexes using optimistic spinning. This patch can actually reduce the number of wasted CPU cycles waiting in the MCS spin loop and let the CPUs do other useful work. So I don't see that as a negative. I think this kind of self-tuning is actually good for the overall throughput of the system. -Longman ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2014-02-10 17:02 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-02-06 14:52 [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued Daniel J Blueman -- strict thread matches above, loose matches on Subject: below -- 2014-01-28 19:13 [PATCH v2 0/5] mutex: Mutex scalability patches Jason Low 2014-01-28 19:13 ` [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued Jason Low 2014-01-28 21:07 ` Peter Zijlstra 2014-01-28 22:51 ` Jason Low 2014-01-29 11:51 ` Peter Zijlstra 2014-01-31 3:29 ` Jason Low 2014-01-31 14:09 ` Peter Zijlstra 2014-01-31 20:01 ` Jason Low 2014-01-31 20:08 ` Peter Zijlstra 2014-02-02 21:01 ` Jason Low 2014-02-02 21:12 ` Peter Zijlstra 2014-02-03 18:39 ` Jason Low 2014-02-03 19:25 ` Peter Zijlstra 2014-02-03 20:55 ` Jason Low 2014-02-03 21:06 ` Peter Zijlstra 2014-02-03 21:56 ` Jason Low 2014-02-04 7:13 ` Jason Low 2014-02-02 22:02 ` Paul E. McKenney 2014-02-02 20:02 ` Peter Zijlstra 2014-02-05 21:44 ` Waiman Long 2014-02-06 14:04 ` Peter Zijlstra 2014-02-06 18:45 ` Waiman Long 2014-02-06 20:10 ` Norton, Scott J 2014-02-10 17:01 ` Peter Zijlstra 2014-02-06 17:44 ` Jason Low 2014-02-06 18:37 ` Waiman Long
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).