[RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-01-28 19:13 [PATCH v2 0/5] mutex: Mutex scalability patches Jason Low
@ 2014-01-28 19:13 ` Jason Low
  2014-01-28 21:07   ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Low @ 2014-01-28 19:13 UTC (permalink / raw)
  To: mingo, peterz, paulmck, Waiman.Long, torvalds, tglx, jason.low2
  Cc: linux-kernel, riel, akpm, davidlohr, hpa, andi, aswin,
	scott.norton, chegu_vinod

Before a thread attempts mutex spin on owner, it is first added to a queue
using an MCS lock so that only 1 thread spins on owner at a time. However, when
the spinner is queued, it is unable to check if it needs to reschedule and
will remain on the queue. This could cause the spinner to spin longer
than its allocated time. However, once it is the spinner's turn to spin on
owner, it would immediately go to slowpath if it need_resched() and gets no spin
time. In these situations, not only does the spinner take up more time for a
chance to spin for the mutex, it also ends up not getting to spin once it
gets its turn.

One solution would be to exit the MCS queue and go to mutex slowpath if
need_resched(). However, that may require a lot of overhead. For instance, if a
thread at the end of the queue need_resched(), in order to remove it from the
queue, we would have to unqueue and requeue all other spinners in front of it.

This RFC patch tries to address the issue in another context by avoiding
situations where spinners immediately get sent to the slowpath on
need_resched() upon getting to spin. We will first check for need_resched()
right after acquiring the MCS lock. If need_resched() is true, then
need_resched() triggered while the thread is waiting in the MCS queue (since
patch 1 makes the spinner check for need_resched() before entering the queue).
In this case, we will allow the thread to have at least 1 try to do
mutex_spin_on_owner() regardless of need_resched(). This patch also removes
the need_resched() in the outer loop in case we require a few extra spins to
observe lock->count == 1, and patch 4 ensures we won't be spinning with
lock owner preempted.

And if the need_resched() check after acquiring the MCS lock is false, then
we won't give the spinner any extra spinning.

Signed-off-by: Jason Low <jason.low2@hp.com>
---
 kernel/locking/mutex.c |   22 ++++++++++++++++------
 1 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index cfaaf53..2281a48 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -187,11 +187,11 @@ static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
  * access and not reliable.
  */
 static noinline
-int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
+int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner, int spin_grace_period)
 {
 	rcu_read_lock();
 	while (owner_running(lock, owner)) {
-		if (need_resched())
+		if (need_resched() && !spin_grace_period)
 			break;

 		arch_mutex_cpu_relax();
@@ -428,7 +428,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	struct task_struct *task = current;
 	struct mutex_waiter waiter;
 	unsigned long flags;
-	int ret;
+	int ret, spin_grace_period;
 	struct mspin_node node;

 	preempt_disable();
@@ -461,6 +461,12 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		goto slowpath;

 	mspin_lock(MLOCK(lock), &node);
+	/*
+	 * If need_resched() triggered while queued, then we will still give
+	 * this spinner a chance to spin for the mutex, rather than send this
+	 * immediately to slowpath after waiting for its turn.
+	 */
+	spin_grace_period = (need_resched()) ? 1 : 0;
 	for (;;) {
 		struct task_struct *owner;

@@ -485,8 +491,12 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		 * release the lock or go to sleep.
 		 */
 		owner = ACCESS_ONCE(lock->owner);
-		if (owner && !mutex_spin_on_owner(lock, owner))
-			break;
+		if (owner) {
+			if (!mutex_spin_on_owner(lock, owner, spin_grace_period))
+				break;
+
+			spin_grace_period = 0;
+		}

 		if ((atomic_read(&lock->count) == 1) &&
 		    (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
@@ -510,7 +520,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		 * we're an RT task that will live-lock because we won't let
 		 * the owner complete.
 		 */
-		if (!owner && (need_resched() || rt_task(task)))
+		if (!owner && rt_task(task))
 			break;

 		/*
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-01-28 19:13 ` [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued Jason Low
@ 2014-01-28 21:07   ` Peter Zijlstra
  2014-01-28 22:51     ` Jason Low
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2014-01-28 21:07 UTC (permalink / raw)
  To: Jason Low
  Cc: mingo, paulmck, Waiman.Long, torvalds, tglx, linux-kernel, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Tue, Jan 28, 2014 at 11:13:16AM -0800, Jason Low wrote:
> Before a thread attempts mutex spin on owner, it is first added to a queue
> using an MCS lock so that only 1 thread spins on owner at a time. However, when
> the spinner is queued, it is unable to check if it needs to reschedule and
> will remain on the queue. This could cause the spinner to spin longer
> than its allocated time. 

what allocated time?

> However, once it is the spinner's turn to spin on
> owner, it would immediately go to slowpath if it need_resched() and gets no spin
> time. In these situations, not only does the spinner take up more time for a
> chance to spin for the mutex, it also ends up not getting to spin once it
> gets its turn.
> 
> One solution would be to exit the MCS queue and go to mutex slowpath if
> need_resched(). However, that may require a lot of overhead. For instance, if a
> thread at the end of the queue need_resched(), in order to remove it from the
> queue, we would have to unqueue and requeue all other spinners in front of it.

If you can do that, you can also walk the list and find prev and cmpxchg
the entry out. But I don't think you can do either, as we simply don't
have a head pointer.

> This RFC patch tries to address the issue in another context by avoiding
> situations where spinners immediately get sent to the slowpath on
> need_resched() upon getting to spin. 

> We will first check for need_resched()
> right after acquiring the MCS lock. If need_resched() is true, then
> need_resched() triggered while the thread is waiting in the MCS queue (since
> patch 1 makes the spinner check for need_resched() before entering the queue).

> In this case, we will allow the thread to have at least 1 try to do
> mutex_spin_on_owner() regardless of need_resched(). 

No! We should absolutely not ever willfully ignore need_resched(). Esp.
not for unbounded spins.

> This patch also removes
> the need_resched() in the outer loop in case we require a few extra spins to
> observe lock->count == 1, and patch 4 ensures we won't be spinning with
> lock owner preempted.
> 
> And if the need_resched() check after acquiring the MCS lock is false, then
> we won't give the spinner any extra spinning.

But urgh, nasty problem. Lemme ponder this a bit.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-01-28 21:07   ` Peter Zijlstra
@ 2014-01-28 22:51     ` Jason Low
  2014-01-29 11:51       ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Low @ 2014-01-28 22:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, paulmck, Waiman.Long, torvalds, tglx, linux-kernel, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Tue, 2014-01-28 at 22:07 +0100, Peter Zijlstra wrote:
> On Tue, Jan 28, 2014 at 11:13:16AM -0800, Jason Low wrote:
> > Before a thread attempts mutex spin on owner, it is first added to a queue
> > using an MCS lock so that only 1 thread spins on owner at a time. However, when
> > the spinner is queued, it is unable to check if it needs to reschedule and
> > will remain on the queue. This could cause the spinner to spin longer
> > than its allocated time. 
> 
> what allocated time?

Hi Peter,

By "spin longer than its allocated time", I meant to say that the thread
can continue spinning/waiting in the MCS queue after need_resched() has
been set.

> > However, once it is the spinner's turn to spin on
> > owner, it would immediately go to slowpath if it need_resched() and gets no spin
> > time. In these situations, not only does the spinner take up more time for a
> > chance to spin for the mutex, it also ends up not getting to spin once it
> > gets its turn.
> > 
> > One solution would be to exit the MCS queue and go to mutex slowpath if
> > need_resched(). However, that may require a lot of overhead. For instance, if a
> > thread at the end of the queue need_resched(), in order to remove it from the
> > queue, we would have to unqueue and requeue all other spinners in front of it.
> 
> If you can do that, you can also walk the list and find prev and cmpxchg
> the entry out. But I don't think you can do either, as we simply don't
> have a head pointer.
> 
> > This RFC patch tries to address the issue in another context by avoiding
> > situations where spinners immediately get sent to the slowpath on
> > need_resched() upon getting to spin. 
> 
> > We will first check for need_resched()
> > right after acquiring the MCS lock. If need_resched() is true, then
> > need_resched() triggered while the thread is waiting in the MCS queue (since
> > patch 1 makes the spinner check for need_resched() before entering the queue).
> 
> > In this case, we will allow the thread to have at least 1 try to do
> > mutex_spin_on_owner() regardless of need_resched(). 
> 
> No! We should absolutely not ever willfully ignore need_resched(). Esp.
> not for unbounded spins.

Ok. This was sort of a proof of concept patch to illustrate the type of
performance gains we can get by addressing this issue.

> > This patch also removes
> > the need_resched() in the outer loop in case we require a few extra spins to
> > observe lock->count == 1, and patch 4 ensures we won't be spinning with
> > lock owner preempted.
> > 
> > And if the need_resched() check after acquiring the MCS lock is false, then
> > we won't give the spinner any extra spinning.
> 
> But urgh, nasty problem. Lemme ponder this a bit.

Thanks,
Jason



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-01-28 22:51     ` Jason Low
@ 2014-01-29 11:51       ` Peter Zijlstra
  2014-01-31  3:29         ` Jason Low
  2014-02-05 21:44         ` Waiman Long
  0 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-01-29 11:51 UTC (permalink / raw)
  To: Jason Low
  Cc: mingo, paulmck, Waiman.Long, torvalds, tglx, linux-kernel, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote:
> > But urgh, nasty problem. Lemme ponder this a bit.

OK, please have a very careful look at the below. It survived a boot
with udev -- which usually stresses mutex contention enough to explode
(in fact it did a few time when I got the contention/cancel path wrong),
however I have not ran anything else on it.

The below is an MCS variant that allows relatively cheap unqueueing. But
its somewhat tricky and I might have gotten a case wrong, esp. the
double concurrent cancel case got my head hurting (I didn't attempt a
tripple unqueue).

Applies to tip/master but does generate a few (harmless) compile
warnings because I didn't fully clean up the mcs_spinlock vs m_spinlock
thing.

Also, there's a comment in the slowpath that bears consideration.

---
 kernel/locking/mutex.c | 158 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 148 insertions(+), 10 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 45fe1b5293d6..4a69da73903c 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -166,6 +166,9 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)
 	struct task_struct *owner;
 	int retval = 1;
 
+	if (need_resched())
+		return 0;
+
 	rcu_read_lock();
 	owner = ACCESS_ONCE(lock->owner);
 	if (owner)
@@ -358,6 +361,134 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock,
 	spin_unlock_mutex(&lock->base.wait_lock, flags);
 }
 
+struct m_spinlock {
+	struct m_spinlock *next, *prev;
+	int locked; /* 1 if lock acquired */
+};
+
+/*
+ * Using a single mcs node per CPU is safe because mutex_lock() should not be
+ * called from interrupt context and we have preemption disabled over the mcs
+ * lock usage.
+ */
+static DEFINE_PER_CPU(struct m_spinlock, m_node);
+
+static bool m_spin_lock(struct m_spinlock **lock)
+{
+	struct m_spinlock *node = this_cpu_ptr(&m_node);
+	struct m_spinlock *prev, *next;
+
+	node->locked = 0;
+	node->next = NULL;
+
+	node->prev = prev = xchg(lock, node);
+	if (likely(prev == NULL))
+		return true;
+
+	ACCESS_ONCE(prev->next) = node;
+
+	/*
+	 * Normally @prev is untouchable after the above store; because at that
+	 * moment unlock can proceed and wipe the node element from stack.
+	 *
+	 * However, since our nodes are static per-cpu storage, we're
+	 * guaranteed their existence -- this allows us to apply
+	 * cmpxchg in an attempt to undo our queueing.
+	 */
+
+	while (!smp_load_acquire(&node->locked)) {
+		if (need_resched())
+			goto unqueue;
+		arch_mutex_cpu_relax();
+	}
+	return true;
+
+unqueue:
+
+	/*
+	 * Undo our @prev->next assignment; this will make @prev's unlock()
+	 * wait for a next pointer since @lock points to us.
+	 */
+	if (cmpxchg(&prev->next, node, NULL) != node) { /* A -> B */
+		/*
+		 * @prev->next no longer pointed to us; either we hold the lock
+		 * or @prev cancelled the wait too and we need to reload and
+		 * retry.
+		 */
+		if (smp_load_acquire(&node->locked))
+			return true;
+
+		/*
+		 * Because we observed the new @prev->next, the smp_wmb() at (B)
+		 * ensures that we must now observe the new @node->prev.
+		 */
+		prev = ACCESS_ONCE(node->prev);
+		goto unqueue;
+	}
+
+	if (smp_load_acquire(&node->locked)) {
+		/*
+		 * OOPS, we were too late, we already got the lock.  No harm
+		 * done though; @prev is now unused an nobody cares we frobbed
+		 * it.
+		 */
+		return true;
+	}
+
+	/*
+	 * Per the above logic @prev's unlock() will now wait,
+	 * therefore @prev is now stable.
+	 */
+
+	if (cmpxchg(lock, node, prev) == node) {
+		/*
+		 * We were still the last queued, we moved @lock back.  @prev
+		 * will now observe @lock and will complete its unlock().
+		 */
+		return false;
+	}
+
+	/*
+	 * We're not the last to be queued, obtain ->next.
+	 */
+
+	while (!(next = ACCESS_ONCE(node->next)))
+		arch_mutex_cpu_relax();
+
+	ACCESS_ONCE(next->prev) = prev;
+
+	/*
+	 * Ensure that @next->prev is written before we write @prev->next,
+	 * this guarantees that when the cmpxchg at (A) fails we must
+	 * observe the new prev value.
+	 */
+	smp_wmb(); /* B -> A */
+
+	/*
+	 * And point @prev to our next, and we're unlinked. We can use an
+	 * non-atomic op because only we modify @prev->next.
+	 */
+	ACCESS_ONCE(prev->next) = next;
+
+	return false;
+}
+
+static void m_spin_unlock(struct m_spinlock **lock)
+{
+	struct m_spinlock *node = this_cpu_ptr(&m_node);
+	struct m_spinlock *next = ACCESS_ONCE(node->next);
+
+	while (likely(!next)) {
+		if (likely(cmpxchg(lock, node, NULL) == node))
+			return;
+
+		arch_mutex_cpu_relax();
+		next = ACCESS_ONCE(node->next);
+	}
+
+	smp_store_release(&next->locked, 1);
+}
+
 /*
  * Lock a mutex (possibly interruptible), slowpath:
  */
@@ -400,9 +531,11 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	if (!mutex_can_spin_on_owner(lock))
 		goto slowpath;
 
+	if (!m_spin_lock(&lock->mcs_lock))
+		goto slowpath;
+
 	for (;;) {
 		struct task_struct *owner;
-		struct mcs_spinlock  node;
 
 		if (use_ww_ctx && ww_ctx->acquired > 0) {
 			struct ww_mutex *ww;
@@ -417,19 +550,16 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 			 * performed the optimistic spinning cannot be done.
 			 */
 			if (ACCESS_ONCE(ww->ctx))
-				goto slowpath;
+				break;
 		}
 
 		/*
 		 * If there's an owner, wait for it to either
 		 * release the lock or go to sleep.
 		 */
-		mcs_spin_lock(&lock->mcs_lock, &node);
 		owner = ACCESS_ONCE(lock->owner);
-		if (owner && !mutex_spin_on_owner(lock, owner)) {
-			mcs_spin_unlock(&lock->mcs_lock, &node);
-			goto slowpath;
-		}
+		if (owner && !mutex_spin_on_owner(lock, owner))
+			break;
 
 		if ((atomic_read(&lock->count) == 1) &&
 		    (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
@@ -442,11 +572,10 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 			}
 
 			mutex_set_owner(lock);
-			mcs_spin_unlock(&lock->mcs_lock, &node);
+			m_spin_unlock(&lock->mcs_lock);
 			preempt_enable();
 			return 0;
 		}
-		mcs_spin_unlock(&lock->mcs_lock, &node);
 
 		/*
 		 * When there's no owner, we might have preempted between the
@@ -455,7 +584,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		 * the owner complete.
 		 */
 		if (!owner && (need_resched() || rt_task(task)))
-			goto slowpath;
+			break;
 
 		/*
 		 * The cpu_relax() call is a compiler barrier which forces
@@ -465,10 +594,19 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		 */
 		arch_mutex_cpu_relax();
 	}
+	m_spin_unlock(&lock->mcs_lock);
 slowpath:
 #endif
 	spin_lock_mutex(&lock->wait_lock, flags);
 
+	/*
+	 * XXX arguably, when need_resched() is set and there's other pending
+	 * owners we should not try-acquire and simply queue and schedule().
+	 *
+	 * There's nothing worse than obtaining a lock only to get immediately
+	 * scheduled out.
+	 */
+
 	/* once more, can we acquire the lock? */
 	if (MUTEX_SHOW_NO_WAITER(lock) && (atomic_xchg(&lock->count, 0) == 1))
 		goto skip_wait;

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-01-29 11:51       ` Peter Zijlstra
@ 2014-01-31  3:29         ` Jason Low
  2014-01-31 14:09           ` Peter Zijlstra
  2014-02-05 21:44         ` Waiman Long
  1 sibling, 1 reply; 26+ messages in thread
From: Jason Low @ 2014-01-31  3:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, paulmck, Waiman.Long, torvalds, tglx, linux-kernel, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Wed, 2014-01-29 at 12:51 +0100, Peter Zijlstra wrote:
> On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote:
> > > But urgh, nasty problem. Lemme ponder this a bit.
> 
> OK, please have a very careful look at the below. It survived a boot
> with udev -- which usually stresses mutex contention enough to explode
> (in fact it did a few time when I got the contention/cancel path wrong),
> however I have not ran anything else on it.

I tested this patch on a 2 socket, 8 core machine with the AIM7 fserver
workload. After 100 users, the system gets soft lockups.

Some condition may be causing threads to not leave the "goto unqueue"
loop. I added a debug counter, and threads were able to reach more than
1,000,000,000 "goto unqueue".

I also was initially thinking if there can be problems when multiple
threads need_resched() and unqueue at the same time. As an example, 2
nodes that need to reschedule are next to each other in the middle of
the MCS queue. The 1st node executes "while (!(next =
ACCESS_ONCE(node->next)))" and exits the while loop because next is not
NULL. Then, the 2nd node execute its "if (cmpxchg(&prev->next, node,
NULL) != node)". We may then end up in a situation where the node before
the 1st node gets linked with the outdated 2nd node.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-01-31  3:29         ` Jason Low
@ 2014-01-31 14:09           ` Peter Zijlstra
  2014-01-31 20:01             ` Jason Low
  2014-02-02 20:02             ` Peter Zijlstra
  0 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-01-31 14:09 UTC (permalink / raw)
  To: Jason Low
  Cc: mingo, paulmck, Waiman.Long, torvalds, tglx, linux-kernel, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Thu, Jan 30, 2014 at 07:29:37PM -0800, Jason Low wrote:
> On Wed, 2014-01-29 at 12:51 +0100, Peter Zijlstra wrote:
> > On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote:
> > > > But urgh, nasty problem. Lemme ponder this a bit.
> > 
> > OK, please have a very careful look at the below. It survived a boot
> > with udev -- which usually stresses mutex contention enough to explode
> > (in fact it did a few time when I got the contention/cancel path wrong),
> > however I have not ran anything else on it.
> 
> I tested this patch on a 2 socket, 8 core machine with the AIM7 fserver
> workload. After 100 users, the system gets soft lockups.
> 
> Some condition may be causing threads to not leave the "goto unqueue"
> loop. I added a debug counter, and threads were able to reach more than
> 1,000,000,000 "goto unqueue".
> 
> I also was initially thinking if there can be problems when multiple
> threads need_resched() and unqueue at the same time. As an example, 2
> nodes that need to reschedule are next to each other in the middle of
> the MCS queue. The 1st node executes "while (!(next =
> ACCESS_ONCE(node->next)))" and exits the while loop because next is not
> NULL. Then, the 2nd node execute its "if (cmpxchg(&prev->next, node,
> NULL) != node)". We may then end up in a situation where the node before
> the 1st node gets linked with the outdated 2nd node.

Yes indeed, I found two bugs. If you consider the MCS lock with 4 queued
nodes:

       .----.
 L ->  | N4 |
       `----'
        ^  V
       .----.
       | N3 |
       `----'
        ^  V
       .----.
       | N2 |
       `----'
        ^  V
       .----.
       | N1 |
       `----'

And look at the 3 unqueue steps in the below patch, and read 3A to
mean, Node 3 ran step A.

Both bugs were in Step B, the first was triggered through:

  3A,4A,3B,4B

In this case the old code would have 3 spinning in @n3->next, however
4B would have moved L to N3 and left @n3->next empty. FAIL.

The second was:

  2A,2B,3A,2C,3B,3C

In this case the cancellation of both 2 and 3 'succeed' and we end up
with N1 linked to N3 and N2 linked to N4. Total fail.


The first bug was fixed by having Step-B spin on both @n->next and @l
just like the unlink() path already did.

The second bug was fixed by having Step-B xchg(@n->next, NULL) instead
of only reading it; this way the 3A above cannot complete until after 2C
and we have a coherent chain back.

I've downloaded AIM7 from sf.net and I hope I'm running it with 100+
loads but I'm not entirely sure I got this thing right, its not really
making progress with or without patch :/

---
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -166,6 +166,9 @@ static inline int mutex_can_spin_on_owne
 	struct task_struct *owner;
 	int retval = 1;
 
+	if (need_resched())
+		return 0;
+
 	rcu_read_lock();
 	owner = ACCESS_ONCE(lock->owner);
 	if (owner)
@@ -358,6 +361,166 @@ ww_mutex_set_context_fastpath(struct ww_
 	spin_unlock_mutex(&lock->base.wait_lock, flags);
 }
 
+struct m_spinlock {
+	struct m_spinlock *next, *prev;
+	int locked; /* 1 if lock acquired */
+};
+
+/*
+ * Using a single mcs node per CPU is safe because mutex_lock() should not be
+ * called from interrupt context and we have preemption disabled over the mcs
+ * lock usage.
+ */
+static DEFINE_PER_CPU(struct m_spinlock, m_node);
+
+static bool m_spin_lock(struct m_spinlock **lock)
+{
+	struct m_spinlock *node = this_cpu_ptr(&m_node);
+	struct m_spinlock *prev, *next;
+
+	node->locked = 0;
+	node->next = NULL;
+
+	node->prev = prev = xchg(lock, node);
+	if (likely(prev == NULL))
+		return true;
+
+	ACCESS_ONCE(prev->next) = node;
+
+	/*
+	 * Normally @prev is untouchable after the above store; because at that
+	 * moment unlock can proceed and wipe the node element from stack.
+	 *
+	 * However, since our nodes are static per-cpu storage, we're
+	 * guaranteed their existence -- this allows us to apply
+	 * cmpxchg in an attempt to undo our queueing.
+	 */
+
+	while (!smp_load_acquire(&node->locked)) {
+		if (need_resched())
+			goto unqueue;
+		arch_mutex_cpu_relax();
+	}
+	return true;
+
+unqueue:
+	/*
+	 * Step - A  -- stabilize @prev
+	 *
+	 * Undo our @prev->next assignment; this will make @prev's
+	 * unlock()/cancel() wait for a next pointer since @lock points to us
+	 * (or later).
+	 */
+
+	for (;;) {
+		next = cmpxchg(&prev->next, node, NULL); /* A -> B,C */
+
+		/*
+		 * Because the unlock() path retains @prev->next (for
+		 * performance) we must check @node->locked after clearing
+		 * @prev->next to see if we raced.
+		 *
+		 * Ordered by the cmpxchg() above and the conditional-store in
+		 * unlock().
+		 */
+		if (smp_load_acquire(&node->locked)) {
+			/*
+			 * OOPS, we were too late, we already got the lock. No
+			 * harm done though; @prev is now unused an nobody
+			 * cares we frobbed it.
+			 */
+			return true;
+		}
+
+		if (next == node)
+			break;
+
+		/*
+		 * @prev->next didn't point to us anymore, we didn't own the
+		 * lock, so reload and try again.
+		 *
+		 * Because we observed the new @prev->next, the smp_wmb() at
+		 * (C) ensures that we must now observe the new @node->prev.
+		 */
+		prev = ACCESS_ONCE(node->prev);
+	}
+
+	/*
+	 * Step - B -- stabilize @next
+	 *
+	 * Similar to unlock(), wait for @node->next or move @lock from @node
+	 * back to @prev.
+	 */
+
+	for (;;) {
+		if (*lock == node && cmpxchg(lock, node, prev) == node) {
+			/*
+			 * We were the last queued, we moved @lock back. @prev
+			 * will now observe @lock and will complete its
+			 * unlock()/cancel().
+			 */
+			return false;
+		}
+
+		/*
+		 * We must xchg() the @node->next value, because if we were to
+		 * leave it in, a concurrent cancel() from @node->next might
+		 * complete Step-A and think its @prev is still valid.
+		 *
+		 * If the concurrent cancel() wins the race, we'll wait for
+		 * either @lock to point to us, through its Step-B, or wait for
+		 * a new @node->next from its Step-C.
+		 */
+		next = xchg(&node->next, NULL); /* B -> A */
+		if (next)
+			break;
+
+		arch_mutex_cpu_relax();
+	}
+
+	/*
+	 * Step - C -- unlink
+	 *
+	 * @prev is stable because its still waiting for a new @prev->next
+	 * pointer, @next is stable because our @node->next pointer is NULL and
+	 * it will wait in Step-A.
+	 */
+
+	ACCESS_ONCE(next->prev) = prev;
+
+	/*
+	 * Ensure that @next->prev is written before we write @prev->next,
+	 * this guarantees that when the cmpxchg at (A) fails we must
+	 * observe the new prev value.
+	 */
+	smp_wmb(); /* C -> A */
+
+	/*
+	 * And point @prev to our next, and we're unlinked.
+	 */
+	ACCESS_ONCE(prev->next) = next;
+
+	return false;
+}
+
+static void m_spin_unlock(struct m_spinlock **lock)
+{
+	struct m_spinlock *node = this_cpu_ptr(&m_node);
+	struct m_spinlock *next;
+
+	for (;;) {
+		if (likely(cmpxchg(lock, node, NULL) == node))
+			return;
+
+		next = ACCESS_ONCE(node->next);
+		if (unlikely(next))
+			break;
+
+		arch_mutex_cpu_relax();
+	}
+	smp_store_release(&next->locked, 1);
+}
+
 /*
  * Lock a mutex (possibly interruptible), slowpath:
  */
@@ -400,9 +563,11 @@ __mutex_lock_common(struct mutex *lock,
 	if (!mutex_can_spin_on_owner(lock))
 		goto slowpath;
 
+	if (!m_spin_lock(&lock->mcs_lock))
+		goto slowpath;
+
 	for (;;) {
 		struct task_struct *owner;
-		struct mcs_spinlock  node;
 
 		if (use_ww_ctx && ww_ctx->acquired > 0) {
 			struct ww_mutex *ww;
@@ -417,19 +582,16 @@ __mutex_lock_common(struct mutex *lock,
 			 * performed the optimistic spinning cannot be done.
 			 */
 			if (ACCESS_ONCE(ww->ctx))
-				goto slowpath;
+				break;
 		}
 
 		/*
 		 * If there's an owner, wait for it to either
 		 * release the lock or go to sleep.
 		 */
-		mcs_spin_lock(&lock->mcs_lock, &node);
 		owner = ACCESS_ONCE(lock->owner);
-		if (owner && !mutex_spin_on_owner(lock, owner)) {
-			mcs_spin_unlock(&lock->mcs_lock, &node);
-			goto slowpath;
-		}
+		if (owner && !mutex_spin_on_owner(lock, owner))
+			break;
 
 		if ((atomic_read(&lock->count) == 1) &&
 		    (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
@@ -442,11 +604,10 @@ __mutex_lock_common(struct mutex *lock,
 			}
 
 			mutex_set_owner(lock);
-			mcs_spin_unlock(&lock->mcs_lock, &node);
+			m_spin_unlock(&lock->mcs_lock);
 			preempt_enable();
 			return 0;
 		}
-		mcs_spin_unlock(&lock->mcs_lock, &node);
 
 		/*
 		 * When there's no owner, we might have preempted between the
@@ -455,7 +616,7 @@ __mutex_lock_common(struct mutex *lock,
 		 * the owner complete.
 		 */
 		if (!owner && (need_resched() || rt_task(task)))
-			goto slowpath;
+			break;
 
 		/*
 		 * The cpu_relax() call is a compiler barrier which forces
@@ -465,10 +626,19 @@ __mutex_lock_common(struct mutex *lock,
 		 */
 		arch_mutex_cpu_relax();
 	}
+	m_spin_unlock(&lock->mcs_lock);
 slowpath:
 #endif
 	spin_lock_mutex(&lock->wait_lock, flags);
 
+	/*
+	 * XXX arguably, when need_resched() is set and there's other pending
+	 * owners we should not try-acquire and simply queue and schedule().
+	 *
+	 * There's nothing worse than obtaining a lock only to get immediately
+	 * scheduled out.
+	 */
+
 	/* once more, can we acquire the lock? */
 	if (MUTEX_SHOW_NO_WAITER(lock) && (atomic_xchg(&lock->count, 0) == 1))
 		goto skip_wait;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-01-31 14:09           ` Peter Zijlstra
@ 2014-01-31 20:01             ` Jason Low
  2014-01-31 20:08               ` Peter Zijlstra
  2014-02-02 20:02             ` Peter Zijlstra
  1 sibling, 1 reply; 26+ messages in thread
From: Jason Low @ 2014-01-31 20:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jason Low, Ingo Molnar, Paul McKenney, Waiman Long,
	Linus Torvalds, Thomas Gleixner, Linux Kernel Mailing List,
	Rik van Riel, Andrew Morton, Davidlohr Bueso, H. Peter Anvin,
	Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J,
	chegu_vinod

On Fri, Jan 31, 2014 at 6:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> I've downloaded AIM7 from sf.net and I hope I'm running it with 100+
> loads but I'm not entirely sure I got this thing right, its not really
> making progress with or without patch :/

Ingo's program http://lkml.org/lkml/2006/1/8/50 using the V option may
be able to generate similar mutex contention.

Currently still getting soft lockups with the updated version.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-01-31 20:01             ` Jason Low
@ 2014-01-31 20:08               ` Peter Zijlstra
  2014-02-02 21:01                 ` Jason Low
  2014-02-02 22:02                 ` Paul E. McKenney
  0 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-01-31 20:08 UTC (permalink / raw)
  To: Jason Low
  Cc: Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds,
	Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel,
	Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod

On Fri, Jan 31, 2014 at 12:01:37PM -0800, Jason Low wrote:
> On Fri, Jan 31, 2014 at 6:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > I've downloaded AIM7 from sf.net and I hope I'm running it with 100+
> > loads but I'm not entirely sure I got this thing right, its not really
> > making progress with or without patch :/
> 
> Ingo's program http://lkml.org/lkml/2006/1/8/50 using the V option may
> be able to generate similar mutex contention.
> 
> Currently still getting soft lockups with the updated version.

Bugger.. ok clearly I need to think harder still. I'm fairly sure this
cancelation can work though, just seems tricky to get right :-)

I'll give that proglet from Ingo a go, although that might be Monday ere
I get to it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-01-31 14:09           ` Peter Zijlstra
  2014-01-31 20:01             ` Jason Low
@ 2014-02-02 20:02             ` Peter Zijlstra
  1 sibling, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-02-02 20:02 UTC (permalink / raw)
  To: Jason Low
  Cc: mingo, paulmck, Waiman.Long, torvalds, tglx, linux-kernel, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Fri, Jan 31, 2014 at 03:09:41PM +0100, Peter Zijlstra wrote:
> +struct m_spinlock {
> +	struct m_spinlock *next, *prev;
> +	int locked; /* 1 if lock acquired */
> +};
> +
> +/*
> + * Using a single mcs node per CPU is safe because mutex_lock() should not be
> + * called from interrupt context and we have preemption disabled over the mcs
> + * lock usage.
> + */
> +static DEFINE_PER_CPU(struct m_spinlock, m_node);
> +
> +static bool m_spin_lock(struct m_spinlock **lock)
> +{
> +	struct m_spinlock *node = this_cpu_ptr(&m_node);
> +	struct m_spinlock *prev, *next;
> +
> +	node->locked = 0;
> +	node->next = NULL;
> +
> +	node->prev = prev = xchg(lock, node);
> +	if (likely(prev == NULL))
> +		return true;
> +
> +	ACCESS_ONCE(prev->next) = node;
> +
> +	/*
> +	 * Normally @prev is untouchable after the above store; because at that
> +	 * moment unlock can proceed and wipe the node element from stack.
> +	 *
> +	 * However, since our nodes are static per-cpu storage, we're
> +	 * guaranteed their existence -- this allows us to apply
> +	 * cmpxchg in an attempt to undo our queueing.
> +	 */
> +
> +	while (!smp_load_acquire(&node->locked)) {
> +		if (need_resched())
> +			goto unqueue;
> +		arch_mutex_cpu_relax();
> +	}
> +	return true;
> +
> +unqueue:
> +	/*
> +	 * Step - A  -- stabilize @prev
> +	 *
> +	 * Undo our @prev->next assignment; this will make @prev's
> +	 * unlock()/cancel() wait for a next pointer since @lock points to us
> +	 * (or later).
> +	 */
> +
> +	for (;;) {
> +		next = cmpxchg(&prev->next, node, NULL); /* A -> B,C */
> +
> +		/*
> +		 * Because the unlock() path retains @prev->next (for
> +		 * performance) we must check @node->locked after clearing
> +		 * @prev->next to see if we raced.
> +		 *
> +		 * Ordered by the cmpxchg() above and the conditional-store in
> +		 * unlock().
> +		 */
> +		if (smp_load_acquire(&node->locked)) {
> +			/*
> +			 * OOPS, we were too late, we already got the lock. No
> +			 * harm done though; @prev is now unused an nobody
> +			 * cares we frobbed it.
> +			 */
> +			return true;
> +		}
> +
> +		if (next == node)
> +			break;
> +
> +		/*
> +		 * @prev->next didn't point to us anymore, we didn't own the
> +		 * lock, so reload and try again.
> +		 *
> +		 * Because we observed the new @prev->next, the smp_wmb() at
> +		 * (C) ensures that we must now observe the new @node->prev.
> +		 */
> +		prev = ACCESS_ONCE(node->prev);
> +	}
> +
> +	/*
> +	 * Step - B -- stabilize @next
> +	 *
> +	 * Similar to unlock(), wait for @node->next or move @lock from @node
> +	 * back to @prev.
> +	 */
> +
> +	for (;;) {
> +		if (*lock == node && cmpxchg(lock, node, prev) == node) {
> +			/*
> +			 * We were the last queued, we moved @lock back. @prev
> +			 * will now observe @lock and will complete its
> +			 * unlock()/cancel().
> +			 */
> +			return false;
> +		}
> +
> +		/*
> +		 * We must xchg() the @node->next value, because if we were to
> +		 * leave it in, a concurrent cancel() from @node->next might
> +		 * complete Step-A and think its @prev is still valid.
> +		 *
> +		 * If the concurrent cancel() wins the race, we'll wait for
> +		 * either @lock to point to us, through its Step-B, or wait for
> +		 * a new @node->next from its Step-C.
> +		 */
> +		next = xchg(&node->next, NULL); /* B -> A */
> +		if (next)
> +			break;
> +
> +		arch_mutex_cpu_relax();
> +	}
> +
> +	/*
> +	 * Step - C -- unlink
> +	 *
> +	 * @prev is stable because its still waiting for a new @prev->next
> +	 * pointer, @next is stable because our @node->next pointer is NULL and
> +	 * it will wait in Step-A.
> +	 */
> +
> +	ACCESS_ONCE(next->prev) = prev;
> +
> +	/*
> +	 * Ensure that @next->prev is written before we write @prev->next,
> +	 * this guarantees that when the cmpxchg at (A) fails we must
> +	 * observe the new prev value.
> +	 */
> +	smp_wmb(); /* C -> A */

OK, I've definitely stared at this code for too long :/

I don't think the above barrier is right -- or even required for that
matter. At this point I can't see anything wrong with the order of
either of these stores.

If the latter store hits first an unlock can can happen and release the
@next entry, which is fine as the Step-A loop can deal with that, if the
unlock doesn't happen, we'll simply wait for the first store to become
visible before trying to undo later again.

If the former store hits first, we'll simply wait for the later store to
appear and we'll try to undo it.

This obviously doesn't explain lockups, but it does reduce the code to
stare at ever so slightly.

> +	/*
> +	 * And point @prev to our next, and we're unlinked.
> +	 */
> +	ACCESS_ONCE(prev->next) = next;
> +
> +	return false;
> +}
> +
> +static void m_spin_unlock(struct m_spinlock **lock)
> +{
> +	struct m_spinlock *node = this_cpu_ptr(&m_node);
> +	struct m_spinlock *next;
> +
> +	for (;;) {
> +		if (likely(cmpxchg(lock, node, NULL) == node))
> +			return;
> +
> +		next = ACCESS_ONCE(node->next);
> +		if (unlikely(next))
> +			break;
> +
> +		arch_mutex_cpu_relax();
> +	}
> +	smp_store_release(&next->locked, 1);
> +}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-01-31 20:08               ` Peter Zijlstra
@ 2014-02-02 21:01                 ` Jason Low
  2014-02-02 21:12                   ` Peter Zijlstra
  2014-02-02 22:02                 ` Paul E. McKenney
  1 sibling, 1 reply; 26+ messages in thread
From: Jason Low @ 2014-02-02 21:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds,
	Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel,
	Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod

On Fri, 2014-01-31 at 21:08 +0100, Peter Zijlstra wrote:
> On Fri, Jan 31, 2014 at 12:01:37PM -0800, Jason Low wrote:
> > Currently still getting soft lockups with the updated version.
> 
> Bugger.. ok clearly I need to think harder still. I'm fairly sure this
> cancelation can work though, just seems tricky to get right :-)

Ok, I believe I have found a race condition between m_spin_lock() and
m_spin_unlock().

In m_spin_unlock(), we do "next = ACCESS_ONCE(node->next)". Then, if
next is not NULL, we proceed to set next->locked to 1.

A thread in m_spin_lock() in the unqueue path could execute
"next = cmpxchg(&prev->next, node, NULL)" after the thread in
m_spin_unlock() accesses its node->next and finds that it is not NULL.
Then, the thread in m_spin_lock() could check !node->locked before
the thread in m_spin_unlock() sets next->locked to 1.

The following addition change was able to solve the initial lockups that were
occurring when running fserver on a 2 socket box.

---
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 9eb4dbe..e71a84a 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -513,8 +513,13 @@ static void m_spin_unlock(struct m_spinlock **lock)
 			return;
 
 		next = ACCESS_ONCE(node->next);
-		if (unlikely(next))
-			break;
+
+		if (unlikely(next)) {
+			next = cmpxchg(&node->next, next, NULL);
+
+			if (next)
+				break;
+		}
 
 		arch_mutex_cpu_relax();
 	}



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-02-02 21:01                 ` Jason Low
@ 2014-02-02 21:12                   ` Peter Zijlstra
  2014-02-03 18:39                     ` Jason Low
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2014-02-02 21:12 UTC (permalink / raw)
  To: Jason Low
  Cc: Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds,
	Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel,
	Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod

On Sun, Feb 02, 2014 at 01:01:23PM -0800, Jason Low wrote:
> On Fri, 2014-01-31 at 21:08 +0100, Peter Zijlstra wrote:
> > On Fri, Jan 31, 2014 at 12:01:37PM -0800, Jason Low wrote:
> > > Currently still getting soft lockups with the updated version.
> > 
> > Bugger.. ok clearly I need to think harder still. I'm fairly sure this
> > cancelation can work though, just seems tricky to get right :-)
> 
> Ok, I believe I have found a race condition between m_spin_lock() and
> m_spin_unlock().
> 
> In m_spin_unlock(), we do "next = ACCESS_ONCE(node->next)". Then, if
> next is not NULL, we proceed to set next->locked to 1.
> 
> A thread in m_spin_lock() in the unqueue path could execute
> "next = cmpxchg(&prev->next, node, NULL)" after the thread in
> m_spin_unlock() accesses its node->next and finds that it is not NULL.
> Then, the thread in m_spin_lock() could check !node->locked before
> the thread in m_spin_unlock() sets next->locked to 1.

Yes indeed. How silly of me to not spot that!

> The following addition change was able to solve the initial lockups that were
> occurring when running fserver on a 2 socket box.
> 
> ---
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 9eb4dbe..e71a84a 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -513,8 +513,13 @@ static void m_spin_unlock(struct m_spinlock **lock)
>  			return;
>  
>  		next = ACCESS_ONCE(node->next);
> -		if (unlikely(next))
> -			break;
> +
> +		if (unlikely(next)) {
> +			next = cmpxchg(&node->next, next, NULL);
> +
> +			if (next)

The cmpxchg could fail and next still be !NULL I suppose.

> +				break;
> +		}


The way I wrote that same loop in step-B, is:


	for (;;) {
		if (*lock == node && cmpxchg(lock, node, prev) == node)
			return

		next = xchg(&node->next, NULL); /* B -> A */
		if (next)
			break;

		arch_mutex_cpu_relax();
	}

I suppose we can make that something like:


	if (node->next) {
		next = xchg(&node->next, NULL);
		if (next)
			break
	}

To avoid the xchg on every loop.

I had wanted to avoid the additional locked op in the unlock path, but
yes that does make things easier.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-01-31 20:08               ` Peter Zijlstra
  2014-02-02 21:01                 ` Jason Low
@ 2014-02-02 22:02                 ` Paul E. McKenney
  1 sibling, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2014-02-02 22:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jason Low, Ingo Molnar, Waiman Long, Linus Torvalds,
	Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel,
	Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod

On Fri, Jan 31, 2014 at 09:08:25PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 31, 2014 at 12:01:37PM -0800, Jason Low wrote:
> > On Fri, Jan 31, 2014 at 6:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > I've downloaded AIM7 from sf.net and I hope I'm running it with 100+
> > > loads but I'm not entirely sure I got this thing right, its not really
> > > making progress with or without patch :/
> > 
> > Ingo's program http://lkml.org/lkml/2006/1/8/50 using the V option may
> > be able to generate similar mutex contention.
> > 
> > Currently still getting soft lockups with the updated version.
> 
> Bugger.. ok clearly I need to think harder still. I'm fairly sure this
> cancelation can work though, just seems tricky to get right :-)

We used to do something similar to avoid passing locks off to tasks that
had been interrupted while spinning, and it was a bit tricky.  But we
had it a bit easier, because we didn't actually have to remove the element
from the queue, just bypass it at lock-grant time.

							Thanx, Paul

> I'll give that proglet from Ingo a go, although that might be Monday ere
> I get to it.
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-02-02 21:12                   ` Peter Zijlstra
@ 2014-02-03 18:39                     ` Jason Low
  2014-02-03 19:25                       ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Low @ 2014-02-03 18:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds,
	Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel,
	Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod

On Sun, 2014-02-02 at 22:12 +0100, Peter Zijlstra wrote:
> The way I wrote that same loop in step-B, is:
> 
> 
> 	for (;;) {
> 		if (*lock == node && cmpxchg(lock, node, prev) == node)
> 			return
> 
> 		next = xchg(&node->next, NULL); /* B -> A */
> 		if (next)
> 			break;
> 
> 		arch_mutex_cpu_relax();
> 	}
> 
> I suppose we can make that something like:
> 
> 
> 	if (node->next) {
> 		next = xchg(&node->next, NULL);
> 		if (next)
> 			break
> 	}
> 
> To avoid the xchg on every loop.

Ah yes, we want to use xchg() on &node->next.

Since the cmpxchg() is now in a loop in the unlock function, an
additional (*lock == node) check before the cmpxchg() would also be nice
to avoid spinning on cmpxchg() there too.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-02-03 18:39                     ` Jason Low
@ 2014-02-03 19:25                       ` Peter Zijlstra
  2014-02-03 20:55                         ` Jason Low
  2014-02-04  7:13                         ` Jason Low
  0 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-02-03 19:25 UTC (permalink / raw)
  To: Jason Low
  Cc: Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds,
	Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel,
	Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod

On Mon, Feb 03, 2014 at 10:39:20AM -0800, Jason Low wrote:
> > To avoid the xchg on every loop.
> 
> Ah yes, we want to use xchg() on &node->next.
> 
> Since the cmpxchg() is now in a loop in the unlock function, an
> additional (*lock == node) check before the cmpxchg() would also be nice
> to avoid spinning on cmpxchg() there too.

Right, I have the below; you can find the patches this depends upon
here:

  http://programming.kicks-ass.net/sekrit/patches.tar.bz2

---
Subject: locking, mutex: Cancelable MCS lock for adaptive spinning
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed, 29 Jan 2014 12:51:42 +0100

Since we want a task waiting for a mutex_lock() to go to sleep and
reschedule on need_resched() we must be able to abort the
mcs_spin_lock() around the adaptive spin.

Therefore implement a cancelable mcs lock.

XXX: anybody got a better name than m_spinlock?

Cc: paulmck@linux.vnet.ibm.com
Cc: Waiman.Long@hp.com
Cc: torvalds@linux-foundation.org
Cc: tglx@linutronix.de
Cc: riel@redhat.com
Cc: akpm@linux-foundation.org
Cc: davidlohr@hp.com
Cc: hpa@zytor.com
Cc: andi@firstfloor.org
Cc: aswin@hp.com
Cc: scott.norton@hp.com
Cc: chegu_vinod@hp.com
Cc: mingo@redhat.com
Cc: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-7jr68p4f447w2e0ck7y1yl06@git.kernel.org
---
 include/linux/mutex.h         |    4 -
 kernel/locking/Makefile       |    2 
 kernel/locking/mcs_spinlock.c |  156 ++++++++++++++++++++++++++++++++++++++++++
 kernel/locking/mcs_spinlock.h |   18 ++++
 kernel/locking/mutex.c        |   10 +-
 5 files changed, 183 insertions(+), 7 deletions(-)

--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -46,7 +46,7 @@
  * - detects multi-task circular deadlocks and prints out all affected
  *   locks and tasks (and only those tasks)
  */
-struct mcs_spinlock;
+struct m_spinlock;
 struct mutex {
 	/* 1: unlocked, 0: locked, negative: locked, possible waiters */
 	atomic_t		count;
@@ -56,7 +56,7 @@ struct mutex {
 	struct task_struct	*owner;
 #endif
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-	struct mcs_spinlock	*mcs_lock;	/* Spinner MCS lock */
+	struct m_spinlock	*m_lock;	/* Spinner MCS lock */
 #endif
 #ifdef CONFIG_DEBUG_MUTEXES
 	const char 		*name;
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -1,5 +1,5 @@
 
-obj-y += mutex.o semaphore.o rwsem.o lglock.o
+obj-y += mutex.o semaphore.o rwsem.o lglock.o mcs_spinlock.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_lockdep.o = -pg
--- /dev/null
+++ b/kernel/locking/mcs_spinlock.c
@@ -0,0 +1,156 @@
+
+#include <linux/percpu.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include "mcs_spinlock.h"
+
+#ifdef CONFIG_SMP
+
+/*
+ * Using a single mcs node per CPU is safe because mutex_lock() should not be
+ * called from interrupt context and we have preemption disabled over the mcs
+ * lock usage.
+ */
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct m_spinlock, m_node);
+
+/*
+ * Get a stable @node->next pointer, either for unlock() or unqueue() purposes.
+ * Can return NULL in case we were the last queued and we updated @lock instead.
+ */
+static inline struct m_spinlock *
+m_spin_wait_next(struct m_spinlock **lock, struct m_spinlock *node,
+		 struct m_spinlock *prev)
+{
+	struct m_spinlock *next = NULL;
+
+	for (;;) {
+		if (*lock == node && cmpxchg(lock, node, prev) == node) {
+			/*
+			 * We were the last queued, we moved @lock back. @prev
+			 * will now observe @lock and will complete its
+			 * unlock()/unqueue().
+			 */
+			break;
+		}
+
+		/*
+		 * We must xchg() the @node->next value, because if we were to
+		 * leave it in, a concurrent unlock()/unqueue() from
+		 * @node->next might complete Step-A and think its @prev is
+		 * still valid.
+		 *
+		 * If the concurrent unlock()/unqueue() wins the race, we'll
+		 * wait for either @lock to point to us, through its Step-B, or
+		 * wait for a new @node->next from its Step-C.
+		 */
+		if (node->next) {
+			next = xchg(&node->next, NULL);
+			if (next)
+				break;
+		}
+
+		arch_mutex_cpu_relax();
+	}
+
+	return next;
+}
+
+bool m_spin_lock(struct m_spinlock **lock)
+{
+	struct m_spinlock *node = this_cpu_ptr(&m_node);
+	struct m_spinlock *prev, *next;
+
+	node->locked = 0;
+	node->next = NULL;
+
+	node->prev = prev = xchg(lock, node);
+	if (likely(prev == NULL))
+		return true;
+
+	ACCESS_ONCE(prev->next) = node;
+
+	/*
+	 * Normally @prev is untouchable after the above store; because at that
+	 * moment unlock can proceed and wipe the node element from stack.
+	 *
+	 * However, since our nodes are static per-cpu storage, we're
+	 * guaranteed their existence -- this allows us to apply
+	 * cmpxchg in an attempt to undo our queueing.
+	 */
+
+	while (!smp_load_acquire(&node->locked)) {
+		if (need_resched())
+			goto unqueue;
+		arch_mutex_cpu_relax();
+	}
+	return true;
+
+unqueue:
+	/*
+	 * Step - A  -- stabilize @prev
+	 *
+	 * Undo our @prev->next assignment; this will make @prev's
+	 * unlock()/unqueue() wait for a next pointer since @lock points to us
+	 * (or later).
+	 */
+
+	for (;;) {
+		if (prev->next == node &&
+		    cmpxchg(&prev->next, node, NULL) == node)
+			break;
+
+		/*
+		 * We can only fail the cmpxchg() racing against an unlock(),
+		 * in which case we should observe @node->locked becomming
+		 * true.
+		 */
+		if (smp_load_acquire(&node->locked))
+			return true;
+
+		/*
+		 * Or we race against a concurrent unqueue()'s step-B, in which
+		 * case its step-C will write us a new @node->prev pointer.
+		 */
+		prev = ACCESS_ONCE(node->prev);
+	}
+
+	/*
+	 * Step - B -- stabilize @next
+	 *
+	 * Similar to unlock(), wait for @node->next or move @lock from @node
+	 * back to @prev.
+	 */
+
+	next = m_spin_wait_next(lock, node, prev);
+	if (!next)
+		return false;
+
+	/*
+	 * Step - C -- unlink
+	 *
+	 * @prev is stable because its still waiting for a new @prev->next
+	 * pointer, @next is stable because our @node->next pointer is NULL and
+	 * it will wait in Step-A.
+	 */
+
+	ACCESS_ONCE(next->prev) = prev;
+	ACCESS_ONCE(prev->next) = next;
+
+	return false;
+}
+
+void m_spin_unlock(struct m_spinlock **lock)
+{
+	struct m_spinlock *node = this_cpu_ptr(&m_node);
+	struct m_spinlock *next;
+
+	if (likely(cmpxchg(lock, node, NULL) == node))
+		return;
+
+	next = m_spin_wait_next(lock, node, NULL);
+	if (next)
+		ACCESS_ONCE(next->locked) = 1;
+}
+
+#endif
+
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -109,4 +109,22 @@ void mcs_spin_unlock(struct mcs_spinlock
 	arch_mcs_spin_unlock_contended(&next->locked);
 }
 
+/*
+ * Cancellable version of the MCS lock above.
+ *
+ * This version can fail acquisition and unqueue a spinner; it assumes no
+ * nesting.
+ *
+ * Intended for adaptive spinning of sleeping locks:
+ * mutex_lock()/rwsem_down_{read,write}() etc.
+ */
+
+struct m_spinlock {
+	struct m_spinlock *next, *prev;
+	int locked; /* 1 if lock acquired */
+};
+
+extern bool m_spin_lock(struct m_spinlock **lock);
+extern void m_spin_unlock(struct m_spinlock **lock);
+
 #endif /* __LINUX_MCS_SPINLOCK_H */
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -53,7 +53,7 @@ __mutex_init(struct mutex *lock, const c
 	INIT_LIST_HEAD(&lock->wait_list);
 	mutex_clear_owner(lock);
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-	lock->mcs_lock = NULL;
+	lock->m_lock = NULL;
 #endif
 
 	debug_mutex_init(lock, name, key);
@@ -403,7 +403,9 @@ __mutex_lock_common(struct mutex *lock,
 	if (!mutex_can_spin_on_owner(lock))
 		goto slowpath;
 
-	mcs_spin_lock(&lock->mcs_lock);
+	if (!m_spin_lock(&lock->m_lock))
+		goto slowpath;
+
 	for (;;) {
 		struct task_struct *owner;
 
@@ -442,7 +444,7 @@ __mutex_lock_common(struct mutex *lock,
 			}
 
 			mutex_set_owner(lock);
-			mcs_spin_unlock(&lock->mcs_lock);
+			m_spin_unlock(&lock->m_lock);
 			preempt_enable();
 			return 0;
 		}
@@ -464,7 +466,7 @@ __mutex_lock_common(struct mutex *lock,
 		 */
 		arch_mutex_cpu_relax();
 	}
-	mcs_spin_unlock(&lock->mcs_lock);
+	m_spin_unlock(&lock->m_lock);
 slowpath:
 #endif
 	spin_lock_mutex(&lock->wait_lock, flags);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-02-03 19:25                       ` Peter Zijlstra
@ 2014-02-03 20:55                         ` Jason Low
  2014-02-03 21:06                           ` Peter Zijlstra
  2014-02-04  7:13                         ` Jason Low
  1 sibling, 1 reply; 26+ messages in thread
From: Jason Low @ 2014-02-03 20:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds,
	Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel,
	Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod

On Mon, 2014-02-03 at 20:25 +0100, Peter Zijlstra wrote:
> On Mon, Feb 03, 2014 at 10:39:20AM -0800, Jason Low wrote:
> > > To avoid the xchg on every loop.
> > 
> > Ah yes, we want to use xchg() on &node->next.
> > 
> > Since the cmpxchg() is now in a loop in the unlock function, an
> > additional (*lock == node) check before the cmpxchg() would also be nice
> > to avoid spinning on cmpxchg() there too.
> 
> Right, I have the below; you can find the patches this depends upon
> here:
> 
>   http://programming.kicks-ass.net/sekrit/patches.tar.bz2
> 
> ---
> Subject: locking, mutex: Cancelable MCS lock for adaptive spinning
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Wed, 29 Jan 2014 12:51:42 +0100
> 
> Since we want a task waiting for a mutex_lock() to go to sleep and
> reschedule on need_resched() we must be able to abort the
> mcs_spin_lock() around the adaptive spin.
> 
> Therefore implement a cancelable mcs lock.
> 
> XXX: anybody got a better name than m_spinlock?

So I was thinking something along the lines of
mcs_spin_lock_cancelable() as that's essentially what this function
does.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-02-03 20:55                         ` Jason Low
@ 2014-02-03 21:06                           ` Peter Zijlstra
  2014-02-03 21:56                             ` Jason Low
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2014-02-03 21:06 UTC (permalink / raw)
  To: Jason Low
  Cc: Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds,
	Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel,
	Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod

On Mon, Feb 03, 2014 at 12:55:34PM -0800, Jason Low wrote:
> On Mon, 2014-02-03 at 20:25 +0100, Peter Zijlstra wrote:
> > XXX: anybody got a better name than m_spinlock?
> 
> So I was thinking something along the lines of
> mcs_spin_lock_cancelable() as that's essentially what this function
> does.

sure, but what do we call the data structure that goes with it? Can't
have two struct mcs_spinlock :/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-02-03 21:06                           ` Peter Zijlstra
@ 2014-02-03 21:56                             ` Jason Low
  0 siblings, 0 replies; 26+ messages in thread
From: Jason Low @ 2014-02-03 21:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds,
	Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel,
	Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod

On Mon, 2014-02-03 at 22:06 +0100, Peter Zijlstra wrote:
> On Mon, Feb 03, 2014 at 12:55:34PM -0800, Jason Low wrote:
> > On Mon, 2014-02-03 at 20:25 +0100, Peter Zijlstra wrote:
> > > XXX: anybody got a better name than m_spinlock?
> > 
> > So I was thinking something along the lines of
> > mcs_spin_lock_cancelable() as that's essentially what this function
> > does.
> 
> sure, but what do we call the data structure that goes with it? Can't
> have two struct mcs_spinlock :/

If this structure is only going to be used for cancelable mcs locking,
what do you think of "struct cancelable_mcs_spinlock"?



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-02-03 19:25                       ` Peter Zijlstra
  2014-02-03 20:55                         ` Jason Low
@ 2014-02-04  7:13                         ` Jason Low
  1 sibling, 0 replies; 26+ messages in thread
From: Jason Low @ 2014-02-04  7:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul McKenney, Waiman Long, Linus Torvalds,
	Thomas Gleixner, Linux Kernel Mailing List, Rik van Riel,
	Andrew Morton, Davidlohr Bueso, H. Peter Anvin, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J, chegu_vinod

On Mon, 2014-02-03 at 20:25 +0100, Peter Zijlstra wrote:

> +void m_spin_unlock(struct m_spinlock **lock)
> +{
> +	struct m_spinlock *node = this_cpu_ptr(&m_node);
> +	struct m_spinlock *next;
> +
> +	if (likely(cmpxchg(lock, node, NULL) == node))
> +		return;

At this current point, (node->next != NULL) is a likely scenario.
Perhaps we can also add the following code here:

	next = xchg(&node->next, NULL);
	if (next) {
		ACCESS_ONCE(next->locked) = 1;
		return;
	}

> +	next = m_spin_wait_next(lock, node, NULL);
> +	if (next)
> +		ACCESS_ONCE(next->locked) = 1;
> +}



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-01-29 11:51       ` Peter Zijlstra
  2014-01-31  3:29         ` Jason Low
@ 2014-02-05 21:44         ` Waiman Long
  2014-02-06 14:04           ` Peter Zijlstra
  2014-02-06 17:44           ` Jason Low
  1 sibling, 2 replies; 26+ messages in thread
From: Waiman Long @ 2014-02-05 21:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jason Low, mingo, paulmck, torvalds, tglx, linux-kernel, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On 01/29/2014 06:51 AM, Peter Zijlstra wrote:
> On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote:
>>> But urgh, nasty problem. Lemme ponder this a bit.
> OK, please have a very careful look at the below. It survived a boot
> with udev -- which usually stresses mutex contention enough to explode
> (in fact it did a few time when I got the contention/cancel path wrong),
> however I have not ran anything else on it.
>
> The below is an MCS variant that allows relatively cheap unqueueing. But
> its somewhat tricky and I might have gotten a case wrong, esp. the
> double concurrent cancel case got my head hurting (I didn't attempt a
> tripple unqueue).
>
> Applies to tip/master but does generate a few (harmless) compile
> warnings because I didn't fully clean up the mcs_spinlock vs m_spinlock
> thing.
>
> Also, there's a comment in the slowpath that bears consideration.
>
>

I have an alternative way of breaking out of the MCS lock waiting queue 
when need_resched() is set. I overload the locked flag to indicate a 
skipped node if negative. I run the patch through the AIM7 high-systime 
workload on a 4-socket server and it seemed to run fine.

Please check the following POC patch to see if you have any comment.
diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
index e9a4d74..84a0b32 100644
--- a/include/linux/mcs_spinlock.h
+++ b/include/linux/mcs_spinlock.h
@@ -14,17 +14,45 @@

  struct mcs_spinlock {
      struct mcs_spinlock *next;
-    int locked; /* 1 if lock acquired */
+    int locked; /* 1 if lock acquired, -1 if skipping node */
  };

+/*
+ * The node skipping feature requires the MCS node to be persistent,
+ * i.e. not allocated on stack. In addition, the MCS node cannot be
+ * reused until after the locked flag is cleared. The mcs_skip_node()
+ * macro must be defined before including this header file.
+ */
+#ifdef mcs_skip_node
+#undef arch_mcs_spin_lock_contended
+#undef arch_mcs_spin_unlock_contended
+
+#define    arch_mcs_spin_lock_contended(n)                    \
+do {                                    \
+    while (!smp_load_acquire(&(n)->locked)) {            \
+        if (mcs_skip_node(n)) {                    \
+            if (cmpxchg(&(n)->locked, 0, -1) == 0)        \
+                return;                    \
+        }                            \
+        arch_mutex_cpu_relax();                    \
+    };                                \
+} while (0)
+
+#define    arch_mcs_spin_unlock_contended(n)                \
+    if (cmpxchg(&(n)->locked, 0, 1) == 0)                \
+        return
+
+#define mcs_set_locked(n, v)    (n)->locked = (v)
+#else    /* mcs_skip_node */
+
  #ifndef arch_mcs_spin_lock_contended
  /*
   * Using smp_load_acquire() provides a memory barrier that ensures
   * subsequent operations happen after the lock is acquired.
   */
-#define arch_mcs_spin_lock_contended(l)                    \
+#define arch_mcs_spin_lock_contended(n)                    \
  do {                                    \
-    while (!(smp_load_acquire(l)))                    \
+    while (!(smp_load_acquire(&(n)->locked)))            \
          arch_mutex_cpu_relax();                    \
  } while (0)
  #endif
@@ -35,9 +63,12 @@ do {                                \
   * operations in the critical section has been completed before
   * unlocking.
   */
-#define arch_mcs_spin_unlock_contended(l)                \
-    smp_store_release((l), 1)
+#define arch_mcs_spin_unlock_contended(n)                \
+    smp_store_release(&(n)->locked, 1)
+
+#define mcs_set_locked(n, v)
  #endif
+#endif    /* mcs_skip_node */

  /*
   * Note: the smp_load_acquire/smp_store_release pair is not
@@ -77,12 +108,13 @@ void mcs_spin_lock(struct mcs_spinlock **lock, 
struct mcs_spinlock *node)
           * value won't be used. If a debug mode is needed to
           * audit lock status, then set node->locked value here.
           */
+        mcs_set_locked(node, 1);
          return;
      }
      ACCESS_ONCE(prev->next) = node;

      /* Wait until the lock holder passes the lock down. */
-    arch_mcs_spin_lock_contended(&node->locked);
+    arch_mcs_spin_lock_contended(node);
  }

  /*
@@ -94,19 +126,35 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, 
struct mcs_spinlock *node)
  {
      struct mcs_spinlock *next = ACCESS_ONCE(node->next);

+#ifdef mcs_skip_node
+check_next_node:
+#endif
      if (likely(!next)) {
          /*
           * Release the lock by setting it to NULL
           */
-        if (likely(cmpxchg(lock, node, NULL) == node))
+        if (likely(cmpxchg(lock, node, NULL) == node)) {
+            mcs_set_locked(node, 0);
              return;
+        }
          /* Wait until the next pointer is set */
          while (!(next = ACCESS_ONCE(node->next)))
              arch_mutex_cpu_relax();
      }
+    /* Clear the lock flag to indicate the node can be used again. */
+    mcs_set_locked(node, 0);

      /* Pass lock to next waiter. */
-    arch_mcs_spin_unlock_contended(&next->locked);
+    arch_mcs_spin_unlock_contended(next);
+
+#ifdef mcs_skip_node
+    /*
+     * The next node should be skipped
+     */
+    node = next;
+    next = ACCESS_ONCE(node->next);
+    goto check_next_node;
+#endif
  }

  #endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 45fe1b5..6351eca 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -25,6 +25,10 @@
  #include <linux/spinlock.h>
  #include <linux/interrupt.h>
  #include <linux/debug_locks.h>
+/*
+ * Allowed an MCS node to be skipped if need_resched() is true.
+ */
+#define mcs_skip_node(n)    need_resched()
  #include <linux/mcs_spinlock.h>

  /*
@@ -45,6 +49,13 @@
   */
  #define    MUTEX_SHOW_NO_WAITER(mutex)    
(atomic_read(&(mutex)->count) >= 0)

+/*
+ * Using a single mcs node per CPU is safe because mutex_lock() should 
not be
+ * called from interrupt context and we have preemption disabled over 
the mcs
+ * lock usage.
+ */
+static DEFINE_PER_CPU(struct mcs_spinlock, m_node);
+
  void
  __mutex_init(struct mutex *lock, const char *name, struct 
lock_class_key *key)
  {
@@ -166,6 +177,9 @@ static inline int mutex_can_spin_on_owner(struct 
mutex *lock)
      struct task_struct *owner;
      int retval = 1;

+    if (need_resched())
+        return 0;
+
      rcu_read_lock();
      owner = ACCESS_ONCE(lock->owner);
      if (owner)
@@ -370,6 +384,7 @@ __mutex_lock_common(struct mutex *lock, long state, 
unsigned int subclass,
      struct mutex_waiter waiter;
      unsigned long flags;
      int ret;
+    struct mcs_spinlock  *node;

      preempt_disable();
      mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
@@ -400,9 +415,13 @@ __mutex_lock_common(struct mutex *lock, long state, 
unsigned int subclass,
      if (!mutex_can_spin_on_owner(lock))
          goto slowpath;

+    node = this_cpu_ptr(&m_node);
+    if (node->locked < 0)
+        /* MCS node not available yet */
+        goto slowpath;
+
      for (;;) {
          struct task_struct *owner;
-        struct mcs_spinlock  node;

          if (use_ww_ctx && ww_ctx->acquired > 0) {
              struct ww_mutex *ww;
@@ -424,10 +443,15 @@ __mutex_lock_common(struct mutex *lock, long 
state, unsigned int subclass,
           * If there's an owner, wait for it to either
           * release the lock or go to sleep.
           */
-        mcs_spin_lock(&lock->mcs_lock, &node);
+        mcs_spin_lock(&lock->mcs_lock, node);
+        if (node->locked < 0)
+            /*
+             * need_resched() true, no unlock needed
+             */
+            goto slowpath;
          owner = ACCESS_ONCE(lock->owner);
          if (owner && !mutex_spin_on_owner(lock, owner)) {
-            mcs_spin_unlock(&lock->mcs_lock, &node);
+            mcs_spin_unlock(&lock->mcs_lock, node);
              goto slowpath;
          }

@@ -442,11 +466,11 @@ __mutex_lock_common(struct mutex *lock, long 
state, unsigned int subclass,
              }

              mutex_set_owner(lock);
-            mcs_spin_unlock(&lock->mcs_lock, &node);
+            mcs_spin_unlock(&lock->mcs_lock, node);
              preempt_enable();
              return 0;
          }
-        mcs_spin_unlock(&lock->mcs_lock, &node);
+        mcs_spin_unlock(&lock->mcs_lock, node);

          /*
           * When there's no owner, we might have preempted between the




^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-02-05 21:44         ` Waiman Long
@ 2014-02-06 14:04           ` Peter Zijlstra
  2014-02-06 18:45             ` Waiman Long
  2014-02-06 17:44           ` Jason Low
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2014-02-06 14:04 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jason Low, mingo, paulmck, torvalds, tglx, linux-kernel, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Wed, Feb 05, 2014 at 04:44:34PM -0500, Waiman Long wrote:
> I have an alternative way of breaking out of the MCS lock waiting queue when
> need_resched() is set. I overload the locked flag to indicate a skipped node
> if negative. 

I'm not quite seeing how it works (then again, I've not really read the
patch carefully).

Suppose you break out; at that point you get queued and go to sleep.
Suppose you got woken up while you MCS entry is still 'pending' and
magically win the race and acquire the lock.

At that point your MCS entry can be re-used while its still part of the
list.

Its a fantastically small race window, but I don't see anything that
makes it impossible.

> I run the patch through the AIM7 high-systime workload on a
> 4-socket server and it seemed to run fine.

How do people run this AIM7 piece of shit? I let it run for over an hour
and it generated exactly 0 numbers, it just sits there eating cpu-time
and creating a racket from my pantry.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to  spin_on_owner if need_resched() triggered while queued
@ 2014-02-06 14:52 Daniel J Blueman
  0 siblings, 0 replies; 26+ messages in thread
From: Daniel J Blueman @ 2014-02-06 14:52 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Waiman Long

On Thursday, 6 February 2014 22:10:01 UTC+8, Peter Zijlstra  wrote:
 > On Wed, Feb 05, 2014 at 04:44:34PM -0500, Waiman Long wrote:
 > > I have an alternative way of breaking out of the MCS lock waiting 
queue when
 > > need_resched() is set. I overload the locked flag to indicate a 
skipped node
 > > if negative.
 >
 > I'm not quite seeing how it works (then again, I've not really read the
 > patch carefully).
 >
 > Suppose you break out; at that point you get queued and go to sleep.
 > Suppose you got woken up while you MCS entry is still 'pending' and
 > magically win the race and acquire the lock.
 >
 > At that point your MCS entry can be re-used while its still part of the
 > list.
 >
 > Its a fantastically small race window, but I don't see anything that
 > makes it impossible.
 >
 > > I run the patch through the AIM7 high-systime workload on a
 > > 4-socket server and it seemed to run fine.
 >
 > How do people run this AIM7 piece of shit? I let it run for over an hour
 > and it generated exactly 0 numbers, it just sits there eating cpu-time
 > and creating a racket from my pantry.

Without any better advice, I was building the OSDL AIM7 [1], tweaking 
DISKDIR in data/reaim.config and running (eg on a 384-core setup):
$ src/reaim -c data/reaim.config -f data/workfile.compute -i 16 -e 384

Thanks,
   Daniel

[1] http://sourceforge.net/projects/re-aim-7/
-- 
Daniel J Blueman
Principal Software Engineer, Numascale

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-02-05 21:44         ` Waiman Long
  2014-02-06 14:04           ` Peter Zijlstra
@ 2014-02-06 17:44           ` Jason Low
  2014-02-06 18:37             ` Waiman Long
  1 sibling, 1 reply; 26+ messages in thread
From: Jason Low @ 2014-02-06 17:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, mingo, paulmck, torvalds, tglx, linux-kernel,
	riel, akpm, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod

On Wed, 2014-02-05 at 16:44 -0500, Waiman Long wrote:
> On 01/29/2014 06:51 AM, Peter Zijlstra wrote:
> > On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote:
> >>> But urgh, nasty problem. Lemme ponder this a bit.
> > OK, please have a very careful look at the below. It survived a boot
> > with udev -- which usually stresses mutex contention enough to explode
> > (in fact it did a few time when I got the contention/cancel path wrong),
> > however I have not ran anything else on it.
> >
> > The below is an MCS variant that allows relatively cheap unqueueing. But
> > its somewhat tricky and I might have gotten a case wrong, esp. the
> > double concurrent cancel case got my head hurting (I didn't attempt a
> > tripple unqueue).
> >
> > Applies to tip/master but does generate a few (harmless) compile
> > warnings because I didn't fully clean up the mcs_spinlock vs m_spinlock
> > thing.
> >
> > Also, there's a comment in the slowpath that bears consideration.
> >
> >
> 
> I have an alternative way of breaking out of the MCS lock waiting queue 
> when need_resched() is set. I overload the locked flag to indicate a 
> skipped node if negative. I run the patch through the AIM7 high-systime 
> workload on a 4-socket server and it seemed to run fine.
> 
> Please check the following POC patch to see if you have any comment.

So one of the concerns I had with the approach of skipping nodes was
that, under heavy contention, we potentially could cause optimistic
spinning to be disabled on CPUs for a while since the nodes can't be
used until they have been released. One advantage of the unqueuing
method would be that nodes are usable after the spinners exit the MCS
queue and go to sleep.

Jason


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-02-06 17:44           ` Jason Low
@ 2014-02-06 18:37             ` Waiman Long
  0 siblings, 0 replies; 26+ messages in thread
From: Waiman Long @ 2014-02-06 18:37 UTC (permalink / raw)
  To: Jason Low
  Cc: Peter Zijlstra, mingo, paulmck, torvalds, tglx, linux-kernel,
	riel, akpm, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod

On 02/06/2014 12:44 PM, Jason Low wrote:
> On Wed, 2014-02-05 at 16:44 -0500, Waiman Long wrote:
>> On 01/29/2014 06:51 AM, Peter Zijlstra wrote:
>>> On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote:
>>>>> But urgh, nasty problem. Lemme ponder this a bit.
>>> OK, please have a very careful look at the below. It survived a boot
>>> with udev -- which usually stresses mutex contention enough to explode
>>> (in fact it did a few time when I got the contention/cancel path wrong),
>>> however I have not ran anything else on it.
>>>
>>> The below is an MCS variant that allows relatively cheap unqueueing. But
>>> its somewhat tricky and I might have gotten a case wrong, esp. the
>>> double concurrent cancel case got my head hurting (I didn't attempt a
>>> tripple unqueue).
>>>
>>> Applies to tip/master but does generate a few (harmless) compile
>>> warnings because I didn't fully clean up the mcs_spinlock vs m_spinlock
>>> thing.
>>>
>>> Also, there's a comment in the slowpath that bears consideration.
>>>
>>>
>> I have an alternative way of breaking out of the MCS lock waiting queue
>> when need_resched() is set. I overload the locked flag to indicate a
>> skipped node if negative. I run the patch through the AIM7 high-systime
>> workload on a 4-socket server and it seemed to run fine.
>>
>> Please check the following POC patch to see if you have any comment.
> So one of the concerns I had with the approach of skipping nodes was
> that, under heavy contention, we potentially could cause optimistic
> spinning to be disabled on CPUs for a while since the nodes can't be
> used until they have been released. One advantage of the unqueuing
> method would be that nodes are usable after the spinners exit the MCS
> queue and go to sleep.
>
> Jason
>

Under heavy contention when many threads are trying to access the 
mutexes using optimistic spinning. This patch can actually reduce the 
number of wasted CPU cycles waiting in the MCS spin loop and let the 
CPUs do other useful work. So I don't see that as a negative. I think 
this kind of self-tuning is actually good for the overall throughput of 
the system.

-Longman

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-02-06 14:04           ` Peter Zijlstra
@ 2014-02-06 18:45             ` Waiman Long
  2014-02-06 20:10               ` Norton, Scott J
  0 siblings, 1 reply; 26+ messages in thread
From: Waiman Long @ 2014-02-06 18:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jason Low, mingo, paulmck, torvalds, tglx, linux-kernel, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On 02/06/2014 09:04 AM, Peter Zijlstra wrote:
> On Wed, Feb 05, 2014 at 04:44:34PM -0500, Waiman Long wrote:
>> I have an alternative way of breaking out of the MCS lock waiting queue when
>> need_resched() is set. I overload the locked flag to indicate a skipped node
>> if negative.
> I'm not quite seeing how it works (then again, I've not really read the
> patch carefully).
>
> Suppose you break out; at that point you get queued and go to sleep.
> Suppose you got woken up while you MCS entry is still 'pending' and
> magically win the race and acquire the lock.
>
> At that point your MCS entry can be re-used while its still part of the
> list.

Actually the MCS node entry cannot be reused until an MCS queue head 
task traverses that entry and clear the locked flag. So it is possible 
that the affected CPU won't be able to participate in optimistic 
spinning for a while if the mutex is heavily contended.

> Its a fantastically small race window, but I don't see anything that
> makes it impossible.

I will send out an official patch with some performance data to solicit 
further feedback.

>> I run the patch through the AIM7 high-systime workload on a
>> 4-socket server and it seemed to run fine.
> How do people run this AIM7 piece of shit? I let it run for over an hour
> and it generated exactly 0 numbers, it just sits there eating cpu-time
> and creating a racket from my pantry.

AIM7 can be tricky to set up. Fortunately, someone in our team had done 
the ground work so I can just grab and run it.

-Longman

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-02-06 18:45             ` Waiman Long
@ 2014-02-06 20:10               ` Norton, Scott J
  2014-02-10 17:01                 ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Norton, Scott J @ 2014-02-06 20:10 UTC (permalink / raw)
  To: Long, Wai Man, Peter Zijlstra
  Cc: Low, Jason, mingo@redhat.com, paulmck@linux.vnet.ibm.com,
	torvalds@linux-foundation.org, tglx@linutronix.de,
	linux-kernel@vger.kernel.org, riel@redhat.com,
	akpm@linux-foundation.org, Bueso, Davidlohr, hpa@zytor.com,
	andi@firstfloor.org, Chandramouleeswaran, Aswin, Vinod, Chegu

>> I run the patch through the AIM7 high-systime workload on a
>> 4-socket server and it seemed to run fine.
> How do people run this AIM7 piece of shit? I let it run for over an hour
> and it generated exactly 0 numbers, it just sits there eating cpu-time
> and creating a racket from my pantry.

./reaim -s100  -e2000 -t -j100 -i100 -f workfile.high_systime

The reaim.config file contains:

FILESIZE 10k
POOLSIZE 1m
DISKDIR /t0
DISKDIR /t1
DISKDIR /t2
DISKDIR /t3
DISKDIR /t4
DISKDIR /t5
DISKDIR /t6
DISKDIR /t7
DISKDIR /t8
DISKDIR /t9
DISKDIR /t10
DISKDIR /t11
DISKDIR /t12
DISKDIR /t13
DISKDIR /t14
DISKDIR /t15

The way Longman uses this is to create 16 ramdisk filesystems through
/dev/ram* and then mount those filesystems to the /t* directories.
Although you could run it through a regular filesystem also. It will use
whatever you place in the reaim.config file as DISKDIR. You can specify
one or more DISKDIR directories.


Scott


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued
  2014-02-06 20:10               ` Norton, Scott J
@ 2014-02-10 17:01                 ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-02-10 17:01 UTC (permalink / raw)
  To: Norton, Scott J
  Cc: Long, Wai Man, Low, Jason, mingo@redhat.com,
	paulmck@linux.vnet.ibm.com, torvalds@linux-foundation.org,
	tglx@linutronix.de, linux-kernel@vger.kernel.org, riel@redhat.com,
	akpm@linux-foundation.org, Bueso, Davidlohr, hpa@zytor.com,
	andi@firstfloor.org, Chandramouleeswaran, Aswin, Vinod, Chegu

On Thu, Feb 06, 2014 at 08:10:02PM +0000, Norton, Scott J wrote:
> > How do people run this AIM7 piece of shit? I let it run for over an hour
> > and it generated exactly 0 numbers, it just sits there eating cpu-time
> > and creating a racket from my pantry.
> 
> ./reaim -s100  -e2000 -t -j100 -i100 -f workfile.high_systime
> 
> The reaim.config file contains:
> 
> FILESIZE 10k
> POOLSIZE 1m
> DISKDIR /t0
> DISKDIR /t1
> DISKDIR /t2
> DISKDIR /t3
> DISKDIR /t4
> DISKDIR /t5
> DISKDIR /t6
> DISKDIR /t7
> DISKDIR /t8
> DISKDIR /t9
> DISKDIR /t10
> DISKDIR /t11
> DISKDIR /t12
> DISKDIR /t13
> DISKDIR /t14
> DISKDIR /t15
> 
> The way Longman uses this is to create 16 ramdisk filesystems through
> /dev/ram* and then mount those filesystems to the /t* directories.
> Although you could run it through a regular filesystem also. It will use
> whatever you place in the reaim.config file as DISKDIR. You can specify
> one or more DISKDIR directories.

OK, and we're back to creating a racket but not producing useful
numbers; how long is that crap supposed to run before it gives a number?

Surely it can produce a useful number after a few minutes of runtime..
Letting it run for hours is just a waste of time and money.

Note that I'm running this on a WSM-EP with (2*6*2) 24 CPUs and 24G of ram.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2014-02-10 17:02 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-06 14:52 [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued Daniel J Blueman
  -- strict thread matches above, loose matches on Subject: below --
2014-01-28 19:13 [PATCH v2 0/5] mutex: Mutex scalability patches Jason Low
2014-01-28 19:13 ` [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued Jason Low
2014-01-28 21:07   ` Peter Zijlstra
2014-01-28 22:51     ` Jason Low
2014-01-29 11:51       ` Peter Zijlstra
2014-01-31  3:29         ` Jason Low
2014-01-31 14:09           ` Peter Zijlstra
2014-01-31 20:01             ` Jason Low
2014-01-31 20:08               ` Peter Zijlstra
2014-02-02 21:01                 ` Jason Low
2014-02-02 21:12                   ` Peter Zijlstra
2014-02-03 18:39                     ` Jason Low
2014-02-03 19:25                       ` Peter Zijlstra
2014-02-03 20:55                         ` Jason Low
2014-02-03 21:06                           ` Peter Zijlstra
2014-02-03 21:56                             ` Jason Low
2014-02-04  7:13                         ` Jason Low
2014-02-02 22:02                 ` Paul E. McKenney
2014-02-02 20:02             ` Peter Zijlstra
2014-02-05 21:44         ` Waiman Long
2014-02-06 14:04           ` Peter Zijlstra
2014-02-06 18:45             ` Waiman Long
2014-02-06 20:10               ` Norton, Scott J
2014-02-10 17:01                 ` Peter Zijlstra
2014-02-06 17:44           ` Jason Low
2014-02-06 18:37             ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).