[PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI
       [not found] ` <alpine.LFD.2.00.1007071356480.2604@localhost.localdomain>
@ 2010-07-13  8:03   ` Darren Hart
  2010-07-13  9:25     ` Thomas Gleixner
  2010-07-13  9:58     ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI Thomas Gleixner
  0 siblings, 2 replies; 8+ messages in thread
From: Darren Hart @ 2010-07-13  8:03 UTC (permalink / raw)
  To: lkml, 
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Eric Dumazet,
	John Kacur, Steven Rostedt, Mike Galbraith, linux-rt-users

Thanks to Thomas, Steven, and Mike for hashing this over me. After an
IRC discussion with Thomas, I put the following together. It resolves
the issue for me, Mike please test and let us know if it fixes it for
you. A couple of points of discussion before we commit this:

The use of the new state flag, PI_WAKEUP_INPROGRESS, is pretty ugly.
Would a new task_pi_blocked_on_valid() method be preferred (in
rtmutex.c)? 

The new WARN_ON() in task_blocks_on_rt_mutex() is complex. It didn't
exist before and we've now closed this gap, should we just drop it?

I've added a couple BUG_ON()s in futex_wait_requeue_pi() dealing with
the race with requeue and q.lock_ptr. I'd like to leave this for the
time being if nobody strongly objects.

Thanks,

Darren


>From 93fd3bb97800ebf5e5c1a6a85937bab93256dd42 Mon Sep 17 00:00:00 2001
From: Darren Hart <dvhltc@us.ibm.com>
Date: Fri, 9 Jul 2010 17:50:23 -0400
Subject: [PATCH 1/2] futex: protect against pi_blocked_on corruption during requeue PI

The requeue_pi mechanism introduced proxy locking of the rtmutex.  This creates
a scenario where a task can wakeup, not knowing it has been enqueued on an
rtmutex. Blocking on an hb->lock() can overwrite a valid value in
current->pi_blocked_on, leading to an inconsistent state.

Prevent overwriting pi_blocked_on by serializing on the waiter's pi_lock (a
raw_spinlock) and using the new PI_WAKEUP_INPROGRESS state flag to indicate a
waiter that has been woken by a timeout or signal. This prevents the rtmutex
code from adding the waiter to the rtmutex wait list, returning EAGAIN to
futex_requeue(), which will in turn ignore the waiter during a requeue. Care
is taken to allow current to block on locks even if PI_WAKEUP_INPROGRESS is
set.

During normal wakeup, this results in one less hb->lock protected section. In
the pre-requeue-timeout-or-signal wakeup, this removes the "greedy locking"
behavior, no attempt will be made to acquire the lock.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <efault@gmx.de>
---
 kernel/futex.c          |   50 +++++++++++++++++++++++++++++++++-------------
 kernel/rtmutex.c        |   45 ++++++++++++++++++++++++++++++++++-------
 kernel/rtmutex_common.h |    1 +
 kernel/sched.c          |    5 +++-
 4 files changed, 78 insertions(+), 23 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index a6cec32..c92978d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1336,6 +1336,9 @@ retry_private:
 				requeue_pi_wake_futex(this, &key2, hb2);
 				drop_count++;
 				continue;
+			} else if (ret == -EAGAIN) {
+				/* Waiter woken by timeout or signal. */
+				continue;
 			} else if (ret) {
 				/* -EDEADLK */
 				this->pi_state = NULL;
@@ -2211,9 +2214,9 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
 				 int clockrt, u32 __user *uaddr2)
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
+	struct futex_hash_bucket *hb, *hb2;
 	struct rt_mutex_waiter rt_waiter;
 	struct rt_mutex *pi_mutex = NULL;
-	struct futex_hash_bucket *hb;
 	union futex_key key2;
 	struct futex_q q;
 	int res, ret;
@@ -2255,18 +2258,33 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
 	/* Queue the futex_q, drop the hb lock, wait for wakeup. */
 	futex_wait_queue_me(hb, &q, to);
 
-	spin_lock(&hb->lock);
-	ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
-	spin_unlock(&hb->lock);
-	if (ret)
-		goto out_put_keys;
-
 	/*
-	 * In order for us to be here, we know our q.key == key2, and since
-	 * we took the hb->lock above, we also know that futex_requeue() has
-	 * completed and we no longer have to concern ourselves with a wakeup
-	 * race with the atomic proxy lock acquition by the requeue code.
+	 * Avoid races with requeue and trying to block on two mutexes
+	 * (hb->lock and uaddr2's rtmutex) by serializing access to
+	 * pi_blocked_on with pi_lock and setting PI_BLOCKED_ON_PENDING.
+	 */
+	raw_spin_lock(&current->pi_lock);
+	if (current->pi_blocked_on) {
+		raw_spin_unlock(&current->pi_lock);
+	} else {
+		current->pi_blocked_on = (struct rt_mutex_waiter *)PI_WAKEUP_INPROGRESS;
+		raw_spin_unlock(&current->pi_lock);
+
+		spin_lock(&hb->lock);
+		ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
+		spin_unlock(&hb->lock);
+		if (ret)
+			goto out_put_keys;
+	}
+
+	/* 
+	 * In order to be here, we have either been requeued, are in the process
+	 * of being requeued, or requeue successfully acquired uaddr2 on our
+	 * behalf.  If pi_blocked_on was non-null above, we may be racing with a
+	 * requeue.  Do not rely on q->lock_ptr to be hb2->lock until after
+	 * blocking on hb->lock or hb2->lock.
 	 */
+	hb2 = hash_futex(&key2);
 
 	/* Check if the requeue code acquired the second futex for us. */
 	if (!q.rt_waiter) {
@@ -2275,10 +2293,12 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
 		 * did a lock-steal - fix up the PI-state in that case.
 		 */
 		if (q.pi_state && (q.pi_state->owner != current)) {
-			spin_lock(q.lock_ptr);
+			spin_lock(&hb2->lock);
+			BUG_ON(&hb2->lock != q.lock_ptr);
+
 			ret = fixup_pi_state_owner(uaddr2, &q, current,
 						   fshared);
-			spin_unlock(q.lock_ptr);
+			spin_unlock(&hb2->lock);
 		}
 	} else {
 		/*
@@ -2291,7 +2311,9 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
 		ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
 		debug_rt_mutex_free_waiter(&rt_waiter);
 
-		spin_lock(q.lock_ptr);
+		spin_lock(&hb2->lock);
+		BUG_ON(&hb2->lock != q.lock_ptr);
+
 		/*
 		 * Fixup the pi_state owner and possibly acquire the lock if we
 		 * haven't already.
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 23dd443..0399108 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -227,7 +227,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 * reached or the state of the chain has changed while we
 	 * dropped the locks.
 	 */
-	if (!waiter || !waiter->task)
+	if (!waiter || (long)waiter == PI_WAKEUP_INPROGRESS || !waiter->task)
 		goto out_unlock_pi;
 
 	/*
@@ -448,6 +448,21 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 	int chain_walk = 0, res;
 
 	raw_spin_lock(&task->pi_lock);
+
+	/*
+	 * In the case of futex requeue PI, this will be a proxy lock. The task
+	 * will wake unaware that it is enqueueed on this lock. Avoid blocking
+	 * on two locks and corrupting pi_blocked_on via the
+	 * PI_WAKEUP_INPROGRESS flag. futex_wait_requeue_pi() sets this when it
+	 * wakes up before requeue (due to a signal or timeout). Do not enqueue
+	 * the task if PI_WAKEUP_INPROGRESS is set.
+	 */
+	if (task != current &&
+	    (long)task->pi_blocked_on == PI_WAKEUP_INPROGRESS) {
+		raw_spin_unlock(&task->pi_lock);
+		return -EAGAIN;
+	}
+
 	__rt_mutex_adjust_prio(task);
 	waiter->task = task;
 	waiter->lock = lock;
@@ -459,6 +474,15 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 		top_waiter = rt_mutex_top_waiter(lock);
 	plist_add(&waiter->list_entry, &lock->wait_list);
 
+	/* 
+	 * Tasks can only block on one lock at a time. In the case of futex
+	 * requeue PI, if task == current it may have set PI_WAKEUP_INPROGRESS
+	 * to prevent requeue, but it will still need to acquire locks on its
+	 * way out of futex_wait_requeue_pi().
+ 	 */
+	WARN_ON(task->pi_blocked_on != NULL &&
+	        (task != current || (long)task->pi_blocked_on != PI_WAKEUP_INPROGRESS));
+
 	task->pi_blocked_on = waiter;
 
 	raw_spin_unlock(&task->pi_lock);
@@ -469,7 +493,8 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 		plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
 
 		__rt_mutex_adjust_prio(owner);
-		if (owner->pi_blocked_on)
+		if (owner->pi_blocked_on &&
+		    (long)owner->pi_blocked_on != PI_WAKEUP_INPROGRESS)
 			chain_walk = 1;
 		raw_spin_unlock(&owner->pi_lock);
 	}
@@ -579,9 +604,11 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
 
 	raw_spin_lock(&pendowner->pi_lock);
 
-	WARN_ON(!pendowner->pi_blocked_on);
-	WARN_ON(pendowner->pi_blocked_on != waiter);
-	WARN_ON(pendowner->pi_blocked_on->lock != lock);
+ 	if (!WARN_ON(!pendowner->pi_blocked_on) &&
+ 	    !WARN_ON((long)pendowner->pi_blocked_on == PI_WAKEUP_INPROGRESS)) {
+  		WARN_ON(pendowner->pi_blocked_on != waiter);
+  		WARN_ON(pendowner->pi_blocked_on->lock != lock);
+  	}
 
 	pendowner->pi_blocked_on = NULL;
 
@@ -624,7 +651,8 @@ static void remove_waiter(struct rt_mutex *lock,
 		}
 		__rt_mutex_adjust_prio(owner);
 
-		if (owner->pi_blocked_on)
+		if (owner->pi_blocked_on &&
+		    (long)owner->pi_blocked_on != PI_WAKEUP_INPROGRESS)
 			chain_walk = 1;
 
 		raw_spin_unlock(&owner->pi_lock);
@@ -658,7 +686,8 @@ void rt_mutex_adjust_pi(struct task_struct *task)
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || waiter->list_entry.prio == task->prio) {
+	if (!waiter || (long)waiter == PI_WAKEUP_INPROGRESS ||
+	    waiter->list_entry.prio == task->prio) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
@@ -1527,7 +1556,7 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 	ret = task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock,
 				      flags);
 
-	if (ret && !waiter->task) {
+	if (ret == -EDEADLK && !waiter->task) {
 		/*
 		 * Reset the return value. We might have
 		 * returned with -EDEADLK and the owner
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index 4df690c..94a856f 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -115,6 +115,7 @@ static inline unsigned long rt_mutex_owner_pending(struct rt_mutex *lock)
 /*
  * PI-futex support (proxy locking functions, etc.):
  */
+#define PI_WAKEUP_INPROGRESS 1
 extern struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock);
 extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock,
 				       struct task_struct *proxy_owner);
diff --git a/kernel/sched.c b/kernel/sched.c
index aa5dced..9d4337e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -83,6 +83,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/sched.h>
 
+#include "rtmutex_common.h"
+
 /*
  * Convert user-nice values [ -20 ... 0 ... 19 ]
  * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
@@ -6377,7 +6379,8 @@ void task_setprio(struct task_struct *p, int prio)
 	 */
 	if (unlikely(p == rq->idle)) {
 		WARN_ON(p != rq->curr);
-		WARN_ON(p->pi_blocked_on);
+		WARN_ON(p->pi_blocked_on &&
+		        (long)p->pi_blocked_on != PI_WAKEUP_INPROGRESS);
 		goto out_unlock;
 	}
 
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI
  2010-07-13  8:03   ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI Darren Hart
@ 2010-07-13  9:25     ` Thomas Gleixner
  2010-07-13 10:28       ` Thomas Gleixner
  2010-07-13  9:58     ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI Thomas Gleixner
  1 sibling, 1 reply; 8+ messages in thread
From: Thomas Gleixner @ 2010-07-13  9:25 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml, , Peter Zijlstra, Ingo Molnar, Eric Dumazet, John Kacur,
	Steven Rostedt, Mike Galbraith, linux-rt-users

On Tue, 13 Jul 2010, Darren Hart wrote:

> Thanks to Thomas, Steven, and Mike for hashing this over me. After an
> IRC discussion with Thomas, I put the following together. It resolves
> the issue for me, Mike please test and let us know if it fixes it for
> you. A couple of points of discussion before we commit this:
> 
> The use of the new state flag, PI_WAKEUP_INPROGRESS, is pretty ugly.
> Would a new task_pi_blocked_on_valid() method be preferred (in
> rtmutex.c)? 
> 
> The new WARN_ON() in task_blocks_on_rt_mutex() is complex. It didn't
> exist before and we've now closed this gap, should we just drop it?

We can simplify it to:

	WARN_ON(task->pi_blocked_on &&
		task->pi_blocked_on != PI_WAKEUP_INPROGRESS);

We check for !=current and PI_WAKEUP_INPROGRESS just above.   
 
> I've added a couple BUG_ON()s in futex_wait_requeue_pi() dealing with
> the race with requeue and q.lock_ptr. I'd like to leave this for the
> time being if nobody strongly objects.
> -
>  	/*
> -	 * In order for us to be here, we know our q.key == key2, and since
> -	 * we took the hb->lock above, we also know that futex_requeue() has
> -	 * completed and we no longer have to concern ourselves with a wakeup
> -	 * race with the atomic proxy lock acquition by the requeue code.
> +	 * Avoid races with requeue and trying to block on two mutexes
> +	 * (hb->lock and uaddr2's rtmutex) by serializing access to
> +	 * pi_blocked_on with pi_lock and setting PI_BLOCKED_ON_PENDING.
> +	 */
> +	raw_spin_lock(&current->pi_lock);

  Needs to be raw_spin_lock_irq()

> +	if (current->pi_blocked_on) {
> +		raw_spin_unlock(&current->pi_lock);
> +	} else {
> +		current->pi_blocked_on = (struct rt_mutex_waiter *)PI_WAKEUP_INPROGRESS;

  #define PI_WAKEUP_INPROGRESS ((struct rt_mutex_waiter *) 1)

  perhaps ? That gets rid of all type casts

> +		raw_spin_unlock(&current->pi_lock);
> +
> +		spin_lock(&hb->lock);

  We need to cleanup current->pi_blocked_on here. If we succeed in the
  hb->lock fast path then we might leak the PI_WAKEUP_INPROGRESS to user space
  and the next requeue will fail.

> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> index 23dd443..0399108 100644
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -227,7 +227,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
>  	 * reached or the state of the chain has changed while we
>  	 * dropped the locks.
>  	 */
> -	if (!waiter || !waiter->task)
> +	if (!waiter || (long)waiter == PI_WAKEUP_INPROGRESS || !waiter->task)
>  		goto out_unlock_pi;

 Why do we need that check ? Either the requeue succeeded then
 task->pi_blocked_on is set to the real waiter or the wakeup won and
 we are in no lock chain.

 If we ever find a waiter with PI_WAKEUP_INPROGRESS set in
 rt_mutex_adjust_prio_chain() then it's a bug nothing else.

> @@ -469,7 +493,8 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
>  		plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
>  
>  		__rt_mutex_adjust_prio(owner);
> -		if (owner->pi_blocked_on)
> +		if (owner->pi_blocked_on &&
> +		    (long)owner->pi_blocked_on != PI_WAKEUP_INPROGRESS)

  Again, that can never happen

>  			chain_walk = 1;
>  		raw_spin_unlock(&owner->pi_lock);
>  	}
> @@ -579,9 +604,11 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
>  
>  	raw_spin_lock(&pendowner->pi_lock);
>  
> -	WARN_ON(!pendowner->pi_blocked_on);
> -	WARN_ON(pendowner->pi_blocked_on != waiter);
> -	WARN_ON(pendowner->pi_blocked_on->lock != lock);
> + 	if (!WARN_ON(!pendowner->pi_blocked_on) &&
> + 	    !WARN_ON((long)pendowner->pi_blocked_on == PI_WAKEUP_INPROGRESS)) {

  Ditto

> +  		WARN_ON(pendowner->pi_blocked_on != waiter);
> +  		WARN_ON(pendowner->pi_blocked_on->lock != lock);
> +  	}
>  
>  	pendowner->pi_blocked_on = NULL;
>  
> @@ -624,7 +651,8 @@ static void remove_waiter(struct rt_mutex *lock,
>  		}
>  		__rt_mutex_adjust_prio(owner);
>  
> -		if (owner->pi_blocked_on)
> +		if (owner->pi_blocked_on &&
> +		    (long)owner->pi_blocked_on != PI_WAKEUP_INPROGRESS)
>  			chain_walk = 1;

  Same here.			
  
>  		raw_spin_unlock(&owner->pi_lock);
> @@ -658,7 +686,8 @@ void rt_mutex_adjust_pi(struct task_struct *task)
>  	raw_spin_lock_irqsave(&task->pi_lock, flags);
>  
>  	waiter = task->pi_blocked_on;
> -	if (!waiter || waiter->list_entry.prio == task->prio) {
> +	if (!waiter || (long)waiter == PI_WAKEUP_INPROGRESS ||
> +	    waiter->list_entry.prio == task->prio) {

  And here

>  /*
>   * Convert user-nice values [ -20 ... 0 ... 19 ]
>   * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
> @@ -6377,7 +6379,8 @@ void task_setprio(struct task_struct *p, int prio)
>  	 */
>  	if (unlikely(p == rq->idle)) {
>  		WARN_ON(p != rq->curr);
> -		WARN_ON(p->pi_blocked_on);
> +		WARN_ON(p->pi_blocked_on &&
> +		        (long)p->pi_blocked_on != PI_WAKEUP_INPROGRESS);

  Yuck. Paranoia ? If we ever requeue idle, then .....

I'm going to cleanup the stuff and send out a new patch for Mike
to test.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI
  2010-07-13  9:25     ` Thomas Gleixner
@ 2010-07-13 10:28       ` Thomas Gleixner
  2010-07-13 11:52         ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2 Thomas Gleixner
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Gleixner @ 2010-07-13 10:28 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml, , Peter Zijlstra, Ingo Molnar, Eric Dumazet, John Kacur,
	Steven Rostedt, Mike Galbraith, linux-rt-users

On Tue, 13 Jul 2010, Thomas Gleixner wrote:
> On Tue, 13 Jul 2010, Darren Hart wrote:
>
> > --- a/kernel/rtmutex.c
> > +++ b/kernel/rtmutex.c
> > @@ -227,7 +227,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
> >  	 * reached or the state of the chain has changed while we
> >  	 * dropped the locks.
> >  	 */
> > -	if (!waiter || !waiter->task)
> > +	if (!waiter || (long)waiter == PI_WAKEUP_INPROGRESS || !waiter->task)
> >  		goto out_unlock_pi;
> 
>  Why do we need that check ? Either the requeue succeeded then
>  task->pi_blocked_on is set to the real waiter or the wakeup won and
>  we are in no lock chain.
> 
>  If we ever find a waiter with PI_WAKEUP_INPROGRESS set in
>  rt_mutex_adjust_prio_chain() then it's a bug nothing else.

Grrr, I'm wrong. If we take hb->lock in the fast path then something
else might try to boost us and trip over this :(

This code causes braindamage. I really wonder whether we need to
remove it according to the "United Nations Convention against Torture
and Other Cruel, Inhuman or Degrading Treatment or Punishment".

> > @@ -6377,7 +6379,8 @@ void task_setprio(struct task_struct *p, int prio)
> >  	 */
> >  	if (unlikely(p == rq->idle)) {
> >  		WARN_ON(p != rq->curr);
> > -		WARN_ON(p->pi_blocked_on);
> > +		WARN_ON(p->pi_blocked_on &&
> > +		        (long)p->pi_blocked_on != PI_WAKEUP_INPROGRESS);
> 
>   Yuck. Paranoia ? If we ever requeue idle, then .....

  At least one which is bogus :)
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2
  2010-07-13 10:28       ` Thomas Gleixner
@ 2010-07-13 11:52         ` Thomas Gleixner
  2010-07-13 15:57           ` Mike Galbraith
                             ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Thomas Gleixner @ 2010-07-13 11:52 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml, , Peter Zijlstra, Ingo Molnar, Eric Dumazet, John Kacur,
	Steven Rostedt, Mike Galbraith, linux-rt-users

On Tue, 13 Jul 2010, Thomas Gleixner wrote:
> 
> This code causes braindamage. I really wonder whether we need to
> remove it according to the "United Nations Convention against Torture
> and Other Cruel, Inhuman or Degrading Treatment or Punishment".
> 

Ok, finally managed to twist my brain around it. Mike, can you give it
a test ride ?

Thanks,

	tglx

-------->
Subject: futex: Protect against pi_blocked_on corruption during requeue PI
From: Darren Hart <dvhltc@us.ibm.com>
Date: Fri, 9 Jul 2010 17:50:23 -0400

The requeue_pi mechanism introduced proxy locking of the rtmutex. This
creates a scenario where a task can wakeup, not knowing it has been
enqueued on an rtmutex. Blocking on an hb->lock() can overwrite a
valid value in current->pi_blocked_on, leading to an inconsistent
state.

Prevent overwriting pi_blocked_on by serializing on the waiter's
pi_lock and using the new PI_WAKEUP_INPROGRESS state flag to indicate
a waiter that has been woken by a timeout or signal. This prevents the
rtmutex code from adding the waiter to the rtmutex wait list,
returning EAGAIN to futex_requeue(), which will in turn ignore the
waiter during a requeue. Care is taken to allow current to block on
locks even if PI_WAKEUP_INPROGRESS is set.

During normal wakeup, this results in one less hb->lock protected
section. In the pre-requeue-timeout-or-signal wakeup, this removes the
"greedy locking" behavior, no attempt will be made to acquire the
lock.

[ tglx: take pi_lock with lock_irq(), removed paranoid warning,
  	plugged pi_state and pi_blocked_on leak, adjusted some
  	comments ]

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <4C3C1DCF.9090509@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/futex.c          |   50 +++++++++++++++++++++++++++++++++-------------
 kernel/rtmutex.c        |   45 ++++++++++++++++++++++++++++++++++-------
 kernel/rtmutex_common.h |    1 +
 kernel/sched.c          |    5 +++-
 4 files changed, 78 insertions(+), 23 deletions(-)

Index: linux-2.6-tip/kernel/futex.c
===================================================================
--- linux-2.6-tip.orig/kernel/futex.c
+++ linux-2.6-tip/kernel/futex.c
@@ -1336,6 +1336,16 @@ retry_private:
 				requeue_pi_wake_futex(this, &key2, hb2);
 				drop_count++;
 				continue;
+			} else if (ret == -EAGAIN) {
+				/*
+				 * Waiter was woken by timeout or
+				 * signal and has set pi_blocked_on to
+				 * PI_WAKEUP_INPROGRESS before we
+				 * tried to enqueue it on the rtmutex.
+				 */
+				this->pi_state = NULL;
+				free_pi_state(pi_state);
+				continue;
 			} else if (ret) {
 				/* -EDEADLK */
 				this->pi_state = NULL;
@@ -2211,9 +2221,9 @@ static int futex_wait_requeue_pi(u32 __u
 				 int clockrt, u32 __user *uaddr2)
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
+	struct futex_hash_bucket *hb, *hb2;
 	struct rt_mutex_waiter rt_waiter;
 	struct rt_mutex *pi_mutex = NULL;
-	struct futex_hash_bucket *hb;
 	union futex_key key2;
 	struct futex_q q;
 	int res, ret;
@@ -2255,18 +2265,51 @@ static int futex_wait_requeue_pi(u32 __u
 	/* Queue the futex_q, drop the hb lock, wait for wakeup. */
 	futex_wait_queue_me(hb, &q, to);
 
-	spin_lock(&hb->lock);
-	ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
-	spin_unlock(&hb->lock);
-	if (ret)
-		goto out_put_keys;
+	/*
+	 * Avoid races with requeue and trying to block on two mutexes
+	 * (hb->lock and uaddr2's rtmutex) by serializing access to
+	 * pi_blocked_on with pi_lock.
+	 */
+	raw_spin_lock_irq(&current->pi_lock);
+	if (current->pi_blocked_on) {
+		/* Requeue happened already */
+		raw_spin_unlock_irq(&current->pi_lock);
+	} else {
+		/*
+		 * Setting pi_blocked_on to PI_WAKEUP_INPROGRESS
+		 * prevents a concurrent requeue from enqueuein us on
+		 * the uaddr2 rtmutex. After that we can safely
+		 * acquire (and possibly block on) hb->lock.
+		 */
+		current->pi_blocked_on = PI_WAKEUP_INPROGRESS;
+		raw_spin_unlock_irq(&current->pi_lock);
+
+		spin_lock(&hb->lock);
+
+		/*
+		 * Clean up pi_blocked_on. We might leak it otherwise
+		 * when we succeeded with the hb->lock in the fast
+		 * path.
+		 */
+		raw_spin_lock_irq(&current->pi_lock);
+		current->pi_blocked_on = NULL;
+		raw_spin_unlock_irq(&current->pi_lock);
+
+		ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
+		spin_unlock(&hb->lock);
+		if (ret)
+			goto out_put_keys;
+	}
 
 	/*
-	 * In order for us to be here, we know our q.key == key2, and since
-	 * we took the hb->lock above, we also know that futex_requeue() has
-	 * completed and we no longer have to concern ourselves with a wakeup
-	 * race with the atomic proxy lock acquition by the requeue code.
+	 * In order to be here, we have either been requeued, are in
+	 * the process of being requeued, or requeue successfully
+	 * acquired uaddr2 on our behalf.  If pi_blocked_on was
+	 * non-null above, we may be racing with a requeue.  Do not
+	 * rely on q->lock_ptr to be hb2->lock until after blocking on
+	 * hb->lock or hb2->lock.
 	 */
+	hb2 = hash_futex(&key2);
 
 	/* Check if the requeue code acquired the second futex for us. */
 	if (!q.rt_waiter) {
@@ -2275,10 +2318,12 @@ static int futex_wait_requeue_pi(u32 __u
 		 * did a lock-steal - fix up the PI-state in that case.
 		 */
 		if (q.pi_state && (q.pi_state->owner != current)) {
-			spin_lock(q.lock_ptr);
+			spin_lock(&hb2->lock);
+			BUG_ON(&hb2->lock != q.lock_ptr);
+
 			ret = fixup_pi_state_owner(uaddr2, &q, current,
 						   fshared);
-			spin_unlock(q.lock_ptr);
+			spin_unlock(&hb2->lock);
 		}
 	} else {
 		/*
@@ -2291,7 +2336,9 @@ static int futex_wait_requeue_pi(u32 __u
 		ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
 		debug_rt_mutex_free_waiter(&rt_waiter);
 
-		spin_lock(q.lock_ptr);
+		spin_lock(&hb2->lock);
+		BUG_ON(&hb2->lock != q.lock_ptr);
+
 		/*
 		 * Fixup the pi_state owner and possibly acquire the lock if we
 		 * haven't already.
Index: linux-2.6-tip/kernel/rtmutex.c
===================================================================
--- linux-2.6-tip.orig/kernel/rtmutex.c
+++ linux-2.6-tip/kernel/rtmutex.c
@@ -82,6 +82,11 @@ static void fixup_rt_mutex_waiters(struc
 		clear_rt_mutex_waiters(lock);
 }
 
+static inline int rt_mutex_real_waiter(struct rt_mutex_waiter *waiter)
+{
+	return waiter && waiter != PI_WAKEUP_INPROGRESS;
+}
+
 /*
  * We can speed up the acquire/release, if the architecture
  * supports cmpxchg and if there's no debugging state to be set up
@@ -227,7 +232,7 @@ static int rt_mutex_adjust_prio_chain(st
 	 * reached or the state of the chain has changed while we
 	 * dropped the locks.
 	 */
-	if (!waiter || !waiter->task)
+	if (!rt_mutex_real_waiter(waiter) || !waiter->task)
 		goto out_unlock_pi;
 
 	/*
@@ -448,6 +453,23 @@ static int task_blocks_on_rt_mutex(struc
 	int chain_walk = 0, res;
 
 	raw_spin_lock(&task->pi_lock);
+
+	/*
+	 * In the case of futex requeue PI, this will be a proxy
+	 * lock. The task will wake unaware that it is enqueueed on
+	 * this lock. Avoid blocking on two locks and corrupting
+	 * pi_blocked_on via the PI_WAKEUP_INPROGRESS
+	 * flag. futex_wait_requeue_pi() sets this when it wakes up
+	 * before requeue (due to a signal or timeout). Do not enqueue
+	 * the task if PI_WAKEUP_INPROGRESS is set.
+	 */
+	if (task != current && task->pi_blocked_on == PI_WAKEUP_INPROGRESS) {
+		raw_spin_unlock(&task->pi_lock);
+		return -EAGAIN;
+	}
+
+	BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on));
+
 	__rt_mutex_adjust_prio(task);
 	waiter->task = task;
 	waiter->lock = lock;
@@ -469,7 +491,7 @@ static int task_blocks_on_rt_mutex(struc
 		plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
 
 		__rt_mutex_adjust_prio(owner);
-		if (owner->pi_blocked_on)
+		if (rt_mutex_real_waiter(owner->pi_blocked_on))
 			chain_walk = 1;
 		raw_spin_unlock(&owner->pi_lock);
 	}
@@ -624,7 +646,7 @@ static void remove_waiter(struct rt_mute
 		}
 		__rt_mutex_adjust_prio(owner);
 
-		if (owner->pi_blocked_on)
+		if (rt_mutex_real_waiter(owner->pi_blocked_on))
 			chain_walk = 1;
 
 		raw_spin_unlock(&owner->pi_lock);
@@ -658,7 +680,8 @@ void rt_mutex_adjust_pi(struct task_stru
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || waiter->list_entry.prio == task->prio) {
+	if (!rt_mutex_real_waiter(waiter) ||
+	    waiter->list_entry.prio == task->prio) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
@@ -1527,7 +1550,7 @@ int rt_mutex_start_proxy_lock(struct rt_
 	ret = task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock,
 				      flags);
 
-	if (ret && !waiter->task) {
+	if (ret == -EDEADLK && !waiter->task) {
 		/*
 		 * Reset the return value. We might have
 		 * returned with -EDEADLK and the owner
Index: linux-2.6-tip/kernel/rtmutex_common.h
===================================================================
--- linux-2.6-tip.orig/kernel/rtmutex_common.h
+++ linux-2.6-tip/kernel/rtmutex_common.h
@@ -115,6 +115,9 @@ static inline unsigned long rt_mutex_own
 /*
  * PI-futex support (proxy locking functions, etc.):
  */
+
+#define PI_WAKEUP_INPROGRESS	((struct rt_mutex_waiter *) 1)
+
 extern struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock);
 extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock,
 				       struct task_struct *proxy_owner);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2
  2010-07-13 11:52         ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2 Thomas Gleixner
@ 2010-07-13 15:57           ` Mike Galbraith
  2010-07-13 18:59           ` Darren Hart
  2010-07-18  8:32           ` Mike Galbraith
  2 siblings, 0 replies; 8+ messages in thread
From: Mike Galbraith @ 2010-07-13 15:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Darren Hart, lkml,, Peter Zijlstra, Ingo Molnar, Eric Dumazet,
	John Kacur, Steven Rostedt, linux-rt-users

On Tue, 2010-07-13 at 13:52 +0200, Thomas Gleixner wrote:
> On Tue, 13 Jul 2010, Thomas Gleixner wrote:
> > 
> > This code causes braindamage. I really wonder whether we need to
> > remove it according to the "United Nations Convention against Torture
> > and Other Cruel, Inhuman or Degrading Treatment or Punishment".
> > 
> 
> Ok, finally managed to twist my brain around it. Mike, can you give it
> a test ride ?

I'm away from box until Friday evening.  I'll test, but it'll likely be
ancient history by then.

	-Mike



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2
  2010-07-13 11:52         ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2 Thomas Gleixner
  2010-07-13 15:57           ` Mike Galbraith
@ 2010-07-13 18:59           ` Darren Hart
  2010-07-18  8:32           ` Mike Galbraith
  2 siblings, 0 replies; 8+ messages in thread
From: Darren Hart @ 2010-07-13 18:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: lkml, , Peter Zijlstra, Ingo Molnar, Eric Dumazet, John Kacur,
	Steven Rostedt, Mike Galbraith, linux-rt-users

On 07/13/2010 04:52 AM, Thomas Gleixner wrote:
> On Tue, 13 Jul 2010, Thomas Gleixner wrote:
>>
>> This code causes braindamage. I really wonder whether we need to
>> remove it according to the "United Nations Convention against Torture
>> and Other Cruel, Inhuman or Degrading Treatment or Punishment".
>>
>
> Ok, finally managed to twist my brain around it. Mike, can you give it
> a test ride ?

Since Mike is out I built this version and ran it a few times. I saw a 
100% reproduction rate previously. I haven't seen any errors in a 
handful of runs (~10) with this patch.

I do still see the hrtimer latency message on the first run, so this is 
likely unrelated to the issue we're addressing:

Jul 13 14:47:59 elm9m94 kernel: hrtimer: interrupt took 123924 ns

As for Thomas's changes, only a couple nitpics below:


> @@ -2255,18 +2265,51 @@ static int futex_wait_requeue_pi(u32 __u
>   	/* Queue the futex_q, drop the hb lock, wait for wakeup. */
>   	futex_wait_queue_me(hb,&q, to);
>
> -	spin_lock(&hb->lock);
> -	ret = handle_early_requeue_pi_wakeup(hb,&q,&key2, to);
> -	spin_unlock(&hb->lock);
> -	if (ret)
> -		goto out_put_keys;
> +	/*
> +	 * Avoid races with requeue and trying to block on two mutexes
> +	 * (hb->lock and uaddr2's rtmutex) by serializing access to
> +	 * pi_blocked_on with pi_lock.
> +	 */
> +	raw_spin_lock_irq(&current->pi_lock);
> +	if (current->pi_blocked_on) {
> +		/* Requeue happened already */

This comment isn't quite accurate. The requeue may be in progress, which 
means the q.lock_ptr is not trustworthy as noted below. Consider:

		/*
		 * We have been requeued, or are in the process
		 * of being requeued.
		 */

> +		raw_spin_unlock_irq(&current->pi_lock);
> +	} else {
> +		/*
> +		 * Setting pi_blocked_on to PI_WAKEUP_INPROGRESS
> +		 * prevents a concurrent requeue from enqueuein us on

s/enqueuein/enqueueing/  not that my dictionary thinks either one of 
them are words ;-)

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2
  2010-07-13 11:52         ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2 Thomas Gleixner
  2010-07-13 15:57           ` Mike Galbraith
  2010-07-13 18:59           ` Darren Hart
@ 2010-07-18  8:32           ` Mike Galbraith
  2 siblings, 0 replies; 8+ messages in thread
From: Mike Galbraith @ 2010-07-18  8:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Darren Hart, lkml,, Peter Zijlstra, Ingo Molnar, Eric Dumazet,
	John Kacur, Steven Rostedt, linux-rt-users

On Tue, 2010-07-13 at 13:52 +0200, Thomas Gleixner wrote:
> On Tue, 13 Jul 2010, Thomas Gleixner wrote:

> Ok, finally managed to twist my brain around it. Mike, can you give it
> a test ride ?

It's a bit late, but I took it out for a spin, no problem noted.

	-Mike


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI
  2010-07-13  8:03   ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI Darren Hart
  2010-07-13  9:25     ` Thomas Gleixner
@ 2010-07-13  9:58     ` Thomas Gleixner
  1 sibling, 0 replies; 8+ messages in thread
From: Thomas Gleixner @ 2010-07-13  9:58 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml, , Peter Zijlstra, Ingo Molnar, Eric Dumazet, John Kacur,
	Steven Rostedt, Mike Galbraith, linux-rt-users

On Tue, 13 Jul 2010, Darren Hart wrote:
> diff --git a/kernel/futex.c b/kernel/futex.c
> index a6cec32..c92978d 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1336,6 +1336,9 @@ retry_private:
>  				requeue_pi_wake_futex(this, &key2, hb2);
>  				drop_count++;
>  				continue;
> +			} else if (ret == -EAGAIN) {
> +				/* Waiter woken by timeout or signal. */

  This leaks the pi_state.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-07-18  8:32 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1278478019.10245.77.camel@marge.simson.net>
     [not found] ` <alpine.LFD.2.00.1007071356480.2604@localhost.localdomain>
2010-07-13  8:03   ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI Darren Hart
2010-07-13  9:25     ` Thomas Gleixner
2010-07-13 10:28       ` Thomas Gleixner
2010-07-13 11:52         ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2 Thomas Gleixner
2010-07-13 15:57           ` Mike Galbraith
2010-07-13 18:59           ` Darren Hart
2010-07-18  8:32           ` Mike Galbraith
2010-07-13  9:58     ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox