Re: [TIP][RFC 6/7] futex: add requeue_pi calls

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Darren Hart <dvhltc@us.ibm.com>
To: "lkml, " <linux-kernel@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Steven Rostedt <rostedt@goodmis.org>,
	Sripathi Kodi <sripathik@in.ibm.com>,
	John Stultz <johnstul@linux.vnet.ibm.com>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [TIP][RFC 6/7] futex: add requeue_pi calls
Date: Thu, 05 Mar 2009 17:42:31 -0800	[thread overview]
Message-ID: <49B07F87.2020808@us.ibm.com> (raw)
In-Reply-To: <49B00302.5000607@us.ibm.com>

Darren Hart wrote:
> Darren Hart wrote:
>> Darren Hart wrote:
>>> From: Darren Hart <dvhltc@us.ibm.com>
>>>
>>> PI Futexes must have an owner at all times, so the standard requeue 
>>> commands
>>> aren't sufficient.  The new commands properly manage pi futex 
>>> ownership by
>>> ensuring a futex with waiters has an owner at all times.  Once 
>>> complete these
>>> patches will allow glibc to properly handle pi mutexes with 
>>> pthread_condvars.
>>>
>>> The approach taken here is to create two new futex op codes:
>>>
>>> FUTEX_WAIT_REQUEUE_PI:
>>> Threads will use this op code to wait on a futex (such as a non-pi 
>>> waitqueue)
>>> and wake after they have been requeued to a pi futex.  Prior to 
>>> returning to
>>> userspace, they will take this pi futex (and the underlying rt_mutex).
>>>
>>> futex_wait_requeue_pi() is currently the result of a high speed 
>>> collision
>>> between futex_wait and futex_lock_pi (with the first part of 
>>> futex_lock_pi
>>> being done by futex_requeue_pi_init() on behalf of the waiter).
>>>
>>> FUTEX_REQUEUE_PI:
>>> This call must be used to wake threads waiting with 
>>> FUTEX_WAIT_REQUEUE_PI,
>>> regardless of how many threads the caller intends to wake or requeue.
>>> pthread_cond_broadcast should call this with nr_wake=1 and 
>>> nr_requeue=-1 (all).
>>> pthread_cond_signal should call this with nr_wake=1 and 
>>> nr_requeue=0.  The
>>> reason being we need both callers to get the benefit of the
>>> futex_requeue_pi_init() routine which will prepare the top_waiter 
>>> (the thread
>>> to be woken) to take possesion of the pi futex by setting 
>>> FUTEX_WAITERS and
>>> preparing the futex_q.pi_state.  futex_requeue() also enqueues the 
>>> top_waiter
>>> on the rt_mutex via rt_mutex_start_proxy_lock(). If 
>>> pthread_cond_signal used
>>> FUTEX_WAKE, we would have a similar race window where the caller can 
>>> return and
>>> release the mutex before the waiters can fully wake, potentially 
>>> leaving the
>>> rt_mutex with waiters but no owner.
>>>
>>> We hit a failed paging request running the testcase (7/7) in a loop
>>> (only takes a few minutes at most to hit on my 8way x86_64 test
>>> machine).  It appears to be the result of splitting rt_mutex_slowlock()
>>> across two execution contexts by means of rt_mutex_start_proxy_lock()
>>> and rt_mutex_finish_proxy_lock().  The former calls
>>> task_blocks_on_rt_mutex() on behalf of the waiting task prior to
>>> requeuing and waking it by the requeueing thread.  The latter is
>>> executed upon wakeup by the waiting thread which somehow manages to call
>>> the new __rt_mutex_slowlock() with waiter->task != NULL and still
>>> succeed with try_to_take_lock(), this leads to corruption of the plists
>>> and an eventual failed paging request.  See 7/7 for the rather crude
>>> testcase that causes this.  Any tips on where this race might be
>>> occuring are welcome.

<snip>

> 
> I've updated my tracing and can show that rt_mutex_start_proxy_lock() is 
> not setting RT_MUTEX_HAS_WAITERS like it should be:
> 
> ------------[ cut here ]------------
> kernel BUG at kernel/rtmutex.c:646!
> invalid opcode: 0000 [#1] PREEMPT SMP
> last sysfs file: 
> /sys/devices/pci0000:00/0000:00:03.0/0000:01:00.0/host0/port-0:
> 0/end_device-0:0/target0:0:0/0:0:0:0/vendor
> Dumping ftrace buffer:
> ---------------------------------
>   <...>-3793    1d..3 558351872us : lookup_pi_state: allocating a new pi 
> state
>   <...>-3793    1d..3 558351876us : lookup_pi_state: initial rt_mutex 
> owner: ffff88023d9486c0
>   <...>-3793    1...2 558351877us : futex_requeue: futex_lock_pi_atomic 
> returned: 0
>   <...>-3793    1...2 558351877us : futex_requeue: futex_requeue_pi_init 
> returned: 0
>   <...>-3793    1...3 558351879us : rt_mutex_start_proxy_lock: 
> task_blocks_on_rt_mutex returned 0
>   <...>-3793    1...3 558351880us : rt_mutex_start_proxy_lock: lock has 
> waiterflag: 0
>   <...>-3793    1...1 558351888us : rt_mutex_unlock: unlocking 
> ffff88023b5f6950
>   <...>-3793    1...1 558351888us : rt_mutex_unlock: lock waiter flag: 0
>   <...>-3793    1...1 558351889us : rt_mutex_unlock: unlocked 
> ffff88023b5f6950
>   <...>-3783    0...1 558351893us : __rt_mutex_slowlock: waiter->task is 
> ffff88023c872440
>   <...>-3783    0...1 558351897us : try_to_take_rt_mutex: assigned 
> rt_mutex (ffff88023b5f6950) owner to current ffff88023c872440
>   <...>-3783    0...1 558351897us : __rt_mutex_slowlock: got the lock
> ---------------------------------
> 
> I'll start digging into why that's happening, but I wanted to share the 
> trace output.
> 

As it turns out I missed setting RT_MUTEX_HAS_WAITERS on the rt_mutex in
rt_mutex_start_proxy_lock() - seems awfully silly in retrospect - but a
little non-obvious while writing it.  I added mark_rt_mutex_waiters()
after the call to task_blocks_on_rt_mutex() and the test has completed
more than 400 iterations successfully (it would fail after no more than
2 most of the time before).

Steven, there are several ways to set RT_MUTEX_HAS_WAITERS - but this
seemed like a reasonable approach, would you agree?  Since I'm holding
the wait_lock I don't technically need the atomic cmpxchg and could
probably just set it explicity - do you have a preference?


RFC: rt_mutex: add proxy lock routines

From: Darren Hart <dvhltc@us.ibm.com>

This patch is required for the first half of requeue_pi to function.  It
basically splits rt_mutex_slowlock() right down the middle, just before the
first call to schedule().

This patch uses a new futex_q field, rt_waiter, for now.  I think
I should be able to use task->pi_blocked_on in a future versino of this patch.

V6: -add mark_rt_mutex_waiters() to rt_mutex_start_procy_lock() to avoid
     the race condition evident in previous versions
V5: -remove EXPORT_SYMBOL_GPL from the new routines
    -minor cleanups
V4: -made detect_deadlock a parameter to rt_mutex_enqueue_task
    -refactored rt_mutex_slowlock to share code with new functions
    -renamed rt_mutex_enqueue_task and rt_mutex_handle_wakeup to
     rt_mutex_start_proxy_lock and rt_mutex_finish_proxy_lock, respectively

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sripathi Kodi <sripathik@in.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---

 kernel/rtmutex.c        |  193 +++++++++++++++++++++++++++++++++++++----------
 kernel/rtmutex_common.h |    8 ++
 2 files changed, 161 insertions(+), 40 deletions(-)


diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 69d9cb9..f438362 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -411,6 +411,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock)
  */
 static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 				   struct rt_mutex_waiter *waiter,
+				   struct task_struct *task,
 				   int detect_deadlock)
 {
 	struct task_struct *owner = rt_mutex_owner(lock);
@@ -418,21 +419,21 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 	unsigned long flags;
 	int chain_walk = 0, res;
 
-	spin_lock_irqsave(&current->pi_lock, flags);
-	__rt_mutex_adjust_prio(current);
-	waiter->task = current;
+	spin_lock_irqsave(&task->pi_lock, flags);
+	__rt_mutex_adjust_prio(task);
+	waiter->task = task;
 	waiter->lock = lock;
-	plist_node_init(&waiter->list_entry, current->prio);
-	plist_node_init(&waiter->pi_list_entry, current->prio);
+	plist_node_init(&waiter->list_entry, task->prio);
+	plist_node_init(&waiter->pi_list_entry, task->prio);
 
 	/* Get the top priority waiter on the lock */
 	if (rt_mutex_has_waiters(lock))
 		top_waiter = rt_mutex_top_waiter(lock);
 	plist_add(&waiter->list_entry, &lock->wait_list);
 
-	current->pi_blocked_on = waiter;
+	task->pi_blocked_on = waiter;
 
-	spin_unlock_irqrestore(&current->pi_lock, flags);
+	spin_unlock_irqrestore(&task->pi_lock, flags);
 
 	if (waiter == rt_mutex_top_waiter(lock)) {
 		spin_lock_irqsave(&owner->pi_lock, flags);
@@ -460,7 +461,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 	spin_unlock(&lock->wait_lock);
 
 	res = rt_mutex_adjust_prio_chain(owner, detect_deadlock, lock, waiter,
-					 current);
+					 task);
 
 	spin_lock(&lock->wait_lock);
 
@@ -605,37 +606,25 @@ void rt_mutex_adjust_pi(struct task_struct *task)
 	rt_mutex_adjust_prio_chain(task, 0, NULL, NULL, task);
 }
 
-/*
- * Slow path lock function:
+/**
+ * __rt_mutex_slowlock - perform the wait-wake-try-to-take loop
+ * @lock		 the rt_mutex to take
+ * @state:		 the state the task should block in (TASK_INTERRUPTIBLE
+ * 			 or TASK_UNINTERRUPTIBLE)
+ * @timeout:		 the pre-initialized and started timer, or NULL for none
+ * @waiter:		 the pre-initialized rt_mutex_waiter
+ * @detect_deadlock:	 passed to task_blocks_on_rt_mutex
+ *
+ * lock->wait_lock must be held by the caller.
  */
 static int __sched
-rt_mutex_slowlock(struct rt_mutex *lock, int state,
-		  struct hrtimer_sleeper *timeout,
-		  int detect_deadlock)
+__rt_mutex_slowlock(struct rt_mutex *lock, int state,
+		    struct hrtimer_sleeper *timeout,
+		    struct rt_mutex_waiter *waiter,
+		    int detect_deadlock)
 {
-	struct rt_mutex_waiter waiter;
 	int ret = 0;
 
-	debug_rt_mutex_init_waiter(&waiter);
-	waiter.task = NULL;
-
-	spin_lock(&lock->wait_lock);
-
-	/* Try to acquire the lock again: */
-	if (try_to_take_rt_mutex(lock)) {
-		spin_unlock(&lock->wait_lock);
-		return 0;
-	}
-
-	set_current_state(state);
-
-	/* Setup the timer, when timeout != NULL */
-	if (unlikely(timeout)) {
-		hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS);
-		if (!hrtimer_active(&timeout->timer))
-			timeout->task = NULL;
-	}
-
 	for (;;) {
 		/* Try to acquire the lock: */
 		if (try_to_take_rt_mutex(lock))
@@ -656,19 +645,19 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 		}
 
 		/*
-		 * waiter.task is NULL the first time we come here and
+		 * waiter->task is NULL the first time we come here and
 		 * when we have been woken up by the previous owner
 		 * but the lock got stolen by a higher prio task.
 		 */
-		if (!waiter.task) {
-			ret = task_blocks_on_rt_mutex(lock, &waiter,
+		if (!waiter->task) {
+			ret = task_blocks_on_rt_mutex(lock, waiter, current,
 						      detect_deadlock);
 			/*
 			 * If we got woken up by the owner then start loop
 			 * all over without going into schedule to try
 			 * to get the lock now:
 			 */
-			if (unlikely(!waiter.task)) {
+			if (unlikely(!waiter->task)) {
 				/*
 				 * Reset the return value. We might
 				 * have returned with -EDEADLK and the
@@ -684,15 +673,52 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 
 		spin_unlock(&lock->wait_lock);
 
-		debug_rt_mutex_print_deadlock(&waiter);
+		debug_rt_mutex_print_deadlock(waiter);
 
-		if (waiter.task)
+		if (waiter->task)
 			schedule_rt_mutex(lock);
 
 		spin_lock(&lock->wait_lock);
 		set_current_state(state);
 	}
 
+	return ret;
+}
+
+/*
+ * Slow path lock function:
+ */
+static int __sched
+rt_mutex_slowlock(struct rt_mutex *lock, int state,
+		  struct hrtimer_sleeper *timeout,
+		  int detect_deadlock)
+{
+	struct rt_mutex_waiter waiter;
+	int ret = 0;
+
+	debug_rt_mutex_init_waiter(&waiter);
+	waiter.task = NULL;
+
+	spin_lock(&lock->wait_lock);
+
+	/* Try to acquire the lock again: */
+	if (try_to_take_rt_mutex(lock)) {
+		spin_unlock(&lock->wait_lock);
+		return 0;
+	}
+
+	set_current_state(state);
+
+	/* Setup the timer, when timeout != NULL */
+	if (unlikely(timeout)) {
+		hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS);
+		if (!hrtimer_active(&timeout->timer))
+			timeout->task = NULL;
+	}
+
+	ret = __rt_mutex_slowlock(lock, state, timeout, &waiter,
+				  detect_deadlock);
+
 	set_current_state(TASK_RUNNING);
 
 	if (unlikely(waiter.task))
@@ -986,6 +1012,42 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock,
 }
 
 /**
+ * rt_mutex_start_proxy_lock - prepare another task to take the lock
+ *
+ * @lock:		the rt_mutex to take
+ * @waiter:		the rt_mutex_waiter initialized by the waiter
+ * @task:		the task to prepare
+ * @detext_deadlock:	passed to task_blocks_on_rt_mutex
+ *
+ * The lock should have an owner, and it should not be task.
+ * Special API call for FUTEX_REQUEUE_PI support.
+ */
+int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
+			      struct rt_mutex_waiter *waiter,
+			      struct task_struct *task, int detect_deadlock)
+{
+	int ret;
+
+	spin_lock(&lock->wait_lock);
+	ret = task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock);
+	mark_rt_mutex_waiters(lock);
+	if (ret && !waiter->task) {
+		/*
+		 * Reset the return value. We might have
+		 * returned with -EDEADLK and the owner
+		 * released the lock while we were walking the
+		 * pi chain.  Let the waiter sort it out.
+		 */
+		ret = 0;
+	}
+	spin_unlock(&lock->wait_lock);
+
+	debug_rt_mutex_print_deadlock(waiter);
+
+	return ret;
+}
+
+/**
  * rt_mutex_next_owner - return the next owner of the lock
  *
  * @lock: the rt lock query
@@ -1004,3 +1066,54 @@ struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock)
 
 	return rt_mutex_top_waiter(lock)->task;
 }
+
+/**
+ * rt_mutex_finish_proxy_lock - Complete the taking of the lock initialized on
+ *                              our behalf by another thread.
+ * @lock: the rt_mutex we were woken on
+ * @to: the timeout, null if none. hrtimer should already have been started.
+ * @waiter: the pre-initialized rt_mutex_waiter
+ * @detect_deadlock: for use by __rt_mutex_slowlock
+ *
+ * Special API call for PI-futex requeue support
+ */
+int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
+			       struct hrtimer_sleeper *to,
+			       struct rt_mutex_waiter *waiter,
+			       int detect_deadlock)
+{
+	int ret;
+
+	if (waiter->task)
+		schedule_rt_mutex(lock);
+
+	spin_lock(&lock->wait_lock);
+
+	set_current_state(TASK_INTERRUPTIBLE);
+
+	ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter,
+				  detect_deadlock);
+
+	set_current_state(TASK_RUNNING);
+
+	if (unlikely(waiter->task))
+		remove_waiter(lock, waiter);
+
+	/*
+	 * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
+	 * have to fix that up.
+	 */
+	fixup_rt_mutex_waiters(lock);
+
+	spin_unlock(&lock->wait_lock);
+
+	/*
+	 * Readjust priority, when we did not get the lock. We might have been
+	 * the pending owner and boosted. Since we did not take the lock, the
+	 * PI boost has to go.
+	 */
+	if (unlikely(ret))
+		rt_mutex_adjust_prio(current);
+
+	return ret;
+}
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index e124bf5..97a2f81 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -120,6 +120,14 @@ extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock,
 				       struct task_struct *proxy_owner);
 extern void rt_mutex_proxy_unlock(struct rt_mutex *lock,
 				  struct task_struct *proxy_owner);
+extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
+				     struct rt_mutex_waiter *waiter,
+				     struct task_struct *task,
+				     int detect_deadlock);
+extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
+				      struct hrtimer_sleeper *to,
+				      struct rt_mutex_waiter *waiter,
+				      int detect_deadlock);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 # include "rtmutex-debug.h"

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

next prev parent reply	other threads:[~2009-03-06  1:42 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-03-03  0:02 [TIP][RFC 0/7] requeue pi implemenation Darren Hart
2009-03-03  0:09 ` [TIP][RFC 1/7] futex: futex_wait_queue_me() Darren Hart
2009-03-03  0:11 ` [TIP][RFC 2/7] futex: futex_top_waiter() Darren Hart
2009-03-07 15:16   ` Thomas Gleixner
2009-03-09 18:04     ` Darren Hart
2009-03-03  0:13 ` [TIP][RFC 3/7] futex: futex_lock_pi_atomic() Darren Hart
2009-03-03 13:03   ` Peter Zijlstra
2009-03-03 17:29     ` Darren Hart
2009-03-03  0:14 ` [TIP][RFC 4/7] futex: finish_futex_lock_pi() Darren Hart
2009-03-07 15:30   ` Thomas Gleixner
2009-03-09 18:05     ` Darren Hart
2009-03-03  0:16 ` [TIP][RFC 5/7] rt_mutex: add proxy lock routines Darren Hart
2009-03-07 15:44   ` Thomas Gleixner
2009-03-09 18:31     ` Darren Hart
2009-03-03  0:20 ` [TIP][RFC 6/7] futex: add requeue_pi calls Darren Hart
2009-03-04  7:53   ` Darren Hart
2009-03-05 16:51     ` Darren Hart
2009-03-06  1:42       ` Darren Hart [this message]
2009-03-06  2:21         ` Steven Rostedt
2009-03-06  5:27           ` Darren Hart
2009-03-07 15:50             ` Thomas Gleixner
2009-03-09 19:55               ` Darren Hart
2009-03-07  6:03         ` Sripathi Kodi
2009-03-09  9:48   ` Thomas Gleixner
2009-03-10  4:50     ` Darren Hart
2009-03-10 13:39       ` Thomas Gleixner
2009-03-03  0:23 ` [TIP][RFC 7/7] requeue pi testcase Darren Hart

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:69d9cb9 dfblob:f438362 dfblob:e124bf5 dfblob:97a2f81 )
 OR (
bs:"Re: [TIP][RFC 6/7] futex: add requeue_pi calls" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49B07F87.2020808@us.ibm.com \
    --to=dvhltc@us.ibm.com \
    --cc=johnstul@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=sripathik@in.ibm.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.