rt_mutex_timed_lock() vs hrtimer

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* rt_mutex_timed_lock() vs hrtimer_wakeup() race ?
@ 2006-07-30  4:36 Oleg Nesterov
  2006-07-30 22:23 ` Steven Rostedt
  2006-08-01  7:58 ` Thomas Gleixner
  0 siblings, 2 replies; 8+ messages in thread
From: Oleg Nesterov @ 2006-07-30  4:36 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Esben Nielsen; +Cc: linux-kernel

I am trying to get some understanding of rt_mutex, but I'm afraid it's
not possible without your help...

// runs on CPU 0
rt_mutex_slowlock:

	// main loop
	for (;;) {
		...

			if (timeout && !timeout->task) {
				ret = -ETIMEDOUT;
				break;
			}

		...

		schedule();

		...

		set_current_state(state);
	}

What if timeout->timer is fired on CPU 1 right before set_current_state() ?

hrtimer_wakeup() does:

	timeout->task = NULL;		<----- [1]

	spin_lock(runqueues->lock);

	task->state = TASK_RUNNING;	<----- [2]

(task->array != NULL, so try_to_wake_up() just goes to out_running)

If my understanding correct, [1] may slip into the critical section (because
spin_lock() is not a wmb), so that CPU 0 will see [2] but not [1]. In that
case rt_mutex_slowlock() can miss the timeout (set_current_state()->mb()
doesn't help).

Of course, this race (even _if_ I am right) is pure theoretical, but probably
we need smp_wmb() after [1] in hrtimer_wakeup().

Note that do_nanosleep() is ok, hrtimer_base->lock provides a necessary
serialization. In fact, I think it can use __set_current_state(), because
both hrtimer_start() and run_hrtimer_queue() do lock/unlock of base->lock.

Another question, task_blocks_on_rt_mutex() does get_task_struct() and checks
owner->pi_blocked_on != NULL under owner->pi_lock. Why ? RT_MUTEX_HAS_WAITERS
bit is set, we are holding ->wait_lock, so the 'owner' can't go away until
we drop ->wait_lock. I think we can drop owner->pi_lock right after
__rt_mutex_adjust_prio(owner), we can't miss owner->pi_blocked_on != NULL
if it was true before we take owner->pi_lock, and this is the case we should
worry about, yes?

In other words (because I myself can't parse the paragraph above :), could
you explain me why this patch is not correct:

--- rtmutex.c~	2006-07-30 05:15:38.000000000 +0400
+++ rtmutex.c	2006-07-30 05:41:44.000000000 +0400
@@ -407,7 +407,7 @@ static int task_blocks_on_rt_mutex(struc
 	struct task_struct *owner = rt_mutex_owner(lock);
 	struct rt_mutex_waiter *top_waiter = waiter;
 	unsigned long flags;
-	int boost = 0, res;
+	int res;

 	spin_lock_irqsave(&current->pi_lock, flags);
 	__rt_mutex_adjust_prio(current);
@@ -431,24 +431,20 @@ static int task_blocks_on_rt_mutex(struc
 		plist_add(&waiter->pi_list_entry, &owner->pi_waiters);

 		__rt_mutex_adjust_prio(owner);
-		if (owner->pi_blocked_on) {
-			boost = 1;
-			/* gets dropped in rt_mutex_adjust_prio_chain()! */
-			get_task_struct(owner);
-		}
 		spin_unlock_irqrestore(&owner->pi_lock, flags);
+
+		if (owner->pi_blocked_on)
+			goto boost;
 	}
 	else if (debug_rt_mutex_detect_deadlock(waiter, detect_deadlock)) {
-		spin_lock_irqsave(&owner->pi_lock, flags);
-		if (owner->pi_blocked_on) {
-			boost = 1;
-			/* gets dropped in rt_mutex_adjust_prio_chain()! */
-			get_task_struct(owner);
-		}
-		spin_unlock_irqrestore(&owner->pi_lock, flags);
+		if (owner->pi_blocked_on)
+			goto boost;
 	}
-	if (!boost)
-		return 0;
+
+	return 0;
+boost:
+	/* gets dropped in rt_mutex_adjust_prio_chain()! */
+	get_task_struct(owner);

 	spin_unlock(&lock->wait_lock);

----------------------------------------------------------
The same question for remove_waiter()/rt_mutex_adjust_pi().

The last (stupid) one,
wake_up_new_task:

		if (unlikely(!current->array))
			__activate_task(p, rq);

(offtopic) Is it really possible to have current->array == NULL here?

		else {
			p->prio = current->prio;

What if current was pi-boosted so that rt_prio(current->prio) == 1,
who will de-boost the child?

			p->normal_prio = current->normal_prio;

Why? p->normal_prio was calculated by effective_prio() above, could you
explain why that value is not ok?

Thanks,

Oleg.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: rt_mutex_timed_lock() vs hrtimer_wakeup() race ?
  2006-07-30  4:36 rt_mutex_timed_lock() vs hrtimer_wakeup() race ? Oleg Nesterov
@ 2006-07-30 22:23 ` Steven Rostedt
  2006-08-01  0:12   ` Oleg Nesterov
  2006-08-01  7:58 ` Thomas Gleixner
  1 sibling, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2006-07-30 22:23 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Ingo Molnar, Thomas Gleixner, Esben Nielsen, linux-kernel

On Sun, 2006-07-30 at 08:36 +0400, Oleg Nesterov wrote:
> I am trying to get some understanding of rt_mutex, but I'm afraid it's
> not possible without your help...

Well, I'll give it a try :)

> 
> // runs on CPU 0
> rt_mutex_slowlock:
> 
> 	// main loop
> 	for (;;) {
> 		...
> 
> 			if (timeout && !timeout->task) {
> 				ret = -ETIMEDOUT;
> 				break;
> 			}
> 
> 		...
> 
> 		schedule();
> 
> 		...
> 
> 		set_current_state(state);
> 	}
> 
> What if timeout->timer is fired on CPU 1 right before set_current_state() ?
> 
> hrtimer_wakeup() does:
> 
> 	timeout->task = NULL;		<----- [1]
> 
> 	spin_lock(runqueues->lock);
> 
> 	task->state = TASK_RUNNING;	<----- [2]
> 
> (task->array != NULL, so try_to_wake_up() just goes to out_running)
> 
> If my understanding correct, [1] may slip into the critical section (because
> spin_lock() is not a wmb), so that CPU 0 will see [2] but not [1]. In that
> case rt_mutex_slowlock() can miss the timeout (set_current_state()->mb()
> doesn't help).

Hmm, that spin_lock and task->state = TASK_RUNNING is in try_to_wake_up,
and at the bottom too.  I don't think there's a CPU that will slip a
write after a call to a function and several branches.  I guess it could
happen. I'm not an expert here. We're only dealing with SMP issues here,
and not worried about PCI bus write outs or anything else.

Does anyone know of a SMP machine that could have a CPU do a write to
memory and postpone it till after several branches?


> 
> Of course, this race (even _if_ I am right) is pure theoretical, but probably
> we need smp_wmb() after [1] in hrtimer_wakeup().

Theoretically, you may be right, and I'm not sure how much a smp_wmb
would hurt to prevent a theoretical problem.

> 
> Note that do_nanosleep() is ok, hrtimer_base->lock provides a necessary
> serialization. In fact, I think it can use __set_current_state(), because
> both hrtimer_start() and run_hrtimer_queue() do lock/unlock of base->lock.
> 

Hmm, that's probably true too.  But this really is a micro-optimization,
since do_nanosleep isn't called that often (it's a syscall and not
called at every schedule or interrupt).

> 
> 
> Another question, task_blocks_on_rt_mutex() does get_task_struct() and checks
> owner->pi_blocked_on != NULL under owner->pi_lock. Why ? RT_MUTEX_HAS_WAITERS
> bit is set, we are holding ->wait_lock, so the 'owner' can't go away until
> we drop ->wait_lock.

That's probably true that the owner can't disappear before we let go of
the wait_lock, since the owner should not be disappearing while holding
locks.  But you are missing the point to why we are grabbing the
pi_lock.  We are preventing a race that can have us do unneeded work
(see below).


>  I think we can drop owner->pi_lock right after
> __rt_mutex_adjust_prio(owner), we can't miss owner->pi_blocked_on != NULL
> if it was true before we take owner->pi_lock, and this is the case we should
> worry about, yes?
> 
> In other words (because I myself can't parse the paragraph above :), could
> you explain me why this patch is not correct:
> 
> --- rtmutex.c~	2006-07-30 05:15:38.000000000 +0400
> +++ rtmutex.c	2006-07-30 05:41:44.000000000 +0400
> @@ -407,7 +407,7 @@ static int task_blocks_on_rt_mutex(struc
>  	struct task_struct *owner = rt_mutex_owner(lock);
>  	struct rt_mutex_waiter *top_waiter = waiter;
>  	unsigned long flags;
> -	int boost = 0, res;
> +	int res;
>  
>  	spin_lock_irqsave(&current->pi_lock, flags);
>  	__rt_mutex_adjust_prio(current);
> @@ -431,24 +431,20 @@ static int task_blocks_on_rt_mutex(struc
>  		plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
>  
>  		__rt_mutex_adjust_prio(owner);
> -		if (owner->pi_blocked_on) {
> -			boost = 1;
> -			/* gets dropped in rt_mutex_adjust_prio_chain()! */
> -			get_task_struct(owner);
> -		}
>  		spin_unlock_irqrestore(&owner->pi_lock, flags);
> +
> +		if (owner->pi_blocked_on)
> +			goto boost;

OK, this is a little complicated. We might have just upped the owner's
prio with the __rt_mutex_adjust_prio(owner).  Now we want to know if the
owner is blocked on a lock or not.  If it isn't at this moment, we don't
want to add the extra time in calling the rt_mutex_adjust_prio_chain.
If we do the owner->pi_blocked_on test after letting go of the
owner->pi_lock, the owner could than become blocked on a lock.

This isn't a bug in logic, that is, it doesn't break the code, but since
the owner already has been boosted, when it blocks it will do the work
of rt_mutex_adjust_prio_chain too.  So we run the risk of calling
rt_mutex_adjust_prio_chain when we really didn't need to because the
owner already had it's prio boosted when it blocked.

Make sense?


>  	}
>  	else if (debug_rt_mutex_detect_deadlock(waiter, detect_deadlock)) {
> -		spin_lock_irqsave(&owner->pi_lock, flags);
> -		if (owner->pi_blocked_on) {
> -			boost = 1;
> -			/* gets dropped in rt_mutex_adjust_prio_chain()! */
> -			get_task_struct(owner);
> -		}
> -		spin_unlock_irqrestore(&owner->pi_lock, flags);
> +		if (owner->pi_blocked_on)
> +			goto boost;
>  	}
> -	if (!boost)
> -		return 0;
> +
> +	return 0;
> +boost:
> +	/* gets dropped in rt_mutex_adjust_prio_chain()! */
> +	get_task_struct(owner);

So the protection was probably more on that boost = 1 and if we _are_
going to boost then we should grab the lock.  But, I think you are right
that the get_task_struct is already protected by the wait_lock so
perhaps this patch would work:




Index: linux-2.6.18-rc2/kernel/rtmutex.c
===================================================================
--- linux-2.6.18-rc2.orig/kernel/rtmutex.c	2006-07-30 18:04:12.000000000 -0400
+++ linux-2.6.18-rc2/kernel/rtmutex.c	2006-07-30 18:07:08.000000000 -0400
@@ -433,25 +433,26 @@ static int task_blocks_on_rt_mutex(struc
 		plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
 
 		__rt_mutex_adjust_prio(owner);
-		if (owner->pi_blocked_on) {
+		if (owner->pi_blocked_on)
 			boost = 1;
-			/* gets dropped in rt_mutex_adjust_prio_chain()! */
-			get_task_struct(owner);
-		}
 		spin_unlock_irqrestore(&owner->pi_lock, flags);
 	}
 	else if (debug_rt_mutex_detect_deadlock(waiter, detect_deadlock)) {
 		spin_lock_irqsave(&owner->pi_lock, flags);
-		if (owner->pi_blocked_on) {
+		if (owner->pi_blocked_on)
 			boost = 1;
-			/* gets dropped in rt_mutex_adjust_prio_chain()! */
-			get_task_struct(owner);
-		}
 		spin_unlock_irqrestore(&owner->pi_lock, flags);
 	}
 	if (!boost)
 		return 0;
 
+	/*
+	 * The owner can't disappear while holding a lock,
+	 * so the owner struct is protected by wait_lock.
+	 * Gets dropped in rt_mutex_adjust_prio_chain()!
+	 */
+	get_task_struct(owner);
+
 	spin_unlock(&lock->wait_lock);
 
 	res = rt_mutex_adjust_prio_chain(owner, detect_deadlock, lock, waiter,




>  
>  	spin_unlock(&lock->wait_lock);
>  
> ----------------------------------------------------------
> The same question for remove_waiter()/rt_mutex_adjust_pi().
> 

Same answer. :)

> 
> The last (stupid) one,

Well, the other questions were definitely not stupid!

> wake_up_new_task:
> 
> 		if (unlikely(!current->array))
> 			__activate_task(p, rq);
> 
> (offtopic) Is it really possible to have current->array == NULL here?

  ** You know, I think you're right. I don't think current->array can 
     ever be NULL outside of schedule. And here we are even holding the 
     rq locks!

     Perhaps this can happen on startup. Is the init thread on any run
     queue on system boot up?


> 
> 		else {
> 			p->prio = current->prio;
> 
> What if current was pi-boosted so that rt_prio(current->prio) == 1,
> who will de-boost the child?

Good point!  Perhaps this can cause a prio leak.

> 
> 			p->normal_prio = current->normal_prio;
> 
> Why? p->normal_prio was calculated by effective_prio() above, could you
> explain why that value is not ok?

I don't know this answer. So I can't help you here.  Seems that these
were not "Stupid" questions after all. ;)


-- Steve



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: rt_mutex_timed_lock() vs hrtimer_wakeup() race ?
  2006-07-30 22:23 ` Steven Rostedt
@ 2006-08-01  0:12   ` Oleg Nesterov
  2006-07-31 20:47     ` Steven Rostedt
  0 siblings, 1 reply; 8+ messages in thread
From: Oleg Nesterov @ 2006-08-01  0:12 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Ingo Molnar, Thomas Gleixner, Esben Nielsen, linux-kernel

On 07/30, Steven Rostedt wrote:
>
> On Sun, 2006-07-30 at 08:36 +0400, Oleg Nesterov wrote:
> > 
> > Another question, task_blocks_on_rt_mutex() does get_task_struct() and checks
> > owner->pi_blocked_on != NULL under owner->pi_lock. Why ? RT_MUTEX_HAS_WAITERS
> > bit is set, we are holding ->wait_lock, so the 'owner' can't go away until
> > we drop ->wait_lock.
> 
> That's probably true that the owner can't disappear before we let go of
> the wait_lock, since the owner should not be disappearing while holding
> locks.  But you are missing the point to why we are grabbing the
> pi_lock.  We are preventing a race that can have us do unneeded work
> (see below).

Yes, I see. But ...

> ===================================================================
> --- linux-2.6.18-rc2.orig/kernel/rtmutex.c	2006-07-30 18:04:12.000000000 -0400
> +++ linux-2.6.18-rc2/kernel/rtmutex.c	2006-07-30 18:07:08.000000000 -0400
> @@ -433,25 +433,26 @@ static int task_blocks_on_rt_mutex(struc
>	...
>  	else if (debug_rt_mutex_detect_deadlock(waiter, detect_deadlock)) {
>  		spin_lock_irqsave(&owner->pi_lock, flags);
> -		if (owner->pi_blocked_on) {
> +		if (owner->pi_blocked_on)
>  			boost = 1;
> -			/* gets dropped in rt_mutex_adjust_prio_chain()! */
> -			get_task_struct(owner);
> -		}
>  		spin_unlock_irqrestore(&owner->pi_lock, flags);

In that case ->pi_lock can't buy anything. With or without ->pi_lock this
check is equally racy, so spin_lock() only adds unneeded work?

Thanks!

Oleg.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: rt_mutex_timed_lock() vs hrtimer_wakeup() race ?
  2006-08-01  0:12   ` Oleg Nesterov
@ 2006-07-31 20:47     ` Steven Rostedt
  0 siblings, 0 replies; 8+ messages in thread
From: Steven Rostedt @ 2006-07-31 20:47 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Ingo Molnar, Thomas Gleixner, Esben Nielsen, linux-kernel

On Tue, 2006-08-01 at 04:12 +0400, Oleg Nesterov wrote:
> On 07/30, Steven Rostedt wrote:
> >
> > On Sun, 2006-07-30 at 08:36 +0400, Oleg Nesterov wrote:
> > > 
> > > Another question, task_blocks_on_rt_mutex() does get_task_struct() and checks
> > > owner->pi_blocked_on != NULL under owner->pi_lock. Why ? RT_MUTEX_HAS_WAITERS
> > > bit is set, we are holding ->wait_lock, so the 'owner' can't go away until
> > > we drop ->wait_lock.
> > 
> > That's probably true that the owner can't disappear before we let go of
> > the wait_lock, since the owner should not be disappearing while holding
> > locks.  But you are missing the point to why we are grabbing the
> > pi_lock.  We are preventing a race that can have us do unneeded work
> > (see below).
> 
> Yes, I see. But ...
> 
> > ===================================================================
> > --- linux-2.6.18-rc2.orig/kernel/rtmutex.c	2006-07-30 18:04:12.000000000 -0400
> > +++ linux-2.6.18-rc2/kernel/rtmutex.c	2006-07-30 18:07:08.000000000 -0400
> > @@ -433,25 +433,26 @@ static int task_blocks_on_rt_mutex(struc
> >	...
> >  	else if (debug_rt_mutex_detect_deadlock(waiter, detect_deadlock)) {
> >  		spin_lock_irqsave(&owner->pi_lock, flags);
> > -		if (owner->pi_blocked_on) {
> > +		if (owner->pi_blocked_on)
> >  			boost = 1;
> > -			/* gets dropped in rt_mutex_adjust_prio_chain()! */
> > -			get_task_struct(owner);
> > -		}
> >  		spin_unlock_irqrestore(&owner->pi_lock, flags);
> 
> In that case ->pi_lock can't buy anything. With or without ->pi_lock this
> check is equally racy, so spin_lock() only adds unneeded work?

crap! I just did a blind change there.  The first one does matter, but
this is for debugging.  Hmm actually I would just remove the owner
blocked check all together and do the boost = 1 to force the chain walk.
It's for debugging anyway.

So that probably could just look something like this:

	else if (debug_rt_mutex_detect_deadlock(waiter, detect_deadlock))
		boost = 1;

the "boost" here is a misnomer.  It probably would be better to call it
"walk" or "chain_walk".

-- Steve



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: rt_mutex_timed_lock() vs hrtimer_wakeup() race ?
  2006-07-30  4:36 rt_mutex_timed_lock() vs hrtimer_wakeup() race ? Oleg Nesterov
  2006-07-30 22:23 ` Steven Rostedt
@ 2006-08-01  7:58 ` Thomas Gleixner
  2006-08-01 12:07   ` Steven Rostedt
  1 sibling, 1 reply; 8+ messages in thread
From: Thomas Gleixner @ 2006-08-01  7:58 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Ingo Molnar, Steven Rostedt, Esben Nielsen, linux-kernel

On Sun, 2006-07-30 at 08:36 +0400, Oleg Nesterov wrote:
> Another question, task_blocks_on_rt_mutex() does get_task_struct() and checks
> owner->pi_blocked_on != NULL under owner->pi_lock. Why ? RT_MUTEX_HAS_WAITERS
> bit is set, we are holding ->wait_lock, so the 'owner' can't go away until
> we drop ->wait_lock. I think we can drop owner->pi_lock right after
> __rt_mutex_adjust_prio(owner), we can't miss owner->pi_blocked_on != NULL
> if it was true before we take owner->pi_lock, and this is the case we should
> worry about, yes?

No.

We hold lock->wait_lock. The owner of this lock can be blocked itself,
which makes it necessary to do the chain walk. The indicator is
owner->pi_blocked_on. This field is only protected by owner->pi_lock.

If we look at this field outside of owner->pi_lock, then we might miss a
chain walk.

CPU 0					CPU 1

lock lock->wait_lock

block_on_rt_mutex()

lock current->pi_lock
current->pi_blocked_on = waiter(lock)
unlock current->pi_lock

boost = owner->pi_blocked_on

					owner blocks on lock2
					lock owner->pi_lock
					owner->pi_blocked_on = waiter(lock2)
					unlock owner->pi_lock

lock owner->pi_lock
enqueue waiter
adjust prio
unlock owner->pi_lock
unlock lock->wait_lock

if boost
	do_chain_walk()

->pi_blocked_on is protected by ->pi_lock and has to be read/modified
under this lock. That way we can not miss a chain walk:

CPU 0					CPU 1

lock lock->wait_lock

block_on_rt_mutex()

lock current->pi_lock
current->pi_blocked_on = waiter(lock)
unlock current->pi_lock

					owner blocks on lock2
					lock owner->pi_lock
					owner->pi_blocked_on = waiter(lock2)
					unlock owner->pi_lock

lock owner->pi_lock
enqueue waiter
adjust prio
boost = owner->pi_blocked_on
unlock owner->pi_lock
unlock lock->wait_lock

if boost
	do_chain_walk()


CPU 0					CPU 1

lock lock->wait_lock

block_on_rt_mutex()

lock current->pi_lock
current->pi_blocked_on = waiter(lock)
unlock current->pi_lock

lock owner->pi_lock
enqueue waiter
adjust prio of owner
boost = owner->pi_blocked_on
unlock owner->pi_lock
unlock lock->wait_lock

					owner blocks on lock2
					lock owner->pi_lock
					owner->pi_blocked_on = waiter(lock2)
					unlock owner->pi_lock
if boost
	do_chain_walk()

					owner propagates the priority to owner(lock2)

get_task_struct() is there for a different reason. We have to protect
the gap where we hold _no_ lock in rt_mutex_adjust_prio_chain:

retry:
	lock task->pi_lock
	block = task->pi_blocked_on->lock
	if (! trylock block->wait_lock) {
		unlock task->pi_lock

-> task could go away right here, if we do not hold a reference

		cpu_relax()
		goto retry
	}


	tglx



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: rt_mutex_timed_lock() vs hrtimer_wakeup() race ?
  2006-08-01  7:58 ` Thomas Gleixner
@ 2006-08-01 12:07   ` Steven Rostedt
  2006-08-01 12:52     ` Thomas Gleixner
  0 siblings, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2006-08-01 12:07 UTC (permalink / raw)
  To: tglx; +Cc: Oleg Nesterov, Ingo Molnar, Esben Nielsen, linux-kernel

On Tue, 2006-08-01 at 09:58 +0200, Thomas Gleixner wrote:
> On Sun, 2006-07-30 at 08:36 +0400, Oleg Nesterov wrote:
> > Another question, task_blocks_on_rt_mutex() does get_task_struct() and checks
> > owner->pi_blocked_on != NULL under owner->pi_lock. Why ? RT_MUTEX_HAS_WAITERS
> > bit is set, we are holding ->wait_lock, so the 'owner' can't go away until
> > we drop ->wait_lock. I think we can drop owner->pi_lock right after
> > __rt_mutex_adjust_prio(owner), we can't miss owner->pi_blocked_on != NULL
> > if it was true before we take owner->pi_lock, and this is the case we should
> > worry about, yes?
> 
> No.
> 
> We hold lock->wait_lock. The owner of this lock can be blocked itself,
> which makes it necessary to do the chain walk. The indicator is
> owner->pi_blocked_on. This field is only protected by owner->pi_lock.
> 
> If we look at this field outside of owner->pi_lock, then we might miss a
> chain walk.
> 

Actually Thomas, not counting the debug case, his patch wont miss a
chain walk.  That is because the boost is read _after_ the owner's prio
is adjusted.  So the only thing the lock is doing for us is to prevent
us from walking the chain twice for the same lock grab. (btw. I'm
looking at 2.6.18-rc2, and not the -rt patch, just to make things
clear).

> CPU 0					CPU 1
> 
> lock lock->wait_lock
> 
> block_on_rt_mutex()
> 
> lock current->pi_lock
> current->pi_blocked_on = waiter(lock)
> unlock current->pi_lock
> 
> boost = owner->pi_blocked_on
> 
> 					owner blocks on lock2
> 					lock owner->pi_lock
> 					owner->pi_blocked_on = waiter(lock2)
> 					unlock owner->pi_lock
> 
> lock owner->pi_lock
> enqueue waiter
> adjust prio
> unlock owner->pi_lock
> unlock lock->wait_lock
> 
> if boost
> 	do_chain_walk()
> 
> ->pi_blocked_on is protected by ->pi_lock and has to be read/modified

Something does not have to be read under a lock unless it is actually
doing something with it while still under the lock. But this code
doesn't do that.  It reads boost = owner->pi_blocked_on and then
releases the lock.  The only other thing that this read is protected
with is the adjusting of the prio.  So as I said. We are protecting
against doing more work, and that is all.  We wont miss any chain walk
by moving the boost check outside the lock, as long as we do the boost
check _after_ we adjusted the owner's prio and the blocking task is on
the owner's/lock's wait queue.

> under this lock. That way we can not miss a chain walk:
> 
> CPU 0					CPU 1
> 
> lock lock->wait_lock
> 
> block_on_rt_mutex()
> 
> lock current->pi_lock
> current->pi_blocked_on = waiter(lock)
> unlock current->pi_lock
> 
> 					owner blocks on lock2
> 					lock owner->pi_lock
> 					owner->pi_blocked_on = waiter(lock2)
> 					unlock owner->pi_lock
> 
> lock owner->pi_lock
> enqueue waiter
> adjust prio
> boost = owner->pi_blocked_on
> unlock owner->pi_lock
> unlock lock->wait_lock
> 
> if boost
> 	do_chain_walk()
> 

BTW, you are showing a lot of CPU0 vs CPU1 code here without explaining
what you are showing. Please point out where exactly a chain walk is
missed, and how the locking of the boost variable helps.  All I see is
that it prevents the chain walk from happening twice.

Reading a variable under a lock and then releasing the lock does
nothing, unless that variable might be a NULL dereference, and that
would mean that the owner struct disappeared.  Which can't happen
because we still have the wait_lock (see below).

> 
> CPU 0					CPU 1
> 
> lock lock->wait_lock
> 
> block_on_rt_mutex()
> 
> lock current->pi_lock
> current->pi_blocked_on = waiter(lock)
> unlock current->pi_lock
> 
> lock owner->pi_lock
> enqueue waiter
> adjust prio of owner
> boost = owner->pi_blocked_on
> unlock owner->pi_lock
> unlock lock->wait_lock
> 
> 					owner blocks on lock2
> 					lock owner->pi_lock
> 					owner->pi_blocked_on = waiter(lock2)
> 					unlock owner->pi_lock
> if boost
> 	do_chain_walk()
> 
> 					owner propagates the priority to owner(lock2)
> 
> get_task_struct() is there for a different reason. We have to protect
> the gap where we hold _no_ lock in rt_mutex_adjust_prio_chain:

I think Oleg realized this. But the question was why is it done with the
pi_lock held.  It is just as good to do it with the wait_lock held,
because the owner can't disappear while holding a lock or that is in
itself is a bug.  So we can push that get_task_struct to just before the
wait_lock is released.

-- Steve

> 
> retry:
> 	lock task->pi_lock
> 	block = task->pi_blocked_on->lock
> 	if (! trylock block->wait_lock) {
> 		unlock task->pi_lock
> 
> -> task could go away right here, if we do not hold a reference
> 
> 		cpu_relax()
> 		goto retry
> 	}
> 
> 
> 	tglx
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: rt_mutex_timed_lock() vs hrtimer_wakeup() race ?
  2006-08-01 12:07   ` Steven Rostedt
@ 2006-08-01 12:52     ` Thomas Gleixner
  2006-08-01 13:21       ` Steven Rostedt
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Gleixner @ 2006-08-01 12:52 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Oleg Nesterov, Ingo Molnar, Esben Nielsen, linux-kernel

On Tue, 2006-08-01 at 08:07 -0400, Steven Rostedt wrote:
> > We hold lock->wait_lock. The owner of this lock can be blocked itself,
> > which makes it necessary to do the chain walk. The indicator is
> > owner->pi_blocked_on. This field is only protected by owner->pi_lock.
> > 
> > If we look at this field outside of owner->pi_lock, then we might miss a
> > chain walk.
> > 
> 
> Actually Thomas, not counting the debug case, his patch wont miss a
> chain walk.  That is because the boost is read _after_ the owner's prio
> is adjusted.  So the only thing the lock is doing for us is to prevent
> us from walking the chain twice for the same lock grab. (btw. I'm
> looking at 2.6.18-rc2, and not the -rt patch, just to make things
> clear).

So what do we win, when we drop the lock before we check for boosting ?
In the worst case we do a redundant chain walk.

	tglx



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: rt_mutex_timed_lock() vs hrtimer_wakeup() race ?
  2006-08-01 12:52     ` Thomas Gleixner
@ 2006-08-01 13:21       ` Steven Rostedt
  0 siblings, 0 replies; 8+ messages in thread
From: Steven Rostedt @ 2006-08-01 13:21 UTC (permalink / raw)
  To: tglx; +Cc: Oleg Nesterov, Ingo Molnar, Esben Nielsen, linux-kernel

On Tue, 2006-08-01 at 14:52 +0200, Thomas Gleixner wrote:
> On Tue, 2006-08-01 at 08:07 -0400, Steven Rostedt wrote:
> > > We hold lock->wait_lock. The owner of this lock can be blocked itself,
> > > which makes it necessary to do the chain walk. The indicator is
> > > owner->pi_blocked_on. This field is only protected by owner->pi_lock.
> > > 
> > > If we look at this field outside of owner->pi_lock, then we might miss a
> > > chain walk.
> > > 
> > 
> > Actually Thomas, not counting the debug case, his patch wont miss a
> > chain walk.  That is because the boost is read _after_ the owner's prio
> > is adjusted.  So the only thing the lock is doing for us is to prevent
> > us from walking the chain twice for the same lock grab. (btw. I'm
> > looking at 2.6.18-rc2, and not the -rt patch, just to make things
> > clear).
> 
> So what do we win, when we drop the lock before we check for boosting ?
> In the worst case we do a redundant chain walk.
> 

I'm not saying that it was correct to drop the lock before checking for
boosting. I'm just saying that it won't miss a chain walk.  But we might
do two of them. I think I'll send my patch again that fixes this up a
little.  I'll make two patches: one for mainline, and one for -rt.

In the debug case, we really don't even need to grab the lock.
(see patch -- coming soon)

-- Steve



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2006-08-01 13:21 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-30  4:36 rt_mutex_timed_lock() vs hrtimer_wakeup() race ? Oleg Nesterov
2006-07-30 22:23 ` Steven Rostedt
2006-08-01  0:12   ` Oleg Nesterov
2006-07-31 20:47     ` Steven Rostedt
2006-08-01  7:58 ` Thomas Gleixner
2006-08-01 12:07   ` Steven Rostedt
2006-08-01 12:52     ` Thomas Gleixner
2006-08-01 13:21       ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox