PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
@ 2006-11-05 19:34 Oleg Nesterov
  2006-11-05 22:29 ` Linus Torvalds
  2006-11-06 12:31 ` Steven Rostedt
  0 siblings, 2 replies; 17+ messages in thread
From: Oleg Nesterov @ 2006-11-05 19:34 UTC (permalink / raw)
  To: Thomas Gleixner, Steven Rostedt
  Cc: Andrew Morton, Linus Torvalds, linux-kernel

When task->array != NULL, try_to_wake_up() just goes to "out_running" and sets
task->state = TASK_RUNNING.

In that case hrtimer_wakeup() does:

	timeout->task = NULL;		<----- [1]

	spin_lock(runqueues->lock);

	task->state = TASK_RUNNING;	<----- [2]

from Documentation/memory-barriers.txt

	Memory operations that occur before a LOCK operation may appear to
	happen after it completes.

This means that [2] may be completed before [1], and

CPU_0							CPU_1
rt_mutex_slowlock:

for (;;) {
	...
		if (timeout && !timeout->task)
			return -ETIMEDOUT;
	...

	schedule();
							hrtimer_wakeup() sets
	...						task->state = TASK_RUNNING,
							but "timeout->task = NULL"
							is not completed
	set_current_state(TASK_INTERRUPTIBLE);
}

we can miss a timeout.

Of course, this all is scholasticism, this can't happen in practice, but
may be this patch makes sense as a documentation update.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>

--- STATS/kernel/hrtimer.c~1_hrtw	2006-10-22 18:24:03.000000000 +0400
+++ STATS/kernel/hrtimer.c	2006-11-05 22:32:36.000000000 +0300
@@ -662,9 +662,12 @@ static int hrtimer_wakeup(struct hrtimer
 		container_of(timer, struct hrtimer_sleeper, timer);
 	struct task_struct *task = t->task;
 
-	t->task = NULL;
-	if (task)
+	if (task) {
+		t->task = NULL;
+		/* must be visible before task->state = TASK_RUNNING */
+		smp_wmb();
 		wake_up_process(task);
+	}
 
 	return HRTIMER_NORESTART;
 }


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
  2006-11-05 19:34 PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock() Oleg Nesterov
@ 2006-11-05 22:29 ` Linus Torvalds
  2006-11-05 22:53   ` Steven Rostedt
                     ` (2 more replies)
  2006-11-06 12:31 ` Steven Rostedt
  1 sibling, 3 replies; 17+ messages in thread
From: Linus Torvalds @ 2006-11-05 22:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thomas Gleixner, Steven Rostedt, Andrew Morton, linux-kernel

On Sun, 5 Nov 2006, Oleg Nesterov wrote:
>
> When task->array != NULL, try_to_wake_up() just goes to "out_running" and sets
> task->state = TASK_RUNNING.
> 
> In that case hrtimer_wakeup() does:
> 
> 	timeout->task = NULL;		<----- [1]
> 
> 	spin_lock(runqueues->lock);
> 
> 	task->state = TASK_RUNNING;	<----- [2]
> 
> from Documentation/memory-barriers.txt
> 
> 	Memory operations that occur before a LOCK operation may appear to
> 	happen after it completes.
> 
> This means that [2] may be completed before [1], and

Yes. On x86 (and x86-64) you'll never see this, because writes are always 
seen in order regardless, and in addition, the spin_lock is actually 
totally serializing anyway. On most other architectures, the spin_lock 
will serialize all the writes too, but it's not guaranteed, so in theory 
you're right. I suspect no actual architecture will do this, but hey, 
when talking memory ordering, safe is a lot better than sorry.

That said, since "task->state" in only tested _inside_ the runqueue lock, 
there is no race that I can see. Since we've gotten the runqueue lock in 
order to even check task-state, the processor that _sets_ task state must 
not only have done the "spin_lock()", it must also have done the 
"spin_unlock()", and _that_ will not allow either the timeout or the task 
state to haev leaked out from under it (because that would imply that the 
critical region leaked out too).

So I don't think the race exists anyway - the schedule() will return 
immediately (because it will see TASK_RUNNING), and we'll just retry.

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
  2006-11-05 22:29 ` Linus Torvalds
@ 2006-11-05 22:53   ` Steven Rostedt
  2006-11-05 23:08     ` Oleg Nesterov
  2006-11-06  3:08     ` Linus Torvalds
  2006-11-05 22:53   ` Oleg Nesterov
  2006-11-06  8:57   ` Benjamin Herrenschmidt
  2 siblings, 2 replies; 17+ messages in thread
From: Steven Rostedt @ 2006-11-05 22:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Thomas Gleixner, Andrew Morton, linux-kernel


On Sun, 5 Nov 2006, Linus Torvalds wrote:

>
> That said, since "task->state" in only tested _inside_ the runqueue lock,
> there is no race that I can see. Since we've gotten the runqueue lock in
> order to even check task-state, the processor that _sets_ task state must
> not only have done the "spin_lock()", it must also have done the
> "spin_unlock()", and _that_ will not allow either the timeout or the task
> state to haev leaked out from under it (because that would imply that the
> critical region leaked out too).
>
> So I don't think the race exists anyway - the schedule() will return
> immediately (because it will see TASK_RUNNING), and we'll just retry.
>

This whole situation is very theoretical, but I think this actually can
happen *theoretically*.


OK, the spin_lock doesn't do any serialization, but the unlock does. But
the problem can happen before the unlock. Because of the loop.

CPU 1                                    CPU 2

    task_rq_lock()

    p->state = TASK_RUNNING;


                                      (from bottom of for loop)
                                      set_current_state(TASK_INTERRUPTIBLE);

                                    for (;;) {  (looping)

                                      if (timeout && !timeout->task)


   (now CPU implements)
   t->task = NULL

   task_rq_unlock();

                                   schedule() (with state == TASK_INTERRUPTIBLE)


Again, this is very theoretical, and I don't even think that this can
happen if you tried to make it.  But I guess if hardware were to change in
the future with the same rules that we have today with barriers, that this
can be a race.

-- Steve



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
  2006-11-05 22:53   ` Steven Rostedt
@ 2006-11-05 23:08     ` Oleg Nesterov
  2006-11-06  3:08     ` Linus Torvalds
  1 sibling, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2006-11-05 23:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Thomas Gleixner, Andrew Morton, linux-kernel

On 11/05, Steven Rostedt wrote:
> 
> On Sun, 5 Nov 2006, Linus Torvalds wrote:
> 
> >
> > That said, since "task->state" in only tested _inside_ the runqueue lock,
> > there is no race that I can see. Since we've gotten the runqueue lock in
> > order to even check task-state, the processor that _sets_ task state must
> > not only have done the "spin_lock()", it must also have done the
> > "spin_unlock()", and _that_ will not allow either the timeout or the task
> > state to haev leaked out from under it (because that would imply that the
> > critical region leaked out too).
> >
> > So I don't think the race exists anyway - the schedule() will return
> > immediately (because it will see TASK_RUNNING), and we'll just retry.
> >
> 
> This whole situation is very theoretical, but I think this actually can
> happen *theoretically*.
> 
> 
> OK, the spin_lock doesn't do any serialization, but the unlock does. But
> the problem can happen before the unlock. Because of the loop.

Yes, yes, thanks.

( Actually, this was more "is my understanding correct?" than a patch )

Thanks!

> CPU 1                                    CPU 2
> 
>     task_rq_lock()
> 
>     p->state = TASK_RUNNING;
> 
> 
>                                       (from bottom of for loop)
>                                       set_current_state(TASK_INTERRUPTIBLE);
> 
>                                     for (;;) {  (looping)
> 
>                                       if (timeout && !timeout->task)
> 
> 
>    (now CPU implements)
>    t->task = NULL
> 
>    task_rq_unlock();
> 
>                                    schedule() (with state == TASK_INTERRUPTIBLE)
> 
> 
> Again, this is very theoretical, and I don't even think that this can
> happen if you tried to make it.  But I guess if hardware were to change in
> the future with the same rules that we have today with barriers, that this
> can be a race.
> 
> -- Steve
> 
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
  2006-11-05 22:53   ` Steven Rostedt
  2006-11-05 23:08     ` Oleg Nesterov
@ 2006-11-06  3:08     ` Linus Torvalds
  2006-11-06 12:09       ` Oleg Nesterov
  1 sibling, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2006-11-06  3:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, Thomas Gleixner, Andrew Morton, linux-kernel



On Sun, 5 Nov 2006, Steven Rostedt wrote:
> 
> This whole situation is very theoretical, but I think this actually can
> happen *theoretically*.
> 
> OK, the spin_lock doesn't do any serialization, but the unlock does. But
> the problem can happen before the unlock. Because of the loop.
> 
> CPU 1                                    CPU 2
> 
>     task_rq_lock()
> 
>     p->state = TASK_RUNNING;
> 
> 
>                                       (from bottom of for loop)
>                                       set_current_state(TASK_INTERRUPTIBLE);
> 
>                                     for (;;) {  (looping)
> 
>                                       if (timeout && !timeout->task)
> 
> 
>    (now CPU implements)
>    t->task = NULL
> 
>    task_rq_unlock();
> 
>                                    schedule() (with state == TASK_INTERRUPTIBLE)

Yeah, that seems a real bug. You _always_ need to actually do the thing 
that you wait for _before_ you want it up. That's what all the scheduling 
primitives depend on - you can't wake people up first, and then set the 
condition variable.

So if a rt_mutex depeds on something that is set inside the rq-lock, it 
needs to get the task rw-lock in order to check it.

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
  2006-11-06  3:08     ` Linus Torvalds
@ 2006-11-06 12:09       ` Oleg Nesterov
  2006-11-06 12:26         ` Steven Rostedt
  0 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2006-11-06 12:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Thomas Gleixner, Andrew Morton, linux-kernel

On 11/05, Linus Torvalds wrote:
> 
> On Sun, 5 Nov 2006, Steven Rostedt wrote:
> > 
> > This whole situation is very theoretical, but I think this actually can
> > happen *theoretically*.
> > 
> > OK, the spin_lock doesn't do any serialization, but the unlock does. But
> > the problem can happen before the unlock. Because of the loop.
> > 
> > CPU 1                                    CPU 2
> > 
> >     task_rq_lock()
> > 
> >     p->state = TASK_RUNNING;
> > 
> > 
> >                                       (from bottom of for loop)
> >                                       set_current_state(TASK_INTERRUPTIBLE);
> > 
> >                                     for (;;) {  (looping)
> > 
> >                                       if (timeout && !timeout->task)
> > 
> > 
> >    (now CPU implements)
> >    t->task = NULL
> > 
> >    task_rq_unlock();
> > 
> >                                    schedule() (with state == TASK_INTERRUPTIBLE)
> 
> Yeah, that seems a real bug. You _always_ need to actually do the thing 
> that you wait for _before_ you want it up. That's what all the scheduling 
> primitives depend on - you can't wake people up first, and then set the 
> condition variable.
> 
> So if a rt_mutex depeds on something that is set inside the rq-lock, it 
> needs to get the task rw-lock in order to check it.

No, rt_mutex is fine (I think).

My changelog was very unclean and confusing, I'll try again. What we are
doing is:

	rt_mutex_slowlock:

		task->state = TASK_INTERRUPTIBLE;

		mb();

		if (CONDITION)
			return -ETIMEDOUT;

		schedule();

This is common and correct.

	hrtimer_wakeup:

		CONDITION = 1;			// [1]

		spin_lock(rq->lock);

		task->state = TASK_RUNNING;	// [2]

This needs 'wmb()' between [1] and [2] unless spin_lock() garantees memory
ordering. Of course, rt_mutex can take rq->lock to solve this, but I don't
think it would be right.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
  2006-11-06 12:09       ` Oleg Nesterov
@ 2006-11-06 12:26         ` Steven Rostedt
  0 siblings, 0 replies; 17+ messages in thread
From: Steven Rostedt @ 2006-11-06 12:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Thomas Gleixner, Andrew Morton, linux-kernel

On Mon, 6 Nov 2006, Oleg Nesterov wrote:

> On 11/05, Linus Torvalds wrote:
> >
> > Yeah, that seems a real bug. You _always_ need to actually do the thing
> > that you wait for _before_ you want it up. That's what all the scheduling
> > primitives depend on - you can't wake people up first, and then set the
> > condition variable.
> >
> > So if a rt_mutex depeds on something that is set inside the rq-lock, it
> > needs to get the task rw-lock in order to check it.
>
> No, rt_mutex is fine (I think).

I agree.  The problem isn't with rt_mutex.  It rt_mutex doesn't depend on
that condition to break out of the loop. That condition is an extra
condition that's there to stop in the case we timed out before the owner
released the lock. The real condition to break out of the loop is the
owner releasing the lock, and there's no known races there.

So the time out is what needs to make sure that rt_mutex sees the event.

>
> My changelog was very unclean and confusing, I'll try again. What we are
> doing is:
>
> 	rt_mutex_slowlock:
>
> 		task->state = TASK_INTERRUPTIBLE;
>
> 		mb();
>
> 		if (CONDITION)
> 			return -ETIMEDOUT;
>
> 		schedule();
>
> This is common and correct.
>
> 	hrtimer_wakeup:
>
> 		CONDITION = 1;			// [1]

Right! The changing of the condition isn't even in any spin lock. But that
condition needs be be completed before the changes done within the spin
lock, but unfortunately, the spin lock itself doesn't guarantee anything.

>
> 		spin_lock(rq->lock);
>
> 		task->state = TASK_RUNNING;	// [2]
>
> This needs 'wmb()' between [1] and [2] unless spin_lock() garantees memory
> ordering. Of course, rt_mutex can take rq->lock to solve this, but I don't
> think it would be right.

Correct, the solution should NOT change rt_mutex.  Your original patch is
the correct approach.

-- Steve

>
> Oleg.
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
  2006-11-05 22:29 ` Linus Torvalds
  2006-11-05 22:53   ` Steven Rostedt
@ 2006-11-05 22:53   ` Oleg Nesterov
  2006-11-06  8:57   ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2006-11-05 22:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Steven Rostedt, Andrew Morton, linux-kernel

On 11/05, Linus Torvalds wrote:
> 
> 
> On Sun, 5 Nov 2006, Oleg Nesterov wrote:
> >
> > When task->array != NULL, try_to_wake_up() just goes to "out_running" and sets
> > task->state = TASK_RUNNING.
> > 
> > In that case hrtimer_wakeup() does:
> > 
> > 	timeout->task = NULL;		<----- [1]
> > 
> > 	spin_lock(runqueues->lock);
> > 
> > 	task->state = TASK_RUNNING;	<----- [2]
> > 
> > from Documentation/memory-barriers.txt
> > 
> > 	Memory operations that occur before a LOCK operation may appear to
> > 	happen after it completes.
> > 
> > This means that [2] may be completed before [1], and
> 
> Yes. On x86 (and x86-64) you'll never see this, because writes are always 
> seen in order regardless, and in addition, the spin_lock is actually 
> totally serializing anyway. On most other architectures, the spin_lock 
> will serialize all the writes too, but it's not guaranteed, so in theory 
> you're right. I suspect no actual architecture will do this, but hey, 
> when talking memory ordering, safe is a lot better than sorry.
> 
> That said, since "task->state" in only tested _inside_ the runqueue lock, 
> there is no race that I can see. Since we've gotten the runqueue lock in 
> order to even check task-state, the processor that _sets_ task state must 
> not only have done the "spin_lock()", it must also have done the 
> "spin_unlock()", and _that_ will not allow either the timeout or the task 
> state to haev leaked out from under it (because that would imply that the 
> critical region leaked out too).
> 
> So I don't think the race exists anyway - the schedule() will return 
> immediately (because it will see TASK_RUNNING), and we'll just retry.

schedule() will see TASK_INTERRUPTIBLE. hrtimer_wakeup() sets TASK_RUNNING,
rt_mutex_slowlock() does set_current_state(TASK_INTERRUPTIBLE) after that.
schedule() takes runqueue lock, yes, but we are testing timeout->task before.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
  2006-11-05 22:29 ` Linus Torvalds
  2006-11-05 22:53   ` Steven Rostedt
  2006-11-05 22:53   ` Oleg Nesterov
@ 2006-11-06  8:57   ` Benjamin Herrenschmidt
  2006-11-06 12:35     ` Steven Rostedt
  2 siblings, 1 reply; 17+ messages in thread
From: Benjamin Herrenschmidt @ 2006-11-06  8:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Thomas Gleixner, Steven Rostedt, Andrew Morton,
	linux-kernel


> Yes. On x86 (and x86-64) you'll never see this, because writes are always 
> seen in order regardless, and in addition, the spin_lock is actually 
> totally serializing anyway. On most other architectures, the spin_lock 
> will serialize all the writes too, but it's not guaranteed, so in theory 
> you're right. I suspect no actual architecture will do this, but hey, 
> when talking memory ordering, safe is a lot better than sorry.

PowerPC doesn't serialize the writes on spin_lock, only on spin_unlock.

(That is, previous writes can "leak" into the lock, but writes done
before the unlock can't leak out of the spinlock).

Now, I've just glanced at the thread, so I don't know if that's relevant
to the problems you guys are talking about :-)

Ben.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
  2006-11-06  8:57   ` Benjamin Herrenschmidt
@ 2006-11-06 12:35     ` Steven Rostedt
  2006-11-06 20:53       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 17+ messages in thread
From: Steven Rostedt @ 2006-11-06 12:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Andrew Morton,
	linux-kernel

On Mon, 6 Nov 2006, Benjamin Herrenschmidt wrote:

>
> > Yes. On x86 (and x86-64) you'll never see this, because writes are always
> > seen in order regardless, and in addition, the spin_lock is actually
> > totally serializing anyway. On most other architectures, the spin_lock
> > will serialize all the writes too, but it's not guaranteed, so in theory
> > you're right. I suspect no actual architecture will do this, but hey,
> > when talking memory ordering, safe is a lot better than sorry.
>
> PowerPC doesn't serialize the writes on spin_lock, only on spin_unlock.
>
> (That is, previous writes can "leak" into the lock, but writes done
> before the unlock can't leak out of the spinlock).
>
> Now, I've just glanced at the thread, so I don't know if that's relevant
> to the problems you guys are talking about :-)
>

It is relevant.  In powerpc, can one write happen before another write?


  x = 1;
  barrier();  (only compiler barrier)
  b = 2;


And have CPU 2 see b=2 before seeing x=1?

If so, then I guess this is indeed a bug on powerpc.

-- Steve


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
  2006-11-06 12:35     ` Steven Rostedt
@ 2006-11-06 20:53       ` Benjamin Herrenschmidt
  2006-11-06 21:17         ` Steven Rostedt
  0 siblings, 1 reply; 17+ messages in thread
From: Benjamin Herrenschmidt @ 2006-11-06 20:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Andrew Morton,
	linux-kernel

On Mon, 2006-11-06 at 07:35 -0500, Steven Rostedt wrote:

> It is relevant.  In powerpc, can one write happen before another write?
> 
> 
>   x = 1;
>   barrier();  (only compiler barrier)
>   b = 2;
> 
> 
> And have CPU 2 see b=2 before seeing x=1?

Yes. Definitely.

> If so, then I guess this is indeed a bug on powerpc.

Ben.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
  2006-11-06 20:53       ` Benjamin Herrenschmidt
@ 2006-11-06 21:17         ` Steven Rostedt
  2006-11-06 21:41           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 17+ messages in thread
From: Steven Rostedt @ 2006-11-06 21:17 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Andrew Morton,
	linux-kernel


On Tue, 7 Nov 2006, Benjamin Herrenschmidt wrote:

> On Mon, 2006-11-06 at 07:35 -0500, Steven Rostedt wrote:
>
> > It is relevant.  In powerpc, can one write happen before another write?
> >
> >
> >   x = 1;
> >   barrier();  (only compiler barrier)
> >   b = 2;
> >
> >
> > And have CPU 2 see b=2 before seeing x=1?
>
> Yes. Definitely.

OK, I see in powerpc, that spin lock calls isync. This just clears the
pipeline. It doesn't touch the loads and stores, right?

So basically saying this:

   x=1;
   asm ("isync");
   b=2;

Would that have the same problem too?  Where another CPU can see x=1
before seeing b=2?

Thanks!

-- Steve


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
  2006-11-06 21:17         ` Steven Rostedt
@ 2006-11-06 21:41           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 17+ messages in thread
From: Benjamin Herrenschmidt @ 2006-11-06 21:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Andrew Morton,
	linux-kernel

On Mon, 2006-11-06 at 16:17 -0500, Steven Rostedt wrote:
> On Tue, 7 Nov 2006, Benjamin Herrenschmidt wrote:
> 
> > On Mon, 2006-11-06 at 07:35 -0500, Steven Rostedt wrote:
> >
> > > It is relevant.  In powerpc, can one write happen before another write?
> > >
> > >
> > >   x = 1;
> > >   barrier();  (only compiler barrier)
> > >   b = 2;
> > >
> > >
> > > And have CPU 2 see b=2 before seeing x=1?
> >
> > Yes. Definitely.
> 
> OK, I see in powerpc, that spin lock calls isync. This just clears the
> pipeline. It doesn't touch the loads and stores, right?

Yes. That isync is to prevent loads to be speculated accross spin_lock,
thus leaking out of the lock by the top. In fact, it doesn't act on the
load per-se but it prevent speculative execution accross the conditional
branch in the spin_lock. 

> So basically saying this:
> 
>    x=1;
>    asm ("isync");
>    b=2;
> 
> Would that have the same problem too?  Where another CPU can see x=1
> before seeing b=2?

Yes.

What isync provides is

a = *foo
spin_lock_loop_with_conditional_branch
isync
b = *bar

It prevents the read of b from being speculated by the CPU ahead of a

Ben.




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
  2006-11-05 19:34 PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock() Oleg Nesterov
  2006-11-05 22:29 ` Linus Torvalds
@ 2006-11-06 12:31 ` Steven Rostedt
  2006-11-06 14:15   ` Oleg Nesterov
  1 sibling, 1 reply; 17+ messages in thread
From: Steven Rostedt @ 2006-11-06 12:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thomas Gleixner, Andrew Morton, Linus Torvalds, LKML, Ingo Molnar



On Sun, 5 Nov 2006, Oleg Nesterov wrote:

> When task->array != NULL, try_to_wake_up() just goes to "out_running" and sets
> task->state = TASK_RUNNING.
>
> In that case hrtimer_wakeup() does:
>
> 	timeout->task = NULL;		<----- [1]
>
> 	spin_lock(runqueues->lock);
>
> 	task->state = TASK_RUNNING;	<----- [2]
>
> from Documentation/memory-barriers.txt
>
> 	Memory operations that occur before a LOCK operation may appear to
> 	happen after it completes.
>
> This means that [2] may be completed before [1], and
>
> CPU_0							CPU_1
> rt_mutex_slowlock:
>
> for (;;) {
> 	...
> 		if (timeout && !timeout->task)
> 			return -ETIMEDOUT;
> 	...
>
> 	schedule();
> 							hrtimer_wakeup() sets
> 	...						task->state = TASK_RUNNING,
> 							but "timeout->task = NULL"
> 							is not completed
> 	set_current_state(TASK_INTERRUPTIBLE);
> }
>
> we can miss a timeout.
>
> Of course, this all is scholasticism, this can't happen in practice, but
> may be this patch makes sense as a documentation update.
>
> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
>
> --- STATS/kernel/hrtimer.c~1_hrtw	2006-10-22 18:24:03.000000000 +0400
> +++ STATS/kernel/hrtimer.c	2006-11-05 22:32:36.000000000 +0300
> @@ -662,9 +662,12 @@ static int hrtimer_wakeup(struct hrtimer
>  		container_of(timer, struct hrtimer_sleeper, timer);
>  	struct task_struct *task = t->task;
>
> -	t->task = NULL;
> -	if (task)
> +	if (task) {
> +		t->task = NULL;
> +		/* must be visible before task->state = TASK_RUNNING */
> +		smp_wmb();
>  		wake_up_process(task);
> +	}
>
>  	return HRTIMER_NORESTART;
>  }
>
>

Ingo, Thomas, could you ack this too, or Nack it with an explaination.

Ingo, you might want to read the whole thread here.

Acked-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
  2006-11-06 12:31 ` Steven Rostedt
@ 2006-11-06 14:15   ` Oleg Nesterov
  2006-11-06 14:11     ` Steven Rostedt
  0 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2006-11-06 14:15 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Andrew Morton, Linus Torvalds, LKML, Ingo Molnar

On 11/06, Steven Rostedt wrote:
> 
> Acked-by: Steven Rostedt <rostedt@goodmis.org>

Thanks.

But on the other hand we probably have a similar code (set condition +
wake_up_process()) in other places too, and __wake_up(wait_queue_head_t)
has (in theory) the same problem. Probably we can add something like

	smp_wmb_unless_spin_lock_implies_memory_barrier_on_this_arch()

somewhere in try_to_wake_up(). I dunno.

Oleg.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
  2006-11-06 14:15   ` Oleg Nesterov
@ 2006-11-06 14:11     ` Steven Rostedt
  2006-11-06 15:05       ` Steven Rostedt
  0 siblings, 1 reply; 17+ messages in thread
From: Steven Rostedt @ 2006-11-06 14:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thomas Gleixner, Andrew Morton, Linus Torvalds, LKML, Ingo Molnar

On Mon, 6 Nov 2006, Oleg Nesterov wrote:

> On 11/06, Steven Rostedt wrote:
> >
> > Acked-by: Steven Rostedt <rostedt@goodmis.org>
>
> Thanks.
>
> But on the other hand we probably have a similar code (set condition +
> wake_up_process()) in other places too, and __wake_up(wait_queue_head_t)
> has (in theory) the same problem. Probably we can add something like
>
> 	smp_wmb_unless_spin_lock_implies_memory_barrier_on_this_arch()

Probably smp_wmb_before_spin_lock would be a good name.

>
> somewhere in try_to_wake_up(). I dunno.
>

But I don't think it belongs in try_to_wake_up. That's just assuming too
much in that function. The event to wake up could simply be an interrupt
took place, where the smb_wmb would not be needed.

As Linus has stated.  Most cases where a process is going to sleep on an
event, it had better check to make sure that the event didn't happen
between the time it set itself to something other than TASK_RUNNING and
when it calls schedule.

I guess what this points out is that in such a case, the event maker must
make sure that the event is visible before it wakes up the process that's
waiting on the event.  But putting that logic into try_to_wake_up, IMHO is
an overkill.

But...

Doing a quick search on wake_up, I came across loop.c (also my first hit).

Here we have: (some funcs pulled in).

loop_make_request:
  lo->lo_bio = lo->lo_biotail = bio;
  wake_up(&lo->lo_event);

And with wait_event_interruptible(lo->lo_event, lo->lo_bio)
we have:

  prepare_to_wait(..., TASK_INTERRUPTIBLE);
  if (lo->bio)
     break;
  schedule_timeout();

So this probably also has the same bug!

There's nothing to prevent the CPU from showing lo->bio has changed
*after* setting wake_up to TASK_RUNNING.

So maybe this *is* a race condition that is all over the kernel for other
architectures!

-- Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
  2006-11-06 14:11     ` Steven Rostedt
@ 2006-11-06 15:05       ` Steven Rostedt
  0 siblings, 0 replies; 17+ messages in thread
From: Steven Rostedt @ 2006-11-06 15:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thomas Gleixner, Andrew Morton, Linus Torvalds, LKML, Ingo Molnar,
	sct


On Mon, 6 Nov 2006, Steven Rostedt wrote:

> But...
>
> Doing a quick search on wake_up, I came across loop.c (also my first hit).
>
> Here we have: (some funcs pulled in).
>
> loop_make_request:
>   lo->lo_bio = lo->lo_biotail = bio;
>   wake_up(&lo->lo_event);
>
> And with wait_event_interruptible(lo->lo_event, lo->lo_bio)
> we have:
>
>   prepare_to_wait(..., TASK_INTERRUPTIBLE);
>   if (lo->bio)
>      break;
>   schedule_timeout();
>
>
> So this probably also has the same bug!
>
> There's nothing to prevent the CPU from showing lo->bio has changed
> *after* setting wake_up to TASK_RUNNING.
>
> So maybe this *is* a race condition that is all over the kernel for other
> architectures!

Talking this over with Stephen Tweedie, he pointed out to me that the
setting of the task state is done within the wait queue locks, and thus,
the above is not a problem.

But the old paradigm of:

  add_wait_queue();
  for (;;) {
	task->state = TASK_INTERRUPTIBLE;
	if (condition)
		break;
	schedule();
  }

Can still be a bug, since the other CPU might hold off the condition
to after it set the state to TASK_RUNNING.

-- Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2006-11-06 21:41 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-05 19:34 PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock() Oleg Nesterov
2006-11-05 22:29 ` Linus Torvalds
2006-11-05 22:53   ` Steven Rostedt
2006-11-05 23:08     ` Oleg Nesterov
2006-11-06  3:08     ` Linus Torvalds
2006-11-06 12:09       ` Oleg Nesterov
2006-11-06 12:26         ` Steven Rostedt
2006-11-05 22:53   ` Oleg Nesterov
2006-11-06  8:57   ` Benjamin Herrenschmidt
2006-11-06 12:35     ` Steven Rostedt
2006-11-06 20:53       ` Benjamin Herrenschmidt
2006-11-06 21:17         ` Steven Rostedt
2006-11-06 21:41           ` Benjamin Herrenschmidt
2006-11-06 12:31 ` Steven Rostedt
2006-11-06 14:15   ` Oleg Nesterov
2006-11-06 14:11     ` Steven Rostedt
2006-11-06 15:05       ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox