* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
2006-11-05 22:29 ` Linus Torvalds
@ 2006-11-05 22:53 ` Steven Rostedt
2006-11-05 23:08 ` Oleg Nesterov
2006-11-06 3:08 ` Linus Torvalds
2006-11-05 22:53 ` Oleg Nesterov
2006-11-06 8:57 ` Benjamin Herrenschmidt
2 siblings, 2 replies; 17+ messages in thread
From: Steven Rostedt @ 2006-11-05 22:53 UTC (permalink / raw)
To: Linus Torvalds
Cc: Oleg Nesterov, Thomas Gleixner, Andrew Morton, linux-kernel
On Sun, 5 Nov 2006, Linus Torvalds wrote:
>
> That said, since "task->state" in only tested _inside_ the runqueue lock,
> there is no race that I can see. Since we've gotten the runqueue lock in
> order to even check task-state, the processor that _sets_ task state must
> not only have done the "spin_lock()", it must also have done the
> "spin_unlock()", and _that_ will not allow either the timeout or the task
> state to haev leaked out from under it (because that would imply that the
> critical region leaked out too).
>
> So I don't think the race exists anyway - the schedule() will return
> immediately (because it will see TASK_RUNNING), and we'll just retry.
>
This whole situation is very theoretical, but I think this actually can
happen *theoretically*.
OK, the spin_lock doesn't do any serialization, but the unlock does. But
the problem can happen before the unlock. Because of the loop.
CPU 1 CPU 2
task_rq_lock()
p->state = TASK_RUNNING;
(from bottom of for loop)
set_current_state(TASK_INTERRUPTIBLE);
for (;;) { (looping)
if (timeout && !timeout->task)
(now CPU implements)
t->task = NULL
task_rq_unlock();
schedule() (with state == TASK_INTERRUPTIBLE)
Again, this is very theoretical, and I don't even think that this can
happen if you tried to make it. But I guess if hardware were to change in
the future with the same rules that we have today with barriers, that this
can be a race.
-- Steve
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
2006-11-05 22:53 ` Steven Rostedt
@ 2006-11-05 23:08 ` Oleg Nesterov
2006-11-06 3:08 ` Linus Torvalds
1 sibling, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2006-11-05 23:08 UTC (permalink / raw)
To: Steven Rostedt
Cc: Linus Torvalds, Thomas Gleixner, Andrew Morton, linux-kernel
On 11/05, Steven Rostedt wrote:
>
> On Sun, 5 Nov 2006, Linus Torvalds wrote:
>
> >
> > That said, since "task->state" in only tested _inside_ the runqueue lock,
> > there is no race that I can see. Since we've gotten the runqueue lock in
> > order to even check task-state, the processor that _sets_ task state must
> > not only have done the "spin_lock()", it must also have done the
> > "spin_unlock()", and _that_ will not allow either the timeout or the task
> > state to haev leaked out from under it (because that would imply that the
> > critical region leaked out too).
> >
> > So I don't think the race exists anyway - the schedule() will return
> > immediately (because it will see TASK_RUNNING), and we'll just retry.
> >
>
> This whole situation is very theoretical, but I think this actually can
> happen *theoretically*.
>
>
> OK, the spin_lock doesn't do any serialization, but the unlock does. But
> the problem can happen before the unlock. Because of the loop.
Yes, yes, thanks.
( Actually, this was more "is my understanding correct?" than a patch )
Thanks!
> CPU 1 CPU 2
>
> task_rq_lock()
>
> p->state = TASK_RUNNING;
>
>
> (from bottom of for loop)
> set_current_state(TASK_INTERRUPTIBLE);
>
> for (;;) { (looping)
>
> if (timeout && !timeout->task)
>
>
> (now CPU implements)
> t->task = NULL
>
> task_rq_unlock();
>
> schedule() (with state == TASK_INTERRUPTIBLE)
>
>
> Again, this is very theoretical, and I don't even think that this can
> happen if you tried to make it. But I guess if hardware were to change in
> the future with the same rules that we have today with barriers, that this
> can be a race.
>
> -- Steve
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
2006-11-05 22:53 ` Steven Rostedt
2006-11-05 23:08 ` Oleg Nesterov
@ 2006-11-06 3:08 ` Linus Torvalds
2006-11-06 12:09 ` Oleg Nesterov
1 sibling, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2006-11-06 3:08 UTC (permalink / raw)
To: Steven Rostedt
Cc: Oleg Nesterov, Thomas Gleixner, Andrew Morton, linux-kernel
On Sun, 5 Nov 2006, Steven Rostedt wrote:
>
> This whole situation is very theoretical, but I think this actually can
> happen *theoretically*.
>
> OK, the spin_lock doesn't do any serialization, but the unlock does. But
> the problem can happen before the unlock. Because of the loop.
>
> CPU 1 CPU 2
>
> task_rq_lock()
>
> p->state = TASK_RUNNING;
>
>
> (from bottom of for loop)
> set_current_state(TASK_INTERRUPTIBLE);
>
> for (;;) { (looping)
>
> if (timeout && !timeout->task)
>
>
> (now CPU implements)
> t->task = NULL
>
> task_rq_unlock();
>
> schedule() (with state == TASK_INTERRUPTIBLE)
Yeah, that seems a real bug. You _always_ need to actually do the thing
that you wait for _before_ you want it up. That's what all the scheduling
primitives depend on - you can't wake people up first, and then set the
condition variable.
So if a rt_mutex depeds on something that is set inside the rq-lock, it
needs to get the task rw-lock in order to check it.
Linus
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
2006-11-06 3:08 ` Linus Torvalds
@ 2006-11-06 12:09 ` Oleg Nesterov
2006-11-06 12:26 ` Steven Rostedt
0 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2006-11-06 12:09 UTC (permalink / raw)
To: Linus Torvalds
Cc: Steven Rostedt, Thomas Gleixner, Andrew Morton, linux-kernel
On 11/05, Linus Torvalds wrote:
>
> On Sun, 5 Nov 2006, Steven Rostedt wrote:
> >
> > This whole situation is very theoretical, but I think this actually can
> > happen *theoretically*.
> >
> > OK, the spin_lock doesn't do any serialization, but the unlock does. But
> > the problem can happen before the unlock. Because of the loop.
> >
> > CPU 1 CPU 2
> >
> > task_rq_lock()
> >
> > p->state = TASK_RUNNING;
> >
> >
> > (from bottom of for loop)
> > set_current_state(TASK_INTERRUPTIBLE);
> >
> > for (;;) { (looping)
> >
> > if (timeout && !timeout->task)
> >
> >
> > (now CPU implements)
> > t->task = NULL
> >
> > task_rq_unlock();
> >
> > schedule() (with state == TASK_INTERRUPTIBLE)
>
> Yeah, that seems a real bug. You _always_ need to actually do the thing
> that you wait for _before_ you want it up. That's what all the scheduling
> primitives depend on - you can't wake people up first, and then set the
> condition variable.
>
> So if a rt_mutex depeds on something that is set inside the rq-lock, it
> needs to get the task rw-lock in order to check it.
No, rt_mutex is fine (I think).
My changelog was very unclean and confusing, I'll try again. What we are
doing is:
rt_mutex_slowlock:
task->state = TASK_INTERRUPTIBLE;
mb();
if (CONDITION)
return -ETIMEDOUT;
schedule();
This is common and correct.
hrtimer_wakeup:
CONDITION = 1; // [1]
spin_lock(rq->lock);
task->state = TASK_RUNNING; // [2]
This needs 'wmb()' between [1] and [2] unless spin_lock() garantees memory
ordering. Of course, rt_mutex can take rq->lock to solve this, but I don't
think it would be right.
Oleg.
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
2006-11-06 12:09 ` Oleg Nesterov
@ 2006-11-06 12:26 ` Steven Rostedt
0 siblings, 0 replies; 17+ messages in thread
From: Steven Rostedt @ 2006-11-06 12:26 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, Thomas Gleixner, Andrew Morton, linux-kernel
On Mon, 6 Nov 2006, Oleg Nesterov wrote:
> On 11/05, Linus Torvalds wrote:
> >
> > Yeah, that seems a real bug. You _always_ need to actually do the thing
> > that you wait for _before_ you want it up. That's what all the scheduling
> > primitives depend on - you can't wake people up first, and then set the
> > condition variable.
> >
> > So if a rt_mutex depeds on something that is set inside the rq-lock, it
> > needs to get the task rw-lock in order to check it.
>
> No, rt_mutex is fine (I think).
I agree. The problem isn't with rt_mutex. It rt_mutex doesn't depend on
that condition to break out of the loop. That condition is an extra
condition that's there to stop in the case we timed out before the owner
released the lock. The real condition to break out of the loop is the
owner releasing the lock, and there's no known races there.
So the time out is what needs to make sure that rt_mutex sees the event.
>
> My changelog was very unclean and confusing, I'll try again. What we are
> doing is:
>
> rt_mutex_slowlock:
>
> task->state = TASK_INTERRUPTIBLE;
>
> mb();
>
> if (CONDITION)
> return -ETIMEDOUT;
>
> schedule();
>
> This is common and correct.
>
> hrtimer_wakeup:
>
> CONDITION = 1; // [1]
Right! The changing of the condition isn't even in any spin lock. But that
condition needs be be completed before the changes done within the spin
lock, but unfortunately, the spin lock itself doesn't guarantee anything.
>
> spin_lock(rq->lock);
>
> task->state = TASK_RUNNING; // [2]
>
> This needs 'wmb()' between [1] and [2] unless spin_lock() garantees memory
> ordering. Of course, rt_mutex can take rq->lock to solve this, but I don't
> think it would be right.
Correct, the solution should NOT change rt_mutex. Your original patch is
the correct approach.
-- Steve
>
> Oleg.
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
2006-11-05 22:29 ` Linus Torvalds
2006-11-05 22:53 ` Steven Rostedt
@ 2006-11-05 22:53 ` Oleg Nesterov
2006-11-06 8:57 ` Benjamin Herrenschmidt
2 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2006-11-05 22:53 UTC (permalink / raw)
To: Linus Torvalds
Cc: Thomas Gleixner, Steven Rostedt, Andrew Morton, linux-kernel
On 11/05, Linus Torvalds wrote:
>
>
> On Sun, 5 Nov 2006, Oleg Nesterov wrote:
> >
> > When task->array != NULL, try_to_wake_up() just goes to "out_running" and sets
> > task->state = TASK_RUNNING.
> >
> > In that case hrtimer_wakeup() does:
> >
> > timeout->task = NULL; <----- [1]
> >
> > spin_lock(runqueues->lock);
> >
> > task->state = TASK_RUNNING; <----- [2]
> >
> > from Documentation/memory-barriers.txt
> >
> > Memory operations that occur before a LOCK operation may appear to
> > happen after it completes.
> >
> > This means that [2] may be completed before [1], and
>
> Yes. On x86 (and x86-64) you'll never see this, because writes are always
> seen in order regardless, and in addition, the spin_lock is actually
> totally serializing anyway. On most other architectures, the spin_lock
> will serialize all the writes too, but it's not guaranteed, so in theory
> you're right. I suspect no actual architecture will do this, but hey,
> when talking memory ordering, safe is a lot better than sorry.
>
> That said, since "task->state" in only tested _inside_ the runqueue lock,
> there is no race that I can see. Since we've gotten the runqueue lock in
> order to even check task-state, the processor that _sets_ task state must
> not only have done the "spin_lock()", it must also have done the
> "spin_unlock()", and _that_ will not allow either the timeout or the task
> state to haev leaked out from under it (because that would imply that the
> critical region leaked out too).
>
> So I don't think the race exists anyway - the schedule() will return
> immediately (because it will see TASK_RUNNING), and we'll just retry.
schedule() will see TASK_INTERRUPTIBLE. hrtimer_wakeup() sets TASK_RUNNING,
rt_mutex_slowlock() does set_current_state(TASK_INTERRUPTIBLE) after that.
schedule() takes runqueue lock, yes, but we are testing timeout->task before.
Oleg.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
2006-11-05 22:29 ` Linus Torvalds
2006-11-05 22:53 ` Steven Rostedt
2006-11-05 22:53 ` Oleg Nesterov
@ 2006-11-06 8:57 ` Benjamin Herrenschmidt
2006-11-06 12:35 ` Steven Rostedt
2 siblings, 1 reply; 17+ messages in thread
From: Benjamin Herrenschmidt @ 2006-11-06 8:57 UTC (permalink / raw)
To: Linus Torvalds
Cc: Oleg Nesterov, Thomas Gleixner, Steven Rostedt, Andrew Morton,
linux-kernel
> Yes. On x86 (and x86-64) you'll never see this, because writes are always
> seen in order regardless, and in addition, the spin_lock is actually
> totally serializing anyway. On most other architectures, the spin_lock
> will serialize all the writes too, but it's not guaranteed, so in theory
> you're right. I suspect no actual architecture will do this, but hey,
> when talking memory ordering, safe is a lot better than sorry.
PowerPC doesn't serialize the writes on spin_lock, only on spin_unlock.
(That is, previous writes can "leak" into the lock, but writes done
before the unlock can't leak out of the spinlock).
Now, I've just glanced at the thread, so I don't know if that's relevant
to the problems you guys are talking about :-)
Ben.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
2006-11-06 8:57 ` Benjamin Herrenschmidt
@ 2006-11-06 12:35 ` Steven Rostedt
2006-11-06 20:53 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 17+ messages in thread
From: Steven Rostedt @ 2006-11-06 12:35 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Andrew Morton,
linux-kernel
On Mon, 6 Nov 2006, Benjamin Herrenschmidt wrote:
>
> > Yes. On x86 (and x86-64) you'll never see this, because writes are always
> > seen in order regardless, and in addition, the spin_lock is actually
> > totally serializing anyway. On most other architectures, the spin_lock
> > will serialize all the writes too, but it's not guaranteed, so in theory
> > you're right. I suspect no actual architecture will do this, but hey,
> > when talking memory ordering, safe is a lot better than sorry.
>
> PowerPC doesn't serialize the writes on spin_lock, only on spin_unlock.
>
> (That is, previous writes can "leak" into the lock, but writes done
> before the unlock can't leak out of the spinlock).
>
> Now, I've just glanced at the thread, so I don't know if that's relevant
> to the problems you guys are talking about :-)
>
It is relevant. In powerpc, can one write happen before another write?
x = 1;
barrier(); (only compiler barrier)
b = 2;
And have CPU 2 see b=2 before seeing x=1?
If so, then I guess this is indeed a bug on powerpc.
-- Steve
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
2006-11-06 12:35 ` Steven Rostedt
@ 2006-11-06 20:53 ` Benjamin Herrenschmidt
2006-11-06 21:17 ` Steven Rostedt
0 siblings, 1 reply; 17+ messages in thread
From: Benjamin Herrenschmidt @ 2006-11-06 20:53 UTC (permalink / raw)
To: Steven Rostedt
Cc: Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Andrew Morton,
linux-kernel
On Mon, 2006-11-06 at 07:35 -0500, Steven Rostedt wrote:
> It is relevant. In powerpc, can one write happen before another write?
>
>
> x = 1;
> barrier(); (only compiler barrier)
> b = 2;
>
>
> And have CPU 2 see b=2 before seeing x=1?
Yes. Definitely.
> If so, then I guess this is indeed a bug on powerpc.
Ben.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
2006-11-06 20:53 ` Benjamin Herrenschmidt
@ 2006-11-06 21:17 ` Steven Rostedt
2006-11-06 21:41 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 17+ messages in thread
From: Steven Rostedt @ 2006-11-06 21:17 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Andrew Morton,
linux-kernel
On Tue, 7 Nov 2006, Benjamin Herrenschmidt wrote:
> On Mon, 2006-11-06 at 07:35 -0500, Steven Rostedt wrote:
>
> > It is relevant. In powerpc, can one write happen before another write?
> >
> >
> > x = 1;
> > barrier(); (only compiler barrier)
> > b = 2;
> >
> >
> > And have CPU 2 see b=2 before seeing x=1?
>
> Yes. Definitely.
OK, I see in powerpc, that spin lock calls isync. This just clears the
pipeline. It doesn't touch the loads and stores, right?
So basically saying this:
x=1;
asm ("isync");
b=2;
Would that have the same problem too? Where another CPU can see x=1
before seeing b=2?
Thanks!
-- Steve
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: PATCH? hrtimer_wakeup: fix a theoretical race wrt rt_mutex_slowlock()
2006-11-06 21:17 ` Steven Rostedt
@ 2006-11-06 21:41 ` Benjamin Herrenschmidt
0 siblings, 0 replies; 17+ messages in thread
From: Benjamin Herrenschmidt @ 2006-11-06 21:41 UTC (permalink / raw)
To: Steven Rostedt
Cc: Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Andrew Morton,
linux-kernel
On Mon, 2006-11-06 at 16:17 -0500, Steven Rostedt wrote:
> On Tue, 7 Nov 2006, Benjamin Herrenschmidt wrote:
>
> > On Mon, 2006-11-06 at 07:35 -0500, Steven Rostedt wrote:
> >
> > > It is relevant. In powerpc, can one write happen before another write?
> > >
> > >
> > > x = 1;
> > > barrier(); (only compiler barrier)
> > > b = 2;
> > >
> > >
> > > And have CPU 2 see b=2 before seeing x=1?
> >
> > Yes. Definitely.
>
> OK, I see in powerpc, that spin lock calls isync. This just clears the
> pipeline. It doesn't touch the loads and stores, right?
Yes. That isync is to prevent loads to be speculated accross spin_lock,
thus leaking out of the lock by the top. In fact, it doesn't act on the
load per-se but it prevent speculative execution accross the conditional
branch in the spin_lock.
> So basically saying this:
>
> x=1;
> asm ("isync");
> b=2;
>
> Would that have the same problem too? Where another CPU can see x=1
> before seeing b=2?
Yes.
What isync provides is
a = *foo
spin_lock_loop_with_conditional_branch
isync
b = *bar
It prevents the read of b from being speculated by the CPU ahead of a
Ben.
^ permalink raw reply [flat|nested] 17+ messages in thread