3.9.x: Possible race related to stop_machine leads to lockup.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 3.9.x:  Possible race related to stop_machine leads to lockup.
@ 2013-06-04 21:18 Ben Greear
  2013-06-04 22:13 ` Ben Greear
  0 siblings, 1 reply; 4+ messages in thread
From: Ben Greear @ 2013-06-04 21:18 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Rusty Russell, Thomas Gleixner

I've been trying to figure out why I see the migration/* processes
hang in a busy loop....

While reading the stop_machine.c file, I think I might have an
answer.

The set_state() method sets the thread_ack to the current number
of threads.  Each thread's state machine then decrements it down to
zero where it bumps the state to the next level.  This lets each
cpu stop in lock-step it seems.

But, from what I can tell, the __stop_machine() method can
(re)set the state to STOPMACHINE_PREPARE while the migration
processes are in their loop.  That would explain why they sometimes
loop forever.

Does this make sense?

Any ideas on how to fix this properly?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 3.9.x:  Possible race related to stop_machine leads to lockup.
  2013-06-04 21:18 3.9.x: Possible race related to stop_machine leads to lockup Ben Greear
@ 2013-06-04 22:13 ` Ben Greear
  2013-06-05  4:41   ` Rusty Russell
  0 siblings, 1 reply; 4+ messages in thread
From: Ben Greear @ 2013-06-04 22:13 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Rusty Russell, Thomas Gleixner

On 06/04/2013 02:18 PM, Ben Greear wrote:
> I've been trying to figure out why I see the migration/* processes
> hang in a busy loop....
>
> While reading the stop_machine.c file, I think I might have an
> answer.
>
> The set_state() method sets the thread_ack to the current number
> of threads.  Each thread's state machine then decrements it down to
> zero where it bumps the state to the next level.  This lets each
> cpu stop in lock-step it seems.
>
> But, from what I can tell, the __stop_machine() method can
> (re)set the state to STOPMACHINE_PREPARE while the migration
> processes are in their loop.  That would explain why they sometimes
> loop forever.
>
> Does this make sense?

Err, no..that doesn't make sense.  'smdata' is on the stack.

More printk debugging makes it look like one thread just
never notices that smdata->state has been updated by another
thread.

There is this comment..maybe cpu_relax only does the chill out part
and we need something else to make sure smdata->state is freshly
read from the other CPU's cache?

		/* Chill out and ensure we re-read stopmachine_state. */
		cpu_relax();
		if (smdata->state != curstate) {

Gah..way out of my league :P

Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 3.9.x:  Possible race related to stop_machine leads to lockup.
  2013-06-04 22:13 ` Ben Greear
@ 2013-06-05  4:41   ` Rusty Russell
  2013-06-05 15:11     ` Ben Greear
  0 siblings, 1 reply; 4+ messages in thread
From: Rusty Russell @ 2013-06-05  4:41 UTC (permalink / raw)
  To: Ben Greear, Linux Kernel Mailing List; +Cc: Thomas Gleixner, Tejun Heo

Ben Greear <greearb@candelatech.com> writes:
> On 06/04/2013 02:18 PM, Ben Greear wrote:
>> I've been trying to figure out why I see the migration/* processes
>> hang in a busy loop....
>>
>> While reading the stop_machine.c file, I think I might have an
>> answer.
>>
>> The set_state() method sets the thread_ack to the current number
>> of threads.  Each thread's state machine then decrements it down to
>> zero where it bumps the state to the next level.  This lets each
>> cpu stop in lock-step it seems.
>>
>> But, from what I can tell, the __stop_machine() method can
>> (re)set the state to STOPMACHINE_PREPARE while the migration
>> processes are in their loop.  That would explain why they sometimes
>> loop forever.
>>
>> Does this make sense?
>
> Err, no..that doesn't make sense.  'smdata' is on the stack.
>
> More printk debugging makes it look like one thread just
> never notices that smdata->state has been updated by another
> thread.
>
> There is this comment..maybe cpu_relax only does the chill out part
> and we need something else to make sure smdata->state is freshly
> read from the other CPU's cache?
>
> 		/* Chill out and ensure we re-read stopmachine_state. */
> 		cpu_relax();
> 		if (smdata->state != curstate) {
>
> Gah..way out of my league :P

What architecture?  Maybe someone didn't get the memo; cpu_relax()
should be a read barrier.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 3.9.x:  Possible race related to stop_machine leads to lockup.
  2013-06-05  4:41   ` Rusty Russell
@ 2013-06-05 15:11     ` Ben Greear
  0 siblings, 0 replies; 4+ messages in thread
From: Ben Greear @ 2013-06-05 15:11 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Linux Kernel Mailing List, Thomas Gleixner, Tejun Heo

On 06/04/2013 09:41 PM, Rusty Russell wrote:
> Ben Greear <greearb@candelatech.com> writes:
>> On 06/04/2013 02:18 PM, Ben Greear wrote:
>>> I've been trying to figure out why I see the migration/* processes
>>> hang in a busy loop....
>>>
>>> While reading the stop_machine.c file, I think I might have an
>>> answer.
>>>
>>> The set_state() method sets the thread_ack to the current number
>>> of threads.  Each thread's state machine then decrements it down to
>>> zero where it bumps the state to the next level.  This lets each
>>> cpu stop in lock-step it seems.
>>>
>>> But, from what I can tell, the __stop_machine() method can
>>> (re)set the state to STOPMACHINE_PREPARE while the migration
>>> processes are in their loop.  That would explain why they sometimes
>>> loop forever.
>>>
>>> Does this make sense?
>>
>> Err, no..that doesn't make sense.  'smdata' is on the stack.
>>
>> More printk debugging makes it look like one thread just
>> never notices that smdata->state has been updated by another
>> thread.
>>
>> There is this comment..maybe cpu_relax only does the chill out part
>> and we need something else to make sure smdata->state is freshly
>> read from the other CPU's cache?
>>
>> 		/* Chill out and ensure we re-read stopmachine_state. */
>> 		cpu_relax();
>> 		if (smdata->state != curstate) {
>>
>> Gah..way out of my league :P
>
> What architecture?  Maybe someone didn't get the memo; cpu_relax()
> should be a read barrier.

I tried making it and smp read barier, and tried using atomic_t for the state
object.  No big help.

Latest theory is that one thread gets stuck doing IRQs while rest of CPUs have
disabled IRQs and that one CPU/thread never gets back to the cpu shutdown state
machine.

I'll post a more complete debugging patch later today, and try to find
a better way to reproduce it.

Thanks,
Ben
>
> Cheers,
> Rusty.
>


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-06-05 15:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-04 21:18 3.9.x: Possible race related to stop_machine leads to lockup Ben Greear
2013-06-04 22:13 ` Ben Greear
2013-06-05  4:41   ` Rusty Russell
2013-06-05 15:11     ` Ben Greear

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox