Re: 3.9.x: Possible race related to stop_machine leads to lockup.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ben Greear <greearb@candelatech.com>
To: Rusty Russell <rusty@rustcorp.com.au>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>, Tejun Heo <tj@kernel.org>
Subject: Re: 3.9.x:  Possible race related to stop_machine leads to lockup.
Date: Wed, 05 Jun 2013 08:11:05 -0700	[thread overview]
Message-ID: <51AF5509.1070706@candelatech.com> (raw)
In-Reply-To: <87mwr5rwxo.fsf@rustcorp.com.au>

On 06/04/2013 09:41 PM, Rusty Russell wrote:
> Ben Greear <greearb@candelatech.com> writes:
>> On 06/04/2013 02:18 PM, Ben Greear wrote:
>>> I've been trying to figure out why I see the migration/* processes
>>> hang in a busy loop....
>>>
>>> While reading the stop_machine.c file, I think I might have an
>>> answer.
>>>
>>> The set_state() method sets the thread_ack to the current number
>>> of threads.  Each thread's state machine then decrements it down to
>>> zero where it bumps the state to the next level.  This lets each
>>> cpu stop in lock-step it seems.
>>>
>>> But, from what I can tell, the __stop_machine() method can
>>> (re)set the state to STOPMACHINE_PREPARE while the migration
>>> processes are in their loop.  That would explain why they sometimes
>>> loop forever.
>>>
>>> Does this make sense?
>>
>> Err, no..that doesn't make sense.  'smdata' is on the stack.
>>
>> More printk debugging makes it look like one thread just
>> never notices that smdata->state has been updated by another
>> thread.
>>
>> There is this comment..maybe cpu_relax only does the chill out part
>> and we need something else to make sure smdata->state is freshly
>> read from the other CPU's cache?
>>
>> 		/* Chill out and ensure we re-read stopmachine_state. */
>> 		cpu_relax();
>> 		if (smdata->state != curstate) {
>>
>> Gah..way out of my league :P
>
> What architecture?  Maybe someone didn't get the memo; cpu_relax()
> should be a read barrier.

I tried making it and smp read barier, and tried using atomic_t for the state
object.  No big help.

Latest theory is that one thread gets stuck doing IRQs while rest of CPUs have
disabled IRQs and that one CPU/thread never gets back to the cpu shutdown state
machine.

I'll post a more complete debugging patch later today, and try to find
a better way to reproduce it.

Thanks,
Ben
>
> Cheers,
> Rusty.
>


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

     prev parent reply	other threads:[~2013-06-05 15:14 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-04 21:18 3.9.x: Possible race related to stop_machine leads to lockup Ben Greear
2013-06-04 22:13 ` Ben Greear
2013-06-05  4:41   ` Rusty Russell
2013-06-05 15:11     ` Ben Greear [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51AF5509.1070706@candelatech.com \
    --to=greearb@candelatech.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rusty@rustcorp.com.au \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.