From: Juergen Gross <jgross@suse.com>
To: Stefan Bader <stefan.bader@canonical.com>,
Andrew Cooper <andrew.cooper3@citrix.com>,
"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>
Subject: Re: Shutdown panic in disable_nonboot_cpus after cpupool-numa-split
Date: Mon, 07 Jul 2014 16:28:18 +0200 [thread overview]
Message-ID: <53BAAE82.2090200@suse.com> (raw)
In-Reply-To: <53BAA9DD.4040403@canonical.com>
On 07/07/2014 04:08 PM, Stefan Bader wrote:
> On 07.07.2014 15:03, Jürgen Groß wrote:
>> On 07/07/2014 02:49 PM, Stefan Bader wrote:
>>> On 07.07.2014 14:38, Jürgen Groß wrote:
>>>> On 07/07/2014 02:00 PM, Andrew Cooper wrote:
>>>>> On 07/07/14 12:33, Stefan Bader wrote:
>>>>>> I recently noticed that I get a panic (rebooting the system) on shutdown in
>>>>>> some
>>>>> > cases. This happened only on my AMD system and also not all the time.
>>>>> Finally
>>>>> > realized that it is related to the use of using cpupool-numa-split
>>>>> > (libxl with xen-4.4 maybe, but not 100% sure 4.3 as well).
>>>>> >
>>>>> > What happens is that on shutdown the hypervisor runs
>>>>> disable_nonboot_cpus which
>>>>> > call cpu_down for each online cpu. There is a BUG_ON in the code for
>>>>> the case of
>>>>> > cpu_down returning -EBUSY. This happens in my case as soon as the
>>>>> first cpu that
>>>>> > has been moved to pool-1 by cpupool-numa-split is attempted. The error is
>>>>> > returned by running the notifier_call_chain and I suspect that ends
>>>>> up calling
>>>>> > cpupool_cpu_remove which always returns EBUSY for cpus not in pool0.
>>>>> >
>>>>> > I am not sure which end needs to be fixed but looping over all online
>>>>> cpus in
>>>>> > disable_nonboot_cpus sounds plausible. So maybe the check for pool-0 in
>>>>> > cpupool_cpu_remove is wrong...?
>>>>> >
>>>>> > -Stefan
>>>>>
>>>>> Hmm yes - this looks completely broken.
>>>>>
>>>>> cpupool_cpu_remove() only has a single caller which is from cpu_down(),
>>>>> and will unconditionally fail for cpus outside of the default pool.
>>>>>
>>>>> It is not obvious at all how this is supposed to work, and the comment
>>>>> beside cpupool_cpu_remove() doesn't help.
>>>>>
>>>>> Can you try the following (only compile tested) patch, which looks
>>>>> plausibly like it might DTRT. The for_each_cpupool() is a little nastly
>>>>> but there appears to be no cpu_to_cpupool mapping available.
>>>>
>>>> Your patch has the disadvantage to support hot-unplug of the last cpu in
>>>> a cpupool. The following should work, however:
>>>
>>> Disadvantage and support sounded a bit confusing. But I think it means
>>> hot-unplugging the last cpu of a pool is bad and should not be working.
>>
>> Correct.
>>
>>>
>>>>
>>>> diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
>>>> index 4a0e569..73249d3 100644
>>>> --- a/xen/common/cpupool.c
>>>> +++ b/xen/common/cpupool.c
>>>> @@ -471,12 +471,24 @@ static void cpupool_cpu_add(unsigned int cpu)
>>>> */
>>>> static int cpupool_cpu_remove(unsigned int cpu)
>>>> {
>>>> - int ret = 0;
>>>> + int ret = -EBUSY;
>>>> + struct cpupool **c;
>>>>
>>>> spin_lock(&cpupool_lock);
>>>> - if ( !cpumask_test_cpu(cpu, cpupool0->cpu_valid))
>>>> - ret = -EBUSY;
>>>> + if ( cpumask_test_cpu(cpu, cpupool0->cpu_valid) )
>>>> + ret = 0;
>>>> else
>>>> + {
>>>> + for_each_cpupool(c)
>>>> + {
>>>> + if ( cpumask_test_cpu(cpu, (*c)->cpu_suspended ) )
>>>
>>> The rest seems to keep the semantics the same as before (though does that mean
>>> unplugging the last cpu of pool-0 is ok?) But why testing for suspended here to
>>> succeed (and not valid)?
>>
>> Testing valid would again enable to remove the last cpu of a cpupool in
>> case of hotplugging. cpu_suspended is set if all cpus are to be removed
>> due to shutdown, suspend to ram/disk, ...
>
> Ah, ok. Thanks for the detail explanation. So I was trying this change in
> parallel and can confirm that it gets rid of the panic on shutdown. But when I
> try to offline any cpu in pool1 (if echoing 0 into /sys/devices/xen_cpu/xen_cpu?
> is the correct method) I always get EBUSY. IOW I cannot hot-unplug any cpu that
> is in a pool other than 0. It does only work after removing it from pool1, then
> add it to pool0 and then echo 0 into online.
That's how it was designed some years ago. I don't want to change the
behavior in the hypervisor. Adding some tool support could make sense,
however.
Juergen
next prev parent reply other threads:[~2014-07-07 14:28 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-07-07 11:33 Shutdown panic in disable_nonboot_cpus after cpupool-numa-split Stefan Bader
2014-07-07 12:00 ` Andrew Cooper
2014-07-07 12:38 ` Jürgen Groß
2014-07-07 12:49 ` Stefan Bader
2014-07-07 13:03 ` Jürgen Groß
2014-07-07 14:08 ` Stefan Bader
2014-07-07 14:28 ` Juergen Gross [this message]
2014-07-07 14:43 ` Stefan Bader
2014-07-28 8:36 ` Stefan Bader
2014-07-28 8:50 ` Jürgen Groß
2014-07-28 9:02 ` Stefan Bader
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53BAAE82.2090200@suse.com \
--to=jgross@suse.com \
--cc=andrew.cooper3@citrix.com \
--cc=stefan.bader@canonical.com \
--cc=xen-devel@lists.xensource.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.