Shutdown panic in disable_nonboot_cpus after cpupool-numa-split

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* Shutdown panic in disable_nonboot_cpus after cpupool-numa-split
@ 2014-07-07 11:33 Stefan Bader
  2014-07-07 12:00 ` Andrew Cooper
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Bader @ 2014-07-07 11:33 UTC (permalink / raw)
  To: xen-devel@lists.xensource.com


[-- Attachment #1.1: Type: text/plain, Size: 3673 bytes --]

I recently noticed that I get a panic (rebooting the system) on shutdown in some
cases. This happened only on my AMD system and also not all the time. Finally
realized that it is related to the use of using cpupool-numa-split
(libxl with xen-4.4 maybe, but not 100% sure 4.3 as well).

What happens is that on shutdown the hypervisor runs disable_nonboot_cpus which
call cpu_down for each online cpu. There is a BUG_ON in the code for the case of
cpu_down returning -EBUSY. This happens in my case as soon as the first cpu that
has been moved to pool-1 by cpupool-numa-split is attempted. The error is
returned by running the notifier_call_chain and I suspect that ends up calling
cpupool_cpu_remove which always returns EBUSY for cpus not in pool0.

I am not sure which end needs to be fixed but looping over all online cpus in
disable_nonboot_cpus sounds plausible. So maybe the check for pool-0 in
cpupool_cpu_remove is wrong...?

-Stefan

[I switched around printk and BUG_ON to actually see the offending cpu]

(XEN) mydbg: after notifier_call_chain in cpu_down
(XEN) Error taking CPU4 down: -16
(XEN) Xen BUG at cpu.c:196 [@190 normally]
(XEN) ----[ Xen-4.4.0  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82d08010184f>] disable_nonboot_cpus+0xff/0x110
(XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor
(XEN) rax: ffff82d0802f8320   rbx: 00000000fffffff0   rcx: 0000000000000000
(XEN) rdx: ffff82d0802b0000   rsi: 000000000000000a   rdi: ffff82d080267638
(XEN) rbp: 0000000000000004   rsp: ffff82d0802b7e50   r8:  ffff83041ff90000
(XEN) r9:  0000000000010000   r10: 0000000000000001   r11: 0000000000000002
(XEN) r12: 0000000000000005   r13: ffff82d0802e2620   r14: 0000000000000003
(XEN) r15: ffff82d0802e2620   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 00000000dfc65000   cr2: ffff88002acdb798
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff82d0802b7e50:
(XEN)    ffff82d0802e2620 0000000000000000 ffff82d0802f8320 ffff82d08019eb82
(XEN)    efff82d0802f8380 ffff8300dfcff000 ffff8300dfcff000 ffff83040dca40a0
(XEN)    ffff8300dfcff000 ffff82d0802f8308 ffff82d0802e2620 0000000000000003
(XEN)    ffff82d0802e2620 ffff82d0801054be ffff8300dfcff180 ffff82d0802f8400
(XEN)    ffff82d0802f8410 ffff82d080129970 0000000000000000 ffff82d080129c69
(XEN)    ffff82d0802b0000 ffff8300dfcff000 00000000ffffffff ffff82d08015bd2b
(XEN)    ffff8300dfafe000 00000000fee1dead 00007fada0b3fc8c 0000000000002001
(XEN)    0000000000000005 ffff880029717d78 0000000000000000 0000000000000246
(XEN)    00000000ffff0000 0000000000000000 0000000000000005 0000000000000000
(XEN)    ffffffff810010ea 0000000000002001 0000000000003401 ffff880029717ce0
(XEN)    0000010000000000 ffffffff810010ea 000000000000e033 0000000000000246
(XEN)    ffff880029717cc8 000000000000e02b 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 ffff8300dfafe000
(XEN)    0000000000000000 0000000000000000
(XEN) Xen call trace:
(XEN) Xen call trace:
(XEN)    [<ffff82d08010184f>] disable_nonboot_cpus+0xff/0x110
(XEN)    [<ffff82d08019eb82>] enter_state_helper+0xc2/0x3c0
(XEN)    [<ffff82d0801054be>] continue_hypercall_tasklet_handler+0xbe/0xd0
(XEN)    [<ffff82d080129970>] do_tasklet_work+0x60/0xa0
(XEN)    [<ffff82d080129c69>] do_tasklet+0x59/0x90
(XEN)    [<ffff82d08015bd2b>] idle_loop+0x1b/0x50
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Xen BUG at cpu.c:196
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 901 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Shutdown panic in disable_nonboot_cpus after cpupool-numa-split
  2014-07-07 11:33 Shutdown panic in disable_nonboot_cpus after cpupool-numa-split Stefan Bader
@ 2014-07-07 12:00 ` Andrew Cooper
  2014-07-07 12:38   ` Jürgen Groß
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Cooper @ 2014-07-07 12:00 UTC (permalink / raw)
  To: Stefan Bader, xen-devel@lists.xensource.com


[-- Attachment #1.1: Type: text/plain, Size: 2187 bytes --]

On 07/07/14 12:33, Stefan Bader wrote:
> I recently noticed that I get a panic (rebooting the system) on shutdown in some
> cases. This happened only on my AMD system and also not all the time.
Finally
> realized that it is related to the use of using cpupool-numa-split
> (libxl with xen-4.4 maybe, but not 100% sure 4.3 as well).
>
> What happens is that on shutdown the hypervisor runs
disable_nonboot_cpus which
> call cpu_down for each online cpu. There is a BUG_ON in the code for
the case of
> cpu_down returning -EBUSY. This happens in my case as soon as the
first cpu that
> has been moved to pool-1 by cpupool-numa-split is attempted. The error is
> returned by running the notifier_call_chain and I suspect that ends up
calling
> cpupool_cpu_remove which always returns EBUSY for cpus not in pool0.
>
> I am not sure which end needs to be fixed but looping over all online
cpus in
> disable_nonboot_cpus sounds plausible. So maybe the check for pool-0 in
> cpupool_cpu_remove is wrong...?
>
> -Stefan

Hmm yes - this looks completely broken.

cpupool_cpu_remove() only has a single caller which is from cpu_down(),
and will unconditionally fail for cpus outside of the default pool.

It is not obvious at all how this is supposed to work, and the comment
beside cpupool_cpu_remove() doesn't help.

Can you try the following (only compile tested) patch, which looks
plausibly like it might DTRT.  The for_each_cpupool() is a little nastly
but there appears to be no cpu_to_cpupool mapping available.

~Andrew

---8<-----
diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index 4a0e569..a1af1ea 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -472,12 +472,12 @@ static void cpupool_cpu_add(unsigned int cpu)
 static int cpupool_cpu_remove(unsigned int cpu)
 {
     int ret = 0;
+    struct cpupool **c;
 
     spin_lock(&cpupool_lock);
-    if ( !cpumask_test_cpu(cpu, cpupool0->cpu_valid))
-        ret = -EBUSY;
-    else
-        cpumask_set_cpu(cpu, &cpupool_locked_cpus);
+    for_each_cpupool(c)
+        cpumask_clear_cpu(cpu, (*c)->cpu_valid);
+    cpumask_set_cpu(cpu, &cpupool_locked_cpus);
     spin_unlock(&cpupool_lock);
 
     return ret;



[-- Attachment #1.2: Type: text/html, Size: 3298 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Shutdown panic in disable_nonboot_cpus after cpupool-numa-split
  2014-07-07 12:00 ` Andrew Cooper
@ 2014-07-07 12:38   ` Jürgen Groß
  2014-07-07 12:49     ` Stefan Bader
  0 siblings, 1 reply; 11+ messages in thread
From: Jürgen Groß @ 2014-07-07 12:38 UTC (permalink / raw)
  To: Andrew Cooper, Stefan Bader, xen-devel@lists.xensource.com

On 07/07/2014 02:00 PM, Andrew Cooper wrote:
> On 07/07/14 12:33, Stefan Bader wrote:
>> I recently noticed that I get a  panic (rebooting the system) on shutdown in some
>  > cases. This happened only on my AMD system and also not all the time.
> Finally
>  > realized that it is related to the use of using cpupool-numa-split
>  > (libxl with xen-4.4 maybe, but not 100% sure 4.3 as well).
>  >
>  > What happens is that on shutdown the hypervisor runs
> disable_nonboot_cpus which
>  > call cpu_down for each online cpu. There is a BUG_ON in the code for
> the case of
>  > cpu_down returning -EBUSY. This happens in my case as soon as the
> first cpu that
>  > has been moved to pool-1 by cpupool-numa-split is attempted. The error is
>  > returned by running the notifier_call_chain and I suspect that ends
> up calling
>  > cpupool_cpu_remove which always returns EBUSY for cpus not in pool0.
>  >
>  > I am not sure which end needs to be fixed but looping over all online
> cpus in
>  > disable_nonboot_cpus sounds plausible. So maybe the check for pool-0 in
>  > cpupool_cpu_remove is wrong...?
>  >
>  > -Stefan
>
> Hmm yes - this looks completely broken.
>
> cpupool_cpu_remove() only has a single caller which is from cpu_down(),
> and will unconditionally fail for cpus outside of the default pool.
>
> It is not obvious at all how this is supposed to work, and the comment
> beside cpupool_cpu_remove() doesn't help.
>
> Can you try the following (only compile tested) patch, which looks
> plausibly like it might DTRT.  The for_each_cpupool() is a little nastly
> but there appears to be no cpu_to_cpupool mapping available.

Your patch has the disadvantage to support hot-unplug of the last cpu in
a cpupool. The following should work, however:

diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index 4a0e569..73249d3 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -471,12 +471,24 @@ static void cpupool_cpu_add(unsigned int cpu)
   */
  static int cpupool_cpu_remove(unsigned int cpu)
  {
-    int ret = 0;
+    int ret = -EBUSY;
+    struct cpupool **c;

      spin_lock(&cpupool_lock);
-    if ( !cpumask_test_cpu(cpu, cpupool0->cpu_valid))
-        ret = -EBUSY;
+    if ( cpumask_test_cpu(cpu, cpupool0->cpu_valid) )
+        ret = 0;
      else
+    {
+        for_each_cpupool(c)
+        {
+            if ( cpumask_test_cpu(cpu, (*c)->cpu_suspended ) )
+            {
+                ret = 0;
+                break;
+            }
+        }
+    }
+    if ( !ret )
          cpumask_set_cpu(cpu, &cpupool_locked_cpus);
      spin_unlock(&cpupool_lock);



Juergen

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Shutdown panic in disable_nonboot_cpus after cpupool-numa-split
  2014-07-07 12:38   ` Jürgen Groß
@ 2014-07-07 12:49     ` Stefan Bader
  2014-07-07 13:03       ` Jürgen Groß
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Bader @ 2014-07-07 12:49 UTC (permalink / raw)
  To: Jürgen Groß, Andrew Cooper,
	xen-devel@lists.xensource.com


[-- Attachment #1.1: Type: text/plain, Size: 3193 bytes --]

On 07.07.2014 14:38, Jürgen Groß wrote:
> On 07/07/2014 02:00 PM, Andrew Cooper wrote:
>> On 07/07/14 12:33, Stefan Bader wrote:
>>> I recently noticed that I get a  panic (rebooting the system) on shutdown in
>>> some
>>  > cases. This happened only on my AMD system and also not all the time.
>> Finally
>>  > realized that it is related to the use of using cpupool-numa-split
>>  > (libxl with xen-4.4 maybe, but not 100% sure 4.3 as well).
>>  >
>>  > What happens is that on shutdown the hypervisor runs
>> disable_nonboot_cpus which
>>  > call cpu_down for each online cpu. There is a BUG_ON in the code for
>> the case of
>>  > cpu_down returning -EBUSY. This happens in my case as soon as the
>> first cpu that
>>  > has been moved to pool-1 by cpupool-numa-split is attempted. The error is
>>  > returned by running the notifier_call_chain and I suspect that ends
>> up calling
>>  > cpupool_cpu_remove which always returns EBUSY for cpus not in pool0.
>>  >
>>  > I am not sure which end needs to be fixed but looping over all online
>> cpus in
>>  > disable_nonboot_cpus sounds plausible. So maybe the check for pool-0 in
>>  > cpupool_cpu_remove is wrong...?
>>  >
>>  > -Stefan
>>
>> Hmm yes - this looks completely broken.
>>
>> cpupool_cpu_remove() only has a single caller which is from cpu_down(),
>> and will unconditionally fail for cpus outside of the default pool.
>>
>> It is not obvious at all how this is supposed to work, and the comment
>> beside cpupool_cpu_remove() doesn't help.
>>
>> Can you try the following (only compile tested) patch, which looks
>> plausibly like it might DTRT.  The for_each_cpupool() is a little nastly
>> but there appears to be no cpu_to_cpupool mapping available.
> 
> Your patch has the disadvantage to support hot-unplug of the last cpu in
> a cpupool. The following should work, however:

Disadvantage and support sounded a bit confusing. But I think it means
hot-unplugging the last cpu of a pool is bad and should not be working.

> 
> diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
> index 4a0e569..73249d3 100644
> --- a/xen/common/cpupool.c
> +++ b/xen/common/cpupool.c
> @@ -471,12 +471,24 @@ static void cpupool_cpu_add(unsigned int cpu)
>   */
>  static int cpupool_cpu_remove(unsigned int cpu)
>  {
> -    int ret = 0;
> +    int ret = -EBUSY;
> +    struct cpupool **c;
> 
>      spin_lock(&cpupool_lock);
> -    if ( !cpumask_test_cpu(cpu, cpupool0->cpu_valid))
> -        ret = -EBUSY;
> +    if ( cpumask_test_cpu(cpu, cpupool0->cpu_valid) )
> +        ret = 0;
>      else
> +    {
> +        for_each_cpupool(c)
> +        {
> +            if ( cpumask_test_cpu(cpu, (*c)->cpu_suspended ) )

The rest seems to keep the semantics the same as before (though does that mean
unplugging the last cpu of pool-0 is ok?) But why testing for suspended here to
succeed (and not valid)?


> +            {
> +                ret = 0;
> +                break;
> +            }
> +        }
> +    }
> +    if ( !ret )
>          cpumask_set_cpu(cpu, &cpupool_locked_cpus);
>      spin_unlock(&cpupool_lock);
> 
> 
> 
> Juergen



[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 901 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Shutdown panic in disable_nonboot_cpus after cpupool-numa-split
  2014-07-07 12:49     ` Stefan Bader
@ 2014-07-07 13:03       ` Jürgen Groß
  2014-07-07 14:08         ` Stefan Bader
  0 siblings, 1 reply; 11+ messages in thread
From: Jürgen Groß @ 2014-07-07 13:03 UTC (permalink / raw)
  To: Stefan Bader, Andrew Cooper, xen-devel@lists.xensource.com

On 07/07/2014 02:49 PM, Stefan Bader wrote:
> On 07.07.2014 14:38, Jürgen Groß wrote:
>> On 07/07/2014 02:00 PM, Andrew Cooper wrote:
>>> On 07/07/14 12:33, Stefan Bader wrote:
>>>> I recently noticed that I get a  panic (rebooting the system) on shutdown in
>>>> some
>>>   > cases. This happened only on my AMD system and also not all the time.
>>> Finally
>>>   > realized that it is related to the use of using cpupool-numa-split
>>>   > (libxl with xen-4.4 maybe, but not 100% sure 4.3 as well).
>>>   >
>>>   > What happens is that on shutdown the hypervisor runs
>>> disable_nonboot_cpus which
>>>   > call cpu_down for each online cpu. There is a BUG_ON in the code for
>>> the case of
>>>   > cpu_down returning -EBUSY. This happens in my case as soon as the
>>> first cpu that
>>>   > has been moved to pool-1 by cpupool-numa-split is attempted. The error is
>>>   > returned by running the notifier_call_chain and I suspect that ends
>>> up calling
>>>   > cpupool_cpu_remove which always returns EBUSY for cpus not in pool0.
>>>   >
>>>   > I am not sure which end needs to be fixed but looping over all online
>>> cpus in
>>>   > disable_nonboot_cpus sounds plausible. So maybe the check for pool-0 in
>>>   > cpupool_cpu_remove is wrong...?
>>>   >
>>>   > -Stefan
>>>
>>> Hmm yes - this looks completely broken.
>>>
>>> cpupool_cpu_remove() only has a single caller which is from cpu_down(),
>>> and will unconditionally fail for cpus outside of the default pool.
>>>
>>> It is not obvious at all how this is supposed to work, and the comment
>>> beside cpupool_cpu_remove() doesn't help.
>>>
>>> Can you try the following (only compile tested) patch, which looks
>>> plausibly like it might DTRT.  The for_each_cpupool() is a little nastly
>>> but there appears to be no cpu_to_cpupool mapping available.
>>
>> Your patch has the disadvantage to support hot-unplug of the last cpu in
>> a cpupool. The following should work, however:
>
> Disadvantage and support sounded a bit confusing. But I think it means
> hot-unplugging the last cpu of a pool is bad and should not be working.

Correct.

>
>>
>> diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
>> index 4a0e569..73249d3 100644
>> --- a/xen/common/cpupool.c
>> +++ b/xen/common/cpupool.c
>> @@ -471,12 +471,24 @@ static void cpupool_cpu_add(unsigned int cpu)
>>    */
>>   static int cpupool_cpu_remove(unsigned int cpu)
>>   {
>> -    int ret = 0;
>> +    int ret = -EBUSY;
>> +    struct cpupool **c;
>>
>>       spin_lock(&cpupool_lock);
>> -    if ( !cpumask_test_cpu(cpu, cpupool0->cpu_valid))
>> -        ret = -EBUSY;
>> +    if ( cpumask_test_cpu(cpu, cpupool0->cpu_valid) )
>> +        ret = 0;
>>       else
>> +    {
>> +        for_each_cpupool(c)
>> +        {
>> +            if ( cpumask_test_cpu(cpu, (*c)->cpu_suspended ) )
>
> The rest seems to keep the semantics the same as before (though does that mean
> unplugging the last cpu of pool-0 is ok?) But why testing for suspended here to
> succeed (and not valid)?

Testing valid would again enable to remove the last cpu of a cpupool in
case of hotplugging. cpu_suspended is set if all cpus are to be removed
due to shutdown, suspend to ram/disk, ...

Juergen

>
>
>> +            {
>> +                ret = 0;
>> +                break;
>> +            }
>> +        }
>> +    }
>> +    if ( !ret )
>>           cpumask_set_cpu(cpu, &cpupool_locked_cpus);
>>       spin_unlock(&cpupool_lock);
>>
>>
>>
>> Juergen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Shutdown panic in disable_nonboot_cpus after cpupool-numa-split
  2014-07-07 13:03       ` Jürgen Groß
@ 2014-07-07 14:08         ` Stefan Bader
  2014-07-07 14:28           ` Juergen Gross
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Bader @ 2014-07-07 14:08 UTC (permalink / raw)
  To: Jürgen Groß, Andrew Cooper,
	xen-devel@lists.xensource.com


[-- Attachment #1.1: Type: text/plain, Size: 4184 bytes --]

On 07.07.2014 15:03, Jürgen Groß wrote:
> On 07/07/2014 02:49 PM, Stefan Bader wrote:
>> On 07.07.2014 14:38, Jürgen Groß wrote:
>>> On 07/07/2014 02:00 PM, Andrew Cooper wrote:
>>>> On 07/07/14 12:33, Stefan Bader wrote:
>>>>> I recently noticed that I get a  panic (rebooting the system) on shutdown in
>>>>> some
>>>>   > cases. This happened only on my AMD system and also not all the time.
>>>> Finally
>>>>   > realized that it is related to the use of using cpupool-numa-split
>>>>   > (libxl with xen-4.4 maybe, but not 100% sure 4.3 as well).
>>>>   >
>>>>   > What happens is that on shutdown the hypervisor runs
>>>> disable_nonboot_cpus which
>>>>   > call cpu_down for each online cpu. There is a BUG_ON in the code for
>>>> the case of
>>>>   > cpu_down returning -EBUSY. This happens in my case as soon as the
>>>> first cpu that
>>>>   > has been moved to pool-1 by cpupool-numa-split is attempted. The error is
>>>>   > returned by running the notifier_call_chain and I suspect that ends
>>>> up calling
>>>>   > cpupool_cpu_remove which always returns EBUSY for cpus not in pool0.
>>>>   >
>>>>   > I am not sure which end needs to be fixed but looping over all online
>>>> cpus in
>>>>   > disable_nonboot_cpus sounds plausible. So maybe the check for pool-0 in
>>>>   > cpupool_cpu_remove is wrong...?
>>>>   >
>>>>   > -Stefan
>>>>
>>>> Hmm yes - this looks completely broken.
>>>>
>>>> cpupool_cpu_remove() only has a single caller which is from cpu_down(),
>>>> and will unconditionally fail for cpus outside of the default pool.
>>>>
>>>> It is not obvious at all how this is supposed to work, and the comment
>>>> beside cpupool_cpu_remove() doesn't help.
>>>>
>>>> Can you try the following (only compile tested) patch, which looks
>>>> plausibly like it might DTRT.  The for_each_cpupool() is a little nastly
>>>> but there appears to be no cpu_to_cpupool mapping available.
>>>
>>> Your patch has the disadvantage to support hot-unplug of the last cpu in
>>> a cpupool. The following should work, however:
>>
>> Disadvantage and support sounded a bit confusing. But I think it means
>> hot-unplugging the last cpu of a pool is bad and should not be working.
> 
> Correct.
> 
>>
>>>
>>> diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
>>> index 4a0e569..73249d3 100644
>>> --- a/xen/common/cpupool.c
>>> +++ b/xen/common/cpupool.c
>>> @@ -471,12 +471,24 @@ static void cpupool_cpu_add(unsigned int cpu)
>>>    */
>>>   static int cpupool_cpu_remove(unsigned int cpu)
>>>   {
>>> -    int ret = 0;
>>> +    int ret = -EBUSY;
>>> +    struct cpupool **c;
>>>
>>>       spin_lock(&cpupool_lock);
>>> -    if ( !cpumask_test_cpu(cpu, cpupool0->cpu_valid))
>>> -        ret = -EBUSY;
>>> +    if ( cpumask_test_cpu(cpu, cpupool0->cpu_valid) )
>>> +        ret = 0;
>>>       else
>>> +    {
>>> +        for_each_cpupool(c)
>>> +        {
>>> +            if ( cpumask_test_cpu(cpu, (*c)->cpu_suspended ) )
>>
>> The rest seems to keep the semantics the same as before (though does that mean
>> unplugging the last cpu of pool-0 is ok?) But why testing for suspended here to
>> succeed (and not valid)?
> 
> Testing valid would again enable to remove the last cpu of a cpupool in
> case of hotplugging. cpu_suspended is set if all cpus are to be removed
> due to shutdown, suspend to ram/disk, ...

Ah, ok. Thanks for the detail explanation. So I was trying this change in
parallel and can confirm that it gets rid of the panic on shutdown. But when I
try to offline any cpu in pool1 (if echoing 0 into /sys/devices/xen_cpu/xen_cpu?
is the correct method) I always get EBUSY. IOW I cannot hot-unplug any cpu that
is in a pool other than 0. It does only work after removing it from pool1, then
add it to pool0 and then echo 0 into online.

-Stefan


> 
> Juergen
> 
>>
>>
>>> +            {
>>> +                ret = 0;
>>> +                break;
>>> +            }
>>> +        }
>>> +    }
>>> +    if ( !ret )
>>>           cpumask_set_cpu(cpu, &cpupool_locked_cpus);
>>>       spin_unlock(&cpupool_lock);
>>>
>>>
>>>
>>> Juergen
> 



[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 901 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Shutdown panic in disable_nonboot_cpus after cpupool-numa-split
  2014-07-07 14:08         ` Stefan Bader
@ 2014-07-07 14:28           ` Juergen Gross
  2014-07-07 14:43             ` Stefan Bader
  0 siblings, 1 reply; 11+ messages in thread
From: Juergen Gross @ 2014-07-07 14:28 UTC (permalink / raw)
  To: Stefan Bader, Andrew Cooper, xen-devel@lists.xensource.com

On 07/07/2014 04:08 PM, Stefan Bader wrote:
> On 07.07.2014 15:03, Jürgen Groß wrote:
>> On 07/07/2014 02:49 PM, Stefan Bader wrote:
>>> On 07.07.2014 14:38, Jürgen Groß wrote:
>>>> On 07/07/2014 02:00 PM, Andrew Cooper wrote:
>>>>> On 07/07/14 12:33, Stefan Bader wrote:
>>>>>> I recently noticed that I get a  panic (rebooting the system) on shutdown in
>>>>>> some
>>>>>    > cases. This happened only on my AMD system and also not all the time.
>>>>> Finally
>>>>>    > realized that it is related to the use of using cpupool-numa-split
>>>>>    > (libxl with xen-4.4 maybe, but not 100% sure 4.3 as well).
>>>>>    >
>>>>>    > What happens is that on shutdown the hypervisor runs
>>>>> disable_nonboot_cpus which
>>>>>    > call cpu_down for each online cpu. There is a BUG_ON in the code for
>>>>> the case of
>>>>>    > cpu_down returning -EBUSY. This happens in my case as soon as the
>>>>> first cpu that
>>>>>    > has been moved to pool-1 by cpupool-numa-split is attempted. The error is
>>>>>    > returned by running the notifier_call_chain and I suspect that ends
>>>>> up calling
>>>>>    > cpupool_cpu_remove which always returns EBUSY for cpus not in pool0.
>>>>>    >
>>>>>    > I am not sure which end needs to be fixed but looping over all online
>>>>> cpus in
>>>>>    > disable_nonboot_cpus sounds plausible. So maybe the check for pool-0 in
>>>>>    > cpupool_cpu_remove is wrong...?
>>>>>    >
>>>>>    > -Stefan
>>>>>
>>>>> Hmm yes - this looks completely broken.
>>>>>
>>>>> cpupool_cpu_remove() only has a single caller which is from cpu_down(),
>>>>> and will unconditionally fail for cpus outside of the default pool.
>>>>>
>>>>> It is not obvious at all how this is supposed to work, and the comment
>>>>> beside cpupool_cpu_remove() doesn't help.
>>>>>
>>>>> Can you try the following (only compile tested) patch, which looks
>>>>> plausibly like it might DTRT.  The for_each_cpupool() is a little nastly
>>>>> but there appears to be no cpu_to_cpupool mapping available.
>>>>
>>>> Your patch has the disadvantage to support hot-unplug of the last cpu in
>>>> a cpupool. The following should work, however:
>>>
>>> Disadvantage and support sounded a bit confusing. But I think it means
>>> hot-unplugging the last cpu of a pool is bad and should not be working.
>>
>> Correct.
>>
>>>
>>>>
>>>> diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
>>>> index 4a0e569..73249d3 100644
>>>> --- a/xen/common/cpupool.c
>>>> +++ b/xen/common/cpupool.c
>>>> @@ -471,12 +471,24 @@ static void cpupool_cpu_add(unsigned int cpu)
>>>>     */
>>>>    static int cpupool_cpu_remove(unsigned int cpu)
>>>>    {
>>>> -    int ret = 0;
>>>> +    int ret = -EBUSY;
>>>> +    struct cpupool **c;
>>>>
>>>>        spin_lock(&cpupool_lock);
>>>> -    if ( !cpumask_test_cpu(cpu, cpupool0->cpu_valid))
>>>> -        ret = -EBUSY;
>>>> +    if ( cpumask_test_cpu(cpu, cpupool0->cpu_valid) )
>>>> +        ret = 0;
>>>>        else
>>>> +    {
>>>> +        for_each_cpupool(c)
>>>> +        {
>>>> +            if ( cpumask_test_cpu(cpu, (*c)->cpu_suspended ) )
>>>
>>> The rest seems to keep the semantics the same as before (though does that mean
>>> unplugging the last cpu of pool-0 is ok?) But why testing for suspended here to
>>> succeed (and not valid)?
>>
>> Testing valid would again enable to remove the last cpu of a cpupool in
>> case of hotplugging. cpu_suspended is set if all cpus are to be removed
>> due to shutdown, suspend to ram/disk, ...
>
> Ah, ok. Thanks for the detail explanation. So I was trying this change in
> parallel and can confirm that it gets rid of the panic on shutdown. But when I
> try to offline any cpu in pool1 (if echoing 0 into /sys/devices/xen_cpu/xen_cpu?
> is the correct method) I always get EBUSY. IOW I cannot hot-unplug any cpu that
> is in a pool other than 0. It does only work after removing it from pool1, then
> add it to pool0 and then echo 0 into online.

That's how it was designed some years ago. I don't want to change the
behavior in the hypervisor. Adding some tool support could make sense,
however.

Juergen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Shutdown panic in disable_nonboot_cpus after cpupool-numa-split
  2014-07-07 14:28           ` Juergen Gross
@ 2014-07-07 14:43             ` Stefan Bader
  2014-07-28  8:36               ` Stefan Bader
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Bader @ 2014-07-07 14:43 UTC (permalink / raw)
  To: Juergen Gross, Andrew Cooper, xen-devel@lists.xensource.com


[-- Attachment #1.1: Type: text/plain, Size: 4528 bytes --]

On 07.07.2014 16:28, Juergen Gross wrote:
> On 07/07/2014 04:08 PM, Stefan Bader wrote:
>> On 07.07.2014 15:03, Jürgen Groß wrote:
>>> On 07/07/2014 02:49 PM, Stefan Bader wrote:
>>>> On 07.07.2014 14:38, Jürgen Groß wrote:
>>>>> On 07/07/2014 02:00 PM, Andrew Cooper wrote:
>>>>>> On 07/07/14 12:33, Stefan Bader wrote:
>>>>>>> I recently noticed that I get a  panic (rebooting the system) on shutdown in
>>>>>>> some
>>>>>>    > cases. This happened only on my AMD system and also not all the time.
>>>>>> Finally
>>>>>>    > realized that it is related to the use of using cpupool-numa-split
>>>>>>    > (libxl with xen-4.4 maybe, but not 100% sure 4.3 as well).
>>>>>>    >
>>>>>>    > What happens is that on shutdown the hypervisor runs
>>>>>> disable_nonboot_cpus which
>>>>>>    > call cpu_down for each online cpu. There is a BUG_ON in the code for
>>>>>> the case of
>>>>>>    > cpu_down returning -EBUSY. This happens in my case as soon as the
>>>>>> first cpu that
>>>>>>    > has been moved to pool-1 by cpupool-numa-split is attempted. The
>>>>>> error is
>>>>>>    > returned by running the notifier_call_chain and I suspect that ends
>>>>>> up calling
>>>>>>    > cpupool_cpu_remove which always returns EBUSY for cpus not in pool0.
>>>>>>    >
>>>>>>    > I am not sure which end needs to be fixed but looping over all online
>>>>>> cpus in
>>>>>>    > disable_nonboot_cpus sounds plausible. So maybe the check for pool-0 in
>>>>>>    > cpupool_cpu_remove is wrong...?
>>>>>>    >
>>>>>>    > -Stefan
>>>>>>
>>>>>> Hmm yes - this looks completely broken.
>>>>>>
>>>>>> cpupool_cpu_remove() only has a single caller which is from cpu_down(),
>>>>>> and will unconditionally fail for cpus outside of the default pool.
>>>>>>
>>>>>> It is not obvious at all how this is supposed to work, and the comment
>>>>>> beside cpupool_cpu_remove() doesn't help.
>>>>>>
>>>>>> Can you try the following (only compile tested) patch, which looks
>>>>>> plausibly like it might DTRT.  The for_each_cpupool() is a little nastly
>>>>>> but there appears to be no cpu_to_cpupool mapping available.
>>>>>
>>>>> Your patch has the disadvantage to support hot-unplug of the last cpu in
>>>>> a cpupool. The following should work, however:
>>>>
>>>> Disadvantage and support sounded a bit confusing. But I think it means
>>>> hot-unplugging the last cpu of a pool is bad and should not be working.
>>>
>>> Correct.
>>>
>>>>
>>>>>
>>>>> diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
>>>>> index 4a0e569..73249d3 100644
>>>>> --- a/xen/common/cpupool.c
>>>>> +++ b/xen/common/cpupool.c
>>>>> @@ -471,12 +471,24 @@ static void cpupool_cpu_add(unsigned int cpu)
>>>>>     */
>>>>>    static int cpupool_cpu_remove(unsigned int cpu)
>>>>>    {
>>>>> -    int ret = 0;
>>>>> +    int ret = -EBUSY;
>>>>> +    struct cpupool **c;
>>>>>
>>>>>        spin_lock(&cpupool_lock);
>>>>> -    if ( !cpumask_test_cpu(cpu, cpupool0->cpu_valid))
>>>>> -        ret = -EBUSY;
>>>>> +    if ( cpumask_test_cpu(cpu, cpupool0->cpu_valid) )
>>>>> +        ret = 0;
>>>>>        else
>>>>> +    {
>>>>> +        for_each_cpupool(c)
>>>>> +        {
>>>>> +            if ( cpumask_test_cpu(cpu, (*c)->cpu_suspended ) )
>>>>
>>>> The rest seems to keep the semantics the same as before (though does that mean
>>>> unplugging the last cpu of pool-0 is ok?) But why testing for suspended here to
>>>> succeed (and not valid)?
>>>
>>> Testing valid would again enable to remove the last cpu of a cpupool in
>>> case of hotplugging. cpu_suspended is set if all cpus are to be removed
>>> due to shutdown, suspend to ram/disk, ...
>>
>> Ah, ok. Thanks for the detail explanation. So I was trying this change in
>> parallel and can confirm that it gets rid of the panic on shutdown. But when I
>> try to offline any cpu in pool1 (if echoing 0 into /sys/devices/xen_cpu/xen_cpu?
>> is the correct method) I always get EBUSY. IOW I cannot hot-unplug any cpu that
>> is in a pool other than 0. It does only work after removing it from pool1, then
>> add it to pool0 and then echo 0 into online.
> 
> That's how it was designed some years ago. I don't want to change the
> behavior in the hypervisor. Adding some tool support could make sense,
> however.

Ok, so in that case everything works as expected and the change fixes the
currently broken shutdown and could be properly submitted for inclusion (with my
tested-by).

-Stefan
> 
> Juergen
> 



[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 901 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Shutdown panic in disable_nonboot_cpus after cpupool-numa-split
  2014-07-07 14:43             ` Stefan Bader
@ 2014-07-28  8:36               ` Stefan Bader
  2014-07-28  8:50                 ` Jürgen Groß
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Bader @ 2014-07-28  8:36 UTC (permalink / raw)
  To: Juergen Gross, Andrew Cooper, xen-devel@lists.xensource.com


[-- Attachment #1.1: Type: text/plain, Size: 4809 bytes --]

On 07.07.2014 16:43, Stefan Bader wrote:
> On 07.07.2014 16:28, Juergen Gross wrote:
>> On 07/07/2014 04:08 PM, Stefan Bader wrote:
>>> On 07.07.2014 15:03, Jürgen Groß wrote:
>>>> On 07/07/2014 02:49 PM, Stefan Bader wrote:
>>>>> On 07.07.2014 14:38, Jürgen Groß wrote:
>>>>>> On 07/07/2014 02:00 PM, Andrew Cooper wrote:
>>>>>>> On 07/07/14 12:33, Stefan Bader wrote:
>>>>>>>> I recently noticed that I get a  panic (rebooting the system) on shutdown in
>>>>>>>> some
>>>>>>>    > cases. This happened only on my AMD system and also not all the time.
>>>>>>> Finally
>>>>>>>    > realized that it is related to the use of using cpupool-numa-split
>>>>>>>    > (libxl with xen-4.4 maybe, but not 100% sure 4.3 as well).
>>>>>>>    >
>>>>>>>    > What happens is that on shutdown the hypervisor runs
>>>>>>> disable_nonboot_cpus which
>>>>>>>    > call cpu_down for each online cpu. There is a BUG_ON in the code for
>>>>>>> the case of
>>>>>>>    > cpu_down returning -EBUSY. This happens in my case as soon as the
>>>>>>> first cpu that
>>>>>>>    > has been moved to pool-1 by cpupool-numa-split is attempted. The
>>>>>>> error is
>>>>>>>    > returned by running the notifier_call_chain and I suspect that ends
>>>>>>> up calling
>>>>>>>    > cpupool_cpu_remove which always returns EBUSY for cpus not in pool0.
>>>>>>>    >
>>>>>>>    > I am not sure which end needs to be fixed but looping over all online
>>>>>>> cpus in
>>>>>>>    > disable_nonboot_cpus sounds plausible. So maybe the check for pool-0 in
>>>>>>>    > cpupool_cpu_remove is wrong...?
>>>>>>>    >
>>>>>>>    > -Stefan
>>>>>>>
>>>>>>> Hmm yes - this looks completely broken.
>>>>>>>
>>>>>>> cpupool_cpu_remove() only has a single caller which is from cpu_down(),
>>>>>>> and will unconditionally fail for cpus outside of the default pool.
>>>>>>>
>>>>>>> It is not obvious at all how this is supposed to work, and the comment
>>>>>>> beside cpupool_cpu_remove() doesn't help.
>>>>>>>
>>>>>>> Can you try the following (only compile tested) patch, which looks
>>>>>>> plausibly like it might DTRT.  The for_each_cpupool() is a little nastly
>>>>>>> but there appears to be no cpu_to_cpupool mapping available.
>>>>>>
>>>>>> Your patch has the disadvantage to support hot-unplug of the last cpu in
>>>>>> a cpupool. The following should work, however:
>>>>>
>>>>> Disadvantage and support sounded a bit confusing. But I think it means
>>>>> hot-unplugging the last cpu of a pool is bad and should not be working.
>>>>
>>>> Correct.
>>>>
>>>>>
>>>>>>
>>>>>> diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
>>>>>> index 4a0e569..73249d3 100644
>>>>>> --- a/xen/common/cpupool.c
>>>>>> +++ b/xen/common/cpupool.c
>>>>>> @@ -471,12 +471,24 @@ static void cpupool_cpu_add(unsigned int cpu)
>>>>>>     */
>>>>>>    static int cpupool_cpu_remove(unsigned int cpu)
>>>>>>    {
>>>>>> -    int ret = 0;
>>>>>> +    int ret = -EBUSY;
>>>>>> +    struct cpupool **c;
>>>>>>
>>>>>>        spin_lock(&cpupool_lock);
>>>>>> -    if ( !cpumask_test_cpu(cpu, cpupool0->cpu_valid))
>>>>>> -        ret = -EBUSY;
>>>>>> +    if ( cpumask_test_cpu(cpu, cpupool0->cpu_valid) )
>>>>>> +        ret = 0;
>>>>>>        else
>>>>>> +    {
>>>>>> +        for_each_cpupool(c)
>>>>>> +        {
>>>>>> +            if ( cpumask_test_cpu(cpu, (*c)->cpu_suspended ) )
>>>>>
>>>>> The rest seems to keep the semantics the same as before (though does that mean
>>>>> unplugging the last cpu of pool-0 is ok?) But why testing for suspended here to
>>>>> succeed (and not valid)?
>>>>
>>>> Testing valid would again enable to remove the last cpu of a cpupool in
>>>> case of hotplugging. cpu_suspended is set if all cpus are to be removed
>>>> due to shutdown, suspend to ram/disk, ...
>>>
>>> Ah, ok. Thanks for the detail explanation. So I was trying this change in
>>> parallel and can confirm that it gets rid of the panic on shutdown. But when I
>>> try to offline any cpu in pool1 (if echoing 0 into /sys/devices/xen_cpu/xen_cpu?
>>> is the correct method) I always get EBUSY. IOW I cannot hot-unplug any cpu that
>>> is in a pool other than 0. It does only work after removing it from pool1, then
>>> add it to pool0 and then echo 0 into online.
>>
>> That's how it was designed some years ago. I don't want to change the
>> behavior in the hypervisor. Adding some tool support could make sense,
>> however.
> 
> Ok, so in that case everything works as expected and the change fixes the
> currently broken shutdown and could be properly submitted for inclusion (with my
> tested-by).
> 

Is this needing something from my side to do? I could re-submit the whole patch
but it since it is Juergen's work it felt a little rude to do so.

-Stefan



[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Shutdown panic in disable_nonboot_cpus after cpupool-numa-split
  2014-07-28  8:36               ` Stefan Bader
@ 2014-07-28  8:50                 ` Jürgen Groß
  2014-07-28  9:02                   ` Stefan Bader
  0 siblings, 1 reply; 11+ messages in thread
From: Jürgen Groß @ 2014-07-28  8:50 UTC (permalink / raw)
  To: Stefan Bader, Andrew Cooper, xen-devel@lists.xensource.com

On 07/28/2014 10:36 AM, Stefan Bader wrote:
> On 07.07.2014 16:43, Stefan Bader wrote:
>> On 07.07.2014 16:28, Juergen Gross wrote:
>>> On 07/07/2014 04:08 PM, Stefan Bader wrote:
>>>> On 07.07.2014 15:03, Jürgen Groß wrote:
>>>>> On 07/07/2014 02:49 PM, Stefan Bader wrote:
>>>>>> On 07.07.2014 14:38, Jürgen Groß wrote:
>>>>>>> On 07/07/2014 02:00 PM, Andrew Cooper wrote:
>>>>>>>> On 07/07/14 12:33, Stefan Bader wrote:
>>>>>>>>> I recently noticed that I get a  panic (rebooting the system) on shutdown in
>>>>>>>>> some
>>>>>>>>     > cases. This happened only on my AMD system and also not all the time.
>>>>>>>> Finally
>>>>>>>>     > realized that it is related to the use of using cpupool-numa-split
>>>>>>>>     > (libxl with xen-4.4 maybe, but not 100% sure 4.3 as well).
>>>>>>>>     >
>>>>>>>>     > What happens is that on shutdown the hypervisor runs
>>>>>>>> disable_nonboot_cpus which
>>>>>>>>     > call cpu_down for each online cpu. There is a BUG_ON in the code for
>>>>>>>> the case of
>>>>>>>>     > cpu_down returning -EBUSY. This happens in my case as soon as the
>>>>>>>> first cpu that
>>>>>>>>     > has been moved to pool-1 by cpupool-numa-split is attempted. The
>>>>>>>> error is
>>>>>>>>     > returned by running the notifier_call_chain and I suspect that ends
>>>>>>>> up calling
>>>>>>>>     > cpupool_cpu_remove which always returns EBUSY for cpus not in pool0.
>>>>>>>>     >
>>>>>>>>     > I am not sure which end needs to be fixed but looping over all online
>>>>>>>> cpus in
>>>>>>>>     > disable_nonboot_cpus sounds plausible. So maybe the check for pool-0 in
>>>>>>>>     > cpupool_cpu_remove is wrong...?
>>>>>>>>     >
>>>>>>>>     > -Stefan
>>>>>>>>
>>>>>>>> Hmm yes - this looks completely broken.
>>>>>>>>
>>>>>>>> cpupool_cpu_remove() only has a single caller which is from cpu_down(),
>>>>>>>> and will unconditionally fail for cpus outside of the default pool.
>>>>>>>>
>>>>>>>> It is not obvious at all how this is supposed to work, and the comment
>>>>>>>> beside cpupool_cpu_remove() doesn't help.
>>>>>>>>
>>>>>>>> Can you try the following (only compile tested) patch, which looks
>>>>>>>> plausibly like it might DTRT.  The for_each_cpupool() is a little nastly
>>>>>>>> but there appears to be no cpu_to_cpupool mapping available.
>>>>>>>
>>>>>>> Your patch has the disadvantage to support hot-unplug of the last cpu in
>>>>>>> a cpupool. The following should work, however:
>>>>>>
>>>>>> Disadvantage and support sounded a bit confusing. But I think it means
>>>>>> hot-unplugging the last cpu of a pool is bad and should not be working.
>>>>>
>>>>> Correct.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
>>>>>>> index 4a0e569..73249d3 100644
>>>>>>> --- a/xen/common/cpupool.c
>>>>>>> +++ b/xen/common/cpupool.c
>>>>>>> @@ -471,12 +471,24 @@ static void cpupool_cpu_add(unsigned int cpu)
>>>>>>>      */
>>>>>>>     static int cpupool_cpu_remove(unsigned int cpu)
>>>>>>>     {
>>>>>>> -    int ret = 0;
>>>>>>> +    int ret = -EBUSY;
>>>>>>> +    struct cpupool **c;
>>>>>>>
>>>>>>>         spin_lock(&cpupool_lock);
>>>>>>> -    if ( !cpumask_test_cpu(cpu, cpupool0->cpu_valid))
>>>>>>> -        ret = -EBUSY;
>>>>>>> +    if ( cpumask_test_cpu(cpu, cpupool0->cpu_valid) )
>>>>>>> +        ret = 0;
>>>>>>>         else
>>>>>>> +    {
>>>>>>> +        for_each_cpupool(c)
>>>>>>> +        {
>>>>>>> +            if ( cpumask_test_cpu(cpu, (*c)->cpu_suspended ) )
>>>>>>
>>>>>> The rest seems to keep the semantics the same as before (though does that mean
>>>>>> unplugging the last cpu of pool-0 is ok?) But why testing for suspended here to
>>>>>> succeed (and not valid)?
>>>>>
>>>>> Testing valid would again enable to remove the last cpu of a cpupool in
>>>>> case of hotplugging. cpu_suspended is set if all cpus are to be removed
>>>>> due to shutdown, suspend to ram/disk, ...
>>>>
>>>> Ah, ok. Thanks for the detail explanation. So I was trying this change in
>>>> parallel and can confirm that it gets rid of the panic on shutdown. But when I
>>>> try to offline any cpu in pool1 (if echoing 0 into /sys/devices/xen_cpu/xen_cpu?
>>>> is the correct method) I always get EBUSY. IOW I cannot hot-unplug any cpu that
>>>> is in a pool other than 0. It does only work after removing it from pool1, then
>>>> add it to pool0 and then echo 0 into online.
>>>
>>> That's how it was designed some years ago. I don't want to change the
>>> behavior in the hypervisor. Adding some tool support could make sense,
>>> however.
>>
>> Ok, so in that case everything works as expected and the change fixes the
>> currently broken shutdown and could be properly submitted for inclusion (with my
>> tested-by).
>>
>
> Is this needing something from my side to do? I could re-submit the whole patch
> but it since it is Juergen's work it felt a little rude to do so.

Patch is already in xen-unstable/staging.


Juergen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Shutdown panic in disable_nonboot_cpus after cpupool-numa-split
  2014-07-28  8:50                 ` Jürgen Groß
@ 2014-07-28  9:02                   ` Stefan Bader
  0 siblings, 0 replies; 11+ messages in thread
From: Stefan Bader @ 2014-07-28  9:02 UTC (permalink / raw)
  To: Jürgen Groß, Andrew Cooper,
	xen-devel@lists.xensource.com


[-- Attachment #1.1: Type: text/plain, Size: 5385 bytes --]

On 28.07.2014 10:50, Jürgen Groß wrote:
> On 07/28/2014 10:36 AM, Stefan Bader wrote:
>> On 07.07.2014 16:43, Stefan Bader wrote:
>>> On 07.07.2014 16:28, Juergen Gross wrote:
>>>> On 07/07/2014 04:08 PM, Stefan Bader wrote:
>>>>> On 07.07.2014 15:03, Jürgen Groß wrote:
>>>>>> On 07/07/2014 02:49 PM, Stefan Bader wrote:
>>>>>>> On 07.07.2014 14:38, Jürgen Groß wrote:
>>>>>>>> On 07/07/2014 02:00 PM, Andrew Cooper wrote:
>>>>>>>>> On 07/07/14 12:33, Stefan Bader wrote:
>>>>>>>>>> I recently noticed that I get a  panic (rebooting the system) on
>>>>>>>>>> shutdown in
>>>>>>>>>> some
>>>>>>>>>     > cases. This happened only on my AMD system and also not all the
>>>>>>>>> time.
>>>>>>>>> Finally
>>>>>>>>>     > realized that it is related to the use of using cpupool-numa-split
>>>>>>>>>     > (libxl with xen-4.4 maybe, but not 100% sure 4.3 as well).
>>>>>>>>>     >
>>>>>>>>>     > What happens is that on shutdown the hypervisor runs
>>>>>>>>> disable_nonboot_cpus which
>>>>>>>>>     > call cpu_down for each online cpu. There is a BUG_ON in the code for
>>>>>>>>> the case of
>>>>>>>>>     > cpu_down returning -EBUSY. This happens in my case as soon as the
>>>>>>>>> first cpu that
>>>>>>>>>     > has been moved to pool-1 by cpupool-numa-split is attempted. The
>>>>>>>>> error is
>>>>>>>>>     > returned by running the notifier_call_chain and I suspect that ends
>>>>>>>>> up calling
>>>>>>>>>     > cpupool_cpu_remove which always returns EBUSY for cpus not in pool0.
>>>>>>>>>     >
>>>>>>>>>     > I am not sure which end needs to be fixed but looping over all
>>>>>>>>> online
>>>>>>>>> cpus in
>>>>>>>>>     > disable_nonboot_cpus sounds plausible. So maybe the check for
>>>>>>>>> pool-0 in
>>>>>>>>>     > cpupool_cpu_remove is wrong...?
>>>>>>>>>     >
>>>>>>>>>     > -Stefan
>>>>>>>>>
>>>>>>>>> Hmm yes - this looks completely broken.
>>>>>>>>>
>>>>>>>>> cpupool_cpu_remove() only has a single caller which is from cpu_down(),
>>>>>>>>> and will unconditionally fail for cpus outside of the default pool.
>>>>>>>>>
>>>>>>>>> It is not obvious at all how this is supposed to work, and the comment
>>>>>>>>> beside cpupool_cpu_remove() doesn't help.
>>>>>>>>>
>>>>>>>>> Can you try the following (only compile tested) patch, which looks
>>>>>>>>> plausibly like it might DTRT.  The for_each_cpupool() is a little nastly
>>>>>>>>> but there appears to be no cpu_to_cpupool mapping available.
>>>>>>>>
>>>>>>>> Your patch has the disadvantage to support hot-unplug of the last cpu in
>>>>>>>> a cpupool. The following should work, however:
>>>>>>>
>>>>>>> Disadvantage and support sounded a bit confusing. But I think it means
>>>>>>> hot-unplugging the last cpu of a pool is bad and should not be working.
>>>>>>
>>>>>> Correct.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
>>>>>>>> index 4a0e569..73249d3 100644
>>>>>>>> --- a/xen/common/cpupool.c
>>>>>>>> +++ b/xen/common/cpupool.c
>>>>>>>> @@ -471,12 +471,24 @@ static void cpupool_cpu_add(unsigned int cpu)
>>>>>>>>      */
>>>>>>>>     static int cpupool_cpu_remove(unsigned int cpu)
>>>>>>>>     {
>>>>>>>> -    int ret = 0;
>>>>>>>> +    int ret = -EBUSY;
>>>>>>>> +    struct cpupool **c;
>>>>>>>>
>>>>>>>>         spin_lock(&cpupool_lock);
>>>>>>>> -    if ( !cpumask_test_cpu(cpu, cpupool0->cpu_valid))
>>>>>>>> -        ret = -EBUSY;
>>>>>>>> +    if ( cpumask_test_cpu(cpu, cpupool0->cpu_valid) )
>>>>>>>> +        ret = 0;
>>>>>>>>         else
>>>>>>>> +    {
>>>>>>>> +        for_each_cpupool(c)
>>>>>>>> +        {
>>>>>>>> +            if ( cpumask_test_cpu(cpu, (*c)->cpu_suspended ) )
>>>>>>>
>>>>>>> The rest seems to keep the semantics the same as before (though does that
>>>>>>> mean
>>>>>>> unplugging the last cpu of pool-0 is ok?) But why testing for suspended
>>>>>>> here to
>>>>>>> succeed (and not valid)?
>>>>>>
>>>>>> Testing valid would again enable to remove the last cpu of a cpupool in
>>>>>> case of hotplugging. cpu_suspended is set if all cpus are to be removed
>>>>>> due to shutdown, suspend to ram/disk, ...
>>>>>
>>>>> Ah, ok. Thanks for the detail explanation. So I was trying this change in
>>>>> parallel and can confirm that it gets rid of the panic on shutdown. But when I
>>>>> try to offline any cpu in pool1 (if echoing 0 into
>>>>> /sys/devices/xen_cpu/xen_cpu?
>>>>> is the correct method) I always get EBUSY. IOW I cannot hot-unplug any cpu
>>>>> that
>>>>> is in a pool other than 0. It does only work after removing it from pool1,
>>>>> then
>>>>> add it to pool0 and then echo 0 into online.
>>>>
>>>> That's how it was designed some years ago. I don't want to change the
>>>> behavior in the hypervisor. Adding some tool support could make sense,
>>>> however.
>>>
>>> Ok, so in that case everything works as expected and the change fixes the
>>> currently broken shutdown and could be properly submitted for inclusion (with my
>>> tested-by).
>>>
>>
>> Is this needing something from my side to do? I could re-submit the whole patch
>> but it since it is Juergen's work it felt a little rude to do so.
> 
> Patch is already in xen-unstable/staging.

Argh, so it is. I always seem to forget about this branch. So I only checked
master. :/

Thanks,
Stefan
> 
> 
> Juergen
> 



[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-07-28  9:02 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-07 11:33 Shutdown panic in disable_nonboot_cpus after cpupool-numa-split Stefan Bader
2014-07-07 12:00 ` Andrew Cooper
2014-07-07 12:38   ` Jürgen Groß
2014-07-07 12:49     ` Stefan Bader
2014-07-07 13:03       ` Jürgen Groß
2014-07-07 14:08         ` Stefan Bader
2014-07-07 14:28           ` Juergen Gross
2014-07-07 14:43             ` Stefan Bader
2014-07-28  8:36               ` Stefan Bader
2014-07-28  8:50                 ` Jürgen Groß
2014-07-28  9:02                   ` Stefan Bader

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).