From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Cooper Subject: Re: Shutdown panic in disable_nonboot_cpus after cpupool-numa-split Date: Mon, 7 Jul 2014 13:00:17 +0100 Message-ID: <53BA8BD1.4020506@citrix.com> References: <53BA857A.8070608@canonical.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============6703225393155894116==" Return-path: In-Reply-To: <53BA857A.8070608@canonical.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Stefan Bader , "xen-devel@lists.xensource.com" List-Id: xen-devel@lists.xenproject.org --===============6703225393155894116== Content-Type: multipart/alternative; boundary="------------050907000608070104000905" --------------050907000608070104000905 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit On 07/07/14 12:33, Stefan Bader wrote: > I recently noticed that I get a panic (rebooting the system) on shutdown in some > cases. This happened only on my AMD system and also not all the time. Finally > realized that it is related to the use of using cpupool-numa-split > (libxl with xen-4.4 maybe, but not 100% sure 4.3 as well). > > What happens is that on shutdown the hypervisor runs disable_nonboot_cpus which > call cpu_down for each online cpu. There is a BUG_ON in the code for the case of > cpu_down returning -EBUSY. This happens in my case as soon as the first cpu that > has been moved to pool-1 by cpupool-numa-split is attempted. The error is > returned by running the notifier_call_chain and I suspect that ends up calling > cpupool_cpu_remove which always returns EBUSY for cpus not in pool0. > > I am not sure which end needs to be fixed but looping over all online cpus in > disable_nonboot_cpus sounds plausible. So maybe the check for pool-0 in > cpupool_cpu_remove is wrong...? > > -Stefan Hmm yes - this looks completely broken. cpupool_cpu_remove() only has a single caller which is from cpu_down(), and will unconditionally fail for cpus outside of the default pool. It is not obvious at all how this is supposed to work, and the comment beside cpupool_cpu_remove() doesn't help. Can you try the following (only compile tested) patch, which looks plausibly like it might DTRT. The for_each_cpupool() is a little nastly but there appears to be no cpu_to_cpupool mapping available. ~Andrew ---8<----- diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c index 4a0e569..a1af1ea 100644 --- a/xen/common/cpupool.c +++ b/xen/common/cpupool.c @@ -472,12 +472,12 @@ static void cpupool_cpu_add(unsigned int cpu) static int cpupool_cpu_remove(unsigned int cpu) { int ret = 0; + struct cpupool **c; spin_lock(&cpupool_lock); - if ( !cpumask_test_cpu(cpu, cpupool0->cpu_valid)) - ret = -EBUSY; - else - cpumask_set_cpu(cpu, &cpupool_locked_cpus); + for_each_cpupool(c) + cpumask_clear_cpu(cpu, (*c)->cpu_valid); + cpumask_set_cpu(cpu, &cpupool_locked_cpus); spin_unlock(&cpupool_lock); return ret; --------------050907000608070104000905 Content-Type: text/html; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit On 07/07/14 12:33, Stefan Bader wrote:
> I recently noticed that I get a panic (rebooting the system) on shutdown in some
> cases. This happened only on my AMD system and also not all the time. Finally
> realized that it is related to the use of using cpupool-numa-split
> (libxl with xen-4.4 maybe, but not 100% sure 4.3 as well).
>
> What happens is that on shutdown the hypervisor runs disable_nonboot_cpus which
> call cpu_down for each online cpu. There is a BUG_ON in the code for the case of
> cpu_down returning -EBUSY. This happens in my case as soon as the first cpu that
> has been moved to pool-1 by cpupool-numa-split is attempted. The error is
> returned by running the notifier_call_chain and I suspect that ends up calling
> cpupool_cpu_remove which always returns EBUSY for cpus not in pool0.
>
> I am not sure which end needs to be fixed but looping over all online cpus in
> disable_nonboot_cpus sounds plausible. So maybe the check for pool-0 in
> cpupool_cpu_remove is wrong...?
>
> -Stefan


Hmm yes - this looks completely broken.

cpupool_cpu_remove() only has a single caller which is from cpu_down(), and will unconditionally fail for cpus outside of the default pool.

It is not obvious at all how this is supposed to work, and the comment beside cpupool_cpu_remove() doesn't help.

Can you try the following (only compile tested) patch, which looks plausibly like it might DTRT.  The for_each_cpupool() is a little nastly but there appears to be no cpu_to_cpupool mapping available.

~Andrew

---8<-----
diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index 4a0e569..a1af1ea 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -472,12 +472,12 @@ static void cpupool_cpu_add(unsigned int cpu)
 static int cpupool_cpu_remove(unsigned int cpu)
 {
     int ret = 0;
+    struct cpupool **c;
 
     spin_lock(&cpupool_lock);
-    if ( !cpumask_test_cpu(cpu, cpupool0->cpu_valid))
-        ret = -EBUSY;
-    else
-        cpumask_set_cpu(cpu, &cpupool_locked_cpus);
+    for_each_cpupool(c)
+        cpumask_clear_cpu(cpu, (*c)->cpu_valid);
+    cpumask_set_cpu(cpu, &cpupool_locked_cpus);
     spin_unlock(&cpupool_lock);
 
     return ret;


--------------050907000608070104000905-- --===============6703225393155894116== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============6703225393155894116==--