From mboxrd@z Thu Jan 1 00:00:00 1970 From: George Dunlap Subject: Re: Xen crashing when killing a domain with no VCPUs allocated Date: Mon, 21 Jul 2014 11:49:18 +0100 Message-ID: <53CCF02E.7000607@eu.citrix.com> References: <53C920DD.6060300@linaro.org> <1405701560.14973.1.camel@kazak.uk.xensource.com> <53C982FF.7070608@linaro.org> <53CCEC64.7040304@eu.citrix.com> <53CCEEA3.5080305@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <53CCEEA3.5080305@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Andrew Cooper , Julien Grall , Ian Campbell Cc: jgross@suse.com, Stefano Stabellini , Dario Faggioli , Tim Deegan , george.dunlap@citrix.com, xen-devel List-Id: xen-devel@lists.xenproject.org On 07/21/2014 11:42 AM, Andrew Cooper wrote: > On 21/07/14 11:33, George Dunlap wrote: >> On 07/18/2014 09:26 PM, Julien Grall wrote: >>> On 18/07/14 17:39, Ian Campbell wrote: >>>> On Fri, 2014-07-18 at 14:27 +0100, Julien Grall wrote: >>>>> Hi all, >>>>> >>>>> I've been played with the function alloc_vcpu on ARM. And I hit one >>>>> case >>>>> where this function can failed. >>>>> >>>>> During domain creation, the toolstack will call DOMCTL_max_vcpus >>>>> which may >>>>> fail, for instance because alloc_vcpu didn't succeed. In this case, >>>>> the >>>>> toolstack will call DOMCTL_domaindestroy. And I got the below stack >>>>> trace. >>>>> >>>>> It can be reproduced on Xen 4.5 (and I also suspect Xen 4.4) by >>>>> returning >>>>> in an error in vcpu_initialize. >>>>> >>>>> I'm not sure how to correctly fix it. >>>> I think a simple check at the head of the function would be ok. >>>> >>>> Alternatively perhaps in sched_mode_domain, which could either detect >>>> this or could detect a domain in pool0 being moved to pool0 and short >>>> circuit. >>> I was thinking about the small fix below. If it's fine for everyone, >>> I can >>> send a patch next week. >>> >>> diff --git a/xen/common/schedule.c b/xen/common/schedule.c >>> index e9eb0bc..c44d047 100644 >>> --- a/xen/common/schedule.c >>> +++ b/xen/common/schedule.c >>> @@ -311,7 +311,7 @@ int sched_move_domain(struct domain *d, struct >>> cpupool *c) >>> } >>> /* Do we have vcpus already? If not, no need to update >>> node-affinity */ >>> - if ( d->vcpu ) >>> + if ( d->vcpu && d->vcpu[0] != NULL ) >>> domain_update_node_affinity(d); >> So is the problem that we're allocating the vcpu array area, but not >> putting any vcpus in it? > The problem (as I recall) was that domain_create() got midway through > and alloc_vcpu(0) failed with -ENOMEM. Following that failure, the > toolstack called domain_destroy(). > > Having d->vcpu properly allocated and containing fully NULL pointers is > a valid position to be in, especial in error or teardown paths. > >> Overall it seems like those checks for the existence of cpus should be >> moved into domain_update_node_affinity(). The ASSERT() there I think >> is just a sanity check to make sure we're not getting a ridiculous >> result out of our calculation; but of course if there actually are no >> vcpus, it's not ridiculous at all. >> >> One solution might be to change the ASSERT to >> ASSERT(!cpumask_empty(dom_cpumask) || !d->vcpu || !d->vcpu[0]). Then >> we could probably even remove the d->vcpu conditional when calling it. > If you were going along this line, the pointer checks are substantially > less expensive than cpumask_empty(), so the ||'s should be reordered. > However, I am not convinced that it is necessarily the best solution, > given my previous observation. Er, I was with you until the last part. What's wrong with changing the assert from "Make sure I have *something* in there" to "Make sure I have *something* in there *if I have any vcpus*"? That seems to be accepting that having d->vcpu allocated but full of null pointers is a valid condition. -George