From mboxrd@z Thu Jan  1 00:00:00 1970
From: George Dunlap <george.dunlap@eu.citrix.com>
Subject: Re: Xen crashing when killing a domain with no VCPUs
 allocated
Date: Mon, 21 Jul 2014 11:49:18 +0100
Message-ID: <53CCF02E.7000607@eu.citrix.com>
References: <53C920DD.6060300@linaro.org>	<1405701560.14973.1.camel@kazak.uk.xensource.com>	<53C982FF.7070608@linaro.org>
	<53CCEC64.7040304@eu.citrix.com> <53CCEEA3.5080305@citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <53CCEEA3.5080305@citrix.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Andrew Cooper <andrew.cooper3@citrix.com>, Julien Grall <julien.grall@linaro.org>, Ian Campbell <Ian.Campbell@citrix.com>
Cc: jgross@suse.com, Stefano Stabellini <stefano.stabellini@eu.citrix.com>, Dario Faggioli <dario.faggioli@citrix.com>, Tim Deegan <tim@xen.org>, george.dunlap@citrix.com, xen-devel <xen-devel@lists.xen.org>
List-Id: xen-devel@lists.xenproject.org

On 07/21/2014 11:42 AM, Andrew Cooper wrote:
> On 21/07/14 11:33, George Dunlap wrote:
>> On 07/18/2014 09:26 PM, Julien Grall wrote:
>>> On 18/07/14 17:39, Ian Campbell wrote:
>>>> On Fri, 2014-07-18 at 14:27 +0100, Julien Grall wrote:
>>>>> Hi all,
>>>>>
>>>>> I've been played with the function alloc_vcpu on ARM. And I hit one
>>>>> case
>>>>> where this function can failed.
>>>>>
>>>>> During domain creation, the toolstack will call DOMCTL_max_vcpus
>>>>> which may
>>>>> fail, for instance because alloc_vcpu didn't succeed. In this case,
>>>>> the
>>>>> toolstack will call DOMCTL_domaindestroy. And I got the below stack
>>>>> trace.
>>>>>
>>>>> It can be reproduced on Xen 4.5 (and I also suspect Xen 4.4) by
>>>>> returning
>>>>> in an error in vcpu_initialize.
>>>>>
>>>>> I'm not sure how to correctly fix it.
>>>> I think a simple check at the head of the function would be ok.
>>>>
>>>> Alternatively perhaps in sched_mode_domain, which could either detect
>>>> this or could detect a domain in pool0 being moved to pool0 and short
>>>> circuit.
>>> I was thinking about the small fix below. If it's fine for everyone,
>>> I can
>>> send a patch next week.
>>>
>>> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
>>> index e9eb0bc..c44d047 100644
>>> --- a/xen/common/schedule.c
>>> +++ b/xen/common/schedule.c
>>> @@ -311,7 +311,7 @@ int sched_move_domain(struct domain *d, struct
>>> cpupool *c)
>>>        }
>>>          /* Do we have vcpus already? If not, no need to update
>>> node-affinity */
>>> -    if ( d->vcpu )
>>> +    if ( d->vcpu && d->vcpu[0] != NULL )
>>>            domain_update_node_affinity(d);
>> So is the problem that we're allocating the vcpu array area, but not
>> putting any vcpus in it?
> The problem (as I recall) was that domain_create() got midway through
> and alloc_vcpu(0) failed with -ENOMEM.  Following that failure, the
> toolstack called domain_destroy().
>
> Having d->vcpu properly allocated and containing fully NULL pointers is
> a valid position to be in, especial in error or teardown paths.
>
>> Overall it seems like those checks for the existence of cpus should be
>> moved into domain_update_node_affinity().  The ASSERT() there I think
>> is just a sanity check to make sure we're not getting a ridiculous
>> result out of our calculation; but of course if there actually are no
>> vcpus, it's not ridiculous at all.
>>
>> One solution might be to change the ASSERT to
>> ASSERT(!cpumask_empty(dom_cpumask) || !d->vcpu || !d->vcpu[0]).  Then
>> we could probably even remove the d->vcpu conditional when calling it.
> If you were going along this line, the pointer checks are substantially
> less expensive than cpumask_empty(), so the ||'s should be reordered.
> However, I am not convinced that it is necessarily the best solution,
> given my previous observation.

Er, I was with you until the last part.  What's wrong with changing the 
assert from "Make sure I have *something* in there" to "Make sure I have 
*something* in there *if I have any vcpus*"?  That seems to be accepting 
that having d->vcpu allocated but full of null pointers is a valid 
condition.

  -George