From mboxrd@z Thu Jan  1 00:00:00 1970
From: Juergen Gross <jgross@suse.com>
Subject: Re: [PATCH 1/4] expand x86 arch_shared_info to support
 linear p2m list
Date: Tue, 18 Nov 2014 06:33:25 +0100
Message-ID: <546ADA25.4000709@suse.com>
References: <1415957846-22703-1-git-send-email-jgross@suse.com>	<1415957846-22703-2-git-send-email-jgross@suse.com>	<5465EA63.3010007@citrix.com>	<5465FB34.9010606@suse.com>	<54660A16.2090006@citrix.com>	<54660E5C.8030107@suse.com>	<546618D9.5070200@citrix.com>	<54662096.6060603@suse.com>
	<546628F7.4000008@citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="windows-1252"; Format="flowed"
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <546628F7.4000008@citrix.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Andrew Cooper <andrew.cooper3@citrix.com>, xen-devel@lists.xensource.com, jbeulich@suse.com, konrad.wilk@oracle.com, david.vrabel@citrix.com
List-Id: xen-devel@lists.xenproject.org

On 11/14/2014 05:08 PM, Andrew Cooper wrote:
> On 14/11/14 15:32, Juergen Gross wrote:
>> On 11/14/2014 03:59 PM, Andrew Cooper wrote:
>>> On 14/11/14 14:14, J=FCrgen Gro=DF wrote:
>>>> On 11/14/2014 02:56 PM, Andrew Cooper wrote:
>>>>> On 14/11/14 12:53, Juergen Gross wrote:
>>>>>> On 11/14/2014 12:41 PM, Andrew Cooper wrote:
>>>>>>> On 14/11/14 09:37, Juergen Gross wrote:
>>>>>>>> The x86 struct arch_shared_info field pfn_to_mfn_frame_list_list
>>>>>>>> currently contains the mfn of the top level page frame of the 3
>>>>>>>> level
>>>>>>>> p2m tree, which is used by the Xen tools during saving and
>>>>>>>> restoring
>>>>>>>> (and live migration) of pv domains and for crash dump analysis.
>>>>>>>> With
>>>>>>>> three levels of the p2m tree it is possible to support up to 512
>>>>>>>> GB of
>>>>>>>> RAM for a 64 bit pv domain.
>>>>>>>>
>>>>>>>> A 32 bit pv domain can support more, as each memory page can hold
>>>>>>>> 1024
>>>>>>>> instead of 512 entries, leading to a limit of 4 TB.
>>>>>>>>
>>>>>>>> To be able to support more RAM on x86-64 switch to a virtual mapped
>>>>>>>> p2m list.
>>>>>>>>
>>>>>>>> This patch expands struct arch_shared_info with a new p2m list
>>>>>>>> virtual
>>>>>>>> address and the mfn of the page table root. The new information is
>>>>>>>> indicated by the domain to be valid by storing ~0UL into
>>>>>>>> pfn_to_mfn_frame_list_list. The hypervisor indicates usability of
>>>>>>>> this
>>>>>>>> feature by a new flag XENFEAT_virtual_p2m.
>>>>>>>
>>>>>>> How do you envisage this being used?  Are you expecting the tools
>>>>>>> to do
>>>>>>> manual pagetable walks using xc_map_foreign_xxx() ?
>>>>>>
>>>>>> Yes. Not very different compared to today's mapping via the 3 level
>>>>>> p2m tree. Just another entry format, 4 instead of 3 levels and
>>>>>> starting
>>>>>> at an offset.
>>>>>
>>>>> Yes - David and I were discussing this over lunch, and it is not
>>>>> actually very different.
>>>>>
>>>>> In reality, how likely is it that the pages backing this virtual
>>>>> linear
>>>>> array change?
>>>>
>>>> Very unlikely, I think. But not impossible.
>>>>
>>>>> One issue currently is that, during the live part of migration, the
>>>>> toolstack has no way of working out whether the structure of the
>>>>> p2m has
>>>>> changed (intermediate leaves rearranged, or the length increasing).
>>>>>
>>>>> In the case that the VM does change the structure of the p2m under the
>>>>> feet of the toolstack, migration will either blow up in a
>>>>> non-subtle way
>>>>> with a p2m/m2p mismatch, or in a subtle way with the receiving side
>>>>> copying the new p2m over the wrong part of the new domain.
>>>>>
>>>>> I am wondering whether, with this new p2m method, we can take
>>>>> sufficient
>>>>> steps to be able to guarantee mishaps like this can't occur.
>>>>
>>>> This should be easy: I could add a counter in arch_shared_info which is
>>>> incremented whenever a p2m mapping is being changed. The toolstack
>>>> could
>>>> compare the counter values before start and at end of migration and
>>>> redo
>>>> the migration (or fail) if they are different. In order to avoid races
>>>> I would have to increment the counter before and after changing the
>>>> mapping.
>>>>
>>>
>>> That is insufficient I believe.
>>>
>>> Consider:
>>>
>>> * Toolstack walks pagetables and maps the frames containing the
>>> linear p2m
>>> * Live migration starts
>>> * VM remaps a frame in the middle of the linear p2m
>>> * Live migration continues, but the toolstack has a stale frame in the
>>> middle of its view of the p2m.
>>
>> This would be covered by my suggestion. At the end of the memory
>> transfer (with some bogus contents) the toolstack would discover the
>> change of the p2m structure and either fail the migration or start it
>> from the beginning and thus overwriting the bogus frames.
>
> Checking after pause is too late.  The content of the p2m is used verify
> each frame being sent on the wire, so is in active use for the entire
> duration of live migration.
>
> If the toolstack starts verifying frames being sent using information
> from a stale p2m, the best that can be hoped for is that the toolstack
> declares that the p2m and m2p are inconsistent and abort the migrate.
>
>>
>>> As the p2m is almost never expected to change, I think it might be
>>> better to have a flag the toolstack can set to say "The toolstack is
>>> peeking at your p2m behind your back - you must not change its
>>> structure."
>>
>> Be careful here: changes of the structure can be due to two scenarios:
>> - ballooning (invalid entries being populated): this is no problem, as
>>    we can stop the ballooning during live migration.
>> - mapping of grant pages e.g. in a stub domain (first map in an area
>>    former marked as invalid): you can't stop this, as the stub domain
>>    has to do some work. Here a restart of the migration should work, as
>>    the p2m structure change can only happen once for each affected p2m
>>    page.
>
> Migration is not at all possible with a domain referencing foreign frames.
>
> The live part can cope with foreign frames referenced in the ptes.  As
> part of the pause handling in the VM, the frontends must unmap any
> grants they have.  After pause, any remaining foreign frames cause a
> migration failure.
>
>>
>>> Having just thought this through, I think there is also a race condition
>>> between a VM changing an entry in the p2m, and the toolstack doing
>>> verifications of frames being sent.
>>
>> Okay, so the flag you mentioned should just prohibit changes in the
>> p2m list related to memory frames of the affected domain: ballooning
>> up or down, or rearranging the memory layout (does this happen today?).
>> Mapping and unmapping of grant pages should be still allowed.
>
> HVM guests doesn't have any of their p2m updates represented in the
> logdirty bitmap, so ballooning an HVM guest during migrate leads to
> unexpected holes or lack of holes on the resuming side, leading to a
> very confused balloon driver.
>
> At the time I had not found a problem with PV guests, but it is now
> clear that there is a period of time when a guest is altering its p2m
> where the p2m and m2p are out of sync, which will cause a migration
> failure if the toolstack observes this artefact.

So ballooning should be disabled during migration. I think this should
be handled via callbacks triggered by xenstore: one at start of
migration to stop ballooning and one at end to restart it. I wouldn't
want to tie this functionality to the p2m list structure, as it is
not related to it.

Juergen