From mboxrd@z Thu Jan 1 00:00:00 1970 From: Juergen Gross Subject: Re: [PATCH 1/4] expand x86 arch_shared_info to support linear p2m list Date: Tue, 18 Nov 2014 06:33:25 +0100 Message-ID: <546ADA25.4000709@suse.com> References: <1415957846-22703-1-git-send-email-jgross@suse.com> <1415957846-22703-2-git-send-email-jgross@suse.com> <5465EA63.3010007@citrix.com> <5465FB34.9010606@suse.com> <54660A16.2090006@citrix.com> <54660E5C.8030107@suse.com> <546618D9.5070200@citrix.com> <54662096.6060603@suse.com> <546628F7.4000008@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="windows-1252"; Format="flowed" Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <546628F7.4000008@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Andrew Cooper , xen-devel@lists.xensource.com, jbeulich@suse.com, konrad.wilk@oracle.com, david.vrabel@citrix.com List-Id: xen-devel@lists.xenproject.org On 11/14/2014 05:08 PM, Andrew Cooper wrote: > On 14/11/14 15:32, Juergen Gross wrote: >> On 11/14/2014 03:59 PM, Andrew Cooper wrote: >>> On 14/11/14 14:14, J=FCrgen Gro=DF wrote: >>>> On 11/14/2014 02:56 PM, Andrew Cooper wrote: >>>>> On 14/11/14 12:53, Juergen Gross wrote: >>>>>> On 11/14/2014 12:41 PM, Andrew Cooper wrote: >>>>>>> On 14/11/14 09:37, Juergen Gross wrote: >>>>>>>> The x86 struct arch_shared_info field pfn_to_mfn_frame_list_list >>>>>>>> currently contains the mfn of the top level page frame of the 3 >>>>>>>> level >>>>>>>> p2m tree, which is used by the Xen tools during saving and >>>>>>>> restoring >>>>>>>> (and live migration) of pv domains and for crash dump analysis. >>>>>>>> With >>>>>>>> three levels of the p2m tree it is possible to support up to 512 >>>>>>>> GB of >>>>>>>> RAM for a 64 bit pv domain. >>>>>>>> >>>>>>>> A 32 bit pv domain can support more, as each memory page can hold >>>>>>>> 1024 >>>>>>>> instead of 512 entries, leading to a limit of 4 TB. >>>>>>>> >>>>>>>> To be able to support more RAM on x86-64 switch to a virtual mapped >>>>>>>> p2m list. >>>>>>>> >>>>>>>> This patch expands struct arch_shared_info with a new p2m list >>>>>>>> virtual >>>>>>>> address and the mfn of the page table root. The new information is >>>>>>>> indicated by the domain to be valid by storing ~0UL into >>>>>>>> pfn_to_mfn_frame_list_list. The hypervisor indicates usability of >>>>>>>> this >>>>>>>> feature by a new flag XENFEAT_virtual_p2m. >>>>>>> >>>>>>> How do you envisage this being used? Are you expecting the tools >>>>>>> to do >>>>>>> manual pagetable walks using xc_map_foreign_xxx() ? >>>>>> >>>>>> Yes. Not very different compared to today's mapping via the 3 level >>>>>> p2m tree. Just another entry format, 4 instead of 3 levels and >>>>>> starting >>>>>> at an offset. >>>>> >>>>> Yes - David and I were discussing this over lunch, and it is not >>>>> actually very different. >>>>> >>>>> In reality, how likely is it that the pages backing this virtual >>>>> linear >>>>> array change? >>>> >>>> Very unlikely, I think. But not impossible. >>>> >>>>> One issue currently is that, during the live part of migration, the >>>>> toolstack has no way of working out whether the structure of the >>>>> p2m has >>>>> changed (intermediate leaves rearranged, or the length increasing). >>>>> >>>>> In the case that the VM does change the structure of the p2m under the >>>>> feet of the toolstack, migration will either blow up in a >>>>> non-subtle way >>>>> with a p2m/m2p mismatch, or in a subtle way with the receiving side >>>>> copying the new p2m over the wrong part of the new domain. >>>>> >>>>> I am wondering whether, with this new p2m method, we can take >>>>> sufficient >>>>> steps to be able to guarantee mishaps like this can't occur. >>>> >>>> This should be easy: I could add a counter in arch_shared_info which is >>>> incremented whenever a p2m mapping is being changed. The toolstack >>>> could >>>> compare the counter values before start and at end of migration and >>>> redo >>>> the migration (or fail) if they are different. In order to avoid races >>>> I would have to increment the counter before and after changing the >>>> mapping. >>>> >>> >>> That is insufficient I believe. >>> >>> Consider: >>> >>> * Toolstack walks pagetables and maps the frames containing the >>> linear p2m >>> * Live migration starts >>> * VM remaps a frame in the middle of the linear p2m >>> * Live migration continues, but the toolstack has a stale frame in the >>> middle of its view of the p2m. >> >> This would be covered by my suggestion. At the end of the memory >> transfer (with some bogus contents) the toolstack would discover the >> change of the p2m structure and either fail the migration or start it >> from the beginning and thus overwriting the bogus frames. > > Checking after pause is too late. The content of the p2m is used verify > each frame being sent on the wire, so is in active use for the entire > duration of live migration. > > If the toolstack starts verifying frames being sent using information > from a stale p2m, the best that can be hoped for is that the toolstack > declares that the p2m and m2p are inconsistent and abort the migrate. > >> >>> As the p2m is almost never expected to change, I think it might be >>> better to have a flag the toolstack can set to say "The toolstack is >>> peeking at your p2m behind your back - you must not change its >>> structure." >> >> Be careful here: changes of the structure can be due to two scenarios: >> - ballooning (invalid entries being populated): this is no problem, as >> we can stop the ballooning during live migration. >> - mapping of grant pages e.g. in a stub domain (first map in an area >> former marked as invalid): you can't stop this, as the stub domain >> has to do some work. Here a restart of the migration should work, as >> the p2m structure change can only happen once for each affected p2m >> page. > > Migration is not at all possible with a domain referencing foreign frames. > > The live part can cope with foreign frames referenced in the ptes. As > part of the pause handling in the VM, the frontends must unmap any > grants they have. After pause, any remaining foreign frames cause a > migration failure. > >> >>> Having just thought this through, I think there is also a race condition >>> between a VM changing an entry in the p2m, and the toolstack doing >>> verifications of frames being sent. >> >> Okay, so the flag you mentioned should just prohibit changes in the >> p2m list related to memory frames of the affected domain: ballooning >> up or down, or rearranging the memory layout (does this happen today?). >> Mapping and unmapping of grant pages should be still allowed. > > HVM guests doesn't have any of their p2m updates represented in the > logdirty bitmap, so ballooning an HVM guest during migrate leads to > unexpected holes or lack of holes on the resuming side, leading to a > very confused balloon driver. > > At the time I had not found a problem with PV guests, but it is now > clear that there is a period of time when a guest is altering its p2m > where the p2m and m2p are out of sync, which will cause a migration > failure if the toolstack observes this artefact. So ballooning should be disabled during migration. I think this should be handled via callbacks triggered by xenstore: one at start of migration to stop ballooning and one at end to restart it. I wouldn't want to tie this functionality to the p2m list structure, as it is not related to it. Juergen