From: Juergen Gross <jgross@suse.com>
To: David Vrabel <david.vrabel@citrix.com>,
Andrew Cooper <andrew.cooper3@citrix.com>,
ian.campbell@citrix.com, ian.jackson@eu.citrix.com,
jbeulich@suse.com, keir@xen.org, tim@xen.org,
xen-devel@lists.xen.org
Subject: Re: [PATCH V3 1/1] expand x86 arch_shared_info to support >3 level p2m tree
Date: Tue, 16 Sep 2014 05:52:43 +0200 [thread overview]
Message-ID: <5417B40B.4000703@suse.com> (raw)
In-Reply-To: <5416F809.7060509@citrix.com>
On 09/15/2014 04:30 PM, David Vrabel wrote:
> On 15/09/14 11:46, Juergen Gross wrote:
>> On 09/15/2014 12:30 PM, David Vrabel wrote:
>>> On 15/09/14 10:52, Juergen Gross wrote:
>>>> On 09/15/2014 11:44 AM, David Vrabel wrote:
>>>>> On 15/09/14 09:52, Juergen Gross wrote:
>>>>>> On 09/15/2014 10:29 AM, Andrew Cooper wrote:
>>>>>>>
>>>>>>> On 12/09/2014 11:31, Juergen Gross wrote:
>>>>>>>> On 09/09/2014 12:49 PM, Juergen Gross wrote:
>>>>>>>>> On 09/09/2014 12:27 PM, Andrew Cooper wrote:
>>>>>>>>>> On 09/09/14 10:58, Juergen Gross wrote:
>>>>>>>>>>> The x86 struct arch_shared_info field pfn_to_mfn_frame_list_list
>>>>>>>>>>> currently contains the mfn of the top level page frame of the 3
>>>>>>>>>>> level
>>>>>>>>>>> p2m tree, which is used by the Xen tools during saving and
>>>>>>>>>>> restoring
>>>>>>>>>>> (and live migration) of pv domains. With three levels of the p2m
>>>>>>>>>>> tree
>>>>>>>>>>> it is possible to support up to 512 GB of RAM for a 64 bit pv
>>>>>>>>>>> domain.
>>>>>>>>>>> A 32 bit pv domain can support more, as each memory page can hold
>>>>>>>>>>> 1024
>>>>>>>>>>> instead of 512 entries, leading to a limit of 4 TB. To be able to
>>>>>>>>>>> support more RAM on x86-64 an additional level is to be added.
>>>>>>>>>>>
>>>>>>>>>>> This patch expands struct arch_shared_info with a new p2m tree
>>>>>>>>>>> root
>>>>>>>>>>> and the number of levels of the p2m tree. The new information is
>>>>>>>>>>> indicated by the domain to be valid by storing ~0UL into
>>>>>>>>>>> pfn_to_mfn_frame_list_list (this should be done only if more than
>>>>>>>>>>> three levels are needed, of course).
>>>>>>>>>>
>>>>>>>>>> A small domain feeling a little tight on space could easily opt
>>>>>>>>>> for
>>>>>>>>>> a 2
>>>>>>>>>> or even 1 level p2m. (After all, one advantage of virt is to cram
>>>>>>>>>> many
>>>>>>>>>> small VMs into a server).
>>>>>>>>>>
>>>>>>>>>> How is xen and toolstack support for n-level p2ms going to be
>>>>>>>>>> advertised
>>>>>>>>>> to guests? Simply assuming the toolstack is capable of dealing
>>>>>>>>>> with
>>>>>>>>>> this new scheme wont work with a new pv guest running on an older
>>>>>>>>>> Xen.
>>>>>>>>>
>>>>>>>>> Is it really worth doing such an optimization? This would save only
>>>>>>>>> very
>>>>>>>>> few pages.
>>>>>>>>>
>>>>>>>>> If you think it should be done we can add another SIF_* flag to
>>>>>>>>> start_info->flags. In this case a domain using this feature could
>>>>>>>>> not be
>>>>>>>>> migrated to a server with old tools, however. So we would probably
>>>>>>>>> end
>>>>>>>>> with the need to be able to suppress that flag on a per-domain
>>>>>>>>> base.
>>>>>>>>
>>>>>>>> Any further comments?
>>>>>>>>
>>>>>>>> Which way should I go?
>>>>>>>>
>>>>>>>
>>>>>>> There are two approaches, with different up/downsides
>>>>>>>
>>>>>>> 1) continue to use the old method, and use the new method only when
>>>>>>> absolutely required. This will function, but on old toolstacks,
>>>>>>> suffer
>>>>>>> migration/suspend failures when the toolstack fails to find the p2m.
>>>>>>>
>>>>>>> 2) Provide a Xen feature flag indicating the presence of N-level p2m
>>>>>>> support. Guests which can see this flag are free to use N-level, and
>>>>>>> guests which can't are not.
>>>>>>>
>>>>>>> Ultimately, giving more than 512GB to a current 64bit PV domain is
>>>>>>> not
>>>>>>> going to work, and the choice above depends on which failure mode you
>>>>>>> wish a new/old mix to have.
>>>>>>
>>>>>> I'd prefer solution 1), as it will enable Dom0 with more than 512 GB
>>>>>> without requiring a change of any Xen component. Additionally large
>>>>>> domains can be started by users who don't care for migrating or
>>>>>> suspending them.
>>>>>>
>>>>>> So I'd rather keep my patch as posted.
>>>>>
>>>>> PV guests can have extra memory added, beyond their initial limit.
>>>>> Supporting this would require option 2.
>>>>
>>>> I don't see why this should require option 2.
>>>
>>> Um...
>>>
>>>> Option 1 only prohibits suspending/migrating a domain with more than
>>>> 512 GB.
>>>
>>> ...this is the reason.
>>>
>>> With the exception of VMs that have assigned direct access to hardware,
>>> migration is an essential feature and must be supported.
>>
>> So you'd prefer:
>>
>> 1) >512GB pv-domains (including Dom0) will be supported only with new
>> Xen (4.6?), no matter if the user requires migration to be supported
>
> Yes. >512 GiB and not being able to migrate are not obviously related
> from the point of view of the end user (unlike assigning a PCI device).
>
> Failing at domain save time is most likely too late for the end user.
What would you think about following compromise:
We add a flag that indicates support of multi-level p2m. Additionally
the Linux kernel can ignore the flag not being set either if started as
Dom0 or if told so via kernel parameter.
>
>> to:
>>
>> 2) >512GB pv-domains (especially Dom0 and VMs with direct hw access) can
>> be started on current Xen versions, migration is possible only if Xen
>> is new (4.6?)
>
> There's also my preferred option:
>
> 3) >512 GiB PV domains are not supported. Large guests must be PVH or
> PVHVM.
In theory okay, but not right now, I think. PVH Dom0 is not production
ready.
Juergen
>
>> What is the common use case for migration? I doubt it is used very often
>> for really huge domains.
>
> XenServer uses it for server pool upgrades with no VM downtime.
>
> Also, today's huge VM is tomorrow's common size.
>
> David
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
next prev parent reply other threads:[~2014-09-16 3:52 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-09-09 9:58 [PATCH V3 0/1] support >3 level p2m tree Juergen Gross
2014-09-09 9:58 ` [PATCH V3 1/1] expand x86 arch_shared_info to " Juergen Gross
2014-09-09 10:27 ` Andrew Cooper
2014-09-09 10:49 ` Juergen Gross
2014-09-12 10:31 ` Juergen Gross
2014-09-15 8:29 ` Andrew Cooper
2014-09-15 8:52 ` Juergen Gross
2014-09-15 9:42 ` Jan Beulich
2014-09-15 9:48 ` Juergen Gross
2014-09-15 9:44 ` David Vrabel
2014-09-15 9:52 ` Juergen Gross
2014-09-15 10:30 ` David Vrabel
2014-09-15 10:46 ` Juergen Gross
2014-09-15 11:29 ` Jan Beulich
2014-09-15 14:30 ` David Vrabel
2014-09-16 3:52 ` Juergen Gross [this message]
2014-09-16 10:14 ` David Vrabel
2014-09-16 10:38 ` Juergen Gross
2014-09-16 11:56 ` David Vrabel
2014-09-16 12:44 ` Juergen Gross
2014-09-17 4:25 ` Juergen Gross
2014-09-30 8:53 ` Jan Beulich
[not found] ` <542A8B93020000780003AE7B@suse.com>
2014-09-30 8:59 ` Juergen Gross
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5417B40B.4000703@suse.com \
--to=jgross@suse.com \
--cc=andrew.cooper3@citrix.com \
--cc=david.vrabel@citrix.com \
--cc=ian.campbell@citrix.com \
--cc=ian.jackson@eu.citrix.com \
--cc=jbeulich@suse.com \
--cc=keir@xen.org \
--cc=tim@xen.org \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.