Re: Design question for PV superpage support

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jeremy Fitzhardinge <jeremy@goop.org>
To: Mick.Jordan@sun.com
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>,
	Dave McCracken <dcm@mccr.org>,
	Xen Developers List <xen-devel@lists.xensource.com>
Subject: Re: Design question for PV superpage support
Date: Tue, 03 Mar 2009 09:23:36 -0800	[thread overview]
Message-ID: <49AD6798.8040303@goop.org> (raw)
In-Reply-To: <49AD6397.9030803@Sun.COM>

Mick Jordan wrote:
> On 03/03/09 06:33, Dan Magenheimer wrote:
>>> In general, I think the guest should assume that large page 
>>> mappings are 
>>> merely an optimization that (a) might not be possible on domain start 
>>> due to machine memory fragmentation and (b) that this condition might 
>>> also occur on restore. Given these, it must always be prepared to 
>>> function with 4K pages, which implies that it would need to preserve 
>>> enough page table frame memory to be able revert from large 
>>> to small pages.
>>>
>>> Mick
>>>     
>>
>> Do you disagree with my assertion that use of 2MB pages is
>> almost always an attempt to eke out a performance improvement,
>> that emulating 2MB pages with fragmented 4KB pages is likely
>> slower than just using 4KB pages to start with, and thus
>> that "must always be prepared to function with 4KB pages"
>> should NOT occur silently (if at all)?
>>   
> I agree with the first statement. I'm not sure what you mean by 
> "emulate 2MB pages with fragmented 4K pages" unless you assume nested 
> page table support or you just mean falling back to 4K pages. As for 
> whether a change should be silent, I'm less clear on that. I certainly 
> wouldn't consider it a fatal condition requiring domain termination, 
> That position is consistent with the "optimization not correctness" 
> view of using large tables. However, a guest might want to indicate in 
> some way that it has downgraded

The tradeoff is between the performance gain one might get from using 
large pages vs the intrusiveness of changes to a PV kernel.  Given that 
when paravirtualizing this we're going to be making small changes to the 
kernel's existing large page support, rather than adding it new or a 
separate large-page mechanism, we need to make sure that as many of the 
guest's existing assumptions can be satisfied.

The requirement that a guest be able to come up with enough L1 pagetable 
pages to be able to map all the shattered 2M mappings at any time 
definitely doesn't fall into that category.  You'd need to:

   1. Have an interface for Xen to tell the guest which pages need to be
      remapped.  Presumably this would be in terms of once contiguous
      pfn ranges which are now backed with discontinuous mfns.
   2. Get the guest to remap those pfns to the new mfns, which will
      require walking every pagetable of every process searching for
      those pfns, allocating memory for the new pagetable level.

However the main use of 2M mappings in Linux is to map the kernel text 
and data.  That's clearly not going to be possible if we need to run 
kernel code to put things together after a restore.  Hm, given that, I 
guess we could just kludge it into hugetlbfs, but it really does make it 
a very narrow set of users.

>> BTW, thinking ahead to ballooning with 2MB pages, are we prepared
>> to assume that a relinquished 2MB page can't be fragmented?
>> While this may be appealing for systems where nearly all
>> guests are using 2MB pages, systems where the 2MB guest is
>> an odd duck might suffer substantially by making that
>> assumption.
>>   
> Agreed. All of this really only becomes an issue when memory is 
> overcommitted. Unfortunately, that is precisely when 2MB machine 
> contiguous pages are likely to be difficult to find.

If 2M pages are becoming more important, then we should change Xen to do 
all domain allocations in 2M units, while reserving separate superpages 
specifically for fragmenting into 4k allocations.  Its certainly 
sensible to always round a domain's initial size up to 2M (most will 
already be a 2M multiple, I suspect).  Balloon is the obvious exception, 
but I would argue that ballooning in less than 2M units is a lot of 
fiddly makework.  The difference between a giving a domain 128MB vs 
126MB is already pretty trivial; dealing with 4k changes in domain size 
is laughably small.

(Now Keir brings up all difficulties...)

    J

next prev parent reply	other threads:[~2009-03-03 17:23 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-03-02 13:54 Design question for PV superpage support Dave McCracken
2009-03-02 13:58 ` Keir Fraser
2009-03-02 16:43 ` Mick Jordan
2009-03-02 17:06   ` Keir Fraser
2009-03-02 18:02     ` Mick Jordan
2009-03-02 17:29   ` Dave McCracken
2009-03-02 17:52     ` Keir Fraser
2009-03-02 18:03       ` Dan Magenheimer
2009-03-02 18:30         ` Keir Fraser
2009-03-02 18:46           ` Mick Jordan
2009-03-02 18:48           ` Dan Magenheimer
2009-03-02 19:04             ` Keir Fraser
2009-03-02 17:45   ` Mick Jordan
2009-03-02 17:54     ` Keir Fraser
2009-03-02 18:00     ` Dave McCracken
2009-03-02 18:14       ` Mick Jordan
2009-03-02 19:14         ` Dave McCracken
2009-03-03  1:37           ` Jeremy Fitzhardinge
2009-03-03  3:59             ` Mick Jordan
2009-03-03 14:33               ` Dan Magenheimer
2009-03-03 17:06                 ` Mick Jordan
2009-03-03 17:23                   ` Jeremy Fitzhardinge [this message]
2009-03-03 18:10                   ` Keir Fraser
2009-03-03 17:28                 ` Jeremy Fitzhardinge
2009-03-03 18:09                   ` Dan Magenheimer
2009-03-03 17:26               ` Jeremy Fitzhardinge
2009-03-03  1:32       ` Jeremy Fitzhardinge
2009-03-03  1:15 ` Jeremy Fitzhardinge

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49AD6798.8040303@goop.org \
    --to=jeremy@goop.org \
    --cc=Mick.Jordan@sun.com \
    --cc=dan.magenheimer@oracle.com \
    --cc=dcm@mccr.org \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.