From: Wei Huang <wei.huang2@amd.com>
To: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>,
xen-devel@lists.xensource.com, keir.fraser@eu.citrix.com,
Tim Deegan <Tim.Deegan@citrix.com>
Subject: Re: [RFC][Patches] Xen 1GB Page Table Support
Date: Thu, 19 Mar 2009 13:51:58 -0500 [thread overview]
Message-ID: <49C2944E.90508@amd.com> (raw)
In-Reply-To: <ca3b7d14-02b5-4f93-a559-5e8a532810e1@default>
Dan,
Thanks for your comments. I am not sure about which splintering overhead
you are referring to. I can think of three areas:
1. splintering in page allocation
In this case, Xen fails to allocate requested page order. So it falls
back to smaller pages to setup p2m table. The overhead is
O(guest_mem_size), which is a one-time deal.
2. P2M splits large page into smaller pages
This is one directional because we don't merge smaller pages to large
ones. The worst case is to split all guest large pages. So overhead is
O(total_large_page_mem). In long run, the overhead will converge to 0
because it is one-directional. Note this overhead also covers when PoD
feature is enabled.
3. CPU splintering
If CPU does not support 1GB page, it automatically does splintering
using smaller ones (such as 2MB). In this case, the overhead is always
there. But 1) this only happens to a small number of old chips; 2) I
believe that it is still faster than 4K pages. CPUID (1gb feature and
1gb TLB entries) can be used to detect and stop this problem, if we
don't really like it.
I agree on your concerns. Customers should have the right to make their
own decision. But that require new feature is enabled in the first
place. For a lot of benchmarks, splintering overhead can be offset with
benefits of huge pages. SPECJBB is a good example of using large pages
(see Ben Serebrin's presentation in Xen Summit). With that said, I agree
with the idea of adding a new option in guest configure file.
-Wei
Dan Magenheimer wrote:
> I'd like to reiterate my argument raised in a previous
> discussion of hugepages: Just because this CAN be made
> to work, doesn't imply that it SHOULD be made to work.
> Real users use larger pages in their OS for the sole
> reason that they expect a performance improvement.
> If it magically works, but works slow (and possibly
> slower than if the OS had just used small pages to
> start with), this is likely to lead to unsatisfied
> customers, and perhaps allegations such as "Xen sucks
> when running databases".
>
> So, please, let's think this through before implementing
> it just because we can. At a minimum, an administrator
> should be somehow warned if large pages are getting splintered.
>
> And if its going in over my objection, please tie it to
> a boot option that defaults off so administrator action
> is required to allow silent splintering.
>
> My two cents...
> Dan
>
>> -----Original Message-----
>> From: Huang2, Wei [mailto:Wei.Huang2@amd.com]
>> Sent: Thursday, March 19, 2009 2:07 AM
>> To: George Dunlap
>> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com;
>> Tim Deegan
>> Subject: RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
>>
>>
>> Here are patches using the middle approach. It handles 1GB
>> pages in PoD
>> by remapping 1GB with 2MB pages & retry. I also added code for 1GB
>> detection. Please comment.
>>
>> Thanks a lot,
>>
>> -Wei
>>
>> -----Original Message-----
>> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George
>> Dunlap
>> Sent: Wednesday, March 18, 2009 12:20 PM
>> To: Huang2, Wei
>> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com;
>> Tim Deegan
>> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
>>
>> Thanks for doing this work, Wei -- especially all the extra effort for
>> the PoD integration.
>>
>> One question: How well would you say you've tested the PoD
>> functionality? Or to put it the other way, how much do I need to
>> prioritize testing this before the 3.4 release?
>>
>> It wouldn't be a bad idea to do as you suggested, and break things
>> into 2 meg pages for the PoD case. In order to take the best
>> advantage of this in a PoD scenario, you'd need to have a balloon
>> driver that could allocate 1G of continuous *guest* p2m space, which
>> seems a bit optimistic at this point...
>>
>> -George
>>
>> 2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:
>>> Current Xen supports 2MB super pages for NPT/EPT. The
>> attached patches
>>> extend this feature to support 1GB pages. The PoD
>> (populate-on-demand)
>>> introduced by George Dunlap made P2M modification harder. I tried to
>>> preserve existing PoD design by introducing a 1GB PoD cache list.
>>>
>>>
>>>
>>> Note that 1GB PoD can be dropped if we don't care about 1GB when PoD
>> is
>>> enabled. In this case, we can just split 1GB PDPE into 512x2MB PDE
>> entries
>>> and grab pages from PoD super list. That can pretty much make
>>> 1gb_p2m_pod.patch go away.
>>>
>>>
>>>
>>> Any comment/suggestion on design idea will be appreciated.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> -Wei
>>>
>>>
>>>
>>>
>>>
>>> The following is the description:
>>>
>>> === 1gb_tools.patch ===
>>>
>>> Extend existing setup_guest() function. Basically, it tries to
>> allocate 1GB
>>> pages whenever available. If this request fails, it falls
>> back to 2MB.
>> If
>>> both fail, then 4KB pages will be used.
>>>
>>>
>>>
>>> === 1gb_p2m.patch ===
>>>
>>> * p2m_next_level()
>>>
>>> Check PSE bit of L3 page table entry. If 1GB is found (PSE=1), we
>> split 1GB
>>> into 512 2MB pages.
>>>
>>>
>>>
>>> * p2m_set_entry()
>>>
>>> Configure the PSE bit of L3 P2M table if page order == 18 (1GB).
>>>
>>>
>>>
>>> * p2m_gfn_to_mfn()
>>>
>>> Add support for 1GB case when doing gfn to mfn translation. When L3
>> entry is
>>> marked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate().
>> Otherwise,
>>> we do the regular address translation (gfn ==> mfn).
>>>
>>>
>>>
>>> * p2m_gfn_to_mfn_current()
>>>
>>> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
>>> POPULATE_ON_DEMAND, it demands a populate using
>> p2m_pod_demand_populate().
>>> Otherwise, it does a normal translation. 1GB page is taken into
>>> consideration.
>>>
>>>
>>>
>>> * set_p2m_entry()
>>>
>>> Request 1GB page
>>>
>>>
>>>
>>> * audit_p2m()
>>>
>>> Support 1GB while auditing p2m table.
>>>
>>>
>>>
>>> * p2m_change_type_global()
>>>
>>> Deal with 1GB page when changing global page type.
>>>
>>>
>>>
>>> === 1gb_p2m_pod.patch ===
>>>
>>> * xen/include/asm-x86/p2m.h
>>>
>>> Minor change to deal with PoD. It separates super page
>> cache list into
>> 2MB
>>> and 1GB lists. Similarly, we record last gpfn of sweeping
>> for both 2MB
>> and
>>> 1GB.
>>>
>>>
>>>
>>> * p2m_pod_cache_add()
>>>
>>> Check page order and add 1GB super page into PoD 1GB cache list.
>>>
>>>
>>>
>>> * p2m_pod_cache_get()
>>>
>>> Grab a page from cache list. It tries to break 1GB page into 512 2MB
>> pages
>>> if 2MB PoD list is empty. Similarly, 4KB can be requested from super
>> pages.
>>> The breaking order is 2MB then 1GB.
>>>
>>>
>>>
>>> * p2m_pod_cache_target()
>>>
>>> This function is used to set PoD cache size. To increase PoD target,
>> we try
>>> to allocate 1GB from xen domheap. If this fails, we try 2MB. If both
>> fail,
>>> we try 4KB which is guaranteed to work.
>>>
>>>
>>>
>>> To decrease the target, we use a similar approach. We first try to
>> free 1GB
>>> pages from 1GB PoD cache list. If such request fails, we try 2MB PoD
>> cache
>>> list. If both fail, we try 4KB list.
>>>
>>>
>>>
>>> * p2m_pod_zero_check_superpage_1gb()
>>>
>>> This adds a new function to check for 1GB page. This function is
>> similar to
>>> p2m_pod_zero_check_superpage_2mb().
>>>
>>>
>>>
>>> * p2m_pod_zero_check_superpage_1gb()
>>>
>>> We add a new function to sweep 1GB page from guest memory.
>> This is the
>> same
>>> as p2m_pod_zero_check_superpage_2mb().
>>>
>>>
>>>
>>> * p2m_pod_demand_populate()
>>>
>>> The trick of this function is to do remap_and_retry if
>> p2m_pod_cache_get()
>>> fails. When p2m_pod_get() fails, this function will splits p2m table
>> entry
>>> into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB). That can
>> guarantee
>>> populate demands always work.
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>>>
>>>
>>
>
next prev parent reply other threads:[~2009-03-19 18:51 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-03-18 15:48 [RFC][Patches] Xen 1GB Page Table Support Huang2, Wei
2009-03-18 17:20 ` George Dunlap
2009-03-18 17:32 ` Keir Fraser
2009-03-18 17:45 ` Huang2, Wei
2009-03-18 19:15 ` Keir Fraser
2009-03-18 17:37 ` Huang2, Wei
2009-03-19 8:07 ` Huang2, Wei
2009-03-19 14:17 ` Dan Magenheimer
2009-03-19 18:51 ` Wei Huang [this message]
2009-03-19 19:56 ` Dan Magenheimer
2009-03-19 21:07 ` Wei Huang
2009-03-20 9:45 ` George Dunlap
2009-03-20 13:40 ` Dan Magenheimer
2009-03-20 13:56 ` Wei Huang
2009-03-20 13:59 ` Gianluca Guida
2009-03-20 14:25 ` George Dunlap
2009-03-20 14:21 ` Gianluca Guida
2009-03-20 18:16 ` Gianluca Guida
2009-03-20 18:32 ` Huang2, Wei
2009-05-19 0:55 ` Keir Fraser
2010-01-13 4:07 ` Xu, Dongxiao
2010-01-13 16:27 ` Wei Huang
2010-01-14 1:36 ` Xu, Dongxiao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=49C2944E.90508@amd.com \
--to=wei.huang2@amd.com \
--cc=George.Dunlap@eu.citrix.com \
--cc=Tim.Deegan@citrix.com \
--cc=dan.magenheimer@oracle.com \
--cc=keir.fraser@eu.citrix.com \
--cc=xen-devel@lists.xensource.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.