From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wei Huang Subject: Re: [RFC][Patches] Xen 1GB Page Table Support Date: Fri, 20 Mar 2009 08:56:25 -0500 Message-ID: <49C3A089.3070609@amd.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Dan Magenheimer Cc: George Dunlap , Tim Deegan , xen-devel@lists.xensource.com, Keir Fraser List-Id: xen-devel@lists.xenproject.org Dan, I agree on the order: A > C >= B > D. Generally, super page should perform better than small pages. In reality, the difference between B & C is subtle. It depends on how TLB cache is designed and whether TLB flush happens frequently. -Wei Dan Magenheimer wrote: > Interesting. And non-intuitive. I think you are saying > that, at least theoretically (and using your ABCD, not > my ABC below), A is always faster than > (B | C), and (B | C) is always faster than D. Taking into > account the fact that the TLB size is fixed (I think), > C will always be faster than B and never slower than D. > > So if the theory proves true, that does seem to eliminate > my objection. > > Thanks, > Dan > >> -----Original Message----- >> From: George Dunlap [mailto:george.dunlap@eu.citrix.com] >> Sent: Friday, March 20, 2009 3:46 AM >> To: Dan Magenheimer >> Cc: Wei Huang; xen-devel@lists.xensource.com; Keir Fraser; Tim Deegan >> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support >> >> >> Dan, >> >> Don't forget that this is about the p2m table, which is (if I >> understand >> correctly) orthogonal to what the guest pagetables are doing. So the >> scenario, if HAP is used, would be: >> >> A) DB code uses 2MB pages in guest PTs, OS assumes 2MB pages, >> guest PTs >> use 2MB pages, P2M uses 2MB pages >> - A tlb miss requires 3 * 3 = 9 reads (Assuming 64-bit guest) >> B) DB code uses 2MB pages, OS uses 2MB pages, p2m uses 4K pages >> - A tlb miss requires 3 * 4 = 12 reads >> C) DB code uses 4k pages, OS uses 4k pages, p2m uses 2MB pages >> - A tlb miss requires 4 * 3 = 12 reads >> D) DB code uses 4k pages, OS uses 4k pages, p2m uses 4k pages >> - A tlb miss requires 4 * 4 = 16 reads >> >> And adding the 1G p2m entries will change the multiplier from 3 to 2 >> (i.e., 3*2 = 6 reads for superpages, 4*2 = 8 reads for 4k >> guest pages). >> >> (Those who are more familiar with the hardware, please correct me if >> I've made some mistakes or oversimplified things.) >> >> So adding 1G pages to the p2m table shouldn't change >> expectations of the >> guest OS in any case. Using it will benefit the guest to the same >> degree whether the guest is using 4k, 2Mb, or 1G pages. (If I >> understand >> correctly.) >> >> -George >> >> Dan Magenheimer wrote: >>> Hi Wei -- >>> >>> I'm not worried about the overhead of the splintering, I'm >>> worried about the "hidden overhead" everytime a "silent >>> splinter" is used. >>> >>> Let's assume three scenarios (and for now use 2MB pages though >>> the same concerns can be extended to 1GB and/or mixed 2MB/1GB): >>> >>> A) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides >>> only 2MB pages (no splintering occurs) >>> B) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides >>> only 4KB pages (because of fragmentation, all 2MB pages have >>> been splintered) >>> C) DB code assumes 4KB pages, OS assumes 4KB pages, Xen provides >>> 4KB pages >>> >>> Now run some benchmarks. Clearly one would assume that A is >>> faster than both B and C. The question is: Is B faster or slower >>> than C? >>> >>> If B is always faster than C, then I have less objection to >>> "silent splintering". But if B is sometimes (or maybe always?) >>> slower than C, that's a big issue because a user has gone through >>> the effort of choosing a better-performing system configuration >>> for their software (2MB DB on 2MB OS), but it actually performs >>> worse than if they had chosen the "lower performing" configuration. >>> And, worse, it will likely degrade across time so performance >>> might be fine when the 2MB-DB-on-2MB-OS guest is launched >>> but get much worse when it is paused, save/restored, migrated, >>> or hot-failed. So even if B is only slightly faster than C, >>> if B is much slower than A, this is a problem. >>> >>> Does that make sense? >>> >>> Some suggestions: >>> 1) If it is possible for an administrator to determine how many >>> large pages (both 2MB and 1GB) were requested by each domain >>> and how many are currently whole-vs-splintered, that would help. >>> 2) We may need some form of memory defragmenter >>> >>> >>>> -----Original Message----- >>>> From: Wei Huang [mailto:wei.huang2@amd.com] >>>> Sent: Thursday, March 19, 2009 12:52 PM >>>> To: Dan Magenheimer >>>> Cc: George Dunlap; xen-devel@lists.xensource.com; >>>> keir.fraser@eu.citrix.com; Tim Deegan >>>> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support >>>> >>>> >>>> Dan, >>>> >>>> Thanks for your comments. I am not sure about which >>>> splintering overhead >>>> you are referring to. I can think of three areas: >>>> >>>> 1. splintering in page allocation >>>> In this case, Xen fails to allocate requested page order. >> So it falls >>>> back to smaller pages to setup p2m table. The overhead is >>>> O(guest_mem_size), which is a one-time deal. >>>> >>>> 2. P2M splits large page into smaller pages >>>> This is one directional because we don't merge smaller >> pages to large >>>> ones. The worst case is to split all guest large pages. So >>>> overhead is >>>> O(total_large_page_mem). In long run, the overhead will >> converge to 0 >>>> because it is one-directional. Note this overhead also covers >>>> when PoD >>>> feature is enabled. >>>> >>>> 3. CPU splintering >>>> If CPU does not support 1GB page, it automatically does >> splintering >>>> using smaller ones (such as 2MB). In this case, the overhead >>>> is always >>>> there. But 1) this only happens to a small number of old >> chips; 2) I >>>> believe that it is still faster than 4K pages. CPUID (1gb >> feature and >>>> 1gb TLB entries) can be used to detect and stop this >> problem, if we >>>> don't really like it. >>>> >>>> I agree on your concerns. Customers should have the right to >>>> make their >>>> own decision. But that require new feature is enabled in the first >>>> place. For a lot of benchmarks, splintering overhead can be >>>> offset with >>>> benefits of huge pages. SPECJBB is a good example of using >>>> large pages >>>> (see Ben Serebrin's presentation in Xen Summit). With that >>>> said, I agree >>>> with the idea of adding a new option in guest configure file. >>>> >>>> -Wei >>>> >>>> >>>> Dan Magenheimer wrote: >>>> >>>>> I'd like to reiterate my argument raised in a previous >>>>> discussion of hugepages: Just because this CAN be made >>>>> to work, doesn't imply that it SHOULD be made to work. >>>>> Real users use larger pages in their OS for the sole >>>>> reason that they expect a performance improvement. >>>>> If it magically works, but works slow (and possibly >>>>> slower than if the OS had just used small pages to >>>>> start with), this is likely to lead to unsatisfied >>>>> customers, and perhaps allegations such as "Xen sucks >>>>> when running databases". >>>>> >>>>> So, please, let's think this through before implementing >>>>> it just because we can. At a minimum, an administrator >>>>> should be somehow warned if large pages are getting splintered. >>>>> >>>>> And if its going in over my objection, please tie it to >>>>> a boot option that defaults off so administrator action >>>>> is required to allow silent splintering. >>>>> >>>>> My two cents... >>>>> Dan >>>>> >>>>> >>>>>> -----Original Message----- >>>>>> From: Huang2, Wei [mailto:Wei.Huang2@amd.com] >>>>>> Sent: Thursday, March 19, 2009 2:07 AM >>>>>> To: George Dunlap >>>>>> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; >>>>>> Tim Deegan >>>>>> Subject: RE: [Xen-devel] [RFC][Patches] Xen 1GB Page >> Table Support >>>>>> >>>>>> Here are patches using the middle approach. It handles 1GB >>>>>> pages in PoD >>>>>> by remapping 1GB with 2MB pages & retry. I also added >> code for 1GB >>>>>> detection. Please comment. >>>>>> >>>>>> Thanks a lot, >>>>>> >>>>>> -Wei >>>>>> >>>>>> -----Original Message----- >>>>>> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On >>>>>> >>>> Behalf Of George >>>> >>>>>> Dunlap >>>>>> Sent: Wednesday, March 18, 2009 12:20 PM >>>>>> To: Huang2, Wei >>>>>> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; >>>>>> Tim Deegan >>>>>> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page >> Table Support >>>>>> Thanks for doing this work, Wei -- especially all the >>>>>> >>>> extra effort for >>>> >>>>>> the PoD integration. >>>>>> >>>>>> One question: How well would you say you've tested the PoD >>>>>> functionality? Or to put it the other way, how much do I need to >>>>>> prioritize testing this before the 3.4 release? >>>>>> >>>>>> It wouldn't be a bad idea to do as you suggested, and >> break things >>>>>> into 2 meg pages for the PoD case. In order to take the best >>>>>> advantage of this in a PoD scenario, you'd need to have a balloon >>>>>> driver that could allocate 1G of continuous *guest* p2m >>>>>> >>>> space, which >>>> >>>>>> seems a bit optimistic at this point... >>>>>> >>>>>> -George >>>>>> >>>>>> 2009/3/18 Huang2, Wei : >>>>>> >>>>>>> Current Xen supports 2MB super pages for NPT/EPT. The >>>>>>> >>>>>> attached patches >>>>>> >>>>>>> extend this feature to support 1GB pages. The PoD >>>>>>> >>>>>> (populate-on-demand) >>>>>> >>>>>>> introduced by George Dunlap made P2M modification harder. >>>>>>> >>>> I tried to >>>> >>>>>>> preserve existing PoD design by introducing a 1GB PoD >> cache list. >>>>>>> >>>>>>> >>>>>>> Note that 1GB PoD can be dropped if we don't care about >>>>>>> >>>> 1GB when PoD >>>> >>>>>> is >>>>>> >>>>>>> enabled. In this case, we can just split 1GB PDPE into >> 512x2MB PDE >>>>>>> >>>>>> entries >>>>>> >>>>>>> and grab pages from PoD super list. That can pretty much make >>>>>>> 1gb_p2m_pod.patch go away. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Any comment/suggestion on design idea will be appreciated. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> >>>>>>> >>>>>>> -Wei >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> The following is the description: >>>>>>> >>>>>>> === 1gb_tools.patch === >>>>>>> >>>>>>> Extend existing setup_guest() function. Basically, it tries to >>>>>>> >>>>>> allocate 1GB >>>>>> >>>>>>> pages whenever available. If this request fails, it falls >>>>>>> >>>>>> back to 2MB. >>>>>> If >>>>>> >>>>>>> both fail, then 4KB pages will be used. >>>>>>> >>>>>>> >>>>>>> >>>>>>> === 1gb_p2m.patch === >>>>>>> >>>>>>> * p2m_next_level() >>>>>>> >>>>>>> Check PSE bit of L3 page table entry. If 1GB is found >> (PSE=1), we >>>>>>> >>>>>> split 1GB >>>>>> >>>>>>> into 512 2MB pages. >>>>>>> >>>>>>> >>>>>>> >>>>>>> * p2m_set_entry() >>>>>>> >>>>>>> Configure the PSE bit of L3 P2M table if page order == 18 (1GB). >>>>>>> >>>>>>> >>>>>>> >>>>>>> * p2m_gfn_to_mfn() >>>>>>> >>>>>>> Add support for 1GB case when doing gfn to mfn >>>>>>> >>>> translation. When L3 >>>> >>>>>> entry is >>>>>> >>>>>>> marked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate(). >>>>>>> >>>>>> Otherwise, >>>>>> >>>>>>> we do the regular address translation (gfn ==> mfn). >>>>>>> >>>>>>> >>>>>>> >>>>>>> * p2m_gfn_to_mfn_current() >>>>>>> >>>>>>> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as >>>>>>> POPULATE_ON_DEMAND, it demands a populate using >>>>>>> >>>>>> p2m_pod_demand_populate(). >>>>>> >>>>>>> Otherwise, it does a normal translation. 1GB page is taken into >>>>>>> consideration. >>>>>>> >>>>>>> >>>>>>> >>>>>>> * set_p2m_entry() >>>>>>> >>>>>>> Request 1GB page >>>>>>> >>>>>>> >>>>>>> >>>>>>> * audit_p2m() >>>>>>> >>>>>>> Support 1GB while auditing p2m table. >>>>>>> >>>>>>> >>>>>>> >>>>>>> * p2m_change_type_global() >>>>>>> >>>>>>> Deal with 1GB page when changing global page type. >>>>>>> >>>>>>> >>>>>>> >>>>>>> === 1gb_p2m_pod.patch === >>>>>>> >>>>>>> * xen/include/asm-x86/p2m.h >>>>>>> >>>>>>> Minor change to deal with PoD. It separates super page >>>>>>> >>>>>> cache list into >>>>>> 2MB >>>>>> >>>>>>> and 1GB lists. Similarly, we record last gpfn of sweeping >>>>>>> >>>>>> for both 2MB >>>>>> and >>>>>> >>>>>>> 1GB. >>>>>>> >>>>>>> >>>>>>> >>>>>>> * p2m_pod_cache_add() >>>>>>> >>>>>>> Check page order and add 1GB super page into PoD 1GB cache list. >>>>>>> >>>>>>> >>>>>>> >>>>>>> * p2m_pod_cache_get() >>>>>>> >>>>>>> Grab a page from cache list. It tries to break 1GB page >>>>>>> >>>> into 512 2MB >>>> >>>>>> pages >>>>>> >>>>>>> if 2MB PoD list is empty. Similarly, 4KB can be requested >>>>>>> >>>> from super >>>> >>>>>> pages. >>>>>> >>>>>>> The breaking order is 2MB then 1GB. >>>>>>> >>>>>>> >>>>>>> >>>>>>> * p2m_pod_cache_target() >>>>>>> >>>>>>> This function is used to set PoD cache size. To increase >>>>>>> >>>> PoD target, >>>> >>>>>> we try >>>>>> >>>>>>> to allocate 1GB from xen domheap. If this fails, we try >>>>>>> >>>> 2MB. If both >>>> >>>>>> fail, >>>>>> >>>>>>> we try 4KB which is guaranteed to work. >>>>>>> >>>>>>> >>>>>>> >>>>>>> To decrease the target, we use a similar approach. We >> first try to >>>>>>> >>>>>> free 1GB >>>>>> >>>>>>> pages from 1GB PoD cache list. If such request fails, we >>>>>>> >>>> try 2MB PoD >>>> >>>>>> cache >>>>>> >>>>>>> list. If both fail, we try 4KB list. >>>>>>> >>>>>>> >>>>>>> >>>>>>> * p2m_pod_zero_check_superpage_1gb() >>>>>>> >>>>>>> This adds a new function to check for 1GB page. This function is >>>>>>> >>>>>> similar to >>>>>> >>>>>>> p2m_pod_zero_check_superpage_2mb(). >>>>>>> >>>>>>> >>>>>>> >>>>>>> * p2m_pod_zero_check_superpage_1gb() >>>>>>> >>>>>>> We add a new function to sweep 1GB page from guest memory. >>>>>>> >>>>>> This is the >>>>>> same >>>>>> >>>>>>> as p2m_pod_zero_check_superpage_2mb(). >>>>>>> >>>>>>> >>>>>>> >>>>>>> * p2m_pod_demand_populate() >>>>>>> >>>>>>> The trick of this function is to do remap_and_retry if >>>>>>> >>>>>> p2m_pod_cache_get() >>>>>> >>>>>>> fails. When p2m_pod_get() fails, this function will >>>>>>> >>>> splits p2m table >>>> >>>>>> entry >>>>>> >>>>>>> into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB). That can >>>>>>> >>>>>> guarantee >>>>>> >>>>>>> populate demands always work. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Xen-devel mailing list >>>>>>> Xen-devel@lists.xensource.com >>>>>>> http://lists.xensource.com/xen-devel >>>>>>> >>>>>>> >>>>>>> >>>> >> >