From mboxrd@z Thu Jan  1 00:00:00 1970
From: Wei Huang <wei.huang2@amd.com>
Subject: Re: [RFC][Patches] Xen 1GB Page Table Support
Date: Fri, 20 Mar 2009 08:56:25 -0500
Message-ID: <49C3A089.3070609@amd.com>
References: <e2ee45f3-17c7-405a-a076-9b6d9d70a13e@default>
Mime-Version: 1.0
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <e2ee45f3-17c7-405a-a076-9b6d9d70a13e@default>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: George Dunlap <george.dunlap@eu.citrix.com>, Tim Deegan <Tim.Deegan@eu.citrix.com>, xen-devel@lists.xensource.com, Keir Fraser <Keir.Fraser@eu.citrix.com>
List-Id: xen-devel@lists.xenproject.org

Dan,

I agree on the order: A > C >= B > D. Generally, super page should 
perform better than small pages.

In reality, the difference between B & C is subtle. It depends on how 
TLB cache is designed and whether TLB flush happens frequently.

-Wei

Dan Magenheimer wrote:
> Interesting.  And non-intuitive.  I think you are saying
> that, at least theoretically (and using your ABCD, not
> my ABC below), A is always faster than
> (B | C), and (B | C) is always faster than D.  Taking into
> account the fact that the TLB size is fixed (I think),
> C will always be faster than B and never slower than D.
> 
> So if the theory proves true, that does seem to eliminate
> my objection.
> 
> Thanks,
> Dan
> 
>> -----Original Message-----
>> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
>> Sent: Friday, March 20, 2009 3:46 AM
>> To: Dan Magenheimer
>> Cc: Wei Huang; xen-devel@lists.xensource.com; Keir Fraser; Tim Deegan
>> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
>>
>>
>> Dan,
>>
>> Don't forget that this is about the p2m table, which is (if I 
>> understand 
>> correctly) orthogonal to what the guest pagetables are doing.  So the 
>> scenario, if HAP is used, would be:
>>
>> A) DB code uses 2MB pages in guest PTs, OS assumes 2MB pages, 
>> guest PTs 
>> use 2MB pages, P2M uses 2MB pages
>>  - A tlb miss requires 3 * 3 = 9 reads (Assuming 64-bit guest)
>> B) DB code uses 2MB pages, OS uses 2MB pages, p2m uses 4K pages
>>  - A tlb miss requires 3 * 4 = 12 reads
>> C) DB code uses 4k pages, OS uses 4k pages, p2m uses 2MB pages
>>  - A tlb miss requires 4 * 3 = 12 reads
>> D) DB code uses 4k pages, OS uses 4k pages, p2m uses 4k pages
>>  - A tlb miss requires 4 * 4 = 16 reads
>>
>> And adding the 1G p2m entries will change the multiplier from 3 to 2 
>> (i.e., 3*2 = 6 reads for superpages, 4*2 = 8 reads for 4k 
>> guest pages).
>>
>> (Those who are more familiar with the hardware, please correct me if 
>> I've made some mistakes or oversimplified things.)
>>
>> So adding 1G pages to the p2m table shouldn't change 
>> expectations of the 
>> guest OS in any case.  Using it will benefit the guest to the same 
>> degree whether the guest is using 4k, 2Mb, or 1G pages. (If I 
>> understand 
>> correctly.)
>>
>>  -George
>>
>> Dan Magenheimer wrote:
>>> Hi Wei --
>>>
>>> I'm not worried about the overhead of the splintering, I'm
>>> worried about the "hidden overhead" everytime a "silent
>>> splinter" is used.
>>>
>>> Let's assume three scenarios (and for now use 2MB pages though
>>> the same concerns can be extended to 1GB and/or mixed 2MB/1GB):
>>>
>>> A) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides
>>>    only 2MB pages (no splintering occurs)
>>> B) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides
>>>    only 4KB pages (because of fragmentation, all 2MB pages have
>>>    been splintered)
>>> C) DB code assumes 4KB pages, OS assumes 4KB pages, Xen provides
>>>    4KB pages
>>>
>>> Now run some benchmarks.  Clearly one would assume that A is
>>> faster than both B and C.  The question is: Is B faster or slower
>>> than C?
>>>
>>> If B is always faster than C, then I have less objection to
>>> "silent splintering".  But if B is sometimes (or maybe always?)
>>> slower than C, that's a big issue because a user has gone through
>>> the effort of choosing a better-performing system configuration
>>> for their software (2MB DB on 2MB OS), but it actually performs
>>> worse than if they had chosen the "lower performing" configuration.
>>> And, worse, it will likely degrade across time so performance
>>> might be fine when the 2MB-DB-on-2MB-OS guest is launched
>>> but get much worse when it is paused, save/restored, migrated,
>>> or hot-failed.  So even if B is only slightly faster than C,
>>> if B is much slower than A, this is a problem.
>>>
>>> Does that make sense?
>>>
>>> Some suggestions:
>>> 1) If it is possible for an administrator to determine how many
>>>    large pages (both 2MB and 1GB) were requested by each domain
>>>    and how many are currently whole-vs-splintered, that would help.
>>> 2) We may need some form of memory defragmenter
>>>
>>>   
>>>> -----Original Message-----
>>>> From: Wei Huang [mailto:wei.huang2@amd.com]
>>>> Sent: Thursday, March 19, 2009 12:52 PM
>>>> To: Dan Magenheimer
>>>> Cc: George Dunlap; xen-devel@lists.xensource.com;
>>>> keir.fraser@eu.citrix.com; Tim Deegan
>>>> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
>>>>
>>>>
>>>> Dan,
>>>>
>>>> Thanks for your comments. I am not sure about which 
>>>> splintering overhead 
>>>> you are referring to. I can think of three areas:
>>>>
>>>> 1. splintering in page allocation
>>>> In this case, Xen fails to allocate requested page order. 
>> So it falls 
>>>> back to smaller pages to setup p2m table. The overhead is 
>>>> O(guest_mem_size), which is a one-time deal.
>>>>
>>>> 2. P2M splits large page into smaller pages
>>>> This is one directional because we don't merge smaller 
>> pages to large 
>>>> ones. The worst case is to split all guest large pages. So 
>>>> overhead is 
>>>> O(total_large_page_mem). In long run, the overhead will 
>> converge to 0 
>>>> because it is one-directional. Note this overhead also covers 
>>>> when PoD 
>>>> feature is enabled.
>>>>
>>>> 3. CPU splintering
>>>> If CPU does not support 1GB page, it automatically does 
>> splintering 
>>>> using smaller ones (such as 2MB). In this case, the overhead 
>>>> is always 
>>>> there. But 1) this only happens to a small number of old 
>> chips; 2) I 
>>>> believe that it is still faster than 4K pages. CPUID (1gb 
>> feature and 
>>>> 1gb TLB entries) can be used to detect and stop this 
>> problem, if we 
>>>> don't really like it.
>>>>
>>>> I agree on your concerns. Customers should have the right to 
>>>> make their 
>>>> own decision. But that require new feature is enabled in the first 
>>>> place. For a lot of benchmarks, splintering overhead can be 
>>>> offset with 
>>>> benefits of huge pages. SPECJBB is a good example of using 
>>>> large pages 
>>>> (see Ben Serebrin's presentation in Xen Summit). With that 
>>>> said, I agree 
>>>> with the idea of adding a new option in guest configure file.
>>>>
>>>> -Wei
>>>>
>>>>
>>>> Dan Magenheimer wrote:
>>>>     
>>>>> I'd like to reiterate my argument raised in a previous
>>>>> discussion of hugepages:  Just because this CAN be made
>>>>> to work, doesn't imply that it SHOULD be made to work.
>>>>> Real users use larger pages in their OS for the sole
>>>>> reason that they expect a performance improvement.
>>>>> If it magically works, but works slow (and possibly
>>>>> slower than if the OS had just used small pages to
>>>>> start with), this is likely to lead to unsatisfied
>>>>> customers, and perhaps allegations such as "Xen sucks
>>>>> when running databases".
>>>>>
>>>>> So, please, let's think this through before implementing
>>>>> it just because we can.  At a minimum, an administrator
>>>>> should be somehow warned if large pages are getting splintered.
>>>>>
>>>>> And if its going in over my objection, please tie it to
>>>>> a boot option that defaults off so administrator action
>>>>> is required to allow silent splintering.
>>>>>
>>>>> My two cents...
>>>>> Dan
>>>>>
>>>>>       
>>>>>> -----Original Message-----
>>>>>> From: Huang2, Wei [mailto:Wei.Huang2@amd.com]
>>>>>> Sent: Thursday, March 19, 2009 2:07 AM
>>>>>> To: George Dunlap
>>>>>> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; 
>>>>>> Tim Deegan
>>>>>> Subject: RE: [Xen-devel] [RFC][Patches] Xen 1GB Page 
>> Table Support
>>>>>>
>>>>>> Here are patches using the middle approach. It handles 1GB 
>>>>>> pages in PoD
>>>>>> by remapping 1GB with 2MB pages & retry. I also added 
>> code for 1GB
>>>>>> detection. Please comment.
>>>>>>
>>>>>> Thanks a lot,
>>>>>>
>>>>>> -Wei
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On 
>>>>>>         
>>>> Behalf Of George
>>>>     
>>>>>> Dunlap
>>>>>> Sent: Wednesday, March 18, 2009 12:20 PM
>>>>>> To: Huang2, Wei
>>>>>> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; 
>>>>>> Tim Deegan
>>>>>> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page 
>> Table Support
>>>>>> Thanks for doing this work, Wei -- especially all the 
>>>>>>         
>>>> extra effort for
>>>>     
>>>>>> the PoD integration.
>>>>>>
>>>>>> One question: How well would you say you've tested the PoD
>>>>>> functionality?  Or to put it the other way, how much do I need to
>>>>>> prioritize testing this before the 3.4 release?
>>>>>>
>>>>>> It wouldn't be a bad idea to do as you suggested, and 
>> break things
>>>>>> into 2 meg pages for the PoD case.  In order to take the best
>>>>>> advantage of this in a PoD scenario, you'd need to have a balloon
>>>>>> driver that could allocate 1G of continuous *guest* p2m 
>>>>>>         
>>>> space, which
>>>>     
>>>>>> seems a bit optimistic at this point...
>>>>>>
>>>>>>  -George
>>>>>>
>>>>>> 2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:
>>>>>>         
>>>>>>> Current Xen supports 2MB super pages for NPT/EPT. The 
>>>>>>>           
>>>>>> attached patches
>>>>>>         
>>>>>>> extend this feature to support 1GB pages. The PoD 
>>>>>>>           
>>>>>> (populate-on-demand)
>>>>>>         
>>>>>>> introduced by George Dunlap made P2M modification harder. 
>>>>>>>           
>>>> I tried to
>>>>     
>>>>>>> preserve existing PoD design by introducing a 1GB PoD 
>> cache list.
>>>>>>>
>>>>>>>
>>>>>>> Note that 1GB PoD can be dropped if we don't care about 
>>>>>>>           
>>>> 1GB when PoD
>>>>     
>>>>>> is
>>>>>>         
>>>>>>> enabled. In this case, we can just split 1GB PDPE into 
>> 512x2MB PDE
>>>>>>>           
>>>>>> entries
>>>>>>         
>>>>>>> and grab pages from PoD super list. That can pretty much make
>>>>>>> 1gb_p2m_pod.patch go away.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Any comment/suggestion on design idea will be appreciated.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -Wei
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The following is the description:
>>>>>>>
>>>>>>> === 1gb_tools.patch ===
>>>>>>>
>>>>>>> Extend existing setup_guest() function. Basically, it tries to
>>>>>>>           
>>>>>> allocate 1GB
>>>>>>         
>>>>>>> pages whenever available. If this request fails, it falls 
>>>>>>>           
>>>>>> back to 2MB.
>>>>>> If
>>>>>>         
>>>>>>> both fail, then 4KB pages will be used.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> === 1gb_p2m.patch ===
>>>>>>>
>>>>>>> * p2m_next_level()
>>>>>>>
>>>>>>> Check PSE bit of L3 page table entry. If 1GB is found 
>> (PSE=1), we
>>>>>>>           
>>>>>> split 1GB
>>>>>>         
>>>>>>> into 512 2MB pages.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_set_entry()
>>>>>>>
>>>>>>> Configure the PSE bit of L3 P2M table if page order == 18 (1GB).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_gfn_to_mfn()
>>>>>>>
>>>>>>> Add support for 1GB case when doing gfn to mfn 
>>>>>>>           
>>>> translation. When L3
>>>>     
>>>>>> entry is
>>>>>>         
>>>>>>> marked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate().
>>>>>>>           
>>>>>> Otherwise,
>>>>>>         
>>>>>>> we do the regular address translation (gfn ==> mfn).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_gfn_to_mfn_current()
>>>>>>>
>>>>>>> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
>>>>>>> POPULATE_ON_DEMAND, it demands a populate using
>>>>>>>           
>>>>>> p2m_pod_demand_populate().
>>>>>>         
>>>>>>> Otherwise, it does a normal translation. 1GB page is taken into
>>>>>>> consideration.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * set_p2m_entry()
>>>>>>>
>>>>>>> Request 1GB page
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * audit_p2m()
>>>>>>>
>>>>>>> Support 1GB while auditing p2m table.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_change_type_global()
>>>>>>>
>>>>>>> Deal with 1GB page when changing global page type.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> === 1gb_p2m_pod.patch ===
>>>>>>>
>>>>>>> * xen/include/asm-x86/p2m.h
>>>>>>>
>>>>>>> Minor change to deal with PoD. It separates super page 
>>>>>>>           
>>>>>> cache list into
>>>>>> 2MB
>>>>>>         
>>>>>>> and 1GB lists. Similarly, we record last gpfn of sweeping 
>>>>>>>           
>>>>>> for both 2MB
>>>>>> and
>>>>>>         
>>>>>>> 1GB.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_pod_cache_add()
>>>>>>>
>>>>>>> Check page order and add 1GB super page into PoD 1GB cache list.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_pod_cache_get()
>>>>>>>
>>>>>>> Grab a page from cache list. It tries to break 1GB page 
>>>>>>>           
>>>> into 512 2MB
>>>>     
>>>>>> pages
>>>>>>         
>>>>>>> if 2MB PoD list is empty. Similarly, 4KB can be requested 
>>>>>>>           
>>>> from super
>>>>     
>>>>>> pages.
>>>>>>         
>>>>>>> The breaking order is 2MB then 1GB.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_pod_cache_target()
>>>>>>>
>>>>>>> This function is used to set PoD cache size. To increase 
>>>>>>>           
>>>> PoD target,
>>>>     
>>>>>> we try
>>>>>>         
>>>>>>> to allocate 1GB from xen domheap. If this fails, we try 
>>>>>>>           
>>>> 2MB. If both
>>>>     
>>>>>> fail,
>>>>>>         
>>>>>>> we try 4KB which is guaranteed to work.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> To decrease the target, we use a similar approach. We 
>> first try to
>>>>>>>           
>>>>>> free 1GB
>>>>>>         
>>>>>>> pages from 1GB PoD cache list. If such request fails, we 
>>>>>>>           
>>>> try 2MB PoD
>>>>     
>>>>>> cache
>>>>>>         
>>>>>>> list. If both fail, we try 4KB list.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_pod_zero_check_superpage_1gb()
>>>>>>>
>>>>>>> This adds a new function to check for 1GB page. This function is
>>>>>>>           
>>>>>> similar to
>>>>>>         
>>>>>>> p2m_pod_zero_check_superpage_2mb().
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_pod_zero_check_superpage_1gb()
>>>>>>>
>>>>>>> We add a new function to sweep 1GB page from guest memory. 
>>>>>>>           
>>>>>> This is the
>>>>>> same
>>>>>>         
>>>>>>> as p2m_pod_zero_check_superpage_2mb().
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_pod_demand_populate()
>>>>>>>
>>>>>>> The trick of this function is to do remap_and_retry if
>>>>>>>           
>>>>>> p2m_pod_cache_get()
>>>>>>         
>>>>>>> fails. When p2m_pod_get() fails, this function will 
>>>>>>>           
>>>> splits p2m table
>>>>     
>>>>>> entry
>>>>>>         
>>>>>>> into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB). That can
>>>>>>>           
>>>>>> guarantee
>>>>>>         
>>>>>>> populate demands always work.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Xen-devel mailing list
>>>>>>> Xen-devel@lists.xensource.com
>>>>>>> http://lists.xensource.com/xen-devel
>>>>>>>
>>>>>>>
>>>>>>>           
>>>>     
>>
>