[RFC][PATCH]Large Page Support for HAP

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH]Large Page Support for HAP
@ 2007-11-15 16:26 Huang2, Wei
  2007-11-15 16:36 ` Keir Fraser
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Huang2, Wei @ 2007-11-15 16:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Tim Deegan


[-- Attachment #1.1: Type: text/plain, Size: 2408 bytes --]

I implemented a preliminary version of HAP large page support. My
testings showed that 32bit PAE and 64bit worked well. Also I saw decent
performance improvement for certain benchmarks.
 
So before I go too far, I send this patch to community for
reviews/comments. This patch goes with xen-unstable changeset 16281. I
will redo it after collecting all ideas.
 
Thanks,
 
-Wei
 
============
DESIGN IDEAS:
1. Large page requests
- xc_hvm_build.c requests large page (2MB for now) while starting guests
- memory.c handles large page requests. If it can not handle it, falls
back to 4KB pages.
 
2. P2M table
- P2M table takes page size order as a parameter; It builds P2M table
(setting PSE bit, etc.) according to page size.
- Other related functions (such as p2m_audit()) handles the table based
on page size too.
- Page split/merge
** Large page will be split into 4KB page in P2M table if needed. For
instance, if set_p2m_entry() handles 4KB page but finds PSE/PRESENT bits
are on, it will further split large page to 4KB pages.
** There is NO merge from 4KB pages to large page. Since large page is
only used at the very beginning, guest_physmap_add(), this is OK for
now.
 
3. HAP
- To access the PSE bit, L2 pages of P2M table is installed in linear
mapping on SH_LINEAR_PT_VIRT_START. We borrow this address space since
it was not used.
 
4. gfn_to_mfn translation (P2M)
- gfn_to_mfn_foreign() traverses P2M table and handles address
translation correctly based on PSE bit.
- gfn_to_mfn_current() accesses SH_LINEAR_PT_VIRT_START to check PSE
bit. If is on, we handle translation using large page. Otherwise, it
falls back to normal RO_MPT_VIRT_START address space to access P2M L1
pages.
 
5. M2P translation
- Same as before, M2P translation still happens on 4KB level.
 
AREAS NEEDS COMMENTS:
1. Large page for 32bit mode
- 32bit use 4MB for large page. This is very annoying for
xc_hvm_build.c. I don't want to create another 4MB page_array for it.
- Because of this, this area has not been tested very well. I expect
changes soon.
 
2. Shadow paging
- This implementation will affect shadow mode, especially at
xc_hvm_build.c and memory.c.
- Where and how to avoid affecting shadow?
 
3. Turn it on/off
- Do we want to turn this feature on/off through option (kernel option
or anything else)?
 
4. Other missing areas?
===========

[-- Attachment #1.2: Type: text/html, Size: 6123 bytes --]

[-- Attachment #2: hap_large_page_patch.txt --]
[-- Type: text/plain, Size: 84194 bytes --]

diff -r 3191627e5ad6 tools/libxc/xc_hvm_build.c
--- a/tools/libxc/xc_hvm_build.c	Wed Oct 31 16:21:18 2007 +0000
+++ b/tools/libxc/xc_hvm_build.c	Wed Nov 07 07:16:57 2007 -0600
@@ -149,7 +149,9 @@ static int setup_guest(int xc_handle,
                        char *image, unsigned long image_size)
 {
     xen_pfn_t *page_array = NULL;
+    xen_pfn_t *page_array_2MB = NULL;
     unsigned long i, nr_pages = (unsigned long)memsize << (20 - PAGE_SHIFT);
+    unsigned long nr_pages_2MB = (unsigned long)memsize >> 1;
     unsigned long shared_page_nr, entry_eip;
     struct xen_add_to_physmap xatp;
     struct shared_info *shared_info;
@@ -189,7 +191,8 @@ static int setup_guest(int xc_handle,
             v_start, v_end,
             elf_uval(&elf, elf.ehdr, e_entry));
 
-    if ( (page_array = malloc(nr_pages * sizeof(xen_pfn_t))) == NULL )
+    if ( (page_array = malloc(nr_pages * sizeof(xen_pfn_t))) == NULL ||
+	 (page_array_2MB = malloc(nr_pages_2MB * sizeof(xen_pfn_t))) == NULL )
     {
         PERROR("Could not allocate memory.\n");
         goto error_out;
@@ -197,15 +200,33 @@ static int setup_guest(int xc_handle,
 
     for ( i = 0; i < nr_pages; i++ )
         page_array[i] = i;
+    for ( i = 0; i < nr_pages_2MB; i++ )
+	page_array_2MB[i] = i << 9;
     for ( i = HVM_BELOW_4G_RAM_END >> PAGE_SHIFT; i < nr_pages; i++ )
         page_array[i] += HVM_BELOW_4G_MMIO_LENGTH >> PAGE_SHIFT;
-
-    /* Allocate memory for HVM guest, skipping VGA hole 0xA0000-0xC0000. */
+    for ( i = HVM_BELOW_4G_RAM_END >> (PAGE_SHIFT + 9); i < nr_pages_2MB; i++ )
+	page_array_2MB[i] += HVM_BELOW_4G_MMIO_LENGTH >> PAGE_SHIFT;
+
+    /* Note: We try to request 2MB page allocations at this point. Hypervisor
+     * will fall back to 4KB allocation if it can not satisfies these requests.
+     *
+     * Allocate memory for HVM guest from 0 - 2MB space using 4KB pages, 
+     * skipping VGA hole 0xA0000-0xC0000. 
+     */
     rc = xc_domain_memory_populate_physmap(
         xc_handle, dom, 0xa0, 0, 0, &page_array[0x00]);
     if ( rc == 0 )
         rc = xc_domain_memory_populate_physmap(
-            xc_handle, dom, nr_pages - 0xc0, 0, 0, &page_array[0xc0]);
+	    xc_handle, dom, 0x200 - 0xc0, 0, 0, &page_array[0xc0]);
+    /* Allocate memory for HVM guest beyond 2MB space using 2MB pages */
+    if ( rc == 0 )
+	rc = xc_domain_memory_populate_physmap(
+	    xc_handle, dom, nr_pages_2MB - 1, 9, 0, &page_array_2MB[1]);
+    /* Handle the case of odd number physical memory size */
+    if ( rc == 0 )
+	rc = xc_domain_memory_populate_physmap(
+	    xc_handle, dom, nr_pages - (nr_pages_2MB << 9), 0, 0,
+	    &page_array[nr_pages_2MB << 9]);
     if ( rc != 0 )
     {
         PERROR("Could not allocate memory for HVM guest.\n");
@@ -268,10 +289,12 @@ static int setup_guest(int xc_handle,
     }
 
     free(page_array);
+    free(page_array_2MB);
     return 0;
 
  error_out:
     free(page_array);
+    free(page_array_2MB);
     return -1;
 }
 
diff -r 3191627e5ad6 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Wed Oct 31 16:21:18 2007 +0000
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Wed Nov 07 11:14:53 2007 -0600
@@ -2413,7 +2413,8 @@ static int vmx_alloc_vlapic_mapping(stru
         return -ENOMEM;
     share_xen_page_with_guest(virt_to_page(apic_va), d, XENSHARE_writable);
     guest_physmap_add_page(
-        d, paddr_to_pfn(APIC_DEFAULT_PHYS_BASE), virt_to_mfn(apic_va));
+        d, paddr_to_pfn(APIC_DEFAULT_PHYS_BASE), virt_to_mfn(apic_va),
+        PAGE_SIZE_ORDER_4K);
     d->arch.hvm_domain.vmx_apic_access_mfn = virt_to_mfn(apic_va);
 
     return 0;
diff -r 3191627e5ad6 xen/arch/x86/mm.c
--- a/xen/arch/x86/mm.c	Wed Oct 31 16:21:18 2007 +0000
+++ b/xen/arch/x86/mm.c	Mon Nov 12 07:51:32 2007 -0600
@@ -3155,7 +3155,8 @@ long arch_memory_op(int op, XEN_GUEST_HA
         {
             if ( is_xen_heap_frame(mfn_to_page(prev_mfn)) )
                 /* Xen heap frames are simply unhooked from this phys slot. */
-                guest_physmap_remove_page(d, xatp.gpfn, prev_mfn);
+                guest_physmap_remove_page(d, xatp.gpfn, prev_mfn, 
+                                          PAGE_SIZE_ORDER_4K);
             else
                 /* Normal domain memory is freed, to avoid leaking memory. */
                 guest_remove_page(d, xatp.gpfn);
@@ -3164,10 +3165,10 @@ long arch_memory_op(int op, XEN_GUEST_HA
         /* Unmap from old location, if any. */
         gpfn = get_gpfn_from_mfn(mfn);
         if ( gpfn != INVALID_M2P_ENTRY )
-            guest_physmap_remove_page(d, gpfn, mfn);
+            guest_physmap_remove_page(d, gpfn, mfn, PAGE_SIZE_ORDER_4K);
 
         /* Map at new location. */
-        guest_physmap_add_page(d, xatp.gpfn, mfn);
+        guest_physmap_add_page(d, xatp.gpfn, mfn, PAGE_SIZE_ORDER_4K);
 
         UNLOCK_BIGLOCK(d);
 
diff -r 3191627e5ad6 xen/arch/x86/mm/hap/hap.c
--- a/xen/arch/x86/mm/hap/hap.c	Wed Oct 31 16:21:18 2007 +0000
+++ b/xen/arch/x86/mm/hap/hap.c	Thu Nov 15 03:54:42 2007 -0600
@@ -238,9 +238,14 @@ static void hap_install_xen_entries_in_l
 {
     struct domain *d = v->domain;
     l4_pgentry_t *l4e;
+    l3_pgentry_t *l3e;
+    struct page_info *pg;
 
     l4e = hap_map_domain_page(l4mfn);
     ASSERT(l4e != NULL);
+
+    if ( (pg = hap_alloc(d)) == NULL )
+        goto oom;
 
     /* Copy the common Xen mappings from the idle domain */
     memcpy(&l4e[ROOT_PAGETABLE_FIRST_XEN_SLOT],
@@ -261,7 +266,24 @@ static void hap_install_xen_entries_in_l
         l4e_from_pfn(mfn_x(pagetable_get_mfn(d->arch.phys_table)),
                      __PAGE_HYPERVISOR);
 
+    /* Install the L2 pages of P2M table in linear mapping. We borrow
+     * SH_LINEAR_PT_VIRT_START to store such information.   
+     */
+    l4e[l4_table_offset(SH_LINEAR_PT_VIRT_START)] =
+        l4e_from_pfn(mfn_x(page_to_mfn(pg)), __PAGE_HYPERVISOR);
+    l3e = hap_map_domain_page(page_to_mfn(pg));
+    l3e[l3_table_offset(SH_LINEAR_PT_VIRT_START)] = 
+        l3e_from_pfn(mfn_x(pagetable_get_mfn(d->arch.phys_table)),
+                     __PAGE_HYPERVISOR);
+    hap_unmap_domain_page(l3e);
+
     hap_unmap_domain_page(l4e);
+   
+    return;
+ oom:
+    HAP_ERROR("out of memory building monitor pagetable\n");
+    domain_crash(d);
+    return;
 }
 #endif /* CONFIG_PAGING_LEVELS == 4 */
 
@@ -269,12 +291,17 @@ static void hap_install_xen_entries_in_l
 static void hap_install_xen_entries_in_l2h(struct vcpu *v, mfn_t l2hmfn)
 {
     struct domain *d = v->domain;
+    l1_pgentry_t *l1e;
     l2_pgentry_t *l2e;
     l3_pgentry_t *p2m;
+    struct page_info *pg;
     int i;
 
     l2e = hap_map_domain_page(l2hmfn);
     ASSERT(l2e != NULL);
+
+    if ( (pg = hap_alloc(d)) == NULL )
+        goto oom;
 
     /* Copy the common Xen mappings from the idle domain */
     memcpy(&l2e[L2_PAGETABLE_FIRST_XEN_SLOT & (L2_PAGETABLE_ENTRIES-1)],
@@ -304,8 +331,27 @@ static void hap_install_xen_entries_in_l
                            __PAGE_HYPERVISOR)
             : l2e_empty();
     }
+
+    /* Install the L2 pages of p2m table in linear mapping. We borrow 
+     * SH_LINEAR_PT_VIRT_START to store such information.
+     */
+    l2e[l2_table_offset(SH_LINEAR_PT_VIRT_START)] = 
+        l2e_from_pfn(mfn_x(page_to_mfn(pg)), __PAGE_HYPERVISOR);
+    l1e = hap_map_domain_page(page_to_mfn(pg));
+    for ( i = 0; i < L3_PAGETABLE_ENTRIES; i++ )
+        l1e[l1_table_offset(SH_LINEAR_PT_VIRT_START) + i] =
+                (l3e_get_flags(p2m[i]) & _PAGE_PRESENT)
+                ? l1e_from_pfn(l3e_get_pfn(p2m[i]), __PAGE_HYPERVISOR)
+                : l1e_empty();
+
+    hap_unmap_domain_page(l1e);
     hap_unmap_domain_page(p2m);
     hap_unmap_domain_page(l2e);
+    return;
+ oom:
+    HAP_ERROR("out of memory building monitor pagetable\n");
+    domain_crash(d);
+    return;
 }
 #endif
 
@@ -337,6 +383,11 @@ static void hap_install_xen_entries_in_l
 
     /* Install the domain-specific P2M table */
     l2e[l2_table_offset(RO_MPT_VIRT_START)] =
+        l2e_from_pfn(mfn_x(pagetable_get_mfn(d->arch.phys_table)),
+                            __PAGE_HYPERVISOR);
+
+    /* Install the domain-specific P2M table */
+    l2e[l2_table_offset(SH_LINEAR_PT_VIRT_START)] =
         l2e_from_pfn(mfn_x(pagetable_get_mfn(d->arch.phys_table)),
                             __PAGE_HYPERVISOR);
 
@@ -414,12 +465,33 @@ static void hap_destroy_monitor_table(st
 {
     struct domain *d = v->domain;
 
-#if CONFIG_PAGING_LEVELS == 3
+#if CONFIG_PAGING_LEVELS == 4
+    {
+	l4_pgentry_t *l4e = NULL;
+	unsigned int shl_l4e_offset = l4_table_offset(SH_LINEAR_PT_VIRT_START);
+	
+	l4e = hap_map_domain_page(_mfn(mmfn));
+	ASSERT(l4e_get_flags(l4e[shl_l4e_offset]) & _PAGE_PRESENT);
+	hap_free(d, _mfn(l4e_get_pfn(l4e[shl_l4e_offset])));
+	hap_unmap_domain_page(l4e);
+    }
+#elif CONFIG_PAGING_LEVELS == 3
     /* Need to destroy the l2 monitor page in slot 4 too */
     {
-        l3_pgentry_t *l3e = hap_map_domain_page(mmfn);
+        l3_pgentry_t *l3e = NULL;
+        l2_pgentry_t *l2e = NULL;
+        unsigned int l2e_offset = l2_table_offset(SH_LINEAR_PT_VIRT_START);
+
+        l3e = hap_map_domain_page(mmfn);
         ASSERT(l3e_get_flags(l3e[3]) & _PAGE_PRESENT);
-        hap_free(d, _mfn(l3e_get_pfn(l3e[3])));
+
+        /* destroy the l1 monitor page created for mapping level 2 p2m pages */
+        l2e = hap_map_domain_page(_mfn(l3e_get_pfn(l3e[3])));        
+        ASSERT(l2e_get_flags(l2e[l2e_offset]) & _PAGE_PRESENT);
+        hap_free(d, _mfn(l2e_get_pfn(l2e[l2e_offset])));
+        hap_unmap_domain_page(l2e);
+
+        hap_free(d, _mfn(l3e_get_pfn(l3e[3])));        
         hap_unmap_domain_page(l3e);
     }
 #endif
@@ -644,6 +716,8 @@ static void p2m_install_entry_in_monitor
  * in the monitor table.  This function makes fresh copies when a p2m
  * l3e changes. */
 {
+    l1_pgentry_t *sh_ml1e;
+    unsigned int sh_l2_index;
     l2_pgentry_t *ml2e;
     struct vcpu *v;
     unsigned int index;
@@ -658,20 +732,31 @@ static void p2m_install_entry_in_monitor
 
         ASSERT(paging_mode_external(v->domain));
 
-        if ( v == current ) /* OK to use linear map of monitor_table */
+        if ( v == current ) { /* OK to use linear map of monitor_table */
             ml2e = __linear_l2_table + l2_linear_offset(RO_MPT_VIRT_START);
+	    sh_ml1e = __linear_l1_table + l1_linear_offset(SH_LINEAR_PT_VIRT_START);
+	}
         else {
             l3_pgentry_t *ml3e;
             ml3e = hap_map_domain_page(
                 pagetable_get_mfn(v->arch.monitor_table));
             ASSERT(l3e_get_flags(ml3e[3]) & _PAGE_PRESENT);
             ml2e = hap_map_domain_page(_mfn(l3e_get_pfn(ml3e[3])));
+
+	    sh_l2_index = l2_table_offset(SH_LINEAR_PT_VIRT_START);
+	    ASSERT(l2e_get_flags(ml2e[sh_l2_index]) & _PAGE_PRESENT);
+	    sh_ml1e = hap_map_domain_page(_mfn(l2e_get_pfn(ml2e[sh_l2_index])));
+	    sh_ml1e += l1_table_offset(SH_LINEAR_PT_VIRT_START);
+
             ml2e += l2_table_offset(RO_MPT_VIRT_START);
             hap_unmap_domain_page(ml3e);
         }
         ml2e[index] = l2e_from_pfn(l3e_get_pfn(*l3e), __PAGE_HYPERVISOR);
-        if ( v != current )
+	sh_ml1e[index] = l1e_from_pfn(l3e_get_pfn(*l3e), __PAGE_HYPERVISOR);
+        if ( v != current ) {
             hap_unmap_domain_page(ml2e);
+	    hap_unmap_domain_page(sh_ml1e);
+	}
     }
 }
 #endif
diff -r 3191627e5ad6 xen/arch/x86/mm/p2m.c
--- a/xen/arch/x86/mm/p2m.c	Wed Oct 31 16:21:18 2007 +0000
+++ b/xen/arch/x86/mm/p2m.c	Thu Nov 15 03:54:42 2007 -0600
@@ -149,9 +149,12 @@ p2m_next_level(struct domain *d, mfn_t *
                unsigned long *gfn_remainder, unsigned long gfn, u32 shift,
                u32 max, unsigned long type)
 {
+    l1_pgentry_t *l1_entry;
     l1_pgentry_t *p2m_entry;
     l1_pgentry_t new_entry;
     void *next;
+    int i;
+
     ASSERT(d->arch.p2m.alloc_page);
 
     if ( !(p2m_entry = p2m_find_entry(*table, gfn_remainder, gfn,
@@ -192,6 +195,36 @@ p2m_next_level(struct domain *d, mfn_t *
             break;
         }
     }
+
+    ASSERT(l1e_get_flags(*p2m_entry) & _PAGE_PRESENT);
+
+    /* split single large page into 4KB page in P2M table */
+    if ( type == PGT_l1_page_table && (l1e_get_flags(*p2m_entry) & _PAGE_PSE) )
+    {
+	struct page_info *pg = d->arch.p2m.alloc_page(d);
+	if ( pg == NULL )
+	    return 0;
+	list_add_tail(&pg->list, &d->arch.p2m.pages);
+	pg->u.inuse.type_info = PGT_l1_page_table | 1 | PGT_validated;
+	pg->count_info = 1;
+
+	l1_entry = map_domain_page(mfn_x(page_to_mfn(pg)));
+	for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ )
+	{
+	    mfn_t mfn = _mfn(l1e_get_pfn(*p2m_entry) + i);
+	    new_entry = l1e_from_pfn(mfn_x(mfn),
+				     __PAGE_HYPERVISOR|_PAGE_USER);
+	    paging_write_p2m_entry(d, gfn,
+				   l1_entry+i, *table_mfn, new_entry, 1);
+	}
+	unmap_domain_page(l1_entry);
+
+	new_entry = l1e_from_pfn(mfn_x(page_to_mfn(pg)),
+				 __PAGE_HYPERVISOR|_PAGE_USER);
+	paging_write_p2m_entry(d, gfn,
+			       p2m_entry, *table_mfn, new_entry, 2);
+    }
+
     *table_mfn = _mfn(l1e_get_pfn(*p2m_entry));
     next = map_domain_page(mfn_x(*table_mfn));
     unmap_domain_page(*table);
@@ -202,14 +235,16 @@ p2m_next_level(struct domain *d, mfn_t *
 
 // Returns 0 on error (out of memory)
 static int
-set_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn, p2m_type_t p2mt)
+set_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn, 
+              unsigned int order, p2m_type_t p2mt)
 {
     // XXX -- this might be able to be faster iff current->domain == d
     mfn_t table_mfn = pagetable_get_mfn(d->arch.phys_table);
     void *table =map_domain_page(mfn_x(table_mfn));
     unsigned long gfn_remainder = gfn;
-    l1_pgentry_t *p2m_entry;
+    l1_pgentry_t *p2m_entry = NULL;
     l1_pgentry_t entry_content;
+    l2_pgentry_t l2e_content;
     int rv=0;
 
 #if CONFIG_PAGING_LEVELS >= 4
@@ -234,30 +269,59 @@ set_p2m_entry(struct domain *d, unsigned
                          PGT_l2_page_table) )
         goto out;
 #endif
-    if ( !p2m_next_level(d, &table_mfn, &table, &gfn_remainder, gfn,
-                         L2_PAGETABLE_SHIFT - PAGE_SHIFT,
-                         L2_PAGETABLE_ENTRIES, PGT_l1_page_table) )
-        goto out;
-
-    p2m_entry = p2m_find_entry(table, &gfn_remainder, gfn,
-                               0, L1_PAGETABLE_ENTRIES);
-    ASSERT(p2m_entry);
+    
+    if ( order == PAGE_SIZE_ORDER_2M )
+    {
+	p2m_entry = p2m_find_entry(table, &gfn_remainder, gfn,
+				   L2_PAGETABLE_SHIFT - PAGE_SHIFT, 
+				   L2_PAGETABLE_ENTRIES);
+	ASSERT(p2m_entry);
+
+        if ( (l1e_get_flags(*p2m_entry) & _PAGE_PRESENT) && 
+             !(l1e_get_flags(*p2m_entry) & _PAGE_PSE) )
+	{
+	    P2M_ERROR("configure P2M table 4KB L2 entry with large page\n");
+	    domain_crash(d);
+	    goto out;
+	}
+	
+	if ( mfn_valid(mfn) )
+	    l2e_content = l2e_from_pfn(mfn_x(mfn), 
+				       p2m_type_to_flags(p2mt) | _PAGE_PSE);
+	else
+	    l2e_content = l2e_empty();
+	
+	entry_content.l1 = l2e_content.l2;
+
+	paging_write_p2m_entry(d, gfn, p2m_entry, table_mfn, entry_content, 2);
+    }
+    else if ( order == PAGE_SIZE_ORDER_4K )
+    {
+	if ( !p2m_next_level(d, &table_mfn, &table, &gfn_remainder, gfn,
+			     L2_PAGETABLE_SHIFT - PAGE_SHIFT,
+			     L2_PAGETABLE_ENTRIES, PGT_l1_page_table) )
+	    goto out;
+	
+	p2m_entry = p2m_find_entry(table, &gfn_remainder, gfn,
+				   0, L1_PAGETABLE_ENTRIES);
+	ASSERT(p2m_entry);
+
+	if ( mfn_valid(mfn) || (p2mt == p2m_mmio_direct) )
+	    entry_content = l1e_from_pfn(mfn_x(mfn), p2m_type_to_flags(p2mt));
+	else
+	    entry_content = l1e_empty();
+
+	/* level 1 entry */
+	paging_write_p2m_entry(d, gfn, p2m_entry, table_mfn, entry_content, 1);
+    }
+    
+    if ( vtd_enabled && (p2mt == p2m_mmio_direct) && is_hvm_domain(d) )
+        iommu_flush(d, gfn, (u64*)p2m_entry);
 
     /* Track the highest gfn for which we have ever had a valid mapping */
     if ( mfn_valid(mfn) && (gfn > d->arch.p2m.max_mapped_pfn) )
         d->arch.p2m.max_mapped_pfn = gfn;
 
-    if ( mfn_valid(mfn) || (p2mt == p2m_mmio_direct) )
-        entry_content = l1e_from_pfn(mfn_x(mfn), p2m_type_to_flags(p2mt));
-    else
-        entry_content = l1e_empty();
-
-    /* level 1 entry */
-    paging_write_p2m_entry(d, gfn, p2m_entry, table_mfn, entry_content, 1);
-
-    if ( vtd_enabled && (p2mt == p2m_mmio_direct) && is_hvm_domain(d) )
-        iommu_flush(d, gfn, (u64*)p2m_entry);
-
     /* Success */
     rv = 1;
 
@@ -267,14 +331,11 @@ set_p2m_entry(struct domain *d, unsigned
 }
 
 
-/* Init the datastructures for later use by the p2m code */
-void p2m_init(struct domain *d)
-{
-    p2m_lock_init(d);
-    INIT_LIST_HEAD(&d->arch.p2m.pages);
-}
-
-
+
+/************************************************/
+/*           Completed Functions                */
+/************************************************/
+/***********  SUPPORTIVE FUNCTIONS   ************/
 // Allocate a new p2m table for a domain.
 //
 // The structure of the p2m table is that of a pagetable for xen (i.e. it is
@@ -290,11 +351,11 @@ int p2m_alloc_table(struct domain *d,
                     void (*free_page)(struct domain *d, struct page_info *pg))
 
 {
-    mfn_t mfn;
+    mfn_t mfn = 0;
     struct list_head *entry;
     struct page_info *page, *p2m_top;
     unsigned int page_count = 0;
-    unsigned long gfn;
+    unsigned long gfn = 0;
 
     p2m_lock(d);
 
@@ -334,7 +395,8 @@ int p2m_alloc_table(struct domain *d,
     P2M_PRINTK("populating p2m table\n");
 
     /* Initialise physmap tables for slot zero. Other code assumes this. */
-    if ( !set_p2m_entry(d, 0, _mfn(INVALID_MFN), p2m_invalid) )
+    if ( !set_p2m_entry(d, 0, _mfn(INVALID_MFN), PAGE_SIZE_ORDER_4K,
+                        p2m_invalid) )
         goto error;
 
     /* Copy all existing mappings from the page list and m2p */
@@ -353,7 +415,7 @@ int p2m_alloc_table(struct domain *d,
             (gfn != 0x55555555L)
 #endif
              && gfn != INVALID_M2P_ENTRY
-            && !set_p2m_entry(d, gfn, mfn, p2m_ram_rw) )
+            && !set_p2m_entry(d, gfn, mfn, PAGE_SIZE_ORDER_4K, p2m_ram_rw) )
             goto error;
     }
 
@@ -371,109 +433,6 @@ int p2m_alloc_table(struct domain *d,
                PRI_mfn "\n", gfn, mfn_x(mfn));
     p2m_unlock(d);
     return -ENOMEM;
-}
-
-void p2m_teardown(struct domain *d)
-/* Return all the p2m pages to Xen.
- * We know we don't have any extra mappings to these pages */
-{
-    struct list_head *entry, *n;
-    struct page_info *pg;
-
-    p2m_lock(d);
-    d->arch.phys_table = pagetable_null();
-
-    list_for_each_safe(entry, n, &d->arch.p2m.pages)
-    {
-        pg = list_entry(entry, struct page_info, list);
-        list_del(entry);
-        d->arch.p2m.free_page(d, pg);
-    }
-    p2m_unlock(d);
-}
-
-mfn_t
-gfn_to_mfn_foreign(struct domain *d, unsigned long gfn, p2m_type_t *t)
-/* Read another domain's p2m entries */
-{
-    mfn_t mfn;
-    paddr_t addr = ((paddr_t)gfn) << PAGE_SHIFT;
-    l2_pgentry_t *l2e;
-    l1_pgentry_t *l1e;
-
-    ASSERT(paging_mode_translate(d));
-
-    /* XXX This is for compatibility with the old model, where anything not 
-     * XXX marked as RAM was considered to be emulated MMIO space.
-     * XXX Once we start explicitly registering MMIO regions in the p2m 
-     * XXX we will return p2m_invalid for unmapped gfns */
-    *t = p2m_mmio_dm;
-
-    mfn = pagetable_get_mfn(d->arch.phys_table);
-
-    if ( gfn > d->arch.p2m.max_mapped_pfn )
-        /* This pfn is higher than the highest the p2m map currently holds */
-        return _mfn(INVALID_MFN);
-
-#if CONFIG_PAGING_LEVELS >= 4
-    {
-        l4_pgentry_t *l4e = map_domain_page(mfn_x(mfn));
-        l4e += l4_table_offset(addr);
-        if ( (l4e_get_flags(*l4e) & _PAGE_PRESENT) == 0 )
-        {
-            unmap_domain_page(l4e);
-            return _mfn(INVALID_MFN);
-        }
-        mfn = _mfn(l4e_get_pfn(*l4e));
-        unmap_domain_page(l4e);
-    }
-#endif
-#if CONFIG_PAGING_LEVELS >= 3
-    {
-        l3_pgentry_t *l3e = map_domain_page(mfn_x(mfn));
-#if CONFIG_PAGING_LEVELS == 3
-        /* On PAE hosts the p2m has eight l3 entries, not four (see
-         * shadow_set_p2m_entry()) so we can't use l3_table_offset.
-         * Instead, just count the number of l3es from zero.  It's safe
-         * to do this because we already checked that the gfn is within
-         * the bounds of the p2m. */
-        l3e += (addr >> L3_PAGETABLE_SHIFT);
-#else
-        l3e += l3_table_offset(addr);
-#endif
-        if ( (l3e_get_flags(*l3e) & _PAGE_PRESENT) == 0 )
-        {
-            unmap_domain_page(l3e);
-            return _mfn(INVALID_MFN);
-        }
-        mfn = _mfn(l3e_get_pfn(*l3e));
-        unmap_domain_page(l3e);
-    }
-#endif
-
-    l2e = map_domain_page(mfn_x(mfn));
-    l2e += l2_table_offset(addr);
-    if ( (l2e_get_flags(*l2e) & _PAGE_PRESENT) == 0 )
-    {
-        unmap_domain_page(l2e);
-        return _mfn(INVALID_MFN);
-    }
-    mfn = _mfn(l2e_get_pfn(*l2e));
-    unmap_domain_page(l2e);
-
-    l1e = map_domain_page(mfn_x(mfn));
-    l1e += l1_table_offset(addr);
-    if ( (l1e_get_flags(*l1e) & _PAGE_PRESENT) == 0 )
-    {
-        unmap_domain_page(l1e);
-        return _mfn(INVALID_MFN);
-    }
-    mfn = _mfn(l1e_get_pfn(*l1e));
-    *t = p2m_flags_to_type(l1e_get_flags(*l1e));
-    unmap_domain_page(l1e);
-
-    ASSERT(mfn_valid(mfn) || !p2m_is_ram(*t));
-    return (p2m_is_valid(*t)) ? mfn : _mfn(INVALID_MFN);
 }
 
 #if P2M_AUDIT
@@ -614,6 +573,29 @@ static void audit_p2m(struct domain *d)
                         gfn += 1 << (L2_PAGETABLE_SHIFT - PAGE_SHIFT);
                         continue;
                     }
+
+                    /* check for large page */
+                    if ( l2e_get_flags(l2e[i2]) & _PAGE_PSE )
+                    {
+                        mfn = l2e_get_pfn(l2e[i2]);
+                        ASSERT(mfn_valid(_mfn(mfn)));
+                        for ( i1 = 0; i1 < L1_PAGETABLE_ENTRIES; i1++)
+                        {
+                            m2pfn = get_gpfn_from_mfn(mfn+i1);
+                            if ( m2pfn != (gfn + i) )
+                            {
+                                pmbad++;
+                                P2M_PRINTK("mismatch: gfn %#lx -> mfn %#lx"
+                                           " -> gfn %#lx\n", gfn+i, mfn+i, 
+                                           m2pfn);
+                                BUG();
+                            }
+                        }
+
+                        gfn += 1 << (L2_PAGETABLE_SHIFT - PAGE_SHIFT);
+                        continue;
+                    }
+
                     l1e = map_domain_page(mfn_x(_mfn(l2e_get_pfn(l2e[i2]))));
 
                     for ( i1 = 0; i1 < L1_PAGETABLE_ENTRIES; i1++, gfn++ )
@@ -664,38 +646,32 @@ static void audit_p2m(struct domain *d)
 #define audit_p2m(_d) do { (void)(_d); } while(0)
 #endif /* P2M_AUDIT */
 
-
-
 static void
-p2m_remove_page(struct domain *d, unsigned long gfn, unsigned long mfn)
-{
+p2m_remove_page(struct domain *d, unsigned long gfn, unsigned long mfn,
+                unsigned int order)
+{
+    int i;
     if ( !paging_mode_translate(d) )
         return;
-    P2M_DEBUG("removing gfn=%#lx mfn=%#lx\n", gfn, mfn);
-
-    set_p2m_entry(d, gfn, _mfn(INVALID_MFN), p2m_invalid);
-    set_gpfn_from_mfn(mfn, INVALID_M2P_ENTRY);
-}
-
-void
-guest_physmap_remove_page(struct domain *d, unsigned long gfn,
-                          unsigned long mfn)
-{
-    p2m_lock(d);
-    audit_p2m(d);
-    p2m_remove_page(d, gfn, mfn);
-    audit_p2m(d);
-    p2m_unlock(d);
+    P2M_DEBUG("removing gfn=%#lx mfn=%#lx, order=%d\n", gfn, mfn, order);
+
+    set_p2m_entry(d, gfn, _mfn(INVALID_MFN), order, p2m_invalid);
+    for (i = 0; i < (1UL << order); i++ )
+        set_gpfn_from_mfn(mfn+i, INVALID_M2P_ENTRY);
 }
 
 int
 guest_physmap_add_entry(struct domain *d, unsigned long gfn,
-                        unsigned long mfn, p2m_type_t t)
+                        unsigned long mfn, unsigned int order, p2m_type_t t)
 {
     unsigned long ogfn;
     p2m_type_t ot;
     mfn_t omfn;
     int rc = 0;
+    int i;
+
+    /* make sure gfn and mfn are aligned at order boundary */
+    ASSERT( !(gfn & ((1UL<<order)-1)) && !(mfn & ((1UL<<order)-1)) );
 
     if ( !paging_mode_translate(d) )
         return -EINVAL;
@@ -712,13 +688,14 @@ guest_physmap_add_entry(struct domain *d
     p2m_lock(d);
     audit_p2m(d);
 
-    P2M_DEBUG("adding gfn=%#lx mfn=%#lx\n", gfn, mfn);
+    P2M_DEBUG("adding gfn=%#lx mfn=%#lx order=%d\n", gfn, mfn, order);
 
     omfn = gfn_to_mfn(d, gfn, &ot);
     if ( p2m_is_ram(ot) )
     {
         ASSERT(mfn_valid(omfn));
-        set_gpfn_from_mfn(mfn_x(omfn), INVALID_M2P_ENTRY);
+        for ( i = 0; i < (1UL << order); i++ )
+            set_gpfn_from_mfn(mfn_x(omfn+i), INVALID_M2P_ENTRY);
     }
 
     ogfn = mfn_to_gfn(d, _mfn(mfn));
@@ -741,21 +718,22 @@ guest_physmap_add_entry(struct domain *d
             P2M_DEBUG("old gfn=%#lx -> mfn %#lx\n",
                       ogfn , mfn_x(omfn));
             if ( mfn_x(omfn) == mfn )
-                p2m_remove_page(d, ogfn, mfn);
-        }
-    }
-
-    if ( mfn_valid(_mfn(mfn)) ) 
-    {
-        if ( !set_p2m_entry(d, gfn, _mfn(mfn), t) )
+                p2m_remove_page(d, ogfn, mfn, order);
+        }
+    }
+
+    if ( mfn_valid(_mfn(mfn)) )
+    {
+        if ( !set_p2m_entry(d, gfn, _mfn(mfn), order, t) )
             rc = -EINVAL;
-        set_gpfn_from_mfn(mfn, gfn);
+        for ( i = 0; i < (1UL << order); i++)
+            set_gpfn_from_mfn(mfn+i, gfn+i);
     }
     else
     {
         gdprintk(XENLOG_WARNING, "Adding bad mfn to p2m map (%#lx -> %#lx)\n",
                  gfn, mfn);
-        if ( !set_p2m_entry(d, gfn, _mfn(INVALID_MFN), p2m_invalid) )
+        if ( !set_p2m_entry(d, gfn, _mfn(INVALID_MFN), order, p2m_invalid) )
             rc = -EINVAL;
     }
 
@@ -774,7 +752,7 @@ void p2m_change_type_global(struct domai
     l1_pgentry_t l1e_content;
     l1_pgentry_t *l1e;
     l2_pgentry_t *l2e;
-    mfn_t l1mfn;
+    mfn_t l1mfn, l2mfn;
     int i1, i2;
 #if CONFIG_PAGING_LEVELS >= 3
     l3_pgentry_t *l3e;
@@ -819,12 +797,27 @@ void p2m_change_type_global(struct domai
             {
                 continue;
             }
+            l2mfn = _mfn(l3e_get_pfn(l3e[i3]));
             l2e = map_domain_page(l3e_get_pfn(l3e[i3]));
 #endif /* all levels... */
             for ( i2 = 0; i2 < L2_PAGETABLE_ENTRIES; i2++ )
             {
                 if ( !(l2e_get_flags(l2e[i2]) & _PAGE_PRESENT) )
                 {
+                    continue;
+                }
+
+                if ( (l2e_get_flags(l2e[i2]) & _PAGE_PSE) )
+                {
+                    flags = l2e_get_flags(l2e[i2]);
+                    if ( p2m_flags_to_type(flags) != ot )
+                        continue;
+                    mfn = l2e_get_pfn(l2e[i2]);
+                    gfn = get_gpfn_from_mfn(mfn);
+                    flags = p2m_flags_to_type(nt);
+                    l1e_content = l1e_from_pfn(mfn, flags | _PAGE_PSE);
+                    paging_write_p2m_entry(d, gfn, (l1_pgentry_t *)&l2e[i2],
+                                           l2mfn, l1e_content, 2);
                     continue;
                 }
 
@@ -878,7 +871,7 @@ p2m_type_t p2m_change_type(struct domain
 
     mfn = gfn_to_mfn(d, gfn, &pt);
     if ( pt == ot )
-        set_p2m_entry(d, gfn, mfn, nt);
+        set_p2m_entry(d, gfn, mfn, PAGE_SIZE_ORDER_4K, nt);
 
     p2m_unlock(d);
 
@@ -899,10 +892,11 @@ set_mmio_p2m_entry(struct domain *d, uns
     if ( p2m_is_ram(ot) )
     {
         ASSERT(mfn_valid(omfn));
+        /* 4K page modification. So we don't need to check page order */
         set_gpfn_from_mfn(mfn_x(omfn), INVALID_M2P_ENTRY);
     }
 
-    rc = set_p2m_entry(d, gfn, mfn, p2m_mmio_direct);
+    rc = set_p2m_entry(d, gfn, mfn, PAGE_SIZE_ORDER_4K, p2m_mmio_direct);
     if ( 0 == rc )
         gdprintk(XENLOG_ERR,
             "set_mmio_p2m_entry: set_p2m_entry failed! mfn=%08lx\n",
@@ -926,10 +920,148 @@ clear_mmio_p2m_entry(struct domain *d, u
             "clear_mmio_p2m_entry: gfn_to_mfn failed! gfn=%08lx\n", gfn);
         return 0;
     }
-    rc = set_p2m_entry(d, gfn, _mfn(INVALID_MFN), 0);
+    rc = set_p2m_entry(d, gfn, _mfn(INVALID_MFN), PAGE_SIZE_ORDER_4K, 0);
 
     return rc;
 }
+
+/************************************************/
+/*         PUBLIC INTERFACE FUNCTIONS           */
+/************************************************/
+/* Init the datastructures for later use by the p2m code */
+void p2m_init(struct domain *d)
+{
+    p2m_lock_init(d);
+    INIT_LIST_HEAD(&d->arch.p2m.pages);
+}
+
+void p2m_teardown(struct domain *d)
+/* Return all the p2m pages to Xen.
+ * We know we don't have any extra mappings to these pages */
+{
+    struct list_head *entry, *n;
+    struct page_info *pg;
+
+    p2m_lock(d);
+    d->arch.phys_table = pagetable_null();
+
+    list_for_each_safe(entry, n, &d->arch.p2m.pages)
+    {
+        pg = list_entry(entry, struct page_info, list);
+        list_del(entry);
+        d->arch.p2m.free_page(d, pg);
+    }
+    p2m_unlock(d);
+}
+
+mfn_t
+gfn_to_mfn_foreign(struct domain *d, unsigned long gfn, p2m_type_t *t)
+/* Read another domain's p2m entries */
+{
+    mfn_t mfn;
+    paddr_t addr = ((paddr_t)gfn) << PAGE_SHIFT;
+    l2_pgentry_t *l2e;
+    l1_pgentry_t *l1e;
+
+    ASSERT(paging_mode_translate(d));
+
+    /* XXX This is for compatibility with the old model, where anything not 
+     * XXX marked as RAM was considered to be emulated MMIO space.
+     * XXX Once we start explicitly registering MMIO regions in the p2m 
+     * XXX we will return p2m_invalid for unmapped gfns */
+    *t = p2m_mmio_dm;
+
+    mfn = pagetable_get_mfn(d->arch.phys_table);
+
+    if ( gfn > d->arch.p2m.max_mapped_pfn )
+        /* This pfn is higher than the highest the p2m map currently holds */
+        return _mfn(INVALID_MFN);
+
+#if CONFIG_PAGING_LEVELS >= 4
+    {
+        l4_pgentry_t *l4e = map_domain_page(mfn_x(mfn));
+        l4e += l4_table_offset(addr);
+        if ( (l4e_get_flags(*l4e) & _PAGE_PRESENT) == 0 )
+        {
+            unmap_domain_page(l4e);
+            return _mfn(INVALID_MFN);
+        }
+        mfn = _mfn(l4e_get_pfn(*l4e));
+        unmap_domain_page(l4e);
+    }
+#endif
+#if CONFIG_PAGING_LEVELS >= 3
+    {
+        l3_pgentry_t *l3e = map_domain_page(mfn_x(mfn));
+#if CONFIG_PAGING_LEVELS == 3
+        /* On PAE hosts the p2m has eight l3 entries, not four (see
+         * shadow_set_p2m_entry()) so we can't use l3_table_offset.
+         * Instead, just count the number of l3es from zero.  It's safe
+         * to do this because we already checked that the gfn is within
+         * the bounds of the p2m. */
+        l3e += (addr >> L3_PAGETABLE_SHIFT);
+#else
+        l3e += l3_table_offset(addr);
+#endif
+        if ( (l3e_get_flags(*l3e) & _PAGE_PRESENT) == 0 )
+        {
+            unmap_domain_page(l3e);
+            return _mfn(INVALID_MFN);
+        }
+        mfn = _mfn(l3e_get_pfn(*l3e));
+        unmap_domain_page(l3e);
+    }
+#endif
+
+    l2e = map_domain_page(mfn_x(mfn));
+    l2e += l2_table_offset(addr);
+    if ( (l2e_get_flags(*l2e) & _PAGE_PRESENT) == 0 )
+    {
+        unmap_domain_page(l2e);
+        return _mfn(INVALID_MFN);
+    }
+    else if ( (l2e_get_flags(*l2e) & _PAGE_PSE) ) 
+    {
+        mfn = _mfn(l2e_get_pfn(*l2e) + l1_table_offset(addr));
+        *t = p2m_flags_to_type(l2e_get_flags(*l2e));
+        unmap_domain_page(l2e);
+
+        ASSERT(mfn_valid(mfn) || !p2m_is_ram(*t));
+        return (p2m_is_valid(*t)) ? mfn : _mfn(INVALID_MFN);
+    }
+
+    mfn = _mfn(l2e_get_pfn(*l2e));
+    unmap_domain_page(l2e);
+
+    l1e = map_domain_page(mfn_x(mfn));
+    l1e += l1_table_offset(addr);
+    if ( (l1e_get_flags(*l1e) & _PAGE_PRESENT) == 0 )
+    {
+        unmap_domain_page(l1e);
+        return _mfn(INVALID_MFN);
+    }
+    mfn = _mfn(l1e_get_pfn(*l1e));
+    *t = p2m_flags_to_type(l1e_get_flags(*l1e));
+    unmap_domain_page(l1e);
+
+    ASSERT(mfn_valid(mfn) || !p2m_is_ram(*t));
+    return (p2m_is_valid(*t)) ? mfn : _mfn(INVALID_MFN);
+}
+
+void
+guest_physmap_remove_page(struct domain *d, unsigned long gfn,
+                          unsigned long mfn, unsigned int order)
+{
+    p2m_lock(d);
+    audit_p2m(d);
+    p2m_remove_page(d, gfn, mfn, order);
+    audit_p2m(d);
+    p2m_unlock(d);
+}
+
+/*************************************************/
+/*            INCOMPLETE FUNCTIONS               */
+/*************************************************/
 
 /*
  * Local variables:
diff -r 3191627e5ad6 xen/common/grant_table.c
--- a/xen/common/grant_table.c	Wed Oct 31 16:21:18 2007 +0000
+++ b/xen/common/grant_table.c	Wed Nov 07 11:13:23 2007 -0600
@@ -1128,7 +1128,7 @@ gnttab_transfer(
         spin_lock(&e->grant_table->lock);
 
         sha = &shared_entry(e->grant_table, gop.ref);
-        guest_physmap_add_page(e, sha->frame, mfn);
+        guest_physmap_add_page(e, sha->frame, mfn, PAGE_SIZE_ORDER_4K);
         sha->frame = mfn;
         wmb();
         sha->flags |= GTF_transfer_completed;
diff -r 3191627e5ad6 xen/common/memory.c
--- a/xen/common/memory.c	Wed Oct 31 16:21:18 2007 +0000
+++ b/xen/common/memory.c	Tue Nov 13 06:33:56 2007 -0600
@@ -119,30 +119,55 @@ static void populate_physmap(struct memo
         page = __alloc_domheap_pages(d, cpu, a->extent_order, a->memflags);
         if ( unlikely(page == NULL) ) 
         {
-            gdprintk(XENLOG_INFO, "Could not allocate order=%d extent: "
-                     "id=%d memflags=%x (%ld of %d)\n",
-                     a->extent_order, d->domain_id, a->memflags,
-                     i, a->nr_extents);
-            goto out;
-        }
-
-        mfn = page_to_mfn(page);
-
-        if ( unlikely(paging_mode_translate(d)) )
-        {
-            for ( j = 0; j < (1 << a->extent_order); j++ )
-                if ( guest_physmap_add_page(d, gpfn + j, mfn + j) )
-                    goto out;
-        }
-        else
-        {
-            for ( j = 0; j < (1 << a->extent_order); j++ )
-                set_gpfn_from_mfn(mfn + j, gpfn + j);
-
-            /* Inform the domain of the new page's machine address. */ 
-            if ( unlikely(__copy_to_guest_offset(a->extent_list, i, &mfn, 1)) )
-                goto out;
-        }
+	    /* fail if it is not translate mode */
+	    if ( !paging_mode_translate(d) )
+	    {
+		gdprintk(XENLOG_INFO, "Could not allocate order=%d extent:"
+			 "id=%d memflags=%x (%ld of %d)\n",
+			 a->extent_order, d->domain_id, a->memflags,
+			 i, a->nr_extents);
+		goto out;
+	    }
+
+	    /* try to do allocation using 4KB page instead */
+	    for ( j = 0; j < (1 << a->extent_order); j++ )
+	    {
+		page = __alloc_domheap_pages(d, cpu, 0, a->memflags);
+		if ( page == NULL )
+		{
+		    gdprintk(XENLOG_INFO, "Could not allocate order=%d extent:"
+			     "id=%d memflags=%x (%ld of %d)\n",
+			     0, d->domain_id, a->memflags,
+			     i, a->nr_extents);
+		    goto out;
+		}
+
+		mfn = page_to_mfn(page);
+
+		if ( guest_physmap_add_page(d, gpfn+j, mfn, 0) )
+		    goto out;
+	    }
+        }
+        else /* sucessful in allocating page of extent_order */
+	{
+	    mfn = page_to_mfn(page);
+	    
+	    if ( unlikely(paging_mode_translate(d)) )
+	    {
+		if ( guest_physmap_add_page(d, gpfn, mfn, a->extent_order) )
+		    goto out;
+	    }
+	    else
+	    {
+		for ( j = 0; j < (1 << a->extent_order); j++ )
+		    set_gpfn_from_mfn(mfn + j, gpfn + j);
+		
+		/* Inform the domain of the new page's machine address. */ 
+		if ( unlikely(__copy_to_guest_offset(a->extent_list, i, 
+						     &mfn, 1)) )
+		    goto out;
+	    }
+	}
     }
 
  out:
@@ -175,7 +200,7 @@ int guest_remove_page(struct domain *d, 
     if ( test_and_clear_bit(_PGC_allocated, &page->count_info) )
         put_page(page);
 
-    guest_physmap_remove_page(d, gmfn, mfn);
+    guest_physmap_remove_page(d, gmfn, mfn, PAGE_SIZE_ORDER_4K);
 
     put_page(page);
 
@@ -425,7 +450,8 @@ static long memory_exchange(XEN_GUEST_HA
             if ( !test_and_clear_bit(_PGC_allocated, &page->count_info) )
                 BUG();
             mfn = page_to_mfn(page);
-            guest_physmap_remove_page(d, mfn_to_gmfn(d, mfn), mfn);
+            guest_physmap_remove_page(d, mfn_to_gmfn(d, mfn), mfn, 
+                                      PAGE_SIZE_ORDER_4K);
             put_page(page);
         }
 
@@ -447,8 +473,8 @@ static long memory_exchange(XEN_GUEST_HA
             if ( unlikely(paging_mode_translate(d)) )
             {
                 /* Ignore failure here. There's nothing we can do. */
-                for ( k = 0; k < (1UL << exch.out.extent_order); k++ )
-                    (void)guest_physmap_add_page(d, gpfn + k, mfn + k);
+                (void)guest_physmap_add_page(d, gpfn, mfn, 
+                                             exch.out.extent_order);
             }
             else
             {
diff -r 3191627e5ad6 xen/include/asm-x86/mm.h
--- a/xen/include/asm-x86/mm.h	Wed Oct 31 16:21:18 2007 +0000
+++ b/xen/include/asm-x86/mm.h	Wed Nov 07 11:12:56 2007 -0600
@@ -128,6 +128,11 @@ static inline u32 pickle_domptr(struct d
 #else  
 #define SHADOW_MAX_ORDER 2 /* Need up to 16k allocs for 32-bit on PAE/64 */
 #endif
+
+/* the order of continuously allocated page frame */
+#define PAGE_SIZE_ORDER_4K    0
+#define PAGE_SIZE_ORDER_2M    9
+#define PAGE_SIZE_ORDER_4M    10
 
 #define page_get_owner(_p)    (unpickle_domptr((_p)->u.inuse._domain))
 #define page_set_owner(_p,_d) ((_p)->u.inuse._domain = pickle_domptr(_d))
diff -r 3191627e5ad6 xen/include/asm-x86/p2m.h
--- a/xen/include/asm-x86/p2m.h	Wed Oct 31 16:21:18 2007 +0000
+++ b/xen/include/asm-x86/p2m.h	Tue Nov 13 05:10:11 2007 -0600
@@ -42,7 +42,7 @@
  * paging_mode_external() guests this mapping is in the monitor table.)
  */
 #define phys_to_machine_mapping ((l1_pgentry_t *)RO_MPT_VIRT_START)
-
+#define phys_to_machine_large_page_mapping ((l2_pgentry_t *)SH_LINEAR_PT_VIRT_START)
 /*
  * The upper levels of the p2m pagetable always contain full rights; all 
  * variation in the access control bits is made in the level-1 PTEs.
@@ -98,6 +98,7 @@ static inline mfn_t gfn_to_mfn_current(u
 {
     mfn_t mfn = _mfn(INVALID_MFN);
     p2m_type_t p2mt = p2m_mmio_dm;
+    paddr_t addr = ((paddr_t)gfn) << PAGE_SHIFT;
     /* XXX This is for compatibility with the old model, where anything not 
      * XXX marked as RAM was considered to be emulated MMIO space.
      * XXX Once we start explicitly registering MMIO regions in the p2m 
@@ -106,28 +107,44 @@ static inline mfn_t gfn_to_mfn_current(u
     if ( gfn <= current->domain->arch.p2m.max_mapped_pfn )
     {
         l1_pgentry_t l1e = l1e_empty();
-        int ret;
+        l2_pgentry_t l2e = l2e_empty();
+        int ret, index;
 
         ASSERT(gfn < (RO_MPT_VIRT_END - RO_MPT_VIRT_START) 
                / sizeof(l1_pgentry_t));
 
         /* Need to __copy_from_user because the p2m is sparse and this
          * part might not exist */
-        ret = __copy_from_user(&l1e,
-                               &phys_to_machine_mapping[gfn],
-                               sizeof(l1e));
-
-        if ( ret == 0 ) {
-            p2mt = p2m_flags_to_type(l1e_get_flags(l1e));
-            ASSERT(l1e_get_pfn(l1e) != INVALID_MFN || !p2m_is_ram(p2mt));
+        index = gfn >> (L2_PAGETABLE_SHIFT - L1_PAGETABLE_SHIFT);
+        ret = __copy_from_user(&l2e,
+                               &phys_to_machine_large_page_mapping[index],
+                               sizeof(l2e));
+
+        if ( (ret == 0) && (l2e_get_flags(l2e) & _PAGE_PSE) ) {
+            p2mt = p2m_flags_to_type(l2e_get_flags(l2e));
+            ASSERT(l2e_get_pfn(l2e) != INVALID_MFN || !p2m_is_ram(p2mt));
             if ( p2m_is_valid(p2mt) )
-                mfn = _mfn(l1e_get_pfn(l1e));
-            else 
-                /* XXX see above */
+                mfn = _mfn(l2e_get_pfn(l2e) + l1_table_offset(addr));
+            else
                 p2mt = p2m_mmio_dm;
         }
+        else {
+            ret = __copy_from_user(&l1e,
+                                   &phys_to_machine_mapping[gfn],
+                                   sizeof(l1e));
+
+            if ( ret == 0 ) {
+                p2mt = p2m_flags_to_type(l1e_get_flags(l1e));
+                ASSERT(l1e_get_pfn(l1e) != INVALID_MFN || !p2m_is_ram(p2mt));
+                if ( p2m_is_valid(p2mt) )
+                    mfn = _mfn(l1e_get_pfn(l1e));
+                else 
+                    /* XXX see above */
+                    p2mt = p2m_mmio_dm;
+            }
+        }
     }
-
+    
     *t = p2mt;
     return mfn;
 }
@@ -202,21 +219,22 @@ void p2m_teardown(struct domain *d);
 
 /* Add a page to a domain's p2m table */
 int guest_physmap_add_entry(struct domain *d, unsigned long gfn,
-                             unsigned long mfn, p2m_type_t t);
+                             unsigned long mfn, unsigned int order, 
+                            p2m_type_t t);
 
 /* Untyped version for RAM only, for compatibility 
  *
  * Return 0 for success
  */
 static inline int guest_physmap_add_page(struct domain *d, unsigned long gfn,
-                                         unsigned long mfn)
-{
-    return guest_physmap_add_entry(d, gfn, mfn, p2m_ram_rw);
+                                         unsigned long mfn, unsigned int order)
+{
+    return guest_physmap_add_entry(d, gfn, mfn, order, p2m_ram_rw);
 }
 
 /* Remove a page from a domain's p2m table */
 void guest_physmap_remove_page(struct domain *d, unsigned long gfn,
-                               unsigned long mfn);
+                               unsigned long mfn, unsigned int order);
 
 /* Change types across all p2m entries in a domain */
 void p2m_change_type_global(struct domain *d, p2m_type_t ot, p2m_type_t nt);
diff -r 3191627e5ad6 xen/include/xen/lib.h
--- a/xen/include/xen/lib.h	Wed Oct 31 16:21:18 2007 +0000
+++ b/xen/include/xen/lib.h	Wed Nov 07 10:03:15 2007 -0600
@@ -45,7 +45,7 @@ struct domain;
 
 void cmdline_parse(char *cmdline);
 
-/*#define DEBUG_TRACE_DUMP*/
+#define DEBUG_TRACE_DUMP
 #ifdef DEBUG_TRACE_DUMP
 extern void debugtrace_dump(void);
 extern void debugtrace_printk(const char *fmt, ...);
diff -r 3191627e5ad6 xen/arch/x86/mm/p2m-orig.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/xen/arch/x86/mm/p2m-orig.c	Wed Nov 07 09:45:39 2007 -0600
@@ -0,0 +1,941 @@
+/******************************************************************************
+ * arch/x86/mm/p2m.c
+ *
+ * physical-to-machine mappings for automatically-translated domains.
+ *
+ * Parts of this code are Copyright (c) 2007 by Advanced Micro Devices.
+ * Parts of this code are Copyright (c) 2006-2007 by XenSource Inc.
+ * Parts of this code are Copyright (c) 2006 by Michael A Fetterman
+ * Parts based on earlier work by Michael A Fetterman, Ian Pratt et al.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <asm/domain.h>
+#include <asm/page.h>
+#include <asm/paging.h>
+#include <asm/p2m.h>
+#include <asm/iommu.h>
+
+/* Debugging and auditing of the P2M code? */
+#define P2M_AUDIT     0
+#define P2M_DEBUGGING 1
+
+/*
+ * The P2M lock.  This protects all updates to the p2m table.
+ * Updates are expected to be safe against concurrent reads,
+ * which do *not* require the lock.
+ *
+ * Locking discipline: always acquire this lock before the shadow or HAP one
+ */
+
+#define p2m_lock_init(_d)                            \
+    do {                                             \
+        spin_lock_init(&(_d)->arch.p2m.lock);        \
+        (_d)->arch.p2m.locker = -1;                  \
+        (_d)->arch.p2m.locker_function = "nobody";   \
+    } while (0)
+
+#define p2m_lock(_d)                                                \
+    do {                                                            \
+        if ( unlikely((_d)->arch.p2m.locker == current->processor) )\
+        {                                                           \
+            printk("Error: p2m lock held by %s\n",                  \
+                   (_d)->arch.p2m.locker_function);                 \
+            BUG();                                                  \
+        }                                                           \
+        spin_lock(&(_d)->arch.p2m.lock);                            \
+        ASSERT((_d)->arch.p2m.locker == -1);                        \
+        (_d)->arch.p2m.locker = current->processor;                 \
+        (_d)->arch.p2m.locker_function = __func__;                  \
+    } while (0)
+
+#define p2m_unlock(_d)                                              \
+    do {                                                            \
+        ASSERT((_d)->arch.p2m.locker == current->processor); \
+        (_d)->arch.p2m.locker = -1;                          \
+        (_d)->arch.p2m.locker_function = "nobody";           \
+        spin_unlock(&(_d)->arch.p2m.lock);                   \
+    } while (0)
+
+
+
+/* Printouts */
+#define P2M_PRINTK(_f, _a...)                                \
+    debugtrace_printk("p2m: %s(): " _f, __func__, ##_a)
+#define P2M_ERROR(_f, _a...)                                 \
+    printk("pg error: %s(): " _f, __func__, ##_a)
+#if P2M_DEBUGGING
+#define P2M_DEBUG(_f, _a...)                                 \
+    debugtrace_printk("p2mdebug: %s(): " _f, __func__, ##_a)
+#else
+#define P2M_DEBUG(_f, _a...) do { (void)(_f); } while(0)
+#endif
+
+
+/* Override macros from asm/page.h to make them work with mfn_t */
+#undef mfn_to_page
+#define mfn_to_page(_m) (frame_table + mfn_x(_m))
+#undef mfn_valid
+#define mfn_valid(_mfn) (mfn_x(_mfn) < max_page)
+#undef page_to_mfn
+#define page_to_mfn(_pg) (_mfn((_pg) - frame_table))
+
+
+/* PTE flags for the various types of p2m entry */
+#define P2M_BASE_FLAGS \
+        (_PAGE_PRESENT | _PAGE_USER | _PAGE_DIRTY | _PAGE_ACCESSED)
+
+static unsigned long p2m_type_to_flags(p2m_type_t t) 
+{
+    unsigned long flags = (t & 0x7UL) << 9;
+    switch(t)
+    {
+    case p2m_invalid:
+    default:
+        return flags;
+    case p2m_ram_rw:
+        return flags | P2M_BASE_FLAGS | _PAGE_RW;
+    case p2m_ram_logdirty:
+        return flags | P2M_BASE_FLAGS;
+    case p2m_ram_ro:
+        return flags | P2M_BASE_FLAGS;
+    case p2m_mmio_dm:
+        return flags;
+    case p2m_mmio_direct:
+        return flags | P2M_BASE_FLAGS | _PAGE_RW | _PAGE_PCD;
+    }
+}
+
+
+// Find the next level's P2M entry, checking for out-of-range gfn's...
+// Returns NULL on error.
+//
+static l1_pgentry_t *
+p2m_find_entry(void *table, unsigned long *gfn_remainder,
+                   unsigned long gfn, u32 shift, u32 max)
+{
+    u32 index;
+
+    index = *gfn_remainder >> shift;
+    if ( index >= max )
+    {
+        P2M_DEBUG("gfn=0x%lx out of range "
+                  "(gfn_remainder=0x%lx shift=%d index=0x%x max=0x%x)\n",
+                  gfn, *gfn_remainder, shift, index, max);
+        return NULL;
+    }
+    *gfn_remainder &= (1 << shift) - 1;
+    return (l1_pgentry_t *)table + index;
+}
+
+// Walk one level of the P2M table, allocating a new table if required.
+// Returns 0 on error.
+//
+static int
+p2m_next_level(struct domain *d, mfn_t *table_mfn, void **table,
+               unsigned long *gfn_remainder, unsigned long gfn, u32 shift,
+               u32 max, unsigned long type)
+{
+    l1_pgentry_t *p2m_entry;
+    l1_pgentry_t new_entry;
+    void *next;
+    ASSERT(d->arch.p2m.alloc_page);
+
+    if ( !(p2m_entry = p2m_find_entry(*table, gfn_remainder, gfn,
+                                      shift, max)) )
+        return 0;
+
+    if ( !(l1e_get_flags(*p2m_entry) & _PAGE_PRESENT) )
+    {
+        struct page_info *pg = d->arch.p2m.alloc_page(d);
+        if ( pg == NULL )
+            return 0;
+        list_add_tail(&pg->list, &d->arch.p2m.pages);
+        pg->u.inuse.type_info = type | 1 | PGT_validated;
+        pg->count_info = 1;
+
+        new_entry = l1e_from_pfn(mfn_x(page_to_mfn(pg)),
+                                 __PAGE_HYPERVISOR|_PAGE_USER);
+
+        switch ( type ) {
+        case PGT_l3_page_table:
+            paging_write_p2m_entry(d, gfn,
+                                   p2m_entry, *table_mfn, new_entry, 4);
+            break;
+        case PGT_l2_page_table:
+#if CONFIG_PAGING_LEVELS == 3
+            /* for PAE mode, PDPE only has PCD/PWT/P bits available */
+            new_entry = l1e_from_pfn(mfn_x(page_to_mfn(pg)), _PAGE_PRESENT);
+#endif
+            paging_write_p2m_entry(d, gfn,
+                                   p2m_entry, *table_mfn, new_entry, 3);
+            break;
+        case PGT_l1_page_table:
+            paging_write_p2m_entry(d, gfn,
+                                   p2m_entry, *table_mfn, new_entry, 2);
+            break;
+        default:
+            BUG();
+            break;
+        }
+    }
+    *table_mfn = _mfn(l1e_get_pfn(*p2m_entry));
+    next = map_domain_page(mfn_x(*table_mfn));
+    unmap_domain_page(*table);
+    *table = next;
+
+    return 1;
+}
+
+// Returns 0 on error (out of memory)
+static int
+set_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn, p2m_type_t p2mt)
+{
+    // XXX -- this might be able to be faster iff current->domain == d
+    mfn_t table_mfn = pagetable_get_mfn(d->arch.phys_table);
+    void *table =map_domain_page(mfn_x(table_mfn));
+    unsigned long gfn_remainder = gfn;
+    l1_pgentry_t *p2m_entry;
+    l1_pgentry_t entry_content;
+    int rv=0;
+
+#if CONFIG_PAGING_LEVELS >= 4
+    if ( !p2m_next_level(d, &table_mfn, &table, &gfn_remainder, gfn,
+                         L4_PAGETABLE_SHIFT - PAGE_SHIFT,
+                         L4_PAGETABLE_ENTRIES, PGT_l3_page_table) )
+        goto out;
+#endif
+#if CONFIG_PAGING_LEVELS >= 3
+    /*
+     * When using PAE Xen, we only allow 33 bits of pseudo-physical
+     * address in translated guests (i.e. 8 GBytes).  This restriction
+     * comes from wanting to map the P2M table into the 16MB RO_MPT hole
+     * in Xen's address space for translated PV guests.
+     * When using AMD's NPT on PAE Xen, we are restricted to 4GB.
+     */
+    if ( !p2m_next_level(d, &table_mfn, &table, &gfn_remainder, gfn,
+                         L3_PAGETABLE_SHIFT - PAGE_SHIFT,
+                         ((CONFIG_PAGING_LEVELS == 3)
+                          ? (hvm_funcs.hap_supported ? 4 : 8)
+                          : L3_PAGETABLE_ENTRIES),
+                         PGT_l2_page_table) )
+        goto out;
+#endif
+    if ( !p2m_next_level(d, &table_mfn, &table, &gfn_remainder, gfn,
+                         L2_PAGETABLE_SHIFT - PAGE_SHIFT,
+                         L2_PAGETABLE_ENTRIES, PGT_l1_page_table) )
+        goto out;
+
+    p2m_entry = p2m_find_entry(table, &gfn_remainder, gfn,
+                               0, L1_PAGETABLE_ENTRIES);
+    ASSERT(p2m_entry);
+
+    /* Track the highest gfn for which we have ever had a valid mapping */
+    if ( mfn_valid(mfn) && (gfn > d->arch.p2m.max_mapped_pfn) )
+        d->arch.p2m.max_mapped_pfn = gfn;
+
+    if ( mfn_valid(mfn) || (p2mt == p2m_mmio_direct) )
+        entry_content = l1e_from_pfn(mfn_x(mfn), p2m_type_to_flags(p2mt));
+    else
+        entry_content = l1e_empty();
+
+    /* level 1 entry */
+    paging_write_p2m_entry(d, gfn, p2m_entry, table_mfn, entry_content, 1);
+
+    if ( vtd_enabled && (p2mt == p2m_mmio_direct) && is_hvm_domain(d) )
+        iommu_flush(d, gfn, (u64*)p2m_entry);
+
+    /* Success */
+    rv = 1;
+
+ out:
+    unmap_domain_page(table);
+    return rv;
+}
+
+
+/* Init the datastructures for later use by the p2m code */
+void p2m_init(struct domain *d)
+{
+    p2m_lock_init(d);
+    INIT_LIST_HEAD(&d->arch.p2m.pages);
+}
+
+
+// Allocate a new p2m table for a domain.
+//
+// The structure of the p2m table is that of a pagetable for xen (i.e. it is
+// controlled by CONFIG_PAGING_LEVELS).
+//
+// The alloc_page and free_page functions will be used to get memory to
+// build the p2m, and to release it again at the end of day.
+//
+// Returns 0 for success or -errno.
+//
+int p2m_alloc_table(struct domain *d,
+                    struct page_info * (*alloc_page)(struct domain *d),
+                    void (*free_page)(struct domain *d, struct page_info *pg))
+
+{
+    mfn_t mfn;
+    struct list_head *entry;
+    struct page_info *page, *p2m_top;
+    unsigned int page_count = 0;
+    unsigned long gfn;
+
+    p2m_lock(d);
+
+    if ( pagetable_get_pfn(d->arch.phys_table) != 0 )
+    {
+        P2M_ERROR("p2m already allocated for this domain\n");
+        p2m_unlock(d);
+        return -EINVAL;
+    }
+
+    P2M_PRINTK("allocating p2m table\n");
+
+    d->arch.p2m.alloc_page = alloc_page;
+    d->arch.p2m.free_page = free_page;
+
+    p2m_top = d->arch.p2m.alloc_page(d);
+    if ( p2m_top == NULL )
+    {
+        p2m_unlock(d);
+        return -ENOMEM;
+    }
+    list_add_tail(&p2m_top->list, &d->arch.p2m.pages);
+
+    p2m_top->count_info = 1;
+    p2m_top->u.inuse.type_info =
+#if CONFIG_PAGING_LEVELS == 4
+        PGT_l4_page_table
+#elif CONFIG_PAGING_LEVELS == 3
+        PGT_l3_page_table
+#elif CONFIG_PAGING_LEVELS == 2
+        PGT_l2_page_table
+#endif
+        | 1 | PGT_validated;
+
+    d->arch.phys_table = pagetable_from_mfn(page_to_mfn(p2m_top));
+
+    P2M_PRINTK("populating p2m table\n");
+
+    /* Initialise physmap tables for slot zero. Other code assumes this. */
+    if ( !set_p2m_entry(d, 0, _mfn(INVALID_MFN), p2m_invalid) )
+        goto error;
+
+    /* Copy all existing mappings from the page list and m2p */
+    for ( entry = d->page_list.next;
+          entry != &d->page_list;
+          entry = entry->next )
+    {
+        page = list_entry(entry, struct page_info, list);
+        mfn = page_to_mfn(page);
+        gfn = get_gpfn_from_mfn(mfn_x(mfn));
+        page_count++;
+        if (
+#ifdef __x86_64__
+            (gfn != 0x5555555555555555L)
+#else
+            (gfn != 0x55555555L)
+#endif
+             && gfn != INVALID_M2P_ENTRY
+            && !set_p2m_entry(d, gfn, mfn, p2m_ram_rw) )
+            goto error;
+    }
+
+#if CONFIG_PAGING_LEVELS >= 3
+    if (vtd_enabled && is_hvm_domain(d))
+        iommu_set_pgd(d);
+#endif
+
+    P2M_PRINTK("p2m table initialised (%u pages)\n", page_count);
+    p2m_unlock(d);
+    return 0;
+
+ error:
+    P2M_PRINTK("failed to initialize p2m table, gfn=%05lx, mfn=%"
+               PRI_mfn "\n", gfn, mfn_x(mfn));
+    p2m_unlock(d);
+    return -ENOMEM;
+}
+
+void p2m_teardown(struct domain *d)
+/* Return all the p2m pages to Xen.
+ * We know we don't have any extra mappings to these pages */
+{
+    struct list_head *entry, *n;
+    struct page_info *pg;
+
+    p2m_lock(d);
+    d->arch.phys_table = pagetable_null();
+
+    list_for_each_safe(entry, n, &d->arch.p2m.pages)
+    {
+        pg = list_entry(entry, struct page_info, list);
+        list_del(entry);
+        d->arch.p2m.free_page(d, pg);
+    }
+    p2m_unlock(d);
+}
+
+mfn_t
+gfn_to_mfn_foreign(struct domain *d, unsigned long gfn, p2m_type_t *t)
+/* Read another domain's p2m entries */
+{
+    mfn_t mfn;
+    paddr_t addr = ((paddr_t)gfn) << PAGE_SHIFT;
+    l2_pgentry_t *l2e;
+    l1_pgentry_t *l1e;
+
+    ASSERT(paging_mode_translate(d));
+
+    /* XXX This is for compatibility with the old model, where anything not 
+     * XXX marked as RAM was considered to be emulated MMIO space.
+     * XXX Once we start explicitly registering MMIO regions in the p2m 
+     * XXX we will return p2m_invalid for unmapped gfns */
+    *t = p2m_mmio_dm;
+
+    mfn = pagetable_get_mfn(d->arch.phys_table);
+
+    if ( gfn > d->arch.p2m.max_mapped_pfn )
+        /* This pfn is higher than the highest the p2m map currently holds */
+        return _mfn(INVALID_MFN);
+
+#if CONFIG_PAGING_LEVELS >= 4
+    {
+        l4_pgentry_t *l4e = map_domain_page(mfn_x(mfn));
+        l4e += l4_table_offset(addr);
+        if ( (l4e_get_flags(*l4e) & _PAGE_PRESENT) == 0 )
+        {
+            unmap_domain_page(l4e);
+            return _mfn(INVALID_MFN);
+        }
+        mfn = _mfn(l4e_get_pfn(*l4e));
+        unmap_domain_page(l4e);
+    }
+#endif
+#if CONFIG_PAGING_LEVELS >= 3
+    {
+        l3_pgentry_t *l3e = map_domain_page(mfn_x(mfn));
+#if CONFIG_PAGING_LEVELS == 3
+        /* On PAE hosts the p2m has eight l3 entries, not four (see
+         * shadow_set_p2m_entry()) so we can't use l3_table_offset.
+         * Instead, just count the number of l3es from zero.  It's safe
+         * to do this because we already checked that the gfn is within
+         * the bounds of the p2m. */
+        l3e += (addr >> L3_PAGETABLE_SHIFT);
+#else
+        l3e += l3_table_offset(addr);
+#endif
+        if ( (l3e_get_flags(*l3e) & _PAGE_PRESENT) == 0 )
+        {
+            unmap_domain_page(l3e);
+            return _mfn(INVALID_MFN);
+        }
+        mfn = _mfn(l3e_get_pfn(*l3e));
+        unmap_domain_page(l3e);
+    }
+#endif
+
+    l2e = map_domain_page(mfn_x(mfn));
+    l2e += l2_table_offset(addr);
+    if ( (l2e_get_flags(*l2e) & _PAGE_PRESENT) == 0 )
+    {
+        unmap_domain_page(l2e);
+        return _mfn(INVALID_MFN);
+    }
+    mfn = _mfn(l2e_get_pfn(*l2e));
+    unmap_domain_page(l2e);
+
+    l1e = map_domain_page(mfn_x(mfn));
+    l1e += l1_table_offset(addr);
+    if ( (l1e_get_flags(*l1e) & _PAGE_PRESENT) == 0 )
+    {
+        unmap_domain_page(l1e);
+        return _mfn(INVALID_MFN);
+    }
+    mfn = _mfn(l1e_get_pfn(*l1e));
+    *t = p2m_flags_to_type(l1e_get_flags(*l1e));
+    unmap_domain_page(l1e);
+
+    ASSERT(mfn_valid(mfn) || !p2m_is_ram(*t));
+    return (p2m_is_valid(*t)) ? mfn : _mfn(INVALID_MFN);
+}
+
+#if P2M_AUDIT
+static void audit_p2m(struct domain *d)
+{
+    struct list_head *entry;
+    struct page_info *page;
+    struct domain *od;
+    unsigned long mfn, gfn, m2pfn, lp2mfn = 0;
+    mfn_t p2mfn;
+    unsigned long orphans_d = 0, orphans_i = 0, mpbad = 0, pmbad = 0;
+    int test_linear;
+
+    if ( !paging_mode_translate(d) )
+        return;
+
+    //P2M_PRINTK("p2m audit starts\n");
+
+    test_linear = ( (d == current->domain)
+                    && !pagetable_is_null(current->arch.monitor_table) );
+    if ( test_linear )
+        flush_tlb_local();
+
+    /* Audit part one: walk the domain's page allocation list, checking
+     * the m2p entries. */
+    for ( entry = d->page_list.next;
+          entry != &d->page_list;
+          entry = entry->next )
+    {
+        page = list_entry(entry, struct page_info, list);
+        mfn = mfn_x(page_to_mfn(page));
+
+        // P2M_PRINTK("auditing guest page, mfn=%#lx\n", mfn);
+
+        od = page_get_owner(page);
+
+        if ( od != d )
+        {
+            P2M_PRINTK("wrong owner %#lx -> %p(%u) != %p(%u)\n",
+                       mfn, od, (od?od->domain_id:-1), d, d->domain_id);
+            continue;
+        }
+
+        gfn = get_gpfn_from_mfn(mfn);
+        if ( gfn == INVALID_M2P_ENTRY )
+        {
+            orphans_i++;
+            //P2M_PRINTK("orphaned guest page: mfn=%#lx has invalid gfn\n",
+            //               mfn);
+            continue;
+        }
+
+        if ( gfn == 0x55555555 )
+        {
+            orphans_d++;
+            //P2M_PRINTK("orphaned guest page: mfn=%#lx has debug gfn\n",
+            //               mfn);
+            continue;
+        }
+
+        p2mfn = gfn_to_mfn_foreign(d, gfn);
+        if ( mfn_x(p2mfn) != mfn )
+        {
+            mpbad++;
+            P2M_PRINTK("map mismatch mfn %#lx -> gfn %#lx -> mfn %#lx"
+                       " (-> gfn %#lx)\n",
+                       mfn, gfn, mfn_x(p2mfn),
+                       (mfn_valid(p2mfn)
+                        ? get_gpfn_from_mfn(mfn_x(p2mfn))
+                        : -1u));
+            /* This m2p entry is stale: the domain has another frame in
+             * this physical slot.  No great disaster, but for neatness,
+             * blow away the m2p entry. */
+            set_gpfn_from_mfn(mfn, INVALID_M2P_ENTRY, __PAGE_HYPERVISOR|_PAGE_USER);
+        }
+
+        if ( test_linear && (gfn <= d->arch.p2m.max_mapped_pfn) )
+        {
+            lp2mfn = mfn_x(gfn_to_mfn_current(gfn));
+            if ( lp2mfn != mfn_x(p2mfn) )
+            {
+                P2M_PRINTK("linear mismatch gfn %#lx -> mfn %#lx "
+                           "(!= mfn %#lx)\n", gfn, lp2mfn, mfn_x(p2mfn));
+            }
+        }
+
+        // P2M_PRINTK("OK: mfn=%#lx, gfn=%#lx, p2mfn=%#lx, lp2mfn=%#lx\n",
+        //                mfn, gfn, p2mfn, lp2mfn);
+    }
+
+    /* Audit part two: walk the domain's p2m table, checking the entries. */
+    if ( pagetable_get_pfn(d->arch.phys_table) != 0 )
+    {
+        l2_pgentry_t *l2e;
+        l1_pgentry_t *l1e;
+        int i1, i2;
+
+#if CONFIG_PAGING_LEVELS == 4
+        l4_pgentry_t *l4e;
+        l3_pgentry_t *l3e;
+        int i3, i4;
+        l4e = map_domain_page(mfn_x(pagetable_get_mfn(d->arch.phys_table)));
+#elif CONFIG_PAGING_LEVELS == 3
+        l3_pgentry_t *l3e;
+        int i3;
+        l3e = map_domain_page(mfn_x(pagetable_get_mfn(d->arch.phys_table)));
+#else /* CONFIG_PAGING_LEVELS == 2 */
+        l2e = map_domain_page(mfn_x(pagetable_get_mfn(d->arch.phys_table)));
+#endif
+
+        gfn = 0;
+#if CONFIG_PAGING_LEVELS >= 3
+#if CONFIG_PAGING_LEVELS >= 4
+        for ( i4 = 0; i4 < L4_PAGETABLE_ENTRIES; i4++ )
+        {
+            if ( !(l4e_get_flags(l4e[i4]) & _PAGE_PRESENT) )
+            {
+                gfn += 1 << (L4_PAGETABLE_SHIFT - PAGE_SHIFT);
+                continue;
+            }
+            l3e = map_domain_page(mfn_x(_mfn(l4e_get_pfn(l4e[i4]))));
+#endif /* now at levels 3 or 4... */
+            for ( i3 = 0;
+                  i3 < ((CONFIG_PAGING_LEVELS==4) ? L3_PAGETABLE_ENTRIES : 8);
+                  i3++ )
+            {
+                if ( !(l3e_get_flags(l3e[i3]) & _PAGE_PRESENT) )
+                {
+                    gfn += 1 << (L3_PAGETABLE_SHIFT - PAGE_SHIFT);
+                    continue;
+                }
+                l2e = map_domain_page(mfn_x(_mfn(l3e_get_pfn(l3e[i3]))));
+#endif /* all levels... */
+                for ( i2 = 0; i2 < L2_PAGETABLE_ENTRIES; i2++ )
+                {
+                    if ( !(l2e_get_flags(l2e[i2]) & _PAGE_PRESENT) )
+                    {
+                        gfn += 1 << (L2_PAGETABLE_SHIFT - PAGE_SHIFT);
+                        continue;
+                    }
+                    l1e = map_domain_page(mfn_x(_mfn(l2e_get_pfn(l2e[i2]))));
+
+                    for ( i1 = 0; i1 < L1_PAGETABLE_ENTRIES; i1++, gfn++ )
+                    {
+                        if ( !(l1e_get_flags(l1e[i1]) & _PAGE_PRESENT) )
+                            continue;
+                        mfn = l1e_get_pfn(l1e[i1]);
+                        ASSERT(mfn_valid(_mfn(mfn)));
+                        m2pfn = get_gpfn_from_mfn(mfn);
+                        if ( m2pfn != gfn )
+                        {
+                            pmbad++;
+                            P2M_PRINTK("mismatch: gfn %#lx -> mfn %#lx"
+                                       " -> gfn %#lx\n", gfn, mfn, m2pfn);
+                            BUG();
+                        }
+                    }
+                    unmap_domain_page(l1e);
+                }
+#if CONFIG_PAGING_LEVELS >= 3
+                unmap_domain_page(l2e);
+            }
+#if CONFIG_PAGING_LEVELS >= 4
+            unmap_domain_page(l3e);
+        }
+#endif
+#endif
+
+#if CONFIG_PAGING_LEVELS == 4
+        unmap_domain_page(l4e);
+#elif CONFIG_PAGING_LEVELS == 3
+        unmap_domain_page(l3e);
+#else /* CONFIG_PAGING_LEVELS == 2 */
+        unmap_domain_page(l2e);
+#endif
+
+    }
+
+    //P2M_PRINTK("p2m audit complete\n");
+    //if ( orphans_i | orphans_d | mpbad | pmbad )
+    //    P2M_PRINTK("p2m audit found %lu orphans (%lu inval %lu debug)\n",
+    //                   orphans_i + orphans_d, orphans_i, orphans_d,
+    if ( mpbad | pmbad )
+        P2M_PRINTK("p2m audit found %lu odd p2m, %lu bad m2p entries\n",
+                   pmbad, mpbad);
+}
+#else
+#define audit_p2m(_d) do { (void)(_d); } while(0)
+#endif /* P2M_AUDIT */
+
+
+
+static void
+p2m_remove_page(struct domain *d, unsigned long gfn, unsigned long mfn)
+{
+    if ( !paging_mode_translate(d) )
+        return;
+    P2M_DEBUG("removing gfn=%#lx mfn=%#lx\n", gfn, mfn);
+
+    set_p2m_entry(d, gfn, _mfn(INVALID_MFN), p2m_invalid);
+    set_gpfn_from_mfn(mfn, INVALID_M2P_ENTRY);
+}
+
+void
+guest_physmap_remove_page(struct domain *d, unsigned long gfn,
+                          unsigned long mfn)
+{
+    p2m_lock(d);
+    audit_p2m(d);
+    p2m_remove_page(d, gfn, mfn);
+    audit_p2m(d);
+    p2m_unlock(d);
+}
+
+int
+guest_physmap_add_entry(struct domain *d, unsigned long gfn,
+                        unsigned long mfn, p2m_type_t t)
+{
+    unsigned long ogfn;
+    p2m_type_t ot;
+    mfn_t omfn;
+    int rc = 0;
+
+    if ( !paging_mode_translate(d) )
+        return -EINVAL;
+
+#if CONFIG_PAGING_LEVELS == 3
+    /* 32bit PAE nested paging does not support over 4GB guest due to 
+     * hardware translation limit. This limitation is checked by comparing
+     * gfn with 0xfffffUL.
+     */
+    if ( paging_mode_hap(d) && (gfn > 0xfffffUL) )
+        return -EINVAL;
+#endif
+
+    p2m_lock(d);
+    audit_p2m(d);
+
+    P2M_DEBUG("adding gfn=%#lx mfn=%#lx\n", gfn, mfn);
+
+    omfn = gfn_to_mfn(d, gfn, &ot);
+    if ( p2m_is_ram(ot) )
+    {
+        ASSERT(mfn_valid(omfn));
+        set_gpfn_from_mfn(mfn_x(omfn), INVALID_M2P_ENTRY);
+    }
+
+    ogfn = mfn_to_gfn(d, _mfn(mfn));
+    if (
+#ifdef __x86_64__
+        (ogfn != 0x5555555555555555L)
+#else
+        (ogfn != 0x55555555L)
+#endif
+        && (ogfn != INVALID_M2P_ENTRY)
+        && (ogfn != gfn) )
+    {
+        /* This machine frame is already mapped at another physical address */
+        P2M_DEBUG("aliased! mfn=%#lx, old gfn=%#lx, new gfn=%#lx\n",
+                  mfn, ogfn, gfn);
+        omfn = gfn_to_mfn(d, ogfn, &ot);
+        if ( p2m_is_ram(ot) )
+        {
+            ASSERT(mfn_valid(omfn));
+            P2M_DEBUG("old gfn=%#lx -> mfn %#lx\n",
+                      ogfn , mfn_x(omfn));
+            if ( mfn_x(omfn) == mfn )
+                p2m_remove_page(d, ogfn, mfn);
+        }
+    }
+
+    if ( mfn_valid(_mfn(mfn)) ) 
+    {
+        if ( !set_p2m_entry(d, gfn, _mfn(mfn), t) )
+            rc = -EINVAL;
+        set_gpfn_from_mfn(mfn, gfn);
+    }
+    else
+    {
+        gdprintk(XENLOG_WARNING, "Adding bad mfn to p2m map (%#lx -> %#lx)\n",
+                 gfn, mfn);
+        if ( !set_p2m_entry(d, gfn, _mfn(INVALID_MFN), p2m_invalid) )
+            rc = -EINVAL;
+    }
+
+    audit_p2m(d);
+    p2m_unlock(d);
+
+    return rc;
+}
+
+/* Walk the whole p2m table, changing any entries of the old type
+ * to the new type.  This is used in hardware-assisted paging to 
+ * quickly enable or diable log-dirty tracking */
+void p2m_change_type_global(struct domain *d, p2m_type_t ot, p2m_type_t nt)
+{
+    unsigned long mfn, gfn, flags;
+    l1_pgentry_t l1e_content;
+    l1_pgentry_t *l1e;
+    l2_pgentry_t *l2e;
+    mfn_t l1mfn;
+    int i1, i2;
+#if CONFIG_PAGING_LEVELS >= 3
+    l3_pgentry_t *l3e;
+    int i3;
+#if CONFIG_PAGING_LEVELS == 4
+    l4_pgentry_t *l4e;
+    int i4;
+#endif /* CONFIG_PAGING_LEVELS == 4 */
+#endif /* CONFIG_PAGING_LEVELS >= 3 */
+
+    if ( !paging_mode_translate(d) )
+        return;
+
+    if ( pagetable_get_pfn(d->arch.phys_table) == 0 )
+        return;
+
+    p2m_lock(d);
+
+#if CONFIG_PAGING_LEVELS == 4
+    l4e = map_domain_page(mfn_x(pagetable_get_mfn(d->arch.phys_table)));
+#elif CONFIG_PAGING_LEVELS == 3
+    l3e = map_domain_page(mfn_x(pagetable_get_mfn(d->arch.phys_table)));
+#else /* CONFIG_PAGING_LEVELS == 2 */
+    l2e = map_domain_page(mfn_x(pagetable_get_mfn(d->arch.phys_table)));
+#endif
+
+#if CONFIG_PAGING_LEVELS >= 3
+#if CONFIG_PAGING_LEVELS >= 4
+    for ( i4 = 0; i4 < L4_PAGETABLE_ENTRIES; i4++ )
+    {
+        if ( !(l4e_get_flags(l4e[i4]) & _PAGE_PRESENT) )
+        {
+            continue;
+        }
+        l3e = map_domain_page(l4e_get_pfn(l4e[i4]));
+#endif /* now at levels 3 or 4... */
+        for ( i3 = 0;
+              i3 < ((CONFIG_PAGING_LEVELS==4) ? L3_PAGETABLE_ENTRIES : 8);
+              i3++ )
+        {
+            if ( !(l3e_get_flags(l3e[i3]) & _PAGE_PRESENT) )
+            {
+                continue;
+            }
+            l2e = map_domain_page(l3e_get_pfn(l3e[i3]));
+#endif /* all levels... */
+            for ( i2 = 0; i2 < L2_PAGETABLE_ENTRIES; i2++ )
+            {
+                if ( !(l2e_get_flags(l2e[i2]) & _PAGE_PRESENT) )
+                {
+                    continue;
+                }
+
+                l1mfn = _mfn(l2e_get_pfn(l2e[i2]));
+                l1e = map_domain_page(mfn_x(l1mfn));
+
+                for ( i1 = 0; i1 < L1_PAGETABLE_ENTRIES; i1++, gfn++ )
+                {
+                    flags = l1e_get_flags(l1e[i1]);
+                    if ( p2m_flags_to_type(flags) != ot )
+                        continue;
+                    mfn = l1e_get_pfn(l1e[i1]);
+                    gfn = get_gpfn_from_mfn(mfn);
+                    /* create a new 1le entry with the new type */
+                    flags = p2m_flags_to_type(nt);
+                    l1e_content = l1e_from_pfn(mfn, flags);
+                    paging_write_p2m_entry(d, gfn, &l1e[i1],
+                                           l1mfn, l1e_content, 1);
+                }
+                unmap_domain_page(l1e);
+            }
+#if CONFIG_PAGING_LEVELS >= 3
+            unmap_domain_page(l2e);
+        }
+#if CONFIG_PAGING_LEVELS >= 4
+        unmap_domain_page(l3e);
+    }
+#endif
+#endif
+
+#if CONFIG_PAGING_LEVELS == 4
+    unmap_domain_page(l4e);
+#elif CONFIG_PAGING_LEVELS == 3
+    unmap_domain_page(l3e);
+#else /* CONFIG_PAGING_LEVELS == 2 */
+    unmap_domain_page(l2e);
+#endif
+
+    p2m_unlock(d);
+}
+
+/* Modify the p2m type of a single gfn from ot to nt, returning the 
+ * entry's previous type */
+p2m_type_t p2m_change_type(struct domain *d, unsigned long gfn, 
+                           p2m_type_t ot, p2m_type_t nt)
+{
+    p2m_type_t pt;
+    mfn_t mfn;
+
+    p2m_lock(d);
+
+    mfn = gfn_to_mfn(d, gfn, &pt);
+    if ( pt == ot )
+        set_p2m_entry(d, gfn, mfn, nt);
+
+    p2m_unlock(d);
+
+    return pt;
+}
+
+int
+set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn)
+{
+    int rc = 0;
+    p2m_type_t ot;
+    mfn_t omfn;
+
+    if ( !paging_mode_translate(d) )
+        return 0;
+
+    omfn = gfn_to_mfn(d, gfn, &ot);
+    if ( p2m_is_ram(ot) )
+    {
+        ASSERT(mfn_valid(omfn));
+        set_gpfn_from_mfn(mfn_x(omfn), INVALID_M2P_ENTRY);
+    }
+
+    rc = set_p2m_entry(d, gfn, mfn, p2m_mmio_direct);
+    if ( 0 == rc )
+        gdprintk(XENLOG_ERR,
+            "set_mmio_p2m_entry: set_p2m_entry failed! mfn=%08lx\n",
+            gmfn_to_mfn(d, gfn));
+    return rc;
+}
+
+int
+clear_mmio_p2m_entry(struct domain *d, unsigned long gfn)
+{
+    int rc = 0;
+    unsigned long mfn;
+
+    if ( !paging_mode_translate(d) )
+        return 0;
+
+    mfn = gmfn_to_mfn(d, gfn);
+    if ( INVALID_MFN == mfn )
+    {
+        gdprintk(XENLOG_ERR,
+            "clear_mmio_p2m_entry: gfn_to_mfn failed! gfn=%08lx\n", gfn);
+        return 0;
+    }
+    rc = set_p2m_entry(d, gfn, _mfn(INVALID_MFN), 0);
+
+    return rc;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-set-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff -r 3191627e5ad6 xen/include/asm-x86/p2m-orig.h
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/xen/include/asm-x86/p2m-orig.h	Wed Nov 07 09:46:02 2007 -0600
@@ -0,0 +1,241 @@
+/******************************************************************************
+ * include/asm-x86/paging.h
+ *
+ * physical-to-machine mappings for automatically-translated domains.
+ *
+ * Copyright (c) 2007 Advanced Micro Devices (Wei Huang)
+ * Parts of this code are Copyright (c) 2006-2007 by XenSource Inc.
+ * Parts of this code are Copyright (c) 2006 by Michael A Fetterman
+ * Parts based on earlier work by Michael A Fetterman, Ian Pratt et al.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef _XEN_P2M_H
+#define _XEN_P2M_H
+
+
+/*
+ * The phys_to_machine_mapping maps guest physical frame numbers 
+ * to machine frame numbers.  It only exists for paging_mode_translate 
+ * guests. It is organised in page-table format, which:
+ *
+ * (1) allows us to use it directly as the second pagetable in hardware-
+ *     assisted paging and (hopefully) iommu support; and 
+ * (2) lets us map it directly into the guest vcpus' virtual address space 
+ *     as a linear pagetable, so we can read and write it easily.
+ *
+ * For (2) we steal the address space that would have normally been used
+ * by the read-only MPT map in a non-translated guest.  (For 
+ * paging_mode_external() guests this mapping is in the monitor table.)
+ */
+#define phys_to_machine_mapping ((l1_pgentry_t *)RO_MPT_VIRT_START)
+
+/*
+ * The upper levels of the p2m pagetable always contain full rights; all 
+ * variation in the access control bits is made in the level-1 PTEs.
+ * 
+ * In addition to the phys-to-machine translation, each p2m PTE contains
+ * *type* information about the gfn it translates, helping Xen to decide
+ * on the correct course of action when handling a page-fault to that
+ * guest frame.  We store the type in the "available" bits of the PTEs
+ * in the table, which gives us 8 possible types on 32-bit systems.
+ * Further expansions of the type system will only be supported on
+ * 64-bit Xen.
+ */
+typedef enum {
+    p2m_invalid = 0,            /* Nothing mapped here */
+    p2m_ram_rw = 1,             /* Normal read/write guest RAM */
+    p2m_ram_logdirty = 2,       /* Temporarily read-only for log-dirty */
+    p2m_ram_ro = 3,             /* Read-only; writes go to the device model */
+    p2m_mmio_dm = 4,            /* Reads and write go to the device model */
+    p2m_mmio_direct = 5,        /* Read/write mapping of genuine MMIO area */
+} p2m_type_t;
+
+/* We use bitmaps and maks to handle groups of types */
+#define p2m_to_mask(_t) (1UL << (_t))
+
+/* RAM types, which map to real machine frames */
+#define P2M_RAM_TYPES (p2m_to_mask(p2m_ram_rw)          \
+                       | p2m_to_mask(p2m_ram_logdirty)  \
+                       | p2m_to_mask(p2m_ram_ro))
+
+/* MMIO types, which don't have to map to anything in the frametable */
+#define P2M_MMIO_TYPES (p2m_to_mask(p2m_mmio_dm)        \
+                        | p2m_to_mask(p2m_mmio_direct))
+
+/* Read-only types, which must have the _PAGE_RW bit clear in their PTEs */
+#define P2M_RO_TYPES (p2m_to_mask(p2m_ram_logdirty)     \
+                      | p2m_to_mask(p2m_ram_ro))
+
+/* Useful predicates */
+#define p2m_is_ram(_t) (p2m_to_mask(_t) & P2M_RAM_TYPES)
+#define p2m_is_mmio(_t) (p2m_to_mask(_t) & P2M_MMIO_TYPES)
+#define p2m_is_readonly(_t) (p2m_to_mask(_t) & P2M_RO_TYPES)
+#define p2m_is_valid(_t) (p2m_to_mask(_t) & (P2M_RAM_TYPES | P2M_MMIO_TYPES))
+
+/* Extract the type from the PTE flags that store it */
+static inline p2m_type_t p2m_flags_to_type(unsigned long flags)
+{
+    /* Type is stored in the "available" bits, 9, 10 and 11 */
+    return (flags >> 9) & 0x7;
+}
+ 
+/* Read the current domain's p2m table (through the linear mapping). */
+static inline mfn_t gfn_to_mfn_current(unsigned long gfn, p2m_type_t *t)
+{
+    mfn_t mfn = _mfn(INVALID_MFN);
+    p2m_type_t p2mt = p2m_mmio_dm;
+    /* XXX This is for compatibility with the old model, where anything not 
+     * XXX marked as RAM was considered to be emulated MMIO space.
+     * XXX Once we start explicitly registering MMIO regions in the p2m 
+     * XXX we will return p2m_invalid for unmapped gfns */
+
+    if ( gfn <= current->domain->arch.p2m.max_mapped_pfn )
+    {
+        l1_pgentry_t l1e = l1e_empty();
+        int ret;
+
+        ASSERT(gfn < (RO_MPT_VIRT_END - RO_MPT_VIRT_START) 
+               / sizeof(l1_pgentry_t));
+
+        /* Need to __copy_from_user because the p2m is sparse and this
+         * part might not exist */
+        ret = __copy_from_user(&l1e,
+                               &phys_to_machine_mapping[gfn],
+                               sizeof(l1e));
+
+        if ( ret == 0 ) {
+            p2mt = p2m_flags_to_type(l1e_get_flags(l1e));
+            ASSERT(l1e_get_pfn(l1e) != INVALID_MFN || !p2m_is_ram(p2mt));
+            if ( p2m_is_valid(p2mt) )
+                mfn = _mfn(l1e_get_pfn(l1e));
+            else 
+                /* XXX see above */
+                p2mt = p2m_mmio_dm;
+        }
+    }
+
+    *t = p2mt;
+    return mfn;
+}
+
+/* Read another domain's P2M table, mapping pages as we go */
+mfn_t gfn_to_mfn_foreign(struct domain *d, unsigned long gfn, p2m_type_t *t);
+
+/* General conversion function from gfn to mfn */
+#define gfn_to_mfn(d, g, t) _gfn_to_mfn((d), (g), (t))
+static inline mfn_t _gfn_to_mfn(struct domain *d,
+                                unsigned long gfn, p2m_type_t *t)
+{
+    if ( !paging_mode_translate(d) )
+    {
+        /* Not necessarily true, but for non-translated guests, we claim
+         * it's the most generic kind of memory */
+        *t = p2m_ram_rw;
+        return _mfn(gfn);
+    }
+    if ( likely(current->domain == d) )
+        return gfn_to_mfn_current(gfn, t);
+    else 
+        return gfn_to_mfn_foreign(d, gfn, t);
+}
+
+/* Compatibility function exporting the old untyped interface */
+static inline unsigned long gmfn_to_mfn(struct domain *d, unsigned long gpfn)
+{
+    mfn_t mfn;
+    p2m_type_t t;
+    mfn = gfn_to_mfn(d, gpfn, &t);
+    if ( p2m_is_valid(t) )
+        return mfn_x(mfn);
+    return INVALID_MFN;
+}
+
+/* General conversion function from mfn to gfn */
+static inline unsigned long mfn_to_gfn(struct domain *d, mfn_t mfn)
+{
+    if ( paging_mode_translate(d) )
+        return get_gpfn_from_mfn(mfn_x(mfn));
+    else
+        return mfn_x(mfn);
+}
+
+/* Translate the frame number held in an l1e from guest to machine */
+static inline l1_pgentry_t
+gl1e_to_ml1e(struct domain *d, l1_pgentry_t l1e)
+{
+    if ( unlikely(paging_mode_translate(d)) )
+        l1e = l1e_from_pfn(gmfn_to_mfn(d, l1e_get_pfn(l1e)),
+                           l1e_get_flags(l1e));
+    return l1e;
+}
+
+
+/* Init the datastructures for later use by the p2m code */
+void p2m_init(struct domain *d);
+
+/* Allocate a new p2m table for a domain. 
+ *
+ * The alloc_page and free_page functions will be used to get memory to
+ * build the p2m, and to release it again at the end of day. 
+ *
+ * Returns 0 for success or -errno. */
+int p2m_alloc_table(struct domain *d,
+                    struct page_info * (*alloc_page)(struct domain *d),
+                    void (*free_page)(struct domain *d, struct page_info *pg));
+
+/* Return all the p2m resources to Xen. */
+void p2m_teardown(struct domain *d);
+
+/* Add a page to a domain's p2m table */
+int guest_physmap_add_entry(struct domain *d, unsigned long gfn,
+                             unsigned long mfn, p2m_type_t t);
+
+/* Untyped version for RAM only, for compatibility 
+ *
+ * Return 0 for success
+ */
+static inline int guest_physmap_add_page(struct domain *d, unsigned long gfn,
+                                         unsigned long mfn)
+{
+    return guest_physmap_add_entry(d, gfn, mfn, p2m_ram_rw);
+}
+
+/* Remove a page from a domain's p2m table */
+void guest_physmap_remove_page(struct domain *d, unsigned long gfn,
+                               unsigned long mfn);
+
+/* Change types across all p2m entries in a domain */
+void p2m_change_type_global(struct domain *d, p2m_type_t ot, p2m_type_t nt);
+
+/* Compare-exchange the type of a single p2m entry */
+p2m_type_t p2m_change_type(struct domain *d, unsigned long gfn,
+                           p2m_type_t ot, p2m_type_t nt);
+
+/* Set mmio addresses in the p2m table (for pass-through) */
+int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn);
+int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn);
+
+#endif /* _XEN_P2M_H */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-set-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH]Large Page Support for HAP
  2007-11-15 16:26 [RFC][PATCH]Large Page Support for HAP Huang2, Wei
@ 2007-11-15 16:36 ` Keir Fraser
  2007-11-15 17:36   ` Huang2, Wei
       [not found] ` <20071115185929.GG26000@york.uk.xensource.com>
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Keir Fraser @ 2007-11-15 16:36 UTC (permalink / raw)
  To: Huang2, Wei, xen-devel; +Cc: Tim Deegan


[-- Attachment #1.1: Type: text/plain, Size: 871 bytes --]

On 15/11/07 16:26, "Huang2, Wei" <Wei.Huang2@amd.com> wrote:

> I implemented a preliminary version of HAP large page support. My testings
> showed that 32bit PAE and 64bit worked well. Also I saw decent performance
> improvement for certain benchmarks.
>  
> So before I go too far, I send this patch to community for reviews/comments.
> This patch goes with xen-unstable changeset 16281. I will redo it after
> collecting all ideas.

Looks pretty good to me.

To get round the 2M/4M distinction I¹d write code in terms of Œnormal page¹
and Œsuper page¹, where the former is order 0 and the latter is order
L2_SUPERPAGE_ORDER (or somesuch name). I would try to avoid referencing 2M
or 4M explicitly as much as possible.

Having to shatter the 0-2MB region for the VGA RAM hole is a shame, but I
suppose there¹s no way round that really.

 -- Keir


[-- Attachment #1.2: Type: text/html, Size: 1379 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Re: [RFC][PATCH]Large Page Support for HAP
  2007-11-15 16:36 ` Keir Fraser
@ 2007-11-15 17:36   ` Huang2, Wei
  2007-11-15 17:42     ` Keir Fraser
  0 siblings, 1 reply; 19+ messages in thread
From: Huang2, Wei @ 2007-11-15 17:36 UTC (permalink / raw)
  To: Keir Fraser, xen-devel; +Cc: Tim Deegan

[-- Attachment #1.1: Type: text/plain, Size: 648 bytes --]

To get round the 2M/4M distinction I'd write code in terms of 'normal
page' and 'super page', where the former is order 0 and the latter is
order L2_SUPERPAGE_ORDER (or somesuch name). I would try to avoid
referencing 2M or 4M explicitly as much as possible.

Will do it.

Having to shatter the 0-2MB region for the VGA RAM hole is a shame, but
I suppose there's no way round that really.

Since tools control the page_array (size, order, etc.), this is the only
way to do it. My major concern is about handing 4MB page for 32bit mode
and impacts to shadow paging. Any thought?

Thanks,

-Wei

[-- Attachment #1.2: Type: text/html, Size: 4019 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re: [RFC][PATCH]Large Page Support for HAP
  2007-11-15 17:36   ` Huang2, Wei
@ 2007-11-15 17:42     ` Keir Fraser
  0 siblings, 0 replies; 19+ messages in thread
From: Keir Fraser @ 2007-11-15 17:42 UTC (permalink / raw)
  To: Huang2, Wei, xen-devel; +Cc: Tim Deegan


[-- Attachment #1.1: Type: text/plain, Size: 586 bytes --]


As I said, you should avoid explicitly referencing the actual superpage size
to make supporting 4MB superpage easier. For the impacts to shadow paging,
that¹s Tim¹s area. :-)

 -- Keir

On 15/11/07 17:36, "Huang2, Wei" <Wei.Huang2@amd.com> wrote:

> Having to shatter the 0-2MB region for the VGA RAM hole is a shame, but I
> suppose there¹s no way round that really.
>  
> Since tools control the page_array (size, order, etc.), this is the only way
> to do it. My major concern is about handing 4MB page for 32bit mode and
> impacts to shadow paging. Any thought?



[-- Attachment #1.2: Type: text/html, Size: 1218 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

[parent not found: <20071115185929.GG26000@york.uk.xensource.com>]

* RE: [RFC][PATCH]Large Page Support for HAP
       [not found] ` <20071115185929.GG26000@york.uk.xensource.com>
@ 2007-11-15 19:33   ` Huang2, Wei
  0 siblings, 0 replies; 19+ messages in thread
From: Huang2, Wei @ 2007-11-15 19:33 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel

Tim Deegan wrote:
> Hi,
> 
> At 10:26 -0600 on 15 Nov (1195122365), Huang2, Wei wrote:
>> 2. Shadow paging
>> - This implementation will affect shadow mode, especially at
>> xc_hvm_build.c and memory.c. 
>> - Where and how to avoid affecting shadow?
> 
> Shadow already uses SH_LINEAR_PT_VIRT_START, so we can't put a
> mapping there.  

Given that we don't use SH_LINEAR_PT_VIRT_START in current HAP mode, I
think it is OK to borrow this address space for HAP. You are right that
Shadow is using it; so it is a bit dangerous. If we can prevent large
page support in shadow paging, is using SH_LINEAR_PT still acceptable
for you?

> Can you just use the normal linear mapping plus the
> RO_MPT mapping of the p2m instead?  
> 
> Otherwise, the only thing I can see that shadow will need is for the
> callback from the p2m code that writes the entries to be made aware
> of the superpage level-2 entries.  It'll need to treat a superpage
> entry the same way as 512/1024 level-1 entries.   

Could you elaborate on this idea? RO_MPT is currently being used. I did
not see any spare linear space I can borrow except SH_LINEAR_PT. Do you
mean I can still borrow it, but have to handle it correctly in shadow
code if it is a super page?

Thanks,

-Wei

> 
> Cheers,
> 
> Tim.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC][PATCH]Large Page Support for HAP
  2007-11-15 16:26 [RFC][PATCH]Large Page Support for HAP Huang2, Wei
  2007-11-15 16:36 ` Keir Fraser
       [not found] ` <20071115185929.GG26000@york.uk.xensource.com>
@ 2007-11-16 17:40 ` Byrne, John (HP Labs)
  2007-11-16 17:53   ` Huang2, Wei
  2007-11-19 20:27 ` Stephen C. Tweedie
  3 siblings, 1 reply; 19+ messages in thread
From: Byrne, John (HP Labs) @ 2007-11-16 17:40 UTC (permalink / raw)
  To: Huang2, Wei, xen-devel; +Cc: Tim Deegan


[-- Attachment #1.1: Type: text/plain, Size: 3099 bytes --]

Wei,
 
I have been hacking at this, too,  since I am interested in trying 1GB
pages to see what they can do. After I dug myself into a hole, I
restarted from the beginning and am trying a different approach than
modifying xc_hvm_build.c: modify populate_physmap() to opportunistically
allocate large pages, if possible. I just thought I'd mention it.
 
John Byrne
 

________________________________

From: xen-devel-bounces@lists.xensource.com
[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Huang2, Wei
Sent: Thursday, November 15, 2007 8:26 AM
To: xen-devel@lists.xensource.com
Cc: Tim Deegan
Subject: [Xen-devel] [RFC][PATCH]Large Page Support for HAP


I implemented a preliminary version of HAP large page support. My
testings showed that 32bit PAE and 64bit worked well. Also I saw decent
performance improvement for certain benchmarks.
 
So before I go too far, I send this patch to community for
reviews/comments. This patch goes with xen-unstable changeset 16281. I
will redo it after collecting all ideas.
 
Thanks,
 
-Wei
 
============
DESIGN IDEAS:
1. Large page requests
- xc_hvm_build.c requests large page (2MB for now) while starting guests
- memory.c handles large page requests. If it can not handle it, falls
back to 4KB pages.
 
2. P2M table
- P2M table takes page size order as a parameter; It builds P2M table
(setting PSE bit, etc.) according to page size.
- Other related functions (such as p2m_audit()) handles the table based
on page size too.
- Page split/merge
** Large page will be split into 4KB page in P2M table if needed. For
instance, if set_p2m_entry() handles 4KB page but finds PSE/PRESENT bits
are on, it will further split large page to 4KB pages.
** There is NO merge from 4KB pages to large page. Since large page is
only used at the very beginning, guest_physmap_add(), this is OK for
now.
 
3. HAP
- To access the PSE bit, L2 pages of P2M table is installed in linear
mapping on SH_LINEAR_PT_VIRT_START. We borrow this address space since
it was not used.
 
4. gfn_to_mfn translation (P2M)
- gfn_to_mfn_foreign() traverses P2M table and handles address
translation correctly based on PSE bit.
- gfn_to_mfn_current() accesses SH_LINEAR_PT_VIRT_START to check PSE
bit. If is on, we handle translation using large page. Otherwise, it
falls back to normal RO_MPT_VIRT_START address space to access P2M L1
pages.
 
5. M2P translation
- Same as before, M2P translation still happens on 4KB level.
 
AREAS NEEDS COMMENTS:
1. Large page for 32bit mode
- 32bit use 4MB for large page. This is very annoying for
xc_hvm_build.c. I don't want to create another 4MB page_array for it.
- Because of this, this area has not been tested very well. I expect
changes soon.
 
2. Shadow paging
- This implementation will affect shadow mode, especially at
xc_hvm_build.c and memory.c.
- Where and how to avoid affecting shadow?
 
3. Turn it on/off
- Do we want to turn this feature on/off through option (kernel option
or anything else)?
 
4. Other missing areas?
===========

[-- Attachment #1.2: Type: text/html, Size: 7705 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC][PATCH]Large Page Support for HAP
  2007-11-16 17:40 ` Byrne, John (HP Labs)
@ 2007-11-16 17:53   ` Huang2, Wei
  2007-11-16 18:03     ` Keir Fraser
  2007-11-29 18:48     ` Byrne, John (HP Labs)
  0 siblings, 2 replies; 19+ messages in thread
From: Huang2, Wei @ 2007-11-16 17:53 UTC (permalink / raw)
  To: Byrne, John (HP Labs), xen-devel; +Cc: Tim Deegan


[-- Attachment #1.1: Type: text/plain, Size: 3809 bytes --]

John,
 
If you have a better design, share with us and I will be happy to work
with you. :-) I agree that xc_hvm_build.c does not have to be modified,
if memory.c is smart enough to scan all page_array information. But one
concern is that sometimes Xen tools really want to create mapping at 4KB
boundary instead of using large page. That requires extra information
passed from tools (e.g., xc_hvm_build.c) to memory.c
 
-Wei

________________________________

From: Byrne, John (HP Labs) [mailto:john.l.byrne@hp.com] 
Sent: Friday, November 16, 2007 11:41 AM
To: Huang2, Wei; xen-devel@lists.xensource.com
Cc: Tim Deegan
Subject: RE: [Xen-devel] [RFC][PATCH]Large Page Support for HAP


Wei,
 
I have been hacking at this, too,  since I am interested in trying 1GB
pages to see what they can do. After I dug myself into a hole, I
restarted from the beginning and am trying a different approach than
modifying xc_hvm_build.c: modify populate_physmap() to opportunistically
allocate large pages, if possible. I just thought I'd mention it.
 
John Byrne
 


________________________________

From: xen-devel-bounces@lists.xensource.com
[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Huang2, Wei
Sent: Thursday, November 15, 2007 8:26 AM
To: xen-devel@lists.xensource.com
Cc: Tim Deegan
Subject: [Xen-devel] [RFC][PATCH]Large Page Support for HAP


I implemented a preliminary version of HAP large page support. My
testings showed that 32bit PAE and 64bit worked well. Also I saw decent
performance improvement for certain benchmarks.
 
So before I go too far, I send this patch to community for
reviews/comments. This patch goes with xen-unstable changeset 16281. I
will redo it after collecting all ideas.
 
Thanks,
 
-Wei
 
============
DESIGN IDEAS:
1. Large page requests
- xc_hvm_build.c requests large page (2MB for now) while starting guests
- memory.c handles large page requests. If it can not handle it, falls
back to 4KB pages.
 
2. P2M table
- P2M table takes page size order as a parameter; It builds P2M table
(setting PSE bit, etc.) according to page size.
- Other related functions (such as p2m_audit()) handles the table based
on page size too.
- Page split/merge
** Large page will be split into 4KB page in P2M table if needed. For
instance, if set_p2m_entry() handles 4KB page but finds PSE/PRESENT bits
are on, it will further split large page to 4KB pages.
** There is NO merge from 4KB pages to large page. Since large page is
only used at the very beginning, guest_physmap_add(), this is OK for
now.
 
3. HAP
- To access the PSE bit, L2 pages of P2M table is installed in linear
mapping on SH_LINEAR_PT_VIRT_START. We borrow this address space since
it was not used.
 
4. gfn_to_mfn translation (P2M)
- gfn_to_mfn_foreign() traverses P2M table and handles address
translation correctly based on PSE bit.
- gfn_to_mfn_current() accesses SH_LINEAR_PT_VIRT_START to check PSE
bit. If is on, we handle translation using large page. Otherwise, it
falls back to normal RO_MPT_VIRT_START address space to access P2M L1
pages.
 
5. M2P translation
- Same as before, M2P translation still happens on 4KB level.
 
AREAS NEEDS COMMENTS:
1. Large page for 32bit mode
- 32bit use 4MB for large page. This is very annoying for
xc_hvm_build.c. I don't want to create another 4MB page_array for it.
- Because of this, this area has not been tested very well. I expect
changes soon.
 
2. Shadow paging
- This implementation will affect shadow mode, especially at
xc_hvm_build.c and memory.c.
- Where and how to avoid affecting shadow?
 
3. Turn it on/off
- Do we want to turn this feature on/off through option (kernel option
or anything else)?
 
4. Other missing areas?
===========

[-- Attachment #1.2: Type: text/html, Size: 9179 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH]Large Page Support for HAP
  2007-11-16 17:53   ` Huang2, Wei
@ 2007-11-16 18:03     ` Keir Fraser
  2007-12-07  1:43       ` John Byrne
  2007-11-29 18:48     ` Byrne, John (HP Labs)
  1 sibling, 1 reply; 19+ messages in thread
From: Keir Fraser @ 2007-11-16 18:03 UTC (permalink / raw)
  To: Huang2, Wei, Byrne, John (HP Labs), xen-devel; +Cc: Tim Deegan


[-- Attachment #1.1: Type: text/plain, Size: 4325 bytes --]

To my mind populate_physmap() should do what it is told w.r.t. extent sizes.
I don¹t mind some modification of xc_hvm_build to support this feature.

 -- Keir

On 16/11/07 17:53, "Huang2, Wei" <Wei.Huang2@amd.com> wrote:

> John,
>  
> If you have a better design, share with us and I will be happy to work with
> you. :-) I agree that xc_hvm_build.c does not have to be modified, if memory.c
> is smart enough to scan all page_array information. But one concern is that
> sometimes Xen tools really want to create mapping at 4KB boundary instead of
> using large page. That requires extra information passed from tools (e.g.,
> xc_hvm_build.c) to memory.c
>  
> -Wei
> 
> 
> From: Byrne, John (HP Labs) [mailto:john.l.byrne@hp.com]
> Sent: Friday, November 16, 2007 11:41 AM
> To: Huang2, Wei; xen-devel@lists.xensource.com
> Cc: Tim Deegan
> Subject: RE: [Xen-devel] [RFC][PATCH]Large Page Support for HAP
> 
> Wei,
>  
> I have been hacking at this, too,  since I am interested in trying 1GB pages
> to see what they can do. After I dug myself into a hole, I restarted from the
> beginning and am trying a different approach than modifying xc_hvm_build.c:
> modify populate_physmap() to opportunistically allocate large pages, if
> possible. I just thought I'd mention it.
>  
> John Byrne
>  
> 
> 
> From: xen-devel-bounces@lists.xensource.com
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Huang2, Wei
> Sent: Thursday, November 15, 2007 8:26 AM
> To: xen-devel@lists.xensource.com
> Cc: Tim Deegan
> Subject: [Xen-devel] [RFC][PATCH]Large Page Support for HAP
> 
> I implemented a preliminary version of HAP large page support. My testings
> showed that 32bit PAE and 64bit worked well. Also I saw decent performance
> improvement for certain benchmarks.
>  
> So before I go too far, I send this patch to community for reviews/comments.
> This patch goes with xen-unstable changeset 16281. I will redo it after
> collecting all ideas.
>  
> Thanks,
>  
> -Wei
>  
> ============
> DESIGN IDEAS:
> 1. Large page requests
> - xc_hvm_build.c requests large page (2MB for now) while starting guests
> - memory.c handles large page requests. If it can not handle it, falls back to
> 4KB pages.
>  
> 2. P2M table
> - P2M table takes page size order as a parameter; It builds P2M table (setting
> PSE bit, etc.) according to page size.
> - Other related functions (such as p2m_audit()) handles the table based on
> page size too.
> - Page split/merge
> ** Large page will be split into 4KB page in P2M table if needed. For
> instance, if set_p2m_entry() handles 4KB page but finds PSE/PRESENT bits are
> on, it will further split large page to 4KB pages.
> ** There is NO merge from 4KB pages to large page. Since large page is only
> used at the very beginning, guest_physmap_add(), this is OK for now.
>  
> 3. HAP
> - To access the PSE bit, L2 pages of P2M table is installed in linear mapping
> on SH_LINEAR_PT_VIRT_START. We borrow this address space since it was not
> used.
>  
> 4. gfn_to_mfn translation (P2M)
> - gfn_to_mfn_foreign() traverses P2M table and handles address translation
> correctly based on PSE bit.
> - gfn_to_mfn_current() accesses SH_LINEAR_PT_VIRT_START to check PSE bit. If
> is on, we handle translation using large page. Otherwise, it falls back to
> normal RO_MPT_VIRT_START address space to access P2M L1 pages.
>  
> 5. M2P translation
> - Same as before, M2P translation still happens on 4KB level.
>  
> AREAS NEEDS COMMENTS:
> 1. Large page for 32bit mode
> - 32bit use 4MB for large page. This is very annoying for xc_hvm_build.c. I
> don't want to create another 4MB page_array for it.
> - Because of this, this area has not been tested very well. I expect changes
> soon.
>  
> 2. Shadow paging
> - This implementation will affect shadow mode, especially at xc_hvm_build.c
> and memory.c.
> - Where and how to avoid affecting shadow?
>  
> 3. Turn it on/off
> - Do we want to turn this feature on/off through option (kernel option or
> anything else)?
>  
> 4. Other missing areas?
> ===========
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel



[-- Attachment #1.2: Type: text/html, Size: 6941 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH]Large Page Support for HAP
  2007-11-16 18:03     ` Keir Fraser
@ 2007-12-07  1:43       ` John Byrne
  0 siblings, 0 replies; 19+ messages in thread
From: John Byrne @ 2007-12-07  1:43 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Tim Deegan, Huang2, Wei, xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 5897 bytes --]


Keir,

I'm very late replying to this. I wanted to make sure I something that
worked first before continuing the discussion and things took longer
than I'd hoped. Wei has asked me to send along my patch (against 16256)
for discussion. (Maybe just to make his look good.) Mine is less 
complete --- it doesn't handle page shattering when pages are removed 
--- but it works well enough to start Linux HAP guests with 1G 
super-pages, which was my primary interest.

My original thought for modifying just populate_physmap() to
opportunistically use super-pages was that my try_larger_extents()
function in memory.c could be made mode-specific and that the hypervisor
was the easiest place to have this kind of policy. (Will IOMMU DMA
support for PV guests benefit from super-page allocations?)

I did end up modifying xc_hvm_build, because I wanted to optimize the
guest to use 1G pages by using as little memory under 1G as possible.
So, the memsize_low variable I define is meant to become a parameter to
allow the domain config to specify a low memory size (I'm using 32MB for
now) and the rest of the memory allocated starting at the 1G boundary.
Perhaps some general method of specifying the guest memory layout could
be developed.

For p2m, I assumed that gfn_to_mfn_current() was an infrequent operation 
under HAP and it was not worth doing any direct mapping of the L2/L3
page tables to support this. So gfn_to_mfn_current() in HAP mode just
calls gfn_to_mfn_foreign() (modified to note PSE pages) and walks the
HAP pagetable.

Perhaps there is a useful idea in this that could be used with Wei's 
changes.

John Byrne


Keir Fraser wrote:
> To my mind populate_physmap() should do what it is told w.r.t. extent sizes. I don't mind some modification of xc_hvm_build to support this feature.
> 
>  -- Keir
> 
> On 16/11/07 17:53, "Huang2, Wei" <Wei.Huang2@amd.com> wrote:
> 
> John,
> 
> If you have a better design, share with us and I will be happy to work with you. :-) I agree that xc_hvm_build.c does not have to be modified, if memory.c is smart enough to scan all page_array information. But one concern is that sometimes Xen tools really want to create mapping at 4KB boundary instead of using large page. That requires extra information passed from tools (e.g., xc_hvm_build.c) to memory.c
> 
> -Wei
> 
> ________________________________
> From: Byrne, John (HP Labs) [mailto:john.l.byrne@hp.com]
> Sent: Friday, November 16, 2007 11:41 AM
> To: Huang2, Wei; xen-devel@lists.xensource.com
> Cc: Tim Deegan
> Subject: RE: [Xen-devel] [RFC][PATCH]Large Page Support for HAP
> 
> Wei,
> 
> I have been hacking at this, too,  since I am interested in trying 1GB pages to see what they can do. After I dug myself into a hole, I restarted from the beginning and am trying a different approach than modifying xc_hvm_build.c: modify populate_physmap() to opportunistically allocate large pages, if possible. I just thought I'd mention it.
> 
> John Byrne
> 
> 
> ________________________________
> From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Huang2, Wei
> Sent: Thursday, November 15, 2007 8:26 AM
> To: xen-devel@lists.xensource.com
> Cc: Tim Deegan
> Subject: [Xen-devel] [RFC][PATCH]Large Page Support for HAP
> 
> I implemented a preliminary version of HAP large page support. My testings showed that 32bit PAE and 64bit worked well. Also I saw decent performance improvement for certain benchmarks.
> 
> So before I go too far, I send this patch to community for reviews/comments. This patch goes with xen-unstable changeset 16281. I will redo it after collecting all ideas.
> 
> Thanks,
> 
> -Wei
> 
> ============
> DESIGN IDEAS:
> 1. Large page requests
> - xc_hvm_build.c requests large page (2MB for now) while starting guests
> - memory.c handles large page requests. If it can not handle it, falls back to 4KB pages.
> 
> 2. P2M table
> - P2M table takes page size order as a parameter; It builds P2M table (setting PSE bit, etc.) according to page size.
> - Other related functions (such as p2m_audit()) handles the table based on page size too.
> - Page split/merge
> ** Large page will be split into 4KB page in P2M table if needed. For instance, if set_p2m_entry() handles 4KB page but finds PSE/PRESENT bits are on, it will further split large page to 4KB pages.
> ** There is NO merge from 4KB pages to large page. Since large page is only used at the very beginning, guest_physmap_add(), this is OK for now.
> 
> 3. HAP
> - To access the PSE bit, L2 pages of P2M table is installed in linear mapping on SH_LINEAR_PT_VIRT_START. We borrow this address space since it was not used.
> 
> 4. gfn_to_mfn translation (P2M)
> - gfn_to_mfn_foreign() traverses P2M table and handles address translation correctly based on PSE bit.
> - gfn_to_mfn_current() accesses SH_LINEAR_PT_VIRT_START to check PSE bit. If is on, we handle translation using large page. Otherwise, it falls back to normal RO_MPT_VIRT_START address space to access P2M L1 pages.
> 
> 5. M2P translation
> - Same as before, M2P translation still happens on 4KB level.
> 
> AREAS NEEDS COMMENTS:
> 1. Large page for 32bit mode
> - 32bit use 4MB for large page. This is very annoying for xc_hvm_build.c. I don't want to create another 4MB page_array for it.
> - Because of this, this area has not been tested very well. I expect changes soon.
> 
> 2. Shadow paging
> - This implementation will affect shadow mode, especially at xc_hvm_build.c and memory.c.
> - Where and how to avoid affecting shadow?
> 
> 3. Turn it on/off
> - Do we want to turn this feature on/off through option (kernel option or anything else)?
> 
> 4. Other missing areas?
> ===========
> 
> ________________________________
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
> 
> 



[-- Attachment #2: 1gpages.patch --]
[-- Type: text/x-patch, Size: 17850 bytes --]

diff -r 1b863ae2bf1e tools/libxc/xc_hvm_build.c
--- a/tools/libxc/xc_hvm_build.c	Wed Dec 05 09:59:23 2007 +0000
+++ b/tools/libxc/xc_hvm_build.c	Thu Dec 06 19:27:24 2007 -0600
@@ -21,7 +21,8 @@
 
 #define SCRATCH_PFN 0xFFFFF
 
-static void build_e820map(void *e820_page, unsigned long long mem_size)
+static void build_e820map(void *e820_page, unsigned long long mem_size,
+                          unsigned long long mem_size_low)
 {
     struct e820entry *e820entry =
         (struct e820entry *)(((unsigned char *)e820_page) + HVM_E820_OFFSET);
@@ -77,17 +78,25 @@ static void build_e820map(void *e820_pag
     e820entry[nr_map].type = E820_RESERVED;
     nr_map++;
 
-    /* Low RAM goes here. Remove 3 pages for ioreq, bufioreq, and xenstore. */
+    /* Low RAM goes here. */
     e820entry[nr_map].addr = 0x100000;
-    e820entry[nr_map].size = mem_size - 0x100000 - PAGE_SIZE * 3;
+    e820entry[nr_map].size = mem_size_low - 0x100000 - PAGE_SIZE * 3;
     e820entry[nr_map].type = E820_RAM;
     nr_map++;
 
     /* Explicitly reserve space for special pages (ioreq and xenstore). */
-    e820entry[nr_map].addr = mem_size - PAGE_SIZE * 3;
+    e820entry[nr_map].addr = mem_size_low;
     e820entry[nr_map].size = PAGE_SIZE * 3;
     e820entry[nr_map].type = E820_RESERVED;
     nr_map++;
+
+    if (mem_size > mem_size_low)
+    {
+        e820entry[nr_map].addr = 0x40000000;
+        e820entry[nr_map].size = mem_size - 0x40000000;
+        e820entry[nr_map].type = E820_RAM;
+        nr_map++;
+    }
 
     if ( extra_mem_size )
     {
@@ -158,16 +167,30 @@ static int setup_guest(int xc_handle,
     uint64_t v_start, v_end;
     int rc;
     xen_capabilities_info_t caps;
+    int memsize_low = 32;
+    unsigned long nr_pages_low, nr_pages_1g;
 
     /* An HVM guest must be initialised with at least 2MB memory. */
     if ( memsize < 2 )
         goto error_out;
 
+    /* Align memory on 1G pages, if possible. */
+    nr_pages += 3;
+    nr_pages_low = nr_pages;
+    nr_pages_1g = 0;
+    v_start = 0;
+    v_end = nr_pages << PAGE_SHIFT;
+    if ( memsize_low && memsize_low < 1024 && memsize >= 1024 + memsize_low )
+    {
+        nr_pages_low = (unsigned long)memsize_low << (20 - PAGE_SHIFT);
+        nr_pages_low += 3;
+        nr_pages_1g = nr_pages - nr_pages_low;
+        v_end = (1024UL << 20) + (nr_pages_1g << PAGE_SHIFT);
+    }
+    nr_pages = v_end >> PAGE_SHIFT;
     if ( elf_init(&elf, image, image_size) != 0 )
         goto error_out;
     elf_parse_binary(&elf);
-    v_start = 0;
-    v_end = (unsigned long long)memsize << 20;
 
     if ( xc_version(xc_handle, XENVER_capabilities, &caps) != 0 )
     {
@@ -203,9 +226,15 @@ static int setup_guest(int xc_handle,
     /* Allocate memory for HVM guest, skipping VGA hole 0xA0000-0xC0000. */
     rc = xc_domain_memory_populate_physmap(
         xc_handle, dom, 0xa0, 0, 0, &page_array[0x00]);
+    /* Allocate first chunk. ("low memory") */
     if ( rc == 0 )
         rc = xc_domain_memory_populate_physmap(
-            xc_handle, dom, nr_pages - 0xc0, 0, 0, &page_array[0xc0]);
+            xc_handle, dom, nr_pages_low - 0xc0, 0, 0, &page_array[0xc0]);
+    /* Allocate second chunk. ("high memory") */
+    if ( rc == 0 && nr_pages_1g )
+        rc = xc_domain_memory_populate_physmap(
+            xc_handle, dom, nr_pages_1g, 0, 0,
+            &page_array[0x40000]);
     if ( rc != 0 )
     {
         PERROR("Could not allocate memory for HVM guest.\n");
@@ -220,7 +249,8 @@ static int setup_guest(int xc_handle,
               HVM_E820_PAGE >> PAGE_SHIFT)) == NULL )
         goto error_out;
     memset(e820_page, 0, PAGE_SIZE);
-    build_e820map(e820_page, v_end);
+    build_e820map(e820_page, v_end,
+                  nr_pages_1g ? (nr_pages_low << PAGE_SHIFT) : v_end);
     munmap(e820_page, PAGE_SIZE);
 
     /* Map and initialise shared_info page. */
@@ -239,7 +269,9 @@ static int setup_guest(int xc_handle,
            sizeof(shared_info->evtchn_mask));
     munmap(shared_info, PAGE_SIZE);
 
-    if ( v_end > HVM_BELOW_4G_RAM_END )
+    if ( nr_pages_1g )
+        shared_page_nr = nr_pages_low - 1;
+    else if ( v_end > HVM_BELOW_4G_RAM_END )
         shared_page_nr = (HVM_BELOW_4G_RAM_END >> PAGE_SHIFT) - 1;
     else
         shared_page_nr = (v_end >> PAGE_SHIFT) - 1;
diff -r 1b863ae2bf1e xen/arch/x86/mm/p2m.c
--- a/xen/arch/x86/mm/p2m.c	Wed Dec 05 09:59:23 2007 +0000
+++ b/xen/arch/x86/mm/p2m.c	Thu Dec 06 19:27:24 2007 -0600
@@ -202,7 +202,8 @@ p2m_next_level(struct domain *d, mfn_t *
 
 // Returns 0 on error (out of memory)
 static int
-set_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn, p2m_type_t p2mt)
+__set_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
+                p2m_type_t p2mt, unsigned int order)
 {
     // XXX -- this might be able to be faster iff current->domain == d
     mfn_t table_mfn = pagetable_get_mfn(d->arch.phys_table);
@@ -217,6 +218,18 @@ set_p2m_entry(struct domain *d, unsigned
                          L4_PAGETABLE_SHIFT - PAGE_SHIFT,
                          L4_PAGETABLE_ENTRIES, PGT_l3_page_table) )
         goto out;
+    if (order == 18) {
+        if ((gfn & ((1 << order) - 1)) != 0 || (mfn & ((1 << order) - 1)) != 0)
+            goto out;
+        p2m_entry = p2m_find_entry(table, &gfn_remainder, gfn,
+                                   L3_PAGETABLE_SHIFT - PAGE_SHIFT,
+                                   L3_PAGETABLE_ENTRIES);
+        entry_content = l1e_from_pfn(mfn_x(mfn),
+                                     p2m_type_to_flags(p2mt) | _PAGE_PSE);
+        paging_write_p2m_entry(d, gfn, p2m_entry, table_mfn, entry_content, 3);
+        rv = 1;
+        goto out;
+    }
 #endif
 #if CONFIG_PAGING_LEVELS >= 3
     /*
@@ -233,6 +246,18 @@ set_p2m_entry(struct domain *d, unsigned
                           : L3_PAGETABLE_ENTRIES),
                          PGT_l2_page_table) )
         goto out;
+    if (order == 9) {
+        if ((gfn & ((1 << order) - 1)) != 0 || (mfn & ((1 << order) - 1)) != 0)
+            goto out;
+        p2m_entry = p2m_find_entry(table, &gfn_remainder, gfn,
+                                   L2_PAGETABLE_SHIFT - PAGE_SHIFT,
+                                   L2_PAGETABLE_ENTRIES);
+        entry_content = l1e_from_pfn(mfn_x(mfn),
+                                     p2m_type_to_flags(p2mt) | _PAGE_PSE);
+        paging_write_p2m_entry(d, gfn, p2m_entry, table_mfn, entry_content, 2);
+        rv = 1;
+        goto out;
+    }
 #endif
     if ( !p2m_next_level(d, &table_mfn, &table, &gfn_remainder, gfn,
                          L2_PAGETABLE_SHIFT - PAGE_SHIFT,
@@ -266,6 +291,11 @@ set_p2m_entry(struct domain *d, unsigned
     return rv;
 }
 
+static inline int
+set_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn, p2m_type_t p2mt)
+{
+    return __set_p2m_entry(d, gfn, mfn, p2mt, 0);
+}
 
 /* Init the datastructures for later use by the p2m code */
 void p2m_init(struct domain *d)
@@ -400,6 +430,7 @@ gfn_to_mfn_foreign(struct domain *d, uns
     paddr_t addr = ((paddr_t)gfn) << PAGE_SHIFT;
     l2_pgentry_t *l2e;
     l1_pgentry_t *l1e;
+    unsigned long flags;
 
     ASSERT(paging_mode_translate(d));
 
@@ -441,25 +472,39 @@ gfn_to_mfn_foreign(struct domain *d, uns
 #else
         l3e += l3_table_offset(addr);
 #endif
-        if ( (l3e_get_flags(*l3e) & _PAGE_PRESENT) == 0 )
+        flags = l3e_get_flags(*l3e);
+        if ( (flags & _PAGE_PRESENT) == 0 )
         {
             unmap_domain_page(l3e);
             return _mfn(INVALID_MFN);
         }
         mfn = _mfn(l3e_get_pfn(*l3e));
         unmap_domain_page(l3e);
+        if ( (flags & _PAGE_PSE) != 0 )
+        {
+            mfn += (gfn & ((1UL << (L3_PAGETABLE_SHIFT - PAGE_SHIFT)) - 1));
+            *t = p2m_flags_to_type(flags);
+            return mfn;
+        }
     }
 #endif
 
     l2e = map_domain_page(mfn_x(mfn));
     l2e += l2_table_offset(addr);
-    if ( (l2e_get_flags(*l2e) & _PAGE_PRESENT) == 0 )
+    flags = l2e_get_flags(*l2e);
+    if ( (flags & _PAGE_PRESENT) == 0 )
     {
         unmap_domain_page(l2e);
         return _mfn(INVALID_MFN);
     }
     mfn = _mfn(l2e_get_pfn(*l2e));
     unmap_domain_page(l2e);
+    if ( (flags & _PAGE_PSE) != 0 )
+    {
+        mfn += (gfn & ((1UL << (L2_PAGETABLE_SHIFT - PAGE_SHIFT)) - 1));
+        *t = p2m_flags_to_type(flags);
+        return mfn;
+    }
 
     l1e = map_domain_page(mfn_x(mfn));
     l1e += l1_table_offset(addr);
@@ -689,8 +734,8 @@ guest_physmap_remove_page(struct domain 
 }
 
 int
-guest_physmap_add_entry(struct domain *d, unsigned long gfn,
-                        unsigned long mfn, p2m_type_t t)
+__guest_physmap_add_entry(struct domain *d, unsigned long gfn,
+                          unsigned long mfn, p2m_type_t t, unsigned int order)
 {
     unsigned long ogfn;
     p2m_type_t ot;
@@ -747,9 +792,13 @@ guest_physmap_add_entry(struct domain *d
 
     if ( mfn_valid(_mfn(mfn)) ) 
     {
-        if ( !set_p2m_entry(d, gfn, _mfn(mfn), t) )
+        if ( !__set_p2m_entry(d, gfn, _mfn(mfn), t, order) )
             rc = -EINVAL;
-        set_gpfn_from_mfn(mfn, gfn);
+        {
+            unsigned int i;
+            for (i = 0; i < (1 << order); i++)
+                set_gpfn_from_mfn(mfn + i, gfn + i);
+        }
     }
     else
     {
diff -r 1b863ae2bf1e xen/common/memory.c
--- a/xen/common/memory.c	Wed Dec 05 09:59:23 2007 +0000
+++ b/xen/common/memory.c	Thu Dec 06 19:27:24 2007 -0600
@@ -90,6 +90,81 @@ static void increase_reservation(struct 
     a->nr_done = i;
 }
 
+static int try_extent(xen_pfn_t gpfn, XEN_GUEST_HANDLE(xen_pfn_t) list,
+                      unsigned int off, unsigned int maxoff,
+                      unsigned int order)
+{
+    unsigned int i;
+    unsigned int o = 1 << order;
+    xen_pfn_t cgpfn;
+
+    if ( (gpfn & (o - 1)) != 0 )
+        return 0;
+    if ( off + o > maxoff)
+        return 0;
+
+    for (i = off + 1; i < o; i++)
+    {
+        if ( unlikely(__copy_from_guest_offset(&cgpfn, list, i, 1)) )
+            return -1;
+        if ( gpfn + i - off != cgpfn)
+            return 0;
+    }
+
+    return 1;
+}
+
+static unsigned int try_larger_extents(struct domain *d, xen_pfn_t gpfn,
+                                       XEN_GUEST_HANDLE(xen_pfn_t) list,
+                                       unsigned int off, unsigned int maxoff,
+                                       unsigned int order)
+{
+    unsigned int ret_order;
+    int ret;
+
+    if (!paging_mode_hap(d))
+        return 0;
+    switch (order)
+    {
+    case 18:
+        ret_order = 9;
+        break;
+
+    case 9:
+        ret_order = 0;
+        break;
+
+    case 0:
+        ret = 0;
+        ret_order = 18;
+        ret = try_extent(gpfn, list, off, maxoff, ret_order);
+        if (ret > 0)
+            break;
+        if (ret < 0)
+        {
+            ret_order = ~0;
+            break;
+        }
+        ret_order = 9;
+        ret = try_extent(gpfn, list, off, maxoff, ret_order);
+        if (ret > 0)
+            break;
+        if (ret < 0)
+        {
+            ret_order = ~0;
+            break;
+        }
+        ret_order = 0;
+        break;
+
+    default:
+        ret_order = 0;
+        BUG();
+    }
+
+    return ret_order;
+}
+
 static void populate_physmap(struct memop_args *a)
 {
     struct page_info *page;
@@ -97,6 +172,9 @@ static void populate_physmap(struct memo
     xen_pfn_t gpfn, mfn;
     struct domain *d = a->domain;
     unsigned int cpu = select_local_cpu(d);
+    unsigned int extent_order;
+    unsigned long incr;
+    unsigned long o;
 
     if ( !guest_handle_okay(a->extent_list, a->nr_extents) )
         return;
@@ -105,7 +183,7 @@ static void populate_physmap(struct memo
          !multipage_allocation_permitted(current->domain) )
         return;
 
-    for ( i = a->nr_done; i < a->nr_extents; i++ )
+    for ( i = a->nr_done; i < a->nr_extents; i += incr )
     {
         if ( hypercall_preempt_check() )
         {
@@ -115,33 +193,67 @@ static void populate_physmap(struct memo
 
         if ( unlikely(__copy_from_guest_offset(&gpfn, a->extent_list, i, 1)) )
             goto out;
-
-        page = __alloc_domheap_pages(d, cpu, a->extent_order, a->memflags);
+        extent_order = a->extent_order;
+        if (!extent_order)
+        {
+            extent_order = try_larger_extents(d, gpfn, a->extent_list,
+                                              i, a->nr_extents, 0);
+            if (extent_order == ~0)
+                goto out;
+        }
+
+        page = __alloc_domheap_pages(d, cpu, extent_order, a->memflags);
         if ( unlikely(page == NULL) ) 
         {
-            gdprintk(XENLOG_INFO, "Could not allocate order=%d extent: "
-                     "id=%d memflags=%x (%ld of %d)\n",
-                     a->extent_order, d->domain_id, a->memflags,
-                     i, a->nr_extents);
-            goto out;
-        }
+            if (extent_order != a->extent_order)
+            {
+                do
+                {
+                    extent_order = try_larger_extents(d, gpfn, a->extent_list,
+                                                      i, a->nr_extents,
+                                                      extent_order);
+                    page = __alloc_domheap_pages(d, cpu, extent_order,
+                                                 a->memflags);
+                    if (page)
+                        break;
+                }
+                while (extent_order);
+            }
+            if ( unlikely(page == NULL) )
+            {
+                gdprintk(XENLOG_INFO, "Could not allocate order=%d extent: "
+                         "id=%d memflags=%x (%ld of %d)\n",
+                         a->extent_order, d->domain_id, a->memflags,
+                         i, a->nr_extents);
+                goto out;
+            }
+        }
+        o = 1 << extent_order;
+        if (extent_order == a->extent_order)
+            incr = 1;
+        else
+            incr = o;
 
         mfn = page_to_mfn(page);
 
         if ( unlikely(paging_mode_translate(d)) )
         {
-            for ( j = 0; j < (1 << a->extent_order); j++ )
-                if ( guest_physmap_add_page(d, gpfn + j, mfn + j) )
+            if ( guest_physmap_add_page_order(d, gpfn, mfn, extent_order) )
+                goto out;
+        }
+        else
+        {
+            for ( j = 0; j < o; j++ )
+                set_gpfn_from_mfn(mfn + j, gpfn + j);
+
+            /* Inform the domain of the new page's machine address. */
+            for ( j = 0; j < incr; j++ )
+            {
+                if ( unlikely(__copy_to_guest_offset(a->extent_list, i + j,
+                                                     &mfn, 1)) )
                     goto out;
-        }
-        else
-        {
-            for ( j = 0; j < (1 << a->extent_order); j++ )
-                set_gpfn_from_mfn(mfn + j, gpfn + j);
-
-            /* Inform the domain of the new page's machine address. */ 
-            if ( unlikely(__copy_to_guest_offset(a->extent_list, i, &mfn, 1)) )
-                goto out;
+                mfn++;
+            }
         }
     }
 
diff -r 1b863ae2bf1e xen/include/asm-x86/p2m.h
--- a/xen/include/asm-x86/p2m.h	Wed Dec 05 09:59:23 2007 +0000
+++ b/xen/include/asm-x86/p2m.h	Thu Dec 06 19:27:24 2007 -0600
@@ -93,6 +93,9 @@ static inline p2m_type_t p2m_flags_to_ty
     return (flags >> 9) & 0x7;
 }
  
+/* Read another domain's P2M table, mapping pages as we go */
+mfn_t gfn_to_mfn_foreign(struct domain *d, unsigned long gfn, p2m_type_t *t);
+
 /* Read the current domain's p2m table (through the linear mapping). */
 static inline mfn_t gfn_to_mfn_current(unsigned long gfn, p2m_type_t *t)
 {
@@ -102,6 +105,9 @@ static inline mfn_t gfn_to_mfn_current(u
      * XXX marked as RAM was considered to be emulated MMIO space.
      * XXX Once we start explicitly registering MMIO regions in the p2m 
      * XXX we will return p2m_invalid for unmapped gfns */
+
+    if ( paging_mode_hap(current->domain) )
+        return gfn_to_mfn_foreign(current->domain, gfn, t);
 
     if ( gfn <= current->domain->arch.p2m.max_mapped_pfn )
     {
@@ -131,9 +137,6 @@ static inline mfn_t gfn_to_mfn_current(u
     *t = p2mt;
     return mfn;
 }
-
-/* Read another domain's P2M table, mapping pages as we go */
-mfn_t gfn_to_mfn_foreign(struct domain *d, unsigned long gfn, p2m_type_t *t);
 
 /* General conversion function from gfn to mfn */
 #define gfn_to_mfn(d, g, t) _gfn_to_mfn((d), (g), (t))
@@ -201,8 +204,16 @@ void p2m_teardown(struct domain *d);
 void p2m_teardown(struct domain *d);
 
 /* Add a page to a domain's p2m table */
-int guest_physmap_add_entry(struct domain *d, unsigned long gfn,
-                             unsigned long mfn, p2m_type_t t);
+int __guest_physmap_add_entry(struct domain *d, unsigned long gfn,
+                              unsigned long mfn, p2m_type_t t,
+                              unsigned int order);
+
+static inline int
+guest_physmap_add_entry(struct domain *d, unsigned long gfn,
+                        unsigned long mfn, p2m_type_t t)
+{
+    return __guest_physmap_add_entry(d, gfn, mfn, t, 0);
+}
 
 /* Untyped version for RAM only, for compatibility 
  *
@@ -212,6 +223,14 @@ static inline int guest_physmap_add_page
                                          unsigned long mfn)
 {
     return guest_physmap_add_entry(d, gfn, mfn, p2m_ram_rw);
+}
+
+static inline int guest_physmap_add_page_order(struct domain *d,
+                                               unsigned long gfn,
+                                               unsigned long mfn,
+                                               unsigned int order)
+{
+    return __guest_physmap_add_entry(d, gfn, mfn, p2m_ram_rw, order);
 }
 
 /* Remove a page from a domain's p2m table */

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC][PATCH]Large Page Support for HAP
  2007-11-16 17:53   ` Huang2, Wei
  2007-11-16 18:03     ` Keir Fraser
@ 2007-11-29 18:48     ` Byrne, John (HP Labs)
  1 sibling, 0 replies; 19+ messages in thread
From: Byrne, John (HP Labs) @ 2007-11-29 18:48 UTC (permalink / raw)
  To: Huang2, Wei, xen-devel@lists.xensource.com; +Cc: Tim Deegan

[-- Attachment #1.1: Type: text/plain, Size: 4881 bytes --]

Wei,

Sorry for being sluggish getting back to you, but my code was not working and I lost a week due to networking issues. (I probably could have debugged my code faster if I'd read your changes more carefully.) I have nothing so grand as a design; it is a hack to test 2M and 1G super-page performance on a random page-fault/TLB-miss benchmark. What I was hoping for was to have your code transparently support 1G pages on the assumption that their performance would be far better than 2M pages in this extreme case. Unfortunately for me, on the B1 rev CPU I have, I cannot see any difference between 2M and 1G pages. I saw something in one document about page splintering when the guest uses smaller pages than the NPT. Is this the issue? Do NPT super-pages not make any performance difference if they are larger than the guest pages?

Thanks,

John Byrne

________________________________
From: Huang2, Wei [mailto:Wei.Huang2@amd.com]
Sent: Friday, November 16, 2007 9:54 AM
To: Byrne, John (HP Labs); xen-devel@lists.xensource.com
Cc: Tim Deegan
Subject: RE: [Xen-devel] [RFC][PATCH]Large Page Support for HAP

John,

If you have a better design, share with us and I will be happy to work with you. :-) I agree that xc_hvm_build.c does not have to be modified, if memory.c is smart enough to scan all page_array information. But one concern is that sometimes Xen tools really want to create mapping at 4KB boundary instead of using large page. That requires extra information passed from tools (e.g., xc_hvm_build.c) to memory.c

-Wei

________________________________
From: Byrne, John (HP Labs) [mailto:john.l.byrne@hp.com]
Sent: Friday, November 16, 2007 11:41 AM
To: Huang2, Wei; xen-devel@lists.xensource.com
Cc: Tim Deegan
Subject: RE: [Xen-devel] [RFC][PATCH]Large Page Support for HAP

Wei,

I have been hacking at this, too,  since I am interested in trying 1GB pages to see what they can do. After I dug myself into a hole, I restarted from the beginning and am trying a different approach than modifying xc_hvm_build.c: modify populate_physmap() to opportunistically allocate large pages, if possible. I just thought I'd mention it.

John Byrne

________________________________
From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Huang2, Wei
Sent: Thursday, November 15, 2007 8:26 AM
To: xen-devel@lists.xensource.com
Cc: Tim Deegan
Subject: [Xen-devel] [RFC][PATCH]Large Page Support for HAP

I implemented a preliminary version of HAP large page support. My testings showed that 32bit PAE and 64bit worked well. Also I saw decent performance improvement for certain benchmarks.

So before I go too far, I send this patch to community for reviews/comments. This patch goes with xen-unstable changeset 16281. I will redo it after collecting all ideas.

Thanks,

-Wei

============
DESIGN IDEAS:
1. Large page requests
- xc_hvm_build.c requests large page (2MB for now) while starting guests
- memory.c handles large page requests. If it can not handle it, falls back to 4KB pages.

2. P2M table
- P2M table takes page size order as a parameter; It builds P2M table (setting PSE bit, etc.) according to page size.
- Other related functions (such as p2m_audit()) handles the table based on page size too.
- Page split/merge
** Large page will be split into 4KB page in P2M table if needed. For instance, if set_p2m_entry() handles 4KB page but finds PSE/PRESENT bits are on, it will further split large page to 4KB pages.
** There is NO merge from 4KB pages to large page. Since large page is only used at the very beginning, guest_physmap_add(), this is OK for now.

3. HAP
- To access the PSE bit, L2 pages of P2M table is installed in linear mapping on SH_LINEAR_PT_VIRT_START. We borrow this address space since it was not used.

4. gfn_to_mfn translation (P2M)
- gfn_to_mfn_foreign() traverses P2M table and handles address translation correctly based on PSE bit.
- gfn_to_mfn_current() accesses SH_LINEAR_PT_VIRT_START to check PSE bit. If is on, we handle translation using large page. Otherwise, it falls back to normal RO_MPT_VIRT_START address space to access P2M L1 pages.

5. M2P translation
- Same as before, M2P translation still happens on 4KB level.

AREAS NEEDS COMMENTS:
1. Large page for 32bit mode
- 32bit use 4MB for large page. This is very annoying for xc_hvm_build.c. I don't want to create another 4MB page_array for it.
- Because of this, this area has not been tested very well. I expect changes soon.

2. Shadow paging
- This implementation will affect shadow mode, especially at xc_hvm_build.c and memory.c.
- Where and how to avoid affecting shadow?

3. Turn it on/off
- Do we want to turn this feature on/off through option (kernel option or anything else)?

4. Other missing areas?
===========

[-- Attachment #1.2: Type: text/html, Size: 11878 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH]Large Page Support for HAP
  2007-11-15 16:26 [RFC][PATCH]Large Page Support for HAP Huang2, Wei
                   ` (2 preceding siblings ...)
  2007-11-16 17:40 ` Byrne, John (HP Labs)
@ 2007-11-19 20:27 ` Stephen C. Tweedie
  2007-11-20 10:27   ` Keir Fraser
  3 siblings, 1 reply; 19+ messages in thread
From: Stephen C. Tweedie @ 2007-11-19 20:27 UTC (permalink / raw)
  To: Huang2, Wei; +Cc: Stephen Tweedie, Tim Deegan, xen-devel@lists.xensource.com

Hi,

On Thu, 2007-11-15 at 10:26 -0600, Huang2, Wei wrote:

> DESIGN IDEAS:
> 1. Large page requests
> - xc_hvm_build.c requests large page (2MB for now) while starting
> guests
> - memory.c handles large page requests. If it can not handle it, falls
> back to 4KB pages.

It makes me uncomfortable if the guest can't be sure that a PSE request
isn't actually being honoured by the hardware.

A guest OS has to go to a lot of trouble to use large pages.  Such pages
upset the normal page recycling of the guest, they are hard to
recycle... but the guest expects that the compromises are worth it
because large pages are more efficient at the hardware level.

So if the HV is only going to supply them on a best-effort basis --- if
a guest cannot actually rely on a large-page request being honoured ---
then it's not clear whether this is a net benefit or a net cost to the
guest.

--Stephen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH]Large Page Support for HAP
  2007-11-19 20:27 ` Stephen C. Tweedie
@ 2007-11-20 10:27   ` Keir Fraser
  2007-11-20 11:56     ` Stephen C. Tweedie
  0 siblings, 1 reply; 19+ messages in thread
From: Keir Fraser @ 2007-11-20 10:27 UTC (permalink / raw)
  To: Stephen C. Tweedie, Huang2, Wei; +Cc: Tim Deegan, xen-devel@lists.xensource.com

On 19/11/07 20:27, "Stephen C. Tweedie" <sct@redhat.com> wrote:

> It makes me uncomfortable if the guest can't be sure that a PSE request
> isn't actually being honoured by the hardware.
> 
> A guest OS has to go to a lot of trouble to use large pages.  Such pages
> upset the normal page recycling of the guest, they are hard to
> recycle... but the guest expects that the compromises are worth it
> because large pages are more efficient at the hardware level.
> 
> So if the HV is only going to supply them on a best-effort basis --- if
> a guest cannot actually rely on a large-page request being honoured ---
> then it's not clear whether this is a net benefit or a net cost to the
> guest.

An HVM guest always thinks it has big contiguous chunks of RAM. The
superpage mappings get shattered invisibly by the HV in the shadow page
tables only if 2M/4M allocations were not actually possible. This shattering
happens unconditionally right now, so what's being proposed is a net benefit
to HVM guests.

 -- Keir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH]Large Page Support for HAP
  2007-11-20 10:27   ` Keir Fraser
@ 2007-11-20 11:56     ` Stephen C. Tweedie
  2007-11-20 12:31       ` Ian Pratt
  0 siblings, 1 reply; 19+ messages in thread
From: Stephen C. Tweedie @ 2007-11-20 11:56 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Stephen Tweedie, Tim Deegan, Huang2, Wei,
	xen-devel@lists.xensource.com

Hi,

On Tue, 2007-11-20 at 10:27 +0000, Keir Fraser wrote:

> An HVM guest always thinks it has big contiguous chunks of RAM. The
> superpage mappings get shattered invisibly by the HV in the shadow page
> tables only if 2M/4M allocations were not actually possible. This shattering
> happens unconditionally right now, so what's being proposed is a net benefit
> to HVM guests.

If an HVM guest asks for a bigpage allocation and silently fails to get
it, then that is a net lose for the guest --- the guest takes all of the
pain for none of the benefits of bigpage.

So, you may be better off not offering bigpages at all than offering
them on a best-effort basis; at least that way the guest knows for sure
what resources it has available.

I'm not against supporting bigpages.  But if there's no way for a guest
to know for sure if it has actually _got_ big pages, then I'm not sure
how much use it is.

Note that this probably works fine for controlled benchmark scenarios
where you're running a guest on a single carefully-configured host with
matched bigpage reservations.  But in general, you need bigpages to
continue to work predictably over save/restore, migrate, balloon etc.
else they become a net cost, not a net gain, to the guest.

--Stephen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC][PATCH]Large Page Support for HAP
  2007-11-20 11:56     ` Stephen C. Tweedie
@ 2007-11-20 12:31       ` Ian Pratt
  2007-11-20 17:19         ` Stephen C. Tweedie
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Pratt @ 2007-11-20 12:31 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Tim Deegan, sct, Ian Pratt, Huang2, Wei, xen-devel

> On Tue, 2007-11-20 at 10:27 +0000, Keir Fraser wrote:
> 
> > An HVM guest always thinks it has big contiguous chunks of RAM. The
> > superpage mappings get shattered invisibly by the HV in the shadow
> page
> > tables only if 2M/4M allocations were not actually possible. This
> shattering
> > happens unconditionally right now, so what's being proposed is a net
> benefit
> > to HVM guests.
> 
> If an HVM guest asks for a bigpage allocation and silently fails to
get
> it, then that is a net lose for the guest --- the guest takes all of
the
> pain for none of the benefits of bigpage.
> 
> So, you may be better off not offering bigpages at all than offering
> them on a best-effort basis; at least that way the guest knows for
sure
> what resources it has available.

Unfortunately, a number of guests assume big pages without actually
checking for the feature bit explicitly. For example x86_64 Linux
running HVM will assume it has big pages. We're able to hack this
assumption out of it in PV mode.  IIRC Windows makes the same big page
assumption.

Ian

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC][PATCH]Large Page Support for HAP
  2007-11-20 12:31       ` Ian Pratt
@ 2007-11-20 17:19         ` Stephen C. Tweedie
  2007-11-20 17:22           ` Keir Fraser
  0 siblings, 1 reply; 19+ messages in thread
From: Stephen C. Tweedie @ 2007-11-20 17:19 UTC (permalink / raw)
  To: Ian Pratt
  Cc: Tim Deegan, Stephen Tweedie, Huang2, Wei,
	xen-devel@lists.xensource.com

Hi,

On Tue, 2007-11-20 at 12:31 +0000, Ian Pratt wrote:

> Unfortunately, a number of guests assume big pages without actually
> checking for the feature bit explicitly. For example x86_64 Linux
> running HVM will assume it has big pages.

Yes, but there's a _big_ difference between the opportunistic uses Linux
makes of PSE (bigpage mappings for large static areas like the kernel
text), and places where it is an explicit part of the ABI made to
applications, as in the case of hugetlbfs.

It's the latter case which concerns me, as hugetlbfs is basically an
explicit contract between the guest OS and an application running on it.
Providing faked PSE at that level is something that would be best
avoided.

I don't have any objection to doing opportunistic PSE for the former
case.  But telling the two apart is rather hard.

--Stephen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH]Large Page Support for HAP
  2007-11-20 17:19         ` Stephen C. Tweedie
@ 2007-11-20 17:22           ` Keir Fraser
  2007-11-20 17:40             ` Stephen C. Tweedie
  0 siblings, 1 reply; 19+ messages in thread
From: Keir Fraser @ 2007-11-20 17:22 UTC (permalink / raw)
  To: Stephen C. Tweedie, Ian Pratt
  Cc: Tim Deegan, Huang2, Wei, xen-devel@lists.xensource.com

On 20/11/07 17:19, "Stephen C. Tweedie" <sct@redhat.com> wrote:

> Yes, but there's a _big_ difference between the opportunistic uses Linux
> makes of PSE (bigpage mappings for large static areas like the kernel
> text), and places where it is an explicit part of the ABI made to
> applications, as in the case of hugetlbfs.
> 
> It's the latter case which concerns me, as hugetlbfs is basically an
> explicit contract between the guest OS and an application running on it.
> Providing faked PSE at that level is something that would be best
> avoided.
> 
> I don't have any objection to doing opportunistic PSE for the former
> case.  But telling the two apart is rather hard.

Support for PV guests would be explicit. Hugetlbfs would know whether it had
superpages or not, and can fail in the latter case.

 -- Keir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH]Large Page Support for HAP
  2007-11-20 17:22           ` Keir Fraser
@ 2007-11-20 17:40             ` Stephen C. Tweedie
  2007-11-20 17:44               ` Keir Fraser
  0 siblings, 1 reply; 19+ messages in thread
From: Stephen C. Tweedie @ 2007-11-20 17:40 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Tim Deegan, Stephen Tweedie, Ian Pratt, Huang2, Wei,
	xen-devel@lists.xensource.com

Hi,

On Tue, 2007-11-20 at 17:22 +0000, Keir Fraser wrote:

> > I don't have any objection to doing opportunistic PSE for the former
> > case.  But telling the two apart is rather hard.
> 
> Support for PV guests would be explicit. Hugetlbfs would know whether it had
> superpages or not, and can fail in the latter case.

Yep, I'd be assuming that.  It's the FV case where it's harder to see
how we communicate this information back to the guest.

--Stephen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH]Large Page Support for HAP
  2007-11-20 17:40             ` Stephen C. Tweedie
@ 2007-11-20 17:44               ` Keir Fraser
  2007-11-26 17:26                 ` Stephen C. Tweedie
  0 siblings, 1 reply; 19+ messages in thread
From: Keir Fraser @ 2007-11-20 17:44 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Tim Deegan, Ian Pratt, Huang2, Wei, xen-devel@lists.xensource.com

On 20/11/07 17:40, "Stephen C. Tweedie" <sct@redhat.com> wrote:

>>> I don't have any objection to doing opportunistic PSE for the former
>>> case.  But telling the two apart is rather hard.
>> 
>> Support for PV guests would be explicit. Hugetlbfs would know whether it had
>> superpages or not, and can fail in the latter case.
> 
> Yep, I'd be assuming that.  It's the FV case where it's harder to see
> how we communicate this information back to the guest.

Without PV'ing up the guest I don't see how it is possible. If we advertise
support for long mode, then that must imply support for PSE.

 -- Keir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH]Large Page Support for HAP
  2007-11-20 17:44               ` Keir Fraser
@ 2007-11-26 17:26                 ` Stephen C. Tweedie
  0 siblings, 0 replies; 19+ messages in thread
From: Stephen C. Tweedie @ 2007-11-26 17:26 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Tim Deegan, Stephen Tweedie, Ian Pratt, Huang2, Wei,
	xen-devel@lists.xensource.com

Hi,

On Tue, 2007-11-20 at 17:44 +0000, Keir Fraser wrote:

> > Yep, I'd be assuming that.  It's the FV case where it's harder to see
> > how we communicate this information back to the guest.
> 
> Without PV'ing up the guest I don't see how it is possible. If we advertise
> support for long mode, then that must imply support for PSE.

Thinking about this --- do we have any way to let the admin easily tell
how many PSE pages a guest has successfully created vs. how many have
been faked?  If an admin can't rely on bigpages, at least they will want
to be able to find out when they are working and when they are not.

--Stephen

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2007-12-07  1:43 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-15 16:26 [RFC][PATCH]Large Page Support for HAP Huang2, Wei
2007-11-15 16:36 ` Keir Fraser
2007-11-15 17:36   ` Huang2, Wei
2007-11-15 17:42     ` Keir Fraser
     [not found] ` <20071115185929.GG26000@york.uk.xensource.com>
2007-11-15 19:33   ` Huang2, Wei
2007-11-16 17:40 ` Byrne, John (HP Labs)
2007-11-16 17:53   ` Huang2, Wei
2007-11-16 18:03     ` Keir Fraser
2007-12-07  1:43       ` John Byrne
2007-11-29 18:48     ` Byrne, John (HP Labs)
2007-11-19 20:27 ` Stephen C. Tweedie
2007-11-20 10:27   ` Keir Fraser
2007-11-20 11:56     ` Stephen C. Tweedie
2007-11-20 12:31       ` Ian Pratt
2007-11-20 17:19         ` Stephen C. Tweedie
2007-11-20 17:22           ` Keir Fraser
2007-11-20 17:40             ` Stephen C. Tweedie
2007-11-20 17:44               ` Keir Fraser
2007-11-26 17:26                 ` Stephen C. Tweedie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.