* [PATCH] Fix performance problems with mprotect()
@ 2008-01-05 4:49 Bruce Rogers
2008-01-05 13:35 ` John Levon
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Bruce Rogers @ 2008-01-05 4:49 UTC (permalink / raw)
To: xen-devel
[-- Attachment #1: Type: text/plain, Size: 1921 bytes --]
While working on a database scaling problem using a SAP database test suite we discovered that these enterprise level large shared memory databases are very heavy users of mprotect(), to the extent that the performance overhead in current Xenolinux impacts scaling beyond a few cpus quite badly. A single cpu run under Xen has a nominal impact, but scaling out to 8 cpus results in a performance of less than 10% of native throughput. Ouch!
The problem was isolated to the relatively high overhead of how mprotect() updates ptes, and the lack of efficient batching opportunities as presently implemented. By adding a hypercall which batches pte updates such that hardware flags updates (accessed or dirty bits) are not lost, and redoing the performance critical portion of mprotect() to take advantage of that, performance has improved to where it is essentially equivalent to native performance with this SAP database test suite. I haven't tested other database implementations, but I would imagine they would have similar results.
This is obviously quite an improvement and should allow Xen to be deployed for more enterprise level workloads.
There are actually 2 new hypercalls included in this patch - one that batches flags-only changes to pte's within a single page table, and another which allows the full pte to be changed - both "atomically" as described above.
I imagine there are other places in Linux that could benefit from these hypercalls, but these patches focus on the mprotect performance problem only.
The hypervisor patch was mostly written and tested with the 3.0.4 era hypervisor that ships with SLES 10, and then ported and tested on current xen-unstable.
The Linux patch was written and tested on 2.6.16 and made apply to the 2.6.18 tree without further testing.
Feedback and comments are appreciated.
Signed-off-by: Bruce Rogers <brogers@novell.com>
- Bruce Rogers
[-- Attachment #2: xen_mprotect_perf.patch --]
[-- Type: text/plain, Size: 15661 bytes --]
diff -r 2491691e3e69 xen/include/public/xen.h
--- a/xen/include/public/xen.h Sat Dec 29 17:57:47 2007 +0000
+++ b/xen/include/public/xen.h Fri Jan 04 20:29:31 2008 -0700
@@ -168,9 +168,40 @@
* ptr[:2] -- Machine address within the frame whose mapping to modify.
* The frame must belong to the FD, if one is specified.
* val -- Value to write into the mapping entry.
+ *
+ * ptr[1:0] == MMU_ATOMIC_PT_UPDATE:
+ * Updates a range of entries in an L1 page table atomically such that any
+ * bits which the hardware mmu could have modified are not lost. The source of
+ * the updates is a range of values specified by val. If the new table entry is
+ * valid/present, the mapped frame must belong to the FD, if an FD has been
+ * specified. If attempting to map an I/O page then the caller assumes the
+ * privilege of the FD.
+ * FD == DOMID_IO: Permit /only/ I/O mappings, at the priv level of the caller.
+ * FD == DOMID_XEN: Map restricted areas of Xen's heap space.
+ * ptr[:2] -- Machine address of the page-table entry to modify.
+ * val -- Value to write.
+ *
+ * ptr[1:0] == MMU_FLAG_RANGE_UPDATE:
+ * Updates a range of entries in an L1 page table atomically such that any
+ * bits which the hardware mmu could have modified are not lost. The update
+ * value created by updating certain buts in the the current page table entry,
+ * as specified by val. If the new table entry is valid/present, the mapped
+ * frame must belong to the FD, if an FD has been specified. If attempting to
+ * map an I/O page then the caller assumes the privilege of the FD.
+ * FD == DOMID_IO: Permit /only/ I/O mappings, at the priv level of the caller.
+ * FD == DOMID_XEN: Map restricted areas of Xen's heap space.
+ * ptr[:2] -- Machine address of the first page-table entry to modify.
+ * val[11:0] = mask pte bits 11:0 (1 means update; 0 means don't change)
+ * val[23:12] = mask pte bits 63:52
+ * val[35:24] = new pte bits 11:0 (where corresp. Mask bit == 1)
+ * val[47:36] = new pte bits 63:52
+ * val[57:48] = number of ptes in range
+ * val[63:58] = MBZ
*/
#define MMU_NORMAL_PT_UPDATE 0 /* checked '*ptr = val'. ptr is MA. */
#define MMU_MACHPHYS_UPDATE 1 /* ptr = MA of frame to modify entry for */
+#define MMU_ATOMIC_PT_UPDATE 2 /* checked atomic '*ptr = val'. ptr is MA.*/
+#define MMU_FLAG_RANGE_UPDATE 3 /* range atomic checked '*ptr flags = val'*/
/*
* MMU EXTENDED OPERATIONS
diff -r 2491691e3e69 xen/arch/x86/mm.c
--- a/xen/arch/x86/mm.c Sat Dec 29 17:57:47 2007 +0000
+++ b/xen/arch/x86/mm.c Fri Jan 04 20:35:08 2008 -0700
@@ -1338,6 +1338,97 @@ static void free_l4_table(struct page_in
/* How to write an entry to the guest pagetables.
* Returns 0 for failure (pointer not valid), 1 for success. */
+static inline int update_intpte_sync(intpte_t *p,
+ intpte_t old,
+ intpte_t new,
+ unsigned long mfn,
+ struct vcpu *v)
+{
+ int rv;
+ intpte_t t = old;
+ for ( ; ; )
+ {
+ rv = paging_cmpxchg_guest_entry(v, p, &t, new, _mfn(mfn));
+ if ( unlikely(rv == 0) )
+ {
+ MEM_LOG("Failed to update %" PRIpte " -> %" PRIpte
+ ": saw %" PRIpte, old, new, t);
+ break;
+ }
+
+ if ( t == old )
+ break;
+
+ /* Allowed to change in Accessed/Dirty flags only. */
+ BUG_ON((t ^ old) & ~(intpte_t)(_PAGE_ACCESSED|_PAGE_DIRTY));
+
+ new |= t ^ old;
+ old = t;
+ }
+ return rv;
+}
+
+/* Macro that wraps the appropriate type-changes around update_intpte().
+ * Arguments are: type, ptr, old, new, mfn, vcpu */
+#define UPDATE_ENTRY_SYNC(_t,_p,_o,_n,_m,_v) \
+ update_intpte_sync(&_t ## e_get_intpte(*(_p)), \
+ _t ## e_get_intpte(_o), _t ## e_get_intpte(_n), \
+ (_m), (_v))
+
+/* Update the L1 entry at pl1e to new value nl1e. */
+static int mod_l1_entry_sync(l1_pgentry_t *pl1e, l1_pgentry_t ol1e,
+ l1_pgentry_t nl1e, unsigned long gl1mfn)
+{
+ struct vcpu *curr = current;
+ struct domain *d = curr->domain;
+ unsigned long mfn;
+
+ if ( unlikely(paging_mode_refcounts(d)) )
+ return UPDATE_ENTRY_SYNC(l1, pl1e, ol1e, nl1e, gl1mfn, curr);
+
+ if ( l1e_get_flags(nl1e) & _PAGE_PRESENT )
+ {
+ /* Translate foreign guest addresses. */
+ mfn = gmfn_to_mfn(FOREIGNDOM, l1e_get_pfn(nl1e));
+ if ( unlikely(mfn == INVALID_MFN) )
+ return 0;
+ ASSERT((mfn & ~(PADDR_MASK >> PAGE_SHIFT)) == 0);
+ nl1e = l1e_from_pfn(mfn, l1e_get_flags(nl1e));
+
+ if ( unlikely(l1e_get_flags(nl1e) & l1_disallow_mask(d)) )
+ {
+ MEM_LOG("Bad L1 flags %x",
+ l1e_get_flags(nl1e) & l1_disallow_mask(d));
+ return 0;
+ }
+
+ adjust_guest_l1e(nl1e, d);
+
+ /* Fast path for identical mapping, r/w and presence. */
+ if ( !l1e_has_changed(ol1e, nl1e, _PAGE_RW | _PAGE_PRESENT) )
+ return UPDATE_ENTRY_SYNC(l1, pl1e, ol1e, nl1e, gl1mfn, curr);
+
+ if ( unlikely(!get_page_from_l1e(nl1e, FOREIGNDOM)) )
+ return 0;
+
+ if ( unlikely(!UPDATE_ENTRY_SYNC(l1, pl1e, ol1e, nl1e, gl1mfn, curr)) )
+ {
+ put_page_from_l1e(nl1e, d);
+ return 0;
+ }
+ }
+ else
+ {
+ if ( unlikely(!UPDATE_ENTRY_SYNC(l1, pl1e, ol1e, nl1e, gl1mfn, curr)) )
+ return 0;
+ }
+
+ put_page_from_l1e(ol1e, d);
+ return 1;
+}
+
+/* How to write an entry to the guest pagetables.
+ * Returns 0 for failure (pointer not valid), 1 for success. */
static inline int update_intpte(intpte_t *p,
intpte_t old,
intpte_t new,
@@ -2403,11 +2494,13 @@ int do_mmu_update(
unsigned long gpfn, gmfn, mfn;
struct page_info *page;
int rc = 0, okay = 1, i = 0;
- unsigned int cmd, done = 0;
+ unsigned int cmd, done = 0, pte_cnt;
struct vcpu *v = current;
struct domain *d = v->domain;
unsigned long type_info;
struct domain_mmap_cache mapcache;
+ uint64_t mask_flags, add_flags;
+ l1_pgentry_t ol1e;
if ( unlikely(count & MMU_UPDATE_PREEMPTED) )
{
@@ -2577,6 +2670,217 @@ int do_mmu_update(
paging_mark_dirty(FOREIGNDOM, mfn);
put_page(mfn_to_page(mfn));
+ break;
+
+ case MMU_ATOMIC_PT_UPDATE:
+
+ rc = xsm_mmu_atomic_update(d, req.val);
+ if ( rc )
+ break;
+
+ req.ptr -= cmd;
+
+ gmfn = req.ptr >> PAGE_SHIFT;
+ mfn = gmfn_to_mfn(d, gmfn);
+
+ if ( unlikely(!get_page_from_pagenr(mfn, d)) )
+ {
+ MEM_LOG("Could not get page for normal update");
+ break;
+ }
+
+ va = map_domain_page_with_cache(mfn, &mapcache);
+ va = (void *)((unsigned long)va +
+ (unsigned long)(req.ptr & ~PAGE_MASK));
+ page = mfn_to_page(mfn);
+
+ switch ( (type_info = page->u.inuse.type_info) & PGT_type_mask )
+ {
+ case PGT_l1_page_table:
+ {
+ if ( paging_mode_refcounts(d) )
+ {
+ MEM_LOG("mmu update on auto-refcounted domain!");
+ break;
+ }
+
+ if ( unlikely(!get_page_type(
+ page, type_info & (PGT_type_mask|PGT_pae_xen_l2))) )
+ goto atomic_not_a_pt;
+
+ switch ( type_info & PGT_type_mask )
+ {
+ case PGT_l1_page_table:
+ {
+ l1_pgentry_t l1e = l1e_from_intpte(req.val);
+ if ( unlikely(__copy_from_user(&ol1e, (l1_pgentry_t *)va, sizeof(ol1e)) != 0) )
+ okay = 0;
+ else
+ {
+ okay = mod_l1_entry_sync(va, ol1e, l1e, mfn);
+ }
+ }
+ break;
+ case PGT_l2_page_table:
+ case PGT_l3_page_table:
+ case PGT_l4_page_table:
+ MEM_LOG("Not L1 page table");
+ break;
+ }
+
+ put_page_type(page);
+ }
+ break;
+
+ case PGT_l2_page_table:
+ case PGT_l3_page_table:
+ case PGT_l4_page_table:
+ MEM_LOG("Not L1 page table");
+ break;
+
+ default:
+ atomic_not_a_pt:
+ {
+ if ( unlikely(!get_page_type(page, PGT_writable_page)) )
+ break;
+
+ okay = paging_write_guest_entry(v, va, req.val, _mfn(mfn));
+
+ put_page_type(page);
+ }
+ break;
+ }
+
+ unmap_domain_page_with_cache(va, &mapcache);
+
+ put_page(page);
+ break;
+
+ case MMU_FLAG_RANGE_UPDATE:
+
+ rc = xsm_mmu_flag_range_update(d, req.val);
+ if ( rc )
+ break;
+
+ req.ptr -= cmd;
+
+ pte_cnt = (req.val >> 48) & 0x3ff;
+
+ if (pte_cnt == 0)
+ {
+ okay = 1;
+ break;
+ }
+ if ((req.ptr & ~PAGE_MASK) + (pte_cnt * sizeof(l1_pgentry_t)) > PAGE_SIZE)
+ {
+ // count extends beyond end of page table
+ MEM_LOG("Count value is too large");
+ break;
+ }
+
+ gmfn = req.ptr >> PAGE_SHIFT;
+ mfn = gmfn_to_mfn(d, gmfn);
+
+ page = mfn_to_page(mfn);
+
+ if ( unlikely(!get_page_from_pagenr(mfn, d)) )
+ {
+ MEM_LOG("Could not get page for normal update");
+ break;
+ }
+
+ va = map_domain_page_with_cache(mfn, &mapcache);
+
+ mask_flags = ((req.val & 0xfff000ULL) << 40) |
+ (req.val & 0xfffULL);
+ add_flags = ((req.val & 0xfff000000000ULL) << 16) |
+ ((req.val >> 24) & 0xfffULL);
+
+ switch ( (type_info = page->u.inuse.type_info) & PGT_type_mask )
+ {
+ case PGT_l1_page_table:
+ {
+ if ( paging_mode_refcounts(d) )
+ {
+ MEM_LOG("mmu update on auto-refcounted domain!");
+ break;
+ }
+
+ if ( unlikely(!get_page_type(
+ page, type_info & (PGT_type_mask|PGT_pae_xen_l2))) )
+ goto range_not_a_pt;
+
+ switch ( type_info & PGT_type_mask )
+ {
+ case PGT_l1_page_table:
+ {
+ va = (void *)((unsigned long)va +
+ (unsigned long)(req.ptr & ~PAGE_MASK) -
+ sizeof(l1_pgentry_t));
+ do
+ {
+ l1_pgentry_t ol1e;
+ l1_pgentry_t nl1e;
+
+ va += sizeof(l1_pgentry_t);
+ if ( unlikely(__copy_from_user(&ol1e, va, sizeof(ol1e)) != 0) )
+ {
+ okay = 0;
+ break;
+ }
+ nl1e = ol1e;
+ nl1e.l1 &= ~mask_flags;
+ nl1e.l1 |= add_flags;
+ okay = mod_l1_entry_sync(va, ol1e, nl1e, mfn);
+ if (!okay)
+ break;
+ } while (--pte_cnt);
+ }
+ break;
+ case PGT_l2_page_table:
+ case PGT_l3_page_table:
+ case PGT_l4_page_table:
+ MEM_LOG("Operation permitted on l1 page table only");
+ }
+
+ put_page_type(page);
+ }
+ break;
+
+ case PGT_l2_page_table:
+ case PGT_l3_page_table:
+ case PGT_l4_page_table:
+ MEM_LOG("Operation permitted on l1 page table only");
+ break;
+
+
+ default:
+ range_not_a_pt:
+ {
+ if ( unlikely(!get_page_type(page, PGT_writable_page)) )
+ break;
+ va = (void *)((unsigned long)va +
+ (unsigned long)(req.ptr & ~PAGE_MASK) -
+ sizeof(l1_pgentry_t));
+ do
+ {
+ l1_pgentry_t l1e;
+ va += sizeof(l1_pgentry_t);
+ l1e = *(l1_pgentry_t *)va;
+ l1e.l1 &= ~mask_flags;
+ l1e.l1 |= add_flags;
+ okay = paging_write_guest_entry(v, va, l1e.l1, _mfn(mfn));
+ if (!okay)
+ break;
+ } while (--pte_cnt);
+ put_page_type(page);
+ }
+ break;
+ }
+
+ unmap_domain_page_with_cache(va, &mapcache);
+
+ put_page(page);
break;
default:
diff -r 2491691e3e69 xen/include/xsm/xsm.h
--- a/xen/include/xsm/xsm.h Sat Dec 29 17:57:47 2007 +0000
+++ b/xen/include/xsm/xsm.h Fri Jan 04 20:36:12 2008 -0700
@@ -136,6 +136,8 @@ struct xsm_operations {
int (*mmu_machphys_update) (struct domain *d, unsigned long mfn);
int (*update_va_mapping) (struct domain *d, l1_pgentry_t pte);
int (*add_to_physmap) (struct domain *d1, struct domain *d2);
+ int (*mmu_atomic_update) (struct domain *d, intpte_t fpte);
+ int (*mmu_flag_range_update) (struct domain *d, intpte_t fpte);
#endif
};
@@ -532,6 +534,16 @@ static inline int xsm_add_to_physmap(str
{
return xsm_call(add_to_physmap(d1, d2));
}
+
+static inline int xsm_mmu_atomic_update (struct domain *d, intpte_t fpte)
+{
+ return xsm_call(mmu_atomic_update(d, fpte));
+}
+
+static inline int xsm_mmu_flag_range_update (struct domain *d, intpte_t fpte)
+{
+ return xsm_call(mmu_flag_range_update(d, fpte));
+}
#endif /* CONFIG_X86 */
#endif /* __XSM_H */
diff -r 2491691e3e69 xen/xsm/dummy.c
--- a/xen/xsm/dummy.c Sat Dec 29 17:57:47 2007 +0000
+++ b/xen/xsm/dummy.c Fri Jan 04 20:36:49 2008 -0700
@@ -385,6 +385,16 @@ static int dummy_add_to_physmap (struct
{
return 0;
}
+
+static int dummy_mmu_atomic_update (struct domain *d, intpte_t fpte)
+{
+ return 0;
+}
+
+static int dummy_mmu_flag_range_update (struct domain *d, intpte_t fpte)
+{
+ return 0;
+}
#endif
struct xsm_operations dummy_xsm_ops;
@@ -484,5 +494,7 @@ void xsm_fixup_ops (struct xsm_operation
set_to_dummy_if_null(ops, mmu_machphys_update);
set_to_dummy_if_null(ops, update_va_mapping);
set_to_dummy_if_null(ops, add_to_physmap);
+ set_to_dummy_if_null(ops, mmu_atomic_update);
+ set_to_dummy_if_null(ops, mmu_flag_range_update);
#endif
}
[-- Attachment #3: linux_mprotect_perf.patch --]
[-- Type: text/plain, Size: 6202 bytes --]
diff -r 61c96456a3e1 include/xen/interface/xen.h
--- a/include/xen/interface/xen.h Thu Dec 20 16:58:14 2007 +0000
+++ b/include/xen/interface/xen.h Fri Jan 04 20:41:56 2008 -0700
@@ -168,9 +168,40 @@
* ptr[:2] -- Machine address within the frame whose mapping to modify.
* The frame must belong to the FD, if one is specified.
* val -- Value to write into the mapping entry.
+ *
+ * ptr[1:0] == MMU_ATOMIC_PT_UPDATE:
+ * Updates a range of entries in an L1 page table atomically such that any
+ * bits which the hardware mmu could have modified are not lost. The source of
+ * the updates is a range of values specified by val. If the new table entry is
+ * valid/present, the mapped frame must belong to the FD, if an FD has been
+ * specified. If attempting to map an I/O page then the caller assumes the
+ * privilege of the FD.
+ * FD == DOMID_IO: Permit /only/ I/O mappings, at the priv level of the caller.
+ * FD == DOMID_XEN: Map restricted areas of Xen's heap space.
+ * ptr[:2] -- Machine address of the page-table entry to modify.
+ * val -- Value to write.
+ *
+ * ptr[1:0] == MMU_FLAG_RANGE_UPDATE:
+ * Updates a range of entries in an L1 page table atomically such that any
+ * bits which the hardware mmu could have modified are not lost. The update
+ * value created by updating certain buts in the the current page table entry,
+ * as specified by val. If the new table entry is valid/present, the mapped
+ * frame must belong to the FD, if an FD has been specified. If attempting to
+ * map an I/O page then the caller assumes the privilege of the FD.
+ * FD == DOMID_IO: Permit /only/ I/O mappings, at the priv level of the caller.
+ * FD == DOMID_XEN: Map restricted areas of Xen's heap space.
+ * ptr[:2] -- Machine address of the first page-table entry to modify.
+ * val[11:0] = mask pte bits 11:0 (1 means update; 0 means don't change)
+ * val[23:12] = mask pte bits 63:52
+ * val[35:24] = new pte bits 11:0 (where corresp. Mask bit == 1)
+ * val[47:36] = new pte bits 63:52
+ * val[57:48] = number of ptes in range
+ * val[63:58] = MBZ
*/
#define MMU_NORMAL_PT_UPDATE 0 /* checked '*ptr = val'. ptr is MA. */
#define MMU_MACHPHYS_UPDATE 1 /* ptr = MA of frame to modify entry for */
+#define MMU_ATOMIC_PT_UPDATE 2 /* checked atomic '*ptr = val'. ptr is MA.*/
+#define MMU_FLAG_RANGE_UPDATE 3 /* range atomic checked '*ptr flags = val'*/
/*
* MMU EXTENDED OPERATIONS
diff -r 61c96456a3e1 include/asm-i386/mach-xen/asm/hypervisor.h
--- a/include/asm-i386/mach-xen/asm/hypervisor.h Thu Dec 20 16:58:14 2007 +0000
+++ b/include/asm-i386/mach-xen/asm/hypervisor.h Fri Jan 04 20:43:00 2008 -0700
@@ -120,6 +120,11 @@ int xen_create_contiguous_region(
unsigned long vstart, unsigned int order, unsigned int address_bits);
void xen_destroy_contiguous_region(
unsigned long vstart, unsigned int order);
+
+struct mm_struct;
+
+int xen_change_pte_range(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, unsigned long end, pgprot_t newprot);
struct page;
diff -r 61c96456a3e1 arch/i386/mm/hypervisor.c
--- a/arch/i386/mm/hypervisor.c Thu Dec 20 16:58:14 2007 +0000
+++ b/arch/i386/mm/hypervisor.c Fri Jan 04 20:45:26 2008 -0700
@@ -548,3 +548,77 @@ int write_ldt_entry(void *ldt, int entry
mach_lp, (u64)entry_a | ((u64)entry_b<<32));
}
#endif
+
+#define MAX_BATCHED_FULL_PTES 32
+
+#define PTE_FLAGS 0xfff0000000000fffULL
+
+#define map_mask_bits(X) (((X & 0xfffULL) << 0) | \
+ ((X & PTE_FLAGS) >> 40))
+#define map_flag_bits(X) (((X & PTE_FLAGS) << 24) | \
+ ((X & PTE_FLAGS) >> 16))
+
+int xen_change_pte_range(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, unsigned long end, pgprot_t newprot)
+{
+ int rc = 0;
+ u64 nr_full_ptes = 0, nr_flag_ptes = 0;
+ mmu_update_t u[MAX_BATCHED_FULL_PTES], v;
+ u64 newval, flags, mask, mask_bits;
+ pte_t *pte;
+ spinlock_t *ptl;
+
+ mask_bits = ~(PTE_MASK | _PAGE_DIRTY);
+#if defined(CONFIG_X86_PAE) || defined(CONFIG_X86_64)
+ mask_bits |= ~__supported_pte_mask;
+#endif
+ mask = map_mask_bits(mask_bits);
+
+ pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+
+ for (;;) {
+ if (addr != end && pte_present(*pte)) {
+ newval = __pte_val(pte_modify(*pte, newprot));
+ if (((__pte_val(*pte) ^ newval) & ~PTE_FLAGS) ||
+ (nr_flag_ptes && (newval & PTE_FLAGS) != flags)) {
+ if (nr_flag_ptes) {
+ v.val |= nr_flag_ptes << 48;
+ if ((rc = HYPERVISOR_mmu_update(
+ &v, 1, NULL, DOMID_SELF)) != 0)
+ break;
+ nr_flag_ptes = 0;
+ }
+ u[nr_full_ptes].ptr = virt_to_machine(pte) | MMU_ATOMIC_PT_UPDATE;
+ u[nr_full_ptes].val = newval;
+ if (++nr_full_ptes == MAX_BATCHED_FULL_PTES) {
+ if ((rc = HYPERVISOR_mmu_update(
+ &u[0], nr_full_ptes, NULL, DOMID_SELF)) != 0)
+ break;
+ nr_full_ptes = 0;
+ }
+ } else if (nr_flag_ptes++ == 0) {
+ flags = newval & PTE_FLAGS;
+ v.ptr = virt_to_machine(pte) | MMU_FLAG_RANGE_UPDATE;
+ v.val = mask | map_flag_bits(flags);
+ }
+ } else {
+ if (nr_flag_ptes) {
+ v.val |= nr_flag_ptes << 48;
+ if ((rc = HYPERVISOR_mmu_update(
+ &v, 1, NULL, DOMID_SELF)) != 0)
+ break;
+ nr_flag_ptes = 0;
+ }
+ if (addr == end) {
+ if (nr_full_ptes)
+ rc = HYPERVISOR_mmu_update(
+ &u[0], nr_full_ptes, NULL, DOMID_SELF);
+ break;
+ }
+ }
+ pte++;
+ addr += PAGE_SIZE;
+ }
+ pte_unmap_unlock(pte - 1, ptl);
+ return rc;
+}
diff -r 61c96456a3e1 mm/mprotect.c
--- a/mm/mprotect.c Thu Dec 20 16:58:14 2007 +0000
+++ b/mm/mprotect.c Fri Jan 04 20:46:24 2008 -0700
@@ -76,7 +76,20 @@ static inline void change_pmd_range(stru
next = pmd_addr_end(addr, end);
if (pmd_none_or_clear_bad(pmd))
continue;
+#if CONFIG_XEN
+ {
+ int rc;
+ rc = xen_change_pte_range(mm, pmd, addr, next, newprot);
+ if (rc)
+ {
+ if (rc != -ENOSYS)
+ BUG();
+ change_pte_range(mm, pmd, addr, next, newprot);
+ }
+ }
+#else
change_pte_range(mm, pmd, addr, next, newprot);
+#endif
} while (pmd++, addr = next, addr != end);
}
[-- Attachment #4: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Fix performance problems with mprotect()
2008-01-05 4:49 [PATCH] Fix performance problems with mprotect() Bruce Rogers
@ 2008-01-05 13:35 ` John Levon
2008-01-06 23:22 ` Keir Fraser
` (2 subsequent siblings)
3 siblings, 0 replies; 12+ messages in thread
From: John Levon @ 2008-01-05 13:35 UTC (permalink / raw)
To: Bruce Rogers; +Cc: xen-devel
On Fri, Jan 04, 2008 at 09:49:41PM -0700, Bruce Rogers wrote:
> While working on a database scaling problem
The changelog entry for this patch should have a suitable entry for
http://wiki.xensource.com/xenwiki/APIChangelog, as suggested by Ian
Jackson.
(Regarding the changes, is there a sane way to detect whether the new
API is present or not? IE a sensible unique errno return or something
else.)
regards
john
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Fix performance problems with mprotect()
@ 2008-01-05 15:05 Bruce Rogers
0 siblings, 0 replies; 12+ messages in thread
From: Bruce Rogers @ 2008-01-05 15:05 UTC (permalink / raw)
To: levon; +Cc: xen-devel
I believe -ENOSYS, in this case at least, correctly identifies when the new code is not there. I didn't try to bump any revision, or add an explicit feature flag somewhere.
I would have preferred to do a one time detection by identity remapping some page and see if it succeeds and then just test a global flag to determine whether to use the new method or not, but it appeared to me at least to not be the Linux way (I am still somewhat new to Linux) and was unsure how issues such as migrating the domain to hypervisors which might be missing this feature is handled, so I went with this approach of always being able to fall back to the previously existing method.
Ugly, I agree, but it works.
I'll look into the APIChangelog entry after more feedback rolls in.
- Bruce
>>> John Levon <levon@movementarian.org> 01/05/08 6:35 AM >>>
On Fri, Jan 04, 2008 at 09:49:41PM -0700, Bruce Rogers wrote:
> While working on a database scaling problem
The changelog entry for this patch should have a suitable entry for
http://wiki.xensource.com/xenwiki/APIChangelog, as suggested by Ian
Jackson.
(Regarding the changes, is there a sane way to detect whether the new
API is present or not? IE a sensible unique errno return or something
else.)
regards
john
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Fix performance problems with mprotect()
2008-01-05 4:49 [PATCH] Fix performance problems with mprotect() Bruce Rogers
2008-01-05 13:35 ` John Levon
@ 2008-01-06 23:22 ` Keir Fraser
2008-01-07 15:03 ` Jan Beulich
2008-01-07 15:18 ` Andi Kleen
3 siblings, 0 replies; 12+ messages in thread
From: Keir Fraser @ 2008-01-06 23:22 UTC (permalink / raw)
To: Bruce Rogers, xen-devel
Thanks, I'll need a little while to look into this patch. It's fairly
obviously not a candidate for 3.2.0 at this point anyway (merge window is
long gone by).
-- Keir
On 5/1/08 04:49, "Bruce Rogers" <brogers@novell.com> wrote:
> While working on a database scaling problem using a SAP database test suite we
> discovered that these enterprise level large shared memory databases are very
> heavy users of mprotect(), to the extent that the performance overhead in
> current Xenolinux impacts scaling beyond a few cpus quite badly. A single
> cpu run under Xen has a nominal impact, but scaling out to 8 cpus results in a
> performance of less than 10% of native throughput. Ouch!
>
> The problem was isolated to the relatively high overhead of how mprotect()
> updates ptes, and the lack of efficient batching opportunities as presently
> implemented. By adding a hypercall which batches pte updates such that
> hardware flags updates (accessed or dirty bits) are not lost, and redoing the
> performance critical portion of mprotect() to take advantage of that,
> performance has improved to where it is essentially equivalent to native
> performance with this SAP database test suite. I haven't tested other
> database implementations, but I would imagine they would have similar results.
>
> This is obviously quite an improvement and should allow Xen to be deployed for
> more enterprise level workloads.
>
> There are actually 2 new hypercalls included in this patch - one that batches
> flags-only changes to pte's within a single page table, and another which
> allows the full pte to be changed - both "atomically" as described above.
>
> I imagine there are other places in Linux that could benefit from these
> hypercalls, but these patches focus on the mprotect performance problem only.
>
> The hypervisor patch was mostly written and tested with the 3.0.4 era
> hypervisor that ships with SLES 10, and then ported and tested on current
> xen-unstable.
> The Linux patch was written and tested on 2.6.16 and made apply to the 2.6.18
> tree without further testing.
>
> Feedback and comments are appreciated.
>
> Signed-off-by: Bruce Rogers <brogers@novell.com>
>
> - Bruce Rogers
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Fix performance problems with mprotect()
2008-01-05 4:49 [PATCH] Fix performance problems with mprotect() Bruce Rogers
2008-01-05 13:35 ` John Levon
2008-01-06 23:22 ` Keir Fraser
@ 2008-01-07 15:03 ` Jan Beulich
2008-01-07 17:05 ` Bruce Rogers
2008-01-07 15:18 ` Andi Kleen
3 siblings, 1 reply; 12+ messages in thread
From: Jan Beulich @ 2008-01-07 15:03 UTC (permalink / raw)
To: Bruce Rogers; +Cc: xen-devel
On the Xen patch:
Both added case blocks in do_mmu_update() have a nested switch
statement that seems redundant (i.e. the outer switch already only
handle PGT_l1_page_table pages, but the inner switch check this same
value again.
The same two case block access addresses obtained from
map_domain_page_with_cache() through __copy_from_user(). Is there
any particular reason to not use direct accesses here? (Note that
mod_l1_entry() has to use __copy_from_user() as it may be called from
do_update_va_mapping()). Likewise I would think that there's no strict
need for update_intpte_sync() to use paging_cmpxchg_guest_entry(),
but here I would agree that it's easier to re-use the existing function
than to create and use a new one.
An additional piece of concern regarding the bit assignments of
MMU_FLAG_RANGE_UPDATE's val parameter (Keir, maybe you need to
comment on this one): The whole mmu_update interface, being
defined in public/xen.h, is supposed to be sufficiently architecture
neutral, which it won't be anymore in the way it currently is being
modified. But maybe I'm mistaken and the interface's declaration is
just badly placed (would apply to the mmuext interface then, too)?
In the Linux patch, I'd just like to see the abstraction to be less Xen-
specific, i.e. something like
(perhaps in include/asm-generic/pgtable.h)
#ifndef arch_change_pte_range
# define arch_change_pte_range(mm, pmd, addr, end, newprot) 0
#endif
and then in change_pmd_range():
...
if (pmd_none_or_clear_bad(pmd))
continue;
if (arch_change_pte_range(mm, pmd, addr, next, newprot))
continue;
change_pte_range(mm, pmd, addr, next, newprot);
} while (pmd++, addr = next, addr != end);
...
The BUG() (which really can be BUG_ON() here) would go into the actual
function then.
Jan
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Fix performance problems with mprotect()
2008-01-05 4:49 [PATCH] Fix performance problems with mprotect() Bruce Rogers
` (2 preceding siblings ...)
2008-01-07 15:03 ` Jan Beulich
@ 2008-01-07 15:18 ` Andi Kleen
2008-01-07 16:16 ` Bruce Rogers
2008-01-19 0:03 ` Bruce Rogers
3 siblings, 2 replies; 12+ messages in thread
From: Andi Kleen @ 2008-01-07 15:18 UTC (permalink / raw)
To: Bruce Rogers; +Cc: xen-devel
"Bruce Rogers" <brogers@novell.com> writes:
> +
> + /* Allowed to change in Accessed/Dirty flags only. */
> + BUG_ON((t ^ old) & ~(intpte_t)(_PAGE_ACCESSED|_PAGE_DIRTY));
Are you sure that BUG_ON can't be triggered from the hypercall?
It should error out in this case I think.
-Andi
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Fix performance problems with mprotect()
2008-01-07 15:18 ` Andi Kleen
@ 2008-01-07 16:16 ` Bruce Rogers
2008-01-07 17:35 ` Keir Fraser
2008-01-19 0:03 ` Bruce Rogers
1 sibling, 1 reply; 12+ messages in thread
From: Bruce Rogers @ 2008-01-07 16:16 UTC (permalink / raw)
To: Andi Kleen; +Cc: xen-devel
I'm not sure how this BUG_ON would get triggered with current x86 paging semantics, and probably isn't needed I think.
Am I missing something?
The code is almost exactly the same as the existing update_intpte, which has the same BUG_ON, and so I just left it in.
(BTW:I considered merging the new functions into the non "sync" versions but opted not to.)
- Bruce
>>> On 1/7/2008 at 8:18 AM, in message <p73d4sdsly4.fsf@bingen.suse.de>, Andi Kleen
<andi@firstfloor.org> wrote:
> "Bruce Rogers" <brogers@novell.com> writes:
>> +
>> + /* Allowed to change in Accessed/Dirty flags only. */
>> + BUG_ON((t ^ old) & ~(intpte_t)(_PAGE_ACCESSED|_PAGE_DIRTY));
>
>
> Are you sure that BUG_ON can't be triggered from the hypercall?
> It should error out in this case I think.
>
> -Andi
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Fix performance problems with mprotect()
2008-01-07 15:03 ` Jan Beulich
@ 2008-01-07 17:05 ` Bruce Rogers
2008-01-07 17:45 ` Keir Fraser
0 siblings, 1 reply; 12+ messages in thread
From: Bruce Rogers @ 2008-01-07 17:05 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel
>>> On 1/7/2008 at 8:03 AM, in message <47824D56.76E4.0078.0@novell.com>, "Jan
Beulich" <jbeulich@novell.com> wrote:
> On the Xen patch:
>
> Both added case blocks in do_mmu_update() have a nested switch
> statement that seems redundant (i.e. the outer switch already only
> handle PGT_l1_page_table pages, but the inner switch check this same
> value again.
Right - a silly oversight - will fix.
>
> The same two case block access addresses obtained from
> map_domain_page_with_cache() through __copy_from_user(). Is there
> any particular reason to not use direct accesses here? (Note that
> mod_l1_entry() has to use __copy_from_user() as it may be called from
> do_update_va_mapping()). Likewise I would think that there's no strict
> need for update_intpte_sync() to use paging_cmpxchg_guest_entry(),
> but here I would agree that it's easier to re-use the existing function
> than to create and use a new one.
I'll change to use direct access.
>
> An additional piece of concern regarding the bit assignments of
> MMU_FLAG_RANGE_UPDATE's val parameter (Keir, maybe you need to
> comment on this one): The whole mmu_update interface, being
> defined in public/xen.h, is supposed to be sufficiently architecture
> neutral, which it won't be anymore in the way it currently is being
> modified. But maybe I'm mistaken and the interface's declaration is
> just badly placed (would apply to the mmuext interface then, too)?
I should point out that the MMU_FLAG_RANGE_UPDATE hypercall didn't end up being needed for improving the SAP test, but does get invoked by other uses of mprotect() and feel that it is useful. As it stands I've only implemented x86 version of this, other architecture owners would need to do the same.
Perhaps as far as what is presented in public/xen.h, we should just have a more generic description (not pointing out the format) and leave it up to each architecture as to what specific format for the val member makes sense.
>
> In the Linux patch, I'd just like to see the abstraction to be less Xen-
> specific, i.e. something like
>
> (perhaps in include/asm-generic/pgtable.h)
>
> #ifndef arch_change_pte_range
> # define arch_change_pte_range(mm, pmd, addr, end, newprot) 0
> #endif
>
> and then in change_pmd_range():
>
> ...
> if (pmd_none_or_clear_bad(pmd))
> continue;
> if (arch_change_pte_range(mm, pmd, addr, next, newprot))
> continue;
> change_pte_range(mm, pmd, addr, next, newprot);
> } while (pmd++, addr = next, addr != end);
> ...
>
> The BUG() (which really can be BUG_ON() here) would go into the actual
> function then.
Yes, that approach does look better. I'll respin the patch.
>
> Jan
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Re: [PATCH] Fix performance problems with mprotect()
2008-01-07 16:16 ` Bruce Rogers
@ 2008-01-07 17:35 ` Keir Fraser
0 siblings, 0 replies; 12+ messages in thread
From: Keir Fraser @ 2008-01-07 17:35 UTC (permalink / raw)
To: Bruce Rogers, Andi Kleen; +Cc: xen-devel
The BUG_ON() is valid until the per-domain biglock is removed.
-- Keir
On 7/1/08 16:16, "Bruce Rogers" <BROGERS@novell.com> wrote:
> I'm not sure how this BUG_ON would get triggered with current x86 paging
> semantics, and probably isn't needed I think.
> Am I missing something?
> The code is almost exactly the same as the existing update_intpte, which has
> the same BUG_ON, and so I just left it in.
> (BTW:I considered merging the new functions into the non "sync" versions but
> opted not to.)
>
> - Bruce
>
>>>> On 1/7/2008 at 8:18 AM, in message <p73d4sdsly4.fsf@bingen.suse.de>, Andi
>>>> Kleen
> <andi@firstfloor.org> wrote:
>> "Bruce Rogers" <brogers@novell.com> writes:
>>> +
>>> + /* Allowed to change in Accessed/Dirty flags only. */
>>> + BUG_ON((t ^ old) & ~(intpte_t)(_PAGE_ACCESSED|_PAGE_DIRTY));
>>
>>
>> Are you sure that BUG_ON can't be triggered from the hypercall?
>> It should error out in this case I think.
>>
>> -Andi
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Fix performance problems with mprotect()
2008-01-07 17:05 ` Bruce Rogers
@ 2008-01-07 17:45 ` Keir Fraser
2008-01-07 18:03 ` Bruce Rogers
0 siblings, 1 reply; 12+ messages in thread
From: Keir Fraser @ 2008-01-07 17:45 UTC (permalink / raw)
To: Bruce Rogers, Jan Beulich; +Cc: xen-devel
On 7/1/08 17:05, "Bruce Rogers" <BROGERS@novell.com> wrote:
>> An additional piece of concern regarding the bit assignments of
>> MMU_FLAG_RANGE_UPDATE's val parameter (Keir, maybe you need to
>> comment on this one): The whole mmu_update interface, being
>> defined in public/xen.h, is supposed to be sufficiently architecture
>> neutral, which it won't be anymore in the way it currently is being
>> modified. But maybe I'm mistaken and the interface's declaration is
>> just badly placed (would apply to the mmuext interface then, too)?
> I should point out that the MMU_FLAG_RANGE_UPDATE hypercall didn't end up
> being needed for improving the SAP test, but does get invoked by other uses of
> mprotect() and feel that it is useful. As it stands I've only implemented x86
> version of this, other architecture owners would need to do the same.
> Perhaps as far as what is presented in public/xen.h, we should just have a
> more generic description (not pointing out the format) and leave it up to each
> architecture as to what specific format for the val member makes sense.
No other architecture uses do_mmu_update(). The public definitions are
simply in the wrong header file.
Is MMU_FLAG_RANGE_UPDATE really useful if you have MMU_ATOMIC_PT_UPDATE? If
it is it's presumably a performance issue, and the question should be: why
is a sequence MMU_ATOMIC_PT_UPDATE slower than MMU_FLAG_RANGE_UPDATE, and
can we make it fast enough that MMU_FLAG_RANGE_UPDATE is redundant?
Also, the documentation for MMU_ATOMIC_PT_UDATE in xen.h is possibly
incorrect. It doesn't make sense to me -- are your meanings for ptr[2:] and
val actually correct?
-- Keir
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Fix performance problems with mprotect()
2008-01-07 17:45 ` Keir Fraser
@ 2008-01-07 18:03 ` Bruce Rogers
0 siblings, 0 replies; 12+ messages in thread
From: Bruce Rogers @ 2008-01-07 18:03 UTC (permalink / raw)
To: Keir Fraser, Jan Beulich; +Cc: xen-devel
>>> On 1/7/2008 at 10:45 AM, in message <C3A815C0.11D69%Keir.Fraser@cl.cam.ac.uk>,
Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote:
> On 7/1/08 17:05, "Bruce Rogers" <BROGERS@novell.com> wrote:
>
>>> An additional piece of concern regarding the bit assignments of
>>> MMU_FLAG_RANGE_UPDATE's val parameter (Keir, maybe you need to
>>> comment on this one): The whole mmu_update interface, being
>>> defined in public/xen.h, is supposed to be sufficiently architecture
>>> neutral, which it won't be anymore in the way it currently is being
>>> modified. But maybe I'm mistaken and the interface's declaration is
>>> just badly placed (would apply to the mmuext interface then, too)?
>> I should point out that the MMU_FLAG_RANGE_UPDATE hypercall didn't end up
>> being needed for improving the SAP test, but does get invoked by other uses
> of
>> mprotect() and feel that it is useful. As it stands I've only implemented
> x86
>> version of this, other architecture owners would need to do the same.
>> Perhaps as far as what is presented in public/xen.h, we should just have a
>> more generic description (not pointing out the format) and leave it up to
> each
>> architecture as to what specific format for the val member makes sense.
>
> No other architecture uses do_mmu_update(). The public definitions are
> simply in the wrong header file.
>
> Is MMU_FLAG_RANGE_UPDATE really useful if you have MMU_ATOMIC_PT_UPDATE? If
> it is it's presumably a performance issue, and the question should be: why
> is a sequence MMU_ATOMIC_PT_UPDATE slower than MMU_FLAG_RANGE_UPDATE, and
> can we make it fast enough that MMU_FLAG_RANGE_UPDATE is redundant?
I did find a quite measurable performance difference between using the two methods, but I consider it useful mainly because of its compactness in terms of being able to update all ptes in a page table without needing a large array to initialize from. That said, I am not beholden to this flag range hypercall, since the other one is more generic and we should be able to make it fast enough. If you prefer, we can just go with the MMU_ATOMIC_PT_UPDATE hypercall only.
>
> Also, the documentation for MMU_ATOMIC_PT_UDATE in xen.h is possibly
> incorrect. It doesn't make sense to me -- are your meanings for ptr[2:] and
> val actually correct?
Oh,you are correct. I had originally coded MMU_ATOMIC_PT_UDATE as it is described, since doing it that way was giving me about 14% better performance than doing the batching one mmu_update_t structure at a time, but I again opted for backing off to the more general case and forgot to update that description for that hypercall.
I'll get that fixed.
>
> -- Keir
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Fix performance problems with mprotect()
2008-01-07 15:18 ` Andi Kleen
2008-01-07 16:16 ` Bruce Rogers
@ 2008-01-19 0:03 ` Bruce Rogers
1 sibling, 0 replies; 12+ messages in thread
From: Bruce Rogers @ 2008-01-19 0:03 UTC (permalink / raw)
To: Andi Kleen; +Cc: xen-devel
I don't see how. If you are seeing a problem, please elaborate.
- Bruce
>>> On 1/7/2008 at 8:18 AM, in message <p73d4sdsly4.fsf@bingen.suse.de>, Andi Kleen
<andi@firstfloor.org> wrote:
> "Bruce Rogers" <brogers@novell.com> writes:
>> +
>> + /* Allowed to change in Accessed/Dirty flags only. */
>> + BUG_ON((t ^ old) & ~(intpte_t)(_PAGE_ACCESSED|_PAGE_DIRTY));
>
>
> Are you sure that BUG_ON can't be triggered from the hypercall?
> It should error out in this case I think.
>
> -Andi
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2008-01-19 0:03 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-05 4:49 [PATCH] Fix performance problems with mprotect() Bruce Rogers
2008-01-05 13:35 ` John Levon
2008-01-06 23:22 ` Keir Fraser
2008-01-07 15:03 ` Jan Beulich
2008-01-07 17:05 ` Bruce Rogers
2008-01-07 17:45 ` Keir Fraser
2008-01-07 18:03 ` Bruce Rogers
2008-01-07 15:18 ` Andi Kleen
2008-01-07 16:16 ` Bruce Rogers
2008-01-07 17:35 ` Keir Fraser
2008-01-19 0:03 ` Bruce Rogers
-- strict thread matches above, loose matches on Subject: below --
2008-01-05 15:05 Bruce Rogers
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.