[patch 00/20] paravirt

virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed

* [patch 00/20] paravirt_ops updates
@ 2007-04-04 19:11 Jeremy Fitzhardinge
  2007-04-04 19:11 ` [patch 01/20] update MAINTAINERS Jeremy Fitzhardinge
                   ` (19 more replies)
  0 siblings, 20 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:11 UTC (permalink / raw)
  To: Andi Kleen; +Cc: virtualization, Andrew Morton, lkml

Hi Andi,

Here's a repost of the paravirt_ops update series I posted the other day.
Since then, I found a few potential bugs with patching clobbering,
cleaned up and documented paravirt.h and the patching machinery.

Overview:

add-MAINTAINERS.patch
	obvious

remove-CONFIG_DEBUG_PARAVIRT.patch
	No longer meaningful or needed.

paravirt-nop.patch
	Clean up nop paravirt_ops functions, mainly to allow the patching
	machinery to easily identify them.

paravirt-pte-accessors.patch
	Accessors to allow pv_ops to control the content of pagetable entries.

paravirt-memory-init.patch
	Hook into initial pagetable creation.

paravirt-fixmap.patch
	Create a fixmap for early paravirt_ops mappings.

shared-kernel-pmd.patch
	Make the choice of whether the kernel pmd is shared between
	processes or not a runtime selectable flag.

mm-lifetime-hooks.patch
	Hooks to allow the creation, use and destruction of an mm_struct
	to be followed.

paravirt-patch-rename-paravirt_patch.patch
	Rename a structure to make its use a bit more clear.

paravirt-use-offset-site-ids.patch
	Use the offsetof each function pointer in paravirt_ops as the
	basis of its patching identifier.

paravirt-fix-clobbers.patch
	Fix up various register/use clobber problems.  This may be 2.6.21
	material, but I don't think it will materially affect VMI.

paravirt-patchable-call-wrappers.patch
	Wrap each paravirt_ops call to allow the callsites to be runtime
	patched.

paravirt-document-paravirt_ops.patch
	Document the paravirt_ops structure itself, the patching
	mechanism, and other cleanups.

paravirt-patch-machinery.patch
	General patch machinery for use by pv_ops backends to implment
	patching.

paravirt-flush_tlb_others.patch
	Add a hook for cross-cpu tlb flushing.

revert-map_pt_hook.patch
	Back out the map_pt_hook change.

paravirt-kmap_atomic_pte.patch
	Replace map_pt_hook with kmap_atomic_pte.

cleanup-tsc-sched-clock.patch
	Clean up the tsc-based sched_clock.  (I think you already
	have this.)

paravirt-sched-clock.patch
	Add a hook for sched_clock, so that paravirt_ops backends can
	report unstolen time for use as the scheduler clock.

apply-to-page-range.patch
	Apply a function to a range of pagetable entries.

Thanks,
	J

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 01/20] update MAINTAINERS
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
@ 2007-04-04 19:11 ` Jeremy Fitzhardinge
  2007-04-04 19:11 ` [patch 02/20] Remove CONFIG_DEBUG_PARAVIRT Jeremy Fitzhardinge
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:11 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Chris Wright, virtualization, Andrew Morton, lkml

[-- Attachment #1: add-MAINTAINERS.patch --]
[-- Type: text/plain, Size: 1273 bytes --]

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
---
 MAINTAINERS |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

===================================================================
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2583,6 +2583,19 @@ T:	cvs cvs.parisc-linux.org:/var/cvs/lin
 T:	cvs cvs.parisc-linux.org:/var/cvs/linux-2.6
 S:	Maintained
 
+PARAVIRT_OPS INTERFACE
+P:	Jeremy Fitzhardinge
+M:	jeremy@xensource.com
+P:	Chris Wright
+M:	chrisw@sous-sol.org
+P:	Zachary Amsden
+M:	zach@vmware.com
+P:	Rusty Russell
+M:	rusty@rustcorp.com.au
+L:	virtualization@lists.osdl.org
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+
 PC87360 HARDWARE MONITORING DRIVER
 P:	Jim Cromie
 M:	jim.cromie@gmail.com
@@ -3780,6 +3793,15 @@ L:	linux-x25@vger.kernel.org
 L:	linux-x25@vger.kernel.org
 S:	Maintained
 
+XEN HYPERVISOR INTERFACE
+P:	Jeremy Fitzhardinge
+M:	jeremy@xensource.com
+P:	Chris Wright
+M:	chrisw@sous-sol.org
+L:	virtualization@lists.osdl.org
+L:	xen-devel@lists.xensource.com
+S:	Supported
+
 XFS FILESYSTEM
 P:	Silicon Graphics Inc
 P:	Tim Shimmin, David Chatterton

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 02/20] Remove CONFIG_DEBUG_PARAVIRT
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
  2007-04-04 19:11 ` [patch 01/20] update MAINTAINERS Jeremy Fitzhardinge
@ 2007-04-04 19:11 ` Jeremy Fitzhardinge
  2007-04-04 19:11 ` [patch 03/20] use paravirt_nop to consistently mark no-op operations Jeremy Fitzhardinge
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:11 UTC (permalink / raw)
  To: Andi Kleen; +Cc: virtualization, Andrew Morton, lkml

[-- Attachment #1: remove-CONFIG_DEBUG_PARAVIRT.patch --]
[-- Type: text/plain, Size: 2062 bytes --]

Remove CONFIG_DEBUG_PARAVIRT.  When inlining code, this option
attempts to trash registers in the patch-site's "clobber" field, on
the grounds that this should find bugs with incorrect clobbers.
Unfortunately, the clobber field really means "registers modified by
this patch site", which includes return values.

Because of this, this option has outlived its usefulness, so remove
it.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>

---
 arch/i386/Kconfig.debug        |   10 ----------
 arch/i386/kernel/alternative.c |   14 +-------------
 2 files changed, 1 insertion(+), 23 deletions(-)

===================================================================
--- a/arch/i386/Kconfig.debug
+++ b/arch/i386/Kconfig.debug
@@ -85,14 +85,4 @@ config DOUBLEFAULT
           option saves about 4k and might cause you much additional grey
           hair.
 
-config DEBUG_PARAVIRT
-	bool "Enable some paravirtualization debugging"
-	default n
-	depends on PARAVIRT && DEBUG_KERNEL
-	help
-	  Currently deliberately clobbers regs which are allowed to be
-	  clobbered in inlined paravirt hooks, even in native mode.
-	  If turning this off solves a problem, then DISABLE_INTERRUPTS() or
-	  ENABLE_INTERRUPTS() is lying about what registers can be clobbered.
-
 endmenu
===================================================================
--- a/arch/i386/kernel/alternative.c
+++ b/arch/i386/kernel/alternative.c
@@ -359,19 +359,7 @@ void apply_paravirt(struct paravirt_patc
 
 		used = paravirt_ops.patch(p->instrtype, p->clobbers, p->instr,
 					  p->len);
-#ifdef CONFIG_DEBUG_PARAVIRT
-		{
-		int i;
-		/* Deliberately clobber regs using "not %reg" to find bugs. */
-		for (i = 0; i < 3; i++) {
-			if (p->len - used >= 2 && (p->clobbers & (1 << i))) {
-				memcpy(p->instr + used, "\xf7\xd0", 2);
-				p->instr[used+1] |= i;
-				used += 2;
-			}
-		}
-		}
-#endif
+
 		/* Pad the rest with nops */
 		nop_out(p->instr + used, p->len - used);
 	}

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 03/20] use paravirt_nop to consistently mark no-op operations
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
  2007-04-04 19:11 ` [patch 01/20] update MAINTAINERS Jeremy Fitzhardinge
  2007-04-04 19:11 ` [patch 02/20] Remove CONFIG_DEBUG_PARAVIRT Jeremy Fitzhardinge
@ 2007-04-04 19:11 ` Jeremy Fitzhardinge
  2007-04-04 19:11 ` [patch 04/20] Add pagetable accessors to pack and unpack pagetable entries Jeremy Fitzhardinge
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:11 UTC (permalink / raw)
  To: Andi Kleen; +Cc: virtualization, Andrew Morton, Ingo Molnar, lkml

[-- Attachment #1: paravirt-nop.patch --]
[-- Type: text/plain, Size: 3381 bytes --]

Add a _paravirt_nop function for use as a stub for no-op operations,
and paravirt_nop #defined void * version to make using it easier
(since all its uses are as a void *).

This is useful to allow the patcher to automatically identify noop
operations so it can simply nop out the callsite.


Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
[mingo] but only as a cleanup of the current open-coded (void *) casts.
My problem with this is that it loses the types. Not that there is much
to check for, but still, this adds some assumptions about how function
calls look like

---
 arch/i386/kernel/paravirt.c |   26 +++++++++++++-------------
 include/asm-i386/paravirt.h |    3 +++
 2 files changed, 16 insertions(+), 13 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -35,7 +35,7 @@
 #include <asm/timer.h>
 
 /* nop stub */
-static void native_nop(void)
+void _paravirt_nop(void)
 {
 }
 
@@ -490,7 +490,7 @@ struct paravirt_ops paravirt_ops = {
 
  	.patch = native_patch,
 	.banner = default_banner,
-	.arch_setup = native_nop,
+	.arch_setup = paravirt_nop,
 	.memory_setup = machine_specific_memory_setup,
 	.get_wallclock = native_get_wallclock,
 	.set_wallclock = native_set_wallclock,
@@ -546,25 +546,25 @@ struct paravirt_ops paravirt_ops = {
 	.setup_boot_clock = setup_boot_APIC_clock,
 	.setup_secondary_clock = setup_secondary_APIC_clock,
 #endif
-	.set_lazy_mode = (void *)native_nop,
+	.set_lazy_mode = paravirt_nop,
 
 	.flush_tlb_user = native_flush_tlb,
 	.flush_tlb_kernel = native_flush_tlb_global,
 	.flush_tlb_single = native_flush_tlb_single,
 
-	.map_pt_hook = (void *)native_nop,
-
-	.alloc_pt = (void *)native_nop,
-	.alloc_pd = (void *)native_nop,
-	.alloc_pd_clone = (void *)native_nop,
-	.release_pt = (void *)native_nop,
-	.release_pd = (void *)native_nop,
+	.map_pt_hook = paravirt_nop,
+
+	.alloc_pt = paravirt_nop,
+	.alloc_pd = paravirt_nop,
+	.alloc_pd_clone = paravirt_nop,
+	.release_pt = paravirt_nop,
+	.release_pd = paravirt_nop,
 
 	.set_pte = native_set_pte,
 	.set_pte_at = native_set_pte_at,
 	.set_pmd = native_set_pmd,
-	.pte_update = (void *)native_nop,
-	.pte_update_defer = (void *)native_nop,
+	.pte_update = paravirt_nop,
+	.pte_update_defer = paravirt_nop,
 #ifdef CONFIG_X86_PAE
 	.set_pte_atomic = native_set_pte_atomic,
 	.set_pte_present = native_set_pte_present,
@@ -576,7 +576,7 @@ struct paravirt_ops paravirt_ops = {
 	.irq_enable_sysexit = native_irq_enable_sysexit,
 	.iret = native_iret,
 
-	.startup_ipi_hook = (void *)native_nop,
+	.startup_ipi_hook = paravirt_nop,
 };
 
 /*
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -430,6 +430,9 @@ static inline void pmd_clear(pmd_t *pmdp
 #define arch_enter_lazy_mmu_mode() paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_MMU)
 #define arch_leave_lazy_mmu_mode() paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_NONE)
 
+void _paravirt_nop(void);
+#define paravirt_nop	((void *)_paravirt_nop)
+
 /* These all sit in the .parainstructions section to tell us what to patch. */
 struct paravirt_patch {
 	u8 *instr; 		/* original instructions */

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 04/20] Add pagetable accessors to pack and unpack pagetable entries
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
                   ` (2 preceding siblings ...)
  2007-04-04 19:11 ` [patch 03/20] use paravirt_nop to consistently mark no-op operations Jeremy Fitzhardinge
@ 2007-04-04 19:11 ` Jeremy Fitzhardinge
  2007-04-04 19:11 ` [patch 05/20] Hooks to set up initial pagetable Jeremy Fitzhardinge
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:11 UTC (permalink / raw)
  To: Andi Kleen; +Cc: virtualization, Andrew Morton, Ingo Molnar, lkml

[-- Attachment #1: paravirt-pte-accessors.patch --]
[-- Type: text/plain, Size: 19486 bytes --]

Add a set of accessors to pack, unpack and modify page table entries
(at all levels).  This allows a paravirt implementation to control the
contents of pgd/pmd/pte entries.  For example, Xen uses this to
convert the (pseudo-)physical address into a machine address when
populating a pagetable entry, and converting back to pphys address
when an entry is read.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: Ingo Molnar <mingo@elte.hu>

---
 arch/i386/kernel/paravirt.c       |   84 +++++--------------------------------
 arch/i386/kernel/vmi.c            |    6 +-
 include/asm-i386/page.h           |   79 +++++++++++++++++++++++++++++-----
 include/asm-i386/paravirt.h       |   52 +++++++++++++++++-----
 include/asm-i386/pgtable-2level.h |   28 +++++++++---
 include/asm-i386/pgtable-3level.h |   65 +++++++++++++++++-----------
 include/asm-i386/pgtable.h        |    2 
 7 files changed, 186 insertions(+), 130 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -399,78 +399,6 @@ static void native_flush_tlb_single(u32 
 {
 	__native_flush_tlb_single(addr);
 }
-
-#ifndef CONFIG_X86_PAE
-static void native_set_pte(pte_t *ptep, pte_t pteval)
-{
-	*ptep = pteval;
-}
-
-static void native_set_pte_at(struct mm_struct *mm, u32 addr, pte_t *ptep, pte_t pteval)
-{
-	*ptep = pteval;
-}
-
-static void native_set_pmd(pmd_t *pmdp, pmd_t pmdval)
-{
-	*pmdp = pmdval;
-}
-
-#else /* CONFIG_X86_PAE */
-
-static void native_set_pte(pte_t *ptep, pte_t pte)
-{
-	ptep->pte_high = pte.pte_high;
-	smp_wmb();
-	ptep->pte_low = pte.pte_low;
-}
-
-static void native_set_pte_at(struct mm_struct *mm, u32 addr, pte_t *ptep, pte_t pte)
-{
-	ptep->pte_high = pte.pte_high;
-	smp_wmb();
-	ptep->pte_low = pte.pte_low;
-}
-
-static void native_set_pte_present(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte)
-{
-	ptep->pte_low = 0;
-	smp_wmb();
-	ptep->pte_high = pte.pte_high;
-	smp_wmb();
-	ptep->pte_low = pte.pte_low;
-}
-
-static void native_set_pte_atomic(pte_t *ptep, pte_t pteval)
-{
-	set_64bit((unsigned long long *)ptep,pte_val(pteval));
-}
-
-static void native_set_pmd(pmd_t *pmdp, pmd_t pmdval)
-{
-	set_64bit((unsigned long long *)pmdp,pmd_val(pmdval));
-}
-
-static void native_set_pud(pud_t *pudp, pud_t pudval)
-{
-	*pudp = pudval;
-}
-
-static void native_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
-{
-	ptep->pte_low = 0;
-	smp_wmb();
-	ptep->pte_high = 0;
-}
-
-static void native_pmd_clear(pmd_t *pmd)
-{
-	u32 *tmp = (u32 *)pmd;
-	*tmp = 0;
-	smp_wmb();
-	*(tmp + 1) = 0;
-}
-#endif /* CONFIG_X86_PAE */
 
 /* These are in entry.S */
 extern void native_iret(void);
@@ -565,13 +493,25 @@ struct paravirt_ops paravirt_ops = {
 	.set_pmd = native_set_pmd,
 	.pte_update = paravirt_nop,
 	.pte_update_defer = paravirt_nop,
+
+	.ptep_get_and_clear = native_ptep_get_and_clear,
+
 #ifdef CONFIG_X86_PAE
 	.set_pte_atomic = native_set_pte_atomic,
 	.set_pte_present = native_set_pte_present,
 	.set_pud = native_set_pud,
 	.pte_clear = native_pte_clear,
 	.pmd_clear = native_pmd_clear,
+
+	.pmd_val = native_pmd_val,
+	.make_pmd = native_make_pmd,
 #endif
+
+	.pte_val = native_pte_val,
+	.pgd_val = native_pgd_val,
+
+	.make_pte = native_make_pte,
+	.make_pgd = native_make_pgd,
 
 	.irq_enable_sysexit = native_irq_enable_sysexit,
 	.iret = native_iret,
===================================================================
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -444,13 +444,13 @@ static void vmi_release_pd(u32 pfn)
         ((level) | (is_current_as(mm, user) ?                           \
                 (VMI_PAGE_DEFER | VMI_PAGE_CURRENT_AS | ((addr) & VMI_PAGE_VA_MASK)) : 0))
 
-static void vmi_update_pte(struct mm_struct *mm, u32 addr, pte_t *ptep)
+static void vmi_update_pte(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
 	vmi_check_page_type(__pa(ptep) >> PAGE_SHIFT, VMI_PAGE_PTE);
 	vmi_ops.update_pte(ptep, vmi_flags_addr(mm, addr, VMI_PAGE_PT, 0));
 }
 
-static void vmi_update_pte_defer(struct mm_struct *mm, u32 addr, pte_t *ptep)
+static void vmi_update_pte_defer(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
 	vmi_check_page_type(__pa(ptep) >> PAGE_SHIFT, VMI_PAGE_PTE);
 	vmi_ops.update_pte(ptep, vmi_flags_addr_defer(mm, addr, VMI_PAGE_PT, 0));
@@ -463,7 +463,7 @@ static void vmi_set_pte(pte_t *ptep, pte
 	vmi_ops.set_pte(pte, ptep, VMI_PAGE_PT);
 }
 
-static void vmi_set_pte_at(struct mm_struct *mm, u32 addr, pte_t *ptep, pte_t pte)
+static void vmi_set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte)
 {
 	vmi_check_page_type(__pa(ptep) >> PAGE_SHIFT, VMI_PAGE_PTE);
 	vmi_ops.set_pte(pte, ptep, vmi_flags_addr(mm, addr, VMI_PAGE_PT, 0));
===================================================================
--- a/include/asm-i386/page.h
+++ b/include/asm-i386/page.h
@@ -12,7 +12,6 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__
 
-
 #ifdef CONFIG_X86_USE_3DNOW
 
 #include <asm/mmx.h>
@@ -42,26 +41,81 @@
  * These are used to make use of C type-checking..
  */
 extern int nx_enabled;
+
 #ifdef CONFIG_X86_PAE
 extern unsigned long long __supported_pte_mask;
 typedef struct { unsigned long pte_low, pte_high; } pte_t;
 typedef struct { unsigned long long pmd; } pmd_t;
 typedef struct { unsigned long long pgd; } pgd_t;
 typedef struct { unsigned long long pgprot; } pgprot_t;
-#define pmd_val(x)	((x).pmd)
-#define pte_val(x)	((x).pte_low | ((unsigned long long)(x).pte_high << 32))
-#define __pmd(x) ((pmd_t) { (x) } )
+
+static inline unsigned long long native_pgd_val(pgd_t pgd)
+{
+	return pgd.pgd;
+}
+
+static inline unsigned long long native_pmd_val(pmd_t pmd)
+{
+	return pmd.pmd;
+}
+
+static inline unsigned long long native_pte_val(pte_t pte)
+{
+	return pte.pte_low | ((unsigned long long)pte.pte_high << 32);
+}
+
+static inline pgd_t native_make_pgd(unsigned long long val)
+{
+	return (pgd_t) { val };
+}
+
+static inline pmd_t native_make_pmd(unsigned long long val)
+{
+	return (pmd_t) { val };
+}
+
+static inline pte_t native_make_pte(unsigned long long val)
+{
+	return (pte_t) { .pte_low = val, .pte_high = (val >> 32) } ;
+}
+
+#ifndef CONFIG_PARAVIRT
+#define pmd_val(x)	native_pmd_val(x)
+#define __pmd(x)	native_make_pmd(x)
+#endif
+
 #define HPAGE_SHIFT	21
 #include <asm-generic/pgtable-nopud.h>
-#else
+#else  /* !CONFIG_X86_PAE */
 typedef struct { unsigned long pte_low; } pte_t;
 typedef struct { unsigned long pgd; } pgd_t;
 typedef struct { unsigned long pgprot; } pgprot_t;
 #define boot_pte_t pte_t /* or would you rather have a typedef */
-#define pte_val(x)	((x).pte_low)
+
+static inline unsigned long native_pgd_val(pgd_t pgd)
+{
+	return pgd.pgd;
+}
+
+static inline unsigned long native_pte_val(pte_t pte)
+{
+	return pte.pte_low;
+}
+
+static inline pgd_t native_make_pgd(unsigned long val)
+{
+	return (pgd_t) { val };
+}
+
+static inline pte_t native_make_pte(unsigned long val)
+{
+	return (pte_t) { .pte_low = val };
+}
+
 #define HPAGE_SHIFT	22
 #include <asm-generic/pgtable-nopmd.h>
-#endif
+#endif	/* CONFIG_X86_PAE */
+
 #define PTE_MASK	PAGE_MASK
 
 #ifdef CONFIG_HUGETLB_PAGE
@@ -71,12 +125,15 @@ typedef struct { unsigned long pgprot; }
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #endif
 
-#define pgd_val(x)	((x).pgd)
 #define pgprot_val(x)	((x).pgprot)
-
-#define __pte(x) ((pte_t) { (x) } )
-#define __pgd(x) ((pgd_t) { (x) } )
 #define __pgprot(x)	((pgprot_t) { (x) } )
+
+#ifndef CONFIG_PARAVIRT
+#define pgd_val(x)	native_pgd_val(x)
+#define __pgd(x)	native_make_pgd(x)
+#define pte_val(x)	native_pte_val(x)
+#define __pte(x)	native_make_pte(x)
+#endif
 
 #endif /* !__ASSEMBLY__ */
 
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -2,7 +2,6 @@
 #define __ASM_PARAVIRT_H
 /* Various instructions on x86 need to be replaced for
  * para-virtualization: those hooks are defined here. */
-#include <linux/linkage.h>
 #include <linux/stringify.h>
 #include <asm/page.h>
 
@@ -25,6 +24,8 @@
 #define CLBR_ANY 0x7
 
 #ifndef __ASSEMBLY__
+#include <linux/types.h>
+
 struct thread_struct;
 struct Xgt_desc_struct;
 struct tss_struct;
@@ -53,11 +54,6 @@ struct paravirt_ops
 	unsigned long (*get_wallclock)(void);
 	int (*set_wallclock)(unsigned long);
 	void (*time_init)(void);
-
-	/* All the function pointers here are declared as "fastcall"
-	   so that we get a specific register-based calling
-	   convention.  This makes it easier to implement inline
-	   assembler replacements. */
 
 	void (*cpuid)(unsigned int *eax, unsigned int *ebx,
 		      unsigned int *ecx, unsigned int *edx);
@@ -139,16 +135,33 @@ struct paravirt_ops
 	void (*release_pd)(u32 pfn);
 
 	void (*set_pte)(pte_t *ptep, pte_t pteval);
-	void (*set_pte_at)(struct mm_struct *mm, u32 addr, pte_t *ptep, pte_t pteval);
+	void (*set_pte_at)(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pteval);
 	void (*set_pmd)(pmd_t *pmdp, pmd_t pmdval);
-	void (*pte_update)(struct mm_struct *mm, u32 addr, pte_t *ptep);
-	void (*pte_update_defer)(struct mm_struct *mm, u32 addr, pte_t *ptep);
+	void (*pte_update)(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
+	void (*pte_update_defer)(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
+
+ 	pte_t (*ptep_get_and_clear)(pte_t *ptep);
+
 #ifdef CONFIG_X86_PAE
 	void (*set_pte_atomic)(pte_t *ptep, pte_t pteval);
-	void (*set_pte_present)(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte);
+ 	void (*set_pte_present)(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte);
 	void (*set_pud)(pud_t *pudp, pud_t pudval);
-	void (*pte_clear)(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
+ 	void (*pte_clear)(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
 	void (*pmd_clear)(pmd_t *pmdp);
+
+	unsigned long long (*pte_val)(pte_t);
+	unsigned long long (*pmd_val)(pmd_t);
+	unsigned long long (*pgd_val)(pgd_t);
+
+	pte_t (*make_pte)(unsigned long long pte);
+	pmd_t (*make_pmd)(unsigned long long pmd);
+	pgd_t (*make_pgd)(unsigned long long pgd);
+#else
+	unsigned long (*pte_val)(pte_t);
+	unsigned long (*pgd_val)(pgd_t);
+
+	pte_t (*make_pte)(unsigned long pte);
+	pgd_t (*make_pgd)(unsigned long pgd);
 #endif
 
 	void (*set_lazy_mode)(int mode);
@@ -218,6 +231,8 @@ static inline void __cpuid(unsigned int 
 #define read_cr4() paravirt_ops.read_cr4()
 #define read_cr4_safe(x) paravirt_ops.read_cr4_safe()
 #define write_cr4(x) paravirt_ops.write_cr4(x)
+
+#define raw_ptep_get_and_clear(xp)	(paravirt_ops.ptep_get_and_clear(xp))
 
 static inline void raw_safe_halt(void)
 {
@@ -303,6 +318,17 @@ static inline void halt(void)
 	(paravirt_ops.write_idt_entry((dt), (entry), (low), (high)))
 #define set_iopl_mask(mask) (paravirt_ops.set_iopl_mask(mask))
 
+#define __pte(x)	paravirt_ops.make_pte(x)
+#define __pgd(x)	paravirt_ops.make_pgd(x)
+
+#define pte_val(x)	paravirt_ops.pte_val(x)
+#define pgd_val(x)	paravirt_ops.pgd_val(x)
+
+#ifdef CONFIG_X86_PAE
+#define __pmd(x)	paravirt_ops.make_pmd(x)
+#define pmd_val(x)	paravirt_ops.pmd_val(x)
+#endif
+
 /* The paravirtualized I/O functions */
 static inline void slow_down_io(void) {
 	paravirt_ops.io_delay();
@@ -343,6 +369,7 @@ static inline void setup_secondary_clock
 }
 #endif
 
+
 #ifdef CONFIG_SMP
 static inline void startup_ipi_hook(int phys_apicid, unsigned long start_eip,
 				    unsigned long start_esp)
@@ -370,7 +397,8 @@ static inline void set_pte(pte_t *ptep, 
 	paravirt_ops.set_pte(ptep, pteval);
 }
 
-static inline void set_pte_at(struct mm_struct *mm, u32 addr, pte_t *ptep, pte_t pteval)
+static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
+			      pte_t *ptep, pte_t pteval)
 {
 	paravirt_ops.set_pte_at(mm, addr, ptep, pteval);
 }
===================================================================
--- a/include/asm-i386/pgtable-2level.h
+++ b/include/asm-i386/pgtable-2level.h
@@ -11,11 +11,24 @@
  * within a page table are directly modified.  Thus, the following
  * hook is made available.
  */
+static inline void native_set_pte(pte_t *ptep , pte_t pte)
+{
+	*ptep = pte;
+}
+static inline void native_set_pte_at(struct mm_struct *mm, unsigned long addr,
+				     pte_t *ptep , pte_t pte)
+{
+	native_set_pte(ptep, pte);
+}
+static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
+{
+	*pmdp = pmd;
+}
 #ifndef CONFIG_PARAVIRT
-#define set_pte(pteptr, pteval) (*(pteptr) = pteval)
-#define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval)
-#define set_pmd(pmdptr, pmdval) (*(pmdptr) = (pmdval))
-#endif
+#define set_pte(pteptr, pteval)		native_set_pte(pteptr, pteval)
+#define set_pte_at(mm,addr,ptep,pteval) native_set_pte_at(mm, addr, ptep, pteval)
+#define set_pmd(pmdptr, pmdval)		native_set_pmd(pmdptr, pmdval)
+#endif
 
 #define set_pte_atomic(pteptr, pteval) set_pte(pteptr,pteval)
 #define set_pte_present(mm,addr,ptep,pteval) set_pte_at(mm,addr,ptep,pteval)
@@ -23,11 +36,14 @@
 #define pte_clear(mm,addr,xp)	do { set_pte_at(mm, addr, xp, __pte(0)); } while (0)
 #define pmd_clear(xp)	do { set_pmd(xp, __pmd(0)); } while (0)
 
-#define raw_ptep_get_and_clear(xp)	__pte(xchg(&(xp)->pte_low, 0))
+static inline pte_t native_ptep_get_and_clear(pte_t *xp)
+{
+	return __pte(xchg(&xp->pte_low, 0));
+}
 
 #define pte_page(x)		pfn_to_page(pte_pfn(x))
 #define pte_none(x)		(!(x).pte_low)
-#define pte_pfn(x)		((unsigned long)(((x).pte_low >> PAGE_SHIFT)))
+#define pte_pfn(x)		(pte_val(x) >> PAGE_SHIFT)
 #define pfn_pte(pfn, prot)	__pte(((pfn) << PAGE_SHIFT) | pgprot_val(prot))
 #define pfn_pmd(pfn, prot)	__pmd(((pfn) << PAGE_SHIFT) | pgprot_val(prot))
 
===================================================================
--- a/include/asm-i386/pgtable-3level.h
+++ b/include/asm-i386/pgtable-3level.h
@@ -42,20 +42,23 @@ static inline int pte_exec_kernel(pte_t 
 	return pte_x(pte);
 }
 
-#ifndef CONFIG_PARAVIRT
 /* Rules for using set_pte: the pte being assigned *must* be
  * either not present or in a state where the hardware will
  * not attempt to update the pte.  In places where this is
  * not possible, use pte_get_and_clear to obtain the old pte
  * value and then use set_pte to update it.  -ben
  */
-static inline void set_pte(pte_t *ptep, pte_t pte)
+static inline void native_set_pte(pte_t *ptep, pte_t pte)
 {
 	ptep->pte_high = pte.pte_high;
 	smp_wmb();
 	ptep->pte_low = pte.pte_low;
 }
-#define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval)
+static inline void native_set_pte_at(struct mm_struct *mm, unsigned long addr,
+				     pte_t *ptep , pte_t pte)
+{
+	native_set_pte(ptep, pte);
+}
 
 /*
  * Since this is only called on user PTEs, and the page fault handler
@@ -63,7 +66,8 @@ static inline void set_pte(pte_t *ptep, 
  * we are justified in merely clearing the PTE present bit, followed
  * by a set.  The ordering here is important.
  */
-static inline void set_pte_present(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte)
+static inline void native_set_pte_present(struct mm_struct *mm, unsigned long addr,
+					  pte_t *ptep, pte_t pte)
 {
 	ptep->pte_low = 0;
 	smp_wmb();
@@ -72,33 +76,49 @@ static inline void set_pte_present(struc
 	ptep->pte_low = pte.pte_low;
 }
 
-#define set_pte_atomic(pteptr,pteval) \
-		set_64bit((unsigned long long *)(pteptr),pte_val(pteval))
-#define set_pmd(pmdptr,pmdval) \
-		set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval))
-#define set_pud(pudptr,pudval) \
-		(*(pudptr) = (pudval))
+static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
+{
+	set_64bit((unsigned long long *)(ptep),native_pte_val(pte));
+}
+static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
+{
+	set_64bit((unsigned long long *)(pmdp),native_pmd_val(pmd));
+}
+static inline void native_set_pud(pud_t *pudp, pud_t pud)
+{
+	*pudp = pud;
+}
 
 /*
  * For PTEs and PDEs, we must clear the P-bit first when clearing a page table
  * entry, so clear the bottom half first and enforce ordering with a compiler
  * barrier.
  */
-static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+static inline void native_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
 	ptep->pte_low = 0;
 	smp_wmb();
 	ptep->pte_high = 0;
 }
 
-static inline void pmd_clear(pmd_t *pmd)
+static inline void native_pmd_clear(pmd_t *pmd)
 {
 	u32 *tmp = (u32 *)pmd;
 	*tmp = 0;
 	smp_wmb();
 	*(tmp + 1) = 0;
 }
-#endif
+
+#ifndef CONFIG_PARAVIRT
+#define set_pte(ptep, pte)			native_set_pte(ptep, pte)
+#define set_pte_at(mm, addr, ptep, pte)		native_set_pte_at(mm, addr, ptep, pte)
+#define set_pte_present(mm, addr, ptep, pte)	native_set_pte_present(mm, addr, ptep, pte)
+#define set_pte_atomic(ptep, pte)		native_set_pte_atomic(ptep, pte)
+#define set_pmd(pmdp, pmd)			native_set_pmd(pmdp, pmd)
+#define set_pud(pudp, pud)			native_set_pud(pudp, pud)
+#define pte_clear(mm, addr, ptep)		native_pte_clear(mm, addr, ptep)
+#define pmd_clear(pmd)				native_pmd_clear(pmd)
+#endif
 
 /*
  * Pentium-II erratum A13: in PAE mode we explicitly have to flush
@@ -119,7 +139,7 @@ static inline void pud_clear (pud_t * pu
 #define pmd_offset(pud, address) ((pmd_t *) pud_page(*(pud)) + \
 			pmd_index(address))
 
-static inline pte_t raw_ptep_get_and_clear(pte_t *ptep)
+static inline pte_t native_ptep_get_and_clear(pte_t *ptep)
 {
 	pte_t res;
 
@@ -146,28 +166,21 @@ static inline int pte_none(pte_t pte)
 
 static inline unsigned long pte_pfn(pte_t pte)
 {
-	return (pte.pte_low >> PAGE_SHIFT) |
-		(pte.pte_high << (32 - PAGE_SHIFT));
+	return pte_val(pte) >> PAGE_SHIFT;
 }
 
 extern unsigned long long __supported_pte_mask;
 
 static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
 {
-	pte_t pte;
-
-	pte.pte_high = (page_nr >> (32 - PAGE_SHIFT)) | \
-					(pgprot_val(pgprot) >> 32);
-	pte.pte_high &= (__supported_pte_mask >> 32);
-	pte.pte_low = ((page_nr << PAGE_SHIFT) | pgprot_val(pgprot)) & \
-							__supported_pte_mask;
-	return pte;
+	return __pte((((unsigned long long)page_nr << PAGE_SHIFT) |
+		      pgprot_val(pgprot)) & __supported_pte_mask);
 }
 
 static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
 {
-	return __pmd((((unsigned long long)page_nr << PAGE_SHIFT) | \
-			pgprot_val(pgprot)) & __supported_pte_mask);
+	return __pmd((((unsigned long long)page_nr << PAGE_SHIFT) |
+		      pgprot_val(pgprot)) & __supported_pte_mask);
 }
 
 /*
===================================================================
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -264,6 +264,8 @@ static inline pte_t pte_mkhuge(pte_t pte
 #define pte_update(mm, addr, ptep)		do { } while (0)
 #define pte_update_defer(mm, addr, ptep)	do { } while (0)
 #define paravirt_map_pt_hook(slot, va, pfn)	do { } while (0)
+
+#define raw_ptep_get_and_clear(xp)     native_ptep_get_and_clear(xp)
 #endif
 
 /*

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 05/20] Hooks to set up initial pagetable
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
                   ` (3 preceding siblings ...)
  2007-04-04 19:11 ` [patch 04/20] Add pagetable accessors to pack and unpack pagetable entries Jeremy Fitzhardinge
@ 2007-04-04 19:11 ` Jeremy Fitzhardinge
  2007-04-04 19:11 ` [patch 06/20] Allocate a fixmap slot Jeremy Fitzhardinge
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:11 UTC (permalink / raw)
  To: Andi Kleen
  Cc: virtualization, Andrew Morton, Ingo Molnar, lkml, William Irwin

[-- Attachment #1: paravirt-memory-init.patch --]
[-- Type: text/plain, Size: 11984 bytes --]

This patch introduces paravirt_ops hooks to control how the kernel's
initial pagetable is set up.

In the case of a native boot, the very early bootstrap code creates a
simple non-PAE pagetable to map the kernel and physical memory.  When
the VM subsystem is initialized, it creates a proper pagetable which
respects the PAE mode, large pages, etc.

When booting under a hypervisor, there are many possibilities for what
paging environment the hypervisor establishes for the guest kernel, so
the constructon of the kernel's pagetable depends on the hypervisor.

In the case of Xen, the hypervisor boots the kernel with a fully
constructed pagetable, which is already using PAE if necessary.  Also,
Xen requires particular care when constructing pagetables to make sure
all pagetables are always mapped read-only.

In order to make this easier, kernel's initial pagetable construction
has been changed to only allocate and initialize a pagetable page if
there's no page already present in the pagetable.  This allows the Xen
paravirt backend to make a copy of the hypervisor-provided pagetable,
allowing the kernel to establish any more mappings it needs while
keeping the existing ones.

A slightly subtle point which is worth highlighting here is that Xen
requires all kernel mappings to share the same pte_t pages between all
pagetables, so that updating a kernel page's mapping in one pagetable
is reflected in all other pagetables.  This makes it possible to
allocate a page and attach it to a pagetable without having to
explicitly enumerate that page's mapping in all pagetables.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: William Irwin <bill.irwin@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>

---
 arch/i386/kernel/paravirt.c |    3 
 arch/i386/mm/init.c         |  158 +++++++++++++++++++++++++++++--------------
 include/asm-i386/paravirt.h |   17 ++++
 include/asm-i386/pgtable.h  |   16 ++++
 4 files changed, 142 insertions(+), 52 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -476,6 +476,9 @@ struct paravirt_ops paravirt_ops = {
 #endif
 	.set_lazy_mode = paravirt_nop,
 
+	.pagetable_setup_start = native_pagetable_setup_start,
+	.pagetable_setup_done = native_pagetable_setup_done,
+
 	.flush_tlb_user = native_flush_tlb,
 	.flush_tlb_kernel = native_flush_tlb_global,
 	.flush_tlb_single = native_flush_tlb_single,
===================================================================
--- a/arch/i386/mm/init.c
+++ b/arch/i386/mm/init.c
@@ -42,6 +42,7 @@
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 #include <asm/sections.h>
+#include <asm/paravirt.h>
 
 unsigned int __VMALLOC_RESERVE = 128 << 20;
 
@@ -62,6 +63,7 @@ static pmd_t * __init one_md_table_init(
 		
 #ifdef CONFIG_X86_PAE
 	pmd_table = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
+
 	paravirt_alloc_pd(__pa(pmd_table) >> PAGE_SHIFT);
 	set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
 	pud = pud_offset(pgd, 0);
@@ -83,12 +85,10 @@ static pte_t * __init one_page_table_ini
 {
 	if (pmd_none(*pmd)) {
 		pte_t *page_table = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE);
+
 		paravirt_alloc_pt(__pa(page_table) >> PAGE_SHIFT);
 		set_pmd(pmd, __pmd(__pa(page_table) | _PAGE_TABLE));
-		if (page_table != pte_offset_kernel(pmd, 0))
-			BUG();	
-
-		return page_table;
+		BUG_ON(page_table != pte_offset_kernel(pmd, 0));
 	}
 	
 	return pte_offset_kernel(pmd, 0);
@@ -119,7 +119,7 @@ static void __init page_table_range_init
 	pgd = pgd_base + pgd_idx;
 
 	for ( ; (pgd_idx < PTRS_PER_PGD) && (vaddr != end); pgd++, pgd_idx++) {
-		if (pgd_none(*pgd)) 
+		if (!(pgd_val(*pgd) & _PAGE_PRESENT))
 			one_md_table_init(pgd);
 		pud = pud_offset(pgd, vaddr);
 		pmd = pmd_offset(pud, vaddr);
@@ -158,7 +158,11 @@ static void __init kernel_physical_mappi
 	pfn = 0;
 
 	for (; pgd_idx < PTRS_PER_PGD; pgd++, pgd_idx++) {
-		pmd = one_md_table_init(pgd);
+		if (!(pgd_val(*pgd) & _PAGE_PRESENT))
+			pmd = one_md_table_init(pgd);
+		else
+			pmd = pmd_offset(pud_offset(pgd, PAGE_OFFSET), PAGE_OFFSET);
+
 		if (pfn >= max_low_pfn)
 			continue;
 		for (pmd_idx = 0; pmd_idx < PTRS_PER_PMD && pfn < max_low_pfn; pmd++, pmd_idx++) {
@@ -167,20 +171,26 @@ static void __init kernel_physical_mappi
 			/* Map with big pages if possible, otherwise create normal page tables. */
 			if (cpu_has_pse) {
 				unsigned int address2 = (pfn + PTRS_PER_PTE - 1) * PAGE_SIZE + PAGE_OFFSET + PAGE_SIZE-1;
-
-				if (is_kernel_text(address) || is_kernel_text(address2))
-					set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE_EXEC));
-				else
-					set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE));
+				if (!pmd_present(*pmd)) {
+					if (is_kernel_text(address) || is_kernel_text(address2))
+						set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE_EXEC));
+					else
+						set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE));
+				}
 				pfn += PTRS_PER_PTE;
 			} else {
 				pte = one_page_table_init(pmd);
 
-				for (pte_ofs = 0; pte_ofs < PTRS_PER_PTE && pfn < max_low_pfn; pte++, pfn++, pte_ofs++) {
-						if (is_kernel_text(address))
-							set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC));
-						else
-							set_pte(pte, pfn_pte(pfn, PAGE_KERNEL));
+				for (pte_ofs = 0;
+				     pte_ofs < PTRS_PER_PTE && pfn < max_low_pfn;
+				     pte++, pfn++, pte_ofs++, address += PAGE_SIZE) {
+					if (pte_present(*pte))
+						continue;
+
+					if (is_kernel_text(address))
+						set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC));
+					else
+						set_pte(pte, pfn_pte(pfn, PAGE_KERNEL));
 				}
 			}
 		}
@@ -337,44 +347,36 @@ extern void __init remap_numa_kva(void);
 #define remap_numa_kva() do {} while (0)
 #endif
 
-static void __init pagetable_init (void)
-{
-	unsigned long vaddr;
-	pgd_t *pgd_base = swapper_pg_dir;
-
+void __init native_pagetable_setup_start(pgd_t *base)
+{
 #ifdef CONFIG_X86_PAE
 	int i;
-	/* Init entries of the first-level page table to the zero page */
-	for (i = 0; i < PTRS_PER_PGD; i++)
-		set_pgd(pgd_base + i, __pgd(__pa(empty_zero_page) | _PAGE_PRESENT));
+
+	/*
+	 * Init entries of the first-level page table to the
+	 * zero page, if they haven't already been set up.
+	 *
+	 * In a normal native boot, we'll be running on a
+	 * pagetable rooted in swapper_pg_dir, but not in PAE
+	 * mode, so this will end up clobbering the mappings
+	 * for the lower 24Mbytes of the address space,
+	 * without affecting the kernel address space.
+	 */
+	for (i = 0; i < USER_PTRS_PER_PGD; i++)
+		set_pgd(&base[i],
+			__pgd(__pa(empty_zero_page) | _PAGE_PRESENT));
+
+	/* Make sure kernel address space is empty so that a pagetable
+	   will be allocated for it. */
+	memset(&base[USER_PTRS_PER_PGD], 0,
+	       KERNEL_PGD_PTRS * sizeof(pgd_t));
 #else
 	paravirt_alloc_pd(__pa(swapper_pg_dir) >> PAGE_SHIFT);
 #endif
-
-	/* Enable PSE if available */
-	if (cpu_has_pse) {
-		set_in_cr4(X86_CR4_PSE);
-	}
-
-	/* Enable PGE if available */
-	if (cpu_has_pge) {
-		set_in_cr4(X86_CR4_PGE);
-		__PAGE_KERNEL |= _PAGE_GLOBAL;
-		__PAGE_KERNEL_EXEC |= _PAGE_GLOBAL;
-	}
-
-	kernel_physical_mapping_init(pgd_base);
-	remap_numa_kva();
-
-	/*
-	 * Fixed mappings, only the page table structure has to be
-	 * created - mappings will be set by set_fixmap():
-	 */
-	vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK;
-	page_table_range_init(vaddr, 0, pgd_base);
-
-	permanent_kmaps_init(pgd_base);
-
+}
+
+void __init native_pagetable_setup_done(pgd_t *base)
+{
 #ifdef CONFIG_X86_PAE
 	/*
 	 * Add low memory identity-mappings - SMP needs it when
@@ -383,8 +385,62 @@ static void __init pagetable_init (void)
 	 * All user-space mappings are explicitly cleared after
 	 * SMP startup.
 	 */
-	set_pgd(&pgd_base[0], pgd_base[USER_PTRS_PER_PGD]);
-#endif
+	set_pgd(&base[0], base[USER_PTRS_PER_PGD]);
+#endif
+}
+
+/*
+ * Build a proper pagetable for the kernel mappings.  Up until this
+ * point, we've been running on some set of pagetables constructed by
+ * the boot process.
+ *
+ * If we're booting on native hardware, this will be a pagetable
+ * constructed in arch/i386/kernel/head.S, and not running in PAE mode
+ * (even if we'll end up running in PAE).  The root of the pagetable
+ * will be swapper_pg_dir.
+ *
+ * If we're booting paravirtualized under a hypervisor, then there are
+ * more options: we may already be running PAE, and the pagetable may
+ * or may not be based in swapper_pg_dir.  In any case,
+ * paravirt_pagetable_setup_start() will set up swapper_pg_dir
+ * appropriately for the rest of the initialization to work.
+ *
+ * In general, pagetable_init() assumes that the pagetable may already
+ * be partially populated, and so it avoids stomping on any existing
+ * mappings.
+ */
+static void __init pagetable_init (void)
+{
+	unsigned long vaddr, end;
+	pgd_t *pgd_base = swapper_pg_dir;
+
+	paravirt_pagetable_setup_start(pgd_base);
+
+	/* Enable PSE if available */
+	if (cpu_has_pse)
+		set_in_cr4(X86_CR4_PSE);
+
+	/* Enable PGE if available */
+	if (cpu_has_pge) {
+		set_in_cr4(X86_CR4_PGE);
+		__PAGE_KERNEL |= _PAGE_GLOBAL;
+		__PAGE_KERNEL_EXEC |= _PAGE_GLOBAL;
+	}
+
+	kernel_physical_mapping_init(pgd_base);
+	remap_numa_kva();
+
+	/*
+	 * Fixed mappings, only the page table structure has to be
+	 * created - mappings will be set by set_fixmap():
+	 */
+	vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK;
+	end = (FIXADDR_TOP + PMD_SIZE - 1) & PMD_MASK;
+	page_table_range_init(vaddr, end, pgd_base);
+
+	permanent_kmaps_init(pgd_base);
+
+	paravirt_pagetable_setup_done(pgd_base);
 }
 
 #if defined(CONFIG_SOFTWARE_SUSPEND) || defined(CONFIG_ACPI_SLEEP)
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -2,10 +2,11 @@
 #define __ASM_PARAVIRT_H
 /* Various instructions on x86 need to be replaced for
  * para-virtualization: those hooks are defined here. */
+
+#ifdef CONFIG_PARAVIRT
 #include <linux/stringify.h>
 #include <asm/page.h>
 
-#ifdef CONFIG_PARAVIRT
 /* These are the most performance critical ops, so we want to be able to patch
  * callers */
 #define PARAVIRT_IRQ_DISABLE 0
@@ -48,6 +49,9 @@ struct paravirt_ops
 	void (*arch_setup)(void);
 	char *(*memory_setup)(void);
 	void (*init_IRQ)(void);
+
+	void (*pagetable_setup_start)(pgd_t *pgd_base);
+	void (*pagetable_setup_done)(pgd_t *pgd_base);
 
 	void (*banner)(void);
 
@@ -369,6 +373,17 @@ static inline void setup_secondary_clock
 }
 #endif
 
+static inline void paravirt_pagetable_setup_start(pgd_t *base)
+{
+	if (paravirt_ops.pagetable_setup_start)
+		(*paravirt_ops.pagetable_setup_start)(base);
+}
+
+static inline void paravirt_pagetable_setup_done(pgd_t *base)
+{
+	if (paravirt_ops.pagetable_setup_done)
+		(*paravirt_ops.pagetable_setup_done)(base);
+}
 
 #ifdef CONFIG_SMP
 static inline void startup_ipi_hook(int phys_apicid, unsigned long start_eip,
===================================================================
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -512,6 +512,22 @@ do {									\
  * tables contain all the necessary information.
  */
 #define update_mmu_cache(vma,address,pte) do { } while (0)
+
+void native_pagetable_setup_start(pgd_t *base);
+void native_pagetable_setup_done(pgd_t *base);
+
+#ifndef CONFIG_PARAVIRT
+static inline void paravirt_pagetable_setup_start(pgd_t *base)
+{
+	native_pagetable_setup_start(base);
+}
+
+static inline void paravirt_pagetable_setup_done(pgd_t *base)
+{
+	native_pagetable_setup_done(base);
+}
+#endif	/* !CONFIG_PARAVIRT */
+
 #endif /* !__ASSEMBLY__ */
 
 #ifdef CONFIG_FLATMEM

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 06/20] Allocate a fixmap slot
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
                   ` (4 preceding siblings ...)
  2007-04-04 19:11 ` [patch 05/20] Hooks to set up initial pagetable Jeremy Fitzhardinge
@ 2007-04-04 19:11 ` Jeremy Fitzhardinge
  2007-04-04 19:11 ` [patch 07/20] Allow paravirt backend to choose kernel PMD sharing Jeremy Fitzhardinge
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:11 UTC (permalink / raw)
  To: Andi Kleen; +Cc: virtualization, Andrew Morton, Ingo Molnar, lkml

[-- Attachment #1: paravirt-fixmap.patch --]
[-- Type: text/plain, Size: 1102 bytes --]

Allocate a fixmap slot for use by a paravirt_ops implementation.  This
is intended for early-boot bootstrap mappings.  Once the zones and
allocator have been set up, it would be better to use get_vm_area() to
allocate some virtual space.

Xen uses this to map the hypervisor's shared info page, which doesn't
have a pseudo-physical page number, and therefore can't be mapped
ordinarily.  It is needed early because it contains the vcpu state,
including the interrupt mask.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: Ingo Molnar <mingo@elte.hu>

---
 include/asm-i386/fixmap.h |    3 +++
 1 file changed, 3 insertions(+)

===================================================================
--- a/include/asm-i386/fixmap.h
+++ b/include/asm-i386/fixmap.h
@@ -86,6 +86,9 @@ enum fixed_addresses {
 #ifdef CONFIG_PCI_MMCONFIG
 	FIX_PCIE_MCFG,
 #endif
+#ifdef CONFIG_PARAVIRT
+	FIX_PARAVIRT_BOOTMAP,
+#endif
 	__end_of_permanent_fixed_addresses,
 	/* temporary boot-time mappings, used before ioremap() is functional */
 #define NR_FIX_BTMAPS	16

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 07/20] Allow paravirt backend to choose kernel PMD sharing
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
                   ` (5 preceding siblings ...)
  2007-04-04 19:11 ` [patch 06/20] Allocate a fixmap slot Jeremy Fitzhardinge
@ 2007-04-04 19:11 ` Jeremy Fitzhardinge
  2007-04-05  0:30   ` Christoph Lameter
  2007-04-06 23:41   ` Andrew Morton
  2007-04-04 19:11 ` [patch 08/20] add hooks to intercept mm creation and destruction Jeremy Fitzhardinge
                   ` (12 subsequent siblings)
  19 siblings, 2 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:11 UTC (permalink / raw)
  To: Andi Kleen
  Cc: virtualization, Andrew Morton, Ingo Molnar, lkml,
	William Lee Irwin III, Christoph Lameter

[-- Attachment #1: shared-kernel-pmd.patch --]
[-- Type: text/plain, Size: 12027 bytes --]

Normally when running in PAE mode, the 4th PMD maps the kernel address
space, which can be shared among all processes (since they all need
the same kernel mappings).

Xen, however, does not allow guests to have the kernel pmd shared
between page tables, so parameterize pgtable.c to allow both modes of
operation.

There are several side-effects of this.  One is that vmalloc will
update the kernel address space mappings, and those updates need to be
propagated into all processes if the kernel mappings are not
intrinsically shared.  In the non-PAE case, this is done by
maintaining a pgd_list of all processes; this list is used when all
process pagetables must be updated.  pgd_list is threaded via
otherwise unused entries in the page structure for the pgd, which
means that the pgd must be page-sized for this to work.

Normally the PAE pgd is only 4x64 byte entries large, but Xen requires
the PAE pgd to page aligned anyway, so this patch forces the pgd to be
page aligned+sized when the kernel pmd is unshared, to accomodate both
these requirements.

Also, since there may be several distinct kernel pmds (if the
user/kernel split is below 3G), there's no point in allocating them
from a slab cache; they're just allocated with get_free_page and
initialized appropriately.  (Of course the could be cached if there is
just a single kernel pmd - which is the default with a 3G user/kernel
split - but it doesn't seem worthwhile to add yet another case into
this code).

[ Many thanks to wli for review comments. ]

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: William Lee Irwin III <wli@holomorphy.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Christoph Lameter <clameter@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
---
 arch/i386/kernel/paravirt.c            |    1 
 arch/i386/mm/fault.c                   |    6 +-
 arch/i386/mm/init.c                    |   18 +++++-
 arch/i386/mm/pageattr.c                |    2 
 arch/i386/mm/pgtable.c                 |   84 ++++++++++++++++++++++++++------
 include/asm-i386/paravirt.h            |    1 
 include/asm-i386/pgtable-2level-defs.h |    2 
 include/asm-i386/pgtable-2level.h      |    2 
 include/asm-i386/pgtable-3level-defs.h |    6 ++
 include/asm-i386/pgtable-3level.h      |    2 
 include/asm-i386/pgtable.h             |    7 ++
 11 files changed, 105 insertions(+), 26 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -604,6 +604,7 @@ struct paravirt_ops paravirt_ops = {
 	.name = "bare hardware",
 	.paravirt_enabled = 0,
 	.kernel_rpl = 0,
+	.shared_kernel_pmd = 1,	/* Only used when CONFIG_X86_PAE is set */
 
  	.patch = native_patch,
 	.banner = default_banner,
===================================================================
--- a/arch/i386/mm/fault.c
+++ b/arch/i386/mm/fault.c
@@ -588,8 +588,7 @@ do_sigbus:
 	force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
 }
 
-#ifndef CONFIG_X86_PAE
-void vmalloc_sync_all(void)
+void _vmalloc_sync_all(void)
 {
 	/*
 	 * Note that races in the updates of insync and start aren't
@@ -600,6 +599,8 @@ void vmalloc_sync_all(void)
 	static DECLARE_BITMAP(insync, PTRS_PER_PGD);
 	static unsigned long start = TASK_SIZE;
 	unsigned long address;
+
+	BUG_ON(SHARED_KERNEL_PMD);
 
 	BUILD_BUG_ON(TASK_SIZE & ~PGDIR_MASK);
 	for (address = start; address >= TASK_SIZE; address += PGDIR_SIZE) {
@@ -623,4 +624,3 @@ void vmalloc_sync_all(void)
 			start = address + PGDIR_SIZE;
 	}
 }
-#endif
===================================================================
--- a/arch/i386/mm/init.c
+++ b/arch/i386/mm/init.c
@@ -715,6 +715,8 @@ struct kmem_cache *pmd_cache;
 
 void __init pgtable_cache_init(void)
 {
+	size_t pgd_size = PTRS_PER_PGD*sizeof(pgd_t);
+
 	if (PTRS_PER_PMD > 1) {
 		pmd_cache = kmem_cache_create("pmd",
 					PTRS_PER_PMD*sizeof(pmd_t),
@@ -724,13 +726,23 @@ void __init pgtable_cache_init(void)
 					NULL);
 		if (!pmd_cache)
 			panic("pgtable_cache_init(): cannot create pmd cache");
+
+		if (!SHARED_KERNEL_PMD) {
+			/* If we're in PAE mode and have a non-shared
+			   kernel pmd, then the pgd size must be a
+			   page size.  This is because the pgd_list
+			   links through the page structure, so there
+			   can only be one pgd per page for this to
+			   work. */
+			pgd_size = PAGE_SIZE;
+		}
 	}
 	pgd_cache = kmem_cache_create("pgd",
-				PTRS_PER_PGD*sizeof(pgd_t),
-				PTRS_PER_PGD*sizeof(pgd_t),
+				pgd_size,
+				pgd_size,
 				0,
 				pgd_ctor,
-				PTRS_PER_PMD == 1 ? pgd_dtor : NULL);
+				(!SHARED_KERNEL_PMD) ? pgd_dtor : NULL);
 	if (!pgd_cache)
 		panic("pgtable_cache_init(): Cannot create pgd cache");
 }
===================================================================
--- a/arch/i386/mm/pageattr.c
+++ b/arch/i386/mm/pageattr.c
@@ -91,7 +91,7 @@ static void set_pmd_pte(pte_t *kpte, uns
 	unsigned long flags;
 
 	set_pte_atomic(kpte, pte); 	/* change init_mm */
-	if (PTRS_PER_PMD > 1)
+	if (SHARED_KERNEL_PMD)
 		return;
 
 	spin_lock_irqsave(&pgd_lock, flags);
===================================================================
--- a/arch/i386/mm/pgtable.c
+++ b/arch/i386/mm/pgtable.c
@@ -238,35 +238,56 @@ static inline void pgd_list_del(pgd_t *p
 		set_page_private(next, (unsigned long)pprev);
 }
 
+#if (PTRS_PER_PMD == 1)
+/* Non-PAE pgd constructor */
 void pgd_ctor(void *pgd, struct kmem_cache *cache, unsigned long unused)
 {
 	unsigned long flags;
 
-	if (PTRS_PER_PMD == 1) {
-		memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t));
-		spin_lock_irqsave(&pgd_lock, flags);
-	}
+	/* !PAE, no pagetable sharing */
+	memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t));
 
 	clone_pgd_range((pgd_t *)pgd + USER_PTRS_PER_PGD,
 			swapper_pg_dir + USER_PTRS_PER_PGD,
 			KERNEL_PGD_PTRS);
 
-	if (PTRS_PER_PMD > 1)
-		return;
+	spin_lock_irqsave(&pgd_lock, flags);
 
 	/* must happen under lock */
 	paravirt_alloc_pd_clone(__pa(pgd) >> PAGE_SHIFT,
-			__pa(swapper_pg_dir) >> PAGE_SHIFT,
-			USER_PTRS_PER_PGD, PTRS_PER_PGD - USER_PTRS_PER_PGD);
+				__pa(swapper_pg_dir) >> PAGE_SHIFT,
+				USER_PTRS_PER_PGD,
+				KERNEL_PGD_PTRS);
 
 	pgd_list_add(pgd);
 	spin_unlock_irqrestore(&pgd_lock, flags);
 }
-
-/* never called when PTRS_PER_PMD > 1 */
+#else  /* PTRS_PER_PMD > 1 */
+/* PAE pgd constructor */
+void pgd_ctor(void *pgd, struct kmem_cache *cache, unsigned long unused)
+{
+	/* PAE, kernel PMD may be shared */
+
+	if (SHARED_KERNEL_PMD) {
+		clone_pgd_range((pgd_t *)pgd + USER_PTRS_PER_PGD,
+				swapper_pg_dir + USER_PTRS_PER_PGD,
+				KERNEL_PGD_PTRS);
+	} else {
+		unsigned long flags;
+
+		memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t));
+		spin_lock_irqsave(&pgd_lock, flags);
+		pgd_list_add(pgd);
+		spin_unlock_irqrestore(&pgd_lock, flags);
+	}
+}
+#endif	/* PTRS_PER_PMD */
+
 void pgd_dtor(void *pgd, struct kmem_cache *cache, unsigned long unused)
 {
 	unsigned long flags; /* can be called from interrupt context */
+
+	BUG_ON(SHARED_KERNEL_PMD);
 
 	paravirt_release_pd(__pa(pgd) >> PAGE_SHIFT);
 	spin_lock_irqsave(&pgd_lock, flags);
@@ -274,6 +295,37 @@ void pgd_dtor(void *pgd, struct kmem_cac
 	spin_unlock_irqrestore(&pgd_lock, flags);
 }
 
+#define UNSHARED_PTRS_PER_PGD				\
+	(SHARED_KERNEL_PMD ? USER_PTRS_PER_PGD : PTRS_PER_PGD)
+
+/* If we allocate a pmd for part of the kernel address space, then
+   make sure its initialized with the appropriate kernel mappings.
+   Otherwise use a cached zeroed pmd.  */
+static pmd_t *pmd_cache_alloc(int idx)
+{
+	pmd_t *pmd;
+
+	if (idx >= USER_PTRS_PER_PGD) {
+		pmd = (pmd_t *)__get_free_page(GFP_KERNEL);
+
+		if (pmd)
+			memcpy(pmd,
+			       (void *)pgd_page_vaddr(swapper_pg_dir[idx]),
+			       sizeof(pmd_t) * PTRS_PER_PMD);
+	} else
+		pmd = kmem_cache_alloc(pmd_cache, GFP_KERNEL);
+
+	return pmd;
+}
+
+static void pmd_cache_free(pmd_t *pmd, int idx)
+{
+	if (idx >= USER_PTRS_PER_PGD)
+		free_page((unsigned long)pmd);
+	else
+		kmem_cache_free(pmd_cache, pmd);
+}
+
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
 	int i;
@@ -282,10 +334,12 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 	if (PTRS_PER_PMD == 1 || !pgd)
 		return pgd;
 
-	for (i = 0; i < USER_PTRS_PER_PGD; ++i) {
-		pmd_t *pmd = kmem_cache_alloc(pmd_cache, GFP_KERNEL);
+ 	for (i = 0; i < UNSHARED_PTRS_PER_PGD; ++i) {
+		pmd_t *pmd = pmd_cache_alloc(i);
+
 		if (!pmd)
 			goto out_oom;
+
 		paravirt_alloc_pd(__pa(pmd) >> PAGE_SHIFT);
 		set_pgd(&pgd[i], __pgd(1 + __pa(pmd)));
 	}
@@ -296,7 +350,7 @@ out_oom:
 		pgd_t pgdent = pgd[i];
 		void* pmd = (void *)__va(pgd_val(pgdent)-1);
 		paravirt_release_pd(__pa(pmd) >> PAGE_SHIFT);
-		kmem_cache_free(pmd_cache, pmd);
+		pmd_cache_free(pmd, i);
 	}
 	kmem_cache_free(pgd_cache, pgd);
 	return NULL;
@@ -308,11 +362,11 @@ void pgd_free(pgd_t *pgd)
 
 	/* in the PAE case user pgd entries are overwritten before usage */
 	if (PTRS_PER_PMD > 1)
-		for (i = 0; i < USER_PTRS_PER_PGD; ++i) {
+		for (i = 0; i < UNSHARED_PTRS_PER_PGD; ++i) {
 			pgd_t pgdent = pgd[i];
 			void* pmd = (void *)__va(pgd_val(pgdent)-1);
 			paravirt_release_pd(__pa(pmd) >> PAGE_SHIFT);
-			kmem_cache_free(pmd_cache, pmd);
+			pmd_cache_free(pmd, i);
 		}
 	/* in the non-PAE case, free_pgtables() clears user pgd entries */
 	kmem_cache_free(pgd_cache, pgd);
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -34,6 +34,7 @@ struct paravirt_ops
 struct paravirt_ops
 {
 	unsigned int kernel_rpl;
+	int shared_kernel_pmd;
  	int paravirt_enabled;
 	const char *name;
 
===================================================================
--- a/include/asm-i386/pgtable-2level-defs.h
+++ b/include/asm-i386/pgtable-2level-defs.h
@@ -1,5 +1,7 @@
 #ifndef _I386_PGTABLE_2LEVEL_DEFS_H
 #define _I386_PGTABLE_2LEVEL_DEFS_H
+
+#define SHARED_KERNEL_PMD	0
 
 /*
  * traditional i386 two-level paging structure:
===================================================================
--- a/include/asm-i386/pgtable-2level.h
+++ b/include/asm-i386/pgtable-2level.h
@@ -65,6 +65,4 @@ static inline int pte_exec_kernel(pte_t 
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { (pte).pte_low })
 #define __swp_entry_to_pte(x)		((pte_t) { (x).val })
 
-void vmalloc_sync_all(void);
-
 #endif /* _I386_PGTABLE_2LEVEL_H */
===================================================================
--- a/include/asm-i386/pgtable-3level-defs.h
+++ b/include/asm-i386/pgtable-3level-defs.h
@@ -1,5 +1,11 @@
 #ifndef _I386_PGTABLE_3LEVEL_DEFS_H
 #define _I386_PGTABLE_3LEVEL_DEFS_H
+
+#ifdef CONFIG_PARAVIRT
+#define SHARED_KERNEL_PMD	(paravirt_ops.shared_kernel_pmd)
+#else
+#define SHARED_KERNEL_PMD	1
+#endif
 
 /*
  * PGDIR_SHIFT determines what a top-level page table entry can map
===================================================================
--- a/include/asm-i386/pgtable-3level.h
+++ b/include/asm-i386/pgtable-3level.h
@@ -180,6 +180,4 @@ static inline pmd_t pfn_pmd(unsigned lon
 
 #define __pmd_free_tlb(tlb, x)		do { } while (0)
 
-#define vmalloc_sync_all() ((void)0)
-
 #endif /* _I386_PGTABLE_3LEVEL_H */
===================================================================
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -244,6 +244,13 @@ static inline pte_t pte_mkwrite(pte_t pt
 static inline pte_t pte_mkwrite(pte_t pte)	{ (pte).pte_low |= _PAGE_RW; return pte; }
 static inline pte_t pte_mkhuge(pte_t pte)	{ (pte).pte_low |= _PAGE_PSE; return pte; }
 
+extern void _vmalloc_sync_all(void);
+static inline void vmalloc_sync_all(void)
+{
+	if (!SHARED_KERNEL_PMD)
+		_vmalloc_sync_all();
+}
+
 #ifdef CONFIG_X86_PAE
 # include <asm/pgtable-3level.h>
 #else

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 08/20] add hooks to intercept mm creation and destruction
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
                   ` (6 preceding siblings ...)
  2007-04-04 19:11 ` [patch 07/20] Allow paravirt backend to choose kernel PMD sharing Jeremy Fitzhardinge
@ 2007-04-04 19:11 ` Jeremy Fitzhardinge
  2007-04-04 19:12 ` [patch 09/20] rename struct paravirt_patch to paravirt_patch_site for clarity Jeremy Fitzhardinge
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:11 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-arch, virtualization, James Bottomley, Andrew Morton,
	Ingo Molnar, lkml

[-- Attachment #1: mm-lifetime-hooks.patch --]
[-- Type: text/plain, Size: 14740 bytes --]

Add hooks to allow a paravirt implementation to track the lifetime of
an mm.  Paravirtualization requires three hooks, but only two are
needed in common code.  They are:

arch_dup_mmap, which is called when a new mmap is created at fork

arch_exit_mmap, which is called when the last process reference to an
  mm is dropped, which typically happens on exit and exec.

The third hook is activate_mm, which is called from the arch-specific
activate_mm() macro/function, and so doesn't need stub versions for
other architectures.  It's called when an mm is first used.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: linux-arch@vger.kernel.org
Cc: James Bottomley <James.Bottomley@SteelEye.com>
Acked-by: Ingo Molnar <mingo@elte.hu>

---
 arch/i386/kernel/paravirt.c         |    4 ++++
 include/asm-alpha/mmu_context.h     |    1 +
 include/asm-arm/mmu_context.h       |    1 +
 include/asm-arm26/mmu_context.h     |    2 ++
 include/asm-avr32/mmu_context.h     |    1 +
 include/asm-cris/mmu_context.h      |    2 ++
 include/asm-frv/mmu_context.h       |    1 +
 include/asm-generic/mm_hooks.h      |   18 ++++++++++++++++++
 include/asm-h8300/mmu_context.h     |    1 +
 include/asm-i386/mmu_context.h      |   17 +++++++++++++++--
 include/asm-i386/paravirt.h         |   23 +++++++++++++++++++++++
 include/asm-ia64/mmu_context.h      |    1 +
 include/asm-m32r/mmu_context.h      |    1 +
 include/asm-m68k/mmu_context.h      |    1 +
 include/asm-m68knommu/mmu_context.h |    1 +
 include/asm-mips/mmu_context.h      |    1 +
 include/asm-parisc/mmu_context.h    |    1 +
 include/asm-powerpc/mmu_context.h   |    1 +
 include/asm-ppc/mmu_context.h       |    1 +
 include/asm-s390/mmu_context.h      |    2 ++
 include/asm-sh/mmu_context.h        |    1 +
 include/asm-sh64/mmu_context.h      |    2 +-
 include/asm-sparc/mmu_context.h     |    2 ++
 include/asm-sparc64/mmu_context.h   |    1 +
 include/asm-um/mmu_context.h        |    2 ++
 include/asm-v850/mmu_context.h      |    2 ++
 include/asm-x86_64/mmu_context.h    |    1 +
 include/asm-xtensa/mmu_context.h    |    1 +
 kernel/fork.c                       |    2 ++
 mm/mmap.c                           |    4 ++++
 30 files changed, 96 insertions(+), 3 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -520,6 +520,10 @@ struct paravirt_ops paravirt_ops = {
 	.irq_enable_sysexit = native_irq_enable_sysexit,
 	.iret = native_iret,
 
+	.dup_mmap = paravirt_nop,
+	.exit_mmap = paravirt_nop,
+	.activate_mm = paravirt_nop,
+
 	.startup_ipi_hook = paravirt_nop,
 };
 
===================================================================
--- a/include/asm-alpha/mmu_context.h
+++ b/include/asm-alpha/mmu_context.h
@@ -10,6 +10,7 @@
 #include <asm/system.h>
 #include <asm/machvec.h>
 #include <asm/compiler.h>
+#include <asm-generic/mm_hooks.h>
 
 /*
  * Force a context reload. This is needed when we change the page
===================================================================
--- a/include/asm-arm/mmu_context.h
+++ b/include/asm-arm/mmu_context.h
@@ -16,6 +16,7 @@
 #include <linux/compiler.h>
 #include <asm/cacheflush.h>
 #include <asm/proc-fns.h>
+#include <asm-generic/mm_hooks.h>
 
 void __check_kvm_seq(struct mm_struct *mm);
 
===================================================================
--- a/include/asm-arm26/mmu_context.h
+++ b/include/asm-arm26/mmu_context.h
@@ -12,6 +12,8 @@
  */
 #ifndef __ASM_ARM_MMU_CONTEXT_H
 #define __ASM_ARM_MMU_CONTEXT_H
+
+#include <asm-generic/mm_hooks.h>
 
 #define init_new_context(tsk,mm)	0
 #define destroy_context(mm)		do { } while(0)
===================================================================
--- a/include/asm-avr32/mmu_context.h
+++ b/include/asm-avr32/mmu_context.h
@@ -15,6 +15,7 @@
 #include <asm/tlbflush.h>
 #include <asm/pgalloc.h>
 #include <asm/sysreg.h>
+#include <asm-generic/mm_hooks.h>
 
 /*
  * The MMU "context" consists of two things:
===================================================================
--- a/include/asm-cris/mmu_context.h
+++ b/include/asm-cris/mmu_context.h
@@ -1,5 +1,7 @@
 #ifndef __CRIS_MMU_CONTEXT_H
 #define __CRIS_MMU_CONTEXT_H
+
+#include <asm-generic/mm_hooks.h>
 
 extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
 extern void get_mmu_context(struct mm_struct *mm);
===================================================================
--- a/include/asm-frv/mmu_context.h
+++ b/include/asm-frv/mmu_context.h
@@ -15,6 +15,7 @@
 #include <asm/setup.h>
 #include <asm/page.h>
 #include <asm/pgalloc.h>
+#include <asm-generic/mm_hooks.h>
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
===================================================================
--- /dev/null
+++ b/include/asm-generic/mm_hooks.h
@@ -0,0 +1,18 @@
+/*
+ * Define generic no-op hooks for arch_dup_mmap and arch_exit_mmap, to
+ * be included in asm-FOO/mmu_context.h for any arch FOO which doesn't
+ * need to hook these.
+ */
+#ifndef _ASM_GENERIC_MM_HOOKS_H
+#define _ASM_GENERIC_MM_HOOKS_H
+
+static inline void arch_dup_mmap(struct mm_struct *oldmm,
+				 struct mm_struct *mm)
+{
+}
+
+static inline void arch_exit_mmap(struct mm_struct *mm)
+{
+}
+
+#endif	/* _ASM_GENERIC_MM_HOOKS_H */
===================================================================
--- a/include/asm-h8300/mmu_context.h
+++ b/include/asm-h8300/mmu_context.h
@@ -4,6 +4,7 @@
 #include <asm/setup.h>
 #include <asm/page.h>
 #include <asm/pgalloc.h>
+#include <asm-generic/mm_hooks.h>
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
===================================================================
--- a/include/asm-i386/mmu_context.h
+++ b/include/asm-i386/mmu_context.h
@@ -5,6 +5,16 @@
 #include <asm/atomic.h>
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
+#include <asm/paravirt.h>
+#ifndef CONFIG_PARAVIRT
+#include <asm-generic/mm_hooks.h>
+
+static inline void paravirt_activate_mm(struct mm_struct *prev,
+					struct mm_struct *next)
+{
+}
+#endif	/* !CONFIG_PARAVIRT */
+
 
 /*
  * Used for LDT copy/destruction.
@@ -65,7 +75,10 @@ static inline void switch_mm(struct mm_s
 #define deactivate_mm(tsk, mm)			\
 	asm("movl %0,%%gs": :"r" (0));
 
-#define activate_mm(prev, next) \
-	switch_mm((prev),(next),NULL)
+#define activate_mm(prev, next)				\
+	do {						\
+		paravirt_activate_mm(prev, next);	\
+		switch_mm((prev),(next),NULL);		\
+	} while(0);
 
 #endif
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -119,6 +119,12 @@ struct paravirt_ops
 
 	void (*io_delay)(void);
 
+	void (*activate_mm)(struct mm_struct *prev,
+			    struct mm_struct *next);
+	void (*dup_mmap)(struct mm_struct *oldmm,
+			 struct mm_struct *mm);
+	void (*exit_mmap)(struct mm_struct *mm);
+
 #ifdef CONFIG_X86_LOCAL_APIC
 	void (*apic_write)(unsigned long reg, unsigned long v);
 	void (*apic_write_atomic)(unsigned long reg, unsigned long v);
@@ -394,6 +400,23 @@ static inline void startup_ipi_hook(int 
 }
 #endif
 
+static inline void paravirt_activate_mm(struct mm_struct *prev,
+					struct mm_struct *next)
+{
+	paravirt_ops.activate_mm(prev, next);
+}
+
+static inline void arch_dup_mmap(struct mm_struct *oldmm,
+				 struct mm_struct *mm)
+{
+	paravirt_ops.dup_mmap(oldmm, mm);
+}
+
+static inline void arch_exit_mmap(struct mm_struct *mm)
+{
+	paravirt_ops.exit_mmap(mm);
+}
+
 #define __flush_tlb() paravirt_ops.flush_tlb_user()
 #define __flush_tlb_global() paravirt_ops.flush_tlb_kernel()
 #define __flush_tlb_single(addr) paravirt_ops.flush_tlb_single(addr)
===================================================================
--- a/include/asm-ia64/mmu_context.h
+++ b/include/asm-ia64/mmu_context.h
@@ -29,6 +29,7 @@
 #include <linux/spinlock.h>
 
 #include <asm/processor.h>
+#include <asm-generic/mm_hooks.h>
 
 struct ia64_ctx {
 	spinlock_t lock;
===================================================================
--- a/include/asm-m32r/mmu_context.h
+++ b/include/asm-m32r/mmu_context.h
@@ -15,6 +15,7 @@
 #include <asm/pgalloc.h>
 #include <asm/mmu.h>
 #include <asm/tlbflush.h>
+#include <asm-generic/mm_hooks.h>
 
 /*
  * Cache of MMU context last used.
===================================================================
--- a/include/asm-m68k/mmu_context.h
+++ b/include/asm-m68k/mmu_context.h
@@ -1,6 +1,7 @@
 #ifndef __M68K_MMU_CONTEXT_H
 #define __M68K_MMU_CONTEXT_H
 
+#include <asm-generic/mm_hooks.h>
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
===================================================================
--- a/include/asm-m68knommu/mmu_context.h
+++ b/include/asm-m68knommu/mmu_context.h
@@ -4,6 +4,7 @@
 #include <asm/setup.h>
 #include <asm/page.h>
 #include <asm/pgalloc.h>
+#include <asm-generic/mm_hooks.h>
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
===================================================================
--- a/include/asm-mips/mmu_context.h
+++ b/include/asm-mips/mmu_context.h
@@ -20,6 +20,7 @@
 #include <asm/mipsmtregs.h>
 #include <asm/smtc.h>
 #endif /* SMTC */
+#include <asm-generic/mm_hooks.h>
 
 /*
  * For the fast tlb miss handlers, we keep a per cpu array of pointers
===================================================================
--- a/include/asm-parisc/mmu_context.h
+++ b/include/asm-parisc/mmu_context.h
@@ -5,6 +5,7 @@
 #include <asm/atomic.h>
 #include <asm/pgalloc.h>
 #include <asm/pgtable.h>
+#include <asm-generic/mm_hooks.h>
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
===================================================================
--- a/include/asm-powerpc/mmu_context.h
+++ b/include/asm-powerpc/mmu_context.h
@@ -10,6 +10,7 @@
 #include <linux/mm.h>	
 #include <asm/mmu.h>	
 #include <asm/cputable.h>
+#include <asm-generic/mm_hooks.h>
 
 /*
  * Copyright (C) 2001 PPC 64 Team, IBM Corp
===================================================================
--- a/include/asm-ppc/mmu_context.h
+++ b/include/asm-ppc/mmu_context.h
@@ -6,6 +6,7 @@
 #include <asm/bitops.h>
 #include <asm/mmu.h>
 #include <asm/cputable.h>
+#include <asm-generic/mm_hooks.h>
 
 /*
  * On 32-bit PowerPC 6xx/7xx/7xxx CPUs, we use a set of 16 VSIDs
===================================================================
--- a/include/asm-s390/mmu_context.h
+++ b/include/asm-s390/mmu_context.h
@@ -10,6 +10,8 @@
 #define __S390_MMU_CONTEXT_H
 
 #include <asm/pgalloc.h>
+#include <asm-generic/mm_hooks.h>
+
 /*
  * get a new mmu context.. S390 don't know about contexts.
  */
===================================================================
--- a/include/asm-sh/mmu_context.h
+++ b/include/asm-sh/mmu_context.h
@@ -12,6 +12,7 @@
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
 #include <asm/io.h>
+#include <asm-generic/mm_hooks.h>
 
 /*
  * The MMU "context" consists of two things:
===================================================================
--- a/include/asm-sh64/mmu_context.h
+++ b/include/asm-sh64/mmu_context.h
@@ -27,7 +27,7 @@ extern unsigned long mmu_context_cache;
 extern unsigned long mmu_context_cache;
 
 #include <asm/page.h>
-
+#include <asm-generic/mm_hooks.h>
 
 /* Current mm's pgd */
 extern pgd_t *mmu_pdtp_cache;
===================================================================
--- a/include/asm-sparc/mmu_context.h
+++ b/include/asm-sparc/mmu_context.h
@@ -4,6 +4,8 @@
 #include <asm/btfixup.h>
 
 #ifndef __ASSEMBLY__
+
+#include <asm-generic/mm_hooks.h>
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
===================================================================
--- a/include/asm-sparc64/mmu_context.h
+++ b/include/asm-sparc64/mmu_context.h
@@ -9,6 +9,7 @@
 #include <linux/spinlock.h>
 #include <asm/system.h>
 #include <asm/spitfire.h>
+#include <asm-generic/mm_hooks.h>
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
===================================================================
--- a/include/asm-um/mmu_context.h
+++ b/include/asm-um/mmu_context.h
@@ -5,6 +5,8 @@
 
 #ifndef __UM_MMU_CONTEXT_H
 #define __UM_MMU_CONTEXT_H
+
+#include <asm-generic/mm_hooks.h>
 
 #include "linux/sched.h"
 #include "choose-mode.h"
===================================================================
--- a/include/asm-v850/mmu_context.h
+++ b/include/asm-v850/mmu_context.h
@@ -1,5 +1,7 @@
 #ifndef __V850_MMU_CONTEXT_H__
 #define __V850_MMU_CONTEXT_H__
+
+#include <asm-generic/mm_hooks.h>
 
 #define destroy_context(mm)		((void)0)
 #define init_new_context(tsk,mm)	0
===================================================================
--- a/include/asm-x86_64/mmu_context.h
+++ b/include/asm-x86_64/mmu_context.h
@@ -7,6 +7,7 @@
 #include <asm/pda.h>
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
+#include <asm-generic/mm_hooks.h>
 
 /*
  * possibly do the LDT unload here?
===================================================================
--- a/include/asm-xtensa/mmu_context.h
+++ b/include/asm-xtensa/mmu_context.h
@@ -18,6 +18,7 @@
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
+#include <asm-generic/mm_hooks.h>
 
 #define XCHAL_MMU_ASID_BITS	8
 
===================================================================
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -286,6 +286,8 @@ static inline int dup_mmap(struct mm_str
 		if (retval)
 			goto out;
 	}
+	/* a new mm has just been created */
+	arch_dup_mmap(oldmm, mm);
 	retval = 0;
 out:
 	up_write(&mm->mmap_sem);
===================================================================
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -29,6 +29,7 @@
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
 #include <asm/tlb.h>
+#include <asm/mmu_context.h>
 
 #ifndef arch_mmap_check
 #define arch_mmap_check(addr, len, flags)	(0)
@@ -1976,6 +1977,9 @@ void exit_mmap(struct mm_struct *mm)
 	struct vm_area_struct *vma = mm->mmap;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
+
+	/* mm's last user has gone, and its about to be pulled down */
+	arch_exit_mmap(mm);
 
 	lru_add_drain();
 	flush_cache_mm(mm);

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 09/20] rename struct paravirt_patch to paravirt_patch_site for clarity
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
                   ` (7 preceding siblings ...)
  2007-04-04 19:11 ` [patch 08/20] add hooks to intercept mm creation and destruction Jeremy Fitzhardinge
@ 2007-04-04 19:12 ` Jeremy Fitzhardinge
  2007-04-06 23:18   ` Andrew Morton
  2007-04-04 19:12 ` [patch 10/20] Use patch site IDs computed from offset in paravirt_ops structure Jeremy Fitzhardinge
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: virtualization, Andrew Morton, lkml

[-- Attachment #1: paravirt-patch-rename-paravirt_patch.patch --]
[-- Type: text/plain, Size: 3362 bytes --]

Rename struct paravirt_patch to paravirt_patch_site, so that it
clearly refers to a callsite, and not the patch which may be applied
to that callsite.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>

---
 arch/i386/kernel/alternative.c |    9 ++++-----
 arch/i386/kernel/vmi.c         |    4 ----
 include/asm-i386/alternative.h |    8 +++++---
 include/asm-i386/paravirt.h    |    5 ++++-
 4 files changed, 13 insertions(+), 13 deletions(-)

===================================================================
--- a/arch/i386/kernel/alternative.c
+++ b/arch/i386/kernel/alternative.c
@@ -335,9 +335,10 @@ void alternatives_smp_switch(int smp)
 #endif
 
 #ifdef CONFIG_PARAVIRT
-void apply_paravirt(struct paravirt_patch *start, struct paravirt_patch *end)
-{
-	struct paravirt_patch *p;
+void apply_paravirt(struct paravirt_patch_site *start,
+		    struct paravirt_patch_site *end)
+{
+	struct paravirt_patch_site *p;
 
 	if (noreplace_paravirt)
 		return;
@@ -355,8 +356,6 @@ void apply_paravirt(struct paravirt_patc
 	/* Sync to be conservative, in case we patched following instructions */
 	sync_core();
 }
-extern struct paravirt_patch __parainstructions[],
-	__parainstructions_end[];
 #endif	/* CONFIG_PARAVIRT */
 
 void __init alternative_instructions(void)
===================================================================
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -70,10 +70,6 @@ static struct {
 	void (*set_initial_ap_state)(int, int);
 	void (*halt)(void);
 } vmi_ops;
-
-/* XXX move this to alternative.h */
-extern struct paravirt_patch __parainstructions[],
-	__parainstructions_end[];
 
 /*
  * VMI patching routines.
===================================================================
--- a/include/asm-i386/alternative.h
+++ b/include/asm-i386/alternative.h
@@ -114,12 +114,14 @@ static inline void alternatives_smp_swit
 #define LOCK_PREFIX ""
 #endif
 
-struct paravirt_patch;
+struct paravirt_patch_site;
 #ifdef CONFIG_PARAVIRT
-void apply_paravirt(struct paravirt_patch *start, struct paravirt_patch *end);
+void apply_paravirt(struct paravirt_patch_site *start,
+		    struct paravirt_patch_site *end);
 #else
 static inline void
-apply_paravirt(struct paravirt_patch *start, struct paravirt_patch *end)
+apply_paravirt(struct paravirt_patch_site *start,
+	       struct paravirt_patch_site *end)
 {}
 #define __parainstructions	NULL
 #define __parainstructions_end	NULL
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -502,12 +502,15 @@ void _paravirt_nop(void);
 #define paravirt_nop	((void *)_paravirt_nop)
 
 /* These all sit in the .parainstructions section to tell us what to patch. */
-struct paravirt_patch {
+struct paravirt_patch_site {
 	u8 *instr; 		/* original instructions */
 	u8 instrtype;		/* type of this instruction */
 	u8 len;			/* length of original instruction */
 	u16 clobbers;		/* what registers you may clobber */
 };
+
+extern struct paravirt_patch_site __parainstructions[],
+	__parainstructions_end[];
 
 #define paravirt_alt(insn_string, typenum, clobber)	\
 	"771:\n\t" insn_string "\n" "772:\n"		\

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 10/20] Use patch site IDs computed from offset in paravirt_ops structure
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
                   ` (8 preceding siblings ...)
  2007-04-04 19:12 ` [patch 09/20] rename struct paravirt_patch to paravirt_patch_site for clarity Jeremy Fitzhardinge
@ 2007-04-04 19:12 ` Jeremy Fitzhardinge
  2007-04-04 19:12 ` [patch 11/20] Fix patch site clobbers to include return register Jeremy Fitzhardinge
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: virtualization, Andrew Morton, lkml

[-- Attachment #1: paravirt-use-offset-site-ids.patch --]
[-- Type: text/plain, Size: 12926 bytes --]

Use patch type identifiers derived from the offset of the operation in
the paravirt_ops structure.  This avoids having to maintain a separate
enum for patch site types.

Also, since the identifier is derived from the offset into
paravirt_ops, the offset can be derived from the identifier.  This is
used to remove replicated information in the various callsite macros,
which has been a source of bugs in the past.

This patch also drops the fused save_fl+cli operation, which doesn't
really add much and makes things more complex - specifically because
it breaks the 1:1 relationship between identifiers and offsets.  If
this operation turns out to be particularly beneficial, then the right
answer is to define a new entrypoint for it.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>

---
 arch/i386/kernel/paravirt.c |   14 +--
 arch/i386/kernel/vmi.c      |   39 +--------
 include/asm-i386/paravirt.h |  179 ++++++++++++++++++++++---------------------
 3 files changed, 105 insertions(+), 127 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -58,7 +58,6 @@ DEF_NATIVE(sti, "sti");
 DEF_NATIVE(sti, "sti");
 DEF_NATIVE(popf, "push %eax; popf");
 DEF_NATIVE(pushf, "pushf; pop %eax");
-DEF_NATIVE(pushf_cli, "pushf; pop %eax; cli");
 DEF_NATIVE(iret, "iret");
 DEF_NATIVE(sti_sysexit, "sti; sysexit");
 
@@ -66,13 +65,12 @@ static const struct native_insns
 {
 	const char *start, *end;
 } native_insns[] = {
-	[PARAVIRT_IRQ_DISABLE] = { start_cli, end_cli },
-	[PARAVIRT_IRQ_ENABLE] = { start_sti, end_sti },
-	[PARAVIRT_RESTORE_FLAGS] = { start_popf, end_popf },
-	[PARAVIRT_SAVE_FLAGS] = { start_pushf, end_pushf },
-	[PARAVIRT_SAVE_FLAGS_IRQ_DISABLE] = { start_pushf_cli, end_pushf_cli },
-	[PARAVIRT_INTERRUPT_RETURN] = { start_iret, end_iret },
-	[PARAVIRT_STI_SYSEXIT] = { start_sti_sysexit, end_sti_sysexit },
+	[PARAVIRT_PATCH(irq_disable)] = { start_cli, end_cli },
+	[PARAVIRT_PATCH(irq_enable)] = { start_sti, end_sti },
+	[PARAVIRT_PATCH(restore_fl)] = { start_popf, end_popf },
+	[PARAVIRT_PATCH(save_fl)] = { start_pushf, end_pushf },
+	[PARAVIRT_PATCH(iret)] = { start_iret, end_iret },
+	[PARAVIRT_PATCH(irq_enable_sysexit)] = { start_sti_sysexit, end_sti_sysexit },
 };
 
 static unsigned native_patch(u8 type, u16 clobbers, void *insns, unsigned len)
===================================================================
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -78,11 +78,6 @@ static struct {
 #define MNEM_JMP  0xe9
 #define MNEM_RET  0xc3
 
-static char irq_save_disable_callout[] = {
-	MNEM_CALL, 0, 0, 0, 0,
-	MNEM_CALL, 0, 0, 0, 0,
-	MNEM_RET
-};
 #define IRQ_PATCH_INT_MASK 0
 #define IRQ_PATCH_DISABLE  5
 
@@ -130,33 +125,17 @@ static unsigned vmi_patch(u8 type, u16 c
 static unsigned vmi_patch(u8 type, u16 clobbers, void *insns, unsigned len)
 {
 	switch (type) {
-		case PARAVIRT_IRQ_DISABLE:
+		case PARAVIRT_PATCH(irq_disable):
 			return patch_internal(VMI_CALL_DisableInterrupts, len, insns);
-		case PARAVIRT_IRQ_ENABLE:
+		case PARAVIRT_PATCH(irq_enable):
 			return patch_internal(VMI_CALL_EnableInterrupts, len, insns);
-		case PARAVIRT_RESTORE_FLAGS:
+		case PARAVIRT_PATCH(restore_fl):
 			return patch_internal(VMI_CALL_SetInterruptMask, len, insns);
-		case PARAVIRT_SAVE_FLAGS:
+		case PARAVIRT_PATCH(save_fl):
 			return patch_internal(VMI_CALL_GetInterruptMask, len, insns);
-        	case PARAVIRT_SAVE_FLAGS_IRQ_DISABLE:
-			if (len >= 10) {
-				patch_internal(VMI_CALL_GetInterruptMask, len, insns);
-				patch_internal(VMI_CALL_DisableInterrupts, len-5, insns+5);
-				return 10;
-			} else {
-				/*
-				 * You bastards didn't leave enough room to
-				 * patch save_flags_irq_disable inline.  Patch
-				 * to a helper
-				 */
-				BUG_ON(len < 5);
-				*(char *)insns = MNEM_CALL;
-				patch_offset(insns, irq_save_disable_callout);
-				return 5;
-			}
-		case PARAVIRT_INTERRUPT_RETURN:
+		case PARAVIRT_PATCH(iret):
 			return patch_internal(VMI_CALL_IRET, len, insns);
-		case PARAVIRT_STI_SYSEXIT:
+		case PARAVIRT_PATCH(irq_enable_sysexit):
 			return patch_internal(VMI_CALL_SYSEXIT, len, insns);
 		default:
 			break;
@@ -763,12 +742,6 @@ static inline int __init activate_vmi(vo
 	para_fill(restore_fl, SetInterruptMask);
 	para_fill(irq_disable, DisableInterrupts);
 	para_fill(irq_enable, EnableInterrupts);
-
-	/* irq_save_disable !!! sheer pain */
-	patch_offset(&irq_save_disable_callout[IRQ_PATCH_INT_MASK],
-		     (char *)paravirt_ops.save_fl);
-	patch_offset(&irq_save_disable_callout[IRQ_PATCH_DISABLE],
-		     (char *)paravirt_ops.irq_disable);
 
 	para_fill(wbinvd, WBINVD);
 	para_fill(read_tsc, RDTSC);
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -4,18 +4,7 @@
  * para-virtualization: those hooks are defined here. */
 
 #ifdef CONFIG_PARAVIRT
-#include <linux/stringify.h>
 #include <asm/page.h>
-
-/* These are the most performance critical ops, so we want to be able to patch
- * callers */
-#define PARAVIRT_IRQ_DISABLE 0
-#define PARAVIRT_IRQ_ENABLE 1
-#define PARAVIRT_RESTORE_FLAGS 2
-#define PARAVIRT_SAVE_FLAGS 3
-#define PARAVIRT_SAVE_FLAGS_IRQ_DISABLE 4
-#define PARAVIRT_INTERRUPT_RETURN 5
-#define PARAVIRT_STI_SYSEXIT 6
 
 /* Bitmask of what can be clobbered: usually at least eax. */
 #define CLBR_NONE 0x0
@@ -191,6 +180,28 @@ struct paravirt_ops
 
 extern struct paravirt_ops paravirt_ops;
 
+#define PARAVIRT_PATCH(x)					\
+	(offsetof(struct paravirt_ops, x) / sizeof(void *))
+
+#define paravirt_type(type)					\
+	[paravirt_typenum] "i" (PARAVIRT_PATCH(type))
+#define paravirt_clobber(clobber)		\
+	[paravirt_clobber] "i" (clobber)
+
+#define PARAVIRT_CALL	"call *paravirt_ops+%c[paravirt_typenum]*4;"
+
+#define _paravirt_alt(insn_string, type, clobber)	\
+	"771:\n\t" insn_string "\n" "772:\n"		\
+	".pushsection .parainstructions,\"a\"\n"	\
+	"  .long 771b\n"				\
+	"  .byte " type "\n"				\
+	"  .byte 772b-771b\n"				\
+	"  .short " clobber "\n"			\
+	".popsection\n"
+
+#define paravirt_alt(insn_string)				\
+	_paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")
+
 #define paravirt_enabled() (paravirt_ops.paravirt_enabled)
 
 static inline void load_esp0(struct tss_struct *tss,
@@ -512,93 +523,89 @@ extern struct paravirt_patch_site __star
 extern struct paravirt_patch_site __start_parainstructions[],
 	__stop_parainstructions[];
 
-#define paravirt_alt(insn_string, typenum, clobber)	\
-	"771:\n\t" insn_string "\n" "772:\n"		\
-	".pushsection .parainstructions,\"a\"\n"	\
-	"  .long 771b\n"				\
-	"  .byte " __stringify(typenum) "\n"		\
-	"  .byte 772b-771b\n"				\
-	"  .short " __stringify(clobber) "\n"		\
-	".popsection"
-
 static inline unsigned long __raw_local_save_flags(void)
 {
 	unsigned long f;
 
-	__asm__ __volatile__(paravirt_alt( "pushl %%ecx; pushl %%edx;"
-					   "call *%1;"
-					   "popl %%edx; popl %%ecx",
-					  PARAVIRT_SAVE_FLAGS, CLBR_NONE)
-			     : "=a"(f): "m"(paravirt_ops.save_fl)
-			     : "memory", "cc");
+	asm volatile(paravirt_alt("pushl %%ecx; pushl %%edx;"
+				  PARAVIRT_CALL
+				  "popl %%edx; popl %%ecx")
+		     : "=a"(f)
+		     : paravirt_type(save_fl),
+		       paravirt_clobber(CLBR_NONE)
+		     : "memory", "cc");
 	return f;
 }
 
 static inline void raw_local_irq_restore(unsigned long f)
 {
-	__asm__ __volatile__(paravirt_alt( "pushl %%ecx; pushl %%edx;"
-					   "call *%1;"
-					   "popl %%edx; popl %%ecx",
-					  PARAVIRT_RESTORE_FLAGS, CLBR_EAX)
-			     : "=a"(f) : "m" (paravirt_ops.restore_fl), "0"(f)
-			     : "memory", "cc");
+	asm volatile(paravirt_alt("pushl %%ecx; pushl %%edx;"
+				  PARAVIRT_CALL
+				  "popl %%edx; popl %%ecx")
+		     : "=a"(f)
+		     : "0"(f),
+		       paravirt_type(restore_fl),
+		       paravirt_clobber(CLBR_EAX)
+		     : "memory", "cc");
 }
 
 static inline void raw_local_irq_disable(void)
 {
-	__asm__ __volatile__(paravirt_alt( "pushl %%ecx; pushl %%edx;"
-					   "call *%0;"
-					   "popl %%edx; popl %%ecx",
-					  PARAVIRT_IRQ_DISABLE, CLBR_EAX)
-			     : : "m" (paravirt_ops.irq_disable)
-			     : "memory", "eax", "cc");
+	asm volatile(paravirt_alt("pushl %%ecx; pushl %%edx;"
+				  PARAVIRT_CALL
+				  "popl %%edx; popl %%ecx")
+		     :
+		     : paravirt_type(irq_disable),
+		       paravirt_clobber(CLBR_EAX)
+		     : "memory", "eax", "cc");
 }
 
 static inline void raw_local_irq_enable(void)
 {
-	__asm__ __volatile__(paravirt_alt( "pushl %%ecx; pushl %%edx;"
-					   "call *%0;"
-					   "popl %%edx; popl %%ecx",
-					  PARAVIRT_IRQ_ENABLE, CLBR_EAX)
-			     : : "m" (paravirt_ops.irq_enable)
-			     : "memory", "eax", "cc");
+	asm volatile(paravirt_alt("pushl %%ecx; pushl %%edx;"
+				  PARAVIRT_CALL
+				  "popl %%edx; popl %%ecx")
+		     :
+		     : paravirt_type(irq_enable),
+		       paravirt_clobber(CLBR_EAX)
+		     : "memory", "eax", "cc");
 }
 
 static inline unsigned long __raw_local_irq_save(void)
 {
 	unsigned long f;
 
-	__asm__ __volatile__(paravirt_alt( "pushl %%ecx; pushl %%edx;"
-					   "call *%1; pushl %%eax;"
-					   "call *%2; popl %%eax;"
-					   "popl %%edx; popl %%ecx",
-					  PARAVIRT_SAVE_FLAGS_IRQ_DISABLE,
-					  CLBR_NONE)
-			     : "=a"(f)
-			     : "m" (paravirt_ops.save_fl),
-			       "m" (paravirt_ops.irq_disable)
-			     : "memory", "cc");
+	f = __raw_local_save_flags();
+	raw_local_irq_disable();
 	return f;
 }
 
-#define CLI_STRING paravirt_alt("pushl %%ecx; pushl %%edx;"		\
-		     "call *paravirt_ops+%c[irq_disable];"		\
-		     "popl %%edx; popl %%ecx",				\
-		     PARAVIRT_IRQ_DISABLE, CLBR_EAX)
-
-#define STI_STRING paravirt_alt("pushl %%ecx; pushl %%edx;"		\
-		     "call *paravirt_ops+%c[irq_enable];"		\
-		     "popl %%edx; popl %%ecx",				\
-		     PARAVIRT_IRQ_ENABLE, CLBR_EAX)
+#define CLI_STRING							\
+	_paravirt_alt("pushl %%ecx; pushl %%edx;"			\
+		      "call *paravirt_ops+%c[paravirt_cli_type]*4;"	\
+		      "popl %%edx; popl %%ecx",				\
+		      "%c[paravirt_cli_type]", "%c[paravirt_clobber]")
+
+#define STI_STRING							\
+	_paravirt_alt("pushl %%ecx; pushl %%edx;"			\
+		      "call *paravirt_ops+%c[paravirt_sti_type]*4;"	\
+		      "popl %%edx; popl %%ecx",				\
+		      "%c[paravirt_sti_type]", "%c[paravirt_clobber]")
+
 #define CLI_STI_CLOBBERS , "%eax"
-#define CLI_STI_INPUT_ARGS \
+#define CLI_STI_INPUT_ARGS						\
 	,								\
-	[irq_disable] "i" (offsetof(struct paravirt_ops, irq_disable)),	\
-	[irq_enable] "i" (offsetof(struct paravirt_ops, irq_enable))
+	[paravirt_cli_type] "i" (PARAVIRT_PATCH(irq_disable)),		\
+	[paravirt_sti_type] "i" (PARAVIRT_PATCH(irq_enable)),		\
+	paravirt_clobber(CLBR_EAX)
+
+#undef PARAVIRT_CALL
 
 #else  /* __ASSEMBLY__ */
 
-#define PARA_PATCH(ptype, clobbers, ops)	\
+#define PARA_PATCH(off)	((off) / 4)
+
+#define PARA_SITE(ptype, clobbers, ops)		\
 771:;						\
 	ops;					\
 772:;						\
@@ -609,25 +616,25 @@ 772:;						\
 	 .short clobbers;			\
 	.popsection
 
-#define INTERRUPT_RETURN				\
-	PARA_PATCH(PARAVIRT_INTERRUPT_RETURN, CLBR_ANY,	\
-	jmp *%cs:paravirt_ops+PARAVIRT_iret)
-
-#define DISABLE_INTERRUPTS(clobbers)			\
-	PARA_PATCH(PARAVIRT_IRQ_DISABLE, clobbers,	\
-	pushl %ecx; pushl %edx;				\
-	call *paravirt_ops+PARAVIRT_irq_disable;	\
-	popl %edx; popl %ecx)				\
-
-#define ENABLE_INTERRUPTS(clobbers)			\
-	PARA_PATCH(PARAVIRT_IRQ_ENABLE, clobbers,	\
-	pushl %ecx; pushl %edx;				\
-	call *%cs:paravirt_ops+PARAVIRT_irq_enable;	\
-	popl %edx; popl %ecx)
-
-#define ENABLE_INTERRUPTS_SYSEXIT			\
-	PARA_PATCH(PARAVIRT_STI_SYSEXIT, CLBR_ANY,	\
-	jmp *%cs:paravirt_ops+PARAVIRT_irq_enable_sysexit)
+#define INTERRUPT_RETURN					\
+	PARA_SITE(PARA_PATCH(PARAVIRT_iret), CLBR_ANY,		\
+		  jmp *%cs:paravirt_ops+PARAVIRT_iret)
+
+#define DISABLE_INTERRUPTS(clobbers)					\
+	PARA_SITE(PARA_PATCH(PARAVIRT_irq_disable), clobbers,		\
+		  pushl %ecx; pushl %edx;				\
+		  call *%cs:paravirt_ops+PARAVIRT_irq_disable;		\
+		  popl %edx; popl %ecx)					\
+
+#define ENABLE_INTERRUPTS(clobbers)					\
+	PARA_SITE(PARA_PATCH(PARAVIRT_irq_enable), clobbers,		\
+		  pushl %ecx; pushl %edx;				\
+		  call *%cs:paravirt_ops+PARAVIRT_irq_enable;		\
+		  popl %edx; popl %ecx)
+
+#define ENABLE_INTERRUPTS_SYSEXIT					\
+	PARA_SITE(PARA_PATCH(PARAVIRT_irq_enable_sysexit), CLBR_ANY,	\
+		  jmp *%cs:paravirt_ops+PARAVIRT_irq_enable_sysexit)
 
 #define GET_CR0_INTO_EAX			\
 	call *paravirt_ops+PARAVIRT_read_cr0

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 11/20] Fix patch site clobbers to include return register
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
                   ` (9 preceding siblings ...)
  2007-04-04 19:12 ` [patch 10/20] Use patch site IDs computed from offset in paravirt_ops structure Jeremy Fitzhardinge
@ 2007-04-04 19:12 ` Jeremy Fitzhardinge
  2007-04-04 19:12 ` [patch 12/20] Consistently wrap paravirt ops callsites to make them patchable Jeremy Fitzhardinge
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: virtualization, Andrew Morton, lkml

[-- Attachment #1: paravirt-fix-clobbers.patch --]
[-- Type: text/plain, Size: 2800 bytes --]

Fix a few clobbers to include the return register.  The clobbers set
is the set of all registers modified (or may be modified) by the code
snippet, regardless of whether it was deliberate or accidental.

Also, make sure that callsites which are used in contexts which don't
allow clobbers actually save and restore all clobberable registers.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>

---
 arch/i386/kernel/entry.S    |    2 +-
 include/asm-i386/paravirt.h |   18 ++++++++++--------
 2 files changed, 11 insertions(+), 9 deletions(-)

===================================================================
--- a/arch/i386/kernel/entry.S
+++ b/arch/i386/kernel/entry.S
@@ -342,7 +342,7 @@ 1:	movl (%ebp),%ebp
 	jae syscall_badsys
 	call *sys_call_table(,%eax,4)
 	movl %eax,PT_EAX(%esp)
-	DISABLE_INTERRUPTS(CLBR_ECX|CLBR_EDX)
+	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_OFF
 	movl TI_flags(%ebp), %ecx
 	testw $_TIF_ALLWORK_MASK, %cx
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -532,7 +532,7 @@ static inline unsigned long __raw_local_
 				  "popl %%edx; popl %%ecx")
 		     : "=a"(f)
 		     : paravirt_type(save_fl),
-		       paravirt_clobber(CLBR_NONE)
+		       paravirt_clobber(CLBR_EAX)
 		     : "memory", "cc");
 	return f;
 }
@@ -617,27 +617,29 @@ 772:;						\
 	.popsection
 
 #define INTERRUPT_RETURN					\
-	PARA_SITE(PARA_PATCH(PARAVIRT_iret), CLBR_ANY,		\
+	PARA_SITE(PARA_PATCH(PARAVIRT_iret), CLBR_NONE,		\
 		  jmp *%cs:paravirt_ops+PARAVIRT_iret)
 
 #define DISABLE_INTERRUPTS(clobbers)					\
 	PARA_SITE(PARA_PATCH(PARAVIRT_irq_disable), clobbers,		\
-		  pushl %ecx; pushl %edx;				\
+		  pushl %eax; pushl %ecx; pushl %edx;			\
 		  call *%cs:paravirt_ops+PARAVIRT_irq_disable;		\
-		  popl %edx; popl %ecx)					\
+		  popl %edx; popl %ecx; popl %eax)			\
 
 #define ENABLE_INTERRUPTS(clobbers)					\
 	PARA_SITE(PARA_PATCH(PARAVIRT_irq_enable), clobbers,		\
-		  pushl %ecx; pushl %edx;				\
+		  pushl %eax; pushl %ecx; pushl %edx;			\
 		  call *%cs:paravirt_ops+PARAVIRT_irq_enable;		\
-		  popl %edx; popl %ecx)
+		  popl %edx; popl %ecx; popl %eax)
 
 #define ENABLE_INTERRUPTS_SYSEXIT					\
-	PARA_SITE(PARA_PATCH(PARAVIRT_irq_enable_sysexit), CLBR_ANY,	\
+	PARA_SITE(PARA_PATCH(PARAVIRT_irq_enable_sysexit), CLBR_NONE,	\
 		  jmp *%cs:paravirt_ops+PARAVIRT_irq_enable_sysexit)
 
 #define GET_CR0_INTO_EAX			\
-	call *paravirt_ops+PARAVIRT_read_cr0
+	push %ecx; push %edx;			\
+	call *paravirt_ops+PARAVIRT_read_cr0;	\
+	pop %edx; pop %ecx
 
 #endif /* __ASSEMBLY__ */
 #endif /* CONFIG_PARAVIRT */

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 12/20] Consistently wrap paravirt ops callsites to make them patchable
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
                   ` (10 preceding siblings ...)
  2007-04-04 19:12 ` [patch 11/20] Fix patch site clobbers to include return register Jeremy Fitzhardinge
@ 2007-04-04 19:12 ` Jeremy Fitzhardinge
  2007-04-04 19:12 ` [patch 13/20] Document asm-i386/paravirt.h Jeremy Fitzhardinge
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: virtualization, Anthony Liguori, Andrew Morton, lkml

[-- Attachment #1: paravirt-patchable-call-wrappers.patch --]
[-- Type: text/plain, Size: 27356 bytes --]

Wrap a set of interesting paravirt_ops calls in a wrapper which makes
the callsites available for patching.  Unfortunately this is pretty
ugly because there's no way to get gcc to generate a function call,
but also wrap just the callsite itself with the necessary labels.

This patch supports functions with 0-4 arguments, and either void or
returning a value.  64-bit arguments must be split into a pair of
32-bit arguments (lower word first).  Small structures are returned in
registers.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Anthony Liguori <anthony@codemonkey.ws>

---
 include/asm-i386/paravirt.h |  715 ++++++++++++++++++++++++++++++++++---------
 1 file changed, 569 insertions(+), 146 deletions(-)

===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -124,7 +124,7 @@ struct paravirt_ops
 
 	void (*flush_tlb_user)(void);
 	void (*flush_tlb_kernel)(void);
-	void (*flush_tlb_single)(u32 addr);
+	void (*flush_tlb_single)(unsigned long addr);
 
 	void (*map_pt_hook)(int type, pte_t *va, u32 pfn);
 
@@ -188,7 +188,7 @@ extern struct paravirt_ops paravirt_ops;
 #define paravirt_clobber(clobber)		\
 	[paravirt_clobber] "i" (clobber)
 
-#define PARAVIRT_CALL	"call *paravirt_ops+%c[paravirt_typenum]*4;"
+#define PARAVIRT_CALL	"call *(paravirt_ops+%c[paravirt_typenum]*4);"
 
 #define _paravirt_alt(insn_string, type, clobber)	\
 	"771:\n\t" insn_string "\n" "772:\n"		\
@@ -199,26 +199,234 @@ extern struct paravirt_ops paravirt_ops;
 	"  .short " clobber "\n"			\
 	".popsection\n"
 
-#define paravirt_alt(insn_string)				\
+#define paravirt_alt(insn_string)					\
 	_paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")
 
-#define paravirt_enabled() (paravirt_ops.paravirt_enabled)
+#define PVOP_CALL0(__rettype, __op)					\
+	({								\
+		__rettype __ret;					\
+		if (sizeof(__rettype) > sizeof(unsigned long)) {	\
+			unsigned long long __tmp;			\
+			unsigned long __ecx;				\
+			asm volatile(paravirt_alt(PARAVIRT_CALL)	\
+				     : "=A" (__tmp), "=c" (__ecx)	\
+				     : paravirt_type(__op),		\
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc");			\
+			__ret = (__rettype)__tmp;			\
+		} else {						\
+			unsigned long __tmp, __edx, __ecx;		\
+			asm volatile(paravirt_alt(PARAVIRT_CALL)	\
+				     : "=a" (__tmp), "=d" (__edx),	\
+				       "=c" (__ecx)			\
+				     : paravirt_type(__op),		\
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc");			\
+			__ret = (__rettype)__tmp;			\
+		}							\
+		__ret;							\
+	})
+#define PVOP_VCALL0(__op)						\
+	({								\
+		unsigned long __eax, __edx, __ecx;			\
+		asm volatile(paravirt_alt(PARAVIRT_CALL)		\
+			     : "=a" (__eax), "=d" (__edx), "=c" (__ecx) \
+			     : paravirt_type(__op),			\
+			       paravirt_clobber(CLBR_ANY)		\
+			     : "memory", "cc");				\
+	})
+
+#define PVOP_CALL1(__rettype, __op, arg1)				\
+	({								\
+		__rettype __ret;					\
+		if (sizeof(__rettype) > sizeof(unsigned long)) {	\
+			unsigned long long __tmp;			\
+			unsigned long __ecx;				\
+			asm volatile(paravirt_alt(PARAVIRT_CALL)	\
+				     : "=A" (__tmp), "=c" (__ecx)	\
+				     : "a" ((u32)(arg1)),		\
+				       paravirt_type(__op),		\
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc");			\
+			__ret = (__rettype)__tmp;			\
+		} else {						\
+			unsigned long __tmp, __edx, __ecx;		\
+			asm volatile(paravirt_alt(PARAVIRT_CALL)	\
+				     : "=a" (__tmp), "=d" (__edx),	\
+				       "=c" (__ecx)			\
+				     : "0" ((u32)(arg1)),		\
+				       paravirt_type(__op),		\
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc");			\
+			__ret = (__rettype)__tmp;			\
+		}							\
+		__ret;							\
+	})
+#define PVOP_VCALL1(__op, arg1)						\
+	({								\
+		unsigned long __eax, __edx, __ecx;			\
+		asm volatile(paravirt_alt(PARAVIRT_CALL)		\
+			     : "=a" (__eax), "=d" (__edx), "=c" (__ecx) \
+			     : "0" ((u32)(arg1)),			\
+			       paravirt_type(__op),			\
+			       paravirt_clobber(CLBR_ANY)		\
+			     : "memory", "cc");				\
+	})
+
+#define PVOP_CALL2(__rettype, __op, arg1, arg2)				\
+	({								\
+		__rettype __ret;					\
+		if (sizeof(__rettype) > sizeof(unsigned long)) {	\
+			unsigned long long __tmp;			\
+			unsigned long __ecx;				\
+			asm volatile(paravirt_alt(PARAVIRT_CALL)	\
+				     : "=A" (__tmp), "=c" (__ecx)	\
+				     : "a" ((u32)(arg1)),		\
+				       "d" ((u32)(arg2)),		\
+				       paravirt_type(__op),		\
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc");			\
+			__ret = (__rettype)__tmp;			\
+		} else {						\
+			unsigned long __tmp, __edx, __ecx;		\
+			asm volatile(paravirt_alt(PARAVIRT_CALL)	\
+				     : "=a" (__tmp), "=d" (__edx),	\
+				       "=c" (__ecx)			\
+				     : "0" ((u32)(arg1)),		\
+				       "1" ((u32)(arg2)),		\
+				       paravirt_type(__op),		\
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc");			\
+			__ret = (__rettype)__tmp;			\
+		}							\
+		__ret;							\
+	})
+#define PVOP_VCALL2(__op, arg1, arg2)					\
+	({								\
+		unsigned long __eax, __edx, __ecx;			\
+		asm volatile(paravirt_alt(PARAVIRT_CALL)		\
+			     : "=a" (__eax), "=d" (__edx), "=c" (__ecx) \
+			     : "0" ((u32)(arg1)),			\
+			       "1" ((u32)(arg2)),			\
+			       paravirt_type(__op),			\
+			       paravirt_clobber(CLBR_ANY)		\
+			     : "memory", "cc");				\
+	})
+
+#define PVOP_CALL3(__rettype, __op, arg1, arg2, arg3)			\
+	({								\
+		__rettype __ret;					\
+		if (sizeof(__rettype) > sizeof(unsigned long)) {	\
+			unsigned long long __tmp;			\
+			unsigned long __ecx;				\
+			asm volatile(paravirt_alt(PARAVIRT_CALL)	\
+				     : "=A" (__tmp), "=c" (__ecx)	\
+				     : "a" ((u32)(arg1)),		\
+				       "d" ((u32)(arg2)),		\
+				       "1" ((u32)(arg3)),		\
+				       paravirt_type(__op),		\
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc");			\
+			__ret = (__rettype)__tmp;			\
+		} else {						\
+			unsigned long __tmp, __edx, __ecx;	\
+			asm volatile(paravirt_alt(PARAVIRT_CALL)	\
+				     : "=a" (__tmp), "=d" (__edx),	\
+				       "=c" (__ecx)			\
+				     : "0" ((u32)(arg1)),		\
+				       "1" ((u32)(arg2)),		\
+				       "2" ((u32)(arg3)),		\
+				       paravirt_type(__op),		\
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc");			\
+			__ret = (__rettype)__tmp;			\
+		}							\
+		__ret;							\
+	})
+#define PVOP_VCALL3(__op, arg1, arg2, arg3)				\
+	({								\
+		unsigned long __eax, __edx, __ecx;			\
+		asm volatile(paravirt_alt(PARAVIRT_CALL)		\
+			     : "=a" (__eax), "=d" (__edx), "=c" (__ecx) \
+			     : "0" ((u32)(arg1)),			\
+			       "1" ((u32)(arg2)),			\
+			       "2" ((u32)(arg3)),			\
+			       paravirt_type(__op),			\
+			       paravirt_clobber(CLBR_ANY)		\
+			     : "memory", "cc");				\
+	})
+
+#define PVOP_CALL4(__rettype, __op, arg1, arg2, arg3, arg4)		\
+	({								\
+		__rettype __ret;					\
+		if (sizeof(__rettype) > sizeof(unsigned long)) {	\
+			unsigned long long __tmp;			\
+			unsigned long __ecx;				\
+			asm volatile("push %[_arg4]; "			\
+				     paravirt_alt(PARAVIRT_CALL)	\
+				     "lea 4(%%esp),%%esp"		\
+				     : "=A" (__tmp), "=c" (__ecx)	\
+				     : "a" ((u32)(arg1)),		\
+				       "d" ((u32)(arg2)),		\
+				       "1" ((u32)(arg3)),		\
+				       [_arg4] "mr" ((u32)(arg4)),	\
+				       paravirt_type(__op),		\
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc",);		\
+			__ret = (__rettype)__tmp;			\
+		} else {						\
+			unsigned long __tmp, __edx, __ecx;		\
+			asm volatile("push %[_arg4]; "			\
+				     paravirt_alt(PARAVIRT_CALL)	\
+				     "lea 4(%%esp),%%esp"		\
+				     : "=a" (__tmp), "=d" (__edx), "=c" (__ecx) \
+				     : "0" ((u32)(arg1)),		\
+				       "1" ((u32)(arg2)),		\
+				       "2" ((u32)(arg3)),		\
+				       [_arg4]"mr" ((u32)(arg4)),	\
+				       paravirt_type(__op),		\
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc");			\
+			__ret = (__rettype)__tmp;			\
+		}							\
+		__ret;							\
+	})
+#define PVOP_VCALL4(__op, arg1, arg2, arg3, arg4)			\
+	({								\
+		unsigned long __eax, __edx, __ecx;			\
+		asm volatile("push %[_arg4]; "				\
+			     paravirt_alt(PARAVIRT_CALL)		\
+			     "lea 4(%%esp),%%esp"			\
+			     : "=a" (__eax), "=d" (__edx), "=c" (__ecx) \
+			     : "0" ((u32)(arg1)),			\
+			       "1" ((u32)(arg2)),			\
+			       "2" ((u32)(arg3)),			\
+			       [_arg4]"mr" ((u32)(arg4)),		\
+			       paravirt_type(__op),			\
+			       paravirt_clobber(CLBR_ANY)		\
+			     : "memory", "cc");				\
+	})
+
+static inline int paravirt_enabled(void)
+{
+	return paravirt_ops.paravirt_enabled;
+}
 
 static inline void load_esp0(struct tss_struct *tss,
 			     struct thread_struct *thread)
 {
-	paravirt_ops.load_esp0(tss, thread);
+	PVOP_VCALL2(load_esp0, tss, thread);
 }
 
 #define ARCH_SETUP			paravirt_ops.arch_setup();
 static inline unsigned long get_wallclock(void)
 {
-	return paravirt_ops.get_wallclock();
+	return PVOP_CALL0(unsigned long, get_wallclock);
 }
 
 static inline int set_wallclock(unsigned long nowtime)
 {
-	return paravirt_ops.set_wallclock(nowtime);
+	return PVOP_CALL1(int, set_wallclock, nowtime);
 }
 
 static inline void (*choose_time_init(void))(void)
@@ -230,127 +438,208 @@ static inline void __cpuid(unsigned int 
 static inline void __cpuid(unsigned int *eax, unsigned int *ebx,
 			   unsigned int *ecx, unsigned int *edx)
 {
-	paravirt_ops.cpuid(eax, ebx, ecx, edx);
+	PVOP_VCALL4(cpuid, eax, ebx, ecx, edx);
 }
 
 /*
  * These special macros can be used to get or set a debugging register
  */
-#define get_debugreg(var, reg) var = paravirt_ops.get_debugreg(reg)
-#define set_debugreg(val, reg) paravirt_ops.set_debugreg(reg, val)
-
-#define clts() paravirt_ops.clts()
-
-#define read_cr0() paravirt_ops.read_cr0()
-#define write_cr0(x) paravirt_ops.write_cr0(x)
-
-#define read_cr2() paravirt_ops.read_cr2()
-#define write_cr2(x) paravirt_ops.write_cr2(x)
-
-#define read_cr3() paravirt_ops.read_cr3()
-#define write_cr3(x) paravirt_ops.write_cr3(x)
-
-#define read_cr4() paravirt_ops.read_cr4()
-#define read_cr4_safe(x) paravirt_ops.read_cr4_safe()
-#define write_cr4(x) paravirt_ops.write_cr4(x)
-
-#define raw_ptep_get_and_clear(xp)	(paravirt_ops.ptep_get_and_clear(xp))
+static inline unsigned long paravirt_get_debugreg(int reg)
+{
+	return PVOP_CALL1(unsigned long, get_debugreg, reg);
+}
+#define get_debugreg(var, reg) var = paravirt_get_debugreg(reg)
+static inline void set_debugreg(unsigned long val, int reg)
+{
+	PVOP_VCALL2(set_debugreg, reg, val);
+}
+
+static inline void clts(void)
+{
+	PVOP_VCALL0(clts);
+}
+
+static inline unsigned long read_cr0(void)
+{
+	return PVOP_CALL0(unsigned long, read_cr0);
+}
+
+static inline void write_cr0(unsigned long x)
+{
+	PVOP_VCALL1(write_cr0, x);
+}
+
+static inline unsigned long read_cr2(void)
+{
+	return PVOP_CALL0(unsigned long, read_cr2);
+}
+
+static inline void write_cr2(unsigned long x)
+{
+	PVOP_VCALL1(write_cr2, x);
+}
+
+static inline unsigned long read_cr3(void)
+{
+	return PVOP_CALL0(unsigned long, read_cr3);
+}
+
+static inline void write_cr3(unsigned long x)
+{
+	PVOP_VCALL1(write_cr3, x);
+}
+
+static inline unsigned long read_cr4(void)
+{
+	return PVOP_CALL0(unsigned long, read_cr4);
+}
+static inline unsigned long read_cr4_safe(void)
+{
+	return PVOP_CALL0(unsigned long, read_cr4_safe);
+}
+
+static inline void write_cr4(unsigned long x)
+{
+	PVOP_VCALL1(write_cr4, x);
+}
 
 static inline void raw_safe_halt(void)
 {
-	paravirt_ops.safe_halt();
+	PVOP_VCALL0(safe_halt);
 }
 
 static inline void halt(void)
 {
-	paravirt_ops.safe_halt();
-}
-#define wbinvd() paravirt_ops.wbinvd()
+	PVOP_VCALL0(safe_halt);
+}
+
+static inline void wbinvd(void)
+{
+	PVOP_VCALL0(wbinvd);
+}
 
 #define get_kernel_rpl()  (paravirt_ops.kernel_rpl)
 
+static inline u64 paravirt_read_msr(unsigned msr, int *err)
+{
+	return PVOP_CALL2(u64, read_msr, msr, err);
+}
+static inline int paravirt_write_msr(unsigned msr, unsigned low, unsigned high)
+{
+	return PVOP_CALL3(int, write_msr, msr, low, high);
+}
+
 /* These should all do BUG_ON(_err), but our headers are too tangled. */
-#define rdmsr(msr,val1,val2) do {				\
-	int _err;						\
-	u64 _l = paravirt_ops.read_msr(msr,&_err);		\
-	val1 = (u32)_l;						\
-	val2 = _l >> 32;					\
+#define rdmsr(msr,val1,val2) do {		\
+	int _err;				\
+	u64 _l = paravirt_read_msr(msr, &_err);	\
+	val1 = (u32)_l;				\
+	val2 = _l >> 32;			\
 } while(0)
 
-#define wrmsr(msr,val1,val2) do {				\
-	u64 _l = ((u64)(val2) << 32) | (val1);			\
-	paravirt_ops.write_msr((msr), _l);			\
+#define wrmsr(msr,val1,val2) do {		\
+	paravirt_write_msr(msr, val1, val2);	\
 } while(0)
 
-#define rdmsrl(msr,val) do {					\
-	int _err;						\
-	val = paravirt_ops.read_msr((msr),&_err);		\
+#define rdmsrl(msr,val) do {			\
+	int _err;				\
+	val = paravirt_read_msr(msr, &_err);	\
 } while(0)
 
-#define wrmsrl(msr,val) (paravirt_ops.write_msr((msr),(val)))
-#define wrmsr_safe(msr,a,b) ({					\
-	u64 _l = ((u64)(b) << 32) | (a);			\
-	paravirt_ops.write_msr((msr),_l);			\
-})
+#define wrmsrl(msr,val)		((void)paravirt_write_msr(msr, val, 0))
+#define wrmsr_safe(msr,a,b)	paravirt_write_msr(msr, a, b)
 
 /* rdmsr with exception handling */
-#define rdmsr_safe(msr,a,b) ({					\
-	int _err;						\
-	u64 _l = paravirt_ops.read_msr(msr,&_err);		\
-	(*a) = (u32)_l;						\
-	(*b) = _l >> 32;					\
+#define rdmsr_safe(msr,a,b) ({			\
+	int _err;				\
+	u64 _l = paravirt_read_msr(msr, &_err);	\
+	(*a) = (u32)_l;				\
+	(*b) = _l >> 32;			\
 	_err; })
 
-#define rdtsc(low,high) do {					\
-	u64 _l = paravirt_ops.read_tsc();			\
-	low = (u32)_l;						\
-	high = _l >> 32;					\
+
+static inline u64 paravirt_read_tsc(void)
+{
+	return PVOP_CALL0(u64, read_tsc);
+}
+#define rdtsc(low,high) do {			\
+	u64 _l = paravirt_read_tsc();		\
+	low = (u32)_l;				\
+	high = _l >> 32;			\
 } while(0)
 
-#define rdtscl(low) do {					\
-	u64 _l = paravirt_ops.read_tsc();			\
-	low = (int)_l;						\
+#define rdtscl(low) do {			\
+	u64 _l = paravirt_read_tsc();		\
+	low = (int)_l;				\
 } while(0)
 
-#define rdtscll(val) (val = paravirt_ops.read_tsc())
+#define rdtscll(val) (val = paravirt_read_tsc())
 
 #define get_scheduled_cycles(val) (val = paravirt_ops.get_scheduled_cycles())
 #define calculate_cpu_khz() (paravirt_ops.get_cpu_khz())
 
 #define write_tsc(val1,val2) wrmsr(0x10, val1, val2)
 
-#define rdpmc(counter,low,high) do {				\
-	u64 _l = paravirt_ops.read_pmc();			\
-	low = (u32)_l;						\
-	high = _l >> 32;					\
+static inline unsigned long long paravirt_read_pmc(int counter)
+{
+	return PVOP_CALL1(u64, read_pmc, counter);
+}
+
+#define rdpmc(counter,low,high) do {		\
+	u64 _l = paravirt_read_pmc(counter);	\
+	low = (u32)_l;				\
+	high = _l >> 32;			\
 } while(0)
 
-#define load_TR_desc() (paravirt_ops.load_tr_desc())
-#define load_gdt(dtr) (paravirt_ops.load_gdt(dtr))
-#define load_idt(dtr) (paravirt_ops.load_idt(dtr))
-#define set_ldt(addr, entries) (paravirt_ops.set_ldt((addr), (entries)))
-#define store_gdt(dtr) (paravirt_ops.store_gdt(dtr))
-#define store_idt(dtr) (paravirt_ops.store_idt(dtr))
-#define store_tr(tr) ((tr) = paravirt_ops.store_tr())
-#define load_TLS(t,cpu) (paravirt_ops.load_tls((t),(cpu)))
-#define write_ldt_entry(dt, entry, low, high)				\
-	(paravirt_ops.write_ldt_entry((dt), (entry), (low), (high)))
-#define write_gdt_entry(dt, entry, low, high)				\
-	(paravirt_ops.write_gdt_entry((dt), (entry), (low), (high)))
-#define write_idt_entry(dt, entry, low, high)				\
-	(paravirt_ops.write_idt_entry((dt), (entry), (low), (high)))
-#define set_iopl_mask(mask) (paravirt_ops.set_iopl_mask(mask))
-
-#define __pte(x)	paravirt_ops.make_pte(x)
-#define __pgd(x)	paravirt_ops.make_pgd(x)
-
-#define pte_val(x)	paravirt_ops.pte_val(x)
-#define pgd_val(x)	paravirt_ops.pgd_val(x)
-
-#ifdef CONFIG_X86_PAE
-#define __pmd(x)	paravirt_ops.make_pmd(x)
-#define pmd_val(x)	paravirt_ops.pmd_val(x)
-#endif
+static inline void load_TR_desc(void)
+{
+	PVOP_VCALL0(load_tr_desc);
+}
+static inline void load_gdt(const struct Xgt_desc_struct *dtr)
+{
+	PVOP_VCALL1(load_gdt, dtr);
+}
+static inline void load_idt(const struct Xgt_desc_struct *dtr)
+{
+	PVOP_VCALL1(load_idt, dtr);
+}
+static inline void set_ldt(const void *addr, unsigned entries)
+{
+	PVOP_VCALL2(set_ldt, addr, entries);
+}
+static inline void store_gdt(struct Xgt_desc_struct *dtr)
+{
+	PVOP_VCALL1(store_gdt, dtr);
+}
+static inline void store_idt(struct Xgt_desc_struct *dtr)
+{
+	PVOP_VCALL1(store_idt, dtr);
+}
+static inline unsigned long paravirt_store_tr(void)
+{
+	return PVOP_CALL0(unsigned long, store_tr);
+}
+#define store_tr(tr)	((tr) = paravirt_store_tr())
+static inline void load_TLS(struct thread_struct *t, unsigned cpu)
+{
+	PVOP_VCALL2(load_tls, t, cpu);
+}
+static inline void write_ldt_entry(void *dt, int entry, u32 low, u32 high)
+{
+	PVOP_VCALL4(write_ldt_entry, dt, entry, low, high);
+}
+static inline void write_gdt_entry(void *dt, int entry, u32 low, u32 high)
+{
+	PVOP_VCALL4(write_gdt_entry, dt, entry, low, high);
+}
+static inline void write_idt_entry(void *dt, int entry, u32 low, u32 high)
+{
+	PVOP_VCALL4(write_idt_entry, dt, entry, low, high);
+}
+static inline void set_iopl_mask(unsigned mask)
+{
+	PVOP_VCALL1(set_iopl_mask, mask);
+}
 
 /* The paravirtualized I/O functions */
 static inline void slow_down_io(void) {
@@ -368,27 +657,27 @@ static inline void slow_down_io(void) {
  */
 static inline void apic_write(unsigned long reg, unsigned long v)
 {
-	paravirt_ops.apic_write(reg,v);
+	PVOP_VCALL2(apic_write, reg, v);
 }
 
 static inline void apic_write_atomic(unsigned long reg, unsigned long v)
 {
-	paravirt_ops.apic_write_atomic(reg,v);
+	PVOP_VCALL2(apic_write_atomic, reg, v);
 }
 
 static inline unsigned long apic_read(unsigned long reg)
 {
-	return paravirt_ops.apic_read(reg);
+	return PVOP_CALL1(unsigned long, apic_read, reg);
 }
 
 static inline void setup_boot_clock(void)
 {
-	paravirt_ops.setup_boot_clock();
+	PVOP_VCALL0(setup_boot_clock);
 }
 
 static inline void setup_secondary_clock(void)
 {
-	paravirt_ops.setup_secondary_clock();
+	PVOP_VCALL0(setup_secondary_clock);
 }
 #endif
 
@@ -408,93 +697,205 @@ static inline void startup_ipi_hook(int 
 static inline void startup_ipi_hook(int phys_apicid, unsigned long start_eip,
 				    unsigned long start_esp)
 {
-	return paravirt_ops.startup_ipi_hook(phys_apicid, start_eip, start_esp);
+	PVOP_VCALL3(startup_ipi_hook, phys_apicid, start_eip, start_esp);
 }
 #endif
 
 static inline void paravirt_activate_mm(struct mm_struct *prev,
 					struct mm_struct *next)
 {
-	paravirt_ops.activate_mm(prev, next);
+	PVOP_VCALL2(activate_mm, prev, next);
 }
 
 static inline void arch_dup_mmap(struct mm_struct *oldmm,
 				 struct mm_struct *mm)
 {
-	paravirt_ops.dup_mmap(oldmm, mm);
+	PVOP_VCALL2(dup_mmap, oldmm, mm);
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
 {
-	paravirt_ops.exit_mmap(mm);
-}
-
-#define __flush_tlb() paravirt_ops.flush_tlb_user()
-#define __flush_tlb_global() paravirt_ops.flush_tlb_kernel()
-#define __flush_tlb_single(addr) paravirt_ops.flush_tlb_single(addr)
-
-#define paravirt_map_pt_hook(type, va, pfn) paravirt_ops.map_pt_hook(type, va, pfn)
-
-#define paravirt_alloc_pt(pfn) paravirt_ops.alloc_pt(pfn)
-#define paravirt_release_pt(pfn) paravirt_ops.release_pt(pfn)
-
-#define paravirt_alloc_pd(pfn) paravirt_ops.alloc_pd(pfn)
-#define paravirt_alloc_pd_clone(pfn, clonepfn, start, count) \
-	paravirt_ops.alloc_pd_clone(pfn, clonepfn, start, count)
-#define paravirt_release_pd(pfn) paravirt_ops.release_pd(pfn)
+	PVOP_VCALL1(exit_mmap, mm);
+}
+
+static inline void __flush_tlb(void)
+{
+	PVOP_VCALL0(flush_tlb_user);
+}
+static inline void __flush_tlb_global(void)
+{
+	PVOP_VCALL0(flush_tlb_kernel);
+}
+static inline void __flush_tlb_single(unsigned long addr)
+{
+	PVOP_VCALL1(flush_tlb_single, addr);
+}
+
+static inline void paravirt_map_pt_hook(int type, pte_t *va, u32 pfn)
+{
+	PVOP_VCALL3(map_pt_hook, type, va, pfn);
+}
+
+static inline void paravirt_alloc_pt(unsigned pfn)
+{
+	PVOP_VCALL1(alloc_pt, pfn);
+}
+static inline void paravirt_release_pt(unsigned pfn)
+{
+	PVOP_VCALL1(release_pt, pfn);
+}
+
+static inline void paravirt_alloc_pd(unsigned pfn)
+{
+	PVOP_VCALL1(alloc_pd, pfn);
+}
+
+static inline void paravirt_alloc_pd_clone(unsigned pfn, unsigned clonepfn,
+					   unsigned start, unsigned count)
+{
+	PVOP_VCALL4(alloc_pd_clone, pfn, clonepfn, start, count);
+}
+static inline void paravirt_release_pd(unsigned pfn)
+{
+	PVOP_VCALL1(release_pd, pfn);
+}
+
+static inline void pte_update(struct mm_struct *mm, unsigned long addr,
+			      pte_t *ptep)
+{
+	PVOP_VCALL3(pte_update, mm, addr, ptep);
+}
+
+static inline void pte_update_defer(struct mm_struct *mm, unsigned long addr,
+				    pte_t *ptep)
+{
+	PVOP_VCALL3(pte_update_defer, mm, addr, ptep);
+}
+
+#ifdef CONFIG_X86_PAE
+static inline pte_t __pte(unsigned long long val)
+{
+	unsigned long long ret = PVOP_CALL2(unsigned long long, make_pte,
+					    val, val >> 32);
+	return (pte_t) { ret, ret >> 32 };
+}
+
+static inline pmd_t __pmd(unsigned long long val)
+{
+	return (pmd_t) { PVOP_CALL2(unsigned long long, make_pmd, val, val >> 32) };
+}
+
+static inline pgd_t __pgd(unsigned long long val)
+{
+	return (pgd_t) { PVOP_CALL2(unsigned long long, make_pgd, val, val >> 32) };
+}
+
+static inline unsigned long long pte_val(pte_t x)
+{
+	return PVOP_CALL2(unsigned long long, pte_val, x.pte_low, x.pte_high);
+}
+
+static inline unsigned long long pmd_val(pmd_t x)
+{
+	return PVOP_CALL2(unsigned long long, pmd_val, x.pmd, x.pmd >> 32);
+}
+
+static inline unsigned long long pgd_val(pgd_t x)
+{
+	return PVOP_CALL2(unsigned long long, pgd_val, x.pgd, x.pgd >> 32);
+}
 
 static inline void set_pte(pte_t *ptep, pte_t pteval)
 {
-	paravirt_ops.set_pte(ptep, pteval);
+	PVOP_VCALL3(set_pte, ptep, pteval.pte_low, pteval.pte_high);
 }
 
 static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep, pte_t pteval)
 {
+	/* 5 arg words */
 	paravirt_ops.set_pte_at(mm, addr, ptep, pteval);
 }
 
+static inline void set_pte_atomic(pte_t *ptep, pte_t pteval)
+{
+	PVOP_VCALL3(set_pte_atomic, ptep, pteval.pte_low, pteval.pte_high);
+}
+
+static inline void set_pte_present(struct mm_struct *mm, unsigned long addr,
+				   pte_t *ptep, pte_t pte)
+{
+	/* 5 arg words */
+	paravirt_ops.set_pte_present(mm, addr, ptep, pte);
+}
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmdval)
 {
-	paravirt_ops.set_pmd(pmdp, pmdval);
-}
-
-static inline void pte_update(struct mm_struct *mm, u32 addr, pte_t *ptep)
-{
-	paravirt_ops.pte_update(mm, addr, ptep);
-}
-
-static inline void pte_update_defer(struct mm_struct *mm, u32 addr, pte_t *ptep)
-{
-	paravirt_ops.pte_update_defer(mm, addr, ptep);
-}
-
-#ifdef CONFIG_X86_PAE
-static inline void set_pte_atomic(pte_t *ptep, pte_t pteval)
-{
-	paravirt_ops.set_pte_atomic(ptep, pteval);
-}
-
-static inline void set_pte_present(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte)
-{
-	paravirt_ops.set_pte_present(mm, addr, ptep, pte);
+	PVOP_VCALL3(set_pmd, pmdp, pmdval.pmd, pmdval.pmd >> 32);
 }
 
 static inline void set_pud(pud_t *pudp, pud_t pudval)
 {
-	paravirt_ops.set_pud(pudp, pudval);
+	PVOP_VCALL3(set_pud, pudp, pudval.pgd.pgd, pudval.pgd.pgd >> 32);
 }
 
 static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
-	paravirt_ops.pte_clear(mm, addr, ptep);
+	PVOP_VCALL3(pte_clear, mm, addr, ptep);
 }
 
 static inline void pmd_clear(pmd_t *pmdp)
 {
-	paravirt_ops.pmd_clear(pmdp);
-}
-#endif
+	PVOP_VCALL1(pmd_clear, pmdp);
+}
+
+static inline pte_t raw_ptep_get_and_clear(pte_t *p)
+{
+	unsigned long long val = PVOP_CALL1(unsigned long long, ptep_get_and_clear, p);
+	return (pte_t) { val, val >> 32 };
+}
+#else  /* !CONFIG_X86_PAE */
+static inline pte_t __pte(unsigned long val)
+{
+	return (pte_t) { PVOP_CALL1(unsigned long, make_pte, val) };
+}
+
+static inline pgd_t __pgd(unsigned long val)
+{
+	return (pgd_t) { PVOP_CALL1(unsigned long, make_pgd, val) };
+}
+
+static inline unsigned long pte_val(pte_t x)
+{
+	return PVOP_CALL1(unsigned long, pte_val, x.pte_low);
+}
+
+static inline unsigned long pgd_val(pgd_t x)
+{
+	return PVOP_CALL1(unsigned long, pgd_val, x.pgd);
+}
+
+static inline void set_pte(pte_t *ptep, pte_t pteval)
+{
+	PVOP_VCALL2(set_pte, ptep, pteval.pte_low);
+}
+
+static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
+			      pte_t *ptep, pte_t pteval)
+{
+	PVOP_VCALL4(set_pte_at, mm, addr, ptep, pteval.pte_low);
+}
+
+static inline void set_pmd(pmd_t *pmdp, pmd_t pmdval)
+{
+	PVOP_VCALL2(set_pmd, pmdp, pmdval.pud.pgd.pgd);
+}
+
+static inline pte_t raw_ptep_get_and_clear(pte_t *p)
+{
+	return (pte_t) { PVOP_CALL1(unsigned long, ptep_get_and_clear, p) };
+}
+#endif	/* CONFIG_X86_PAE */
 
 /* Lazy mode for batching updates / context switch */
 #define PARAVIRT_LAZY_NONE 0
@@ -502,12 +903,24 @@ static inline void pmd_clear(pmd_t *pmdp
 #define PARAVIRT_LAZY_CPU  2
 
 #define  __HAVE_ARCH_ENTER_LAZY_CPU_MODE
-#define arch_enter_lazy_cpu_mode() paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_CPU)
-#define arch_leave_lazy_cpu_mode() paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_NONE)
+static inline void arch_enter_lazy_cpu_mode(void)
+{
+	PVOP_VCALL1(set_lazy_mode, PARAVIRT_LAZY_CPU);
+}
+static inline void arch_leave_lazy_cpu_mode(void)
+{
+	PVOP_VCALL1(set_lazy_mode, PARAVIRT_LAZY_NONE);
+}
 
 #define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
-#define arch_enter_lazy_mmu_mode() paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_MMU)
-#define arch_leave_lazy_mmu_mode() paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_NONE)
+static inline void arch_enter_lazy_mmu_mode(void)
+{
+	PVOP_VCALL1(set_lazy_mode, PARAVIRT_LAZY_MMU);
+}
+static inline void arch_leave_lazy_mmu_mode(void)
+{
+	PVOP_VCALL1(set_lazy_mode, PARAVIRT_LAZY_NONE);
+}
 
 void _paravirt_nop(void);
 #define paravirt_nop	((void *)_paravirt_nop)
@@ -600,6 +1013,16 @@ static inline unsigned long __raw_local_
 	paravirt_clobber(CLBR_EAX)
 
 #undef PARAVIRT_CALL
+#undef PVOP_VCALL0
+#undef PVOP_CALL0
+#undef PVOP_VCALL1
+#undef PVOP_CALL1
+#undef PVOP_VCALL2
+#undef PVOP_CALL2
+#undef PVOP_VCALL3
+#undef PVOP_CALL3
+#undef PVOP_VCALL4
+#undef PVOP_CALL4
 
 #else  /* __ASSEMBLY__ */
 

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 13/20] Document asm-i386/paravirt.h
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
                   ` (11 preceding siblings ...)
  2007-04-04 19:12 ` [patch 12/20] Consistently wrap paravirt ops callsites to make them patchable Jeremy Fitzhardinge
@ 2007-04-04 19:12 ` Jeremy Fitzhardinge
  2007-04-04 19:12 ` [patch 14/20] add common patching machinery Jeremy Fitzhardinge
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: virtualization, Andrew Morton, lkml

[-- Attachment #1: paravirt-document-paravirt_ops.patch --]
[-- Type: text/plain, Size: 10321 bytes --]

Clean things up, and broadly document:
 - the paravirt_ops functions themselves
 - the patching mechanism

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
 
---
 include/asm-i386/paravirt.h |  140 +++++++++++++++++++++++++++++++++++++------
 1 file changed, 123 insertions(+), 17 deletions(-)

===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -21,6 +21,14 @@ struct tss_struct;
 struct tss_struct;
 struct mm_struct;
 struct desc_struct;
+
+/* Lazy mode for batching updates / context switch */
+enum paravirt_lazy_mode {
+	PARAVIRT_LAZY_NONE = 0,
+	PARAVIRT_LAZY_MMU = 1,
+	PARAVIRT_LAZY_CPU = 2,
+};
+
 struct paravirt_ops
 {
 	unsigned int kernel_rpl;
@@ -37,22 +45,33 @@ struct paravirt_ops
 	 */
 	unsigned (*patch)(u8 type, u16 clobber, void *firstinsn, unsigned len);
 
+	/* Basic arch-specific setup */
 	void (*arch_setup)(void);
 	char *(*memory_setup)(void);
 	void (*init_IRQ)(void);
-
+	void (*time_init)(void);
+
+	/*
+	 * Called before/after init_mm pagetable setup. setup_start
+	 * may reset %cr3, and may pre-install parts of the pagetable;
+	 * pagetable setup is expected to preserve any existing
+	 * mapping.
+	 */
 	void (*pagetable_setup_start)(pgd_t *pgd_base);
 	void (*pagetable_setup_done)(pgd_t *pgd_base);
 
+	/* Print a banner to identify the environment */
 	void (*banner)(void);
 
+	/* Set and set time of day */
 	unsigned long (*get_wallclock)(void);
 	int (*set_wallclock)(unsigned long);
-	void (*time_init)(void);
-
+
+	/* cpuid emulation, mostly so that caps bits can be disabled */
 	void (*cpuid)(unsigned int *eax, unsigned int *ebx,
 		      unsigned int *ecx, unsigned int *edx);
 
+	/* hooks for various privileged instructions */
 	unsigned long (*get_debugreg)(int regno);
 	void (*set_debugreg)(int regno, unsigned long value);
 
@@ -71,15 +90,23 @@ struct paravirt_ops
 	unsigned long (*read_cr4)(void);
 	void (*write_cr4)(unsigned long);
 
+	/*
+	 * Get/set interrupt state.  save_fl and restore_fl are only
+	 * expected to use X86_EFLAGS_IF; all other bits
+	 * returned from save_fl are undefined, and may be ignored by
+	 * restore_fl.
+	 */
 	unsigned long (*save_fl)(void);
 	void (*restore_fl)(unsigned long);
 	void (*irq_disable)(void);
 	void (*irq_enable)(void);
 	void (*safe_halt)(void);
 	void (*halt)(void);
+
 	void (*wbinvd)(void);
 
-	/* err = 0/-EFAULT.  wrmsr returns 0/-EFAULT. */
+	/* MSR, PMC and TSR operations.
+	   err = 0/-EFAULT.  wrmsr returns 0/-EFAULT. */
 	u64 (*read_msr)(unsigned int msr, int *err);
 	int (*write_msr)(unsigned int msr, u64 val);
 
@@ -88,6 +115,7 @@ struct paravirt_ops
  	u64 (*get_scheduled_cycles)(void);
 	unsigned long (*get_cpu_khz)(void);
 
+	/* Segment descriptor handling */
 	void (*load_tr_desc)(void);
 	void (*load_gdt)(const struct Xgt_desc_struct *);
 	void (*load_idt)(const struct Xgt_desc_struct *);
@@ -105,9 +133,12 @@ struct paravirt_ops
 	void (*load_esp0)(struct tss_struct *tss, struct thread_struct *t);
 
 	void (*set_iopl_mask)(unsigned mask);
-
 	void (*io_delay)(void);
 
+	/*
+	 * Hooks for intercepting the creation/use/destruction of an
+	 * mm_struct.
+	 */
 	void (*activate_mm)(struct mm_struct *prev,
 			    struct mm_struct *next);
 	void (*dup_mmap)(struct mm_struct *oldmm,
@@ -115,30 +146,43 @@ struct paravirt_ops
 	void (*exit_mmap)(struct mm_struct *mm);
 
 #ifdef CONFIG_X86_LOCAL_APIC
+	/*
+	 * Direct APIC operations, principally for VMI.  Ideally
+	 * these shouldn't be in this interface.
+	 */
 	void (*apic_write)(unsigned long reg, unsigned long v);
 	void (*apic_write_atomic)(unsigned long reg, unsigned long v);
 	unsigned long (*apic_read)(unsigned long reg);
 	void (*setup_boot_clock)(void);
 	void (*setup_secondary_clock)(void);
+
+	void (*startup_ipi_hook)(int phys_apicid,
+				 unsigned long start_eip,
+				 unsigned long start_esp);
 #endif
 
+	/* TLB operations */
 	void (*flush_tlb_user)(void);
 	void (*flush_tlb_kernel)(void);
 	void (*flush_tlb_single)(unsigned long addr);
 
 	void (*map_pt_hook)(int type, pte_t *va, u32 pfn);
 
+	/* Hooks for allocating/releasing pagetable pages */
 	void (*alloc_pt)(u32 pfn);
 	void (*alloc_pd)(u32 pfn);
 	void (*alloc_pd_clone)(u32 pfn, u32 clonepfn, u32 start, u32 count);
 	void (*release_pt)(u32 pfn);
 	void (*release_pd)(u32 pfn);
 
+	/* Pagetable manipulation functions */
 	void (*set_pte)(pte_t *ptep, pte_t pteval);
-	void (*set_pte_at)(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pteval);
+	void (*set_pte_at)(struct mm_struct *mm, unsigned long addr,
+			   pte_t *ptep, pte_t pteval);
 	void (*set_pmd)(pmd_t *pmdp, pmd_t pmdval);
 	void (*pte_update)(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
-	void (*pte_update_defer)(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
+	void (*pte_update_defer)(struct mm_struct *mm,
+				 unsigned long addr, pte_t *ptep);
 
  	pte_t (*ptep_get_and_clear)(pte_t *ptep);
 
@@ -164,13 +208,12 @@ struct paravirt_ops
 	pgd_t (*make_pgd)(unsigned long pgd);
 #endif
 
-	void (*set_lazy_mode)(int mode);
+	/* Set deferred update mode, used for batching operations. */
+	void (*set_lazy_mode)(enum paravirt_lazy_mode mode);
 
 	/* These two are jmp to, not actually called. */
 	void (*irq_enable_sysexit)(void);
 	void (*iret)(void);
-
-	void (*startup_ipi_hook)(int phys_apicid, unsigned long start_eip, unsigned long start_esp);
 };
 
 /* Mark a paravirt probe function. */
@@ -188,8 +231,10 @@ extern struct paravirt_ops paravirt_ops;
 #define paravirt_clobber(clobber)		\
 	[paravirt_clobber] "i" (clobber)
 
-#define PARAVIRT_CALL	"call *(paravirt_ops+%c[paravirt_typenum]*4);"
-
+/*
+ * Generate some code, and mark it as patchable by the
+ * apply_paravirt() alternate instruction patcher.
+ */
 #define _paravirt_alt(insn_string, type, clobber)	\
 	"771:\n\t" insn_string "\n" "772:\n"		\
 	".pushsection .parainstructions,\"a\"\n"	\
@@ -199,9 +244,74 @@ extern struct paravirt_ops paravirt_ops;
 	"  .short " clobber "\n"			\
 	".popsection\n"
 
+/* Generate patchable code, with the default asm parameters. */
 #define paravirt_alt(insn_string)					\
 	_paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")
 
+/*
+ * This generates an indirect call based on the operation type number.
+ * The type number, computed in PARAVIRT_PATCH, is derived from the
+ * offset into the paravirt_ops structure, and can therefore be freely
+ * converted back into a structure offset.
+ */
+#define PARAVIRT_CALL	"call *(paravirt_ops+%c[paravirt_typenum]*4);"
+
+/*
+ * These macros are intended to wrap calls into a paravirt_ops
+ * operation, so that they can be later identified and patched at
+ * runtime.
+ *
+ * Normally, a call to a pv_op function is a simple indirect call:
+ * (paravirt_ops.operations)(args...).
+ *
+ * Unfortunately, this is a relatively slow operation for modern CPUs,
+ * because it cannot necessarily determine what the destination
+ * address is.  In this case, the address is a runtime constant, so at
+ * the very least we can patch the call to e a simple direct call, or
+ * ideally, patch an inline implementation into the callsite.  (Direct
+ * calls are essentially free, because the call and return addresses
+ * are completely predictable.)
+ *
+ * These macros rely on the standard gcc "regparm(3)" calling
+ * convention, in which the first three arguments are placed in %eax,
+ * %edx, %ecx (in that order), and the remaining arguments are placed
+ * on the stack.  All caller-save registers (eax,edx,ecx) are expected
+ * to be modified (either clobbered or used for return values).
+ *
+ * The call instruction itself is marked by placing its start address
+ * and size into the .parainstructions section, so that
+ * apply_paravirt() in arch/i386/kernel/alternative.c can do the
+ * appropriate patching under the control of the backend paravirt_ops
+ * implementation.
+ *
+ * Unfortunately there's no way to get gcc to generate the args setup
+ * for the call, and then allow the call itself to be generated by an
+ * inline asm.  Because of this, we must do the complete arg setup and
+ * return value handling from within these macros.  This is fairly
+ * cumbersome.
+ *
+ * There are 5 sets of PVOP_* macros for dealing with 0-4 arguments.
+ * It could be extended to more arguments, but there would be little
+ * to be gained from that.  For each number of arguments, there are
+ * the two VCALL and CALL variants for void and non-void functions.
+ *
+ * When there is a return value, the invoker of the macro must specify
+ * the return type.  The macro then uses sizeof() on that type to
+ * determine whether its a 32 or 64 bit value, and places the return
+ * in the right register(s) (just %eax for 32-bit, and %edx:%eax for
+ * 64-bit).
+ *
+ * 64-bit arguments are passed as a pair of adjacent 32-bit arguments
+ * in low,high order.
+ *
+ * Small structures are passed and returned in registers.  The macro
+ * calling convention can't directly deal with this, so the wrapper
+ * functions must do this.
+ *
+ * These PVOP_* macros are only defined within this header.  This
+ * means that all uses must be wrapped in inline functions.  This also
+ * makes sure the incoming and outgoing types are always correct.
+ */
 #define PVOP_CALL0(__rettype, __op)					\
 	({								\
 		__rettype __ret;					\
@@ -897,11 +1007,6 @@ static inline pte_t raw_ptep_get_and_cle
 }
 #endif	/* CONFIG_X86_PAE */
 
-/* Lazy mode for batching updates / context switch */
-#define PARAVIRT_LAZY_NONE 0
-#define PARAVIRT_LAZY_MMU  1
-#define PARAVIRT_LAZY_CPU  2
-
 #define  __HAVE_ARCH_ENTER_LAZY_CPU_MODE
 static inline void arch_enter_lazy_cpu_mode(void)
 {
@@ -1012,6 +1117,7 @@ static inline unsigned long __raw_local_
 	[paravirt_sti_type] "i" (PARAVIRT_PATCH(irq_enable)),		\
 	paravirt_clobber(CLBR_EAX)
 
+/* Make sure as little as possible of this mess escapes. */
 #undef PARAVIRT_CALL
 #undef PVOP_VCALL0
 #undef PVOP_CALL0

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 14/20] add common patching machinery
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
                   ` (12 preceding siblings ...)
  2007-04-04 19:12 ` [patch 13/20] Document asm-i386/paravirt.h Jeremy Fitzhardinge
@ 2007-04-04 19:12 ` Jeremy Fitzhardinge
  2007-04-04 19:12 ` [patch 15/20] add flush_tlb_others paravirt_op Jeremy Fitzhardinge
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: virtualization, Ingo Molnar, Anthony Liguori, Andrew Morton, lkml

[-- Attachment #1: paravirt-patch-machinery.patch --]
[-- Type: text/plain, Size: 8425 bytes --]

Implement the actual patching machinery.  paravirt_patch_default()
contains the logic to automatically patch a callsite based on a few
simple rules:

 - if the paravirt_op function is paravirt_nop, then patch nops
 - if the paravirt_op function is a jmp target, then jmp to it
 - if the paravirt_op function is callable and doesn't clobber too much
    for the callsite, call it directly

paravirt_patch_default is suitable as a default implementation of
paravirt_ops.patch, will remove most of the expensive indirect calls
in favour of either a direct call or a pile of nops.

Backends may implement their own patcher, however.  There are several
helper functions to help with this:

paravirt_patch_nop	nop out a callsite
paravirt_patch_ignore	leave the callsite as-is
paravirt_patch_call	patch a call if the caller and callee
			have compatible clobbers
paravirt_patch_jmp	patch in a jmp
paravirt_patch_insns	patch some literal instructions over
			the callsite, if they fit

This patch also implements more direct patches for the native case, so
that when running on native hardware many common operations are
implemented inline.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Anthony Liguori <anthony@codemonkey.ws>
Acked-by: Ingo Molnar <mingo@elte.hu>
---
 arch/i386/kernel/alternative.c |    5 -
 arch/i386/kernel/paravirt.c    |  164 ++++++++++++++++++++++++++++++++--------
 include/asm-i386/paravirt.h    |   12 ++
 3 files changed, 149 insertions(+), 32 deletions(-)

===================================================================
--- a/arch/i386/kernel/alternative.c
+++ b/arch/i386/kernel/alternative.c
@@ -349,11 +349,14 @@ void apply_paravirt(struct paravirt_patc
 		used = paravirt_ops.patch(p->instrtype, p->clobbers, p->instr,
 					  p->len);
 
+		BUG_ON(used > p->len);
+
 		/* Pad the rest with nops */
 		nop_out(p->instr + used, p->len - used);
 	}
 
-	/* Sync to be conservative, in case we patched following instructions */
+	/* Sync to be conservative, in case we patched following
+	   instructions */
 	sync_core();
 }
 #endif	/* CONFIG_PARAVIRT */
===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -54,40 +54,142 @@ char *memory_setup(void)
 #define DEF_NATIVE(name, code)					\
 	extern const char start_##name[], end_##name[];		\
 	asm("start_" #name ": " code "; end_" #name ":")
-DEF_NATIVE(cli, "cli");
-DEF_NATIVE(sti, "sti");
-DEF_NATIVE(popf, "push %eax; popf");
-DEF_NATIVE(pushf, "pushf; pop %eax");
+
+DEF_NATIVE(irq_disable, "cli");
+DEF_NATIVE(irq_enable, "sti");
+DEF_NATIVE(restore_fl, "push %eax; popf");
+DEF_NATIVE(save_fl, "pushf; pop %eax");
 DEF_NATIVE(iret, "iret");
-DEF_NATIVE(sti_sysexit, "sti; sysexit");
-
-static const struct native_insns
-{
-	const char *start, *end;
-} native_insns[] = {
-	[PARAVIRT_PATCH(irq_disable)] = { start_cli, end_cli },
-	[PARAVIRT_PATCH(irq_enable)] = { start_sti, end_sti },
-	[PARAVIRT_PATCH(restore_fl)] = { start_popf, end_popf },
-	[PARAVIRT_PATCH(save_fl)] = { start_pushf, end_pushf },
-	[PARAVIRT_PATCH(iret)] = { start_iret, end_iret },
-	[PARAVIRT_PATCH(irq_enable_sysexit)] = { start_sti_sysexit, end_sti_sysexit },
-};
+DEF_NATIVE(irq_enable_sysexit, "sti; sysexit");
+DEF_NATIVE(read_cr2, "mov %cr2, %eax");
+DEF_NATIVE(write_cr3, "mov %eax, %cr3");
+DEF_NATIVE(read_cr3, "mov %cr3, %eax");
+DEF_NATIVE(clts, "clts");
+DEF_NATIVE(read_tsc, "rdtsc");
+
+DEF_NATIVE(ud2a, "ud2a");
 
 static unsigned native_patch(u8 type, u16 clobbers, void *insns, unsigned len)
 {
-	unsigned int insn_len;
-
-	/* Don't touch it if we don't have a replacement */
-	if (type >= ARRAY_SIZE(native_insns) || !native_insns[type].start)
-		return len;
-
-	insn_len = native_insns[type].end - native_insns[type].start;
-
-	/* Similarly if we can't fit replacement. */
-	if (len < insn_len)
-		return len;
-
-	memcpy(insns, native_insns[type].start, insn_len);
+	const unsigned char *start, *end;
+	unsigned ret;
+
+	switch(type) {
+#define SITE(x)	case PARAVIRT_PATCH(x):	start = start_##x; end = end_##x; goto patch_site
+		SITE(irq_disable);
+		SITE(irq_enable);
+		SITE(restore_fl);
+		SITE(save_fl);
+		SITE(iret);
+		SITE(irq_enable_sysexit);
+		SITE(read_cr2);
+		SITE(read_cr3);
+		SITE(write_cr3);
+		SITE(clts);
+		SITE(read_tsc);
+#undef SITE
+
+	patch_site:
+		ret = paravirt_patch_insns(insns, len, start, end);
+		break;
+
+	case PARAVIRT_PATCH(make_pgd):
+	case PARAVIRT_PATCH(make_pte):
+	case PARAVIRT_PATCH(pgd_val):
+	case PARAVIRT_PATCH(pte_val):
+#ifdef CONFIG_X86_PAE
+	case PARAVIRT_PATCH(make_pmd):
+	case PARAVIRT_PATCH(pmd_val):
+#endif
+		/* These functions end up returning exactly what
+		   they're passed, in the same registers. */
+		ret = paravirt_patch_nop();
+		break;
+
+	default:
+		ret = paravirt_patch_default(type, clobbers, insns, len);
+		break;
+	}
+
+	return ret;
+}
+
+unsigned paravirt_patch_nop(void)
+{
+	return 0;
+}
+
+unsigned paravirt_patch_ignore(unsigned len)
+{
+	return len;
+}
+
+unsigned paravirt_patch_call(void *target, u16 tgt_clobbers,
+			     void *site, u16 site_clobbers,
+			     unsigned len)
+{
+	unsigned char *call = site;
+	unsigned long delta = (unsigned long)target - (unsigned long)(call+5);
+
+	if (tgt_clobbers & ~site_clobbers)
+		return len;	/* target would clobber too much for this site */
+	if (len < 5)
+		return len;	/* call too long for patch site */
+
+	*call++ = 0xe8;		/* call */
+	*(unsigned long *)call = delta;
+
+	return 5;
+}
+
+unsigned paravirt_patch_jmp(void *target, void *site, unsigned len)
+{
+	unsigned char *jmp = site;
+	unsigned long delta = (unsigned long)target - (unsigned long)(jmp+5);
+
+	if (len < 5)
+		return len;	/* call too long for patch site */
+
+	*jmp++ = 0xe9;		/* jmp */
+	*(unsigned long *)jmp = delta;
+
+	return 5;
+}
+
+unsigned paravirt_patch_default(u8 type, u16 clobbers, void *site, unsigned len)
+{
+	void *opfunc = *((void **)&paravirt_ops + type);
+	unsigned ret;
+
+	if (opfunc == NULL)
+		/* If there's no function, patch it with a ud2a (BUG) */
+		ret = paravirt_patch_insns(site, len, start_ud2a, end_ud2a);
+	else if (opfunc == paravirt_nop)
+		/* If the operation is a nop, then nop the callsite */
+		ret = paravirt_patch_nop();
+	else if (type == PARAVIRT_PATCH(iret) ||
+		 type == PARAVIRT_PATCH(irq_enable_sysexit))
+		/* If operation requires a jmp, then jmp */
+		ret = paravirt_patch_jmp(opfunc, site, len);
+	else
+		/* Otherwise call the function; assume target could
+		   clobber any caller-save reg */
+		ret = paravirt_patch_call(opfunc, CLBR_ANY,
+					  site, clobbers, len);
+
+	return ret;
+}
+
+unsigned paravirt_patch_insns(void *site, unsigned len,
+			      const char *start, const char *end)
+{
+	unsigned insn_len = end - start;
+
+	if (insn_len > len || start == NULL)
+		insn_len = len;
+	else
+		memcpy(site, start, insn_len);
+
 	return insn_len;
 }
 
@@ -110,7 +212,7 @@ static void native_flush_tlb_global(void
 	__native_flush_tlb_global();
 }
 
-static void native_flush_tlb_single(u32 addr)
+static void native_flush_tlb_single(unsigned long addr)
 {
 	__native_flush_tlb_single(addr);
 }
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -247,6 +247,18 @@ extern struct paravirt_ops paravirt_ops;
 /* Generate patchable code, with the default asm parameters. */
 #define paravirt_alt(insn_string)					\
 	_paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")
+
+unsigned paravirt_patch_nop(void);
+unsigned paravirt_patch_ignore(unsigned len);
+unsigned paravirt_patch_call(void *target, u16 tgt_clobbers,
+			     void *site, u16 site_clobbers,
+			     unsigned len);
+unsigned paravirt_patch_jmp(void *target, void *site, unsigned len);
+unsigned paravirt_patch_default(u8 type, u16 clobbers, void *site, unsigned len);
+
+unsigned paravirt_patch_insns(void *site, unsigned len,
+			      const char *start, const char *end);
+
 
 /*
  * This generates an indirect call based on the operation type number.

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 15/20] add flush_tlb_others paravirt_op
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
                   ` (13 preceding siblings ...)
  2007-04-04 19:12 ` [patch 14/20] add common patching machinery Jeremy Fitzhardinge
@ 2007-04-04 19:12 ` Jeremy Fitzhardinge
  2007-04-04 19:12 ` [patch 16/20] revert map_pt_hook Jeremy Fitzhardinge
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: virtualization, Andrew Morton, lkml

[-- Attachment #1: paravirt-flush_tlb_others.patch --]
[-- Type: text/plain, Size: 5276 bytes --]

This patch adds a pv_op for flush_tlb_others.  Linux running on native
hardware uses cross-CPU IPIs to flush the TLB on any CPU which may
have a particular mm's pagetable entries cached in its TLB.  This is
inefficient in a paravirtualized environment, since the hypervisor
knows which real CPUs actually contain cached mappings, which may be a
small subset of a guest's VCPUs.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>

---
 arch/i386/kernel/paravirt.c |    1 +
 arch/i386/kernel/smp.c      |   15 ++++++++-------
 include/asm-i386/paravirt.h |    9 +++++++++
 include/asm-i386/tlbflush.h |   19 +++++++++++++++++--
 4 files changed, 35 insertions(+), 9 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -301,6 +301,7 @@ struct paravirt_ops paravirt_ops = {
 	.flush_tlb_user = native_flush_tlb,
 	.flush_tlb_kernel = native_flush_tlb_global,
 	.flush_tlb_single = native_flush_tlb_single,
+	.flush_tlb_others = native_flush_tlb_others,
 
 	.map_pt_hook = paravirt_nop,
 
===================================================================
--- a/arch/i386/kernel/smp.c
+++ b/arch/i386/kernel/smp.c
@@ -256,7 +256,6 @@ static struct mm_struct * flush_mm;
 static struct mm_struct * flush_mm;
 static unsigned long flush_va;
 static DEFINE_SPINLOCK(tlbstate_lock);
-#define FLUSH_ALL	0xffffffff
 
 /*
  * We cannot call mmdrop() because we are in interrupt context, 
@@ -338,7 +337,7 @@ fastcall void smp_invalidate_interrupt(s
 		 
 	if (flush_mm == per_cpu(cpu_tlbstate, cpu).active_mm) {
 		if (per_cpu(cpu_tlbstate, cpu).state == TLBSTATE_OK) {
-			if (flush_va == FLUSH_ALL)
+			if (flush_va == TLB_FLUSH_ALL)
 				local_flush_tlb();
 			else
 				__flush_tlb_one(flush_va);
@@ -353,9 +352,11 @@ out:
 	put_cpu_no_resched();
 }
 
-static void flush_tlb_others(cpumask_t cpumask, struct mm_struct *mm,
-						unsigned long va)
-{
+void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
+			     unsigned long va)
+{
+	cpumask_t cpumask = *cpumaskp;
+
 	/*
 	 * A couple of (to be removed) sanity checks:
 	 *
@@ -417,7 +418,7 @@ void flush_tlb_current_task(void)
 
 	local_flush_tlb();
 	if (!cpus_empty(cpu_mask))
-		flush_tlb_others(cpu_mask, mm, FLUSH_ALL);
+		flush_tlb_others(cpu_mask, mm, TLB_FLUSH_ALL);
 	preempt_enable();
 }
 
@@ -436,7 +437,7 @@ void flush_tlb_mm (struct mm_struct * mm
 			leave_mm(smp_processor_id());
 	}
 	if (!cpus_empty(cpu_mask))
-		flush_tlb_others(cpu_mask, mm, FLUSH_ALL);
+		flush_tlb_others(cpu_mask, mm, TLB_FLUSH_ALL);
 
 	preempt_enable();
 }
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -15,6 +15,7 @@
 
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
+#include <linux/cpumask.h>
 
 struct thread_struct;
 struct Xgt_desc_struct;
@@ -125,6 +126,8 @@ struct paravirt_ops
 	void (*flush_tlb_user)(void);
 	void (*flush_tlb_kernel)(void);
 	void (*flush_tlb_single)(unsigned long addr);
+	void (*flush_tlb_others)(const cpumask_t *cpus, struct mm_struct *mm,
+				 unsigned long va);
 
 	void (*map_pt_hook)(int type, pte_t *va, u32 pfn);
 
@@ -731,6 +734,12 @@ static inline void __flush_tlb_single(un
 	PVOP_VCALL1(flush_tlb_single, addr);
 }
 
+static inline void flush_tlb_others(cpumask_t cpumask, struct mm_struct *mm,
+				    unsigned long va)
+{
+	PVOP_VCALL3(flush_tlb_others, &cpumask, mm, va);
+}
+
 static inline void paravirt_map_pt_hook(int type, pte_t *va, u32 pfn)
 {
 	PVOP_VCALL3(map_pt_hook, type, va, pfn);
===================================================================
--- a/include/asm-i386/tlbflush.h
+++ b/include/asm-i386/tlbflush.h
@@ -79,10 +79,14 @@
  *  - flush_tlb_range(vma, start, end) flushes a range of pages
  *  - flush_tlb_kernel_range(start, end) flushes a range of kernel pages
  *  - flush_tlb_pgtables(mm, start, end) flushes a range of page tables
+ *  - flush_tlb_others(cpumask, mm, va) flushes a TLBs on other cpus
  *
  * ..but the i386 has somewhat limited tlb flushing capabilities,
  * and page-granular flushes are available only on i486 and up.
  */
+
+#define TLB_FLUSH_ALL	0xffffffff
+
 
 #ifndef CONFIG_SMP
 
@@ -110,7 +114,12 @@ static inline void flush_tlb_range(struc
 		__flush_tlb();
 }
 
-#else
+static inline void native_flush_tlb_others(const cpumask_t *cpumask,
+					   struct mm_struct *mm, unsigned long va)
+{
+}
+
+#else  /* SMP */
 
 #include <asm/smp.h>
 
@@ -129,6 +138,9 @@ static inline void flush_tlb_range(struc
 	flush_tlb_mm(vma->vm_mm);
 }
 
+void native_flush_tlb_others(const cpumask_t *cpumask, struct mm_struct *mm,
+			     unsigned long va);
+
 #define TLBSTATE_OK	1
 #define TLBSTATE_LAZY	2
 
@@ -139,8 +151,11 @@ struct tlb_state
 	char __cacheline_padding[L1_CACHE_BYTES-8];
 };
 DECLARE_PER_CPU(struct tlb_state, cpu_tlbstate);
+#endif	/* SMP */
 
-
+#ifndef CONFIG_PARAVIRT
+#define flush_tlb_others(mask, mm, va)		\
+	native_flush_tlb_others(&mask, mm, va)
 #endif
 
 #define flush_tlb_kernel_range(start, end) flush_tlb_all()

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 16/20] revert map_pt_hook.
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
                   ` (14 preceding siblings ...)
  2007-04-04 19:12 ` [patch 15/20] add flush_tlb_others paravirt_op Jeremy Fitzhardinge
@ 2007-04-04 19:12 ` Jeremy Fitzhardinge
  2007-04-04 19:12 ` [patch 17/20] add kmap_atomic_pte for mapping highpte pages Jeremy Fitzhardinge
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: virtualization, Andrew Morton, lkml

[-- Attachment #1: revert-map_pt_hook.patch --]
[-- Type: text/plain, Size: 3661 bytes --]

Back out the map_pt_hook to clear the way for kmap_atomic_pte.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Zachary Amsden <zach@vmware.com>

---
 arch/i386/kernel/paravirt.c |    2 --
 arch/i386/kernel/vmi.c      |    2 ++
 include/asm-i386/paravirt.h |    7 -------
 include/asm-i386/pgtable.h  |   23 ++++-------------------
 4 files changed, 6 insertions(+), 28 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -303,8 +303,6 @@ struct paravirt_ops paravirt_ops = {
 	.flush_tlb_single = native_flush_tlb_single,
 	.flush_tlb_others = native_flush_tlb_others,
 
-	.map_pt_hook = paravirt_nop,
-
 	.alloc_pt = paravirt_nop,
 	.alloc_pd = paravirt_nop,
 	.alloc_pd_clone = paravirt_nop,
===================================================================
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -819,8 +819,10 @@ static inline int __init activate_vmi(vo
 		paravirt_ops.release_pt = vmi_release_pt;
 		paravirt_ops.release_pd = vmi_release_pd;
 	}
+#if 0
 	para_wrap(map_pt_hook, vmi_map_pt_hook, set_linear_mapping,
 		  SetLinearMapping);
+#endif
 
 	/*
 	 * These MUST always be patched.  Don't support indirect jumps
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -128,8 +128,6 @@ struct paravirt_ops
 	void (*flush_tlb_single)(unsigned long addr);
 	void (*flush_tlb_others)(const cpumask_t *cpus, struct mm_struct *mm,
 				 unsigned long va);
-
-	void (*map_pt_hook)(int type, pte_t *va, u32 pfn);
 
 	void (*alloc_pt)(u32 pfn);
 	void (*alloc_pd)(u32 pfn);
@@ -740,11 +738,6 @@ static inline void flush_tlb_others(cpum
 	PVOP_VCALL3(flush_tlb_others, &cpumask, mm, va);
 }
 
-static inline void paravirt_map_pt_hook(int type, pte_t *va, u32 pfn)
-{
-	PVOP_VCALL3(map_pt_hook, type, va, pfn);
-}
-
 static inline void paravirt_alloc_pt(unsigned pfn)
 {
 	PVOP_VCALL1(alloc_pt, pfn);
===================================================================
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -272,7 +272,6 @@ static inline void vmalloc_sync_all(void
  */
 #define pte_update(mm, addr, ptep)		do { } while (0)
 #define pte_update_defer(mm, addr, ptep)	do { } while (0)
-#define paravirt_map_pt_hook(slot, va, pfn)	do { } while (0)
 
 #define raw_ptep_get_and_clear(xp)     native_ptep_get_and_clear(xp)
 #endif
@@ -481,24 +480,10 @@ extern pte_t *lookup_address(unsigned lo
 #endif
 
 #if defined(CONFIG_HIGHPTE)
-#define pte_offset_map(dir, address)				\
-({								\
-	pte_t *__ptep;						\
-	unsigned pfn = pmd_val(*(dir)) >> PAGE_SHIFT;	   	\
-	__ptep = (pte_t *)kmap_atomic(pfn_to_page(pfn),KM_PTE0);\
-	paravirt_map_pt_hook(KM_PTE0,__ptep, pfn);		\
-	__ptep = __ptep + pte_index(address);			\
-	__ptep;							\
-})
-#define pte_offset_map_nested(dir, address)			\
-({								\
-	pte_t *__ptep;						\
-	unsigned pfn = pmd_val(*(dir)) >> PAGE_SHIFT;	   	\
-	__ptep = (pte_t *)kmap_atomic(pfn_to_page(pfn),KM_PTE1);\
-	paravirt_map_pt_hook(KM_PTE1,__ptep, pfn);		\
-	__ptep = __ptep + pte_index(address);			\
-	__ptep;							\
-})
+#define pte_offset_map(dir, address) \
+	((pte_t *)kmap_atomic(pmd_page(*(dir)),KM_PTE0) + pte_index(address))
+#define pte_offset_map_nested(dir, address) \
+	((pte_t *)kmap_atomic(pmd_page(*(dir)),KM_PTE1) + pte_index(address))
 #define pte_unmap(pte) kunmap_atomic(pte, KM_PTE0)
 #define pte_unmap_nested(pte) kunmap_atomic(pte, KM_PTE1)
 #else

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 17/20] add kmap_atomic_pte for mapping highpte pages
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
                   ` (15 preceding siblings ...)
  2007-04-04 19:12 ` [patch 16/20] revert map_pt_hook Jeremy Fitzhardinge
@ 2007-04-04 19:12 ` Jeremy Fitzhardinge
  2007-04-04 19:12 ` [patch 18/20] clean up tsc-based sched_clock Jeremy Fitzhardinge
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: virtualization, Andrew Morton, lkml

[-- Attachment #1: paravirt-kmap_atomic_pte.patch --]
[-- Type: text/plain, Size: 7249 bytes --]

Xen and VMI both have special requirements when mapping a highmem pte
page into the kernel address space.  These can be dealt with by adding
a new kmap_atomic_pte() function for mapping highptes, and hooking it
into the paravirt_ops infrastructure.

Xen specifically wants to map the pte page RO, so this patch exposes a
helper function, kmap_atomic_prot, which maps the page with the
specified page protections.

This also adds a kmap_flush_unused() function to clear out the cached
kmap mappings.  Xen needs this to clear out any potential stray RW
mappings of pages which will become part of a pagetable.

[ Zach - vmi.c will need some attention after this patch.  It wasn't
  immediately obvious to me what needs to be done. ]

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Zachary Amsden <zach@vmware.com>

---
 arch/i386/kernel/paravirt.c |    7 +++++++
 arch/i386/mm/highmem.c      |    9 +++++++--
 include/asm-i386/highmem.h  |   11 +++++++++++
 include/asm-i386/paravirt.h |   13 ++++++++++++-
 include/asm-i386/pgtable.h  |    4 ++--
 include/linux/highmem.h     |    6 ++++++
 mm/highmem.c                |    9 +++++++++
 7 files changed, 54 insertions(+), 5 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -20,6 +20,7 @@
 #include <linux/efi.h>
 #include <linux/bcd.h>
 #include <linux/start_kernel.h>
+#include <linux/highmem.h>
 
 #include <asm/bug.h>
 #include <asm/paravirt.h>
@@ -318,6 +319,12 @@ struct paravirt_ops paravirt_ops = {
 
 	.ptep_get_and_clear = native_ptep_get_and_clear,
 
+#ifdef CONFIG_HIGHPTE
+	.kmap_atomic_pte = native_kmap_atomic_pte,
+#else
+	.kmap_atomic_pte = paravirt_nop,
+#endif
+
 #ifdef CONFIG_X86_PAE
 	.set_pte_atomic = native_set_pte_atomic,
 	.set_pte_present = native_set_pte_present,
===================================================================
--- a/arch/i386/mm/highmem.c
+++ b/arch/i386/mm/highmem.c
@@ -26,7 +26,7 @@ void kunmap(struct page *page)
  * However when holding an atomic kmap is is not legal to sleep, so atomic
  * kmaps are appropriate for short, tight code paths only.
  */
-void *kmap_atomic(struct page *page, enum km_type type)
+void *kmap_atomic_prot(struct page *page, enum km_type type, pgprot_t prot)
 {
 	enum fixed_addresses idx;
 	unsigned long vaddr;
@@ -41,9 +41,14 @@ void *kmap_atomic(struct page *page, enu
 		return page_address(page);
 
 	vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-	set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
+	set_pte(kmap_pte-idx, mk_pte(page, prot));
 
 	return (void*) vaddr;
+}
+
+void *kmap_atomic(struct page *page, enum km_type type)
+{
+	return kmap_atomic_prot(page, type, kmap_prot);
 }
 
 void kunmap_atomic(void *kvaddr, enum km_type type)
===================================================================
--- a/include/asm-i386/highmem.h
+++ b/include/asm-i386/highmem.h
@@ -24,6 +24,7 @@
 #include <linux/threads.h>
 #include <asm/kmap_types.h>
 #include <asm/tlbflush.h>
+#include <asm/paravirt.h>
 
 /* declarations for highmem.c */
 extern unsigned long highstart_pfn, highend_pfn;
@@ -67,10 +68,20 @@ extern void FASTCALL(kunmap_high(struct 
 
 void *kmap(struct page *page);
 void kunmap(struct page *page);
+void *kmap_atomic_prot(struct page *page, enum km_type type, pgprot_t prot);
 void *kmap_atomic(struct page *page, enum km_type type);
 void kunmap_atomic(void *kvaddr, enum km_type type);
 void *kmap_atomic_pfn(unsigned long pfn, enum km_type type);
 struct page *kmap_atomic_to_page(void *ptr);
+
+static inline void *native_kmap_atomic_pte(struct page *page, enum km_type type)
+{
+	return kmap_atomic(page, type);
+}
+
+#ifndef CONFIG_PARAVIRT
+#define kmap_atomic_pte(page, type)	kmap_atomic(page, type)
+#endif
 
 #define flush_cache_kmaps()	do { } while (0)
 
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -16,7 +16,9 @@
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
 #include <linux/cpumask.h>
-
+#include <asm/kmap_types.h>
+
+struct page;
 struct thread_struct;
 struct Xgt_desc_struct;
 struct tss_struct;
@@ -143,6 +145,8 @@ struct paravirt_ops
 	void (*pte_update_defer)(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
 
  	pte_t (*ptep_get_and_clear)(pte_t *ptep);
+
+	void *(*kmap_atomic_pte)(struct page *page, enum km_type type);
 
 #ifdef CONFIG_X86_PAE
 	void (*set_pte_atomic)(pte_t *ptep, pte_t pteval);
@@ -768,6 +772,13 @@ static inline void paravirt_release_pd(u
 	PVOP_VCALL1(release_pd, pfn);
 }
 
+static inline void *kmap_atomic_pte(struct page *page, enum km_type type)
+{
+	unsigned long ret;
+	ret = PVOP_CALL2(unsigned long, kmap_atomic_pte, page, type);
+	return (void *)ret;
+}
+
 static inline void pte_update(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep)
 {
===================================================================
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -481,9 +481,9 @@ extern pte_t *lookup_address(unsigned lo
 
 #if defined(CONFIG_HIGHPTE)
 #define pte_offset_map(dir, address) \
-	((pte_t *)kmap_atomic(pmd_page(*(dir)),KM_PTE0) + pte_index(address))
+	((pte_t *)kmap_atomic_pte(pmd_page(*(dir)),KM_PTE0) + pte_index(address))
 #define pte_offset_map_nested(dir, address) \
-	((pte_t *)kmap_atomic(pmd_page(*(dir)),KM_PTE1) + pte_index(address))
+	((pte_t *)kmap_atomic_pte(pmd_page(*(dir)),KM_PTE1) + pte_index(address))
 #define pte_unmap(pte) kunmap_atomic(pte, KM_PTE0)
 #define pte_unmap_nested(pte) kunmap_atomic(pte, KM_PTE1)
 #else
===================================================================
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -27,6 +27,8 @@ unsigned int nr_free_highpages(void);
 unsigned int nr_free_highpages(void);
 extern unsigned long totalhigh_pages;
 
+void kmap_flush_unused(void);
+
 #else /* CONFIG_HIGHMEM */
 
 static inline unsigned int nr_free_highpages(void) { return 0; }
@@ -44,9 +46,13 @@ static inline void *kmap(struct page *pa
 
 #define kmap_atomic(page, idx) \
 	({ pagefault_disable(); page_address(page); })
+#define kmap_atomic_prot(page, idx, prot)	kmap_atomic(page, idx)
+
 #define kunmap_atomic(addr, idx)	do { pagefault_enable(); } while (0)
 #define kmap_atomic_pfn(pfn, idx)	kmap_atomic(pfn_to_page(pfn), (idx))
 #define kmap_atomic_to_page(ptr)	virt_to_page(ptr)
+
+#define kmap_flush_unused()	do {} while(0)
 #endif
 
 #endif /* CONFIG_HIGHMEM */
===================================================================
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -97,6 +97,15 @@ static void flush_all_zero_pkmaps(void)
 		set_page_address(page, NULL);
 	}
 	flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
+}
+
+/* Flush all unused kmap mappings in order to remove stray
+   mappings. */
+void kmap_flush_unused(void)
+{
+	spin_lock(&kmap_lock);
+	flush_all_zero_pkmaps();
+	spin_unlock(&kmap_lock);
 }
 
 static inline unsigned long map_new_virtual(struct page *page)

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 18/20] clean up tsc-based sched_clock
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
                   ` (16 preceding siblings ...)
  2007-04-04 19:12 ` [patch 17/20] add kmap_atomic_pte for mapping highpte pages Jeremy Fitzhardinge
@ 2007-04-04 19:12 ` Jeremy Fitzhardinge
  2007-04-06 23:22   ` Andrew Morton
  2007-04-04 19:12 ` [patch 19/20] Add a sched_clock paravirt_op Jeremy Fitzhardinge
  2007-04-04 19:12 ` [patch 20/20] Add apply_to_page_range() which applies a function to a pte range Jeremy Fitzhardinge
  19 siblings, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: virtualization, Andrew Morton, lkml

[-- Attachment #1: cleanup-tsc-sched-clock.patch --]
[-- Type: text/plain, Size: 3298 bytes --]

Three cleanups:
 - change "instable" -> "unstable"
 - its better to use get_cpu_var for getting this cpu's variables
 - change cycles_2_ns to do the full computation rather than just the
   tsc->ns scaling.  Its a simpler interface, and it makes the function
   more generally useful.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>

---
 arch/i386/kernel/sched-clock.c |   35 +++++++++++++++++++++--------------
 1 file changed, 21 insertions(+), 14 deletions(-)

===================================================================
--- a/arch/i386/kernel/sched-clock.c
+++ b/arch/i386/kernel/sched-clock.c
@@ -39,17 +39,23 @@
 
 struct sc_data {
 	unsigned int cyc2ns_scale;
-	unsigned char instable;
+	unsigned char unstable;
 	unsigned long long last_tsc;
 	unsigned long long ns_base;
 };
 
 static DEFINE_PER_CPU(struct sc_data, sc_data);
 
-static inline unsigned long long cycles_2_ns(int cpu, unsigned long long cyc)
+static inline unsigned long long cycles_2_ns(unsigned long long cyc)
 {
-	struct sc_data *sc = &per_cpu(sc_data, cpu);
-	return (cyc * sc->cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+	const struct sc_data *sc = &__get_cpu_var(sc_data);
+	unsigned long long ns;
+
+	cyc -= sc->last_tsc;
+	ns = (cyc * sc->cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+	ns += sc->ns_base;
+
+	return ns;
 }
 
 /*
@@ -62,18 +68,19 @@ static inline unsigned long long cycles_
  */
 unsigned long long sched_clock(void)
 {
-	int cpu = get_cpu();
-	struct sc_data *sc = &per_cpu(sc_data, cpu);
 	unsigned long long r;
+	const struct sc_data *sc = &get_cpu_var(sc_data);
 
-	if (sc->instable) {
+	if (sc->unstable) {
 		/* TBD find a cheaper fallback timer than this */
 		r = ktime_to_ns(ktime_get());
 	} else {
 		get_scheduled_cycles(r);
-		r = ((u64)sc->ns_base) + cycles_2_ns(cpu, r - sc->last_tsc);
+		r = cycles_2_ns(r);
 	}
-	put_cpu();
+
+	put_cpu_var(sc_data);
+
 	return r;
 }
 
@@ -81,7 +88,7 @@ static void resync_sc_freq(struct sc_dat
 static void resync_sc_freq(struct sc_data *sc, unsigned int newfreq)
 {
 	if (!cpu_has_tsc) {
-		sc->instable = 1;
+		sc->unstable = 1;
 		return;
 	}
 	/* RED-PEN protect with seqlock? I hope that's not needed
@@ -90,7 +97,7 @@ static void resync_sc_freq(struct sc_dat
 	sc->ns_base = ktime_to_ns(ktime_get());
 	get_scheduled_cycles(sc->last_tsc);
 	sc->cyc2ns_scale = (1000000 << CYC2NS_SCALE_FACTOR) / newfreq;
-	sc->instable = 0;
+	sc->unstable = 0;
 }
 
 static void call_r_s_f(void *arg)
@@ -119,9 +126,9 @@ static int sc_freq_event(struct notifier
 	switch (event) {
 	case CPUFREQ_RESUMECHANGE:  /* needed? */
 	case CPUFREQ_PRECHANGE:
-		/* Mark TSC as instable until cpu frequency change is done
+		/* Mark TSC as unstable until cpu frequency change is done
 		   because we don't know when exactly it will change */
-		sc->instable = 1;
+		sc->unstable = 1;
 		break;
 	case CPUFREQ_SUSPENDCHANGE:
 	case CPUFREQ_POSTCHANGE:
@@ -163,7 +170,7 @@ static __init int init_sched_clock(void)
 	int i;
 	struct cpufreq_freqs f = { .cpu = get_cpu(), .new = 0 };
 	for_each_possible_cpu (i)
-		per_cpu(sc_data, i).instable = 1;
+		per_cpu(sc_data, i).unstable = 1;
 	WARN_ON(num_online_cpus() > 1);
 	call_r_s_f(&f);
 	put_cpu();

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 19/20] Add a sched_clock paravirt_op
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
                   ` (17 preceding siblings ...)
  2007-04-04 19:12 ` [patch 18/20] clean up tsc-based sched_clock Jeremy Fitzhardinge
@ 2007-04-04 19:12 ` Jeremy Fitzhardinge
  2007-04-04 19:12 ` [patch 20/20] Add apply_to_page_range() which applies a function to a pte range Jeremy Fitzhardinge
  19 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: virtualization, john stultz, Andrew Morton, lkml

[-- Attachment #1: paravirt-sched-clock.patch --]
[-- Type: text/plain, Size: 8128 bytes --]

The tsc-based get_scheduled_cycles interface is not a good match for
Xen's runstate accounting, which reports everything in nanoseconds.

This patch replaces this interface with a sched_clock interface, which
matches both Xen and VMI's requirements.

In order to do this, we:
   1. replace get_scheduled_cycles with sched_clock
   2. hoist cycles_2_ns into a common header
   3. update vmi accordingly

One thing to note: because sched_clock is implemented as a weak
function in kernel/sched.c, we must define a real function in order to
override this weak binding.  This means the usual paravirt_ops
technique of using an inline function won't work in this case.


Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Dan Hecht <dhecht@vmware.com>
Cc: john stultz <johnstul@us.ibm.com>

---
 arch/i386/kernel/paravirt.c    |    2 -
 arch/i386/kernel/sched-clock.c |   39 ++++++++++++++--------------------
 arch/i386/kernel/vmi.c         |    2 -
 arch/i386/kernel/vmitime.c     |    6 ++---
 include/asm-i386/paravirt.h    |    7 ++++--
 include/asm-i386/timer.h       |   45 +++++++++++++++++++++++++++++++++++++++-
 include/asm-i386/vmi_time.h    |    2 -
 7 files changed, 71 insertions(+), 32 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -269,7 +269,7 @@ struct paravirt_ops paravirt_ops = {
 	.write_msr = native_write_msr_safe,
 	.read_tsc = native_read_tsc,
 	.read_pmc = native_read_pmc,
-	.get_scheduled_cycles = native_read_tsc,
+	.sched_clock = native_sched_clock,
 	.get_cpu_khz = native_calculate_cpu_khz,
 	.load_tr_desc = native_load_tr_desc,
 	.set_ldt = native_set_ldt,
===================================================================
--- a/arch/i386/kernel/sched-clock.c
+++ b/arch/i386/kernel/sched-clock.c
@@ -37,26 +37,7 @@
 
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
-struct sc_data {
-	unsigned int cyc2ns_scale;
-	unsigned char unstable;
-	unsigned long long last_tsc;
-	unsigned long long ns_base;
-};
-
-static DEFINE_PER_CPU(struct sc_data, sc_data);
-
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-	const struct sc_data *sc = &__get_cpu_var(sc_data);
-	unsigned long long ns;
-
-	cyc -= sc->last_tsc;
-	ns = (cyc * sc->cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
-	ns += sc->ns_base;
-
-	return ns;
-}
+DEFINE_PER_CPU(struct sc_data, sc_data);
 
 /*
  * Scheduler clock - returns current time in nanosec units.
@@ -66,7 +47,7 @@ static inline unsigned long long cycles_
  * [1] no attempt to stop CPU instruction reordering, which can hit
  * in a 100 instruction window or so.
  */
-unsigned long long sched_clock(void)
+unsigned long long native_sched_clock(void)
 {
 	unsigned long long r;
 	const struct sc_data *sc = &get_cpu_var(sc_data);
@@ -75,7 +56,7 @@ unsigned long long sched_clock(void)
 		/* TBD find a cheaper fallback timer than this */
 		r = ktime_to_ns(ktime_get());
 	} else {
-		get_scheduled_cycles(r);
+		rdtscll(r);
 		r = cycles_2_ns(r);
 	}
 
@@ -83,6 +64,18 @@ unsigned long long sched_clock(void)
 
 	return r;
 }
+
+/* We need to define a real function for sched_clock, to override the
+   weak default version */
+#ifdef CONFIG_PARAVIRT
+unsigned long long sched_clock(void)
+{
+	return paravirt_sched_clock();
+}
+#else
+unsigned long long sched_clock(void)
+	__attribute__((alias("native_sched_clock")));
+#endif
 
 /* Resync with new CPU frequency */
 static void resync_sc_freq(struct sc_data *sc, unsigned int newfreq)
@@ -95,7 +88,7 @@ static void resync_sc_freq(struct sc_dat
 	   because sched_clock callers should be able to tolerate small
 	   errors. */
 	sc->ns_base = ktime_to_ns(ktime_get());
-	get_scheduled_cycles(sc->last_tsc);
+	rdtscll(sc->last_tsc);
 	sc->cyc2ns_scale = (1000000 << CYC2NS_SCALE_FACTOR) / newfreq;
 	sc->unstable = 0;
 }
===================================================================
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -866,7 +866,7 @@ static inline int __init activate_vmi(vo
 		paravirt_ops.setup_boot_clock = vmi_timer_setup_boot_alarm;
 		paravirt_ops.setup_secondary_clock = vmi_timer_setup_secondary_alarm;
 #endif
-		paravirt_ops.get_scheduled_cycles = vmi_get_sched_cycles;
+		paravirt_ops.sched_clock = vmi_sched_clock;
  		paravirt_ops.get_cpu_khz = vmi_cpu_khz;
 
 		/* We have true wallclock functions; disable CMOS clock sync */
===================================================================
--- a/arch/i386/kernel/vmitime.c
+++ b/arch/i386/kernel/vmitime.c
@@ -163,9 +163,9 @@ int vmi_set_wallclock(unsigned long now)
 	return -1;
 }
 
-unsigned long long vmi_get_sched_cycles(void)
-{
-	return read_available_cycles();
+unsigned long long vmi_sched_clock(void)
+{
+	return cycles_2_ns(read_available_cycles());
 }
 
 unsigned long vmi_cpu_khz(void)
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -88,7 +88,7 @@ struct paravirt_ops
 
 	u64 (*read_tsc)(void);
 	u64 (*read_pmc)(void);
- 	u64 (*get_scheduled_cycles)(void);
+ 	unsigned long long (*sched_clock)(void);
 	unsigned long (*get_cpu_khz)(void);
 
 	void (*load_tr_desc)(void);
@@ -529,7 +529,10 @@ static inline void halt(void)
 
 #define rdtscll(val) (val = PVOP_CALL0(u64, read_tsc))
 
-#define get_scheduled_cycles(val) (val = paravirt_ops.get_scheduled_cycles())
+static inline unsigned long long paravirt_sched_clock(void)
+{
+	return PVOP_CALL0(unsigned long long, sched_clock);
+}
 #define calculate_cpu_khz() (paravirt_ops.get_cpu_khz())
 
 #define write_tsc(val1,val2) wrmsr(0x10, val1, val2)
===================================================================
--- a/include/asm-i386/timer.h
+++ b/include/asm-i386/timer.h
@@ -15,8 +15,51 @@ extern int recalibrate_cpu_khz(void);
 extern int recalibrate_cpu_khz(void);
 
 #ifndef CONFIG_PARAVIRT
-#define get_scheduled_cycles(val) rdtscll(val)
 #define calculate_cpu_khz() native_calculate_cpu_khz()
 #endif
 
+/* Accellerators for sched_clock()
+ * convert from cycles(64bits) => nanoseconds (64bits)
+ *  basic equation:
+ *		ns = cycles / (freq / ns_per_sec)
+ *		ns = cycles * (ns_per_sec / freq)
+ *		ns = cycles * (10^9 / (cpu_khz * 10^3))
+ *		ns = cycles * (10^6 / cpu_khz)
+ *
+ *	Then we use scaling math (suggested by george@mvista.com) to get:
+ *		ns = cycles * (10^6 * SC / cpu_khz) / SC
+ *		ns = cycles * cyc2ns_scale / SC
+ *
+ *	And since SC is a constant power of two, we can convert the div
+ *  into a shift.
+ *
+ *  We can use khz divisor instead of mhz to keep a better percision, since
+ *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
+ *  (mathieu.desnoyers@polymtl.ca)
+ *
+ *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
+ */
+struct sc_data {
+	unsigned int cyc2ns_scale;
+	unsigned char unstable;
+	unsigned long long last_tsc;
+	unsigned long long ns_base;
+};
+
+DECLARE_PER_CPU(struct sc_data, sc_data);
+
+#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+
+static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+{
+	const struct sc_data *sc = &__get_cpu_var(sc_data);
+	unsigned long long ns;
+
+	cyc -= sc->last_tsc;
+	ns = (cyc * sc->cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+	ns += sc->ns_base;
+
+	return ns;
+}
+
 #endif
===================================================================
--- a/include/asm-i386/vmi_time.h
+++ b/include/asm-i386/vmi_time.h
@@ -49,7 +49,7 @@ extern void __init vmi_time_init(void);
 extern void __init vmi_time_init(void);
 extern unsigned long vmi_get_wallclock(void);
 extern int vmi_set_wallclock(unsigned long now);
-extern unsigned long long vmi_get_sched_cycles(void);
+extern unsigned long long vmi_sched_clock(void);
 extern unsigned long vmi_cpu_khz(void);
 
 #ifdef CONFIG_X86_LOCAL_APIC

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch 20/20] Add apply_to_page_range() which applies a function to a pte range.
  2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
                   ` (18 preceding siblings ...)
  2007-04-04 19:12 ` [patch 19/20] Add a sched_clock paravirt_op Jeremy Fitzhardinge
@ 2007-04-04 19:12 ` Jeremy Fitzhardinge
  2007-04-05  4:41   ` Matt Mackall
  19 siblings, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04 19:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Chris Wright, virtualization, Matt Mackall, Ingo Molnar,
	Ian Pratt, Andrew Morton, Christoph Lameter, lkml

[-- Attachment #1: apply-to-page-range.patch --]
[-- Type: text/plain, Size: 4396 bytes --]

Add a new mm function apply_to_page_range() which applies a given
function to every pte in a given virtual address range in a given mm
structure. This is a generic alternative to cut-and-pasting the Linux
idiomatic pagetable walking code in every place that a sequence of
PTEs must be accessed.

Although this interface is intended to be useful in a wide range of
situations, it is currently used specifically by several Xen
subsystems, for example: to ensure that pagetables have been allocated
for a virtual address range, and to construct batched special
pagetable update requests to map I/O memory (in ioremap()).

Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Matt Mackall <mpm@waste.org>
Acked-by: Ingo Molnar <mingo@elte.hu> 

---
 include/linux/mm.h |    5 ++
 mm/memory.c        |   94 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 99 insertions(+)

===================================================================
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1135,6 +1135,11 @@ struct page *follow_page(struct vm_area_
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_ANON	0x08	/* give ZERO_PAGE if no pgtable */
 
+typedef int (*pte_fn_t)(pte_t *pte, struct page *pmd_page, unsigned long addr,
+			void *data);
+extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
+			       unsigned long size, pte_fn_t fn, void *data);
+
 #ifdef CONFIG_PROC_FS
 void vm_stat_account(struct mm_struct *, unsigned long, struct file *, long);
 #else
===================================================================
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1448,6 +1448,100 @@ int remap_pfn_range(struct vm_area_struc
 }
 EXPORT_SYMBOL(remap_pfn_range);
 
+static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
+				     unsigned long addr, unsigned long end,
+				     pte_fn_t fn, void *data)
+{
+	pte_t *pte;
+	int err;
+	struct page *pmd_page;
+	spinlock_t *ptl;
+
+	pte = (mm == &init_mm) ?
+		pte_alloc_kernel(pmd, addr) :
+		pte_alloc_map_lock(mm, pmd, addr, &ptl);
+	if (!pte)
+		return -ENOMEM;
+
+	BUG_ON(pmd_huge(*pmd));
+
+	pmd_page = pmd_page(*pmd);
+
+	do {
+		err = fn(pte, pmd_page, addr, data);
+		if (err)
+			break;
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+
+	if (mm != &init_mm)
+		pte_unmap_unlock(pte-1, ptl);
+	return err;
+}
+
+static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
+				     unsigned long addr, unsigned long end,
+				     pte_fn_t fn, void *data)
+{
+	pmd_t *pmd;
+	unsigned long next;
+	int err;
+
+	pmd = pmd_alloc(mm, pud, addr);
+	if (!pmd)
+		return -ENOMEM;
+	do {
+		next = pmd_addr_end(addr, end);
+		err = apply_to_pte_range(mm, pmd, addr, next, fn, data);
+		if (err)
+			break;
+	} while (pmd++, addr = next, addr != end);
+	return err;
+}
+
+static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd,
+				     unsigned long addr, unsigned long end,
+				     pte_fn_t fn, void *data)
+{
+	pud_t *pud;
+	unsigned long next;
+	int err;
+
+	pud = pud_alloc(mm, pgd, addr);
+	if (!pud)
+		return -ENOMEM;
+	do {
+		next = pud_addr_end(addr, end);
+		err = apply_to_pmd_range(mm, pud, addr, next, fn, data);
+		if (err)
+			break;
+	} while (pud++, addr = next, addr != end);
+	return err;
+}
+
+/*
+ * Scan a region of virtual memory, filling in page tables as necessary
+ * and calling a provided function on each leaf page table.
+ */
+int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
+			unsigned long size, pte_fn_t fn, void *data)
+{
+	pgd_t *pgd;
+	unsigned long next;
+	unsigned long end = addr + size;
+	int err;
+
+	BUG_ON(addr >= end);
+	pgd = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		err = apply_to_pud_range(mm, pgd, addr, next, fn, data);
+		if (err)
+			break;
+	} while (pgd++, addr = next, addr != end);
+	return err;
+}
+EXPORT_SYMBOL_GPL(apply_to_page_range);
+
 /*
  * handle_pte_fault chooses page fault handler according to an entry
  * which was read non-atomically.  Before making any commitment, on

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing
  2007-04-04 19:11 ` [patch 07/20] Allow paravirt backend to choose kernel PMD sharing Jeremy Fitzhardinge
@ 2007-04-05  0:30   ` Christoph Lameter
  2007-04-05  0:43     ` Jeremy Fitzhardinge
  2007-04-05  1:29     ` Chris Wright
  2007-04-06 23:41   ` Andrew Morton
  1 sibling, 2 replies; 45+ messages in thread
From: Christoph Lameter @ 2007-04-05  0:30 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, virtualization, lkml,
	William Lee Irwin III, Zachary Amsden, Ingo Molnar

Acked-by: Christoph Lameter <clameter@sgi.com>

for all thats worth since I am not a i386 specialist.

How much of the issues with page struct sharing between slab and arch code 
does this address?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing
  2007-04-05  0:30   ` Christoph Lameter
@ 2007-04-05  0:43     ` Jeremy Fitzhardinge
  2007-04-05  1:29     ` Chris Wright
  1 sibling, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-05  0:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Andrew Morton, virtualization, lkml,
	William Lee Irwin III, Zachary Amsden, Ingo Molnar

Christoph Lameter wrote:
> Acked-by: Christoph Lameter <clameter@sgi.com>
>
> for all thats worth since I am not a i386 specialist.
>
> How much of the issues with page struct sharing between slab and arch code 
> does this address?
>   

I haven't been following that thread as closely as I should be, so I
don't have an answer.  I guess the interesting thing in this patch is
that it only uses the pmd cache for usermode pmds (which are
pre-zeroed), and normal page allocations for kernel pmds.  Also, if the
kernel pmds are unshared, the pgds are page-sized, so its not really
making good use of the pgd cache.

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing
  2007-04-05  0:30   ` Christoph Lameter
  2007-04-05  0:43     ` Jeremy Fitzhardinge
@ 2007-04-05  1:29     ` Chris Wright
  1 sibling, 0 replies; 45+ messages in thread
From: Chris Wright @ 2007-04-05  1:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jeremy Fitzhardinge, virtualization, Andrew Morton, Ingo Molnar,
	William Lee Irwin III, lkml

* Christoph Lameter (clameter@sgi.com) wrote:
> Acked-by: Christoph Lameter <clameter@sgi.com>
> 
> for all thats worth since I am not a i386 specialist.
> 
> How much of the issues with page struct sharing between slab and arch code 
> does this address?

I think the answer is 'none yet.'  It uses page sized slab and still
needs pgd_list, for example.  But the mm_list chaining should work too,
so it shouldn't make things any worse.

thanks,
-chris

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 20/20] Add apply_to_page_range() which applies a function to a pte range.
  2007-04-04 19:12 ` [patch 20/20] Add apply_to_page_range() which applies a function to a pte range Jeremy Fitzhardinge
@ 2007-04-05  4:41   ` Matt Mackall
  2007-04-05  6:52     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 45+ messages in thread
From: Matt Mackall @ 2007-04-05  4:41 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, virtualization, Matt Mackall, Ingo Molnar,
	Ian Pratt, lkml, Andrew Morton, Christoph Lameter

On Wed, Apr 04, 2007 at 12:12:11PM -0700, Jeremy Fitzhardinge wrote:
> Add a new mm function apply_to_page_range() which applies a given
> function to every pte in a given virtual address range in a given mm
> structure. This is a generic alternative to cut-and-pasting the Linux
> idiomatic pagetable walking code in every place that a sequence of
> PTEs must be accessed.

As we discussed before, this obviously has a lot in common with my
walk_page_range code.

The major difference and one your above description seems to be
missing the important detail of why it's doing this:

> +	pte_alloc_kernel(pmd, addr) :
> +	pmd = pmd_alloc(mm, pud, addr);
> +	pud = pud_alloc(mm, pgd, addr);

..which is mentioned here:

> +/*
> + * Scan a region of virtual memory, filling in page tables as necessary
> + * and calling a provided function on each leaf page table.
> + */

But I'm not sure what the use case is that wants filling in the page
table..? If both modes really make sense, perhaps a flag could unify
these differences.

> +typedef int (*pte_fn_t)(pte_t *pte, struct page *pmd_page, unsigned long addr,
> +			void *data);

I'd gotten the impression that these sorts of typedefs were out of
fashion.

> +static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
> +				     unsigned long addr, unsigned long end,
> +				     pte_fn_t fn, void *data)
> +{
> +	pte_t *pte;
> +	int err;
> +	struct page *pmd_page;
> +	spinlock_t *ptl;
> +
> +	pte = (mm == &init_mm) ?
> +		pte_alloc_kernel(pmd, addr) :
> +		pte_alloc_map_lock(mm, pmd, addr, &ptl);
> +	if (!pte)
> +		return -ENOMEM;

Seems a bit awkward to pass mm all the way down the tree just for this
quirk. Which is a bit awkward as it means that whether or not a lock
is held in the callback is context dependent.

smaps, clear_ref, and my pagemap code all use the callback at the
pmd_range level, which a) localizes the pte-level locking concerns
with the user b) amortizes the indirection overhead and c)
(unfortunately) makes the user a bit more complex.

We should try to measure whether (b) actually makes a difference.

> +	do {
> +		err = fn(pte, pmd_page, addr, data);
> +		if (err)
> +			break;
> +	} while (pte++, addr += PAGE_SIZE, addr != end);

I was about to say this do/while format seems a bit non-idiomatic for
page table walkers, but then I looked at the code in mm/memory.c and
realized the stuff I've been hacking on is the odd one out.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 20/20] Add apply_to_page_range() which applies a function to a pte range.
  2007-04-05  4:41   ` Matt Mackall
@ 2007-04-05  6:52     ` Jeremy Fitzhardinge
  2007-04-17 20:56       ` Matt Mackall
  0 siblings, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-05  6:52 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Chris Wright, virtualization, Matt Mackall, Ingo Molnar,
	Ian Pratt, lkml, Andrew Morton, Christoph Lameter

Matt Mackall wrote:
>> +/*
>> + * Scan a region of virtual memory, filling in page tables as necessary
>> + * and calling a provided function on each leaf page table.
>> + */
>>     
>
> But I'm not sure what the use case is that wants filling in the page
> table..? If both modes really make sense, perhaps a flag could unify
> these differences.
>   

Well, two reasons:

One is the general one that if you're traversing ptes then they need to
exist to traverse them (for example, if you're creating new mappings). 
Obviously if you want to just visit existing mappings, then
instantiating new pagetable is not the right thing to do (and I could
make use of this too).

The other is that there are various places in the Xen hypervisor API
where you pass in a reference to pte entry for the hypervisor to put
mappings into, and the rest of the pagetable needs to exist.  The Xen
code uses the side-effect of apply_to_page_range() to create pagetable
for these calls.

>> +typedef int (*pte_fn_t)(pte_t *pte, struct page *pmd_page, unsigned long addr,
>> +			void *data);
>>     
>
> I'd gotten the impression that these sorts of typedefs were out of
> fashion.
>   

In general yes, but for function pointers the syntax is so clumsy that I
think typedefs are OK.

>> +static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>> +				     unsigned long addr, unsigned long end,
>> +				     pte_fn_t fn, void *data)
>> +{
>> +	pte_t *pte;
>> +	int err;
>> +	struct page *pmd_page;
>> +	spinlock_t *ptl;
>> +
>> +	pte = (mm == &init_mm) ?
>> +		pte_alloc_kernel(pmd, addr) :
>> +		pte_alloc_map_lock(mm, pmd, addr, &ptl);
>> +	if (!pte)
>> +		return -ENOMEM;
>>     
>
> Seems a bit awkward to pass mm all the way down the tree just for this
> quirk. Which is a bit awkward as it means that whether or not a lock
> is held in the callback is context dependent.
>   

Well, it would need mm for just pte_alloc_map_lock() anyway.

> smaps, clear_ref, and my pagemap code all use the callback at the
> pmd_range level, which a) localizes the pte-level locking concerns
> with the user b) amortizes the indirection overhead and c)
> (unfortunately) makes the user a bit more complex.
>
> We should try to measure whether (b) actually makes a difference.
>   

I'll need to look closely at your code again.  It would be nice to have
one pagewalker.

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 09/20] rename struct paravirt_patch to paravirt_patch_site for clarity
  2007-04-04 19:12 ` [patch 09/20] rename struct paravirt_patch to paravirt_patch_site for clarity Jeremy Fitzhardinge
@ 2007-04-06 23:18   ` Andrew Morton
  2007-04-06 23:24     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 45+ messages in thread
From: Andrew Morton @ 2007-04-06 23:18 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: virtualization, lkml

On Wed, 04 Apr 2007 12:12:00 -0700 Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Rename struct paravirt_patch to paravirt_patch_site, so that it
> clearly refers to a callsite, and not the patch which may be applied
> to that callsite.

Needed a bit of updating for Andi's recently-merged
x86_64-mm-vmi-backend-for-paravirt-ops.patch but I think I got it
right.  We'll see if it compiles.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 18/20] clean up tsc-based sched_clock
  2007-04-04 19:12 ` [patch 18/20] clean up tsc-based sched_clock Jeremy Fitzhardinge
@ 2007-04-06 23:22   ` Andrew Morton
  2007-04-06 23:27     ` Jeremy Fitzhardinge
  2007-04-06 23:40     ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 45+ messages in thread
From: Andrew Morton @ 2007-04-06 23:22 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Andi Kleen, virtualization, lkml

On Wed, 04 Apr 2007 12:12:09 -0700 Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Three cleanups:
>  - change "instable" -> "unstable"
>  - its better to use get_cpu_var for getting this cpu's variables
>  - change cycles_2_ns to do the full computation rather than just the
>    tsc->ns scaling.  Its a simpler interface, and it makes the function
>    more generally useful.
> 
> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
> 
> ---
>  arch/i386/kernel/sched-clock.c |   35 +++++++++++++++++++++--------------

I'm dropping the relevant patch from Andi's tree due to it causing
mysterious hangs when initscripts start ondemand.  So I'll need to drop
this patch and "[patch 19/20] Add a sched_clock paravirt_op".

I still need to work out why that hang is happening - it is very
mysterious.  I got as far as working out that it was hanging on
write_seqlock_irqsave(xtime_lock), then remembered that it's with
CONFIG_SMP=n so I stomped off to bed in disgust.  Later.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 09/20] rename struct paravirt_patch to paravirt_patch_site for clarity
  2007-04-06 23:18   ` Andrew Morton
@ 2007-04-06 23:24     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-06 23:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, virtualization, lkml, Rusty Russell, Zachary Amsden

Andrew Morton wrote:
> Needed a bit of updating for Andi's recently-merged
> x86_64-mm-vmi-backend-for-paravirt-ops.patch but I think I got it
> right.  We'll see if it compiles.
>   

Yeah, it's just a simple name change, so it will either compile or not.

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 18/20] clean up tsc-based sched_clock
  2007-04-06 23:22   ` Andrew Morton
@ 2007-04-06 23:27     ` Jeremy Fitzhardinge
  2007-04-06 23:45       ` Andrew Morton
  2007-04-06 23:40     ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-06 23:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, virtualization, lkml

Andrew Morton wrote:
> On Wed, 04 Apr 2007 12:12:09 -0700 Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> Three cleanups:
>>  - change "instable" -> "unstable"
>>  - its better to use get_cpu_var for getting this cpu's variables
>>  - change cycles_2_ns to do the full computation rather than just the
>>    tsc->ns scaling.  Its a simpler interface, and it makes the function
>>    more generally useful.
>>
>> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
>>
>> ---
>>  arch/i386/kernel/sched-clock.c |   35 +++++++++++++++++++++--------------
>>     
>
> I'm dropping the relevant patch from Andi's tree due to it causing
> mysterious hangs when initscripts start ondemand.  So I'll need to drop
> this patch and "[patch 19/20] Add a sched_clock paravirt_op".
>   

OK.  That just means we need to go back to the original "Add a
sched_clock paravirt_op" follow-on patch.

> I still need to work out why that hang is happening - it is very
> mysterious.  I got as far as working out that it was hanging on
> write_seqlock_irqsave(xtime_lock), then remembered that it's with
> CONFIG_SMP=n so I stomped off to bed in disgust.  Later.
>   

Hangs always, or just sometimes?  I haven't seen any problems with it.

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 18/20] clean up tsc-based sched_clock
  2007-04-06 23:22   ` Andrew Morton
  2007-04-06 23:27     ` Jeremy Fitzhardinge
@ 2007-04-06 23:40     ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-06 23:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, virtualization, lkml

Andrew Morton wrote:
> On Wed, 04 Apr 2007 12:12:09 -0700 Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> Three cleanups:
>>  - change "instable" -> "unstable"
>>  - its better to use get_cpu_var for getting this cpu's variables
>>  - change cycles_2_ns to do the full computation rather than just the
>>    tsc->ns scaling.  Its a simpler interface, and it makes the function
>>    more generally useful.
>>
>> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
>>
>> ---
>>  arch/i386/kernel/sched-clock.c |   35 +++++++++++++++++++++--------------
>>     
>
> I'm dropping the relevant patch from Andi's tree due to it causing
> mysterious hangs when initscripts start ondemand.  So I'll need to drop
> this patch and "[patch 19/20] Add a sched_clock paravirt_op".
>
>
> I still need to work out why that hang is happening - it is very
> mysterious.  I got as far as working out that it was hanging on
> write_seqlock_irqsave(xtime_lock), then remembered that it's with
> CONFIG_SMP=n so I stomped off to bed in disgust.  Later.
>   

Though I do get:

BUG: at arch/i386/kernel/sched-clock.c:167 init_sched_clock()
 [<c0109012>] show_trace_log_lvl+0x1a/0x30
 [<c01095c7>] show_trace+0x12/0x14
 [<c010965b>] dump_stack+0x16/0x18
 [<c0465f51>] init_sched_clock+0x82/0xa8
 [<c045e547>] init+0x14b/0x241
 [<c0108bf7>] kernel_thread_helper+0x7/0x10
 =======================

which suggests that its being initialized too late.

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing
  2007-04-04 19:11 ` [patch 07/20] Allow paravirt backend to choose kernel PMD sharing Jeremy Fitzhardinge
  2007-04-05  0:30   ` Christoph Lameter
@ 2007-04-06 23:41   ` Andrew Morton
  2007-04-07  0:02     ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 45+ messages in thread
From: Andrew Morton @ 2007-04-06 23:41 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: virtualization, lkml, Ingo Molnar, William Lee Irwin III,
	Christoph Lameter

On Wed, 04 Apr 2007 12:11:58 -0700 Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Normally when running in PAE mode, the 4th PMD maps the kernel address
> space, which can be shared among all processes (since they all need
> the same kernel mappings).
> 
> Xen, however, does not allow guests to have the kernel pmd shared
> between page tables, so parameterize pgtable.c to allow both modes of
> operation.
> 
> There are several side-effects of this.  One is that vmalloc will
> update the kernel address space mappings, and those updates need to be
> propagated into all processes if the kernel mappings are not
> intrinsically shared.  In the non-PAE case, this is done by
> maintaining a pgd_list of all processes; this list is used when all
> process pagetables must be updated.  pgd_list is threaded via
> otherwise unused entries in the page structure for the pgd, which
> means that the pgd must be page-sized for this to work.
> 
> Normally the PAE pgd is only 4x64 byte entries large, but Xen requires
> the PAE pgd to page aligned anyway, so this patch forces the pgd to be
> page aligned+sized when the kernel pmd is unshared, to accomodate both
> these requirements.
> 
> Also, since there may be several distinct kernel pmds (if the
> user/kernel split is below 3G), there's no point in allocating them
> from a slab cache; they're just allocated with get_free_page and
> initialized appropriately.  (Of course the could be cached if there is
> just a single kernel pmd - which is the default with a 3G user/kernel
> split - but it doesn't seem worthwhile to add yet another case into
> this code).

All this paravirt stuff isn't making the kernel any prettier, is it?

> ...
>  
> -#ifndef CONFIG_X86_PAE
> -void vmalloc_sync_all(void)
> +void _vmalloc_sync_all(void)
>  {
>  	/*
>  	 * Note that races in the updates of insync and start aren't
> @@ -600,6 +599,8 @@ void vmalloc_sync_all(void)
>  	static DECLARE_BITMAP(insync, PTRS_PER_PGD);
>  	static unsigned long start = TASK_SIZE;
>  	unsigned long address;
> +
> +	BUG_ON(SHARED_KERNEL_PMD);
>  
>  	BUILD_BUG_ON(TASK_SIZE & ~PGDIR_MASK);
>  	for (address = start; address >= TASK_SIZE; address += PGDIR_SIZE) {
> @@ -623,4 +624,3 @@ void vmalloc_sync_all(void)
>  			start = address + PGDIR_SIZE;
>  	}
>  }

This is a functional change for non-paravirt kernels.  Non-PAE kernels now
get a vmalloc_sync_all().  How come?

We normally use double-underscore for things like this.

Your change clashes pretty fundamantally with
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc5/2.6.21-rc5-mm4/broken-out/move-die-notifier-handling-to-common-code-fix-vmalloc_sync_all.patch,
and
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc5/2.6.21-rc5-mm4/broken-out/move-die-notifier-handling-to-common-code.patch
_does_ make the kernel prettier.

But I'm a bit reluctant to rework
move-die-notifier-handling-to-common-code-fix-vmalloc_sync_all.patch
(somehow) until I understand why your patch is a) futzing with non-PAE,
non-paravirt code and b) overengineered.

Why didn't you just stick a

	if (SHARED_KERNEL_PMD)
		return;

into vmalloc_sync_all()?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 18/20] clean up tsc-based sched_clock
  2007-04-06 23:27     ` Jeremy Fitzhardinge
@ 2007-04-06 23:45       ` Andrew Morton
  0 siblings, 0 replies; 45+ messages in thread
From: Andrew Morton @ 2007-04-06 23:45 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Andi Kleen, virtualization, lkml

On Fri, 06 Apr 2007 16:27:20 -0700 Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> > I still need to work out why that hang is happening - it is very
> > mysterious.  I got as far as working out that it was hanging on
> > write_seqlock_irqsave(xtime_lock), then remembered that it's with
> > CONFIG_SMP=n so I stomped off to bed in disgust.  Later.
> >   
> 
> Hangs always, or just sometimes?  I haven't seen any problems with it.

100%.  On the Vaio, of course.  The kernel's canary.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing
  2007-04-06 23:41   ` Andrew Morton
@ 2007-04-07  0:02     ` Jeremy Fitzhardinge
  2007-04-07  0:28       ` Andrew Morton
  0 siblings, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-07  0:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: virtualization, lkml, Ingo Molnar, William Lee Irwin III,
	Christoph Lameter

Andrew Morton wrote:
> All this paravirt stuff isn't making the kernel any prettier, is it?
>   

You're too kind.  wli's comment on the first version of this patch was
something along the lines of "this patch causes a surprising amount of
damage for what little it achieves".

>> ...
>>  
>> -#ifndef CONFIG_X86_PAE
>> -void vmalloc_sync_all(void)
>> +void _vmalloc_sync_all(void)
>>  {
>>  	/*
>>  	 * Note that races in the updates of insync and start aren't
>> @@ -600,6 +599,8 @@ void vmalloc_sync_all(void)
>>  	static DECLARE_BITMAP(insync, PTRS_PER_PGD);
>>  	static unsigned long start = TASK_SIZE;
>>  	unsigned long address;
>> +
>> +	BUG_ON(SHARED_KERNEL_PMD);
>>  
>>  	BUILD_BUG_ON(TASK_SIZE & ~PGDIR_MASK);
>>  	for (address = start; address >= TASK_SIZE; address += PGDIR_SIZE) {
>> @@ -623,4 +624,3 @@ void vmalloc_sync_all(void)
>>  			start = address + PGDIR_SIZE;
>>  	}
>>  }
>>     
>
> This is a functional change for non-paravirt kernels.  Non-PAE kernels now
> get a vmalloc_sync_all().  How come?
>   

You mean PAE kernels get a vmalloc_sync_all?

When we're in PAE mode, but SHARED_KERNEL_PMD is false (which is true
for Xen, but not for native execution), then the kernel mappings are not
implicitly shared between processes.  This means that the vmalloc
mappings are not shared, and so need to be explicitly synchronized
between pagetables, like in the !PAE case.

> We normally use double-underscore for things like this.
>   

OK.

> Your change clashes pretty fundamantally with
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc5/2.6.21-rc5-mm4/broken-out/move-die-notifier-handling-to-common-code-fix-vmalloc_sync_all.patch,
> and
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc5/2.6.21-rc5-mm4/broken-out/move-die-notifier-handling-to-common-code.patch
> _does_ make the kernel prettier.
>   

Hm, it doesn't look like a deep clash.  Dropping the inline function and
just putting the "if (SHARED_KERNEL_PMD) return;" at the start of the
real vmalloc_sync_all() would work fine.

And I like vmalloc_sync_all() being a non-arch-specific interface; it
cleans up another of the xen patches.

> But I'm a bit reluctant to rework
> move-die-notifier-handling-to-common-code-fix-vmalloc_sync_all.patch
> (somehow) until I understand why your patch is a) futzing with non-PAE,
> non-paravirt code

There should be no functional difference for non-paravirt code, PAE or
non-PAE.

>  and b) overengineered.
>   

Overall, or just this bit?

> Why didn't you just stick a
>
> 	if (SHARED_KERNEL_PMD)
> 		return;
>
> into vmalloc_sync_all()?
>   

That would work, but when building !PARAVIRT && PAE, SHARED_KERNEL_PMD
is just constant 1, so it would end up making a pointless function
call.  With the wrapper, the call disappears entirely.  It probably
doesn't matter, but I didn't want anyone to complain about making the
!PARAVIRT generated code worse (hi, Ingo!).

However, if you're making vmalloc_sync_all a weak function anyway, then
there's no difference with the paravirt patches in place.  The

	if (SHARED_KERNEL_PMD)
		return;

will evaluate to

	if (1)
		return;

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing
  2007-04-07  0:02     ` Jeremy Fitzhardinge
@ 2007-04-07  0:28       ` Andrew Morton
  2007-04-07  0:40         ` Jeremy Fitzhardinge
  2007-04-09  2:36         ` William Lee Irwin III
  0 siblings, 2 replies; 45+ messages in thread
From: Andrew Morton @ 2007-04-07  0:28 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, virtualization, lkml, William Lee Irwin III,
	Zachary Amsden, Christoph Lameter, Ingo Molnar

On Fri, 06 Apr 2007 17:02:58 -0700 Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Andrew Morton wrote:
> > All this paravirt stuff isn't making the kernel any prettier, is it?
> >   
> 
> You're too kind.  wli's comment on the first version of this patch was
> something along the lines of "this patch causes a surprising amount of
> damage for what little it achieves".

Damn, I wish I'd said that.

> >> ...
> >>  
> >> -#ifndef CONFIG_X86_PAE
> >> -void vmalloc_sync_all(void)
> >> +void _vmalloc_sync_all(void)
> >>  {
> >>  	/*
> >>  	 * Note that races in the updates of insync and start aren't
> >> @@ -600,6 +599,8 @@ void vmalloc_sync_all(void)
> >>  	static DECLARE_BITMAP(insync, PTRS_PER_PGD);
> >>  	static unsigned long start = TASK_SIZE;
> >>  	unsigned long address;
> >> +
> >> +	BUG_ON(SHARED_KERNEL_PMD);
> >>  
> >>  	BUILD_BUG_ON(TASK_SIZE & ~PGDIR_MASK);
> >>  	for (address = start; address >= TASK_SIZE; address += PGDIR_SIZE) {
> >> @@ -623,4 +624,3 @@ void vmalloc_sync_all(void)
> >>  			start = address + PGDIR_SIZE;
> >>  	}
> >>  }
> >>     
> >
> > This is a functional change for non-paravirt kernels.  Non-PAE kernels now
> > get a vmalloc_sync_all().  How come?
> >   
> 
> You mean PAE kernels get a vmalloc_sync_all?

err, yes.

> When we're in PAE mode, but SHARED_KERNEL_PMD is false (which is true
> for Xen, but not for native execution), then the kernel mappings are not
> implicitly shared between processes.  This means that the vmalloc
> mappings are not shared, and so need to be explicitly synchronized
> between pagetables, like in the !PAE case.

head spins.

> > Your change clashes pretty fundamantally with
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc5/2.6.21-rc5-mm4/broken-out/move-die-notifier-handling-to-common-code-fix-vmalloc_sync_all.patch,
> > and
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc5/2.6.21-rc5-mm4/broken-out/move-die-notifier-handling-to-common-code.patch
> > _does_ make the kernel prettier.
> >   
> 
> Hm, it doesn't look like a deep clash.  Dropping the inline function and
> just putting the "if (SHARED_KERNEL_PMD) return;" at the start of the
> real vmalloc_sync_all() would work fine.

Something like that.  I don't want to redo my patch if we're going to change
your patch ;)

> And I like vmalloc_sync_all() being a non-arch-specific interface; it
> cleans up another of the xen patches.

OK.

> > But I'm a bit reluctant to rework
> > move-die-notifier-handling-to-common-code-fix-vmalloc_sync_all.patch
> > (somehow) until I understand why your patch is a) futzing with non-PAE,
> > non-paravirt code
> 
> There should be no functional difference for non-paravirt code, PAE or
> non-PAE.
> 
> >  and b) overengineered.
> >   
> 
> Overall, or just this bit?

this bit.

> > Why didn't you just stick a
> >
> > 	if (SHARED_KERNEL_PMD)
> > 		return;
> >
> > into vmalloc_sync_all()?
> >   
> 
> That would work, but when building !PARAVIRT && PAE, SHARED_KERNEL_PMD
> is just constant 1, so it would end up making a pointless function
> call.  With the wrapper, the call disappears entirely.  It probably
> doesn't matter, but I didn't want anyone to complain about making the
> !PARAVIRT generated code worse (hi, Ingo!).

vmalloc_sync_all() is a) tremendously slow and b) only called by
register_die_notifier().  We can afford to add a few cycles to it.

> However, if you're making vmalloc_sync_all a weak function anyway, then
> there's no difference with the paravirt patches in place.  The
> 
> 	if (SHARED_KERNEL_PMD)
> 		return;
> 
> will evaluate to
> 
> 	if (1)
> 		return;
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing
  2007-04-07  0:28       ` Andrew Morton
@ 2007-04-07  0:40         ` Jeremy Fitzhardinge
  2007-04-07  1:21           ` Andrew Morton
  2007-04-09  2:36         ` William Lee Irwin III
  1 sibling, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-07  0:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: virtualization, lkml, Ingo Molnar, William Lee Irwin III,
	Christoph Lameter

Andrew Morton wrote:
> Something like that.  I don't want to redo my patch if we're going to change
> your patch ;)
>   

OK.  I won't specifically redo it on top of your patches, but I'll
rework it to remove the inline function and add the if() statement.  Do
you want an incremental update or a complete replacement?

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing
  2007-04-07  0:40         ` Jeremy Fitzhardinge
@ 2007-04-07  1:21           ` Andrew Morton
  2007-04-07  5:47             ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 45+ messages in thread
From: Andrew Morton @ 2007-04-07  1:21 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, virtualization, lkml, William Lee Irwin III,
	Zachary Amsden, Christoph Lameter, Ingo Molnar

On Fri, 06 Apr 2007 17:40:13 -0700 Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Andrew Morton wrote:
> > Something like that.  I don't want to redo my patch if we're going to change
> > your patch ;)
> >   
> 
> OK.  I won't specifically redo it on top of your patches, but I'll
> rework it to remove the inline function and add the if() statement.  Do
> you want an incremental update or a complete replacement?
> 

Thanks.  A replacement would suit.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing
  2007-04-07  1:21           ` Andrew Morton
@ 2007-04-07  5:47             ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-07  5:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, virtualization, lkml, William Lee Irwin III,
	Zachary Amsden, Christoph Lameter, Ingo Molnar

Andrew Morton wrote:
> Thanks.  A replacement would suit.
>   
Subject: Allow paravirt backend to choose kernel PMD sharing

Normally when running in PAE mode, the 4th PMD maps the kernel address
space, which can be shared among all processes (since they all need
the same kernel mappings).

Xen, however, does not allow guests to have the kernel pmd shared
between page tables, so parameterize pgtable.c to allow both modes of
operation.

There are several side-effects of this.  One is that vmalloc will
update the kernel address space mappings, and those updates need to be
propagated into all processes if the kernel mappings are not
intrinsically shared.  In the non-PAE case, this is done by
maintaining a pgd_list of all processes; this list is used when all
process pagetables must be updated.  pgd_list is threaded via
otherwise unused entries in the page structure for the pgd, which
means that the pgd must be page-sized for this to work.

Normally the PAE pgd is only 4x64 byte entries large, but Xen requires
the PAE pgd to page aligned anyway, so this patch forces the pgd to be
page aligned+sized when the kernel pmd is unshared, to accomodate both
these requirements.

Also, since there may be several distinct kernel pmds (if the
user/kernel split is below 3G), there's no point in allocating them
from a slab cache; they're just allocated with get_free_page and
initialized appropriately.  (Of course the could be cached if there is
just a single kernel pmd - which is the default with a 3G user/kernel
split - but it doesn't seem worthwhile to add yet another case into
this code).

[ Many thanks to wli for review comments. ]

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: William Lee Irwin III <wli@holomorphy.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Christoph Lameter <clameter@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
---
 arch/i386/kernel/paravirt.c            |    1 
 arch/i386/mm/fault.c                   |    5 +
 arch/i386/mm/init.c                    |   18 +++++-
 arch/i386/mm/pageattr.c                |    2 
 arch/i386/mm/pgtable.c                 |   84 ++++++++++++++++++++++++++------
 include/asm-i386/paravirt.h            |    1 
 include/asm-i386/pgtable-2level-defs.h |    2 
 include/asm-i386/pgtable-2level.h      |    2 
 include/asm-i386/pgtable-3level-defs.h |    6 ++
 include/asm-i386/pgtable-3level.h      |    2 
 include/asm-i386/pgtable.h             |    2 
 11 files changed, 100 insertions(+), 25 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -132,6 +132,7 @@ struct paravirt_ops paravirt_ops = {
 	.name = "bare hardware",
 	.paravirt_enabled = 0,
 	.kernel_rpl = 0,
+	.shared_kernel_pmd = 1,	/* Only used when CONFIG_X86_PAE is set */
 
  	.patch = native_patch,
 	.banner = default_banner,
===================================================================
--- a/arch/i386/mm/fault.c
+++ b/arch/i386/mm/fault.c
@@ -603,7 +603,6 @@ do_sigbus:
 	force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
 }
 
-#ifndef CONFIG_X86_PAE
 void vmalloc_sync_all(void)
 {
 	/*
@@ -615,6 +614,9 @@ void vmalloc_sync_all(void)
 	static DECLARE_BITMAP(insync, PTRS_PER_PGD);
 	static unsigned long start = TASK_SIZE;
 	unsigned long address;
+
+	if (SHARED_KERNEL_PMD)
+		return;
 
 	BUILD_BUG_ON(TASK_SIZE & ~PGDIR_MASK);
 	for (address = start; address >= TASK_SIZE; address += PGDIR_SIZE) {
@@ -638,4 +640,3 @@ void vmalloc_sync_all(void)
 			start = address + PGDIR_SIZE;
 	}
 }
-#endif
===================================================================
--- a/arch/i386/mm/init.c
+++ b/arch/i386/mm/init.c
@@ -757,6 +757,8 @@ struct kmem_cache *pmd_cache;
 
 void __init pgtable_cache_init(void)
 {
+	size_t pgd_size = PTRS_PER_PGD*sizeof(pgd_t);
+
 	if (PTRS_PER_PMD > 1) {
 		pmd_cache = kmem_cache_create("pmd",
 					PTRS_PER_PMD*sizeof(pmd_t),
@@ -766,13 +768,23 @@ void __init pgtable_cache_init(void)
 					NULL);
 		if (!pmd_cache)
 			panic("pgtable_cache_init(): cannot create pmd cache");
+
+		if (!SHARED_KERNEL_PMD) {
+			/* If we're in PAE mode and have a non-shared
+			   kernel pmd, then the pgd size must be a
+			   page size.  This is because the pgd_list
+			   links through the page structure, so there
+			   can only be one pgd per page for this to
+			   work. */
+			pgd_size = PAGE_SIZE;
+		}
 	}
 	pgd_cache = kmem_cache_create("pgd",
-				PTRS_PER_PGD*sizeof(pgd_t),
-				PTRS_PER_PGD*sizeof(pgd_t),
+				pgd_size,
+				pgd_size,
 				0,
 				pgd_ctor,
-				PTRS_PER_PMD == 1 ? pgd_dtor : NULL);
+				(!SHARED_KERNEL_PMD) ? pgd_dtor : NULL);
 	if (!pgd_cache)
 		panic("pgtable_cache_init(): Cannot create pgd cache");
 }
===================================================================
--- a/arch/i386/mm/pageattr.c
+++ b/arch/i386/mm/pageattr.c
@@ -91,7 +91,7 @@ static void set_pmd_pte(pte_t *kpte, uns
 	unsigned long flags;
 
 	set_pte_atomic(kpte, pte); 	/* change init_mm */
-	if (PTRS_PER_PMD > 1)
+	if (SHARED_KERNEL_PMD)
 		return;
 
 	spin_lock_irqsave(&pgd_lock, flags);
===================================================================
--- a/arch/i386/mm/pgtable.c
+++ b/arch/i386/mm/pgtable.c
@@ -232,35 +232,56 @@ static inline void pgd_list_del(pgd_t *p
 		set_page_private(next, (unsigned long)pprev);
 }
 
+#if (PTRS_PER_PMD == 1)
+/* Non-PAE pgd constructor */
 void pgd_ctor(void *pgd, struct kmem_cache *cache, unsigned long unused)
 {
 	unsigned long flags;
 
-	if (PTRS_PER_PMD == 1) {
-		memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t));
-		spin_lock_irqsave(&pgd_lock, flags);
-	}
+	/* !PAE, no pagetable sharing */
+	memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t));
 
 	clone_pgd_range((pgd_t *)pgd + USER_PTRS_PER_PGD,
 			swapper_pg_dir + USER_PTRS_PER_PGD,
 			KERNEL_PGD_PTRS);
 
-	if (PTRS_PER_PMD > 1)
-		return;
+	spin_lock_irqsave(&pgd_lock, flags);
 
 	/* must happen under lock */
 	paravirt_alloc_pd_clone(__pa(pgd) >> PAGE_SHIFT,
-			__pa(swapper_pg_dir) >> PAGE_SHIFT,
-			USER_PTRS_PER_PGD, PTRS_PER_PGD - USER_PTRS_PER_PGD);
+				__pa(swapper_pg_dir) >> PAGE_SHIFT,
+				USER_PTRS_PER_PGD,
+				KERNEL_PGD_PTRS);
 
 	pgd_list_add(pgd);
 	spin_unlock_irqrestore(&pgd_lock, flags);
 }
-
-/* never called when PTRS_PER_PMD > 1 */
+#else  /* PTRS_PER_PMD > 1 */
+/* PAE pgd constructor */
+void pgd_ctor(void *pgd, struct kmem_cache *cache, unsigned long unused)
+{
+	/* PAE, kernel PMD may be shared */
+
+	if (SHARED_KERNEL_PMD) {
+		clone_pgd_range((pgd_t *)pgd + USER_PTRS_PER_PGD,
+				swapper_pg_dir + USER_PTRS_PER_PGD,
+				KERNEL_PGD_PTRS);
+	} else {
+		unsigned long flags;
+
+		memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t));
+		spin_lock_irqsave(&pgd_lock, flags);
+		pgd_list_add(pgd);
+		spin_unlock_irqrestore(&pgd_lock, flags);
+	}
+}
+#endif	/* PTRS_PER_PMD */
+
 void pgd_dtor(void *pgd, struct kmem_cache *cache, unsigned long unused)
 {
 	unsigned long flags; /* can be called from interrupt context */
+
+	BUG_ON(SHARED_KERNEL_PMD);
 
 	paravirt_release_pd(__pa(pgd) >> PAGE_SHIFT);
 	spin_lock_irqsave(&pgd_lock, flags);
@@ -268,6 +289,37 @@ void pgd_dtor(void *pgd, struct kmem_cac
 	spin_unlock_irqrestore(&pgd_lock, flags);
 }
 
+#define UNSHARED_PTRS_PER_PGD				\
+	(SHARED_KERNEL_PMD ? USER_PTRS_PER_PGD : PTRS_PER_PGD)
+
+/* If we allocate a pmd for part of the kernel address space, then
+   make sure its initialized with the appropriate kernel mappings.
+   Otherwise use a cached zeroed pmd.  */
+static pmd_t *pmd_cache_alloc(int idx)
+{
+	pmd_t *pmd;
+
+	if (idx >= USER_PTRS_PER_PGD) {
+		pmd = (pmd_t *)__get_free_page(GFP_KERNEL);
+
+		if (pmd)
+			memcpy(pmd,
+			       (void *)pgd_page_vaddr(swapper_pg_dir[idx]),
+			       sizeof(pmd_t) * PTRS_PER_PMD);
+	} else
+		pmd = kmem_cache_alloc(pmd_cache, GFP_KERNEL);
+
+	return pmd;
+}
+
+static void pmd_cache_free(pmd_t *pmd, int idx)
+{
+	if (idx >= USER_PTRS_PER_PGD)
+		free_page((unsigned long)pmd);
+	else
+		kmem_cache_free(pmd_cache, pmd);
+}
+
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
 	int i;
@@ -276,10 +328,12 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 	if (PTRS_PER_PMD == 1 || !pgd)
 		return pgd;
 
-	for (i = 0; i < USER_PTRS_PER_PGD; ++i) {
-		pmd_t *pmd = kmem_cache_alloc(pmd_cache, GFP_KERNEL);
+ 	for (i = 0; i < UNSHARED_PTRS_PER_PGD; ++i) {
+		pmd_t *pmd = pmd_cache_alloc(i);
+
 		if (!pmd)
 			goto out_oom;
+
 		paravirt_alloc_pd(__pa(pmd) >> PAGE_SHIFT);
 		set_pgd(&pgd[i], __pgd(1 + __pa(pmd)));
 	}
@@ -290,7 +344,7 @@ out_oom:
 		pgd_t pgdent = pgd[i];
 		void* pmd = (void *)__va(pgd_val(pgdent)-1);
 		paravirt_release_pd(__pa(pmd) >> PAGE_SHIFT);
-		kmem_cache_free(pmd_cache, pmd);
+		pmd_cache_free(pmd, i);
 	}
 	kmem_cache_free(pgd_cache, pgd);
 	return NULL;
@@ -302,11 +356,11 @@ void pgd_free(pgd_t *pgd)
 
 	/* in the PAE case user pgd entries are overwritten before usage */
 	if (PTRS_PER_PMD > 1)
-		for (i = 0; i < USER_PTRS_PER_PGD; ++i) {
+		for (i = 0; i < UNSHARED_PTRS_PER_PGD; ++i) {
 			pgd_t pgdent = pgd[i];
 			void* pmd = (void *)__va(pgd_val(pgdent)-1);
 			paravirt_release_pd(__pa(pmd) >> PAGE_SHIFT);
-			kmem_cache_free(pmd_cache, pmd);
+			pmd_cache_free(pmd, i);
 		}
 	/* in the non-PAE case, free_pgtables() clears user pgd entries */
 	kmem_cache_free(pgd_cache, pgd);
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -35,6 +35,7 @@ struct paravirt_ops
 struct paravirt_ops
 {
 	unsigned int kernel_rpl;
+	int shared_kernel_pmd;
  	int paravirt_enabled;
 	const char *name;
 
===================================================================
--- a/include/asm-i386/pgtable-2level-defs.h
+++ b/include/asm-i386/pgtable-2level-defs.h
@@ -1,5 +1,7 @@
 #ifndef _I386_PGTABLE_2LEVEL_DEFS_H
 #define _I386_PGTABLE_2LEVEL_DEFS_H
+
+#define SHARED_KERNEL_PMD	0
 
 /*
  * traditional i386 two-level paging structure:
===================================================================
--- a/include/asm-i386/pgtable-2level.h
+++ b/include/asm-i386/pgtable-2level.h
@@ -82,6 +82,4 @@ static inline int pte_exec_kernel(pte_t 
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { (pte).pte_low })
 #define __swp_entry_to_pte(x)		((pte_t) { (x).val })
 
-void vmalloc_sync_all(void);
-
 #endif /* _I386_PGTABLE_2LEVEL_H */
===================================================================
--- a/include/asm-i386/pgtable-3level-defs.h
+++ b/include/asm-i386/pgtable-3level-defs.h
@@ -1,5 +1,11 @@
 #ifndef _I386_PGTABLE_3LEVEL_DEFS_H
 #define _I386_PGTABLE_3LEVEL_DEFS_H
+
+#ifdef CONFIG_PARAVIRT
+#define SHARED_KERNEL_PMD	(paravirt_ops.shared_kernel_pmd)
+#else
+#define SHARED_KERNEL_PMD	1
+#endif
 
 /*
  * PGDIR_SHIFT determines what a top-level page table entry can map
===================================================================
--- a/include/asm-i386/pgtable-3level.h
+++ b/include/asm-i386/pgtable-3level.h
@@ -200,6 +200,4 @@ static inline pmd_t pfn_pmd(unsigned lon
 
 #define __pmd_free_tlb(tlb, x)		do { } while (0)
 
-#define vmalloc_sync_all() ((void)0)
-
 #endif /* _I386_PGTABLE_3LEVEL_H */
===================================================================
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -242,6 +242,8 @@ static inline pte_t pte_mkyoung(pte_t pt
 static inline pte_t pte_mkyoung(pte_t pte)	{ (pte).pte_low |= _PAGE_ACCESSED; return pte; }
 static inline pte_t pte_mkwrite(pte_t pte)	{ (pte).pte_low |= _PAGE_RW; return pte; }
 static inline pte_t pte_mkhuge(pte_t pte)	{ (pte).pte_low |= _PAGE_PSE; return pte; }
+
+extern void vmalloc_sync_all(void);
 
 #ifdef CONFIG_X86_PAE
 # include <asm/pgtable-3level.h>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing
  2007-04-07  0:28       ` Andrew Morton
  2007-04-07  0:40         ` Jeremy Fitzhardinge
@ 2007-04-09  2:36         ` William Lee Irwin III
  1 sibling, 0 replies; 45+ messages in thread
From: William Lee Irwin III @ 2007-04-09  2:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeremy Fitzhardinge, Andi Kleen, virtualization, lkml,
	Zachary Amsden, Christoph Lameter, Ingo Molnar

On Fri, 06 Apr 2007 17:02:58 -0700 Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>> You're too kind.  wli's comment on the first version of this patch was
>> something along the lines of "this patch causes a surprising amount of
>> damage for what little it achieves".

On Fri, Apr 06, 2007 at 05:28:44PM -0700, Andrew Morton wrote:
> Damn, I wish I'd said that.

ISTR it went:

On Fri, Feb 16, 2007 at 02:21:07PM -0800, William Lee Irwin III wrote:
> The amount of violence this patch manages to commit is phenomenal for
> what little it actually does. There are also a number of oddities

Cheers.


-- wli

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 20/20] Add apply_to_page_range() which applies a function to a pte range.
  2007-04-05  6:52     ` Jeremy Fitzhardinge
@ 2007-04-17 20:56       ` Matt Mackall
  2007-04-19 19:44         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 45+ messages in thread
From: Matt Mackall @ 2007-04-17 20:56 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, virtualization, Matt Mackall, Ingo Molnar,
	Ian Pratt, lkml, Andrew Morton, Christoph Lameter

On Wed, Apr 04, 2007 at 11:52:57PM -0700, Jeremy Fitzhardinge wrote:
> Matt Mackall wrote:
> >> +/*
> >> + * Scan a region of virtual memory, filling in page tables as necessary
> >> + * and calling a provided function on each leaf page table.
> >> + */
> >>     
> >
> > But I'm not sure what the use case is that wants filling in the page
> > table..? If both modes really make sense, perhaps a flag could unify
> > these differences.
> >   
> 
> Well, two reasons:
> 
> One is the general one that if you're traversing ptes then they need to
> exist to traverse them (for example, if you're creating new mappings). 
> Obviously if you want to just visit existing mappings, then
> instantiating new pagetable is not the right thing to do (and I could
> make use of this too).
> 
> The other is that there are various places in the Xen hypervisor API
> where you pass in a reference to pte entry for the hypervisor to put
> mappings into, and the rest of the pagetable needs to exist.  The Xen
> code uses the side-effect of apply_to_page_range() to create pagetable
> for these calls.

I think adding a flags field and an allocate flag to my callback
struct would be sufficient here.

> > I'd gotten the impression that these sorts of typedefs were out of
> > fashion.
> >   
> 
> In general yes, but for function pointers the syntax is so clumsy that I
> think typedefs are OK.

The syntax is horrible, but I don't think we end up using the
resultant type enough to justify the namespace pollution.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 20/20] Add apply_to_page_range() which applies a function to a pte range.
  2007-04-17 20:56       ` Matt Mackall
@ 2007-04-19 19:44         ` Jeremy Fitzhardinge
  2007-04-19 19:59           ` Matt Mackall
  0 siblings, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-19 19:44 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andi Kleen, Andrew Morton, virtualization, lkml, Ian Pratt,
	Christian Limpach, Chris Wright, Christoph Lameter, Matt Mackall,
	Ingo Molnar

Matt Mackall wrote:
> I think adding a flags field and an allocate flag to my callback
> struct would be sufficient here.
>   

Yes, probably.

What about something that wants to shatter superpages?

> The syntax is horrible, but I don't think we end up using the
> resultant type enough to justify the namespace pollution.
>   

Don't know that that's a huge concern.  The typedef namespace (*_t) is
pretty empty, and its not like we're talking about something that's
visible outside the kernel.  And the syntax is *really* ugly.

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 20/20] Add apply_to_page_range() which applies a function to a pte range.
  2007-04-19 19:44         ` Jeremy Fitzhardinge
@ 2007-04-19 19:59           ` Matt Mackall
  2007-04-19 21:37             ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 45+ messages in thread
From: Matt Mackall @ 2007-04-19 19:59 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Matt Mackall, Ian Pratt, lkml, Chris Wright, virtualization,
	Andrew Morton, Christoph Lameter, Ingo Molnar

On Thu, Apr 19, 2007 at 12:44:57PM -0700, Jeremy Fitzhardinge wrote:
> Matt Mackall wrote:
> > I think adding a flags field and an allocate flag to my callback
> > struct would be sufficient here.
> >   
> 
> Yes, probably.
> 
> What about something that wants to shatter superpages?

Haven't thought a huge amount about that. Perhaps it's best done with
the level 3 callback?

> Don't know that that's a huge concern.  The typedef namespace (*_t) is
> pretty empty, and its not like we're talking about something that's
> visible outside the kernel.  And the syntax is *really* ugly.

Don't really care about this much.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 20/20] Add apply_to_page_range() which applies a function to a pte range.
  2007-04-19 21:37             ` Jeremy Fitzhardinge
@ 2007-04-19 21:30               ` Matt Mackall
  2007-04-19 22:30                 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 45+ messages in thread
From: Matt Mackall @ 2007-04-19 21:30 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Matt Mackall, Ian Pratt, lkml, Chris Wright, virtualization,
	Andrew Morton, Christoph Lameter, Ingo Molnar

On Thu, Apr 19, 2007 at 02:37:53PM -0700, Jeremy Fitzhardinge wrote:
> Matt Mackall wrote:
> > Haven't thought a huge amount about that. Perhaps it's best done with
> > the level 3 callback?
> >   
> 
> Level 2, I think, assuming you count the pte pages as level 1.  I think
> it can be dealt with, so long as it correctly skips level 1 callbacks
> for superpages, and does the test after the level 2 callback has returned.

I was counting from the top down.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 20/20] Add apply_to_page_range() which applies a function to a pte range.
  2007-04-19 19:59           ` Matt Mackall
@ 2007-04-19 21:37             ` Jeremy Fitzhardinge
  2007-04-19 21:30               ` Matt Mackall
  0 siblings, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-19 21:37 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Matt Mackall, Ian Pratt, lkml, Chris Wright, virtualization,
	Andrew Morton, Christoph Lameter, Ingo Molnar

Matt Mackall wrote:
> Haven't thought a huge amount about that. Perhaps it's best done with
> the level 3 callback?
>   

Level 2, I think, assuming you count the pte pages as level 1.  I think
it can be dealt with, so long as it correctly skips level 1 callbacks
for superpages, and does the test after the level 2 callback has returned.

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch 20/20] Add apply_to_page_range() which applies a function to a pte range.
  2007-04-19 21:30               ` Matt Mackall
@ 2007-04-19 22:30                 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-19 22:30 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andi Kleen, Andrew Morton, virtualization, lkml, Ian Pratt,
	Christian Limpach, Chris Wright, Christoph Lameter, Matt Mackall,
	Ingo Molnar

Matt Mackall wrote:
> I was counting from the top down.
>   

Bottom-up is better; that way the levels don't change for 2,3,4 level
pagetables.

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2007-04-19 22:30 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-04 19:11 [patch 00/20] paravirt_ops updates Jeremy Fitzhardinge
2007-04-04 19:11 ` [patch 01/20] update MAINTAINERS Jeremy Fitzhardinge
2007-04-04 19:11 ` [patch 02/20] Remove CONFIG_DEBUG_PARAVIRT Jeremy Fitzhardinge
2007-04-04 19:11 ` [patch 03/20] use paravirt_nop to consistently mark no-op operations Jeremy Fitzhardinge
2007-04-04 19:11 ` [patch 04/20] Add pagetable accessors to pack and unpack pagetable entries Jeremy Fitzhardinge
2007-04-04 19:11 ` [patch 05/20] Hooks to set up initial pagetable Jeremy Fitzhardinge
2007-04-04 19:11 ` [patch 06/20] Allocate a fixmap slot Jeremy Fitzhardinge
2007-04-04 19:11 ` [patch 07/20] Allow paravirt backend to choose kernel PMD sharing Jeremy Fitzhardinge
2007-04-05  0:30   ` Christoph Lameter
2007-04-05  0:43     ` Jeremy Fitzhardinge
2007-04-05  1:29     ` Chris Wright
2007-04-06 23:41   ` Andrew Morton
2007-04-07  0:02     ` Jeremy Fitzhardinge
2007-04-07  0:28       ` Andrew Morton
2007-04-07  0:40         ` Jeremy Fitzhardinge
2007-04-07  1:21           ` Andrew Morton
2007-04-07  5:47             ` Jeremy Fitzhardinge
2007-04-09  2:36         ` William Lee Irwin III
2007-04-04 19:11 ` [patch 08/20] add hooks to intercept mm creation and destruction Jeremy Fitzhardinge
2007-04-04 19:12 ` [patch 09/20] rename struct paravirt_patch to paravirt_patch_site for clarity Jeremy Fitzhardinge
2007-04-06 23:18   ` Andrew Morton
2007-04-06 23:24     ` Jeremy Fitzhardinge
2007-04-04 19:12 ` [patch 10/20] Use patch site IDs computed from offset in paravirt_ops structure Jeremy Fitzhardinge
2007-04-04 19:12 ` [patch 11/20] Fix patch site clobbers to include return register Jeremy Fitzhardinge
2007-04-04 19:12 ` [patch 12/20] Consistently wrap paravirt ops callsites to make them patchable Jeremy Fitzhardinge
2007-04-04 19:12 ` [patch 13/20] Document asm-i386/paravirt.h Jeremy Fitzhardinge
2007-04-04 19:12 ` [patch 14/20] add common patching machinery Jeremy Fitzhardinge
2007-04-04 19:12 ` [patch 15/20] add flush_tlb_others paravirt_op Jeremy Fitzhardinge
2007-04-04 19:12 ` [patch 16/20] revert map_pt_hook Jeremy Fitzhardinge
2007-04-04 19:12 ` [patch 17/20] add kmap_atomic_pte for mapping highpte pages Jeremy Fitzhardinge
2007-04-04 19:12 ` [patch 18/20] clean up tsc-based sched_clock Jeremy Fitzhardinge
2007-04-06 23:22   ` Andrew Morton
2007-04-06 23:27     ` Jeremy Fitzhardinge
2007-04-06 23:45       ` Andrew Morton
2007-04-06 23:40     ` Jeremy Fitzhardinge
2007-04-04 19:12 ` [patch 19/20] Add a sched_clock paravirt_op Jeremy Fitzhardinge
2007-04-04 19:12 ` [patch 20/20] Add apply_to_page_range() which applies a function to a pte range Jeremy Fitzhardinge
2007-04-05  4:41   ` Matt Mackall
2007-04-05  6:52     ` Jeremy Fitzhardinge
2007-04-17 20:56       ` Matt Mackall
2007-04-19 19:44         ` Jeremy Fitzhardinge
2007-04-19 19:59           ` Matt Mackall
2007-04-19 21:37             ` Jeremy Fitzhardinge
2007-04-19 21:30               ` Matt Mackall
2007-04-19 22:30                 ` Jeremy Fitzhardinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).