public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] (Re-)introducing pvops for x86_64 - Real pvops work part
@ 2007-10-31 19:14 Glauber de Oliveira Costa
  2007-10-31 19:14 ` [PATCH 1/16] Wipe out traditional opt from x86_64 Makefile Glauber de Oliveira Costa
  0 siblings, 1 reply; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-10-31 19:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, rusty, ak, mingo, chrisw, jeremy, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer


Hey folks,

This is the part-of-pvops-implementation-that-is-not-exactly-a-merge. Neat,
uh? This is the majority of the work.

The first patch in the series does not really belong here. It was already
sent to lkml separetedly before, but I'm including it again, for a very
simple reason: Try to test the paravirt patches without it, and you'll fail
miserably ;-) (and it was not yet included).

Other than that, I thank you all in advance for the review.

Have fun!


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/16] Wipe out traditional opt from x86_64 Makefile
  2007-10-31 19:14 [PATCH 0/7] (Re-)introducing pvops for x86_64 - Real pvops work part Glauber de Oliveira Costa
@ 2007-10-31 19:14 ` Glauber de Oliveira Costa
  2007-10-31 19:14   ` [PATCH 2/16] paravirt hooks at entry functions Glauber de Oliveira Costa
  0 siblings, 1 reply; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-10-31 19:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, rusty, ak, mingo, chrisw, jeremy, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Glauber de Oliveira Costa

Among other things, using -traditional as a gcc option stops us from
using macro token pasting, which is a feature we heavily rely on.

There was still a use of -traditional in arch/x86/kernel/Makefile_64,
which this patch removes.

I don't see any problems building kernels in my x86_64 box without
-traditional.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
---
 arch/x86/kernel/Makefile_64 |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/Makefile_64 b/arch/x86/kernel/Makefile_64
index 24671c3..74d3727 100644
--- a/arch/x86/kernel/Makefile_64
+++ b/arch/x86/kernel/Makefile_64
@@ -3,7 +3,6 @@
 #
 
 extra-y 	:= head_64.o head64.o init_task.o vmlinux.lds
-EXTRA_AFLAGS	:= -traditional
 obj-y	:= process_64.o signal_64.o entry_64.o traps_64.o irq_64.o \
 		ptrace_64.o time_64.o ioport_64.o ldt_64.o setup_64.o i8259_64.o sys_x86_64.o \
 		x8664_ksyms_64.o i387_64.o syscall_64.o vsyscall_64.o \
-- 
1.4.4.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/16] paravirt hooks at entry functions.
  2007-10-31 19:14 ` [PATCH 1/16] Wipe out traditional opt from x86_64 Makefile Glauber de Oliveira Costa
@ 2007-10-31 19:14   ` Glauber de Oliveira Costa
  2007-10-31 19:14     ` [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt Glauber de Oliveira Costa
  0 siblings, 1 reply; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-10-31 19:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, rusty, ak, mingo, chrisw, jeremy, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Glauber de Oliveira Costa, Steven Rostedt

Those are the hooks needed for paravirt at entry_64.S
In general, they follow the way of i386.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
 arch/x86/kernel/entry_64.S |  108 +++++++++++++++++++++++++++-----------------
 1 files changed, 66 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 3a058bb..b6d7008 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -50,6 +50,7 @@
 #include <asm/hw_irq.h>
 #include <asm/page.h>
 #include <asm/irqflags.h>
+#include <asm/paravirt.h>
 
 	.code64
 
@@ -57,6 +58,20 @@
 #define retint_kernel retint_restore_args
 #endif	
 
+#ifdef CONFIG_PARAVIRT
+ENTRY(native_irq_enable_syscall_ret)
+	movq	%gs:pda_oldrsp,%rsp
+	swapgs
+	sysretq
+/* 
+ * This could well be defined as a C function, but as it is only used here,
+ * let it be locally defined
+ */
+ENTRY(native_swapgs)
+	swapgs
+	retq
+#endif /* CONFIG_PARAVIRT */
+
 
 .macro TRACE_IRQS_IRETQ offset=ARGOFFSET
 #ifdef CONFIG_TRACE_IRQFLAGS
@@ -216,14 +231,21 @@ ENTRY(system_call)
 	CFI_DEF_CFA	rsp,PDA_STACKOFFSET
 	CFI_REGISTER	rip,rcx
 	/*CFI_REGISTER	rflags,r11*/
-	swapgs
+	SWAPGS_UNSAFE_STACK
+	/*
+	 * A hypervisor implementation might want to use a label
+	 * after the swapgs, so that it can do the swapgs
+	 * for the guest and jump here on syscall.
+	 */
+ENTRY(system_call_after_swapgs)
+
 	movq	%rsp,%gs:pda_oldrsp 
 	movq	%gs:pda_kernelstack,%rsp
 	/*
 	 * No need to follow this irqs off/on section - it's straight
 	 * and short:
 	 */
-	sti					
+	ENABLE_INTERRUPTS(CLBR_NONE)
 	SAVE_ARGS 8,1
 	movq  %rax,ORIG_RAX-ARGOFFSET(%rsp) 
 	movq  %rcx,RIP-ARGOFFSET(%rsp)
@@ -246,7 +268,7 @@ ret_from_sys_call:
 sysret_check:		
 	LOCKDEP_SYS_EXIT
 	GET_THREAD_INFO(%rcx)
-	cli
+	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	movl threadinfo_flags(%rcx),%edx
 	andl %edi,%edx
@@ -260,9 +282,7 @@ sysret_check:
 	CFI_REGISTER	rip,rcx
 	RESTORE_ARGS 0,-ARG_SKIP,1
 	/*CFI_REGISTER	rflags,r11*/
-	movq	%gs:pda_oldrsp,%rsp
-	swapgs
-	sysretq
+	ENABLE_INTERRUPTS_SYSCALL_RET
 
 	CFI_RESTORE_STATE
 	/* Handle reschedules */
@@ -271,7 +291,7 @@ sysret_careful:
 	bt $TIF_NEED_RESCHED,%edx
 	jnc sysret_signal
 	TRACE_IRQS_ON
-	sti
+	ENABLE_INTERRUPTS(CLBR_NONE)
 	pushq %rdi
 	CFI_ADJUST_CFA_OFFSET 8
 	call schedule
@@ -282,7 +302,7 @@ sysret_careful:
 	/* Handle a signal */ 
 sysret_signal:
 	TRACE_IRQS_ON
-	sti
+	ENABLE_INTERRUPTS(CLBR_NONE)
 	testl $(_TIF_SIGPENDING|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
 	jz    1f
 
@@ -295,7 +315,7 @@ sysret_signal:
 1:	movl $_TIF_NEED_RESCHED,%edi
 	/* Use IRET because user could have changed frame. This
 	   works because ptregscall_common has called FIXUP_TOP_OF_STACK. */
-	cli
+	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	jmp int_with_check
 	
@@ -327,7 +347,7 @@ tracesys:
  */
 	.globl int_ret_from_sys_call
 int_ret_from_sys_call:
-	cli
+	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	testl $3,CS-ARGOFFSET(%rsp)
 	je retint_restore_args
@@ -349,20 +369,20 @@ int_careful:
 	bt $TIF_NEED_RESCHED,%edx
 	jnc  int_very_careful
 	TRACE_IRQS_ON
-	sti
+	ENABLE_INTERRUPTS(CLBR_NONE)
 	pushq %rdi
 	CFI_ADJUST_CFA_OFFSET 8
 	call schedule
 	popq %rdi
 	CFI_ADJUST_CFA_OFFSET -8
-	cli
+	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	jmp int_with_check
 
 	/* handle signals and tracing -- both require a full stack frame */
 int_very_careful:
 	TRACE_IRQS_ON
-	sti
+	ENABLE_INTERRUPTS(CLBR_NONE)
 	SAVE_REST
 	/* Check for syscall exit trace */	
 	testl $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP),%edx
@@ -385,7 +405,7 @@ int_signal:
 1:	movl $_TIF_NEED_RESCHED,%edi	
 int_restore_rest:
 	RESTORE_REST
-	cli
+	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	jmp int_with_check
 	CFI_ENDPROC
@@ -506,7 +526,7 @@ END(stub_rt_sigreturn)
 	CFI_DEF_CFA_REGISTER	rbp
 	testl $3,CS(%rdi)
 	je 1f
-	swapgs	
+	SWAPGS
 	/* irqcount is used to check if a CPU is already on an interrupt
 	   stack or not. While this is essentially redundant with preempt_count
 	   it is a little cheaper to use a separate counter in the PDA
@@ -527,7 +547,7 @@ ENTRY(common_interrupt)
 	interrupt do_IRQ
 	/* 0(%rsp): oldrsp-ARGOFFSET */
 ret_from_intr:
-	cli	
+	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	decl %gs:pda_irqcount
 	leaveq
@@ -556,13 +576,13 @@ retint_swapgs:		/* return to user-space */
 	/*
 	 * The iretq could re-enable interrupts:
 	 */
-	cli
+	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_IRETQ
-	swapgs 
+	SWAPGS
 	jmp restore_args
 
 retint_restore_args:	/* return to kernel space */
-	cli
+	DISABLE_INTERRUPTS(CLBR_ANY)
 	/*
 	 * The iretq could re-enable interrupts:
 	 */
@@ -570,10 +590,14 @@ retint_restore_args:	/* return to kernel space */
 restore_args:
 	RESTORE_ARGS 0,8,0						
 iret_label:	
+#ifdef CONFIG_PARAVIRT
+	INTERRUPT_RETURN
+#endif
+ENTRY(native_iret)
 	iretq
 
 	.section __ex_table,"a"
-	.quad iret_label,bad_iret	
+	.quad native_iret, bad_iret
 	.previous
 	.section .fixup,"ax"
 	/* force a signal here? this matches i386 behaviour */
@@ -581,24 +605,24 @@ iret_label:
 bad_iret:
 	movq $11,%rdi	/* SIGSEGV */
 	TRACE_IRQS_ON
-	sti
-	jmp do_exit			
-	.previous	
-	
+	ENABLE_INTERRUPTS(CLBR_ANY | ~(CLBR_RDI))
+	jmp do_exit
+	.previous
+
 	/* edi: workmask, edx: work */
 retint_careful:
 	CFI_RESTORE_STATE
 	bt    $TIF_NEED_RESCHED,%edx
 	jnc   retint_signal
 	TRACE_IRQS_ON
-	sti
+	ENABLE_INTERRUPTS(CLBR_NONE)
 	pushq %rdi
 	CFI_ADJUST_CFA_OFFSET	8
 	call  schedule
 	popq %rdi		
 	CFI_ADJUST_CFA_OFFSET	-8
 	GET_THREAD_INFO(%rcx)
-	cli
+	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	jmp retint_check
 	
@@ -606,14 +630,14 @@ retint_signal:
 	testl $(_TIF_SIGPENDING|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
 	jz    retint_swapgs
 	TRACE_IRQS_ON
-	sti
+	ENABLE_INTERRUPTS(CLBR_NONE)
 	SAVE_REST
 	movq $-1,ORIG_RAX(%rsp) 			
 	xorl %esi,%esi		# oldset
 	movq %rsp,%rdi		# &pt_regs
 	call do_notify_resume
 	RESTORE_REST
-	cli
+	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	movl $_TIF_NEED_RESCHED,%edi
 	GET_THREAD_INFO(%rcx)
@@ -731,7 +755,7 @@ END(spurious_interrupt)
 	rdmsr
 	testl %edx,%edx
 	js    1f
-	swapgs
+	SWAPGS
 	xorl  %ebx,%ebx
 1:
 	.if \ist
@@ -747,7 +771,7 @@ END(spurious_interrupt)
 	.if \ist
 	addq	$EXCEPTION_STKSZ, per_cpu__init_tss + TSS_ist + (\ist - 1) * 8(%rbp)
 	.endif
-	cli
+	DISABLE_INTERRUPTS(CLBR_NONE)
 	.if \irqtrace
 	TRACE_IRQS_OFF
 	.endif
@@ -776,10 +800,10 @@ paranoid_swapgs\trace:
 	.if \trace
 	TRACE_IRQS_IRETQ 0
 	.endif
-	swapgs
+	SWAPGS_UNSAFE_STACK
 paranoid_restore\trace:
 	RESTORE_ALL 8
-	iretq
+	INTERRUPT_RETURN
 paranoid_userspace\trace:
 	GET_THREAD_INFO(%rcx)
 	movl threadinfo_flags(%rcx),%ebx
@@ -794,11 +818,11 @@ paranoid_userspace\trace:
 	.if \trace
 	TRACE_IRQS_ON
 	.endif
-	sti
+	ENABLE_INTERRUPTS(CLBR_NONE)
 	xorl %esi,%esi 			/* arg2: oldset */
 	movq %rsp,%rdi 			/* arg1: &pt_regs */
 	call do_notify_resume
-	cli
+	DISABLE_INTERRUPTS(CLBR_NONE)
 	.if \trace
 	TRACE_IRQS_OFF
 	.endif
@@ -807,9 +831,9 @@ paranoid_schedule\trace:
 	.if \trace
 	TRACE_IRQS_ON
 	.endif
-	sti
+	ENABLE_INTERRUPTS(CLBR_ANY)
 	call schedule
-	cli
+	DISABLE_INTERRUPTS(CLBR_ANY)
 	.if \trace
 	TRACE_IRQS_OFF
 	.endif
@@ -862,7 +886,7 @@ KPROBE_ENTRY(error_entry)
 	testl $3,CS(%rsp)
 	je  error_kernelspace
 error_swapgs:	
-	swapgs
+	SWAPGS
 error_sti:	
 	movq %rdi,RDI(%rsp) 	
 	CFI_REL_OFFSET	rdi,RDI
@@ -874,7 +898,7 @@ error_sti:
 error_exit:
 	movl %ebx,%eax
 	RESTORE_REST
-	cli
+	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	GET_THREAD_INFO(%rcx)	
 	testl %eax,%eax
@@ -911,12 +935,12 @@ ENTRY(load_gs_index)
 	CFI_STARTPROC
 	pushf
 	CFI_ADJUST_CFA_OFFSET 8
-	cli
-        swapgs
+	DISABLE_INTERRUPTS(CLBR_ANY | ~(CLBR_RDI))
+        SWAPGS
 gs_change:     
         movl %edi,%gs   
 2:	mfence		/* workaround */
-	swapgs
+	SWAPGS
         popf
 	CFI_ADJUST_CFA_OFFSET -8
         ret
@@ -930,7 +954,7 @@ ENDPROC(load_gs_index)
         .section .fixup,"ax"
 	/* running with kernelgs */
 bad_gs: 
-	swapgs			/* switch back to user gs */
+	SWAPGS			/* switch back to user gs */
 	xorl %eax,%eax
         movl %eax,%gs
         jmp  2b
-- 
1.4.4.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt
  2007-10-31 19:14   ` [PATCH 2/16] paravirt hooks at entry functions Glauber de Oliveira Costa
@ 2007-10-31 19:14     ` Glauber de Oliveira Costa
  2007-10-31 19:14       ` [PATCH 4/16] provide native irq initialization function Glauber de Oliveira Costa
  2007-11-01  4:48       ` [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt Jeremy Fitzhardinge
  0 siblings, 2 replies; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-10-31 19:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, rusty, ak, mingo, chrisw, jeremy, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Glauber de Oliveira Costa, Steven Rostedt

This patch introduces, and patch callers when needed, native
versions for read/write_crX functions, clts and wbinvd.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
 arch/x86/mm/pageattr_64.c   |    3 +-
 include/asm-x86/system_64.h |   60 ++++++++++++++++++++++++++++++------------
 2 files changed, 45 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/pageattr_64.c b/arch/x86/mm/pageattr_64.c
index c40afba..59a52b0 100644
--- a/arch/x86/mm/pageattr_64.c
+++ b/arch/x86/mm/pageattr_64.c
@@ -12,6 +12,7 @@
 #include <asm/processor.h>
 #include <asm/tlbflush.h>
 #include <asm/io.h>
+#include <asm/paravirt.h>
 
 pte_t *lookup_address(unsigned long address)
 { 
@@ -77,7 +78,7 @@ static void flush_kernel_map(void *arg)
 	   much cheaper than WBINVD. */
 	/* clflush is still broken. Disable for now. */
 	if (1 || !cpu_has_clflush)
-		asm volatile("wbinvd" ::: "memory");
+		wbinvd();
 	else list_for_each_entry(pg, l, lru) {
 		void *adr = page_address(pg);
 		clflush_cache_range(adr, PAGE_SIZE);
diff --git a/include/asm-x86/system_64.h b/include/asm-x86/system_64.h
index 4cb2384..b558cb2 100644
--- a/include/asm-x86/system_64.h
+++ b/include/asm-x86/system_64.h
@@ -65,53 +65,62 @@ extern void load_gs_index(unsigned);
 /*
  * Clear and set 'TS' bit respectively
  */
-#define clts() __asm__ __volatile__ ("clts")
+static inline void native_clts(void)
+{
+	asm volatile ("clts");
+}
 
-static inline unsigned long read_cr0(void)
-{ 
+static inline unsigned long native_read_cr0(void)
+{
 	unsigned long cr0;
 	asm volatile("movq %%cr0,%0" : "=r" (cr0));
 	return cr0;
 }
 
-static inline void write_cr0(unsigned long val) 
-{ 
+static inline void native_write_cr0(unsigned long val)
+{
 	asm volatile("movq %0,%%cr0" :: "r" (val));
 }
 
-static inline unsigned long read_cr2(void)
+static inline unsigned long native_read_cr2(void)
 {
 	unsigned long cr2;
 	asm volatile("movq %%cr2,%0" : "=r" (cr2));
 	return cr2;
 }
 
-static inline void write_cr2(unsigned long val)
+static inline void native_write_cr2(unsigned long val)
 {
 	asm volatile("movq %0,%%cr2" :: "r" (val));
 }
 
-static inline unsigned long read_cr3(void)
-{ 
+static inline unsigned long native_read_cr3(void)
+{
 	unsigned long cr3;
 	asm volatile("movq %%cr3,%0" : "=r" (cr3));
 	return cr3;
 }
 
-static inline void write_cr3(unsigned long val)
+static inline void native_write_cr3(unsigned long val)
 {
 	asm volatile("movq %0,%%cr3" :: "r" (val) : "memory");
 }
 
-static inline unsigned long read_cr4(void)
-{ 
+static inline unsigned long native_read_cr4(void)
+{
 	unsigned long cr4;
 	asm volatile("movq %%cr4,%0" : "=r" (cr4));
 	return cr4;
 }
 
-static inline void write_cr4(unsigned long val)
-{ 
+static inline unsigned long native_read_cr4_safe(void)
+{
+	/* CR4 always exist */
+	return native_read_cr4();
+}
+
+static inline void native_write_cr4(unsigned long val)
+{
 	asm volatile("movq %0,%%cr4" :: "r" (val) : "memory");
 }
 
@@ -127,10 +136,27 @@ static inline void write_cr8(unsigned long val)
 	asm volatile("movq %0,%%cr8" :: "r" (val) : "memory");
 }
 
-#define stts() write_cr0(8 | read_cr0())
+static inline void native_wbinvd(void)
+{
+	asm volatile("wbinvd" ::: "memory");
+}
 
-#define wbinvd() \
-	__asm__ __volatile__ ("wbinvd": : :"memory")
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define clts		native_clts
+#define wbinvd		native_wbinvd
+#define read_cr0 	native_read_cr0
+#define read_cr2 	native_read_cr2
+#define read_cr3 	native_read_cr3
+#define read_cr4 	native_read_cr4
+#define write_cr0	native_write_cr0
+#define write_cr2	native_write_cr2
+#define write_cr3	native_write_cr3
+#define write_cr4	native_write_cr4
+#endif
+
+#define stts() write_cr0(8 | read_cr0())
 
 #endif	/* __KERNEL__ */
 
-- 
1.4.4.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/16] provide native irq initialization function
  2007-10-31 19:14     ` [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt Glauber de Oliveira Costa
@ 2007-10-31 19:14       ` Glauber de Oliveira Costa
  2007-10-31 19:14         ` [PATCH 5/16] report ring kernel is running without paravirt Glauber de Oliveira Costa
  2007-11-01  4:48       ` [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt Jeremy Fitzhardinge
  1 sibling, 1 reply; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-10-31 19:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, rusty, ak, mingo, chrisw, jeremy, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Glauber de Oliveira Costa, Steven Rostedt

The interrupt initialization routine becomes native_init_IRQ and will
be overriden later in case paravirt is on. The interrupt array is made visible
for guests such lguest, that will need to have their own initialization
mechanism (though using most of the same irq lines) later on.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
 arch/x86/kernel/i8259_64.c |    7 +++++--
 include/asm-x86/irq_64.h   |    3 +++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/i8259_64.c b/arch/x86/kernel/i8259_64.c
index 3f27ea0..892eab8 100644
--- a/arch/x86/kernel/i8259_64.c
+++ b/arch/x86/kernel/i8259_64.c
@@ -76,7 +76,7 @@ BUILD_16_IRQS(0xc) BUILD_16_IRQS(0xd) BUILD_16_IRQS(0xe) BUILD_16_IRQS(0xf)
 	IRQ(x,c), IRQ(x,d), IRQ(x,e), IRQ(x,f)
 
 /* for the irq vectors */
-static void (*interrupt[NR_VECTORS - FIRST_EXTERNAL_VECTOR])(void) = {
+void (*interrupt[NR_VECTORS - FIRST_EXTERNAL_VECTOR])(void) = {
 					  IRQLIST_16(0x2), IRQLIST_16(0x3),
 	IRQLIST_16(0x4), IRQLIST_16(0x5), IRQLIST_16(0x6), IRQLIST_16(0x7),
 	IRQLIST_16(0x8), IRQLIST_16(0x9), IRQLIST_16(0xa), IRQLIST_16(0xb),
@@ -448,7 +448,10 @@ void __init init_ISA_irqs (void)
 	}
 }
 
-void __init init_IRQ(void)
+/* Overridden in paravirt.c */
+void init_IRQ(void) __attribute__((weak, alias("native_init_IRQ")));
+
+void __init native_init_IRQ(void)
 {
 	int i;
 
diff --git a/include/asm-x86/irq_64.h b/include/asm-x86/irq_64.h
index 5006c6e..4f02446 100644
--- a/include/asm-x86/irq_64.h
+++ b/include/asm-x86/irq_64.h
@@ -46,6 +46,9 @@ static __inline__ int irq_canonicalize(int irq)
 extern void fixup_irqs(cpumask_t map);
 #endif
 
+#include <linux/init.h>
+void native_init_IRQ(void);
+
 #define __ARCH_HAS_DO_SOFTIRQ 1
 
 #endif /* _ASM_IRQ_H */
-- 
1.4.4.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/16] report ring kernel is running without paravirt
  2007-10-31 19:14       ` [PATCH 4/16] provide native irq initialization function Glauber de Oliveira Costa
@ 2007-10-31 19:14         ` Glauber de Oliveira Costa
  2007-10-31 19:14           ` [PATCH 6/16] export math_state_restore Glauber de Oliveira Costa
  0 siblings, 1 reply; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-10-31 19:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, rusty, ak, mingo, chrisw, jeremy, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Glauber de Oliveira Costa, Steven Rostedt

When paravirtualization is disabled, the kernel is always
running at ring 0. So report it in the appropriate macro

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
 include/asm-x86/segment_64.h |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/asm-x86/segment_64.h b/include/asm-x86/segment_64.h
index 04b8ab2..240c1bf 100644
--- a/include/asm-x86/segment_64.h
+++ b/include/asm-x86/segment_64.h
@@ -50,4 +50,8 @@
 #define GDT_SIZE (GDT_ENTRIES * 8)
 #define TLS_SIZE (GDT_ENTRY_TLS_ENTRIES * 8) 
 
+#ifndef CONFIG_PARAVIRT
+#define get_kernel_rpl()  0
+#endif
+
 #endif
-- 
1.4.4.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 6/16] export math_state_restore
  2007-10-31 19:14         ` [PATCH 5/16] report ring kernel is running without paravirt Glauber de Oliveira Costa
@ 2007-10-31 19:14           ` Glauber de Oliveira Costa
  2007-10-31 19:14             ` [PATCH 7/16] native versions for set pagetables Glauber de Oliveira Costa
  0 siblings, 1 reply; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-10-31 19:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, rusty, ak, mingo, chrisw, jeremy, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Glauber de Oliveira Costa, Steven Rostedt

Export math_state_restore symbol, so it can be used for hypervisors.
They are commonly loaded as modules (lguest being an example).

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
 arch/x86/kernel/traps_64.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/traps_64.c b/arch/x86/kernel/traps_64.c
index d0c2bc7..a533ecd 100644
--- a/arch/x86/kernel/traps_64.c
+++ b/arch/x86/kernel/traps_64.c
@@ -1077,6 +1077,7 @@ asmlinkage void math_state_restore(void)
 	task_thread_info(me)->status |= TS_USEDFPU;
 	me->fpu_counter++;
 }
+EXPORT_SYMBOL_GPL(math_state_restore);
 
 void __init trap_init(void)
 {
-- 
1.4.4.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 7/16] native versions for set pagetables
  2007-10-31 19:14           ` [PATCH 6/16] export math_state_restore Glauber de Oliveira Costa
@ 2007-10-31 19:14             ` Glauber de Oliveira Costa
  2007-10-31 19:14               ` [PATCH 8/16] add native functions for descriptors handling Glauber de Oliveira Costa
  0 siblings, 1 reply; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-10-31 19:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, rusty, ak, mingo, chrisw, jeremy, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Glauber de Oliveira Costa, Steven Rostedt

This patch turns the set_p{te,md,ud,gd} functions into their
native_ versions. There is no need to patch any caller.

Also, it adds pte_update() and pte_update_defer() calls whenever
we modify a page table entry. This last part was coded to match
i386 as close as possible.

Pieces of the header are moved to below the #ifdef CONFIG_PARAVIRT
site, as they are users of the newly defined set_* macros.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
 include/asm-x86/pgtable_64.h |  192 ++++++++++++++++++++++++++++--------------
 1 files changed, 128 insertions(+), 64 deletions(-)

diff --git a/include/asm-x86/pgtable_64.h b/include/asm-x86/pgtable_64.h
index 9b0ff47..592d613 100644
--- a/include/asm-x86/pgtable_64.h
+++ b/include/asm-x86/pgtable_64.h
@@ -57,56 +57,107 @@ extern unsigned long empty_zero_page[PAGE_SIZE/sizeof(unsigned long)];
  */
 #define PTRS_PER_PTE	512
 
-#ifndef __ASSEMBLY__
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+
+#define set_pte native_set_pte
+#define set_pte_at(mm, addr, ptep, pteval) set_pte(ptep, pteval)
+#define set_pmd native_set_pmd
+#define set_pud native_set_pud
+#define set_pgd native_set_pgd
+#define pte_clear(mm, addr, xp)				\
+do { 							\
+	set_pte_at(mm, addr, xp, __pte(0)); 		\
+} while (0)
 
-#define pte_ERROR(e) \
-	printk("%s:%d: bad pte %p(%016lx).\n", __FILE__, __LINE__, &(e), pte_val(e))
-#define pmd_ERROR(e) \
-	printk("%s:%d: bad pmd %p(%016lx).\n", __FILE__, __LINE__, &(e), pmd_val(e))
-#define pud_ERROR(e) \
-	printk("%s:%d: bad pud %p(%016lx).\n", __FILE__, __LINE__, &(e), pud_val(e))
-#define pgd_ERROR(e) \
-	printk("%s:%d: bad pgd %p(%016lx).\n", __FILE__, __LINE__, &(e), pgd_val(e))
+#define pmd_clear(xp)	do { set_pmd(xp, __pmd(0)); } while (0)
+#define pud_clear native_pud_clear
+#define pgd_clear native_pgd_clear
+#define pte_update(mm, addr, ptep)              do { } while (0)
+#define pte_update_defer(mm, addr, ptep)        do { } while (0)
 
-#define pgd_none(x)	(!pgd_val(x))
-#define pud_none(x)	(!pud_val(x))
+#endif
 
-static inline void set_pte(pte_t *dst, pte_t val)
+#ifndef __ASSEMBLY__
+
+static inline void native_set_pte(pte_t *dst, pte_t val)
 {
-	pte_val(*dst) = pte_val(val);
+	dst->pte = pte_val(val);
 } 
-#define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval)
 
-static inline void set_pmd(pmd_t *dst, pmd_t val)
+static inline void native_set_pmd(pmd_t *dst, pmd_t val)
 {
-        pmd_val(*dst) = pmd_val(val); 
+	dst->pmd = pmd_val(val);
 } 
 
-static inline void set_pud(pud_t *dst, pud_t val)
+static inline void native_set_pud(pud_t *dst, pud_t val)
 {
-	pud_val(*dst) = pud_val(val);
+	dst->pud = pud_val(val);
 }
 
-static inline void pud_clear (pud_t *pud)
+static inline void native_set_pgd(pgd_t *dst, pgd_t val)
 {
-	set_pud(pud, __pud(0));
+	dst->pgd = pgd_val(val);
 }
 
-static inline void set_pgd(pgd_t *dst, pgd_t val)
+static inline void native_pud_clear(pud_t *pud)
 {
-	pgd_val(*dst) = pgd_val(val); 
-} 
+	set_pud(pud, __pud(0));
+}
 
-static inline void pgd_clear (pgd_t * pgd)
+static inline void native_pgd_clear(pgd_t *pgd)
 {
 	set_pgd(pgd, __pgd(0));
 }
 
-#define ptep_get_and_clear(mm,addr,xp)	__pte(xchg(&(xp)->pte, 0))
+static inline void native_set_pte_at(struct mm_struct *mm, unsigned long addr,
+				     pte_t *ptep, pte_t pteval)
+{
+	native_set_pte(ptep, pteval);
+}
+
+static inline void native_pte_clear(struct mm_struct *mm, unsigned long addr,
+				    pte_t *ptep)
+{
+	native_set_pte_at(mm, addr, ptep, __pte(0));
+}
+
+static inline void native_pmd_clear(pmd_t *pmd)
+{
+	native_set_pmd(pmd, __pmd(0));
+}
+
+
+#define pte_ERROR(e) 					\
+	printk("%s:%d: bad pte %p(%016llx).\n",		\
+	__FILE__, __LINE__, &(e), (u64)pte_val(e))
+#define pmd_ERROR(e) 					\
+	printk("%s:%d: bad pmd %p(%016llx).\n", 	\
+	__FILE__, __LINE__, &(e), (u64)pmd_val(e))
+#define pud_ERROR(e) 					\
+	printk("%s:%d: bad pud %p(%016llx).\n",		\
+	 __FILE__, __LINE__, &(e), (u64)pud_val(e))
+#define pgd_ERROR(e) 					\
+	printk("%s:%d: bad pgd %p(%016llx).\n", 	\
+	__FILE__, __LINE__, &(e), (u64)pgd_val(e))
+
+#define pgd_none(x)	(!pgd_val(x))
+#define pud_none(x)	(!pud_val(x))
 
 struct mm_struct;
 
-static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm, unsigned long addr, pte_t *ptep, int full)
+static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+				       unsigned long addr, pte_t *ptep)
+{
+	pte_t pte = __pte(xchg(&ptep->pte, 0));
+	pte_update(mm, addr, ptep);
+	return pte;
+}
+
+static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
+					    unsigned long addr, pte_t *ptep,
+					    int full)
 {
 	pte_t pte;
 	if (full) {
@@ -246,7 +297,6 @@ static inline unsigned long pmd_bad(pmd_t pmd)
 
 #define pte_none(x)	(!pte_val(x))
 #define pte_present(x)	(pte_val(x) & (_PAGE_PRESENT | _PAGE_PROTNONE))
-#define pte_clear(mm,addr,xp)	do { set_pte_at(mm, addr, xp, __pte(0)); } while (0)
 
 #define pages_to_mb(x) ((x) >> (20-PAGE_SHIFT))	/* FIXME: is this
 						   right? */
@@ -255,11 +305,11 @@ static inline unsigned long pmd_bad(pmd_t pmd)
 
 static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
 {
-	pte_t pte;
-	pte_val(pte) = (page_nr << PAGE_SHIFT);
-	pte_val(pte) |= pgprot_val(pgprot);
-	pte_val(pte) &= __supported_pte_mask;
-	return pte;
+	unsigned long pte;
+	pte = (page_nr << PAGE_SHIFT);
+	pte |= pgprot_val(pgprot);
+	pte &= __supported_pte_mask;
+	return __pte(pte);
 }
 
 /*
@@ -283,30 +333,6 @@ static inline pte_t pte_mkwrite(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) |
 static inline pte_t pte_mkhuge(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_PSE)); return pte; }
 static inline pte_t pte_clrhuge(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_PSE)); return pte; }
 
-struct vm_area_struct;
-
-static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
-{
-	if (!pte_young(*ptep))
-		return 0;
-	return test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte);
-}
-
-static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
-{
-	clear_bit(_PAGE_BIT_RW, &ptep->pte);
-}
-
-/*
- * Macro to mark a page protection value as "uncacheable".
- */
-#define pgprot_noncached(prot)	(__pgprot(pgprot_val(prot) | _PAGE_PCD | _PAGE_PWT))
-
-static inline int pmd_large(pmd_t pte) { 
-	return (pmd_val(pte) & __LARGE_PTE) == __LARGE_PTE; 
-} 	
-
-
 /*
  * Conversion functions: convert a page and protection to a page entry,
  * and a page entry and page directory to the page they refer to.
@@ -340,7 +366,6 @@ static inline int pmd_large(pmd_t pte) {
 			pmd_index(address))
 #define pmd_none(x)	(!pmd_val(x))
 #define pmd_present(x)	(pmd_val(x) & _PAGE_PRESENT)
-#define pmd_clear(xp)	do { set_pmd(xp, __pmd(0)); } while (0)
 #define pfn_pmd(nr,prot) (__pmd(((nr) << PAGE_SHIFT) | pgprot_val(prot)))
 #define pmd_pfn(x)  ((pmd_val(x) & __PHYSICAL_MASK) >> PAGE_SHIFT)
 
@@ -352,15 +377,53 @@ static inline int pmd_large(pmd_t pte) {
 
 /* page, protection -> pte */
 #define mk_pte(page, pgprot)	pfn_pte(page_to_pfn(page), (pgprot))
-#define mk_pte_huge(entry) (pte_val(entry) |= _PAGE_PRESENT | _PAGE_PSE)
- 
+
+static inline pte_t __mk_pte_huge(pte_t entry)
+{
+	unsigned long pte;
+	pte = pte_val(entry);
+	pte |= _PAGE_PRESENT | _PAGE_PSE;
+	return  __pte(pte);
+}
+#define mk_pte_huge(entry) ((entry) = __mk_pte_huge(entry))
+
+#include <linux/mm_types.h>
+static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
+					    unsigned long addr, pte_t *ptep)
+{
+	int ret = 0;
+	if (!pte_young(*ptep))
+		return 0;
+	ret = test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte);
+	pte_update(vma->vm_mm, addr, ptep);
+	return ret;
+}
+
+static inline void ptep_set_wrprotect(struct mm_struct *mm,
+				      unsigned long addr, pte_t *ptep)
+{
+	clear_bit(_PAGE_BIT_RW, &ptep->pte);
+	pte_update(mm, addr, ptep);
+}
+
+/*
+ * Macro to mark a page protection value as "uncacheable".
+ */
+#define pgprot_noncached(prot)	(__pgprot(pgprot_val(prot) | _PAGE_PCD | _PAGE_PWT))
+
+static inline int pmd_large(pmd_t pte)
+{
+	return (pmd_val(pte) & __LARGE_PTE) == __LARGE_PTE;
+}
+
 /* Change flags of a PTE */
-static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
+static inline pte_t pte_modify(pte_t pte_old, pgprot_t newprot)
 { 
-	pte_val(pte) &= _PAGE_CHG_MASK;
-	pte_val(pte) |= pgprot_val(newprot);
-	pte_val(pte) &= __supported_pte_mask;
-       return pte; 
+	unsigned long pte = pte_val(pte_old);
+	pte &= _PAGE_CHG_MASK;
+	pte |= pgprot_val(newprot);
+	pte &= __supported_pte_mask;
+	return __pte(pte);
 }
 
 #define pte_index(address) \
@@ -387,6 +450,7 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	int __changed = !pte_same(*(__ptep), __entry);			  \
 	if (__changed && __dirty) {					  \
 		set_pte(__ptep, __entry);			  	  \
+		pte_update_defer((__vma)->vm_mm, (__address), (__ptep));  \
 		flush_tlb_page(__vma, __address);		  	  \
 	}								  \
 	__changed;							  \
-- 
1.4.4.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 8/16] add native functions for descriptors handling
  2007-10-31 19:14             ` [PATCH 7/16] native versions for set pagetables Glauber de Oliveira Costa
@ 2007-10-31 19:14               ` Glauber de Oliveira Costa
  2007-10-31 19:14                 ` [PATCH 9/16] This patch add provisions for time related functions so they Glauber de Oliveira Costa
  0 siblings, 1 reply; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-10-31 19:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, rusty, ak, mingo, chrisw, jeremy, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Glauber de Oliveira Costa, Steven Rostedt

This patch turns the basic descriptor handling into native_
functions. It is basically write_idt, load_idt, write_gdt,
load_gdt, set_ldt, store_tr, load_tls, and the ones
for updating a single entry.

In the process of doing that, we change the definition of
load_LDT_nolock, and caller sites have to be patched. We
also patch call sites that now needs a typecast.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
 arch/x86/kernel/ldt_64.c         |    7 +-
 include/asm-x86/desc_64.h        |  172 +++++++++++++++++++++++++-------------
 include/asm-x86/mmu_context_64.h |   23 ++++--
 3 files changed, 132 insertions(+), 70 deletions(-)

diff --git a/arch/x86/kernel/ldt_64.c b/arch/x86/kernel/ldt_64.c
index 60e57ab..ab1ff6e 100644
--- a/arch/x86/kernel/ldt_64.c
+++ b/arch/x86/kernel/ldt_64.c
@@ -171,7 +171,7 @@ static int write_ldt(void __user * ptr, unsigned long bytecount, int oldmode)
 {
 	struct task_struct *me = current;
 	struct mm_struct * mm = me->mm;
-	__u32 entry_1, entry_2, *lp;
+	__u32 entry_1, entry_2;
 	int error;
 	struct user_desc ldt_info;
 
@@ -200,7 +200,6 @@ static int write_ldt(void __user * ptr, unsigned long bytecount, int oldmode)
 			goto out_unlock;
 	}
 
-	lp = (__u32 *) ((ldt_info.entry_number << 3) + (char *) mm->context.ldt);
 
    	/* Allow LDTs to be cleared by the user. */
    	if (ldt_info.base_addr == 0 && ldt_info.limit == 0) {
@@ -218,8 +217,8 @@ static int write_ldt(void __user * ptr, unsigned long bytecount, int oldmode)
 
 	/* Install the new entry ...  */
 install:
-	*lp	= entry_1;
-	*(lp+1)	= entry_2;
+	write_ldt_entry(mm->context.ldt, ldt_info.entry_number,
+			entry_1, entry_2);
 	error = 0;
 
 out_unlock:
diff --git a/include/asm-x86/desc_64.h b/include/asm-x86/desc_64.h
index 7d9c938..f7008bb 100644
--- a/include/asm-x86/desc_64.h
+++ b/include/asm-x86/desc_64.h
@@ -16,20 +16,18 @@
 
 extern struct desc_struct cpu_gdt_table[GDT_ENTRIES];
 
-#define load_TR_desc() asm volatile("ltr %w0"::"r" (GDT_ENTRY_TSS*8))
-#define load_LDT_desc() asm volatile("lldt %w0"::"r" (GDT_ENTRY_LDT*8))
-#define clear_LDT()  asm volatile("lldt %w0"::"r" (0))
+static inline void native_load_tr_desc(void)
+{
+	asm volatile("ltr %w0"::"r" (GDT_ENTRY_TSS*8));
+}
 
-static inline unsigned long __store_tr(void)
+static inline unsigned long native_store_tr(void)
 {
        unsigned long tr;
-
        asm volatile ("str %w0":"=r" (tr));
        return tr;
 }
 
-#define store_tr(tr) (tr) = __store_tr()
-
 /*
  * This is the ldt that every process will get unless we need
  * something other than this.
@@ -41,62 +39,51 @@ extern struct desc_ptr cpu_gdt_descr[];
 /* the cpu gdt accessor */
 #define cpu_gdt(_cpu) ((struct desc_struct *)cpu_gdt_descr[_cpu].address)
 
-static inline void load_gdt(const struct desc_ptr *ptr)
+static inline void native_load_gdt(const struct desc_ptr *ptr)
 {
 	asm volatile("lgdt %w0"::"m" (*ptr));
 }
 
-static inline void store_gdt(struct desc_ptr *ptr)
+static inline void native_store_gdt(struct desc_ptr *ptr)
 {
-       asm("sgdt %w0":"=m" (*ptr));
+       asm volatile ("sgdt %w0":"=m" (*ptr));
 }
 
-static inline void _set_gate(void *adr, unsigned type, unsigned long func, unsigned dpl, unsigned ist)  
+static inline void native_write_idt_entry(void *adr, struct gate_struct *s)
 {
-	struct gate_struct s; 	
-	s.offset_low = PTR_LOW(func); 
-	s.segment = __KERNEL_CS;
-	s.ist = ist; 
-	s.p = 1;
-	s.dpl = dpl; 
-	s.zero0 = 0;
-	s.zero1 = 0; 
-	s.type = type; 
-	s.offset_middle = PTR_MIDDLE(func); 
-	s.offset_high = PTR_HIGH(func); 
-	/* does not need to be atomic because it is only done once at setup time */ 
-	memcpy(adr, &s, 16); 
-} 
-
-static inline void set_intr_gate(int nr, void *func) 
-{ 
-	BUG_ON((unsigned)nr > 0xFF);
-	_set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 0, 0); 
-} 
+	/* does not need to be atomic because
+	 * it is only done once at setup time */
+	memcpy(adr, s, 16);
+}
 
-static inline void set_intr_gate_ist(int nr, void *func, unsigned ist) 
-{ 
-	BUG_ON((unsigned)nr > 0xFF);
-	_set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 0, ist); 
-} 
+static inline void native_write_gdt_entry(void *ptr, void *entry,
+						unsigned type, unsigned size)
+{
+	memcpy(ptr, entry, size);
+}
 
-static inline void set_system_gate(int nr, void *func) 
-{ 
-	BUG_ON((unsigned)nr > 0xFF);
-	_set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 3, 0); 
-} 
+/*
+ * This one unfortunately can't go with the others, in below, because it has
+ * an user anxious for its definition: set_tssldt_descriptor
+ */
+#ifndef CONFIG_PARAVIRT
+#define write_gdt_entry(_ptr, _e, _type, _size) \
+		native_write_gdt_entry((_ptr), (_e), (_type), (_size))
+#endif
 
-static inline void set_system_gate_ist(int nr, void *func, unsigned ist)
+static inline void write_dt_entry(struct desc_struct *dt,
+				  int entry, u32 entry_low, u32 entry_high)
 {
-	_set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 3, ist);
+	((struct n_desc_struct *)dt)[entry].a = entry_low;
+	((struct n_desc_struct *)dt)[entry].b = entry_high;
 }
 
-static inline void load_idt(const struct desc_ptr *ptr)
+static inline void native_load_idt(const struct desc_ptr *ptr)
 {
 	asm volatile("lidt %w0"::"m" (*ptr));
 }
 
-static inline void store_idt(struct desc_ptr *dtr)
+static inline void native_store_idt(struct desc_ptr *dtr)
 {
        asm("sidt %w0":"=m" (*dtr));
 }
@@ -114,7 +101,7 @@ static inline void set_tssldt_descriptor(void *ptr, unsigned long tss, unsigned
 	d.limit1 = (size >> 16) & 0xF;
 	d.base2 = (PTR_MIDDLE(tss) >> 8) & 0xFF; 
 	d.base3 = PTR_HIGH(tss); 
-	memcpy(ptr, &d, 16); 
+	write_gdt_entry(ptr, &d, type, 16);
 }
 
 static inline void set_tss_desc(unsigned cpu, void *addr)
@@ -165,7 +152,7 @@ static inline void set_ldt_desc(unsigned cpu, void *addr, int size)
 	(info)->useable		== 0	&& \
 	(info)->lm		== 0)
 
-static inline void load_TLS(struct thread_struct *t, unsigned int cpu)
+static inline void native_load_tls(struct thread_struct *t, unsigned int cpu)
 {
 	unsigned int i;
 	u64 *gdt = (u64 *)(cpu_gdt(cpu) + GDT_ENTRY_TLS_MIN);
@@ -173,28 +160,93 @@ static inline void load_TLS(struct thread_struct *t, unsigned int cpu)
 	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++)
 		gdt[i] = t->tls_array[i];
 } 
+static inline void native_set_ldt(const void *addr,
+				  unsigned int entries)
+{
+	if (likely(entries == 0))
+		__asm__ __volatile__ ("lldt %w0" :: "r" (0));
+	else {
+		unsigned cpu = smp_processor_id();
+
+		set_tssldt_descriptor(&cpu_gdt(cpu)[GDT_ENTRY_LDT],
+					(unsigned long)addr, DESC_LDT,
+					entries * 8 - 1);
+		__asm__ __volatile__ ("lldt %w0"::"r" (GDT_ENTRY_LDT*8));
+	}
+}
+
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define load_TR_desc()	native_load_tr_desc()
+#define load_gdt(ptr)	native_load_gdt(ptr)
+#define load_idt(ptr)	native_load_idt(ptr)
+#define load_TLS(t, cpu) native_load_tls(t, cpu)
+#define set_ldt(addr, entries)	native_set_ldt(addr, entries)
+#define store_tr(tr) (tr) = native_store_tr()
+#define store_gdt(ptr)	native_store_gdt(ptr)
+#define store_idt(ptr)	native_store_idt(ptr)
+
+#define write_idt_entry(_adr, _s) native_write_idt_entry((_adr), (_s))
+#define write_ldt_entry(_ldt, _number, _entry1, _entry2) \
+		write_dt_entry((_ldt), (_number), (_entry1), (_entry2))
+#endif
+
+static inline void _set_gate(void *adr, unsigned type, unsigned long func,
+			     unsigned dpl, unsigned ist)
+{
+	struct gate_struct s;
+	s.offset_low = PTR_LOW(func);
+	s.segment = __KERNEL_CS;
+	s.ist = ist;
+	s.p = 1;
+	s.dpl = dpl;
+	s.zero0 = 0;
+	s.zero1 = 0;
+	s.type = type;
+	s.offset_middle = PTR_MIDDLE(func);
+	s.offset_high = PTR_HIGH(func);
+	write_idt_entry(adr, &s);
+}
+
+static inline void set_intr_gate(int nr, void *func)
+{
+	BUG_ON((unsigned)nr > 0xFF);
+	_set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 0, 0);
+}
+
+static inline void set_intr_gate_ist(int nr, void *func, unsigned ist)
+{
+	BUG_ON((unsigned)nr > 0xFF);
+	_set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 0, ist);
+}
+
+static inline void set_system_gate(int nr, void *func)
+{
+	BUG_ON((unsigned)nr > 0xFF);
+	_set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 3, 0);
+}
+
+static inline void set_system_gate_ist(int nr, void *func, unsigned ist)
+{
+	_set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 3, ist);
+}
+
+#define clear_LDT()  set_ldt(NULL, 0)
 
 /*
  * load one particular LDT into the current CPU
  */
-static inline void load_LDT_nolock (mm_context_t *pc, int cpu)
+static inline void load_LDT_nolock(mm_context_t *pc)
 {
-	int count = pc->size;
-
-	if (likely(!count)) {
-		clear_LDT();
-		return;
-	}
-		
-	set_ldt_desc(cpu, pc->ldt, count);
-	load_LDT_desc();
+	set_ldt(pc->ldt, pc->size);
 }
 
 static inline void load_LDT(mm_context_t *pc)
 {
-	int cpu = get_cpu();
-	load_LDT_nolock(pc, cpu);
-	put_cpu();
+	preempt_disable();
+	load_LDT_nolock(pc);
+	preempt_enable();
 }
 
 extern struct desc_ptr idt_descr;
diff --git a/include/asm-x86/mmu_context_64.h b/include/asm-x86/mmu_context_64.h
index 0cce83a..841ee33 100644
--- a/include/asm-x86/mmu_context_64.h
+++ b/include/asm-x86/mmu_context_64.h
@@ -7,7 +7,16 @@
 #include <asm/pda.h>
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
+
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
 #include <asm-generic/mm_hooks.h>
+static inline void paravirt_activate_mm(struct mm_struct *prev,
+					struct mm_struct *next)
+{
+}
+#endif /* CONFIG_PARAVIRT */
 
 /*
  * possibly do the LDT unload here?
@@ -25,7 +34,7 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 
 static inline void load_cr3(pgd_t *pgd)
 {
-	asm volatile("movq %0,%%cr3" :: "r" (__pa(pgd)) : "memory");
+	write_cr3(__pa(pgd));
 }
 
 static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, 
@@ -43,7 +52,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		load_cr3(next->pgd);
 
 		if (unlikely(next->context.ldt != prev->context.ldt)) 
-			load_LDT_nolock(&next->context, cpu);
+			load_LDT_nolock(&next->context);
 	}
 #ifdef CONFIG_SMP
 	else {
@@ -56,7 +65,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 			 * to make sure to use no freed page tables.
 			 */
 			load_cr3(next->pgd);
-			load_LDT_nolock(&next->context, cpu);
+			load_LDT_nolock(&next->context);
 		}
 	}
 #endif
@@ -67,8 +76,10 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	asm volatile("movl %0,%%fs"::"r"(0));  \
 } while(0)
 
-#define activate_mm(prev, next) \
-	switch_mm((prev),(next),NULL)
-
+#define activate_mm(prev, next) 		\
+do {						\
+	paravirt_activate_mm(prev, next);	\
+	switch_mm((prev), (next), NULL);	\
+} while (0)
 
 #endif
-- 
1.4.4.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 9/16] This patch add provisions for time related functions so they
  2007-10-31 19:14               ` [PATCH 8/16] add native functions for descriptors handling Glauber de Oliveira Costa
@ 2007-10-31 19:14                 ` Glauber de Oliveira Costa
  2007-10-31 19:14                   ` [PATCH 10/16] export cpu_gdt_descr Glauber de Oliveira Costa
  0 siblings, 1 reply; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-10-31 19:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, rusty, ak, mingo, chrisw, jeremy, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Glauber de Oliveira Costa, Steven Rostedt

can be later replaced by paravirt versions.

it basically encloses {g,s}et_wallclock inside the
already existent functions update_persistent_clock and
read_persistent_clock, and defines {s,g}et_wallclock
to the core of such functions.

it also allow for a later-on-game time initialization, as done
by i386. Paravirt guests can set a function to do their own
initialization this way.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
 arch/x86/kernel/time_64.c |   42 ++++++++++++++++++++++++++++--------------
 include/asm-x86/time.h    |   26 ++++++++++++++++++++++----
 2 files changed, 50 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/time_64.c b/arch/x86/kernel/time_64.c
index c821edc..85ec401 100644
--- a/arch/x86/kernel/time_64.c
+++ b/arch/x86/kernel/time_64.c
@@ -33,9 +33,12 @@
 #include <acpi/acpi_bus.h>
 #endif
 #include <asm/i8253.h>
+#include <asm/arch_hooks.h>
 #include <asm/pgtable.h>
 #include <asm/vsyscall.h>
+#include <asm/time.h>
 #include <asm/timex.h>
+#include <asm/timer.h>
 #include <asm/proto.h>
 #include <asm/hpet.h>
 #include <asm/sections.h>
@@ -77,18 +80,12 @@ EXPORT_SYMBOL(profile_pc);
  * sheet for details.
  */
 
-static int set_rtc_mmss(unsigned long nowtime)
+int do_set_rtc_mmss(unsigned long nowtime)
 {
 	int retval = 0;
 	int real_seconds, real_minutes, cmos_minutes;
 	unsigned char control, freq_select;
 
-/*
- * IRQs are disabled when we're called from the timer interrupt,
- * no need for spin_lock_irqsave()
- */
-
-	spin_lock(&rtc_lock);
 
 /*
  * Tell the clock it's being set and stop it.
@@ -138,14 +135,22 @@ static int set_rtc_mmss(unsigned long nowtime)
 	CMOS_WRITE(control, RTC_CONTROL);
 	CMOS_WRITE(freq_select, RTC_FREQ_SELECT);
 
-	spin_unlock(&rtc_lock);
-
 	return retval;
 }
 
 int update_persistent_clock(struct timespec now)
 {
-	return set_rtc_mmss(now.tv_sec);
+	int retval;
+
+/*
+ * IRQs are disabled when we're called from the timer interrupt,
+ * no need for spin_lock_irqsave()
+ */
+	spin_lock(&rtc_lock);
+	retval = set_wallclock(now.tv_sec);
+	spin_unlock(&rtc_lock);
+
+	return retval;
 }
 
 static irqreturn_t timer_event_interrupt(int irq, void *dev_id)
@@ -157,7 +162,7 @@ static irqreturn_t timer_event_interrupt(int irq, void *dev_id)
 	return IRQ_HANDLED;
 }
 
-unsigned long read_persistent_clock(void)
+unsigned long do_get_cmos_time(void)
 {
 	unsigned int year, mon, day, hour, min, sec;
 	unsigned long flags;
@@ -208,10 +213,15 @@ unsigned long read_persistent_clock(void)
 	return mktime(year, mon, day, hour, min, sec);
 }
 
+unsigned long read_persistent_clock(void)
+{
+	return get_wallclock();
+}
+
 /* calibrate_cpu is used on systems with fixed rate TSCs to determine
  * processor frequency */
 #define TICK_COUNT 100000000
-static unsigned int __init tsc_calibrate_cpu_khz(void)
+unsigned long __init native_calculate_cpu_khz(void)
 {
 	int tsc_start, tsc_now;
 	int i, no_ctr_free;
@@ -261,20 +271,23 @@ static struct irqaction irq0 = {
 	.name		= "timer"
 };
 
-void __init time_init(void)
+void __init hpet_time_init(void)
 {
 	if (!hpet_enable())
 		setup_pit_timer();
 
 	setup_irq(0, &irq0);
+}
 
+void __init time_init(void)
+{
 	tsc_calibrate();
 
 	cpu_khz = tsc_khz;
 	if (cpu_has(&boot_cpu_data, X86_FEATURE_CONSTANT_TSC) &&
 		boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
 		boot_cpu_data.x86 == 16)
-		cpu_khz = tsc_calibrate_cpu_khz();
+		cpu_khz = calculate_cpu_khz();
 
 	if (unsynchronized_tsc())
 		mark_tsc_unstable("TSCs unsynchronized");
@@ -287,4 +300,5 @@ void __init time_init(void)
 	printk(KERN_INFO "time.c: Detected %d.%03d MHz processor.\n",
 		cpu_khz / 1000, cpu_khz % 1000);
 	init_tsc_clocksource();
+	late_time_init = choose_time_init();
 }
diff --git a/include/asm-x86/time.h b/include/asm-x86/time.h
index eac0113..0d18c78 100644
--- a/include/asm-x86/time.h
+++ b/include/asm-x86/time.h
@@ -1,6 +1,10 @@
-#ifndef _ASMi386_TIME_H
-#define _ASMi386_TIME_H
+#ifndef _ASMX86_TIME_H
+#define _ASMX86_TIME_H
 
+extern void (*late_time_init)(void);
+extern void hpet_time_init(void);
+
+#ifndef CONFIG_X86_64
 #include <linux/efi.h>
 #include "mach_time.h"
 
@@ -28,8 +32,22 @@ static inline int native_set_wallclock(unsigned long nowtime)
 	return retval;
 }
 
-extern void (*late_time_init)(void);
-extern void hpet_time_init(void);
+#else
+extern unsigned long do_get_cmos_time(void);
+extern int do_set_rtc_mmss(unsigned long nowtime);
+extern void native_time_init_hook(void);
+
+static inline unsigned long native_get_wallclock(void)
+{
+	return do_get_cmos_time();
+}
+
+static inline int native_set_wallclock(unsigned long nowtime)
+{
+	return do_set_rtc_mmss(nowtime);
+}
+
+#endif
 
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
-- 
1.4.4.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 10/16] export cpu_gdt_descr
  2007-10-31 19:14                 ` [PATCH 9/16] This patch add provisions for time related functions so they Glauber de Oliveira Costa
@ 2007-10-31 19:14                   ` Glauber de Oliveira Costa
  2007-10-31 19:14                     ` [PATCH 11/16] turn priviled operation into a macro in head_64.S Glauber de Oliveira Costa
  0 siblings, 1 reply; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-10-31 19:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, rusty, ak, mingo, chrisw, jeremy, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Glauber de Oliveira Costa, Steven Rostedt

With paravirualization, hypervisors needs to handle the gdt,
that was right to this point only used at very early
inialization code. Hypervisors (lguest being the current case)
are commonly modules, so make it an export

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
 arch/x86/kernel/x8664_ksyms_64.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c
index 77c25b3..a005d67 100644
--- a/arch/x86/kernel/x8664_ksyms_64.c
+++ b/arch/x86/kernel/x8664_ksyms_64.c
@@ -8,6 +8,7 @@
 #include <asm/processor.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
+#include <asm/desc.h>
 
 EXPORT_SYMBOL(kernel_thread);
 
@@ -60,3 +61,8 @@ EXPORT_SYMBOL(init_level4_pgt);
 EXPORT_SYMBOL(load_gs_index);
 
 EXPORT_SYMBOL(_proxy_pda);
+
+#ifdef CONFIG_PARAVIRT
+/* Virtualized guests may want to use it */
+EXPORT_SYMBOL_GPL(cpu_gdt_descr);
+#endif
-- 
1.4.4.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 11/16] turn priviled operation into a macro in head_64.S
  2007-10-31 19:14                   ` [PATCH 10/16] export cpu_gdt_descr Glauber de Oliveira Costa
@ 2007-10-31 19:14                     ` Glauber de Oliveira Costa
  2007-10-31 19:14                       ` [PATCH 12/16] tweak io_64.h for paravirt Glauber de Oliveira Costa
  2007-11-01  4:50                       ` [PATCH 11/16] turn priviled operation into a macro in head_64.S Jeremy Fitzhardinge
  0 siblings, 2 replies; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-10-31 19:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, rusty, ak, mingo, chrisw, jeremy, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Glauber de Oliveira Costa, Steven Rostedt

under paravirt, read cr2 cannot be issued directly anymore.
So wrap it in a macro, defined to the operation itself in case
paravirt is off, but to something else if we have paravirt
in the game

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
 arch/x86/kernel/head_64.S |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index b6167fe..c31b1c9 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -19,6 +19,13 @@
 #include <asm/msr.h>
 #include <asm/cache.h>
 
+#ifdef CONFIG_PARAVIRT
+#include <asm/asm-offsets.h>
+#include <asm/paravirt.h>
+#else
+#define GET_CR2_INTO_RCX movq %cr2, %rcx
+#endif
+
 /* we are not able to switch in one step to the final KERNEL ADRESS SPACE
  * because we need identity-mapped pages.
  *
@@ -267,7 +274,7 @@ ENTRY(early_idt_handler)
 	xorl %eax,%eax
 	movq 8(%rsp),%rsi	# get rip
 	movq (%rsp),%rdx
-	movq %cr2,%rcx
+	GET_CR2_INTO_RCX
 	leaq early_idt_msg(%rip),%rdi
 	call early_printk
 	cmpl $2,early_recursion_flag(%rip)
-- 
1.4.4.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 12/16] tweak io_64.h for paravirt.
  2007-10-31 19:14                     ` [PATCH 11/16] turn priviled operation into a macro in head_64.S Glauber de Oliveira Costa
@ 2007-10-31 19:14                       ` Glauber de Oliveira Costa
  2007-10-31 19:14                         ` [PATCH 13/16] native versions for page table entries values Glauber de Oliveira Costa
  2007-11-01  4:50                       ` [PATCH 11/16] turn priviled operation into a macro in head_64.S Jeremy Fitzhardinge
  1 sibling, 1 reply; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-10-31 19:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, rusty, ak, mingo, chrisw, jeremy, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Glauber de Oliveira Costa, Steven Rostedt

We need something here because we can't call in and out instructions
directly. However, we have to be careful, because no indirections are
allowed in misc_64.c , and paravirt_ops is a kind of one. So just
call it directly there

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
 arch/x86/boot/compressed/misc_64.c |    6 +++++
 include/asm-x86/io_64.h            |   37 +++++++++++++++++++++++++++++------
 2 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/arch/x86/boot/compressed/misc_64.c b/arch/x86/boot/compressed/misc_64.c
index 6ea015a..6640a17 100644
--- a/arch/x86/boot/compressed/misc_64.c
+++ b/arch/x86/boot/compressed/misc_64.c
@@ -9,6 +9,12 @@
  * High loaded stuff by Hans Lermen & Werner Almesberger, Feb. 1996
  */
 
+/*
+ * we have to be careful, because no indirections are allowed here, and
+ * paravirt_ops is a kind of one. As it will only run in baremetal anyway,
+ * we just keep it from happening
+ */
+#undef CONFIG_PARAVIRT
 #define _LINUX_STRING_H_ 1
 #define __LINUX_BITMAP_H 1
 
diff --git a/include/asm-x86/io_64.h b/include/asm-x86/io_64.h
index a037b07..57fcdd9 100644
--- a/include/asm-x86/io_64.h
+++ b/include/asm-x86/io_64.h
@@ -35,12 +35,24 @@
   *  - Arnaldo Carvalho de Melo <acme@conectiva.com.br>
   */
 
-#define __SLOW_DOWN_IO "\noutb %%al,$0x80"
+static inline void native_io_delay(void)
+{
+	asm volatile("outb %%al,$0x80" : : : "memory");
+}
 
-#ifdef REALLY_SLOW_IO
-#define __FULL_SLOW_DOWN_IO __SLOW_DOWN_IO __SLOW_DOWN_IO __SLOW_DOWN_IO __SLOW_DOWN_IO
+#if defined(CONFIG_PARAVIRT)
+#include <asm/paravirt.h>
 #else
-#define __FULL_SLOW_DOWN_IO __SLOW_DOWN_IO
+
+static inline void slow_down_io(void)
+{
+	native_io_delay();
+#ifdef REALLY_SLOW_IO
+	native_io_delay();
+	native_io_delay();
+	native_io_delay();
+#endif
+}
 #endif
 
 /*
@@ -52,9 +64,15 @@ static inline void out##s(unsigned x value, unsigned short port) {
 #define __OUT2(s,s1,s2) \
 __asm__ __volatile__ ("out" #s " %" s1 "0,%" s2 "1"
 
+#ifndef REALLY_SLOW_IO
+#define REALLY_SLOW_IO
+#define UNSET_REALLY_SLOW_IO
+#endif
+
 #define __OUT(s,s1,x) \
 __OUT1(s,x) __OUT2(s,s1,"w") : : "a" (value), "Nd" (port)); } \
-__OUT1(s##_p,x) __OUT2(s,s1,"w") __FULL_SLOW_DOWN_IO : : "a" (value), "Nd" (port));} \
+__OUT1(s##_p, x) __OUT2(s, s1, "w") : : "a" (value), "Nd" (port)); \
+		slow_down_io(); }
 
 #define __IN1(s) \
 static inline RETURN_TYPE in##s(unsigned short port) { RETURN_TYPE _v;
@@ -63,8 +81,13 @@ static inline RETURN_TYPE in##s(unsigned short port) { RETURN_TYPE _v;
 __asm__ __volatile__ ("in" #s " %" s2 "1,%" s1 "0"
 
 #define __IN(s,s1,i...) \
-__IN1(s) __IN2(s,s1,"w") : "=a" (_v) : "Nd" (port) ,##i ); return _v; } \
-__IN1(s##_p) __IN2(s,s1,"w") __FULL_SLOW_DOWN_IO : "=a" (_v) : "Nd" (port) ,##i ); return _v; } \
+__IN1(s) __IN2(s, s1, "w") : "=a" (_v) : "Nd" (port), ##i); return _v; } \
+__IN1(s##_p) __IN2(s, s1, "w") : "=a" (_v) : "Nd" (port), ##i);	  \
+				slow_down_io(); return _v; }
+
+#ifdef UNSET_REALLY_SLOW_IO
+#undef REALLY_SLOW_IO
+#endif
 
 #define __INS(s) \
 static inline void ins##s(unsigned short port, void * addr, unsigned long count) \
-- 
1.4.4.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 13/16] native versions for page table entries values
  2007-10-31 19:14                       ` [PATCH 12/16] tweak io_64.h for paravirt Glauber de Oliveira Costa
@ 2007-10-31 19:14                         ` Glauber de Oliveira Costa
  2007-10-31 19:14                           ` [PATCH 14/16] prepare x86_64 architecture initialization for paravirt Glauber de Oliveira Costa
  0 siblings, 1 reply; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-10-31 19:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, rusty, ak, mingo, chrisw, jeremy, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Glauber de Oliveira Costa, Steven Rostedt

This patch turns the page operations (set and make a page table)
into native_ versions. The operations itself will be later
overriden by paravirt.

It uses unsigned long long for consistency with 32-bit. So we
have to fix fault_64.c to get rid of warnings.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
 arch/x86/mm/fault_64.c    |    8 +++---
 include/asm-x86/page_64.h |   56 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 55 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/fault_64.c b/arch/x86/mm/fault_64.c
index 644b4f7..725c574 100644
--- a/arch/x86/mm/fault_64.c
+++ b/arch/x86/mm/fault_64.c
@@ -158,22 +158,22 @@ void dump_pagetable(unsigned long address)
 	pgd = __va((unsigned long)pgd & PHYSICAL_PAGE_MASK); 
 	pgd += pgd_index(address);
 	if (bad_address(pgd)) goto bad;
-	printk("PGD %lx ", pgd_val(*pgd));
+	printk("PGD %llx ", pgd_val(*pgd));
 	if (!pgd_present(*pgd)) goto ret; 
 
 	pud = pud_offset(pgd, address);
 	if (bad_address(pud)) goto bad;
-	printk("PUD %lx ", pud_val(*pud));
+	printk("PUD %llx ", pud_val(*pud));
 	if (!pud_present(*pud))	goto ret;
 
 	pmd = pmd_offset(pud, address);
 	if (bad_address(pmd)) goto bad;
-	printk("PMD %lx ", pmd_val(*pmd));
+	printk("PMD %llx ", pmd_val(*pmd));
 	if (!pmd_present(*pmd) || pmd_large(*pmd)) goto ret;
 
 	pte = pte_offset_kernel(pmd, address);
 	if (bad_address(pte)) goto bad;
-	printk("PTE %lx", pte_val(*pte)); 
+	printk("PTE %llx", pte_val(*pte));
 ret:
 	printk("\n");
 	return;
diff --git a/include/asm-x86/page_64.h b/include/asm-x86/page_64.h
index c3b52bc..71d5c27 100644
--- a/include/asm-x86/page_64.h
+++ b/include/asm-x86/page_64.h
@@ -64,16 +64,62 @@ typedef struct { unsigned long pgprot; } pgprot_t;
 
 extern unsigned long phys_base;
 
-#define pte_val(x)	((x).pte)
-#define pmd_val(x)	((x).pmd)
-#define pud_val(x)	((x).pud)
-#define pgd_val(x)	((x).pgd)
-#define pgprot_val(x)	((x).pgprot)
+static inline unsigned long long native_pte_val(pte_t pte)
+{
+	return pte.pte;
+}
+
+static inline unsigned long long native_pud_val(pud_t pud)
+{
+	return pud.pud;
+}
+
+
+static inline unsigned long long native_pmd_val(pmd_t pmd)
+{
+	return pmd.pmd;
+}
+
+static inline unsigned long long native_pgd_val(pgd_t pgd)
+{
+	return pgd.pgd;
+}
+
+static inline pte_t native_make_pte(unsigned long long pte)
+{
+	return (pte_t){ pte };
+}
+
+static inline pud_t native_make_pud(unsigned long long pud)
+{
+	return (pud_t){ pud };
+}
+
+static inline pmd_t native_make_pmd(unsigned long long pmd)
+{
+	return (pmd_t){ pmd };
+}
+
+static inline pgd_t native_make_pgd(unsigned long long pgd)
+{
+	return (pgd_t){ pgd };
+}
+
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define pte_val(x)	native_pte_val(x)
+#define pmd_val(x)	native_pmd_val(x)
+#define pud_val(x)	native_pud_val(x)
+#define pgd_val(x)	native_pgd_val(x)
 
 #define __pte(x) ((pte_t) { (x) } )
 #define __pmd(x) ((pmd_t) { (x) } )
 #define __pud(x) ((pud_t) { (x) } )
 #define __pgd(x) ((pgd_t) { (x) } )
+#endif /* CONFIG_PARAVIRT */
+
+#define pgprot_val(x)	((x).pgprot)
 #define __pgprot(x)	((pgprot_t) { (x) } )
 
 #endif /* !__ASSEMBLY__ */
-- 
1.4.4.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 14/16] prepare x86_64 architecture initialization for paravirt
  2007-10-31 19:14                         ` [PATCH 13/16] native versions for page table entries values Glauber de Oliveira Costa
@ 2007-10-31 19:14                           ` Glauber de Oliveira Costa
  2007-10-31 19:15                             ` [PATCH 15/16] consolidation of paravirt for 32 and 64 bits Glauber de Oliveira Costa
  0 siblings, 1 reply; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-10-31 19:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, rusty, ak, mingo, chrisw, jeremy, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Glauber de Oliveira Costa, Steven Rostedt

This patch prepares the x86_64 architecture initialization for
paravirt. It requires a memory initialization step, which is done
by implementing 64-bit version for machine_specific_memory_setup,
and putting an ARCH_SETUP hook, for guest-dependent initialization.
This last step is done akin to i386

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
 arch/x86/kernel/e820_64.c  |    9 +++++++--
 arch/x86/kernel/setup_64.c |   28 +++++++++++++++++++++++++++-
 include/asm-x86/setup.h    |   11 ++++++++---
 3 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/e820_64.c b/arch/x86/kernel/e820_64.c
index 04698e0..c67526c 100644
--- a/arch/x86/kernel/e820_64.c
+++ b/arch/x86/kernel/e820_64.c
@@ -593,8 +593,10 @@ void early_panic(char *msg)
 	panic(msg);
 }
 
-void __init setup_memory_region(void)
+/* We're not void only for x86 32-bit compat */
+char * __init machine_specific_memory_setup(void)
 {
+	char *who = "BIOS-e820";
 	/*
 	 * Try to copy the BIOS-supplied E820-map.
 	 *
@@ -605,7 +607,10 @@ void __init setup_memory_region(void)
 	if (copy_e820_map(boot_params.e820_map, boot_params.e820_entries) < 0)
 		early_panic("Cannot find a valid memory map");
 	printk(KERN_INFO "BIOS-provided physical RAM map:\n");
-	e820_print_map("BIOS-e820");
+	e820_print_map(who);
+
+	/* In case someone cares... */
+	return who;
 }
 
 static int __init parse_memopt(char *p)
diff --git a/arch/x86/kernel/setup_64.c b/arch/x86/kernel/setup_64.c
index 238633d..44a11e3 100644
--- a/arch/x86/kernel/setup_64.c
+++ b/arch/x86/kernel/setup_64.c
@@ -39,6 +39,7 @@
 #include <linux/dmi.h>
 #include <linux/dma-mapping.h>
 #include <linux/ctype.h>
+#include <linux/uaccess.h>
 
 #include <asm/mtrr.h>
 #include <asm/uaccess.h>
@@ -60,6 +61,12 @@
 #include <asm/dmi.h>
 #include <asm/cacheflush.h>
 
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define ARCH_SETUP
+#endif
+
 /*
  * Machine setup..
  */
@@ -241,6 +248,16 @@ static void discover_ebda(void)
 	 * 4K EBDA area at 0x40E
 	 */
 	ebda_addr = *(unsigned short *)__va(EBDA_ADDR_POINTER);
+	/*
+	 * There can be some situations, like paravirtualized guests,
+	 * in which there is no available ebda information. In such
+	 * case, just skip it
+	 */
+	if (!ebda_addr) {
+		ebda_size = 0;
+		return;
+	}
+
 	ebda_addr <<= 4;
 
 	ebda_size = *(unsigned short *)__va(ebda_addr);
@@ -254,6 +271,12 @@ static void discover_ebda(void)
 		ebda_size = 64*1024;
 }
 
+/* Overridden in paravirt.c if CONFIG_PARAVIRT */
+void __attribute__((weak)) memory_setup(void)
+{
+       machine_specific_memory_setup();
+}
+
 void __init setup_arch(char **cmdline_p)
 {
 	printk(KERN_INFO "Command line: %s\n", boot_command_line);
@@ -269,7 +292,10 @@ void __init setup_arch(char **cmdline_p)
 	rd_prompt = ((boot_params.hdr.ram_size & RAMDISK_PROMPT_FLAG) != 0);
 	rd_doload = ((boot_params.hdr.ram_size & RAMDISK_LOAD_FLAG) != 0);
 #endif
-	setup_memory_region();
+
+	ARCH_SETUP
+
+	memory_setup();
 	copy_edd();
 
 	if (!boot_params.hdr.root_flags)
diff --git a/include/asm-x86/setup.h b/include/asm-x86/setup.h
index 24d786e..071e054 100644
--- a/include/asm-x86/setup.h
+++ b/include/asm-x86/setup.h
@@ -3,6 +3,13 @@
 
 #define COMMAND_LINE_SIZE 2048
 
+#ifndef __ASSEMBLY__
+char *machine_specific_memory_setup(void);
+#ifndef CONFIG_PARAVIRT
+#define paravirt_post_allocator_init()	do {} while (0)
+#endif
+#endif /* __ASSEMBLY__ */
+
 #ifdef __KERNEL__
 
 #ifdef __i386__
@@ -51,9 +58,7 @@ void __init add_memory_region(unsigned long long start,
 
 extern unsigned long init_pg_tables_end;
 
-#ifndef CONFIG_PARAVIRT
-#define paravirt_post_allocator_init()	do {} while (0)
-#endif
+
 
 #endif /* __i386__ */
 #endif /* _SETUP */
-- 
1.4.4.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 15/16] consolidation of paravirt for 32 and 64 bits
  2007-10-31 19:14                           ` [PATCH 14/16] prepare x86_64 architecture initialization for paravirt Glauber de Oliveira Costa
@ 2007-10-31 19:15                             ` Glauber de Oliveira Costa
  2007-10-31 19:15                               ` [PATCH 16/16] make vsmp a paravirt client Glauber de Oliveira Costa
  0 siblings, 1 reply; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-10-31 19:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, rusty, ak, mingo, chrisw, jeremy, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Glauber de Oliveira Costa, Steven Rostedt

This patch unifies the paravirt ops structures for usage in
x86_64 and i386 architectures. Some new field had to be created
to accomodate the differences between the architectures.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
 arch/x86/Kconfig.x86_64             |   11 +
 arch/x86/kernel/Makefile            |    3 +
 arch/x86/kernel/Makefile_32         |    1 -
 arch/x86/kernel/asm-offsets.c       |    8 +
 arch/x86/kernel/asm-offsets_32.c    |    8 -
 arch/x86/kernel/asm-offsets_64.c    |   22 +-
 arch/x86/kernel/paravirt.c          |  445 +++++++++++++++++++++++++++++++++
 arch/x86/kernel/paravirt_32.c       |  472 -----------------------------------
 arch/x86/kernel/paravirt_patch_32.c |   52 ++++
 arch/x86/kernel/paravirt_patch_64.c |   56 ++++
 arch/x86/kernel/vmlinux_64.lds.S    |    6 +
 include/asm-x86/paravirt.h          |  458 ++++++++++++++++++++++++++++------
 12 files changed, 971 insertions(+), 571 deletions(-)

diff --git a/arch/x86/Kconfig.x86_64 b/arch/x86/Kconfig.x86_64
index cc468ea..04734dd 100644
--- a/arch/x86/Kconfig.x86_64
+++ b/arch/x86/Kconfig.x86_64
@@ -372,6 +372,17 @@ config NODES_SHIFT
 
 # Dummy CONFIG option to select ACPI_NUMA from drivers/acpi/Kconfig.
 
+config PARAVIRT
+	bool
+	depends on EXPERIMENTAL
+	help
+	  Paravirtualization is a way of running multiple instances of
+	  Linux on the same machine, under a hypervisor.  This option
+	  changes the kernel so it can modify itself when it is run
+	  under a hypervisor, improving performance significantly.
+	  However, when run without a hypervisor the kernel is
+	  theoretically slower.  If in doubt, say N.
+
 config X86_64_ACPI_NUMA
        bool "ACPI NUMA detection"
        depends on NUMA
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 3857334..f444d0e 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -1,8 +1,11 @@
 ifeq ($(CONFIG_X86_32),y)
 include ${srctree}/arch/x86/kernel/Makefile_32
+obj-$(CONFIG_PARAVIRT)		+= paravirt_patch_32.o
 else
 include ${srctree}/arch/x86/kernel/Makefile_64
+obj-$(CONFIG_PARAVIRT)		+= paravirt_patch_64.o
 endif
+obj-$(CONFIG_PARAVIRT)		+= paravirt.o
 
 # Workaround to delete .lds files with make clean
 # The problem is that we do not enter Makefile_32 with make clean.
diff --git a/arch/x86/kernel/Makefile_32 b/arch/x86/kernel/Makefile_32
index b9d6798..f19e0d4 100644
--- a/arch/x86/kernel/Makefile_32
+++ b/arch/x86/kernel/Makefile_32
@@ -43,7 +43,6 @@ obj-$(CONFIG_K8_NB)		+= k8.o
 obj-$(CONFIG_MGEODE_LX)		+= geode_32.o mfgpt_32.o
 
 obj-$(CONFIG_VMI)		+= vmi_32.o vmiclock_32.o
-obj-$(CONFIG_PARAVIRT)		+= paravirt_32.o
 obj-y				+= pcspeaker.o
 
 obj-$(CONFIG_SCx200)		+= scx200_32.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index cfa82c8..25530d5 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -1,3 +1,11 @@
+#define DEFINE(sym, val) \
+        asm volatile("\n->" #sym " %0 " #val : : "i" (val))
+
+#define BLANK() asm volatile("\n->" : : )
+
+#define OFFSET(sym, str, mem) \
+	DEFINE(sym, offsetof(struct str, mem));
+
 #ifdef CONFIG_X86_32
 # include "asm-offsets_32.c"
 #else
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index c1ccfab..f320b2d 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -25,14 +25,6 @@
 #include "../../../drivers/lguest/lg.h"
 #endif
 
-#define DEFINE(sym, val) \
-        asm volatile("\n->" #sym " %0 " #val : : "i" (val))
-
-#define BLANK() asm volatile("\n->" : : )
-
-#define OFFSET(sym, str, mem) \
-	DEFINE(sym, offsetof(struct str, mem));
-
 /* workaround for a warning with -Wmissing-prototypes */
 void foo(void);
 
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index d1b6ed9..3ef77dd 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -16,14 +16,7 @@
 #include <asm/thread_info.h>
 #include <asm/ia32.h>
 #include <asm/bootparam.h>
-
-#define DEFINE(sym, val) \
-        asm volatile("\n->" #sym " %0 " #val : : "i" (val))
-
-#define BLANK() asm volatile("\n->" : : )
-
-#define OFFSET(sym, str, mem) \
-	DEFINE(sym, offsetof(struct str, mem))
+#include <asm/paravirt.h>
 
 #define __NO_STUBS 1
 #undef __SYSCALL
@@ -76,6 +69,19 @@ int main(void)
 	       offsetof (struct rt_sigframe32, uc.uc_mcontext));
 	BLANK();
 #endif
+#ifdef CONFIG_PARAVIRT
+	OFFSET(PARAVIRT_enabled, pv_info, paravirt_enabled);
+	OFFSET(PARAVIRT_PATCH_pv_cpu_ops, paravirt_patch_template, pv_cpu_ops);
+	OFFSET(PARAVIRT_PATCH_pv_irq_ops, paravirt_patch_template, pv_irq_ops);
+	OFFSET(PARAVIRT_PATCH_pv_mmu_ops, paravirt_patch_template, pv_mmu_ops);
+	OFFSET(PV_IRQ_irq_disable, pv_irq_ops, irq_disable);
+	OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable);
+	OFFSET(PV_CPU_iret, pv_cpu_ops, iret);
+	OFFSET(PV_CPU_irq_enable_syscall_ret, pv_cpu_ops, irq_enable_syscall_ret);
+	OFFSET(PV_MMU_read_cr2, pv_mmu_ops, read_cr2);
+	OFFSET(PV_CPU_swapgs, pv_cpu_ops, swapgs);
+	BLANK();
+#endif
 	DEFINE(pbe_address, offsetof(struct pbe, address));
 	DEFINE(pbe_orig_address, offsetof(struct pbe, orig_address));
 	DEFINE(pbe_next, offsetof(struct pbe, next));
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
new file mode 100644
index 0000000..c4c1e51
--- /dev/null
+++ b/arch/x86/kernel/paravirt.c
@@ -0,0 +1,445 @@
+/*  Paravirtualization interfaces
+    Copyright (C) 2006 Rusty Russell IBM Corporation
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program; if not, write to the Free Software
+    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+
+    2007 - x86_64 support added by Glauber de Oliveira Costa
+*/
+#include <linux/errno.h>
+#include <linux/module.h>
+#include <linux/efi.h>
+#include <linux/bcd.h>
+#include <linux/highmem.h>
+#include <linux/pci_regs.h>
+#include <linux/pci_ids.h>
+
+#include <asm/bug.h>
+#include <asm/paravirt.h>
+#include <asm/desc.h>
+#include <asm/setup.h>
+#include <asm/arch_hooks.h>
+#include <asm/time.h>
+#include <asm/irq.h>
+#include <asm/delay.h>
+#include <asm/fixmap.h>
+#include <asm/apic.h>
+#include <asm/tlbflush.h>
+#include <asm/timer.h>
+#include <asm/io.h>
+#include <asm/pci-direct.h>
+
+
+/* nop stub */
+void _paravirt_nop(void)
+{
+}
+
+static void __init default_banner(void)
+{
+	printk(KERN_INFO "Booting paravirtualized kernel on %s\n",
+	       pv_info.name);
+}
+
+char *memory_setup(void)
+{
+	return pv_init_ops.memory_setup();
+}
+
+unsigned paravirt_patch_nop(void)
+{
+	return 0;
+}
+
+unsigned paravirt_patch_ignore(unsigned len)
+{
+	return len;
+}
+
+struct branch {
+	unsigned char opcode;
+	u32 delta;
+} __attribute__((packed));
+
+unsigned paravirt_patch_call(void *insnbuf,
+			     const void *target, u16 tgt_clobbers,
+			     unsigned long addr, u16 site_clobbers,
+			     unsigned len)
+{
+	struct branch *b = insnbuf;
+	unsigned long delta = (unsigned long)target - (addr+5);
+
+	if (tgt_clobbers & ~site_clobbers)
+		return len;	/* target would clobber too much for this site */
+	if (len < 5)
+		return len;	/* call too long for patch site */
+
+	b->opcode = 0xe8; /* call */
+	b->delta = delta;
+	BUILD_BUG_ON(sizeof(*b) != 5);
+
+	return 5;
+}
+
+unsigned paravirt_patch_jmp(void *insnbuf, const void *target,
+			    unsigned long addr, unsigned len)
+{
+	struct branch *b = insnbuf;
+	unsigned long delta = (unsigned long)target - (addr+5);
+
+	if (len < 5)
+		return len;	/* call too long for patch site */
+
+	b->opcode = 0xe9;	/* jmp */
+	b->delta = delta;
+
+	return 5;
+}
+
+/* Undefined instruction for dealing with missing ops pointers. */
+static const unsigned char ud2a[] = { 0x0f, 0x0b };
+
+/* Neat trick to map patch type back to the call within the
+ * corresponding structure. */
+static void *get_call_destination(u8 type)
+{
+	struct paravirt_patch_template tmpl = {
+		.pv_init_ops = pv_init_ops,
+		.pv_time_ops = pv_time_ops,
+		.pv_cpu_ops = pv_cpu_ops,
+		.pv_irq_ops = pv_irq_ops,
+		.pv_apic_ops = pv_apic_ops,
+		.pv_mmu_ops = pv_mmu_ops,
+	};
+	return *((void **)&tmpl + type);
+}
+
+unsigned paravirt_patch_default(u8 type, u16 clobbers, void *insnbuf,
+				unsigned long addr, unsigned len)
+{
+	void *opfunc = get_call_destination(type);
+	unsigned ret;
+
+	if (opfunc == NULL)
+		/* If there's no function, patch it with a ud2a (BUG) */
+		ret = paravirt_patch_insns(insnbuf, len, ud2a, ud2a+sizeof(ud2a));
+	else if (opfunc == paravirt_nop)
+		/* If the operation is a nop, then nop the callsite */
+		ret = paravirt_patch_nop();
+	else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) ||
+		 type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_syscall_ret))
+		/* If operation requires a jmp, then jmp */
+		ret = paravirt_patch_jmp(insnbuf, opfunc, addr, len);
+	else
+		/* Otherwise call the function; assume target could
+		   clobber any caller-save reg */
+		ret = paravirt_patch_call(insnbuf, opfunc, CLBR_ANY,
+					  addr, clobbers, len);
+
+	return ret;
+}
+
+unsigned paravirt_patch_insns(void *insnbuf, unsigned len,
+			      const char *start, const char *end)
+{
+	unsigned insn_len = end - start;
+
+	if (insn_len > len || start == NULL)
+		insn_len = len;
+	else
+		memcpy(insnbuf, start, insn_len);
+
+	return insn_len;
+}
+
+void init_IRQ(void)
+{
+	pv_irq_ops.init_IRQ();
+}
+
+static void native_flush_tlb(void)
+{
+	__native_flush_tlb();
+}
+
+/*
+ * Global pages have to be flushed a bit differently. Not a real
+ * performance problem because this does not happen often.
+ */
+static void native_flush_tlb_global(void)
+{
+	__native_flush_tlb_global();
+}
+
+static void native_flush_tlb_single(unsigned long addr)
+{
+	__native_flush_tlb_single(addr);
+}
+
+/* These are in entry.S */
+extern void native_iret(void);
+extern void native_irq_enable_syscall_ret(void);
+
+static int __init print_banner(void)
+{
+	pv_init_ops.banner();
+	return 0;
+}
+core_initcall(print_banner);
+
+static struct resource reserve_ioports = {
+	.start = 0,
+	.end = IO_SPACE_LIMIT,
+	.name = "paravirt-ioport",
+	.flags = IORESOURCE_IO | IORESOURCE_BUSY,
+};
+
+static struct resource reserve_iomem = {
+	.start = 0,
+	.end = -1,
+	.name = "paravirt-iomem",
+	.flags = IORESOURCE_MEM | IORESOURCE_BUSY,
+};
+
+/*
+ * Reserve the whole legacy IO space to prevent any legacy drivers
+ * from wasting time probing for their hardware.  This is a fairly
+ * brute-force approach to disabling all non-virtual drivers.
+ *
+ * Note that this must be called very early to have any effect.
+ */
+int paravirt_disable_iospace(void)
+{
+	int ret;
+
+	ret = request_resource(&ioport_resource, &reserve_ioports);
+	if (ret == 0) {
+		ret = request_resource(&iomem_resource, &reserve_iomem);
+		if (ret)
+			release_resource(&reserve_ioports);
+	}
+
+	return ret;
+}
+
+static DEFINE_PER_CPU(enum paravirt_lazy_mode, paravirt_lazy_mode) = PARAVIRT_LAZY_NONE;
+
+static inline void enter_lazy(enum paravirt_lazy_mode mode)
+{
+	BUG_ON(__get_cpu_var(paravirt_lazy_mode) != PARAVIRT_LAZY_NONE);
+	BUG_ON(preemptible());
+
+	__get_cpu_var(paravirt_lazy_mode) = mode;
+}
+
+void paravirt_leave_lazy(enum paravirt_lazy_mode mode)
+{
+	BUG_ON(__get_cpu_var(paravirt_lazy_mode) != mode);
+	BUG_ON(preemptible());
+
+	__get_cpu_var(paravirt_lazy_mode) = PARAVIRT_LAZY_NONE;
+}
+
+void paravirt_enter_lazy_mmu(void)
+{
+	enter_lazy(PARAVIRT_LAZY_MMU);
+}
+
+void paravirt_leave_lazy_mmu(void)
+{
+	paravirt_leave_lazy(PARAVIRT_LAZY_MMU);
+}
+
+void paravirt_enter_lazy_cpu(void)
+{
+	enter_lazy(PARAVIRT_LAZY_CPU);
+}
+
+void paravirt_leave_lazy_cpu(void)
+{
+	paravirt_leave_lazy(PARAVIRT_LAZY_CPU);
+}
+
+enum paravirt_lazy_mode paravirt_get_lazy_mode(void)
+{
+	return __get_cpu_var(paravirt_lazy_mode);
+}
+
+struct pv_info pv_info = {
+	.name = "bare hardware",
+	.paravirt_enabled = 0,
+	.kernel_rpl = 0,
+	.shared_kernel_pmd = 1,	/* Only used when CONFIG_X86_PAE is set */
+};
+
+struct pv_init_ops pv_init_ops = {
+	.patch = native_patch,
+	.banner = default_banner,
+	.arch_setup = paravirt_nop,
+	.memory_setup = machine_specific_memory_setup,
+};
+
+struct pv_time_ops pv_time_ops = {
+	.time_init = hpet_time_init,
+	.get_wallclock = native_get_wallclock,
+	.set_wallclock = native_set_wallclock,
+	.sched_clock = native_sched_clock,
+	.get_cpu_khz = native_calculate_cpu_khz,
+};
+
+struct pv_irq_ops pv_irq_ops = {
+	.init_IRQ = native_init_IRQ,
+	.save_fl = native_save_fl,
+	.restore_fl = native_restore_fl,
+	.irq_disable = native_irq_disable,
+	.irq_enable = native_irq_enable,
+	.safe_halt = native_safe_halt,
+	.halt = native_halt,
+};
+
+struct pv_cpu_ops pv_cpu_ops = {
+	.cpuid = native_cpuid,
+	.get_debugreg = native_get_debugreg,
+	.set_debugreg = native_set_debugreg,
+	.clts = native_clts,
+	.read_cr0 = native_read_cr0,
+	.write_cr0 = native_write_cr0,
+	.read_cr4 = native_read_cr4,
+	.read_cr4_safe = native_read_cr4_safe,
+	.write_cr4 = native_write_cr4,
+	.wbinvd = native_wbinvd,
+	.read_msr = native_read_msr_safe,
+	.write_msr = native_write_msr_safe,
+	.read_tsc = native_read_tsc,
+	.read_pmc = native_read_pmc,
+	.read_tscp = native_read_tscp,
+	.load_tr_desc = native_load_tr_desc,
+	.set_ldt = native_set_ldt,
+	.load_gdt = native_load_gdt,
+	.load_idt = native_load_idt,
+	.store_gdt = native_store_gdt,
+	.store_idt = native_store_idt,
+	.store_tr = native_store_tr,
+	.load_tls = native_load_tls,
+	.write_ldt_entry = write_dt_entry,
+#ifdef CONFIG_X86_32
+	.write_gdt_entry = write_dt_entry,
+	.write_idt_entry = write_dt_entry,
+#else
+	.write_gdt_entry = native_write_gdt_entry,
+	.write_idt_entry = native_write_idt_entry,
+#endif
+	.load_esp0 = native_load_esp0,
+
+	.irq_enable_syscall_ret = native_irq_enable_syscall_ret,
+	.iret = native_iret,
+	.swapgs = native_swapgs,
+
+	.set_iopl_mask = native_set_iopl_mask,
+	.io_delay = native_io_delay,
+
+	.lazy_mode = {
+		.enter = paravirt_nop,
+		.leave = paravirt_nop,
+	},
+};
+
+struct pv_apic_ops pv_apic_ops = {
+#ifdef CONFIG_X86_LOCAL_APIC
+	.apic_write = native_apic_write,
+	.apic_write_atomic = native_apic_write_atomic,
+	.apic_read = native_apic_read,
+	.setup_boot_clock = setup_boot_APIC_clock,
+	.setup_secondary_clock = setup_secondary_APIC_clock,
+	.startup_ipi_hook = paravirt_nop,
+#endif
+};
+
+struct pv_mmu_ops pv_mmu_ops = {
+#ifdef CONFIG_X86_32
+	.pagetable_setup_start = native_pagetable_setup_start,
+	.pagetable_setup_done = native_pagetable_setup_done,
+#else
+	.pagetable_setup_start = paravirt_nop,
+	.pagetable_setup_done = paravirt_nop,
+#endif
+
+	.read_cr2 = native_read_cr2,
+	.write_cr2 = native_write_cr2,
+	.read_cr3 = native_read_cr3,
+	.write_cr3 = native_write_cr3,
+
+	.flush_tlb_user = native_flush_tlb,
+	.flush_tlb_kernel = native_flush_tlb_global,
+	.flush_tlb_single = native_flush_tlb_single,
+	.flush_tlb_others = native_flush_tlb_others,
+
+	.alloc_pt = paravirt_nop,
+	.alloc_pd = paravirt_nop,
+	.alloc_pd_clone = paravirt_nop,
+	.release_pt = paravirt_nop,
+	.release_pd = paravirt_nop,
+
+	.set_pte = native_set_pte,
+	.set_pte_at = native_set_pte_at,
+	.set_pmd = native_set_pmd,
+	.pte_update = paravirt_nop,
+	.pte_update_defer = paravirt_nop,
+
+#ifdef CONFIG_HIGHPTE
+	.kmap_atomic_pte = kmap_atomic,
+#endif
+
+#ifdef CONFIG_X86_PAE
+	.set_pte_atomic = native_set_pte_atomic,
+	.set_pte_present = native_set_pte_present,
+#endif
+#if defined(CONFIG_X86_PAE) || defined(CONFIG_X86_64)
+	.set_pud = native_set_pud,
+	.pte_clear = native_pte_clear,
+	.pmd_clear = native_pmd_clear,
+	.pmd_val = native_pmd_val,
+	.make_pmd = native_make_pmd,
+#endif
+	.pte_val = native_pte_val,
+	.pgd_val = native_pgd_val,
+
+	.make_pte = native_make_pte,
+	.make_pgd = native_make_pgd,
+#ifdef CONFIG_X86_64
+	.set_pgd = native_set_pgd,
+
+	.pud_clear = native_pud_clear,
+	.pgd_clear = native_pgd_clear,
+
+	.pud_val = native_pud_val,
+
+	.make_pud = native_make_pud,
+#endif
+	.dup_mmap = paravirt_nop,
+	.exit_mmap = paravirt_nop,
+	.activate_mm = paravirt_nop,
+
+	.lazy_mode = {
+		.enter = paravirt_nop,
+		.leave = paravirt_nop,
+	},
+};
+
+EXPORT_SYMBOL_GPL(pv_time_ops);
+EXPORT_SYMBOL_GPL(pv_cpu_ops);
+EXPORT_SYMBOL_GPL(pv_mmu_ops);
+EXPORT_SYMBOL_GPL(pv_apic_ops);
+EXPORT_SYMBOL_GPL(pv_info);
+EXPORT_SYMBOL    (pv_irq_ops);
diff --git a/arch/x86/kernel/paravirt_32.c b/arch/x86/kernel/paravirt_32.c
deleted file mode 100644
index 04f51d0..0000000
--- a/arch/x86/kernel/paravirt_32.c
+++ /dev/null
@@ -1,472 +0,0 @@
-/*  Paravirtualization interfaces
-    Copyright (C) 2006 Rusty Russell IBM Corporation
-
-    This program is free software; you can redistribute it and/or modify
-    it under the terms of the GNU General Public License as published by
-    the Free Software Foundation; either version 2 of the License, or
-    (at your option) any later version.
-
-    This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU General Public License for more details.
-
-    You should have received a copy of the GNU General Public License
-    along with this program; if not, write to the Free Software
-    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
-*/
-#include <linux/errno.h>
-#include <linux/module.h>
-#include <linux/efi.h>
-#include <linux/bcd.h>
-#include <linux/highmem.h>
-
-#include <asm/bug.h>
-#include <asm/paravirt.h>
-#include <asm/desc.h>
-#include <asm/setup.h>
-#include <asm/arch_hooks.h>
-#include <asm/time.h>
-#include <asm/irq.h>
-#include <asm/delay.h>
-#include <asm/fixmap.h>
-#include <asm/apic.h>
-#include <asm/tlbflush.h>
-#include <asm/timer.h>
-
-/* nop stub */
-void _paravirt_nop(void)
-{
-}
-
-static void __init default_banner(void)
-{
-	printk(KERN_INFO "Booting paravirtualized kernel on %s\n",
-	       pv_info.name);
-}
-
-char *memory_setup(void)
-{
-	return pv_init_ops.memory_setup();
-}
-
-/* Simple instruction patching code. */
-#define DEF_NATIVE(ops, name, code)					\
-	extern const char start_##ops##_##name[], end_##ops##_##name[];	\
-	asm("start_" #ops "_" #name ": " code "; end_" #ops "_" #name ":")
-
-DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
-DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
-DEF_NATIVE(pv_irq_ops, restore_fl, "push %eax; popf");
-DEF_NATIVE(pv_irq_ops, save_fl, "pushf; pop %eax");
-DEF_NATIVE(pv_cpu_ops, iret, "iret");
-DEF_NATIVE(pv_cpu_ops, irq_enable_syscall_ret, "sti; sysexit");
-DEF_NATIVE(pv_mmu_ops, read_cr2, "mov %cr2, %eax");
-DEF_NATIVE(pv_mmu_ops, write_cr3, "mov %eax, %cr3");
-DEF_NATIVE(pv_mmu_ops, read_cr3, "mov %cr3, %eax");
-DEF_NATIVE(pv_cpu_ops, clts, "clts");
-DEF_NATIVE(pv_cpu_ops, read_tsc, "rdtsc");
-
-/* Undefined instruction for dealing with missing ops pointers. */
-static const unsigned char ud2a[] = { 0x0f, 0x0b };
-
-static unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
-			     unsigned long addr, unsigned len)
-{
-	const unsigned char *start, *end;
-	unsigned ret;
-
-	switch(type) {
-#define SITE(ops, x)						\
-	case PARAVIRT_PATCH(ops.x):				\
-		start = start_##ops##_##x;			\
-		end = end_##ops##_##x;				\
-		goto patch_site
-
-	SITE(pv_irq_ops, irq_disable);
-	SITE(pv_irq_ops, irq_enable);
-	SITE(pv_irq_ops, restore_fl);
-	SITE(pv_irq_ops, save_fl);
-	SITE(pv_cpu_ops, iret);
-	SITE(pv_cpu_ops, irq_enable_syscall_ret);
-	SITE(pv_mmu_ops, read_cr2);
-	SITE(pv_mmu_ops, read_cr3);
-	SITE(pv_mmu_ops, write_cr3);
-	SITE(pv_cpu_ops, clts);
-	SITE(pv_cpu_ops, read_tsc);
-#undef SITE
-
-	patch_site:
-		ret = paravirt_patch_insns(ibuf, len, start, end);
-		break;
-
-	default:
-		ret = paravirt_patch_default(type, clobbers, ibuf, addr, len);
-		break;
-	}
-
-	return ret;
-}
-
-unsigned paravirt_patch_nop(void)
-{
-	return 0;
-}
-
-unsigned paravirt_patch_ignore(unsigned len)
-{
-	return len;
-}
-
-struct branch {
-	unsigned char opcode;
-	u32 delta;
-} __attribute__((packed));
-
-unsigned paravirt_patch_call(void *insnbuf,
-			     const void *target, u16 tgt_clobbers,
-			     unsigned long addr, u16 site_clobbers,
-			     unsigned len)
-{
-	struct branch *b = insnbuf;
-	unsigned long delta = (unsigned long)target - (addr+5);
-
-	if (tgt_clobbers & ~site_clobbers)
-		return len;	/* target would clobber too much for this site */
-	if (len < 5)
-		return len;	/* call too long for patch site */
-
-	b->opcode = 0xe8; /* call */
-	b->delta = delta;
-	BUILD_BUG_ON(sizeof(*b) != 5);
-
-	return 5;
-}
-
-unsigned paravirt_patch_jmp(void *insnbuf, const void *target,
-			    unsigned long addr, unsigned len)
-{
-	struct branch *b = insnbuf;
-	unsigned long delta = (unsigned long)target - (addr+5);
-
-	if (len < 5)
-		return len;	/* call too long for patch site */
-
-	b->opcode = 0xe9;	/* jmp */
-	b->delta = delta;
-
-	return 5;
-}
-
-/* Neat trick to map patch type back to the call within the
- * corresponding structure. */
-static void *get_call_destination(u8 type)
-{
-	struct paravirt_patch_template tmpl = {
-		.pv_init_ops = pv_init_ops,
-		.pv_time_ops = pv_time_ops,
-		.pv_cpu_ops = pv_cpu_ops,
-		.pv_irq_ops = pv_irq_ops,
-		.pv_apic_ops = pv_apic_ops,
-		.pv_mmu_ops = pv_mmu_ops,
-	};
-	return *((void **)&tmpl + type);
-}
-
-unsigned paravirt_patch_default(u8 type, u16 clobbers, void *insnbuf,
-				unsigned long addr, unsigned len)
-{
-	void *opfunc = get_call_destination(type);
-	unsigned ret;
-
-	if (opfunc == NULL)
-		/* If there's no function, patch it with a ud2a (BUG) */
-		ret = paravirt_patch_insns(insnbuf, len, ud2a, ud2a+sizeof(ud2a));
-	else if (opfunc == paravirt_nop)
-		/* If the operation is a nop, then nop the callsite */
-		ret = paravirt_patch_nop();
-	else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) ||
-		 type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_syscall_ret))
-		/* If operation requires a jmp, then jmp */
-		ret = paravirt_patch_jmp(insnbuf, opfunc, addr, len);
-	else
-		/* Otherwise call the function; assume target could
-		   clobber any caller-save reg */
-		ret = paravirt_patch_call(insnbuf, opfunc, CLBR_ANY,
-					  addr, clobbers, len);
-
-	return ret;
-}
-
-unsigned paravirt_patch_insns(void *insnbuf, unsigned len,
-			      const char *start, const char *end)
-{
-	unsigned insn_len = end - start;
-
-	if (insn_len > len || start == NULL)
-		insn_len = len;
-	else
-		memcpy(insnbuf, start, insn_len);
-
-	return insn_len;
-}
-
-void init_IRQ(void)
-{
-	pv_irq_ops.init_IRQ();
-}
-
-static void native_flush_tlb(void)
-{
-	__native_flush_tlb();
-}
-
-/*
- * Global pages have to be flushed a bit differently. Not a real
- * performance problem because this does not happen often.
- */
-static void native_flush_tlb_global(void)
-{
-	__native_flush_tlb_global();
-}
-
-static void native_flush_tlb_single(unsigned long addr)
-{
-	__native_flush_tlb_single(addr);
-}
-
-/* These are in entry.S */
-extern void native_iret(void);
-extern void native_irq_enable_syscall_ret(void);
-
-static int __init print_banner(void)
-{
-	pv_init_ops.banner();
-	return 0;
-}
-core_initcall(print_banner);
-
-static struct resource reserve_ioports = {
-	.start = 0,
-	.end = IO_SPACE_LIMIT,
-	.name = "paravirt-ioport",
-	.flags = IORESOURCE_IO | IORESOURCE_BUSY,
-};
-
-static struct resource reserve_iomem = {
-	.start = 0,
-	.end = -1,
-	.name = "paravirt-iomem",
-	.flags = IORESOURCE_MEM | IORESOURCE_BUSY,
-};
-
-/*
- * Reserve the whole legacy IO space to prevent any legacy drivers
- * from wasting time probing for their hardware.  This is a fairly
- * brute-force approach to disabling all non-virtual drivers.
- *
- * Note that this must be called very early to have any effect.
- */
-int paravirt_disable_iospace(void)
-{
-	int ret;
-
-	ret = request_resource(&ioport_resource, &reserve_ioports);
-	if (ret == 0) {
-		ret = request_resource(&iomem_resource, &reserve_iomem);
-		if (ret)
-			release_resource(&reserve_ioports);
-	}
-
-	return ret;
-}
-
-static DEFINE_PER_CPU(enum paravirt_lazy_mode, paravirt_lazy_mode) = PARAVIRT_LAZY_NONE;
-
-static inline void enter_lazy(enum paravirt_lazy_mode mode)
-{
-	BUG_ON(x86_read_percpu(paravirt_lazy_mode) != PARAVIRT_LAZY_NONE);
-	BUG_ON(preemptible());
-
-	x86_write_percpu(paravirt_lazy_mode, mode);
-}
-
-void paravirt_leave_lazy(enum paravirt_lazy_mode mode)
-{
-	BUG_ON(x86_read_percpu(paravirt_lazy_mode) != mode);
-	BUG_ON(preemptible());
-
-	x86_write_percpu(paravirt_lazy_mode, PARAVIRT_LAZY_NONE);
-}
-
-void paravirt_enter_lazy_mmu(void)
-{
-	enter_lazy(PARAVIRT_LAZY_MMU);
-}
-
-void paravirt_leave_lazy_mmu(void)
-{
-	paravirt_leave_lazy(PARAVIRT_LAZY_MMU);
-}
-
-void paravirt_enter_lazy_cpu(void)
-{
-	enter_lazy(PARAVIRT_LAZY_CPU);
-}
-
-void paravirt_leave_lazy_cpu(void)
-{
-	paravirt_leave_lazy(PARAVIRT_LAZY_CPU);
-}
-
-enum paravirt_lazy_mode paravirt_get_lazy_mode(void)
-{
-	return x86_read_percpu(paravirt_lazy_mode);
-}
-
-struct pv_info pv_info = {
-	.name = "bare hardware",
-	.paravirt_enabled = 0,
-	.kernel_rpl = 0,
-	.shared_kernel_pmd = 1,	/* Only used when CONFIG_X86_PAE is set */
-};
-
-struct pv_init_ops pv_init_ops = {
-	.patch = native_patch,
-	.banner = default_banner,
-	.arch_setup = paravirt_nop,
-	.memory_setup = machine_specific_memory_setup,
-};
-
-struct pv_time_ops pv_time_ops = {
-	.time_init = hpet_time_init,
-	.get_wallclock = native_get_wallclock,
-	.set_wallclock = native_set_wallclock,
-	.sched_clock = native_sched_clock,
-	.get_cpu_khz = native_calculate_cpu_khz,
-};
-
-struct pv_irq_ops pv_irq_ops = {
-	.init_IRQ = native_init_IRQ,
-	.save_fl = native_save_fl,
-	.restore_fl = native_restore_fl,
-	.irq_disable = native_irq_disable,
-	.irq_enable = native_irq_enable,
-	.safe_halt = native_safe_halt,
-	.halt = native_halt,
-};
-
-struct pv_cpu_ops pv_cpu_ops = {
-	.cpuid = native_cpuid,
-	.get_debugreg = native_get_debugreg,
-	.set_debugreg = native_set_debugreg,
-	.clts = native_clts,
-	.read_cr0 = native_read_cr0,
-	.write_cr0 = native_write_cr0,
-	.read_cr4 = native_read_cr4,
-	.read_cr4_safe = native_read_cr4_safe,
-	.write_cr4 = native_write_cr4,
-	.wbinvd = native_wbinvd,
-	.read_msr = native_read_msr_safe,
-	.write_msr = native_write_msr_safe,
-	.read_tsc = native_read_tsc,
-	.read_pmc = native_read_pmc,
-	.load_tr_desc = native_load_tr_desc,
-	.set_ldt = native_set_ldt,
-	.load_gdt = native_load_gdt,
-	.load_idt = native_load_idt,
-	.store_gdt = native_store_gdt,
-	.store_idt = native_store_idt,
-	.store_tr = native_store_tr,
-	.load_tls = native_load_tls,
-	.write_ldt_entry = write_dt_entry,
-	.write_gdt_entry = write_dt_entry,
-	.write_idt_entry = write_dt_entry,
-	.load_esp0 = native_load_esp0,
-
-	.irq_enable_syscall_ret = native_irq_enable_syscall_ret,
-	.iret = native_iret,
-
-	.set_iopl_mask = native_set_iopl_mask,
-	.io_delay = native_io_delay,
-
-	.lazy_mode = {
-		.enter = paravirt_nop,
-		.leave = paravirt_nop,
-	},
-};
-
-struct pv_apic_ops pv_apic_ops = {
-#ifdef CONFIG_X86_LOCAL_APIC
-	.apic_write = native_apic_write,
-	.apic_write_atomic = native_apic_write_atomic,
-	.apic_read = native_apic_read,
-	.setup_boot_clock = setup_boot_APIC_clock,
-	.setup_secondary_clock = setup_secondary_APIC_clock,
-	.startup_ipi_hook = paravirt_nop,
-#endif
-};
-
-struct pv_mmu_ops pv_mmu_ops = {
-	.pagetable_setup_start = native_pagetable_setup_start,
-	.pagetable_setup_done = native_pagetable_setup_done,
-
-	.read_cr2 = native_read_cr2,
-	.write_cr2 = native_write_cr2,
-	.read_cr3 = native_read_cr3,
-	.write_cr3 = native_write_cr3,
-
-	.flush_tlb_user = native_flush_tlb,
-	.flush_tlb_kernel = native_flush_tlb_global,
-	.flush_tlb_single = native_flush_tlb_single,
-	.flush_tlb_others = native_flush_tlb_others,
-
-	.alloc_pt = paravirt_nop,
-	.alloc_pd = paravirt_nop,
-	.alloc_pd_clone = paravirt_nop,
-	.release_pt = paravirt_nop,
-	.release_pd = paravirt_nop,
-
-	.set_pte = native_set_pte,
-	.set_pte_at = native_set_pte_at,
-	.set_pmd = native_set_pmd,
-	.pte_update = paravirt_nop,
-	.pte_update_defer = paravirt_nop,
-
-#ifdef CONFIG_HIGHPTE
-	.kmap_atomic_pte = kmap_atomic,
-#endif
-
-#ifdef CONFIG_X86_PAE
-	.set_pte_atomic = native_set_pte_atomic,
-	.set_pte_present = native_set_pte_present,
-	.set_pud = native_set_pud,
-	.pte_clear = native_pte_clear,
-	.pmd_clear = native_pmd_clear,
-
-	.pmd_val = native_pmd_val,
-	.make_pmd = native_make_pmd,
-#endif
-
-	.pte_val = native_pte_val,
-	.pgd_val = native_pgd_val,
-
-	.make_pte = native_make_pte,
-	.make_pgd = native_make_pgd,
-
-	.dup_mmap = paravirt_nop,
-	.exit_mmap = paravirt_nop,
-	.activate_mm = paravirt_nop,
-
-	.lazy_mode = {
-		.enter = paravirt_nop,
-		.leave = paravirt_nop,
-	},
-};
-
-EXPORT_SYMBOL_GPL(pv_time_ops);
-EXPORT_SYMBOL_GPL(pv_cpu_ops);
-EXPORT_SYMBOL_GPL(pv_mmu_ops);
-EXPORT_SYMBOL_GPL(pv_apic_ops);
-EXPORT_SYMBOL_GPL(pv_info);
-EXPORT_SYMBOL    (pv_irq_ops);
diff --git a/arch/x86/kernel/paravirt_patch_32.c b/arch/x86/kernel/paravirt_patch_32.c
new file mode 100644
index 0000000..46ae585
--- /dev/null
+++ b/arch/x86/kernel/paravirt_patch_32.c
@@ -0,0 +1,52 @@
+#include <asm/paravirt.h>
+
+DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
+DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
+DEF_NATIVE(pv_irq_ops, restore_fl, "push %eax; popf");
+DEF_NATIVE(pv_irq_ops, save_fl, "pushf; pop %eax");
+DEF_NATIVE(pv_cpu_ops, iret, "iret");
+DEF_NATIVE(pv_cpu_ops, irq_enable_syscall_ret, "sti; sysexit");
+DEF_NATIVE(pv_mmu_ops, read_cr2, "mov %cr2, %eax");
+DEF_NATIVE(pv_mmu_ops, write_cr3, "mov %eax, %cr3");
+DEF_NATIVE(pv_mmu_ops, read_cr3, "mov %cr3, %eax");
+DEF_NATIVE(pv_cpu_ops, clts, "clts");
+DEF_NATIVE(pv_cpu_ops, read_tsc, "rdtsc");
+
+/* Undefined instruction for dealing with missing ops pointers. */
+static const unsigned char ud2a[] = { 0x0f, 0x0b };
+
+unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
+		      unsigned long addr, unsigned len)
+{
+	const unsigned char *start, *end;
+	unsigned ret;
+
+#define PATCH_SITE(ops, x)					\
+		case PARAVIRT_PATCH(ops.x):			\
+			start = start_##ops##_##x;		\
+			end = end_##ops##_##x;			\
+			goto patch_site
+	switch(type) {
+		PATCH_SITE(pv_irq_ops, irq_disable);
+		PATCH_SITE(pv_irq_ops, irq_enable);
+		PATCH_SITE(pv_irq_ops, restore_fl);
+		PATCH_SITE(pv_irq_ops, save_fl);
+		PATCH_SITE(pv_cpu_ops, iret);
+		PATCH_SITE(pv_cpu_ops, irq_enable_syscall_ret);
+		PATCH_SITE(pv_mmu_ops, read_cr2);
+		PATCH_SITE(pv_mmu_ops, read_cr3);
+		PATCH_SITE(pv_mmu_ops, write_cr3);
+		PATCH_SITE(pv_cpu_ops, clts);
+		PATCH_SITE(pv_cpu_ops, read_tsc);
+
+	patch_site:
+		ret = paravirt_patch_insns(ibuf, len, start, end);
+		break;
+
+	default:
+		ret = paravirt_patch_default(type, clobbers, ibuf, addr, len);
+		break;
+	}
+#undef PATCH_SITE
+	return ret;
+}
diff --git a/arch/x86/kernel/paravirt_patch_64.c b/arch/x86/kernel/paravirt_patch_64.c
new file mode 100644
index 0000000..cbfc4f3
--- /dev/null
+++ b/arch/x86/kernel/paravirt_patch_64.c
@@ -0,0 +1,56 @@
+#include <asm/paravirt.h>
+#include <asm/asm-offsets.h>
+
+DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
+DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
+DEF_NATIVE(pv_irq_ops, restore_fl, "pushq %rdi; popfq");
+DEF_NATIVE(pv_irq_ops, save_fl, "pushfq; popq %rax");
+DEF_NATIVE(pv_cpu_ops, iret, "iretq");
+DEF_NATIVE(pv_mmu_ops, read_cr2, "movq %cr2, %rax");
+DEF_NATIVE(pv_mmu_ops, read_cr3, "movq %cr3, %rax");
+DEF_NATIVE(pv_mmu_ops, write_cr3, "movq %rdi, %cr3");
+DEF_NATIVE(pv_mmu_ops, flush_tlb_single, "invlpg (%rdi)");
+DEF_NATIVE(pv_cpu_ops, clts, "clts");
+DEF_NATIVE(pv_cpu_ops, wbinvd, "wbinvd");
+
+/* the three commands give us more control to how to return from a syscall */
+DEF_NATIVE(pv_cpu_ops, irq_enable_syscall_ret, "movq %gs:" __stringify(pda_oldrsp) ", %rsp; swapgs; sysretq;");
+DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs");
+
+unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
+		      unsigned long addr, unsigned len)
+{
+	const unsigned char *start, *end;
+	unsigned ret;
+
+#define PATCH_SITE(ops, x)					\
+		case PARAVIRT_PATCH(ops.x):			\
+			start = start_##ops##_##x;		\
+			end = end_##ops##_##x;			\
+			goto patch_site
+	switch(type) {
+		PATCH_SITE(pv_irq_ops, restore_fl);
+		PATCH_SITE(pv_irq_ops, save_fl);
+		PATCH_SITE(pv_irq_ops, irq_enable);
+		PATCH_SITE(pv_irq_ops, irq_disable);
+		PATCH_SITE(pv_cpu_ops, iret);
+		PATCH_SITE(pv_cpu_ops, irq_enable_syscall_ret);
+		PATCH_SITE(pv_cpu_ops, swapgs);
+		PATCH_SITE(pv_mmu_ops, read_cr2);
+		PATCH_SITE(pv_mmu_ops, read_cr3);
+		PATCH_SITE(pv_mmu_ops, write_cr3);
+		PATCH_SITE(pv_cpu_ops, clts);
+		PATCH_SITE(pv_mmu_ops, flush_tlb_single);
+		PATCH_SITE(pv_cpu_ops, wbinvd);
+
+	patch_site:
+		ret = paravirt_patch_insns(ibuf, len, start, end);
+		break;
+
+	default:
+		ret = paravirt_patch_default(type, clobbers, ibuf, addr, len);
+		break;
+	}
+#undef PATCH_SITE
+	return ret;
+}
diff --git a/arch/x86/kernel/vmlinux_64.lds.S b/arch/x86/kernel/vmlinux_64.lds.S
index ba8ea97..c3fce85 100644
--- a/arch/x86/kernel/vmlinux_64.lds.S
+++ b/arch/x86/kernel/vmlinux_64.lds.S
@@ -185,6 +185,12 @@ SECTIONS
   .altinstr_replacement : AT(ADDR(.altinstr_replacement) - LOAD_OFFSET) {
 	*(.altinstr_replacement)
   }
+  . = ALIGN(8);
+  .parainstructions : AT(ADDR(.parainstructions) - LOAD_OFFSET) {
+  __parainstructions = .;
+	*(.parainstructions)
+  __parainstructions_end = .;
+  }
   /* .exit.text is discard at runtime, not link time, to deal with references
      from .altinstructions and .eh_frame */
   .exit.text : AT(ADDR(.exit.text) - LOAD_OFFSET) { *(.exit.text) }
diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
index d81a361..4ca335a 100644
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -7,23 +7,42 @@
 #include <asm/page.h>
 
 /* Bitmask of what can be clobbered: usually at least eax. */
-#define CLBR_NONE 0x0
-#define CLBR_EAX 0x1
-#define CLBR_ECX 0x2
-#define CLBR_EDX 0x4
-#define CLBR_ANY 0x7
+#define CLBR_NONE 0
+#define CLBR_EAX  (1 << 0)
+#define CLBR_ECX  (1 << 1)
+#define CLBR_EDX  (1 << 2)
+
+#ifdef CONFIG_X86_64
+#define CLBR_RSI  (1 << 3)
+#define CLBR_RDI  (1 << 4)
+#define CLBR_R8   (1 << 5)
+#define CLBR_R9   (1 << 6)
+#define CLBR_R10  (1 << 7)
+#define CLBR_R11  (1 << 8)
+#define CLBR_ANY  ((1 << 9) - 1)
+#include <asm/desc_defs.h>
+#else
+/* CLBR_ANY should match all regs platform has. For i386, that's just it */
+#define CLBR_ANY  ((1 << 3) - 1)
+#endif /* X86_64 */
 
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
 #include <linux/cpumask.h>
 #include <asm/kmap_types.h>
+#include <linux/stringify.h>
 
 struct page;
 struct thread_struct;
-struct Xgt_desc_struct;
 struct tss_struct;
 struct mm_struct;
 struct desc_struct;
+/* FIXME: Ideally, the two arches would use the same data structure */
+#ifdef CONFIG_X86_64
+typedef struct desc_ptr x86_descr_ptr;
+#else
+typedef struct Xgt_desc_struct x86_descr_ptr;
+#endif
 
 /* general info */
 struct pv_info {
@@ -54,7 +73,6 @@ struct pv_init_ops {
 	void (*banner)(void);
 };
 
-
 struct pv_lazy_ops {
 	/* Set deferred update mode, used for batching operations. */
 	void (*enter)(void);
@@ -88,19 +106,26 @@ struct pv_cpu_ops {
 
 	/* Segment descriptor handling */
 	void (*load_tr_desc)(void);
-	void (*load_gdt)(const struct Xgt_desc_struct *);
-	void (*load_idt)(const struct Xgt_desc_struct *);
-	void (*store_gdt)(struct Xgt_desc_struct *);
-	void (*store_idt)(struct Xgt_desc_struct *);
+	void (*load_gdt)(const x86_descr_ptr *);
+	void (*load_idt)(const x86_descr_ptr *);
+	void (*store_gdt)(x86_descr_ptr *);
+	void (*store_idt)(x86_descr_ptr *);
 	void (*set_ldt)(const void *desc, unsigned entries);
 	unsigned long (*store_tr)(void);
 	void (*load_tls)(struct thread_struct *t, unsigned int cpu);
 	void (*write_ldt_entry)(struct desc_struct *,
 				int entrynum, u32 low, u32 high);
+#ifdef CONFIG_X86_32
 	void (*write_gdt_entry)(struct desc_struct *,
 				int entrynum, u32 low, u32 high);
 	void (*write_idt_entry)(struct desc_struct *,
 				int entrynum, u32 low, u32 high);
+#else
+	void (*write_gdt_entry)(void *ptr, void *entry, unsigned type,
+					 unsigned size);
+	void (*write_idt_entry)(void *adr, struct gate_struct *s);
+#endif
+
 	void (*load_esp0)(struct tss_struct *tss, struct thread_struct *t);
 
 	void (*set_iopl_mask)(unsigned mask);
@@ -115,15 +140,18 @@ struct pv_cpu_ops {
 	/* MSR, PMC and TSR operations.
 	   err = 0/-EFAULT.  wrmsr returns 0/-EFAULT. */
 	u64 (*read_msr)(unsigned int msr, int *err);
-	int (*write_msr)(unsigned int msr, u64 val);
+	int (*write_msr)(unsigned int msr, unsigned int low, unsigned int high);
 
 	u64 (*read_tsc)(void);
-	u64 (*read_pmc)(void);
+	u64 (*read_pmc)(int counter);
+	u64 (*read_tscp)(int *aux);
 
 	/* These two are jmp to, not actually called. */
 	void (*irq_enable_syscall_ret)(void);
 	void (*iret)(void);
 
+	void (*swapgs)(void);
+
 	struct pv_lazy_ops lazy_mode;
 };
 
@@ -150,9 +178,9 @@ struct pv_apic_ops {
 	 * Direct APIC operations, principally for VMI.  Ideally
 	 * these shouldn't be in this interface.
 	 */
-	void (*apic_write)(unsigned long reg, unsigned long v);
-	void (*apic_write_atomic)(unsigned long reg, unsigned long v);
-	unsigned long (*apic_read)(unsigned long reg);
+	void (*apic_write)(unsigned long reg, u32 v);
+	void (*apic_write_atomic)(unsigned long reg, u32 v);
+	u32 (*apic_read)(unsigned long reg);
 	void (*setup_boot_clock)(void);
 	void (*setup_secondary_clock)(void);
 
@@ -216,6 +244,8 @@ struct pv_mmu_ops {
 	void (*set_pte_atomic)(pte_t *ptep, pte_t pteval);
 	void (*set_pte_present)(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, pte_t pte);
+#endif
+#if defined(CONFIG_X86_PAE) || defined(CONFIG_X86_64)
 	void (*set_pud)(pud_t *pudp, pud_t pudval);
 	void (*pte_clear)(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
 	void (*pmd_clear)(pmd_t *pmdp);
@@ -227,6 +257,16 @@ struct pv_mmu_ops {
 	pte_t (*make_pte)(unsigned long long pte);
 	pmd_t (*make_pmd)(unsigned long long pmd);
 	pgd_t (*make_pgd)(unsigned long long pgd);
+  #ifdef CONFIG_X86_64
+	void (*set_pgd)(pgd_t *pgdp, pgd_t pgdval);
+
+	void (*pud_clear)(pud_t *pudp);
+	void (*pgd_clear)(pgd_t *pgdp);
+
+	unsigned long long (*pud_val)(pud_t);
+
+	pud_t (*make_pud)(unsigned long long pud);
+  #endif
 #else
 	unsigned long (*pte_val)(pte_t);
 	unsigned long (*pgd_val)(pgd_t);
@@ -255,6 +295,12 @@ struct paravirt_patch_template
 	struct pv_mmu_ops pv_mmu_ops;
 };
 
+#ifdef CONFIG_X86_64
+#define WORDSIZE_STR	"  .quad"
+#else
+#define WORDSIZE_STR	"  .long"
+#endif
+
 extern struct pv_info pv_info;
 extern struct pv_init_ops pv_init_ops;
 extern struct pv_time_ops pv_time_ops;
@@ -279,7 +325,8 @@ extern struct pv_mmu_ops pv_mmu_ops;
 #define _paravirt_alt(insn_string, type, clobber)	\
 	"771:\n\t" insn_string "\n" "772:\n"		\
 	".pushsection .parainstructions,\"a\"\n"	\
-	"  .long 771b\n"				\
+	".align 8\n"					\
+	WORDSIZE_STR "  771b\n"				\
 	"  .byte " type "\n"				\
 	"  .byte 772b-771b\n"				\
 	"  .short " clobber "\n"			\
@@ -289,6 +336,11 @@ extern struct pv_mmu_ops pv_mmu_ops;
 #define paravirt_alt(insn_string)					\
 	_paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")
 
+/* Simple instruction patching code. */
+#define DEF_NATIVE(ops, name, code) 					\
+	extern const char start_##ops##_##name[], end_##ops##_##name[];	\
+	asm("start_" #ops "_" #name ": " code "; end_" #ops "_" #name ":")
+
 unsigned paravirt_patch_nop(void);
 unsigned paravirt_patch_ignore(unsigned len);
 unsigned paravirt_patch_call(void *insnbuf,
@@ -303,6 +355,9 @@ unsigned paravirt_patch_default(u8 type, u16 clobbers, void *insnbuf,
 unsigned paravirt_patch_insns(void *insnbuf, unsigned len,
 			      const char *start, const char *end);
 
+unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
+                      unsigned long addr, unsigned len);
+
 int paravirt_disable_iospace(void);
 
 /*
@@ -319,22 +374,29 @@ int paravirt_disable_iospace(void);
  * runtime.
  *
  * Normally, a call to a pv_op function is a simple indirect call:
- * (paravirt_ops.operations)(args...).
+ * (pv_op_struct.operations)(args...).
  *
  * Unfortunately, this is a relatively slow operation for modern CPUs,
  * because it cannot necessarily determine what the destination
- * address is.  In this case, the address is a runtime constant, so at
- * the very least we can patch the call to e a simple direct call, or
+ * address is. In this case, the address is a runtime constant, so at
+ * the very least we can patch the call to be a simple direct call, or
  * ideally, patch an inline implementation into the callsite.  (Direct
  * calls are essentially free, because the call and return addresses
  * are completely predictable.)
  *
- * These macros rely on the standard gcc "regparm(3)" calling
+ * For i386, these macros rely on the standard gcc "regparm(3)" calling
  * convention, in which the first three arguments are placed in %eax,
  * %edx, %ecx (in that order), and the remaining arguments are placed
  * on the stack.  All caller-save registers (eax,edx,ecx) are expected
  * to be modified (either clobbered or used for return values).
  *
+ * X86_64, on the other hand, already specifies a register-based calling
+ * conventions, returning at %rax, with parameteres going on %rdi, %rsi,
+ * %rdx, and %rcx. Note that for this reason, x86_64 does not need any
+ * special handling for dealing with 4 arguments, unlike i386.
+ * However, x86_64 also have to clobber all caller saved registers, which
+ * unfortunately, are quite a bit (r8 - r11)
+ *
  * The call instruction itself is marked by placing its start address
  * and size into the .parainstructions section, so that
  * apply_paravirt() in arch/i386/kernel/alternative.c can do the
@@ -356,9 +418,10 @@ int paravirt_disable_iospace(void);
  * the return type.  The macro then uses sizeof() on that type to
  * determine whether its a 32 or 64 bit value, and places the return
  * in the right register(s) (just %eax for 32-bit, and %edx:%eax for
- * 64-bit).
+ * 64-bit). For x86_64 machines, it just returns at %rax regardless of
+ * the return value size.
  *
- * 64-bit arguments are passed as a pair of adjacent 32-bit arguments
+ * i386 also passes 64-bit arguments as a pair of adjacent 32-bit arguments
  * in low,high order.
  *
  * Small structures are passed and returned in registers.  The macro
@@ -369,46 +432,67 @@ int paravirt_disable_iospace(void);
  * means that all uses must be wrapped in inline functions.  This also
  * makes sure the incoming and outgoing types are always correct.
  */
+#ifdef CONFIG_X86_32
+#define PVOP_VCALL_ARGS			unsigned long __eax,__edx,__ecx
+#define PVOP_CALL_ARGS			PVOP_VCALL_ARGS
+#define PVOP_VCALL_CLOBBERS		"=a" (__eax), "=d" (__edx),	\
+					"=c" (__ecx)
+#define PVOP_CALL_CLOBBERS		PVOP_VCALL_CLOBBERS
+#define EXTRA_CLOBBERS
+#define VEXTRA_CLOBBERS
+#else
+#define PVOP_VCALL_ARGS		unsigned long __edi,__esi,__edx,__ecx
+#define PVOP_CALL_ARGS		PVOP_VCALL_ARGS, __eax
+#define PVOP_VCALL_CLOBBERS	"=D" (__edi),				\
+				"=S" (__esi), "=d" (__edx),		\
+				"=c" (__ecx)
+
+#define PVOP_CALL_CLOBBERS	PVOP_VCALL_CLOBBERS, "=a" (__eax)
+
+#define EXTRA_CLOBBERS	 , "r8", "r9", "r10", "r11"
+#define VEXTRA_CLOBBERS	 , "rax", "r8", "r9", "r10", "r11"
+#endif
+
 #define __PVOP_CALL(rettype, op, pre, post, ...)			\
 	({								\
 		rettype __ret;						\
-		unsigned long __eax, __edx, __ecx;			\
+		PVOP_CALL_ARGS;					\
+		/* This is 32-bit specific, but is okay in 64-bit */	\
+		/* since this condition will never hold */		\
 		if (sizeof(rettype) > sizeof(unsigned long)) {		\
 			asm volatile(pre				\
 				     paravirt_alt(PARAVIRT_CALL)	\
 				     post				\
-				     : "=a" (__eax), "=d" (__edx),	\
-				       "=c" (__ecx)			\
+				     : PVOP_CALL_CLOBBERS		\
 				     : paravirt_type(op),		\
 				       paravirt_clobber(CLBR_ANY),	\
 				       ##__VA_ARGS__			\
-				     : "memory", "cc");			\
+				     : "memory", "cc" EXTRA_CLOBBERS);	\
 			__ret = (rettype)((((u64)__edx) << 32) | __eax); \
 		} else {						\
 			asm volatile(pre				\
 				     paravirt_alt(PARAVIRT_CALL)	\
 				     post				\
-				     : "=a" (__eax), "=d" (__edx),	\
-				       "=c" (__ecx)			\
+				     : PVOP_CALL_CLOBBERS		\
 				     : paravirt_type(op),		\
 				       paravirt_clobber(CLBR_ANY),	\
 				       ##__VA_ARGS__			\
-				     : "memory", "cc");			\
+				     : "memory", "cc" EXTRA_CLOBBERS);	\
 			__ret = (rettype)__eax;				\
 		}							\
 		__ret;							\
 	})
 #define __PVOP_VCALL(op, pre, post, ...)				\
 	({								\
-		unsigned long __eax, __edx, __ecx;			\
+		PVOP_VCALL_ARGS;					\
 		asm volatile(pre					\
 			     paravirt_alt(PARAVIRT_CALL)		\
 			     post					\
-			     : "=a" (__eax), "=d" (__edx), "=c" (__ecx) \
+			     : PVOP_VCALL_CLOBBERS			\
 			     : paravirt_type(op),			\
 			       paravirt_clobber(CLBR_ANY),		\
 			       ##__VA_ARGS__				\
-			     : "memory", "cc");				\
+			     : "memory", "cc" VEXTRA_CLOBBERS);		\
 	})
 
 #define PVOP_CALL0(rettype, op)						\
@@ -417,22 +501,27 @@ int paravirt_disable_iospace(void);
 	__PVOP_VCALL(op, "", "")
 
 #define PVOP_CALL1(rettype, op, arg1)					\
-	__PVOP_CALL(rettype, op, "", "", "0" ((u32)(arg1)))
+	__PVOP_CALL(rettype, op, "", "", "0" ((unsigned long)(arg1)))
 #define PVOP_VCALL1(op, arg1)						\
-	__PVOP_VCALL(op, "", "", "0" ((u32)(arg1)))
+	__PVOP_VCALL(op, "", "", "0" ((unsigned long)(arg1)))
 
 #define PVOP_CALL2(rettype, op, arg1, arg2)				\
-	__PVOP_CALL(rettype, op, "", "", "0" ((u32)(arg1)), "1" ((u32)(arg2)))
+	__PVOP_CALL(rettype, op, "", "", "0" ((unsigned long)(arg1)), 	\
+	"1" ((unsigned long)(arg2)))
+
 #define PVOP_VCALL2(op, arg1, arg2)					\
-	__PVOP_VCALL(op, "", "", "0" ((u32)(arg1)), "1" ((u32)(arg2)))
+	__PVOP_VCALL(op, "", "", "0" ((unsigned long)(arg1)), 		\
+	"1" ((unsigned long)(arg2)))
 
 #define PVOP_CALL3(rettype, op, arg1, arg2, arg3)			\
-	__PVOP_CALL(rettype, op, "", "", "0" ((u32)(arg1)),		\
-		    "1"((u32)(arg2)), "2"((u32)(arg3)))
+	__PVOP_CALL(rettype, op, "", "", "0" ((unsigned long)(arg1)),	\
+	"1"((unsigned long)(arg2)), "2"((unsigned long)(arg3)))
 #define PVOP_VCALL3(op, arg1, arg2, arg3)				\
-	__PVOP_VCALL(op, "", "", "0" ((u32)(arg1)), "1"((u32)(arg2)),	\
-		     "2"((u32)(arg3)))
+	__PVOP_VCALL(op, "", "", "0" ((unsigned long)(arg1)),		\
+	"1"((unsigned long)(arg2)), "2"((unsigned long)(arg3)))
 
+/* This is the only difference in x86_64. We can make it much simpler */
+#ifdef CONFIG_X86_32
 #define PVOP_CALL4(rettype, op, arg1, arg2, arg3, arg4)			\
 	__PVOP_CALL(rettype, op,					\
 		    "push %[_arg4];", "lea 4(%%esp),%%esp;",		\
@@ -443,6 +532,16 @@ int paravirt_disable_iospace(void);
 		    "push %[_arg4];", "lea 4(%%esp),%%esp;",		\
 		    "0" ((u32)(arg1)), "1" ((u32)(arg2)),		\
 		    "2" ((u32)(arg3)), [_arg4] "mr" ((u32)(arg4)))
+#else
+#define PVOP_CALL4(rettype, op, arg1, arg2, arg3, arg4)			\
+	__PVOP_CALL(rettype, op, "", "", "0" ((unsigned long)(arg1)),	\
+	"1"((unsigned long)(arg2)), "2"((unsigned long)(arg3)),		\
+	"3"((unsigned long)(arg4)))
+#define PVOP_VCALL4(op, arg1, arg2, arg3, arg4)				\
+	__PVOP_VCALL(op, "", "", "0" ((unsigned long)(arg1)),		\
+	"1"((unsigned long)(arg2)), "2"((unsigned long)(arg3)),		\
+	"3"((unsigned long)(arg4)))
+#endif
 
 static inline int paravirt_enabled(void)
 {
@@ -561,6 +660,7 @@ static inline u64 paravirt_read_msr(unsigned msr, int *err)
 {
 	return PVOP_CALL2(u64, pv_cpu_ops.read_msr, msr, err);
 }
+
 static inline int paravirt_write_msr(unsigned msr, unsigned low, unsigned high)
 {
 	return PVOP_CALL3(int, pv_cpu_ops.write_msr, msr, low, high);
@@ -613,8 +713,6 @@ static inline unsigned long long paravirt_sched_clock(void)
 }
 #define calculate_cpu_khz() (pv_time_ops.get_cpu_khz())
 
-#define write_tsc(val1,val2) wrmsr(0x10, val1, val2)
-
 static inline unsigned long long paravirt_read_pmc(int counter)
 {
 	return PVOP_CALL1(u64, pv_cpu_ops.read_pmc, counter);
@@ -626,15 +724,36 @@ static inline unsigned long long paravirt_read_pmc(int counter)
 	high = _l >> 32;			\
 } while(0)
 
+static inline unsigned long paravirt_readt_scp(int *aux)
+{
+	return PVOP_CALL1(u64, pv_cpu_ops.read_tscp, aux);
+}
+
+#define rdtscp(low, high, aux)				\
+do {							\
+	int __aux;					\
+	unsigned long __val = paravirt_rdtscp(&__aux);	\
+	(low) = (u32)__val;				\
+	(high) = (u32)(__val >> 32);			\
+	(aux) = __aux;					\
+} while (0)
+
+#define rdtscpll(val, aux)				\
+do {							\
+	unsigned long __aux; 				\
+	val = paravirt_rdtscp(&__aux);			\
+	(aux) = __aux;					\
+} while (0)
+
 static inline void load_TR_desc(void)
 {
 	PVOP_VCALL0(pv_cpu_ops.load_tr_desc);
 }
-static inline void load_gdt(const struct Xgt_desc_struct *dtr)
+static inline void load_gdt(const x86_descr_ptr *dtr)
 {
 	PVOP_VCALL1(pv_cpu_ops.load_gdt, dtr);
 }
-static inline void load_idt(const struct Xgt_desc_struct *dtr)
+static inline void load_idt(const x86_descr_ptr *dtr)
 {
 	PVOP_VCALL1(pv_cpu_ops.load_idt, dtr);
 }
@@ -642,11 +761,11 @@ static inline void set_ldt(const void *addr, unsigned entries)
 {
 	PVOP_VCALL2(pv_cpu_ops.set_ldt, addr, entries);
 }
-static inline void store_gdt(struct Xgt_desc_struct *dtr)
+static inline void store_gdt(x86_descr_ptr *dtr)
 {
 	PVOP_VCALL1(pv_cpu_ops.store_gdt, dtr);
 }
-static inline void store_idt(struct Xgt_desc_struct *dtr)
+static inline void store_idt(x86_descr_ptr *dtr)
 {
 	PVOP_VCALL1(pv_cpu_ops.store_idt, dtr);
 }
@@ -663,6 +782,8 @@ static inline void write_ldt_entry(void *dt, int entry, u32 low, u32 high)
 {
 	PVOP_VCALL4(pv_cpu_ops.write_ldt_entry, dt, entry, low, high);
 }
+
+#ifdef CONFIG_X86_32
 static inline void write_gdt_entry(void *dt, int entry, u32 low, u32 high)
 {
 	PVOP_VCALL4(pv_cpu_ops.write_gdt_entry, dt, entry, low, high);
@@ -671,6 +792,19 @@ static inline void write_idt_entry(void *dt, int entry, u32 low, u32 high)
 {
 	PVOP_VCALL4(pv_cpu_ops.write_idt_entry, dt, entry, low, high);
 }
+#else
+static inline void write_gdt_entry(void *ptr, void *entry,
+				   unsigned type, unsigned size)
+{
+	PVOP_VCALL4(pv_cpu_ops.write_gdt_entry, ptr, entry, type, size);
+}
+
+static inline void write_idt_entry(void *adr, struct gate_struct *s)
+{
+	PVOP_VCALL2(pv_cpu_ops.write_idt_entry, adr, s);
+}
+#endif
+
 static inline void set_iopl_mask(unsigned mask)
 {
 	PVOP_VCALL1(pv_cpu_ops.set_iopl_mask, mask);
@@ -690,19 +824,19 @@ static inline void slow_down_io(void) {
 /*
  * Basic functions accessing APICs.
  */
-static inline void apic_write(unsigned long reg, unsigned long v)
+static inline void apic_write(unsigned long reg, u32 v)
 {
 	PVOP_VCALL2(pv_apic_ops.apic_write, reg, v);
 }
 
-static inline void apic_write_atomic(unsigned long reg, unsigned long v)
+static inline void apic_write_atomic(unsigned long reg, u32 v)
 {
 	PVOP_VCALL2(pv_apic_ops.apic_write_atomic, reg, v);
 }
 
-static inline unsigned long apic_read(unsigned long reg)
+static inline u32 apic_read(unsigned long reg)
 {
-	return PVOP_CALL1(unsigned long, pv_apic_ops.apic_read, reg);
+	return PVOP_CALL1(u32, pv_apic_ops.apic_read, reg);
 }
 
 static inline void setup_boot_clock(void)
@@ -762,10 +896,12 @@ static inline void __flush_tlb(void)
 {
 	PVOP_VCALL0(pv_mmu_ops.flush_tlb_user);
 }
+
 static inline void __flush_tlb_global(void)
 {
 	PVOP_VCALL0(pv_mmu_ops.flush_tlb_kernel);
 }
+
 static inline void __flush_tlb_single(unsigned long addr)
 {
 	PVOP_VCALL1(pv_mmu_ops.flush_tlb_single, addr);
@@ -908,7 +1044,103 @@ static inline void pmd_clear(pmd_t *pmdp)
 	PVOP_VCALL1(pv_mmu_ops.pmd_clear, pmdp);
 }
 
-#else  /* !CONFIG_X86_PAE */
+#elif defined(CONFIG_X86_64)
+/* FIXME: There ought to be a way to do it that duplicate less code */
+static inline pte_t __pte(unsigned long long val)
+{
+	unsigned long long ret;
+	ret = PVOP_CALL1(unsigned long long, pv_mmu_ops.make_pte, val);
+	return (pte_t) { ret };
+}
+
+static inline pmd_t __pmd(unsigned long long val)
+{
+	unsigned long long ret;
+	ret = PVOP_CALL1(unsigned long long, pv_mmu_ops.make_pmd, val);
+	return (pmd_t) { ret };
+}
+
+static inline pud_t __pud(unsigned long long val)
+{
+	unsigned long long ret;
+	ret = PVOP_CALL1(unsigned long long, pv_mmu_ops.make_pud, val);
+	return (pud_t) { ret };
+}
+
+static inline pgd_t __pgd(unsigned long long val)
+{
+	unsigned long long ret;
+	ret = PVOP_CALL1(unsigned long long, pv_mmu_ops.make_pgd, val);
+	return (pgd_t) { ret };
+}
+
+static inline unsigned long long pte_val(pte_t x)
+{
+	return PVOP_CALL1(unsigned long long, pv_mmu_ops.pte_val, x.pte);
+}
+
+static inline unsigned long long pmd_val(pmd_t x)
+{
+	return PVOP_CALL1(unsigned long long, pv_mmu_ops.pmd_val, x.pmd);
+}
+
+static inline unsigned long long pud_val(pud_t x)
+{
+	return PVOP_CALL1(unsigned long long, pv_mmu_ops.pud_val, x.pud);
+}
+
+static inline unsigned long long pgd_val(pgd_t x)
+{
+	return PVOP_CALL1(unsigned long long, pv_mmu_ops.pgd_val, x.pgd);
+}
+
+static inline void set_pte(pte_t *ptep, pte_t pteval)
+{
+	PVOP_VCALL2(pv_mmu_ops.set_pte, ptep, pteval.pte);
+}
+
+static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
+			      pte_t *ptep, pte_t pteval)
+{
+	PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pteval.pte);
+}
+
+static inline void set_pmd(pmd_t *pmdp, pmd_t pmdval)
+{
+	PVOP_VCALL2(pv_mmu_ops.set_pmd, pmdp, pmdval.pmd);
+}
+
+static inline void set_pud(pud_t *pudp, pud_t pudval)
+{
+	PVOP_VCALL2(pv_mmu_ops.set_pud, pudp, pudval.pud);
+}
+
+static inline void set_pgd(pgd_t *pgdp, pgd_t pgdval)
+{
+	PVOP_VCALL2(pv_mmu_ops.set_pgd, pgdp, pgdval.pgd);
+}
+
+static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+	PVOP_VCALL3(pv_mmu_ops.pte_clear, mm, addr, ptep);
+}
+
+static inline void pmd_clear(pmd_t *pmdp)
+{
+	PVOP_VCALL1(pv_mmu_ops.pmd_clear, pmdp);
+}
+
+static inline void pud_clear(pud_t *pudp)
+{
+	PVOP_VCALL1(pv_mmu_ops.pud_clear, pudp);
+}
+
+static inline void pgd_clear(pgd_t *pgdp)
+{
+	PVOP_VCALL1(pv_mmu_ops.pgd_clear, pgdp);
+}
+
+#else  /* !CONFIG_X86_PAE && !CONFIG_X86_64*/
 
 static inline pte_t __pte(unsigned long val)
 {
@@ -1014,52 +1246,68 @@ struct paravirt_patch_site {
 extern struct paravirt_patch_site __parainstructions[],
 	__parainstructions_end[];
 
+#ifdef CONFIG_X86_32
+#define PV_SAVE_REGS "pushl %%ecx; pushl %%edx;"
+#define PV_RESTORE_REGS "popl %%edx; popl %%ecx"
+#define PV_FLAGS_ARG "0"
+#define PV_EXTRA_CLOBBERS 
+#define PV_VEXTRA_CLOBBERS
+#else
+/* We save some registers, but all of them, that's too much. We clobber all
+ * caller saved registers but the argument parameter */
+#define PV_SAVE_REGS "pushq %%rdi;"
+#define PV_RESTORE_REGS "popq %%rdi;"
+#define PV_EXTRA_CLOBBERS EXTRA_CLOBBERS, "rcx" , "rdx"
+#define PV_VEXTRA_CLOBBERS EXTRA_CLOBBERS, "rdi", "rcx" , "rdx"
+#define PV_FLAGS_ARG "D"
+#endif
+
 static inline unsigned long __raw_local_save_flags(void)
 {
 	unsigned long f;
 
-	asm volatile(paravirt_alt("pushl %%ecx; pushl %%edx;"
+	asm volatile(paravirt_alt(PV_SAVE_REGS
 				  PARAVIRT_CALL
-				  "popl %%edx; popl %%ecx")
+				  PV_RESTORE_REGS)
 		     : "=a"(f)
 		     : paravirt_type(pv_irq_ops.save_fl),
 		       paravirt_clobber(CLBR_EAX)
-		     : "memory", "cc");
+		     : "memory", "cc" PV_VEXTRA_CLOBBERS);
 	return f;
 }
 
 static inline void raw_local_irq_restore(unsigned long f)
 {
-	asm volatile(paravirt_alt("pushl %%ecx; pushl %%edx;"
+	asm volatile(paravirt_alt(PV_SAVE_REGS
 				  PARAVIRT_CALL
-				  "popl %%edx; popl %%ecx")
+				  PV_RESTORE_REGS)
 		     : "=a"(f)
-		     : "0"(f),
+		     : PV_FLAGS_ARG (f),
 		       paravirt_type(pv_irq_ops.restore_fl),
 		       paravirt_clobber(CLBR_EAX)
-		     : "memory", "cc");
+		     : "memory", "cc" PV_EXTRA_CLOBBERS);
 }
 
 static inline void raw_local_irq_disable(void)
 {
-	asm volatile(paravirt_alt("pushl %%ecx; pushl %%edx;"
+	asm volatile(paravirt_alt(PV_SAVE_REGS
 				  PARAVIRT_CALL
-				  "popl %%edx; popl %%ecx")
+				  PV_RESTORE_REGS)
 		     :
 		     : paravirt_type(pv_irq_ops.irq_disable),
 		       paravirt_clobber(CLBR_EAX)
-		     : "memory", "eax", "cc");
+		     : "memory", "eax", "cc" PV_VEXTRA_CLOBBERS);
 }
 
 static inline void raw_local_irq_enable(void)
 {
-	asm volatile(paravirt_alt("pushl %%ecx; pushl %%edx;"
+	asm volatile(paravirt_alt(PV_SAVE_REGS
 				  PARAVIRT_CALL
-				  "popl %%edx; popl %%ecx")
+				  PV_RESTORE_REGS)
 		     :
 		     : paravirt_type(pv_irq_ops.irq_enable),
 		       paravirt_clobber(CLBR_EAX)
-		     : "memory", "eax", "cc");
+		     : "memory", "eax", "cc" PV_VEXTRA_CLOBBERS);
 }
 
 static inline unsigned long __raw_local_irq_save(void)
@@ -1071,27 +1319,41 @@ static inline unsigned long __raw_local_irq_save(void)
 	return f;
 }
 
+#ifdef CONFIG_X86_32
+#define SAVE_REGS "pushl %%ecx; pushl %%edx;"
+#define RESTORE_REGS "popl %%edx; popl %%ecx"
+#define CLI_STI_CLOBBERS , "%eax"
+#else /* !X86_32 */
+#define SAVE_REGS "pushq %%rcx; pushq %%rdx;"
+#define RESTORE_REGS "popq %%rdx; popq %%rcx"
+#define CLI_STI_CLOBBERS , "%rax", "%rdi", "%rsi", "%r8", "%r9", "%r10",\
+			"%r11", "%r12", "%r13", "%r14", "%r15"
+#endif /* X86_32 */
+
 #define CLI_STRING							\
-	_paravirt_alt("pushl %%ecx; pushl %%edx;"			\
+	_paravirt_alt(SAVE_REGS						\
 		      "call *%[paravirt_cli_opptr];"			\
-		      "popl %%edx; popl %%ecx",				\
+		      RESTORE_REGS,					\
 		      "%c[paravirt_cli_type]", "%c[paravirt_clobber]")
 
+
 #define STI_STRING							\
-	_paravirt_alt("pushl %%ecx; pushl %%edx;"			\
+	_paravirt_alt(SAVE_REGS						\
 		      "call *%[paravirt_sti_opptr];"			\
-		      "popl %%edx; popl %%ecx",				\
+		      RESTORE_REGS,					\
 		      "%c[paravirt_sti_type]", "%c[paravirt_clobber]")
 
-#define CLI_STI_CLOBBERS , "%eax"
-#define CLI_STI_INPUT_ARGS						\
-	,								\
-	[paravirt_cli_type] "i" (PARAVIRT_PATCH(pv_irq_ops.irq_disable)),		\
+
+#define CLI_STI_INPUT_ARGS						 \
+	,								 \
+	[paravirt_cli_type] "i" (PARAVIRT_PATCH(pv_irq_ops.irq_disable)),\
 	[paravirt_cli_opptr] "m" (pv_irq_ops.irq_disable),		\
-	[paravirt_sti_type] "i" (PARAVIRT_PATCH(pv_irq_ops.irq_enable)),		\
+	[paravirt_sti_type] "i" (PARAVIRT_PATCH(pv_irq_ops.irq_enable)),\
 	[paravirt_sti_opptr] "m" (pv_irq_ops.irq_enable),		\
 	paravirt_clobber(CLBR_EAX)
 
+
+
 /* Make sure as little as possible of this mess escapes. */
 #undef PARAVIRT_CALL
 #undef __PVOP_CALL
@@ -1106,48 +1368,80 @@ static inline unsigned long __raw_local_irq_save(void)
 #undef PVOP_CALL3
 #undef PVOP_VCALL4
 #undef PVOP_CALL4
+#undef PV_SAVE_REGS
+#undef PV_RESTORE_REGS
 
 #else  /* __ASSEMBLY__ */
 
-#define PARA_PATCH(struct, off)	((PARAVIRT_PATCH_##struct + (off)) / 4)
-
-#define PARA_SITE(ptype, clobbers, ops)		\
+#define _PARA_SITE(ptype, clobbers, ops, word)	\
 771:;						\
 	ops;					\
 772:;						\
 	.pushsection .parainstructions,"a";	\
-	 .long 771b;				\
+	 .align 8;				\
+	 word 771b;				\
 	 .byte ptype;				\
 	 .byte 772b-771b;			\
 	 .short clobbers;			\
 	.popsection
 
+#ifdef CONFIG_X86_64
+#define PV_SAVE_REGS	pushq %rax; pushq %rdi; pushq %rcx; pushq %rdx
+#define PV_RESTORE_REGS popq %rdx; popq %rcx; popq %rdi; popq %rax
+#define PARA_PATCH(struct, off)	((PARAVIRT_PATCH_##struct + (off)) / 8)
+#define PARA_SITE(ptype, clobbers, ops) _PARA_SITE(ptype, clobbers, ops, .quad)
+#else
+#define PV_SAVE_REGS	pushl %eax; pushl %edi; pushl %ecx; pushl %edx
+#define PV_RESTORE_REGS popl %edx; popl %ecx; popl %edi; popl %eax
+#define PARA_PATCH(struct, off)	((PARAVIRT_PATCH_##struct + (off)) / 4)
+#define PARA_SITE(ptype, clobbers, ops) _PARA_SITE(ptype, clobbers, ops, .long)
+#endif
+
 #define INTERRUPT_RETURN						\
 	PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_iret), CLBR_NONE,	\
 		  jmp *%cs:pv_cpu_ops+PV_CPU_iret)
 
 #define DISABLE_INTERRUPTS(clobbers)					\
 	PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_irq_disable), clobbers, \
-		  pushl %eax; pushl %ecx; pushl %edx;			\
+		  PV_SAVE_REGS;						\
 		  call *%cs:pv_irq_ops+PV_IRQ_irq_disable;		\
-		  popl %edx; popl %ecx; popl %eax)			\
+		  PV_RESTORE_REGS;)
 
 #define ENABLE_INTERRUPTS(clobbers)					\
 	PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_irq_enable), clobbers,	\
-		  pushl %eax; pushl %ecx; pushl %edx;			\
+		  PV_SAVE_REGS;						\
 		  call *%cs:pv_irq_ops+PV_IRQ_irq_enable;		\
-		  popl %edx; popl %ecx; popl %eax)
+		  PV_RESTORE_REGS;)
 
 #define ENABLE_INTERRUPTS_SYSCALL_RET					\
 	PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_irq_enable_syscall_ret),\
 		  CLBR_NONE,						\
 		  jmp *%cs:pv_cpu_ops+PV_CPU_irq_enable_syscall_ret)
 
+#ifdef CONFIG_X86_32
 #define GET_CR0_INTO_EAX			\
 	push %ecx; push %edx;			\
 	call *pv_cpu_ops+PV_CPU_read_cr0;	\
 	pop %edx; pop %ecx
 
+#else
+/* Those are x86_64 specific */
+#define SWAPGS								\
+	PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_swapgs), CLBR_NONE,	\
+		  pushq %rax; pushq %rdi; pushq %rcx; pushq %rdx;	\
+		  call *pv_cpu_ops+PV_CPU_swapgs;			\
+		  popq %rdx; popq %rcx; popq %rdi; popq %rax;		\
+		 )
+
+#define GET_CR2_INTO_RCX					\
+	call *pv_mmu_ops+PV_MMU_read_cr2;			\
+	movq %rax, %rcx;					\
+	xorq %rax, %rax;
+
+#endif
+
+#undef WSIZE
+
 #endif /* __ASSEMBLY__ */
 #endif /* CONFIG_PARAVIRT */
 #endif	/* __ASM_PARAVIRT_H */
-- 
1.4.4.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 16/16] make vsmp a paravirt client
  2007-10-31 19:15                             ` [PATCH 15/16] consolidation of paravirt for 32 and 64 bits Glauber de Oliveira Costa
@ 2007-10-31 19:15                               ` Glauber de Oliveira Costa
  2007-11-01  4:38                                 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-10-31 19:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, rusty, ak, mingo, chrisw, jeremy, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Glauber de Oliveira Costa, Steven Rostedt

This patch makes vsmp a paravirt client. It now uses the whole
infrastructure provided by pvops. When we detect we're running
a vsmp box, we change the irq-related paravirt operations (and so,
it have to happen quite early), and the patching function

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
 arch/x86/Kconfig.x86_64    |    3 +-
 arch/x86/kernel/setup_64.c |    3 ++
 arch/x86/kernel/vsmp_64.c  |   72 +++++++++++++++++++++++++++++++++++++++----
 include/asm-x86/setup.h    |    3 +-
 4 files changed, 71 insertions(+), 10 deletions(-)

diff --git a/arch/x86/Kconfig.x86_64 b/arch/x86/Kconfig.x86_64
index 04734dd..544bad5 100644
--- a/arch/x86/Kconfig.x86_64
+++ b/arch/x86/Kconfig.x86_64
@@ -148,15 +148,14 @@ config X86_PC
 	bool "PC-compatible"
 	help
 	  Choose this option if your computer is a standard PC or compatible.
-
 config X86_VSMP
 	bool "Support for ScaleMP vSMP"
 	depends on PCI
+	select PARAVIRT
 	 help
 	  Support for ScaleMP vSMP systems.  Say 'Y' here if this kernel is
 	  supposed to run on these EM64T-based machines.  Only choose this option
 	  if you have one of these machines.
-
 endchoice
 
 choice
diff --git a/arch/x86/kernel/setup_64.c b/arch/x86/kernel/setup_64.c
index 44a11e3..c522549 100644
--- a/arch/x86/kernel/setup_64.c
+++ b/arch/x86/kernel/setup_64.c
@@ -335,6 +335,9 @@ void __init setup_arch(char **cmdline_p)
 
 	init_memory_mapping(0, (end_pfn_map << PAGE_SHIFT));
 
+#ifdef CONFIG_VSMP
+	vsmp_init();
+#endif
 	dmi_scan_machine();
 
 #ifdef CONFIG_SMP
diff --git a/arch/x86/kernel/vsmp_64.c b/arch/x86/kernel/vsmp_64.c
index 414caf0..547d3b3 100644
--- a/arch/x86/kernel/vsmp_64.c
+++ b/arch/x86/kernel/vsmp_64.c
@@ -8,18 +8,70 @@
  *
  * Ravikiran Thirumalai <kiran@scalemp.com>,
  * Shai Fultheim <shai@scalemp.com>
+ * Paravirt ops integration: Glauber de Oliveira Costa <gcosta@redhat.com>
  */
-
 #include <linux/init.h>
 #include <linux/pci_ids.h>
 #include <linux/pci_regs.h>
 #include <asm/pci-direct.h>
 #include <asm/io.h>
+#include <asm/paravirt.h>
+
+/*
+ * Interrupt control for the VSMP architecture:
+ */
+
+static inline unsigned long vsmp_save_fl(void)
+{
+	unsigned long flags = native_save_fl();
+
+	if (flags & X86_EFLAGS_IF)
+		return X86_EFLAGS_IF;
+	return 0;
+}
 
-static int __init vsmp_init(void)
+static inline void vsmp_restore_fl(unsigned long flags)
+{
+	if (flags & X86_EFLAGS_IF)
+		flags &= ~X86_EFLAGS_AC;
+	if (!(flags & X86_EFLAGS_IF))
+		flags &= X86_EFLAGS_AC;
+	native_restore_fl(flags);
+}
+
+static inline void vsmp_irq_disable(void)
+{
+	unsigned long flags = native_save_fl();
+
+	vsmp_restore_fl((flags & ~X86_EFLAGS_IF));
+}
+
+static inline void vsmp_irq_enable(void)
+{
+	unsigned long flags = native_save_fl();
+
+	vsmp_restore_fl((flags | X86_EFLAGS_IF));
+}
+
+static unsigned __init vsmp_patch(u8 type, u16 clobbers, void *ibuf,
+				  unsigned long addr, unsigned len)
+{
+	switch (type) {
+	case PARAVIRT_PATCH(pv_irq_ops.irq_enable):
+	case PARAVIRT_PATCH(pv_irq_ops.irq_disable):
+	case PARAVIRT_PATCH(pv_irq_ops.save_fl):
+	case PARAVIRT_PATCH(pv_irq_ops.restore_fl):
+		return paravirt_patch_default(type, clobbers, ibuf, addr, len);
+	default:
+		return native_patch(type, clobbers, ibuf, addr, len);
+	}
+
+}
+
+int __init vsmp_init(void)
 {
 	void *address;
-	unsigned int cap, ctl;
+	unsigned int cap, ctl, cfg;
 
 	if (!early_pci_allowed())
 		return 0;
@@ -29,8 +81,16 @@ static int __init vsmp_init(void)
 	    (read_pci_config_16(0, 0x1f, 0, PCI_DEVICE_ID) != PCI_DEVICE_ID_SCALEMP_VSMP_CTL))
 		return 0;
 
+	/* If we are, use the distinguished irq functions */
+	pv_irq_ops.irq_disable = vsmp_irq_disable;
+	pv_irq_ops.irq_enable  = vsmp_irq_enable;
+	pv_irq_ops.save_fl  = vsmp_save_fl;
+	pv_irq_ops.restore_fl  = vsmp_restore_fl;
+	pv_init_ops.patch = vsmp_patch;
+
 	/* set vSMP magic bits to indicate vSMP capable kernel */
-	address = ioremap(read_pci_config(0, 0x1f, 0, PCI_BASE_ADDRESS_0), 8);
+	cfg = read_pci_config(0, 0x1f, 0, PCI_BASE_ADDRESS_0);
+	address = early_ioremap(cfg, 8);
 	cap = readl(address);
 	ctl = readl(address + 4);
 	printk("vSMP CTL: capabilities:0x%08x  control:0x%08x\n", cap, ctl);
@@ -42,8 +102,6 @@ static int __init vsmp_init(void)
 		printk("vSMP CTL: control set to:0x%08x\n", ctl);
 	}
 
-	iounmap(address);
+	early_iounmap(address, 8);
 	return 0;
 }
-
-core_initcall(vsmp_init);
diff --git a/include/asm-x86/setup.h b/include/asm-x86/setup.h
index 071e054..dd7996c 100644
--- a/include/asm-x86/setup.h
+++ b/include/asm-x86/setup.h
@@ -58,7 +58,8 @@ void __init add_memory_region(unsigned long long start,
 
 extern unsigned long init_pg_tables_end;
 
-
+/* For EM64T-based VSMP machines */
+int vsmp_init(void);
 
 #endif /* __i386__ */
 #endif /* _SETUP */
-- 
1.4.4.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 16/16] make vsmp a paravirt client
  2007-10-31 19:15                               ` [PATCH 16/16] make vsmp a paravirt client Glauber de Oliveira Costa
@ 2007-11-01  4:38                                 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 28+ messages in thread
From: Jeremy Fitzhardinge @ 2007-11-01  4:38 UTC (permalink / raw)
  To: Glauber de Oliveira Costa
  Cc: linux-kernel, akpm, rusty, ak, mingo, chrisw, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Steven Rostedt

Glauber de Oliveira Costa wrote:
> This patch makes vsmp a paravirt client. It now uses the whole
> infrastructure provided by pvops. When we detect we're running
> a vsmp box, we change the irq-related paravirt operations (and so,
> it have to happen quite early), and the patching function
>
> Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
> ---
>  arch/x86/Kconfig.x86_64    |    3 +-
>  arch/x86/kernel/setup_64.c |    3 ++
>  arch/x86/kernel/vsmp_64.c  |   72 +++++++++++++++++++++++++++++++++++++++----
>  include/asm-x86/setup.h    |    3 +-
>  4 files changed, 71 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/Kconfig.x86_64 b/arch/x86/Kconfig.x86_64
> index 04734dd..544bad5 100644
> --- a/arch/x86/Kconfig.x86_64
> +++ b/arch/x86/Kconfig.x86_64
> @@ -148,15 +148,14 @@ config X86_PC
>  	bool "PC-compatible"
>  	help
>  	  Choose this option if your computer is a standard PC or compatible.
> -
>  config X86_VSMP
>  	bool "Support for ScaleMP vSMP"
>  	depends on PCI
> +	select PARAVIRT
>  	 help
>  	  Support for ScaleMP vSMP systems.  Say 'Y' here if this kernel is
>  	  supposed to run on these EM64T-based machines.  Only choose this option
>  	  if you have one of these machines.
> -
>  endchoice
>  
>  choice
> diff --git a/arch/x86/kernel/setup_64.c b/arch/x86/kernel/setup_64.c
> index 44a11e3..c522549 100644
> --- a/arch/x86/kernel/setup_64.c
> +++ b/arch/x86/kernel/setup_64.c
> @@ -335,6 +335,9 @@ void __init setup_arch(char **cmdline_p)
>  
>  	init_memory_mapping(0, (end_pfn_map << PAGE_SHIFT));
>  
> +#ifdef CONFIG_VSMP
> +	vsmp_init();
> +#endif
>  	dmi_scan_machine();
>  
>  #ifdef CONFIG_SMP
> diff --git a/arch/x86/kernel/vsmp_64.c b/arch/x86/kernel/vsmp_64.c
> index 414caf0..547d3b3 100644
> --- a/arch/x86/kernel/vsmp_64.c
> +++ b/arch/x86/kernel/vsmp_64.c
> @@ -8,18 +8,70 @@
>   *
>   * Ravikiran Thirumalai <kiran@scalemp.com>,
>   * Shai Fultheim <shai@scalemp.com>
> + * Paravirt ops integration: Glauber de Oliveira Costa <gcosta@redhat.com>
>   */
> -
>  #include <linux/init.h>
>  #include <linux/pci_ids.h>
>  #include <linux/pci_regs.h>
>  #include <asm/pci-direct.h>
>  #include <asm/io.h>
> +#include <asm/paravirt.h>
> +
> +/*
> + * Interrupt control for the VSMP architecture:
> + */
> +
> +static inline unsigned long vsmp_save_fl(void)
>   

No point being inline.

> +{
> +	unsigned long flags = native_save_fl();
> +
> +	if (flags & X86_EFLAGS_IF)
> +		return X86_EFLAGS_IF;
>   

Is this right, or should the if be testing _AC?  Otherwise, why not just
"return flags & X86_EFLAGS_IF"?

> +	return 0;
> +}
>  
> -static int __init vsmp_init(void)
> +static inline void vsmp_restore_fl(unsigned long flags)
> +{
> +	if (flags & X86_EFLAGS_IF)
> +		flags &= ~X86_EFLAGS_AC;
> +	if (!(flags & X86_EFLAGS_IF))
> +		flags &= X86_EFLAGS_AC;
>   

Just use "else"?

> +	native_restore_fl(flags);
> +}
> +
> +static inline void vsmp_irq_disable(void)
> +{
> +	unsigned long flags = native_save_fl();
> +
> +	vsmp_restore_fl((flags & ~X86_EFLAGS_IF));
>   

((double paren?))

> +}
> +
> +static inline void vsmp_irq_enable(void)
> +{
> +	unsigned long flags = native_save_fl();
> +
> +	vsmp_restore_fl((flags | X86_EFLAGS_IF));
> +}
> +
> +static unsigned __init vsmp_patch(u8 type, u16 clobbers, void *ibuf,
> +				  unsigned long addr, unsigned len)
> +{
> +	switch (type) {
> +	case PARAVIRT_PATCH(pv_irq_ops.irq_enable):
> +	case PARAVIRT_PATCH(pv_irq_ops.irq_disable):
> +	case PARAVIRT_PATCH(pv_irq_ops.save_fl):
> +	case PARAVIRT_PATCH(pv_irq_ops.restore_fl):
> +		return paravirt_patch_default(type, clobbers, ibuf, addr, len);
> +	default:
> +		return native_patch(type, clobbers, ibuf, addr, len);
> +	}
> +
> +}
> +
> +int __init vsmp_init(void)
>  {
>  	void *address;
> -	unsigned int cap, ctl;
> +	unsigned int cap, ctl, cfg;
>  
>  	if (!early_pci_allowed())
>  		return 0;
> @@ -29,8 +81,16 @@ static int __init vsmp_init(void)
>  	    (read_pci_config_16(0, 0x1f, 0, PCI_DEVICE_ID) != PCI_DEVICE_ID_SCALEMP_VSMP_CTL))
>  		return 0;
>  
> +	/* If we are, use the distinguished irq functions */
> +	pv_irq_ops.irq_disable = vsmp_irq_disable;
> +	pv_irq_ops.irq_enable  = vsmp_irq_enable;
> +	pv_irq_ops.save_fl  = vsmp_save_fl;
> +	pv_irq_ops.restore_fl  = vsmp_restore_fl;
> +	pv_init_ops.patch = vsmp_patch;
> +
>  	/* set vSMP magic bits to indicate vSMP capable kernel */
> -	address = ioremap(read_pci_config(0, 0x1f, 0, PCI_BASE_ADDRESS_0), 8);
> +	cfg = read_pci_config(0, 0x1f, 0, PCI_BASE_ADDRESS_0);
> +	address = early_ioremap(cfg, 8);
>  	cap = readl(address);
>  	ctl = readl(address + 4);
>  	printk("vSMP CTL: capabilities:0x%08x  control:0x%08x\n", cap, ctl);
> @@ -42,8 +102,6 @@ static int __init vsmp_init(void)
>  		printk("vSMP CTL: control set to:0x%08x\n", ctl);
>  	}
>  
> -	iounmap(address);
> +	early_iounmap(address, 8);
>  	return 0;
>  }
> -
> -core_initcall(vsmp_init);
> diff --git a/include/asm-x86/setup.h b/include/asm-x86/setup.h
> index 071e054..dd7996c 100644
> --- a/include/asm-x86/setup.h
> +++ b/include/asm-x86/setup.h
> @@ -58,7 +58,8 @@ void __init add_memory_region(unsigned long long start,
>  
>  extern unsigned long init_pg_tables_end;
>  
> -
> +/* For EM64T-based VSMP machines */
> +int vsmp_init(void);
>  
>  #endif /* __i386__ */
>  #endif /* _SETUP */
>   


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt
  2007-10-31 19:14     ` [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt Glauber de Oliveira Costa
  2007-10-31 19:14       ` [PATCH 4/16] provide native irq initialization function Glauber de Oliveira Costa
@ 2007-11-01  4:48       ` Jeremy Fitzhardinge
  2007-11-01 13:48         ` Glauber de Oliveira Costa
  1 sibling, 1 reply; 28+ messages in thread
From: Jeremy Fitzhardinge @ 2007-11-01  4:48 UTC (permalink / raw)
  To: Glauber de Oliveira Costa
  Cc: linux-kernel, akpm, rusty, ak, mingo, chrisw, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Steven Rostedt

Glauber de Oliveira Costa wrote:
> This patch introduces, and patch callers when needed, native
> versions for read/write_crX functions, clts and wbinvd.
>
> Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
> ---
>  arch/x86/mm/pageattr_64.c   |    3 +-
>  include/asm-x86/system_64.h |   60 ++++++++++++++++++++++++++++++------------
>  2 files changed, 45 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/mm/pageattr_64.c b/arch/x86/mm/pageattr_64.c
> index c40afba..59a52b0 100644
> --- a/arch/x86/mm/pageattr_64.c
> +++ b/arch/x86/mm/pageattr_64.c
> @@ -12,6 +12,7 @@
>  #include <asm/processor.h>
>  #include <asm/tlbflush.h>
>  #include <asm/io.h>
> +#include <asm/paravirt.h>
>  
>  pte_t *lookup_address(unsigned long address)
>  { 
> @@ -77,7 +78,7 @@ static void flush_kernel_map(void *arg)
>  	   much cheaper than WBINVD. */
>  	/* clflush is still broken. Disable for now. */
>  	if (1 || !cpu_has_clflush)
> -		asm volatile("wbinvd" ::: "memory");
> +		wbinvd();
>  	else list_for_each_entry(pg, l, lru) {
>  		void *adr = page_address(pg);
>  		clflush_cache_range(adr, PAGE_SIZE);
> diff --git a/include/asm-x86/system_64.h b/include/asm-x86/system_64.h
> index 4cb2384..b558cb2 100644
> --- a/include/asm-x86/system_64.h
> +++ b/include/asm-x86/system_64.h
> @@ -65,53 +65,62 @@ extern void load_gs_index(unsigned);
>  /*
>   * Clear and set 'TS' bit respectively
>   */
> -#define clts() __asm__ __volatile__ ("clts")
> +static inline void native_clts(void)
> +{
> +	asm volatile ("clts");
> +}
>  
> -static inline unsigned long read_cr0(void)
> -{ 
> +static inline unsigned long native_read_cr0(void)
> +{
>  	unsigned long cr0;
>  	asm volatile("movq %%cr0,%0" : "=r" (cr0));
>  	return cr0;
>  }
>   

This is a pre-existing bug, but it seems to me that these read/write crX
asms should have a constraint to stop the compiler from reordering them
with respect to each other.  The brute-force approach would be to add
"memory" clobbers, but the subtle fix would be to add a variable which
is only used to sequence:

static int __cr_seq;

static inline unsigned long native_read_cr0(void)
{
	unsigned long cr0;
	asm volatile("mov %%cr0, %0" : "=r" (cr0), "=m" (__cr_seq));
}

static inline void native_write_cr0(unsigned long val)
{
	asm volatile("mov %1, %%cr0" : "+m" (__cr_seq) : "r" (val));
}


    J

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 11/16] turn priviled operation into a macro in head_64.S
  2007-10-31 19:14                     ` [PATCH 11/16] turn priviled operation into a macro in head_64.S Glauber de Oliveira Costa
  2007-10-31 19:14                       ` [PATCH 12/16] tweak io_64.h for paravirt Glauber de Oliveira Costa
@ 2007-11-01  4:50                       ` Jeremy Fitzhardinge
  2007-11-01 13:50                         ` Glauber de Oliveira Costa
  1 sibling, 1 reply; 28+ messages in thread
From: Jeremy Fitzhardinge @ 2007-11-01  4:50 UTC (permalink / raw)
  To: Glauber de Oliveira Costa
  Cc: linux-kernel, akpm, rusty, ak, mingo, chrisw, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Steven Rostedt

Glauber de Oliveira Costa wrote:
> under paravirt, read cr2 cannot be issued directly anymore.
> So wrap it in a macro, defined to the operation itself in case
> paravirt is off, but to something else if we have paravirt
> in the game
>   

Is this actually needed?  It's only used in the early fault handler in
head_64.S.  Will we be taking that path in the paravirt case?  If so,
should we disable the fault handler altogether, since the hypervisor can
probably provide better diagnositcs.

    J
> Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
> ---
>  arch/x86/kernel/head_64.S |    9 ++++++++-
>  1 files changed, 8 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index b6167fe..c31b1c9 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -19,6 +19,13 @@
>  #include <asm/msr.h>
>  #include <asm/cache.h>
>  
> +#ifdef CONFIG_PARAVIRT
> +#include <asm/asm-offsets.h>
> +#include <asm/paravirt.h>
> +#else
> +#define GET_CR2_INTO_RCX movq %cr2, %rcx
> +#endif
> +
>  /* we are not able to switch in one step to the final KERNEL ADRESS SPACE
>   * because we need identity-mapped pages.
>   *
> @@ -267,7 +274,7 @@ ENTRY(early_idt_handler)
>  	xorl %eax,%eax
>  	movq 8(%rsp),%rsi	# get rip
>  	movq (%rsp),%rdx
> -	movq %cr2,%rcx
> +	GET_CR2_INTO_RCX
>  	leaq early_idt_msg(%rip),%rdi
>  	call early_printk
>  	cmpl $2,early_recursion_flag(%rip)
>   


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt
  2007-11-01  4:48       ` [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt Jeremy Fitzhardinge
@ 2007-11-01 13:48         ` Glauber de Oliveira Costa
  2007-11-01 15:30           ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-11-01 13:48 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: linux-kernel, akpm, rusty, ak, mingo, chrisw, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Steven Rostedt

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jeremy Fitzhardinge escreveu:
> Glauber de Oliveira Costa wrote:
>> This patch introduces, and patch callers when needed, native
>> versions for read/write_crX functions, clts and wbinvd.
>>
>> Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
>> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
>> Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
>> ---
>>  arch/x86/mm/pageattr_64.c   |    3 +-
>>  include/asm-x86/system_64.h |   60 ++++++++++++++++++++++++++++++------------
>>  2 files changed, 45 insertions(+), 18 deletions(-)
>>
>> diff --git a/arch/x86/mm/pageattr_64.c b/arch/x86/mm/pageattr_64.c
>> index c40afba..59a52b0 100644
>> --- a/arch/x86/mm/pageattr_64.c
>> +++ b/arch/x86/mm/pageattr_64.c
>> @@ -12,6 +12,7 @@
>>  #include <asm/processor.h>
>>  #include <asm/tlbflush.h>
>>  #include <asm/io.h>
>> +#include <asm/paravirt.h>
>>  
>>  pte_t *lookup_address(unsigned long address)
>>  { 
>> @@ -77,7 +78,7 @@ static void flush_kernel_map(void *arg)
>>  	   much cheaper than WBINVD. */
>>  	/* clflush is still broken. Disable for now. */
>>  	if (1 || !cpu_has_clflush)
>> -		asm volatile("wbinvd" ::: "memory");
>> +		wbinvd();
>>  	else list_for_each_entry(pg, l, lru) {
>>  		void *adr = page_address(pg);
>>  		clflush_cache_range(adr, PAGE_SIZE);
>> diff --git a/include/asm-x86/system_64.h b/include/asm-x86/system_64.h
>> index 4cb2384..b558cb2 100644
>> --- a/include/asm-x86/system_64.h
>> +++ b/include/asm-x86/system_64.h
>> @@ -65,53 +65,62 @@ extern void load_gs_index(unsigned);
>>  /*
>>   * Clear and set 'TS' bit respectively
>>   */
>> -#define clts() __asm__ __volatile__ ("clts")
>> +static inline void native_clts(void)
>> +{
>> +	asm volatile ("clts");
>> +}
>>  
>> -static inline unsigned long read_cr0(void)
>> -{ 
>> +static inline unsigned long native_read_cr0(void)
>> +{
>>  	unsigned long cr0;
>>  	asm volatile("movq %%cr0,%0" : "=r" (cr0));
>>  	return cr0;
>>  }
>>   
> 
> This is a pre-existing bug, but it seems to me that these read/write crX
> asms should have a constraint to stop the compiler from reordering them
> with respect to each other.  The brute-force approach would be to add
> "memory" clobbers, but the subtle fix would be to add a variable which
> is only used to sequence:
> 
I in fact have seen bugs with mixed reads and writes to the same cr,
(cr4), but adding the volatile
flag to the read function seemed to fix it. Yet, I agree with you that
the theorectical problem exists for the reorder, and your proposed fix
seems fine (although if we're really desperate about memory usage, we
can use a char instead a int and save 3 bytes!)

It also just ocurred to me that this part of the patch can also go into
the consolidation part. So I'll respin it.

Thanks for the comment
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Remi - http://enigmail.mozdev.org

iD8DBQFHKdkujYI8LaFUWXMRAhV2AKDPIjwGQnoLtldys/OWtIEs6biwxwCg1Jd/
o36S+qcb4sWJ6peqhrSRnos=
=YmeC
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 11/16] turn priviled operation into a macro in head_64.S
  2007-11-01  4:50                       ` [PATCH 11/16] turn priviled operation into a macro in head_64.S Jeremy Fitzhardinge
@ 2007-11-01 13:50                         ` Glauber de Oliveira Costa
  0 siblings, 0 replies; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-11-01 13:50 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: linux-kernel, akpm, rusty, ak, mingo, chrisw, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Steven Rostedt

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jeremy Fitzhardinge escreveu:
> Glauber de Oliveira Costa wrote:
>> under paravirt, read cr2 cannot be issued directly anymore.
>> So wrap it in a macro, defined to the operation itself in case
>> paravirt is off, but to something else if we have paravirt
>> in the game
>>   
> 
> Is this actually needed?  It's only used in the early fault handler in
> head_64.S.  Will we be taking that path in the paravirt case?  If so,
> should we disable the fault handler altogether, since the hypervisor can
> probably provide better diagnositcs.
> 

Well, as you told me earlier, xen won't use it. Neither does lguest.
None of us goes through the normal boot process anyway. But maybe some
other technology uses it?

Zach, how does it work for vmware?


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Remi - http://enigmail.mozdev.org

iD8DBQFHKdmVjYI8LaFUWXMRAgxkAKCQB+VtpGSlm+zuyRRWmi3h+k9NqQCgyfmD
cpPyQcfxo9hcSI0WaDFmpWg=
=Oa4v
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt
  2007-11-01 13:48         ` Glauber de Oliveira Costa
@ 2007-11-01 15:30           ` Jeremy Fitzhardinge
  2007-11-01 16:07             ` Keir Fraser
  0 siblings, 1 reply; 28+ messages in thread
From: Jeremy Fitzhardinge @ 2007-11-01 15:30 UTC (permalink / raw)
  To: Glauber de Oliveira Costa
  Cc: linux-kernel, akpm, rusty, ak, mingo, chrisw, avi, anthony,
	virtualization, lguest, kvm-devel, zach, tglx, jun.nakajima,
	glommer, Steven Rostedt

Glauber de Oliveira Costa wrote:
> I in fact have seen bugs with mixed reads and writes to the same cr,
> (cr4), but adding the volatile
> flag to the read function seemed to fix it.

Well, volatile will make a read be repeated rather than caching the
previous value, but it has no effect on ordering.

> Yet, I agree with you that
> the theorectical problem exists for the reorder, and your proposed fix
> seems fine (although if we're really desperate about memory usage, we
> can use a char instead a int and save 3 bytes!)

Sure.  Ideally the compiler would never even generate a reference to it,
and it could just be extern, but in practice the compiler will generate
references sometimes.

    J

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt
  2007-11-01 15:30           ` Jeremy Fitzhardinge
@ 2007-11-01 16:07             ` Keir Fraser
  2007-11-01 16:13               ` Glauber de Oliveira Costa
  2007-11-01 17:41               ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 28+ messages in thread
From: Keir Fraser @ 2007-11-01 16:07 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Glauber de Oliveira Costa
  Cc: lguest, kvm-devel, linux-kernel, ak, chrisw, tglx, anthony, akpm,
	virtualization, mingo

On 1/11/07 15:30, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:

> Glauber de Oliveira Costa wrote:
>> I in fact have seen bugs with mixed reads and writes to the same cr,
>> (cr4), but adding the volatile
>> flag to the read function seemed to fix it.
> 
> Well, volatile will make a read be repeated rather than caching the
> previous value, but it has no effect on ordering.

volatile prevents the asm from being 'moved significantly', according to the
gcc manual. I take that to mean that reordering is not allowed.

 -- Keir



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt
  2007-11-01 16:07             ` Keir Fraser
@ 2007-11-01 16:13               ` Glauber de Oliveira Costa
  2007-11-01 17:41               ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 28+ messages in thread
From: Glauber de Oliveira Costa @ 2007-11-01 16:13 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Jeremy Fitzhardinge, lguest, kvm-devel, linux-kernel, ak, chrisw,
	tglx, anthony, akpm, virtualization, mingo

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Keir Fraser escreveu:
> On 1/11/07 15:30, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:
> 
>> Glauber de Oliveira Costa wrote:
>>> I in fact have seen bugs with mixed reads and writes to the same cr,
>>> (cr4), but adding the volatile
>>> flag to the read function seemed to fix it.
>> Well, volatile will make a read be repeated rather than caching the
>> previous value, but it has no effect on ordering.
> 
> volatile prevents the asm from being 'moved significantly', according to the
> gcc manual. I take that to mean that reordering is not allowed.
> 
According to a gcc developer to whom I asked this question, volatile
prevents the code
to be removed, but does not prevent it to be moved (pun indented). In
practice, it should force
a re-read, but not influence the ordering decisions from the compiler.
Besides , 'significantly'
sounds like a significantly unprecise word, whose specific meaning may
be implementation dependant.

So I agree that adding a memory location reference is probably the best
alternative.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Remi - http://enigmail.mozdev.org

iD8DBQFHKftDjYI8LaFUWXMRAiLTAKDqf/M8umNYw6u7r9ONozTEUVy8SwCgygma
jWNKQmxmLpyPxr00KbQy9Vg=
=JM4K
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Lguest] [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt
  2007-11-01 17:41               ` Jeremy Fitzhardinge
@ 2007-11-01 16:55                 ` Zachary Amsden
  2007-11-02  1:21                   ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 28+ messages in thread
From: Zachary Amsden @ 2007-11-01 16:55 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Keir Fraser, lguest, kvm-devel, mingo, ak, linux-kernel, anthony,
	tglx, Glauber de Oliveira Costa, virtualization, akpm

On Thu, 2007-11-01 at 10:41 -0700, Jeremy Fitzhardinge wrote:
> Keir Fraser wrote:
> > volatile prevents the asm from being 'moved significantly', according to the
> > gcc manual. I take that to mean that reordering is not allowed.
> >   

I understood it as reordering was permitted, but no re-ordering across
another volatile load, store, or asm was permitted.  And of course, as
long as input and output constraints are written properly, the
re-ordering should not be vulnerable to pathological movement causing
the code to malfunction.

It seems that CPU state side effects which can't be expressed in C need
special care - FPU is certainly one example.

Also, memory clobber on a volatile asm should stop invalid movement
across TLB flushes and other problems areas.  Even memory fences should
have memory clobber in order to stop movement of loads and stores across
the fence by the compiler.

Zach


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt
  2007-11-01 16:07             ` Keir Fraser
  2007-11-01 16:13               ` Glauber de Oliveira Costa
@ 2007-11-01 17:41               ` Jeremy Fitzhardinge
  2007-11-01 16:55                 ` [Lguest] " Zachary Amsden
  1 sibling, 1 reply; 28+ messages in thread
From: Jeremy Fitzhardinge @ 2007-11-01 17:41 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Glauber de Oliveira Costa, lguest, kvm-devel, linux-kernel, ak,
	chrisw, tglx, anthony, akpm, virtualization, mingo

Keir Fraser wrote:
> volatile prevents the asm from being 'moved significantly', according to the
> gcc manual. I take that to mean that reordering is not allowed.
>   

That phrase doesn't appear in the gcc manual; in fact, it specifically
says that reordering can happen:

    The `volatile' keyword indicates that the instruction has important
    side-effects.  GCC will not delete a volatile `asm' if it is reachable.
    (The instruction can still be deleted if GCC can prove that
    control-flow will never reach the location of the instruction.)  Note
    that even a volatile `asm' instruction can be moved relative to other
    code, including across jump instructions.  For example, on many targets
    there is a system register which can be set to control the rounding
    mode of floating point operations.  You might try setting it with a
    volatile `asm', like this PowerPC example:

                asm volatile("mtfsf 255,%0" : : "f" (fpenv));
                sum = x + y;

    This will not work reliably, as the compiler may move the addition back
    before the volatile `asm'.  To make it work you need to add an
    artificial dependency to the `asm' referencing a variable in the code
    you don't want moved, for example:

             asm volatile ("mtfsf 255,%1" : "=X"(sum): "f"(fpenv));
             sum = x + y;

I take from this that it is not a good idea to assume "asm volatile" has
any ordering effects at all.

    J

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Lguest] [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt
  2007-11-01 16:55                 ` [Lguest] " Zachary Amsden
@ 2007-11-02  1:21                   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 28+ messages in thread
From: Jeremy Fitzhardinge @ 2007-11-02  1:21 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Jeremy Fitzhardinge, lguest, akpm, kvm-devel, linux-kernel, ak,
	Keir Fraser, anthony, mingo, Glauber de Oliveira Costa,
	virtualization, tglx

Zachary Amsden wrote:
> I understood it as reordering was permitted, but no re-ordering across
> another volatile load, store, or asm was permitted.

It doesn't say that, so I wouldn't assume it.  Certainly we had problems
with the pda code; until I added the _proxy_pda dependency variable, the
only fix Andi could find was adding both "volatile" and a memory clobber.

>   And of course, as
> long as input and output constraints are written properly, the
> re-ordering should not be vulnerable to pathological movement causing
> the code to malfunction.
>   

Yes.  I think constraints are the only way to control ordering (even if
it's as heavy-handed as a memory clobber).  It would be nice if gcc had
a constraint which was only used for ordering, and never generated a
reference.  Then you could make up pseudo-variables in order to express
dependencies without having the risk that the compiler would generate
references.

> It seems that CPU state side effects which can't be expressed in C need
> special care - FPU is certainly one example.
>   

Not an immediate problem, fortunately.

> Also, memory clobber on a volatile asm should stop invalid movement
> across TLB flushes and other problems areas.

Yes.  Any asm which has global effects on how addresses are interpreted
(like tlbflush, reloading the pagetable base, changing modes, etc) needs
to have a memory clobber.

>   Even memory fences should
> have memory clobber in order to stop movement of loads and stores across
> the fence by the compiler.
>   

Pretty sure they do.  A normal compiler barrier is *just* a memory clobber.

    J

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2007-11-02  1:22 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-31 19:14 [PATCH 0/7] (Re-)introducing pvops for x86_64 - Real pvops work part Glauber de Oliveira Costa
2007-10-31 19:14 ` [PATCH 1/16] Wipe out traditional opt from x86_64 Makefile Glauber de Oliveira Costa
2007-10-31 19:14   ` [PATCH 2/16] paravirt hooks at entry functions Glauber de Oliveira Costa
2007-10-31 19:14     ` [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt Glauber de Oliveira Costa
2007-10-31 19:14       ` [PATCH 4/16] provide native irq initialization function Glauber de Oliveira Costa
2007-10-31 19:14         ` [PATCH 5/16] report ring kernel is running without paravirt Glauber de Oliveira Costa
2007-10-31 19:14           ` [PATCH 6/16] export math_state_restore Glauber de Oliveira Costa
2007-10-31 19:14             ` [PATCH 7/16] native versions for set pagetables Glauber de Oliveira Costa
2007-10-31 19:14               ` [PATCH 8/16] add native functions for descriptors handling Glauber de Oliveira Costa
2007-10-31 19:14                 ` [PATCH 9/16] This patch add provisions for time related functions so they Glauber de Oliveira Costa
2007-10-31 19:14                   ` [PATCH 10/16] export cpu_gdt_descr Glauber de Oliveira Costa
2007-10-31 19:14                     ` [PATCH 11/16] turn priviled operation into a macro in head_64.S Glauber de Oliveira Costa
2007-10-31 19:14                       ` [PATCH 12/16] tweak io_64.h for paravirt Glauber de Oliveira Costa
2007-10-31 19:14                         ` [PATCH 13/16] native versions for page table entries values Glauber de Oliveira Costa
2007-10-31 19:14                           ` [PATCH 14/16] prepare x86_64 architecture initialization for paravirt Glauber de Oliveira Costa
2007-10-31 19:15                             ` [PATCH 15/16] consolidation of paravirt for 32 and 64 bits Glauber de Oliveira Costa
2007-10-31 19:15                               ` [PATCH 16/16] make vsmp a paravirt client Glauber de Oliveira Costa
2007-11-01  4:38                                 ` Jeremy Fitzhardinge
2007-11-01  4:50                       ` [PATCH 11/16] turn priviled operation into a macro in head_64.S Jeremy Fitzhardinge
2007-11-01 13:50                         ` Glauber de Oliveira Costa
2007-11-01  4:48       ` [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt Jeremy Fitzhardinge
2007-11-01 13:48         ` Glauber de Oliveira Costa
2007-11-01 15:30           ` Jeremy Fitzhardinge
2007-11-01 16:07             ` Keir Fraser
2007-11-01 16:13               ` Glauber de Oliveira Costa
2007-11-01 17:41               ` Jeremy Fitzhardinge
2007-11-01 16:55                 ` [Lguest] " Zachary Amsden
2007-11-02  1:21                   ` Jeremy Fitzhardinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox