public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1
@ 2011-07-27  3:20 Andy Lutomirski
  2011-07-27  3:20 ` [PATCH 1/5] x86-64: Pad vDSO to a page boundary Andy Lutomirski
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Andy Lutomirski @ 2011-07-27  3:20 UTC (permalink / raw)
  To: x86, Konrad Rzeszutek Wilk
  Cc: Linux Kernel Mailing List, jeremy, keir.xen, xen-devel,
	virtualization, Andy Lutomirski

This fixes various problems that cropped up with the vdso patches.

 - Patch 1 fixes an information leak to userspace.
 - Patches 2 and 3 fix the kernel build on gold.
 - Patches 4 and 5 fix Xen (I hope).

Konrad, could you could test these on Xen and run 'test_vsyscall test' [1]?
I don't have a usable Xen setup.

Also, I'd appreciate a review of patches 4 and 5 from some Xen/paravirt
people.

[1] https://gitorious.org/linux-test-utils/linux-clock-tests


Andy Lutomirski (5):
  x86-64: Pad vDSO to a page boundary
  x86-64: Move the "user" vsyscall segment out of the data segment.
  x86-64: Work around gold bug 13023
  x86-64/xen: Enable the vvar mapping
  x86-64: Add user_64bit_mode paravirt op

 arch/x86/include/asm/desc.h           |    4 +-
 arch/x86/include/asm/paravirt_types.h |    6 ++++
 arch/x86/include/asm/ptrace.h         |   19 +++++++++++++
 arch/x86/kernel/paravirt.c            |    4 +++
 arch/x86/kernel/step.c                |    2 +-
 arch/x86/kernel/vmlinux.lds.S         |   46 ++++++++++++++++++---------------
 arch/x86/kernel/vsyscall_64.c         |    6 +---
 arch/x86/mm/fault.c                   |    2 +-
 arch/x86/vdso/vdso.S                  |    1 +
 arch/x86/xen/enlighten.c              |    1 +
 arch/x86/xen/mmu.c                    |    4 ++-
 11 files changed, 64 insertions(+), 31 deletions(-)

-- 
1.7.6


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/5] x86-64: Pad vDSO to a page boundary
  2011-07-27  3:20 [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1 Andy Lutomirski
@ 2011-07-27  3:20 ` Andy Lutomirski
  2011-07-27  3:20 ` [PATCH 2/5] x86-64: Move the "user" vsyscall segment out of the data segment Andy Lutomirski
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Andy Lutomirski @ 2011-07-27  3:20 UTC (permalink / raw)
  To: x86, Konrad Rzeszutek Wilk
  Cc: Linux Kernel Mailing List, jeremy, keir.xen, xen-devel,
	virtualization, Andy Lutomirski

This avoids an information leak to userspace.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/vdso/vdso.S |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/vdso/vdso.S b/arch/x86/vdso/vdso.S
index 1b979c1..01f5e3b 100644
--- a/arch/x86/vdso/vdso.S
+++ b/arch/x86/vdso/vdso.S
@@ -9,6 +9,7 @@ __PAGE_ALIGNED_DATA
 vdso_start:
 	.incbin "arch/x86/vdso/vdso.so"
 vdso_end:
+	.align PAGE_SIZE /* extra data here leaks to userspace. */
 
 .previous
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/5] x86-64: Move the "user" vsyscall segment out of the data segment.
  2011-07-27  3:20 [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1 Andy Lutomirski
  2011-07-27  3:20 ` [PATCH 1/5] x86-64: Pad vDSO to a page boundary Andy Lutomirski
@ 2011-07-27  3:20 ` Andy Lutomirski
  2011-07-27  3:20 ` [PATCH 3/5] x86-64: Work around gold bug 13023 Andy Lutomirski
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Andy Lutomirski @ 2011-07-27  3:20 UTC (permalink / raw)
  To: x86, Konrad Rzeszutek Wilk
  Cc: Linux Kernel Mailing List, jeremy, keir.xen, xen-devel,
	virtualization, Andy Lutomirski

The kernel's loader doesn't seem to care, but gold complains.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
Reported-by: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com>
---
 arch/x86/kernel/vmlinux.lds.S |   36 ++++++++++++++++++------------------
 1 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 4aa9c54..e79fb39 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -154,6 +154,24 @@ SECTIONS
 
 #ifdef CONFIG_X86_64
 
+	. = ALIGN(PAGE_SIZE);
+	__vvar_page = .;
+
+	.vvar : AT(ADDR(.vvar) - LOAD_OFFSET) {
+
+	      /* Place all vvars at the offsets in asm/vvar.h. */
+#define EMIT_VVAR(name, offset) 		\
+		. = offset;		\
+		*(.vvar_ ## name)
+#define __VVAR_KERNEL_LDS
+#include <asm/vvar.h>
+#undef __VVAR_KERNEL_LDS
+#undef EMIT_VVAR
+
+	} :data
+
+       . = ALIGN(__vvar_page + PAGE_SIZE, PAGE_SIZE);
+
 #define VSYSCALL_ADDR (-10*1024*1024)
 
 #define VLOAD_OFFSET (VSYSCALL_ADDR - __vsyscall_0 + LOAD_OFFSET)
@@ -162,7 +180,6 @@ SECTIONS
 #define VVIRT_OFFSET (VSYSCALL_ADDR - __vsyscall_0)
 #define VVIRT(x) (ADDR(x) - VVIRT_OFFSET)
 
-	. = ALIGN(4096);
 	__vsyscall_0 = .;
 
 	. = VSYSCALL_ADDR;
@@ -185,23 +202,6 @@ SECTIONS
 #undef VVIRT_OFFSET
 #undef VVIRT
 
-	__vvar_page = .;
-
-	.vvar : AT(ADDR(.vvar) - LOAD_OFFSET) {
-
-	      /* Place all vvars at the offsets in asm/vvar.h. */
-#define EMIT_VVAR(name, offset) 		\
-		. = offset;		\
-		*(.vvar_ ## name)
-#define __VVAR_KERNEL_LDS
-#include <asm/vvar.h>
-#undef __VVAR_KERNEL_LDS
-#undef EMIT_VVAR
-
-	} :data
-
-       . = ALIGN(__vvar_page + PAGE_SIZE, PAGE_SIZE);
-
 #endif /* CONFIG_X86_64 */
 
 	/* Init code and data - will be freed after init */
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/5] x86-64: Work around gold bug 13023
  2011-07-27  3:20 [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1 Andy Lutomirski
  2011-07-27  3:20 ` [PATCH 1/5] x86-64: Pad vDSO to a page boundary Andy Lutomirski
  2011-07-27  3:20 ` [PATCH 2/5] x86-64: Move the "user" vsyscall segment out of the data segment Andy Lutomirski
@ 2011-07-27  3:20 ` Andy Lutomirski
  2011-07-27  3:20 ` [PATCH 4/5] x86-64/xen: Enable the vvar mapping Andy Lutomirski
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Andy Lutomirski @ 2011-07-27  3:20 UTC (permalink / raw)
  To: x86, Konrad Rzeszutek Wilk
  Cc: Linux Kernel Mailing List, jeremy, keir.xen, xen-devel,
	virtualization, Andy Lutomirski

Gold has trouble assigning numbers to the location counter inside of
an output section description.  The bug was triggered by
9fd67b4ed0714ab718f1f9bd14c344af336a6df7, which consolidated all of
the vsyscall sections into a single section.  The workaround is IMO
still nicer than the old way of doing it.

This produces an apparently valid kernel image and passes my vdso
tests on both GNU ld version 2.21.51.0.6-2.fc15 20110118 and GNU
gold (version 2.21.51.0.6-2.fc15 20110118) 1.10 as distributed by
Fedora 15.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
Reported-by: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com>
---
 arch/x86/kernel/vmlinux.lds.S |   16 ++++++++++------
 1 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index e79fb39..8f3a265 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -158,10 +158,12 @@ SECTIONS
 	__vvar_page = .;
 
 	.vvar : AT(ADDR(.vvar) - LOAD_OFFSET) {
+		/* work around gold bug 13023 */
+		__vvar_beginning_hack = .;
 
-	      /* Place all vvars at the offsets in asm/vvar.h. */
-#define EMIT_VVAR(name, offset) 		\
-		. = offset;		\
+		/* Place all vvars at the offsets in asm/vvar.h. */
+#define EMIT_VVAR(name, offset) 			\
+		. = __vvar_beginning_hack + offset;	\
 		*(.vvar_ ## name)
 #define __VVAR_KERNEL_LDS
 #include <asm/vvar.h>
@@ -184,15 +186,17 @@ SECTIONS
 
 	. = VSYSCALL_ADDR;
 	.vsyscall : AT(VLOAD(.vsyscall)) {
+		/* work around gold bug 13023 */
+		__vsyscall_beginning_hack = .;
 		*(.vsyscall_0)
 
-		. = 1024;
+		. = __vsyscall_beginning_hack + 1024;
 		*(.vsyscall_1)
 
-		. = 2048;
+		. = __vsyscall_beginning_hack + 2048;
 		*(.vsyscall_2)
 
-		. = 4096;  /* Pad the whole page. */
+		. = __vsyscall_beginning_hack + 4096;  /* Pad the whole page. */
 	} :user =0xcc
 	. = ALIGN(__vsyscall_0 + PAGE_SIZE, PAGE_SIZE);
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/5] x86-64/xen: Enable the vvar mapping
  2011-07-27  3:20 [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1 Andy Lutomirski
                   ` (2 preceding siblings ...)
  2011-07-27  3:20 ` [PATCH 3/5] x86-64: Work around gold bug 13023 Andy Lutomirski
@ 2011-07-27  3:20 ` Andy Lutomirski
  2011-07-27 13:06   ` Konrad Rzeszutek Wilk
  2011-07-27  3:20 ` [PATCH 5/5] x86-64: Add user_64bit_mode paravirt op Andy Lutomirski
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Andy Lutomirski @ 2011-07-27  3:20 UTC (permalink / raw)
  To: x86, Konrad Rzeszutek Wilk
  Cc: Linux Kernel Mailing List, jeremy, keir.xen, xen-devel,
	virtualization, Andy Lutomirski

Xen needs special treatment for fixmaps.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/mmu.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index f987bde..8cce339 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1916,6 +1916,7 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
 # endif
 #else
 	case VSYSCALL_LAST_PAGE ... VSYSCALL_FIRST_PAGE:
+	case VVAR_PAGE:
 #endif
 	case FIX_TEXT_POKE0:
 	case FIX_TEXT_POKE1:
@@ -1956,7 +1957,8 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
 #ifdef CONFIG_X86_64
 	/* Replicate changes to map the vsyscall page into the user
 	   pagetable vsyscall mapping. */
-	if (idx >= VSYSCALL_LAST_PAGE && idx <= VSYSCALL_FIRST_PAGE) {
+	if ((idx >= VSYSCALL_LAST_PAGE && idx <= VSYSCALL_FIRST_PAGE) ||
+	    idx == VVAR_PAGE) {
 		unsigned long vaddr = __fix_to_virt(idx);
 		set_pte_vaddr_pud(level3_user_vsyscall, vaddr, pte);
 	}
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/5] x86-64: Add user_64bit_mode paravirt op
  2011-07-27  3:20 [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1 Andy Lutomirski
                   ` (3 preceding siblings ...)
  2011-07-27  3:20 ` [PATCH 4/5] x86-64/xen: Enable the vvar mapping Andy Lutomirski
@ 2011-07-27  3:20 ` Andy Lutomirski
  2011-07-27 17:24   ` Jeremy Fitzhardinge
  2011-07-27 12:59 ` [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1 Konrad Rzeszutek Wilk
  2011-07-27 14:57 ` Konrad Rzeszutek Wilk
  6 siblings, 1 reply; 20+ messages in thread
From: Andy Lutomirski @ 2011-07-27  3:20 UTC (permalink / raw)
  To: x86, Konrad Rzeszutek Wilk
  Cc: Linux Kernel Mailing List, jeremy, keir.xen, xen-devel,
	virtualization, Andy Lutomirski

Three places in the kernel assume that the only long mode CPL 3
selector is __USER_CS.  This is not true on Xen -- Xen's sysretq
changes cs to the magic value 0xe033.

Two of the places are corner cases, but as of "x86-64: Improve
vsyscall emulation CS and RIP handling"
(c9712944b2a12373cb6ff8059afcfb7e826a6c54), vsyscalls will segfault
if called with Xen's extra CS selector.  This causes a panic when
older init builds die.

It seems impossible to make Xen use __USER_CS reliably without
taking a performance hit on every system call, so this fixes the
tests instead with a new paravirt op.  It's a little ugly because
ptrace.h can't include paravirt.h.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/include/asm/desc.h           |    4 ++--
 arch/x86/include/asm/paravirt_types.h |    6 ++++++
 arch/x86/include/asm/ptrace.h         |   19 +++++++++++++++++++
 arch/x86/kernel/paravirt.c            |    4 ++++
 arch/x86/kernel/step.c                |    2 +-
 arch/x86/kernel/vsyscall_64.c         |    6 +-----
 arch/x86/mm/fault.c                   |    2 +-
 arch/x86/xen/enlighten.c              |    1 +
 8 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 7b439d9..41935fa 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -27,8 +27,8 @@ static inline void fill_ldt(struct desc_struct *desc, const struct user_desc *in
 
 	desc->base2		= (info->base_addr & 0xff000000) >> 24;
 	/*
-	 * Don't allow setting of the lm bit. It is useless anyway
-	 * because 64bit system calls require __USER_CS:
+	 * Don't allow setting of the lm bit. It would confuse
+	 * user_64bit_mode and would get overridden by sysret anyway.
 	 */
 	desc->l			= 0;
 }
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 2c76521..8e8b9a4 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -41,6 +41,7 @@
 
 #include <asm/desc_defs.h>
 #include <asm/kmap_types.h>
+#include <asm/pgtable_types.h>
 
 struct page;
 struct thread_struct;
@@ -63,6 +64,11 @@ struct paravirt_callee_save {
 struct pv_info {
 	unsigned int kernel_rpl;
 	int shared_kernel_pmd;
+
+#ifdef CONFIG_X86_64
+	u16 extra_user_64bit_cs;  /* __USER_CS if none */
+#endif
+
 	int paravirt_enabled;
 	const char *name;
 };
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 94e7618..3566454 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -131,6 +131,9 @@ struct pt_regs {
 #ifdef __KERNEL__
 
 #include <linux/init.h>
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt_types.h>
+#endif
 
 struct cpuinfo_x86;
 struct task_struct;
@@ -187,6 +190,22 @@ static inline int v8086_mode(struct pt_regs *regs)
 #endif
 }
 
+#ifdef CONFIG_X86_64
+static inline bool user_64bit_mode(struct pt_regs *regs)
+{
+#ifndef CONFIG_PARAVIRT
+	/*
+	 * On non-paravirt systems, this is the only long mode CPL 3
+	 * selector.  We do not allow long mode selectors in the LDT.
+	 */
+	return regs->cs == __USER_CS;
+#else
+	/* Headers are too twisted for this to go in paravirt.h. */
+	return regs->cs == __USER_CS || regs->cs == pv_info.extra_user_64bit_cs;
+#endif
+}
+#endif
+
 /*
  * X86_32 CPUs don't save ss and esp if the CPU is already in kernel mode
  * when it traps.  The previous stack will be directly underneath the saved
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 613a793..d90272e 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -307,6 +307,10 @@ struct pv_info pv_info = {
 	.paravirt_enabled = 0,
 	.kernel_rpl = 0,
 	.shared_kernel_pmd = 1,	/* Only used when CONFIG_X86_PAE is set */
+
+#ifdef CONFIG_X86_64
+	.extra_user_64bit_cs = __USER_CS,
+#endif
 };
 
 struct pv_init_ops pv_init_ops = {
diff --git a/arch/x86/kernel/step.c b/arch/x86/kernel/step.c
index 7977f0c..c346d11 100644
--- a/arch/x86/kernel/step.c
+++ b/arch/x86/kernel/step.c
@@ -74,7 +74,7 @@ static int is_setting_trap_flag(struct task_struct *child, struct pt_regs *regs)
 
 #ifdef CONFIG_X86_64
 		case 0x40 ... 0x4f:
-			if (regs->cs != __USER_CS)
+			if (!user_64bit_mode(regs))
 				/* 32-bit mode: register increment */
 				return 0;
 			/* 64-bit mode: REX prefix */
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index dda7dff..1725930 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -127,11 +127,7 @@ void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
 
 	local_irq_enable();
 
-	/*
-	 * Real 64-bit user mode code has cs == __USER_CS.  Anything else
-	 * is bogus.
-	 */
-	if (regs->cs != __USER_CS) {
+	if (!user_64bit_mode(regs)) {
 		/*
 		 * If we trapped from kernel mode, we might as well OOPS now
 		 * instead of returning to some random address and OOPSing
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 4d09df0..decd51a 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -105,7 +105,7 @@ check_prefetch_opcode(struct pt_regs *regs, unsigned char *instr,
 		 * but for now it's good enough to assume that long
 		 * mode only uses well known segments or kernel.
 		 */
-		return (!user_mode(regs)) || (regs->cs == __USER_CS);
+		return (!user_mode(regs) || user_64bit_mode(regs));
 #endif
 	case 0x60:
 		/* 0x64 thru 0x67 are valid prefixes in all modes. */
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 974a528..a9c710a 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -950,6 +950,7 @@ static unsigned xen_patch(u8 type, u16 clobbers, void *insnbuf,
 static const struct pv_info xen_info __initconst = {
 	.paravirt_enabled = 1,
 	.shared_kernel_pmd = 0,
+	.extra_user_64bit_cs = FLAT_USER_CS64,
 
 	.name = "Xen",
 };
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1
  2011-07-27  3:20 [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1 Andy Lutomirski
                   ` (4 preceding siblings ...)
  2011-07-27  3:20 ` [PATCH 5/5] x86-64: Add user_64bit_mode paravirt op Andy Lutomirski
@ 2011-07-27 12:59 ` Konrad Rzeszutek Wilk
  2011-07-27 14:57 ` Konrad Rzeszutek Wilk
  6 siblings, 0 replies; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-07-27 12:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Linux Kernel Mailing List, jeremy, keir.xen, xen-devel,
	virtualization

On Tue, Jul 26, 2011 at 11:20:34PM -0400, Andy Lutomirski wrote:
> This fixes various problems that cropped up with the vdso patches.
> 
>  - Patch 1 fixes an information leak to userspace.
>  - Patches 2 and 3 fix the kernel build on gold.
>  - Patches 4 and 5 fix Xen (I hope).
> 
> Konrad, could you could test these on Xen and run 'test_vsyscall test' [1]?
> I don't have a usable Xen setup.

Sure.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] x86-64/xen: Enable the vvar mapping
  2011-07-27  3:20 ` [PATCH 4/5] x86-64/xen: Enable the vvar mapping Andy Lutomirski
@ 2011-07-27 13:06   ` Konrad Rzeszutek Wilk
  2011-07-27 13:48     ` Andrew Lutomirski
  0 siblings, 1 reply; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-07-27 13:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Linux Kernel Mailing List, jeremy, keir.xen, xen-devel,
	virtualization

On Tue, Jul 26, 2011 at 11:20:38PM -0400, Andy Lutomirski wrote:
> Xen needs special treatment for fixmaps.

The description needs a bit more, for example:

"Xen needs to handle the newly introduced VVAR_PAGE introduced by
git commit 9fd67b4ed0714ab718f1f9bd14c344af336a6df7
"x86-64: Give vvars their own page"

Otherwise we die during bootup with this:

<and include snippets of the boot message crash>
"

With that included it looks good to me.

> 
> Signed-off-by: Andy Lutomirski <luto@mit.edu>
> Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  arch/x86/xen/mmu.c |    4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index f987bde..8cce339 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1916,6 +1916,7 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
>  # endif
>  #else
>  	case VSYSCALL_LAST_PAGE ... VSYSCALL_FIRST_PAGE:
> +	case VVAR_PAGE:
>  #endif
>  	case FIX_TEXT_POKE0:
>  	case FIX_TEXT_POKE1:
> @@ -1956,7 +1957,8 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
>  #ifdef CONFIG_X86_64
>  	/* Replicate changes to map the vsyscall page into the user
>  	   pagetable vsyscall mapping. */
> -	if (idx >= VSYSCALL_LAST_PAGE && idx <= VSYSCALL_FIRST_PAGE) {
> +	if ((idx >= VSYSCALL_LAST_PAGE && idx <= VSYSCALL_FIRST_PAGE) ||
> +	    idx == VVAR_PAGE) {
>  		unsigned long vaddr = __fix_to_virt(idx);
>  		set_pte_vaddr_pud(level3_user_vsyscall, vaddr, pte);
>  	}
> -- 
> 1.7.6

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] x86-64/xen: Enable the vvar mapping
  2011-07-27 13:06   ` Konrad Rzeszutek Wilk
@ 2011-07-27 13:48     ` Andrew Lutomirski
  0 siblings, 0 replies; 20+ messages in thread
From: Andrew Lutomirski @ 2011-07-27 13:48 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: x86, Linux Kernel Mailing List, jeremy, keir.xen, xen-devel,
	virtualization

On Wed, Jul 27, 2011 at 9:06 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Tue, Jul 26, 2011 at 11:20:38PM -0400, Andy Lutomirski wrote:
>> Xen needs special treatment for fixmaps.
>
> The description needs a bit more, for example:
>
> "Xen needs to handle the newly introduced VVAR_PAGE introduced by
> git commit 9fd67b4ed0714ab718f1f9bd14c344af336a6df7
> "x86-64: Give vvars their own page"

Will do.

--Andy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1
  2011-07-27  3:20 [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1 Andy Lutomirski
                   ` (5 preceding siblings ...)
  2011-07-27 12:59 ` [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1 Konrad Rzeszutek Wilk
@ 2011-07-27 14:57 ` Konrad Rzeszutek Wilk
  2011-07-27 15:04   ` Andrew Lutomirski
  6 siblings, 1 reply; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-07-27 14:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Linux Kernel Mailing List, jeremy, keir.xen, xen-devel,
	virtualization

On Tue, Jul 26, 2011 at 11:20:34PM -0400, Andy Lutomirski wrote:
> This fixes various problems that cropped up with the vdso patches.
> 
>  - Patch 1 fixes an information leak to userspace.
>  - Patches 2 and 3 fix the kernel build on gold.
>  - Patches 4 and 5 fix Xen (I hope).
> 
> Konrad, could you could test these on Xen and run 'test_vsyscall test' [1]?

They boot 64-bit guest succesfully.

But I doesn't compile under 32-bit:

home/konrad/ssd/linux/arch/x86/xen/enlighten.c:953: error: unknown field ‘extra_user_64bit_cs’ specified in initializer
/home/konrad/ssd/linux/arch/x86/xen/enlighten.c:953: error: ‘FLAT_USER_CS64’ undeclared here (not in a function)

Looks like it needs some #ifdef CONFIG_X86_64 magic.. and after
applying that magic dust it compiles and it also boots as 32-bit
(no surprise there).

> I don't have a usable Xen setup.

It is pretty easy to setup. Google for PVops Wiki and you will find wealth
of information. FYI: I am gone next week so won't be able to test these
patches.

> 
> Also, I'd appreciate a review of patches 4 and 5 from some Xen/paravirt
> people.
> 
> [1] https://gitorious.org/linux-test-utils/linux-clock-tests

Grrrrr..

g++ -o test_vsyscall -std=gnu++0x -lrt -ldl -O2 -Wall -mavx -g test_vsyscall.cc
test_vsyscall.cc: In function ‘int bench(int, char**)’:
test_vsyscall.cc:205: error: expected primary-expression before ‘[’ token
test_vsyscall.cc:205: error: expected primary-expression before ‘]’ token
test_vsyscall.cc:206: error: expected primary-expression before ‘[’ token
test_vsyscall.cc:206: error: expected primary-expression before ‘]’ token
test_vsyscall.cc:207: error: expected primary-expression before ‘[’ token
test_vsyscall.cc:207: error: expected primary-expression before ‘]’ token
test_vsyscall.cc:211: error: expected primary-expression before ‘[’ token
test_vsyscall.cc:211: error: expected primary-expression before ‘]’ token
test_vsyscall.cc:213: error: expected primary-expression before ‘[’ token
test_vsyscall.cc:213: error: expected primary-expression before ‘]’ token
test_vsyscall.cc:214: error: expected primary-expression before ‘[’ token
test_vsyscall.cc:214: error: expected primary-expression before ‘]’ token
test_vsyscall.cc:218: error: expected primary-expression before ‘[’ token
test_vsyscall.cc:218: error: expected primary-expression before ‘]’ token
test_vsyscall.cc:219: error: expected primary-expression before ‘[’ token
test_vsyscall.cc:219: error: expected primary-expression before ‘]’ token
test_vsyscall.cc:222: error: expected primary-expression before ‘[’ token
test_vsyscall.cc:222: error: expected primary-expression before ‘]’ token
test_vsyscall.cc:203: warning: unused variable ‘tv’
test_vsyscall.cc:204: warning: unused variable ‘tz’
test_vsyscall.cc:210: warning: unused variable ‘t’
test_vsyscall.cc:217: warning: unused variable ‘cpu’
test_vsyscall.cc:217: warning: unused variable ‘node’

Is there a specific version of GCC I should be using? I seem to be
using: g++ (GCC) 4.4.4 20100503 (Red Hat 4.4.4-2)

Anyhow, removed the benchmark code and ran it on 64-bit:

sh-4.1# /test_vsyscall  test
Testing gettimeofday...
[  109.552261] test_vsyscall[2462] trap invalid opcode ip:400c8d sp:7fff84fab470 error:0 in test_vsyscall[400000+2000]
Illegal instruction
sh-4.1# /test_vsyscall  intcc
About to execute int 0xcc from RIP = 400959
[  114.137150] test_vsyscall[2463] illegal int 0xcc (exploit attempt?) ip:400959 cs:e033 sp:7fff8b328310 ax:2c si:0 di:7fff8b3280f0
Caught SIGSEGV: Segmentation fault (Signal sent by the kernel [(nil)])RIP = 400959

[This is on git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git #testing, which
has todays linus/master and your patchset]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1
  2011-07-27 14:57 ` Konrad Rzeszutek Wilk
@ 2011-07-27 15:04   ` Andrew Lutomirski
  2011-07-27 15:30     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Lutomirski @ 2011-07-27 15:04 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: x86, Linux Kernel Mailing List, jeremy, keir.xen, xen-devel,
	virtualization

On Wed, Jul 27, 2011 at 10:57 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Tue, Jul 26, 2011 at 11:20:34PM -0400, Andy Lutomirski wrote:
>> This fixes various problems that cropped up with the vdso patches.
>>
>>  - Patch 1 fixes an information leak to userspace.
>>  - Patches 2 and 3 fix the kernel build on gold.
>>  - Patches 4 and 5 fix Xen (I hope).
>>
>> Konrad, could you could test these on Xen and run 'test_vsyscall test' [1]?
>
> They boot 64-bit guest succesfully.
>
> But I doesn't compile under 32-bit:
>
> home/konrad/ssd/linux/arch/x86/xen/enlighten.c:953: error: unknown field ‘extra_user_64bit_cs’ specified in initializer
> /home/konrad/ssd/linux/arch/x86/xen/enlighten.c:953: error: ‘FLAT_USER_CS64’ undeclared here (not in a function)
>
> Looks like it needs some #ifdef CONFIG_X86_64 magic.. and after
> applying that magic dust it compiles and it also boots as 32-bit
> (no surprise there).

Whoops!  I thought xen/enlighten.c was 64-bit only.

>
>> I don't have a usable Xen setup.
>
> It is pretty easy to setup. Google for PVops Wiki and you will find wealth
> of information. FYI: I am gone next week so won't be able to test these
> patches.
>
>>
>> Also, I'd appreciate a review of patches 4 and 5 from some Xen/paravirt
>> people.
>>
>> [1] https://gitorious.org/linux-test-utils/linux-clock-tests
>
> Grrrrr..
>
> g++ -o test_vsyscall -std=gnu++0x -lrt -ldl -O2 -Wall -mavx -g test_vsyscall.cc
> test_vsyscall.cc: In function ‘int bench(int, char**)’:
> test_vsyscall.cc:205: error: expected primary-expression before ‘[’ token

[...]

>
> Is there a specific version of GCC I should be using? I seem to be
> using: g++ (GCC) 4.4.4 20100503 (Red Hat 4.4.4-2)

Apparently it needs g++ 4.5 for lambdas.

>
> Anyhow, removed the benchmark code and ran it on 64-bit:
>
> sh-4.1# /test_vsyscall  test
> Testing gettimeofday...
> [  109.552261] test_vsyscall[2462] trap invalid opcode ip:400c8d sp:7fff84fab470 error:0 in test_vsyscall[400000+2000]
> Illegal instruction
> sh-4.1# /test_vsyscall  intcc
> About to execute int 0xcc from RIP = 400959
> [  114.137150] test_vsyscall[2463] illegal int 0xcc (exploit attempt?) ip:400959 cs:e033 sp:7fff8b328310 ax:2c si:0 di:7fff8b3280f0
> Caught SIGSEGV: Segmentation fault (Signal sent by the kernel [(nil)])RIP = 400959
>
> [This is on git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git #testing, which
> has todays linus/master and your patchset]
>

I'll set up Xen.  Something's clearly still buggy.

--Andy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1
  2011-07-27 15:04   ` Andrew Lutomirski
@ 2011-07-27 15:30     ` Konrad Rzeszutek Wilk
  2011-07-27 15:34       ` Andrew Lutomirski
  0 siblings, 1 reply; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-07-27 15:30 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: x86, Linux Kernel Mailing List, jeremy, keir.xen, xen-devel,
	virtualization

> > Anyhow, removed the benchmark code and ran it on 64-bit:
> >
> > sh-4.1# /test_vsyscall  test
> > Testing gettimeofday...
> > [  109.552261] test_vsyscall[2462] trap invalid opcode ip:400c8d sp:7fff84fab470 error:0 in test_vsyscall[400000+2000]
> > Illegal instruction
> > sh-4.1# /test_vsyscall  intcc
> > About to execute int 0xcc from RIP = 400959
> > [  114.137150] test_vsyscall[2463] illegal int 0xcc (exploit attempt?) ip:400959 cs:e033 sp:7fff8b328310 ax:2c si:0 di:7fff8b3280f0
> > Caught SIGSEGV: Segmentation fault (Signal sent by the kernel [(nil)])RIP = 400959
> >
> > [This is on git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git #testing, which
> > has todays linus/master and your patchset]
> >
> 
> I'll set up Xen.  Something's clearly still buggy.

You sure? This is what I get when I boot baremetal:

sh-4.1# 
sh-4.1# xen-detect 
Not running on Xen.
sh-4.1# /test_vsyscall test
Testing gettimeo[   84.442819] test_vsyscall[3175] trap invalid opcode ip:400c8d sp:7fffa8a72dc0 error:0fday...
 in test_vsyscall[400000+2000]
Illegal instruction
sh-4.1# /test_vsyscall intcc
About to execute[   87.549820] test_vsyscall[3176] illegal int 0xcc (exploit attempt?) ip:400959 cs:33 sp:7fff0ccddff0 ax:2c s^G^G^G^G^G^G^Gsh-4.1# 
sh-4.1# /test_vsyscall intcc
About to execute[   90.283817] test_vsyscall[3177] illegal int 0xcc (exploit attempt?) ip:400959 cs:33 sp:7fffae8a8b40 ax:2c son fault (Signal sent by the kernel [(nil)])RIP = 400959

Unless the whole paravirt kernel is buggy. Hadn't tried to boot non-paravirt.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1
  2011-07-27 15:30     ` Konrad Rzeszutek Wilk
@ 2011-07-27 15:34       ` Andrew Lutomirski
  2011-07-27 15:43         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Lutomirski @ 2011-07-27 15:34 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: x86, Linux Kernel Mailing List, jeremy, keir.xen, xen-devel,
	virtualization

On Wed, Jul 27, 2011 at 11:30 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>> > Anyhow, removed the benchmark code and ran it on 64-bit:
>> >
>> > sh-4.1# /test_vsyscall  test
>> > Testing gettimeofday...
>> > [  109.552261] test_vsyscall[2462] trap invalid opcode ip:400c8d sp:7fff84fab470 error:0 in test_vsyscall[400000+2000]
>> > Illegal instruction
>> > sh-4.1# /test_vsyscall  intcc
>> > About to execute int 0xcc from RIP = 400959
>> > [  114.137150] test_vsyscall[2463] illegal int 0xcc (exploit attempt?) ip:400959 cs:e033 sp:7fff8b328310 ax:2c si:0 di:7fff8b3280f0
>> > Caught SIGSEGV: Segmentation fault (Signal sent by the kernel [(nil)])RIP = 400959
>> >
>> > [This is on git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git #testing, which
>> > has todays linus/master and your patchset]
>> >
>>
>> I'll set up Xen.  Something's clearly still buggy.
>
> You sure? This is what I get when I boot baremetal:
>
> sh-4.1#
> sh-4.1# xen-detect
> Not running on Xen.
> sh-4.1# /test_vsyscall test
> Testing gettimeo[   84.442819] test_vsyscall[3175] trap invalid opcode ip:400c8d sp:7fffa8a72dc0 error:0fday...
>  in test_vsyscall[400000+2000]

$ test_vsyscall test
Testing gettimeofday...
  vDSO offset = 0.000001s
  vsyscall offset = 0.000001s

Testing time...
  vDSO offset = 0
  vsyscall offset = 0
Testing getcpu...
  ok!  cpu=6 node=0

Can you send me your test_vsyscall binary so I can disassemble it?

--Andy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1
  2011-07-27 15:34       ` Andrew Lutomirski
@ 2011-07-27 15:43         ` Konrad Rzeszutek Wilk
  2011-07-27 16:15           ` Andrew Lutomirski
  0 siblings, 1 reply; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-07-27 15:43 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: x86, Linux Kernel Mailing List, jeremy, keir.xen, xen-devel,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 1796 bytes --]

On Wed, Jul 27, 2011 at 11:34:21AM -0400, Andrew Lutomirski wrote:
> On Wed, Jul 27, 2011 at 11:30 AM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> >> > Anyhow, removed the benchmark code and ran it on 64-bit:
> >> >
> >> > sh-4.1# /test_vsyscall  test
> >> > Testing gettimeofday...
> >> > [  109.552261] test_vsyscall[2462] trap invalid opcode ip:400c8d sp:7fff84fab470 error:0 in test_vsyscall[400000+2000]
> >> > Illegal instruction
> >> > sh-4.1# /test_vsyscall  intcc
> >> > About to execute int 0xcc from RIP = 400959
> >> > [  114.137150] test_vsyscall[2463] illegal int 0xcc (exploit attempt?) ip:400959 cs:e033 sp:7fff8b328310 ax:2c si:0 di:7fff8b3280f0
> >> > Caught SIGSEGV: Segmentation fault (Signal sent by the kernel [(nil)])RIP = 400959
> >> >
> >> > [This is on git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git #testing, which
> >> > has todays linus/master and your patchset]
> >> >
> >>
> >> I'll set up Xen.  Something's clearly still buggy.
> >
> > You sure? This is what I get when I boot baremetal:
> >
> > sh-4.1#
> > sh-4.1# xen-detect
> > Not running on Xen.
> > sh-4.1# /test_vsyscall test
> > Testing gettimeo[   84.442819] test_vsyscall[3175] trap invalid opcode ip:400c8d sp:7fffa8a72dc0 error:0fday...
> >  in test_vsyscall[400000+2000]
> 
> $ test_vsyscall test
> Testing gettimeofday...
>   vDSO offset = 0.000001s
>   vsyscall offset = 0.000001s
> 
> Testing time...
>   vDSO offset = 0
>   vsyscall offset = 0
> Testing getcpu...
>   ok!  cpu=6 node=0
> 
> Can you send me your test_vsyscall binary so I can disassemble it?

Here it is (also including source since I uncommented parts of it).

One extra thing - I've been using AMD machines for this - I hadn't
tried this on an Intel box.

[-- Attachment #2: test_vsyscall --]
[-- Type: application/octet-stream, Size: 27973 bytes --]

[-- Attachment #3: test_vsyscall.cc --]
[-- Type: text/x-c++src, Size: 9658 bytes --]

#define _POSIX_SOURCE

#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <dlfcn.h>
#include <string.h>
#include <inttypes.h>
#include <signal.h>
#include <sys/ucontext.h>
#include <asm/ldt.h>
#include <errno.h>

static inline int modify_ldt(int mode, void *ptr, unsigned long size)
{
  int ret = syscall(__NR_modify_ldt, mode, ptr, size);
  if (ret != 0)
    errno = -ret;
  return (ret == 0 ? 0 : -1);
}

/* vsyscalls and vDSO */
typedef long (*gtod_t)(struct timeval *tv, struct timezone *tz);
const gtod_t vgtod = (gtod_t)0xffffffffff600000;
gtod_t vdso_gtod;

typedef long (*time_func_t)(time_t *t);
const time_func_t vtime = (time_func_t)0xffffffffff600400;
time_func_t vdso_time;

typedef long (*getcpu_t)(unsigned *, unsigned *, struct getcpu_cache*);
const getcpu_t vgetcpu = (getcpu_t)0xffffffffff600800;
getcpu_t vdso_getcpu;

void init_vdso()
{
  void *vdso = dlopen("linux-vdso.so.1", RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD);
  if (!vdso) {
    printf("Warning: failed to find vDSO\n");
    return;
  }

  vdso_gtod = (gtod_t)dlsym(vdso, "gettimeofday");
  if (!vdso_gtod)
    printf("Warning: failed to find gettimeofday in vDSO\n");

  vdso_time = (time_func_t)dlsym(vdso, "time");
  if (!vdso_time)
    printf("Warning: failed to find time in vDSO\n");

  vdso_getcpu = (getcpu_t)dlsym(vdso, "getcpu");
  if (!vdso_getcpu)
    printf("Warning: failed to find getcpu in vDSO\n");
}

/* syscalls */
static inline long sys_gtod(struct timeval *tv, struct timezone *tz)
{
  return syscall(__NR_gettimeofday, tv, tz);
}

static inline long sys_time(time_t *t)
{
  return syscall(__NR_time, t);
}

/* There is no sys_getcpu. */

static void segv(int sig, siginfo_t *info, void *ctx_void)
{
  psiginfo(info, "Caught SIGSEGV");

  ucontext_t *ctx = (ucontext_t*)ctx_void;
  printf("RIP = %lx\n", ctx->uc_mcontext.gregs[REG_RIP]);

  exit(1);
}

#if 0
/* benchmark helper */
template<typename Func> void benchmark(const char *desc, Func f)
{
  struct timespec start, end;
  long loops = 0;

  printf("Benchmarking %s ... ", desc);
  fflush(stdout);

  if (clock_gettime(CLOCK_MONOTONIC, &start)) {
    perror("clock_gettime");
    exit(1);
  }

  while(true)
    {
      long loops_now = 1000;
      for(int i = 0; i < loops_now; i++)
	f();
      loops += loops_now;

      if (clock_gettime(CLOCK_MONOTONIC, &end)) {
	perror("clock_gettime");
	exit(1);
      }

      unsigned long long duration = (end.tv_nsec - start.tv_nsec) +
	1000000000ULL * (end.tv_sec - start.tv_sec);

      if (duration < 500000000ULL)
	continue;

      printf("%9ld loops in %.5fs = %7.2f nsec / loop\n",
	     loops, float(duration) * 1e-9,
	     float(duration) / loops);
      break;
    }
}
#endif
static double tv_diff(const struct timeval &a, const struct timeval &b)
{
  return double(a.tv_sec - b.tv_sec) +
    double((int)a.tv_usec - (int)b.tv_usec) * 1e-6;
}

int test(int argc, char **argv)
{
  printf("Testing gettimeofday...\n");
  struct timeval tv_sys, tv_vdso, tv_vsys;
  struct timezone tz_sys, tz_vdso, tz_vsys;
  int ret_sys = sys_gtod(&tv_sys, &tz_sys);
  int ret_vdso = -1;
  if (vdso_gtod)
    ret_vdso = vdso_gtod(&tv_vdso, &tz_vdso);
  int ret_vsys = vgtod(&tv_vsys, &tz_vsys);

  if (ret_sys) {
    printf("  syscall failed\n");
  } else {
    if (ret_vdso == 0) {
      if (tz_sys.tz_minuteswest != tz_vdso.tz_minuteswest || tz_sys.tz_dsttime != tz_vdso.tz_dsttime)
	printf("  vDSO tz mismatch\n");
      else
	printf("  vDSO offset = %.6fs\n", tv_diff(tv_vdso, tv_sys));
    } else if (vdso_gtod) {
      printf("  vDSO failed\n");
    }
    if (ret_vsys == 0) {
      if (tz_sys.tz_minuteswest != tz_vsys.tz_minuteswest || tz_sys.tz_dsttime != tz_vsys.tz_dsttime)
	printf("  vsyscall tz mismatch\n");
      else
	printf("  vsyscall offset = %.6fs\n", tv_diff(tv_vsys, tv_sys));
    }
  }

  printf("\nTesting time...\n");
  long t_sys, t_vdso = 0, t_vsys; 
  long t2_sys = -1, t2_vdso = -1, t2_vsys = -1;
  t_sys = sys_time(&t2_sys);
  if (vdso_time)
    t_vdso = vdso_time(&t2_vdso);
  t_vsys = vtime(&t2_vsys);
  if (t_sys < 0 || t_sys != t2_sys) {
    printf("  syscall failed (ret:%ld output:%ld)\n", t_sys, t2_sys);
  } else {
    if (vdso_time) {
      if (t_vdso < 0 || t_vdso != t2_vdso)
	printf("  vDSO failed (ret:%ld output:%ld)\n", t_vdso, t2_vdso);
      else
	printf("  vDSO offset = %ld\n", t_vdso - t_sys);
    }

    if (t_vsys < 0 || t_vsys != t2_vsys)
      printf("  vsyscall failed (ret:%ld output:%ld)\n", t_vsys, t2_vsys);
    else
      printf("  vsyscall offset = %ld\n", t_vsys - t_sys);
  }

  printf("Testing getcpu...\n");
  unsigned cpu_vdso, cpu_vsys, node_vdso, node_vsys;
  ret_vdso = vdso_getcpu(&cpu_vdso, &node_vdso, 0);
  ret_vsys = vgetcpu(&cpu_vsys, &node_vsys, 0);
  if (ret_vdso)
    printf("  vDSO failed (ret:%ld)\n", (unsigned long)ret_vdso);
  if (ret_vsys)
    printf("  vsyscall failed (ret:%ld)\n", (unsigned long)ret_vdso);
  if (ret_vdso == 0 && ret_vsys == 0) {
    if (cpu_vdso != cpu_vsys)
      printf("  cpu mismatch (vdso:%u vsyscall:%u)!\n", cpu_vdso, cpu_vsys);
    else if (node_vdso != node_vsys)
      printf("  node mismatch (vdso:%u vsyscall:%u)!\n", node_vdso, node_vsys);
    else
      printf("  ok!  cpu=%u node=%u\n", cpu_vdso, node_vdso);
  }

  return 0;
}

int bench(int argc, char **argv)
{
  struct timeval tv;
  struct timezone tz;
#if 0
  benchmark(" syscall gettimeofday", [&]{sys_gtod(&tv, &tz);});
  benchmark("    vdso gettimeofday", [&]{vdso_gtod(&tv, &tz);});
  benchmark("vsyscall gettimeofday", [&]{vgtod(&tv, &tz);});

  printf("\n");
  time_t t;
  benchmark(" syscall time        ", [&]{sys_time(&t);});
  if (vdso_time)
    benchmark("    vdso time        ", [&]{vdso_time(&t);});
  benchmark("vsyscall time        ", [&]{vtime(&t);});

  printf("\n");
  unsigned cpu, node;
  benchmark("    vdso getcpu      ", [&]{vdso_getcpu(&cpu, &node, 0);});
  benchmark("vsyscall getcpu      ", [&]{vgetcpu(&cpu, &node, 0);});

  printf("\n");
  benchmark("dummy syscall        ", [&]{syscall(0xffffffff);});
#endif
  return 0;
}

int call(int argc, char **argv)
{
  if (argc != 5) {
    printf("Usage: call <addr> <rax> <arg1> <arg2> <arg3>\n");
    return 1;
  }

  unsigned long addr, rax, arg1, arg2, arg3;
  char *end;
  addr = strtoull(argv[0], &end, 0);
  if (*end)
    goto bad;

  rax = strtoull(argv[1], &end, 0);
  if (*end)
    goto bad;

  arg1 = strtoull(argv[2], &end, 0);
  if (*end)
    goto bad;

  arg2 = strtoull(argv[3], &end, 0);
  if (*end)
    goto bad;

  arg3 = strtoull(argv[4], &end, 0);
  if (*end)
    goto bad;

  unsigned long ret;
  asm volatile("call *%[addr]" : "=a" (ret) : [addr] "rm" (addr), "a" (rax),
	       "D" (arg1), "S" (arg2), "d" (arg3));
  printf("Return value = %ld\n", ret);

  return 0;

 bad:
  printf("Bad arg\n");
  return 1;
}

int intcc(int argc, char **argv)
{
  if (argc != 0) {
    printf("Usage: intcc\n");
    return 1;
  }

  extern char intcc_addr;
  printf("About to execute int 0xcc from RIP = %lX\n",
	 (unsigned long)&intcc_addr);

  asm volatile ("intcc_addr: int $0xcc");
  return 0;
}

struct __attribute__((packed)) farptr {
  uint32_t offset;
  uint16_t sel;
};

static bool to_farptr(farptr *out, uint16_t sel, void *offset)
{
  out->sel = sel;
  out->offset = (uint32_t)(unsigned long)offset;
  return out->offset == (unsigned long)offset;
}

int intcc32(int argc, char **argv)
{
  if (argc != 0) {
    printf("Usage: intcc32\n");
    return 1;
  }

  // Install a 32-bit code descriptor
  struct user_desc desc;
  memset(&desc, 0, sizeof(desc));
  desc.entry_number = 0;
  desc.base_addr = 0;
  desc.limit = 0xFFFFF;
  desc.seg_32bit = 1;
  desc.contents = MODIFY_LDT_CONTENTS_CODE;
  desc.limit_in_pages = 1;

  if (modify_ldt(1, &desc, sizeof(desc)) != 0) {
    perror("modify_ldt");
    return 1;
  }

  /* Load the initial CS. */
  uint16_t initial_cs;
  asm ("mov %%cs,%[initial_cs]" : [initial_cs] "=rm" (initial_cs));
  printf("Initial CS = 0x%04X (entry %d)\n",
	 (unsigned)initial_cs, (int)(initial_cs >> 3));

  extern char landing_32, landing_64;

  /* Set up the pointers. */
  static farptr ptr32, ptr64;
  if (!to_farptr(&ptr32, 0x4, &landing_32) || !to_farptr(&ptr64, initial_cs, &landing_64)) {
    printf("Something's mapped too high\n");
    return 1;
  }

  /* Go for it! */
  asm volatile (
		"mov %%rsp,%%rsi\n"		// Save rsp (avoids truncation).
		"ljmpl *(%%eax)\n"		// Switch to 32-bit mode.

		// 32-bit mode!
		// (Well, sort of.  DS and ES are 0, so we can't use them.)
		".code32\n"
		"landing_32:\n"
		"\tint $0xcc\n"			// Try int 0xcc.
		"\tljmpl *%%cs:(%%ecx)\n"	// Switch back.

		// 64-bit mode again!
		".code64\n"
		"landing_64:\n"
		"\tmov %%rsi,%%rsp"
		:
		: "a" (&ptr32), "c" (&ptr64)
		: "rsi", "cc");

  printf("Holy cow!  We survived!\n");

  return 0;
}

int main(int argc, char **argv)
{
  struct sigaction sa_segv;
  memset(&sa_segv, 0, sizeof(sa_segv));
  sa_segv.sa_sigaction = segv;
  sa_segv.sa_flags = SA_SIGINFO;
  sigemptyset(&sa_segv.sa_mask);
  if (sigaction(SIGSEGV, &sa_segv, 0))
    perror("sigaction");

  init_vdso();
  if (argc < 2) {
    printf("Usage: test_vsyscall <command> ...\n"
	   "command := { test, bench, intcc, call }\n");
    return 1;
  }

  if (!strcmp(argv[1], "test"))
    return test(argc - 2, argv + 2);
  if (!strcmp(argv[1], "bench"))
    return bench(argc - 2, argv + 2);
  if (!strcmp(argv[1], "intcc"))
    return intcc(argc - 2, argv + 2);
  if (!strcmp(argv[1], "intcc32"))
    return intcc32(argc - 2, argv + 2);
  if (!strcmp(argv[1], "call"))
    return call(argc - 2, argv + 2);

  printf("Unknown command\n");
  return 1;
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1
  2011-07-27 15:43         ` Konrad Rzeszutek Wilk
@ 2011-07-27 16:15           ` Andrew Lutomirski
  2011-07-27 16:29             ` [Xen-devel] " Konrad Rzeszutek Wilk
  2011-07-27 16:58             ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 20+ messages in thread
From: Andrew Lutomirski @ 2011-07-27 16:15 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: x86, Linux Kernel Mailing List, jeremy, keir.xen, xen-devel,
	virtualization

On Wed, Jul 27, 2011 at 11:43 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Wed, Jul 27, 2011 at 11:34:21AM -0400, Andrew Lutomirski wrote:
>> On Wed, Jul 27, 2011 at 11:30 AM, Konrad Rzeszutek Wilk
>> <konrad.wilk@oracle.com> wrote:
>> >> > Anyhow, removed the benchmark code and ran it on 64-bit:
>> >> >
>> >> > sh-4.1# /test_vsyscall  test
>> >> > Testing gettimeofday...
>> >> > [  109.552261] test_vsyscall[2462] trap invalid opcode ip:400c8d sp:7fff84fab470 error:0 in test_vsyscall[400000+2000]
>> >> > Illegal instruction
>> >> > sh-4.1# /test_vsyscall  intcc
>> >> > About to execute int 0xcc from RIP = 400959
>> >> > [  114.137150] test_vsyscall[2463] illegal int 0xcc (exploit attempt?) ip:400959 cs:e033 sp:7fff8b328310 ax:2c si:0 di:7fff8b3280f0
>> >> > Caught SIGSEGV: Segmentation fault (Signal sent by the kernel [(nil)])RIP = 400959
>> >> >
>> >> > [This is on git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git #testing, which
>> >> > has todays linus/master and your patchset]
>> >> >
>> >>
>> >> I'll set up Xen.  Something's clearly still buggy.
>> >
>> > You sure? This is what I get when I boot baremetal:
>> >
>> > sh-4.1#
>> > sh-4.1# xen-detect
>> > Not running on Xen.
>> > sh-4.1# /test_vsyscall test
>> > Testing gettimeo[   84.442819] test_vsyscall[3175] trap invalid opcode ip:400c8d sp:7fffa8a72dc0 error:0fday...
>> >  in test_vsyscall[400000+2000]
>>
>> $ test_vsyscall test
>> Testing gettimeofday...
>>   vDSO offset = 0.000001s
>>   vsyscall offset = 0.000001s
>>
>> Testing time...
>>   vDSO offset = 0
>>   vsyscall offset = 0
>> Testing getcpu...
>>   ok!  cpu=6 node=0
>>
>> Can you send me your test_vsyscall binary so I can disassemble it?
>
> Here it is (also including source since I uncommented parts of it).
>
> One extra thing - I've been using AMD machines for this - I hadn't
> tried this on an Intel box.
>

Whoops!  The offending instruction is:

400c8d:       c4 e1 f3 2a c8          vcvtsi2sd %rax,%xmm1,%xmm1

which is unlikely to work on AMD unless you're the lucky owner of a
prerelease Bulldozer chip.  I

I bet if you pull a new copy or remove -mavx from Makefile it will
work.  I got a grossly hacked-up Xen domU booted and everything seems
to work.

(Testing native kernels is really fun with qemu-kvm -kernel <image>
-initrd <my silly initramfs>.  But Xen doesn't seem to support that.)

--Andy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xen-devel] Re: [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1
  2011-07-27 16:15           ` Andrew Lutomirski
@ 2011-07-27 16:29             ` Konrad Rzeszutek Wilk
  2011-07-27 16:58             ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-07-27 16:29 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: jeremy, xen-devel, x86, Linux Kernel Mailing List, virtualization,
	keir.xen

> 400c8d:       c4 e1 f3 2a c8          vcvtsi2sd %rax,%xmm1,%xmm1
> 
> which is unlikely to work on AMD unless you're the lucky owner of a
> prerelease Bulldozer chip.  I
> 
> I bet if you pull a new copy or remove -mavx from Makefile it will
> work.  I got a grossly hacked-up Xen domU booted and everything seems
> to work.

Ok.
> 
> (Testing native kernels is really fun with qemu-kvm -kernel <image>
> -initrd <my silly initramfs>.  But Xen doesn't seem to support that.)

You mean running Xen within QEMU? It does, but you need to use a multiboot
loader and QEMU does not seem to support that. You can create an
.iso image with isolinux that loads mboot.c32 for that.

If you want to use qemu like that in Xen environment you can do it too.
Just need the upstream version and use -M xenpv

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xen-devel] Re: [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1
  2011-07-27 16:15           ` Andrew Lutomirski
  2011-07-27 16:29             ` [Xen-devel] " Konrad Rzeszutek Wilk
@ 2011-07-27 16:58             ` Konrad Rzeszutek Wilk
  2011-07-27 17:05               ` Andrew Lutomirski
  1 sibling, 1 reply; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-07-27 16:58 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: jeremy, xen-devel, x86, Linux Kernel Mailing List, virtualization,
	keir.xen

> >> $ test_vsyscall test
> >> Testing gettimeofday...
> >>   vDSO offset = 0.000001s
> >>   vsyscall offset = 0.000001s
> >>
> >> Testing time...
> >>   vDSO offset = 0
> >>   vsyscall offset = 0
> >> Testing getcpu...
> >>   ok!  cpu=6 node=0

> I bet if you pull a new copy or remove -mavx from Makefile it will
> work.  I got a grossly hacked-up Xen domU booted and everything seems
> to work.

It did. Both Dom0 and DomU work on AMD and Intel.

In regards to the last pv-ops patch - is there no better way? The reason I am asking
is the pv-ops hook is just a bandaid for the problem. Is the Xen syscall suppose to
be doingsomething extra with the stack perhaps?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xen-devel] Re: [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1
  2011-07-27 16:58             ` Konrad Rzeszutek Wilk
@ 2011-07-27 17:05               ` Andrew Lutomirski
  0 siblings, 0 replies; 20+ messages in thread
From: Andrew Lutomirski @ 2011-07-27 17:05 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: jeremy, xen-devel, x86, Linux Kernel Mailing List, virtualization,
	keir.xen

On Wed, Jul 27, 2011 at 12:58 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>> >> $ test_vsyscall test
>> >> Testing gettimeofday...
>> >>   vDSO offset = 0.000001s
>> >>   vsyscall offset = 0.000001s
>> >>
>> >> Testing time...
>> >>   vDSO offset = 0
>> >>   vsyscall offset = 0
>> >> Testing getcpu...
>> >>   ok!  cpu=6 node=0
>
>> I bet if you pull a new copy or remove -mavx from Makefile it will
>> work.  I got a grossly hacked-up Xen domU booted and everything seems
>> to work.
>
> It did. Both Dom0 and DomU work on AMD and Intel.
>
> In regards to the last pv-ops patch - is there no better way? The reason I am asking
> is the pv-ops hook is just a bandaid for the problem. Is the Xen syscall suppose to
> be doingsomething extra with the stack perhaps?
>

The Xen code in question is:

restore_all_guest:
        ASSERT_INTERRUPTS_DISABLED
        RESTORE_ALL
        testw $TRAP_syscall,4(%rsp)
        jz    iret_exit_to_guest

        addq  $8,%rsp
        popq  %rcx                    # RIP
        popq  %r11                    # CS
        cmpw  $FLAT_USER_CS32,%r11
        popq  %r11                    # RFLAGS
        popq  %rsp                    # RSP
        je    1f
        sysretq
1:      sysretl


So with VCGF_in_syscall set, the ireq hypercall will return via
sysretq if the saved CS is __USER_CS or FLAT_USER_CS64.  This is
faster than iretq.

The hypervisor doesn't allow the guest OS to override the values in
MSR_STAR, so FLAT_USER_CS64 gets returned to userspace.  And sysretq
is probably much faster than iretq, so unsetting VCGF_in_syscall is
probably a bad idea.

The ideal solution would be to allow the kernel to change MSR_STAR,
but this would require changing the hypervisor and the kernel.

--Andy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] x86-64: Add user_64bit_mode paravirt op
  2011-07-27  3:20 ` [PATCH 5/5] x86-64: Add user_64bit_mode paravirt op Andy Lutomirski
@ 2011-07-27 17:24   ` Jeremy Fitzhardinge
  2011-07-27 17:45     ` Andrew Lutomirski
  0 siblings, 1 reply; 20+ messages in thread
From: Jeremy Fitzhardinge @ 2011-07-27 17:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Konrad Rzeszutek Wilk, Linux Kernel Mailing List, keir.xen,
	xen-devel, virtualization

On 07/26/2011 08:20 PM, Andy Lutomirski wrote:
> Three places in the kernel assume that the only long mode CPL 3
> selector is __USER_CS.  This is not true on Xen -- Xen's sysretq
> changes cs to the magic value 0xe033.
>
> Two of the places are corner cases, but as of "x86-64: Improve
> vsyscall emulation CS and RIP handling"
> (c9712944b2a12373cb6ff8059afcfb7e826a6c54), vsyscalls will segfault
> if called with Xen's extra CS selector.  This causes a panic when
> older init builds die.
>
> It seems impossible to make Xen use __USER_CS reliably without
> taking a performance hit on every system call, so this fixes the
> tests instead with a new paravirt op.  It's a little ugly because
> ptrace.h can't include paravirt.h.
>
> Signed-off-by: Andy Lutomirski <luto@mit.edu>
> Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  arch/x86/include/asm/desc.h           |    4 ++--
>  arch/x86/include/asm/paravirt_types.h |    6 ++++++
>  arch/x86/include/asm/ptrace.h         |   19 +++++++++++++++++++
>  arch/x86/kernel/paravirt.c            |    4 ++++
>  arch/x86/kernel/step.c                |    2 +-
>  arch/x86/kernel/vsyscall_64.c         |    6 +-----
>  arch/x86/mm/fault.c                   |    2 +-
>  arch/x86/xen/enlighten.c              |    1 +
>  8 files changed, 35 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
> index 7b439d9..41935fa 100644
> --- a/arch/x86/include/asm/desc.h
> +++ b/arch/x86/include/asm/desc.h
> @@ -27,8 +27,8 @@ static inline void fill_ldt(struct desc_struct *desc, const struct user_desc *in
>  
>  	desc->base2		= (info->base_addr & 0xff000000) >> 24;
>  	/*
> -	 * Don't allow setting of the lm bit. It is useless anyway
> -	 * because 64bit system calls require __USER_CS:
> +	 * Don't allow setting of the lm bit. It would confuse
> +	 * user_64bit_mode and would get overridden by sysret anyway.
>  	 */
>  	desc->l			= 0;
>  }
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 2c76521..8e8b9a4 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -41,6 +41,7 @@
>  
>  #include <asm/desc_defs.h>
>  #include <asm/kmap_types.h>
> +#include <asm/pgtable_types.h>
>  
>  struct page;
>  struct thread_struct;
> @@ -63,6 +64,11 @@ struct paravirt_callee_save {
>  struct pv_info {
>  	unsigned int kernel_rpl;
>  	int shared_kernel_pmd;
> +
> +#ifdef CONFIG_X86_64
> +	u16 extra_user_64bit_cs;  /* __USER_CS if none */
> +#endif
> +
>  	int paravirt_enabled;
>  	const char *name;
>  };
> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
> index 94e7618..3566454 100644
> --- a/arch/x86/include/asm/ptrace.h
> +++ b/arch/x86/include/asm/ptrace.h
> @@ -131,6 +131,9 @@ struct pt_regs {
>  #ifdef __KERNEL__
>  
>  #include <linux/init.h>
> +#ifdef CONFIG_PARAVIRT
> +#include <asm/paravirt_types.h>
> +#endif
>  
>  struct cpuinfo_x86;
>  struct task_struct;
> @@ -187,6 +190,22 @@ static inline int v8086_mode(struct pt_regs *regs)
>  #endif
>  }
>  
> +#ifdef CONFIG_X86_64
> +static inline bool user_64bit_mode(struct pt_regs *regs)
> +{
> +#ifndef CONFIG_PARAVIRT
> +	/*
> +	 * On non-paravirt systems, this is the only long mode CPL 3
> +	 * selector.  We do not allow long mode selectors in the LDT.
> +	 */
> +	return regs->cs == __USER_CS;
> +#else
> +	/* Headers are too twisted for this to go in paravirt.h. */
> +	return regs->cs == __USER_CS || regs->cs == pv_info.extra_user_64bit_cs;

Is this necessary because usermode may sometimes be on __USER_CS or
sometimes on Xen's?  Could we just commit to one or the other and make
it a simple comparison?

What if __USER_CS were a variable?

    J
> +#endif
> +}
> +#endif
> +
>  /*
>   * X86_32 CPUs don't save ss and esp if the CPU is already in kernel mode
>   * when it traps.  The previous stack will be directly underneath the saved
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index 613a793..d90272e 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -307,6 +307,10 @@ struct pv_info pv_info = {
>  	.paravirt_enabled = 0,
>  	.kernel_rpl = 0,
>  	.shared_kernel_pmd = 1,	/* Only used when CONFIG_X86_PAE is set */
> +
> +#ifdef CONFIG_X86_64
> +	.extra_user_64bit_cs = __USER_CS,
> +#endif
>  };
>  
>  struct pv_init_ops pv_init_ops = {
> diff --git a/arch/x86/kernel/step.c b/arch/x86/kernel/step.c
> index 7977f0c..c346d11 100644
> --- a/arch/x86/kernel/step.c
> +++ b/arch/x86/kernel/step.c
> @@ -74,7 +74,7 @@ static int is_setting_trap_flag(struct task_struct *child, struct pt_regs *regs)
>  
>  #ifdef CONFIG_X86_64
>  		case 0x40 ... 0x4f:
> -			if (regs->cs != __USER_CS)
> +			if (!user_64bit_mode(regs))
>  				/* 32-bit mode: register increment */
>  				return 0;
>  			/* 64-bit mode: REX prefix */
> diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
> index dda7dff..1725930 100644
> --- a/arch/x86/kernel/vsyscall_64.c
> +++ b/arch/x86/kernel/vsyscall_64.c
> @@ -127,11 +127,7 @@ void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
>  
>  	local_irq_enable();
>  
> -	/*
> -	 * Real 64-bit user mode code has cs == __USER_CS.  Anything else
> -	 * is bogus.
> -	 */
> -	if (regs->cs != __USER_CS) {
> +	if (!user_64bit_mode(regs)) {
>  		/*
>  		 * If we trapped from kernel mode, we might as well OOPS now
>  		 * instead of returning to some random address and OOPSing
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 4d09df0..decd51a 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -105,7 +105,7 @@ check_prefetch_opcode(struct pt_regs *regs, unsigned char *instr,
>  		 * but for now it's good enough to assume that long
>  		 * mode only uses well known segments or kernel.
>  		 */
> -		return (!user_mode(regs)) || (regs->cs == __USER_CS);
> +		return (!user_mode(regs) || user_64bit_mode(regs));
>  #endif
>  	case 0x60:
>  		/* 0x64 thru 0x67 are valid prefixes in all modes. */
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index 974a528..a9c710a 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -950,6 +950,7 @@ static unsigned xen_patch(u8 type, u16 clobbers, void *insnbuf,
>  static const struct pv_info xen_info __initconst = {
>  	.paravirt_enabled = 1,
>  	.shared_kernel_pmd = 0,
> +	.extra_user_64bit_cs = FLAT_USER_CS64,
>  
>  	.name = "Xen",
>  };


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] x86-64: Add user_64bit_mode paravirt op
  2011-07-27 17:24   ` Jeremy Fitzhardinge
@ 2011-07-27 17:45     ` Andrew Lutomirski
  0 siblings, 0 replies; 20+ messages in thread
From: Andrew Lutomirski @ 2011-07-27 17:45 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: x86, Konrad Rzeszutek Wilk, Linux Kernel Mailing List, keir.xen,
	xen-devel, virtualization

On Wed, Jul 27, 2011 at 1:24 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> On 07/26/2011 08:20 PM, Andy Lutomirski wrote:
>> Three places in the kernel assume that the only long mode CPL 3
>> selector is __USER_CS.  This is not true on Xen -- Xen's sysretq
>> changes cs to the magic value 0xe033.
>>
>> Two of the places are corner cases, but as of "x86-64: Improve
>> vsyscall emulation CS and RIP handling"
>> (c9712944b2a12373cb6ff8059afcfb7e826a6c54), vsyscalls will segfault
>> if called with Xen's extra CS selector.  This causes a panic when
>> older init builds die.
>>
>> It seems impossible to make Xen use __USER_CS reliably without
>> taking a performance hit on every system call, so this fixes the
>> tests instead with a new paravirt op.  It's a little ugly because
>> ptrace.h can't include paravirt.h.
>>
>> Signed-off-by: Andy Lutomirski <luto@mit.edu>
>> Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> ---
>>  arch/x86/include/asm/desc.h           |    4 ++--
>>  arch/x86/include/asm/paravirt_types.h |    6 ++++++
>>  arch/x86/include/asm/ptrace.h         |   19 +++++++++++++++++++
>>  arch/x86/kernel/paravirt.c            |    4 ++++
>>  arch/x86/kernel/step.c                |    2 +-
>>  arch/x86/kernel/vsyscall_64.c         |    6 +-----
>>  arch/x86/mm/fault.c                   |    2 +-
>>  arch/x86/xen/enlighten.c              |    1 +
>>  8 files changed, 35 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
>> index 7b439d9..41935fa 100644
>> --- a/arch/x86/include/asm/desc.h
>> +++ b/arch/x86/include/asm/desc.h
>> @@ -27,8 +27,8 @@ static inline void fill_ldt(struct desc_struct *desc, const struct user_desc *in
>>
>>       desc->base2             = (info->base_addr & 0xff000000) >> 24;
>>       /*
>> -      * Don't allow setting of the lm bit. It is useless anyway
>> -      * because 64bit system calls require __USER_CS:
>> +      * Don't allow setting of the lm bit. It would confuse
>> +      * user_64bit_mode and would get overridden by sysret anyway.
>>        */
>>       desc->l                 = 0;
>>  }
>> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
>> index 2c76521..8e8b9a4 100644
>> --- a/arch/x86/include/asm/paravirt_types.h
>> +++ b/arch/x86/include/asm/paravirt_types.h
>> @@ -41,6 +41,7 @@
>>
>>  #include <asm/desc_defs.h>
>>  #include <asm/kmap_types.h>
>> +#include <asm/pgtable_types.h>
>>
>>  struct page;
>>  struct thread_struct;
>> @@ -63,6 +64,11 @@ struct paravirt_callee_save {
>>  struct pv_info {
>>       unsigned int kernel_rpl;
>>       int shared_kernel_pmd;
>> +
>> +#ifdef CONFIG_X86_64
>> +     u16 extra_user_64bit_cs;  /* __USER_CS if none */
>> +#endif
>> +
>>       int paravirt_enabled;
>>       const char *name;
>>  };
>> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
>> index 94e7618..3566454 100644
>> --- a/arch/x86/include/asm/ptrace.h
>> +++ b/arch/x86/include/asm/ptrace.h
>> @@ -131,6 +131,9 @@ struct pt_regs {
>>  #ifdef __KERNEL__
>>
>>  #include <linux/init.h>
>> +#ifdef CONFIG_PARAVIRT
>> +#include <asm/paravirt_types.h>
>> +#endif
>>
>>  struct cpuinfo_x86;
>>  struct task_struct;
>> @@ -187,6 +190,22 @@ static inline int v8086_mode(struct pt_regs *regs)
>>  #endif
>>  }
>>
>> +#ifdef CONFIG_X86_64
>> +static inline bool user_64bit_mode(struct pt_regs *regs)
>> +{
>> +#ifndef CONFIG_PARAVIRT
>> +     /*
>> +      * On non-paravirt systems, this is the only long mode CPL 3
>> +      * selector.  We do not allow long mode selectors in the LDT.
>> +      */
>> +     return regs->cs == __USER_CS;
>> +#else
>> +     /* Headers are too twisted for this to go in paravirt.h. */
>> +     return regs->cs == __USER_CS || regs->cs == pv_info.extra_user_64bit_cs;
>
> Is this necessary because usermode may sometimes be on __USER_CS or
> sometimes on Xen's?  Could we just commit to one or the other and make
> it a simple comparison?

Currently (from memory), brand new threads start out on __USER_CS.
Also, there might be software out there that thunks back and forth
between 32-bit and 64-bit mode and hardcodes CS=51 as the 32->64 bit
jump target.

It is said that 32-64 bit thunking is impossible, but this is
empirically untrue -- I've done it in the intcc32 vsyscall test I
wrote, and if you remove the actual intcc instruction, it will survive
the thunk in both directions.  My code didn't hardcode the assumption.

>
> What if __USER_CS were a variable?

That sounds a little evil :)  It will also make the FIXUP_TOP_OF_STACK
macro a little uglier than it is.

But maybe it's not so bad.  We could even remove the legacy 64-bit
selector, and presumably everything would still work.

What do you think?  I'll benchmark removing VCGF_in_syscall later tonight.

--Andy

>
>    J
>> +#endif
>> +}
>> +#endif
>> +
>>  /*
>>   * X86_32 CPUs don't save ss and esp if the CPU is already in kernel mode
>>   * when it traps.  The previous stack will be directly underneath the saved
>> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
>> index 613a793..d90272e 100644
>> --- a/arch/x86/kernel/paravirt.c
>> +++ b/arch/x86/kernel/paravirt.c
>> @@ -307,6 +307,10 @@ struct pv_info pv_info = {
>>       .paravirt_enabled = 0,
>>       .kernel_rpl = 0,
>>       .shared_kernel_pmd = 1, /* Only used when CONFIG_X86_PAE is set */
>> +
>> +#ifdef CONFIG_X86_64
>> +     .extra_user_64bit_cs = __USER_CS,
>> +#endif
>>  };
>>
>>  struct pv_init_ops pv_init_ops = {
>> diff --git a/arch/x86/kernel/step.c b/arch/x86/kernel/step.c
>> index 7977f0c..c346d11 100644
>> --- a/arch/x86/kernel/step.c
>> +++ b/arch/x86/kernel/step.c
>> @@ -74,7 +74,7 @@ static int is_setting_trap_flag(struct task_struct *child, struct pt_regs *regs)
>>
>>  #ifdef CONFIG_X86_64
>>               case 0x40 ... 0x4f:
>> -                     if (regs->cs != __USER_CS)
>> +                     if (!user_64bit_mode(regs))
>>                               /* 32-bit mode: register increment */
>>                               return 0;
>>                       /* 64-bit mode: REX prefix */
>> diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
>> index dda7dff..1725930 100644
>> --- a/arch/x86/kernel/vsyscall_64.c
>> +++ b/arch/x86/kernel/vsyscall_64.c
>> @@ -127,11 +127,7 @@ void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
>>
>>       local_irq_enable();
>>
>> -     /*
>> -      * Real 64-bit user mode code has cs == __USER_CS.  Anything else
>> -      * is bogus.
>> -      */
>> -     if (regs->cs != __USER_CS) {
>> +     if (!user_64bit_mode(regs)) {
>>               /*
>>                * If we trapped from kernel mode, we might as well OOPS now
>>                * instead of returning to some random address and OOPSing
>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>> index 4d09df0..decd51a 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -105,7 +105,7 @@ check_prefetch_opcode(struct pt_regs *regs, unsigned char *instr,
>>                * but for now it's good enough to assume that long
>>                * mode only uses well known segments or kernel.
>>                */
>> -             return (!user_mode(regs)) || (regs->cs == __USER_CS);
>> +             return (!user_mode(regs) || user_64bit_mode(regs));
>>  #endif
>>       case 0x60:
>>               /* 0x64 thru 0x67 are valid prefixes in all modes. */
>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>> index 974a528..a9c710a 100644
>> --- a/arch/x86/xen/enlighten.c
>> +++ b/arch/x86/xen/enlighten.c
>> @@ -950,6 +950,7 @@ static unsigned xen_patch(u8 type, u16 clobbers, void *insnbuf,
>>  static const struct pv_info xen_info __initconst = {
>>       .paravirt_enabled = 1,
>>       .shared_kernel_pmd = 0,
>> +     .extra_user_64bit_cs = FLAT_USER_CS64,
>>
>>       .name = "Xen",
>>  };
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2011-07-27 17:45 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-27  3:20 [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1 Andy Lutomirski
2011-07-27  3:20 ` [PATCH 1/5] x86-64: Pad vDSO to a page boundary Andy Lutomirski
2011-07-27  3:20 ` [PATCH 2/5] x86-64: Move the "user" vsyscall segment out of the data segment Andy Lutomirski
2011-07-27  3:20 ` [PATCH 3/5] x86-64: Work around gold bug 13023 Andy Lutomirski
2011-07-27  3:20 ` [PATCH 4/5] x86-64/xen: Enable the vvar mapping Andy Lutomirski
2011-07-27 13:06   ` Konrad Rzeszutek Wilk
2011-07-27 13:48     ` Andrew Lutomirski
2011-07-27  3:20 ` [PATCH 5/5] x86-64: Add user_64bit_mode paravirt op Andy Lutomirski
2011-07-27 17:24   ` Jeremy Fitzhardinge
2011-07-27 17:45     ` Andrew Lutomirski
2011-07-27 12:59 ` [PATCH 0/5] Collected vdso/vsyscall fixes for 3.1 Konrad Rzeszutek Wilk
2011-07-27 14:57 ` Konrad Rzeszutek Wilk
2011-07-27 15:04   ` Andrew Lutomirski
2011-07-27 15:30     ` Konrad Rzeszutek Wilk
2011-07-27 15:34       ` Andrew Lutomirski
2011-07-27 15:43         ` Konrad Rzeszutek Wilk
2011-07-27 16:15           ` Andrew Lutomirski
2011-07-27 16:29             ` [Xen-devel] " Konrad Rzeszutek Wilk
2011-07-27 16:58             ` Konrad Rzeszutek Wilk
2011-07-27 17:05               ` Andrew Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox