public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 19/21] i386 Kprobes semaphore fix
@ 2005-11-08  4:39 Zachary Amsden
  2005-11-08 13:12 ` Andi Kleen
  0 siblings, 1 reply; 15+ messages in thread
From: Zachary Amsden @ 2005-11-08  4:39 UTC (permalink / raw)
  To: Andrew Morton, Chris Wright, Linus Torvalds,
	Linux Kernel Mailing List, Virtualization Mailing List,
	H. Peter Anvin, Zwane Mwaikambo, Martin Bligh,
	Pratap Subrahmanyam, Christopher Li, Eric W. Biederman,
	Ingo Molnar, Zachary Amsden, Zachary Amsden

IA-32 linear address translation is loads of fun.

While cleaning up the LDT code, I noticed that kprobes code was very bogus
with respect to segment handling.  Many, many bugs are fixed here.  I chose
to combine the three separate functions that try to do linear address
conversion into one, nice and working functions.  All of the versions had
bugs.

1) Taking an int3 from v8086 mode could cause the kprobes code to read a
   non-existent LDT.

2) The CS value was not truncated to 16 bit, which could cause an access
   beyond the bounds of the LDT.

3) The LDT was being read without taking the mm->context semaphore, which
   means bogus and or non-existent vmalloc()ed pages could be read.

4) 16-bit code segments do not truncate EIP to 16-bit, it is perfectly
   valid to issue an instruction at 0xffff, and there is no wraparound
   of EIP in protected mode.

5) V8086 mode does truncate EIP to 16-bit.

6) Taking the mm->context semaphore requires interrupts to be enabled.

7) Do not assume the GDT TLS descriptors are flat.

8) Raceful testing of segment access rights without LDT semaphore

9) Segment limit for V8086 code is USER limit.

Kprobes was still broken; it would try to read userspace directly;
since I'm already here, might as well fix that too.

Signed-off-by: Zachary Amsden <zach@vmware.com>
Index: linux-2.6.14-zach-work/arch/i386/kernel/kprobes.c
===================================================================
--- linux-2.6.14-zach-work.orig/arch/i386/kernel/kprobes.c	2005-11-04 19:25:27.000000000 -0800
+++ linux-2.6.14-zach-work/arch/i386/kernel/kprobes.c	2005-11-04 19:26:37.000000000 -0800
@@ -156,23 +156,25 @@ static int __kprobes kprobe_handler(stru
 	struct kprobe *p;
 	int ret = 0;
 	kprobe_opcode_t *addr = NULL;
-	unsigned long *lp;
+	unsigned long limit;
 
-	/* We're in an interrupt, but this is clear and BUG()-safe. */
-	preempt_disable();
-	/* Check if the application is using LDT entry for its code segment and
-	 * calculate the address by reading the base address from the LDT entry.
+	/*
+	 * Getting the address may require getting the LDT semaphore, so
+	 * wait to disable preempt since we may take interrupts here.
 	 */
-	if ((regs->xcs & 4) && (current->mm)) {
-		lp = (unsigned long *) ((unsigned long)((regs->xcs >> 3) * 8)
-					+ (char *) current->mm->context.ldt);
-		addr = (kprobe_opcode_t *) (get_desc_base(lp) + regs->eip -
-						sizeof(kprobe_opcode_t));
-	} else {
-		addr = (kprobe_opcode_t *)(regs->eip - sizeof(kprobe_opcode_t));
-	}
-	/* Check we're not actually recursing */
-	if (kprobe_running()) {
+	addr = (kprobe_opcode_t *)convert_eip_to_linear(regs,
+					regs->eip - sizeof(kprobe_opcode_t),
+					&current->mm->context, &limit);
+
+	/* Don't let userspace races re-address into kernel space */
+	if ((unsigned long)addr > limit)
+		return 0;
+
+ 	/* We're in an interrupt, but this is clear and BUG()-safe. */
+ 	preempt_disable();
+
+  	/* Check we're not actually recursing */
+  	if (kprobe_running()) {
 		/* We *are* holding lock here, so this is safe.
 		   Disarm the probe we just hit, and ignore it. */
 		p = get_kprobe(addr);
@@ -209,13 +211,20 @@ static int __kprobes kprobe_handler(stru
 	lock_kprobes();
 	p = get_kprobe(addr);
 	if (!p) {
+		unsigned char instr;
 		unlock_kprobes();
 		if (regs->eflags & VM_MASK) {
 			/* We are in virtual-8086 mode. Return 0 */
 			goto no_kprobe;
 		}
 
-		if (*addr != BREAKPOINT_INSTRUCTION) {
+		instr = BREAKPOINT_INSTRUCTION;
+		if (user_mode(regs))
+			__get_user(instr, (unsigned char __user *) addr);
+		else
+			instr = *addr;
+			
+		if (instr != BREAKPOINT_INSTRUCTION) {
 			/*
 			 * The breakpoint instruction was removed right
 			 * after we hit it.  Another cpu has removed
Index: linux-2.6.14-zach-work/arch/i386/kernel/ptrace.c
===================================================================
--- linux-2.6.14-zach-work.orig/arch/i386/kernel/ptrace.c	2005-11-04 19:25:27.000000000 -0800
+++ linux-2.6.14-zach-work/arch/i386/kernel/ptrace.c	2005-11-04 19:26:37.000000000 -0800
@@ -146,46 +146,76 @@ static unsigned long getreg(struct task_
 	return retval;
 }
 
-static unsigned long convert_eip_to_linear(struct task_struct *child, struct pt_regs *regs)
+/*
+ * Get the GDT/LDT descriptor base.  When you look for races in this code
+ * remember that LDT and other horrors are only used in user space.  Must
+ * disable pre-emption to reading the GDT, and must take the LDT semaphore
+ * for LDT segments.  The fast path handles standard kernel and user CS
+ * as well as V8086 mode.
+ */
+unsigned long convert_eip_to_linear_slow(unsigned long eip, unsigned long seg,
+ 	mm_context_t *context, unsigned long *eip_limit)
 {
-	unsigned long addr, seg;
+	unsigned long base, seg_limit;
+ 	u32 seg_ar;
+	struct desc_struct *desc;
+	unsigned long flags;
 
-	addr = regs->eip;
-	seg = regs->xcs & 0xffff;
-	if (regs->eflags & VM_MASK) {
-		addr = (addr & 0xffff) + (seg << 4);
-		return addr;
+	if (segment_from_ldt(seg)) {
+		/*
+		 * Horrors abound.  Must enable IRQs to take the LDT
+		 * semaphore.  Ok, since LDT from a faulting CS can only
+		 * be from userspace, or this is from ptrace operating
+		 * on a child context directly from a system call.
+		 * This unfortunate mess is needed to deal with int3
+		 * kprobes which enter with IRQs disabled.
+		 */
+		local_save_flags(flags);
+		local_irq_enable();
+		down(&context->sem);
+		desc = get_ldt_desc(context, seg);
+	} else {
+		/* Must disable preemption while reading the GDT. */
+ 		desc = get_gdt_desc(get_cpu(), seg);
+		flags = 0; /* silence compiler */
 	}
 
-	/*
-	 * We'll assume that the code segments in the GDT
-	 * are all zero-based. That is largely true: the
-	 * TLS segments are used for data, and the PNPBIOS
-	 * and APM bios ones we just ignore here.
-	 */
-	if (seg & LDT_SEGMENT) {
-		u32 *desc;
-		unsigned long base;
-
-		down(&child->mm->context.sem);
-		desc = child->mm->context.ldt + (seg & ~7);
-		base = (desc[0] >> 16) | ((desc[1] & 0xff) << 16) | (desc[1] & 0xff000000);
-
-		/* 16-bit code segment? */
-		if (!((desc[1] >> 22) & 1))
-			addr &= 0xffff;
-		addr += base;
-		up(&child->mm->context.sem);
+	/* Check the segment exists, is within the current LDT/GDT size,
+	   that kernel/user (ring 0..3) has the appropriate privilege,
+	   that it's a code segment, and get the limit. */
+	asm ("larl %3,%0; lsll %3,%1"
+		 : "=&r" (seg_ar), "=r" (seg_limit) : "0" (0), "rm" (seg));
+	if ((~seg_ar & 0x9800) || eip > seg_limit) {
+		if (eip_limit)
+			*eip_limit = 0;
+		eip = 1;	 /* So that returned eip > *eip_limit. */
+		base = 0;
+	} else {
+		base = get_desc_base(desc);
+		eip += base;
+		seg_limit += base;
+	}
+
+	if (segment_from_ldt(seg)) {
+		up(&context->sem);
+		local_irq_restore(flags);
+	} else {
+		put_cpu();
 	}
-	return addr;
+
+	/* Adjust EIP and segment limit, and clamp at the kernel limit.
+	   It's legitimate for segments to wrap at 0xffffffff. */
+	if (eip_limit && seg_limit < *eip_limit && seg_limit >= base)
+		*eip_limit = seg_limit;
+
+	return eip;
 }
 
 static inline int is_at_popf(struct task_struct *child, struct pt_regs *regs)
 {
 	int i, copied;
 	unsigned char opcode[16];
-	unsigned long addr = convert_eip_to_linear(child, regs);
-
+ 	unsigned long addr = convert_eip_to_linear(regs, regs->eip, &child->mm->context, NULL);
 	copied = access_process_vm(child, addr, opcode, sizeof(opcode), 0);
 	for (i = 0; i < copied; i++) {
 		switch (opcode[i]) {
Index: linux-2.6.14-zach-work/arch/i386/mm/fault.c
===================================================================
--- linux-2.6.14-zach-work.orig/arch/i386/mm/fault.c	2005-11-04 19:25:27.000000000 -0800
+++ linux-2.6.14-zach-work/arch/i386/mm/fault.c	2005-11-04 19:26:37.000000000 -0800
@@ -56,77 +56,6 @@ void bust_spinlocks(int yes)
 	console_loglevel = loglevel_save;
 }
 
-/*
- * Return EIP plus the CS segment base.  The segment limit is also
- * adjusted, clamped to the kernel/user address space (whichever is
- * appropriate), and returned in *eip_limit.
- *
- * The segment is checked, because it might have been changed by another
- * task between the original faulting instruction and here.
- *
- * If CS is no longer a valid code segment, or if EIP is beyond the
- * limit, or if it is a kernel address when CS is not a kernel segment,
- * then the returned value will be greater than *eip_limit.
- * 
- * This is slow, but is very rarely executed.
- */
-static inline unsigned long get_segment_eip(struct pt_regs *regs,
-					    unsigned long *eip_limit)
-{
-	unsigned long eip = regs->eip;
-	unsigned seg = regs->xcs & 0xffff;
-	u32 seg_ar, seg_limit, base, *desc;
-
-	/* The standard kernel/user address space limit. */
-	*eip_limit = (seg & 3) ? USER_DS.seg : KERNEL_DS.seg;
-
-	/* Unlikely, but must come before segment checks. */
-	if (unlikely((regs->eflags & VM_MASK) != 0))
-		return eip + (seg << 4);
-	
-	/* By far the most common cases. */
-	if (likely(seg == __USER_CS || seg == __KERNEL_CS))
-		return eip;
-
-	/* Check the segment exists, is within the current LDT/GDT size,
-	   that kernel/user (ring 0..3) has the appropriate privilege,
-	   that it's a code segment, and get the limit. */
-	__asm__ ("larl %3,%0; lsll %3,%1"
-		 : "=&r" (seg_ar), "=r" (seg_limit) : "0" (0), "rm" (seg));
-	if ((~seg_ar & 0x9800) || eip > seg_limit) {
-		*eip_limit = 0;
-		return 1;	 /* So that returned eip > *eip_limit. */
-	}
-
-	/* Get the GDT/LDT descriptor base. 
-	   When you look for races in this code remember that
-	   LDT and other horrors are only used in user space. */
-	if (seg & (1<<2)) {
-		/* Must lock the LDT while reading it. */
-		down(&current->mm->context.sem);
-		desc = current->mm->context.ldt;
-		desc = (void *)desc + (seg & ~7);
-	} else {
-		/* Must disable preemption while reading the GDT. */
- 		desc = (u32 *)get_cpu_gdt_table(get_cpu());
-		desc = (void *)desc + (seg & ~7);
-	}
-
-	/* Decode the code segment base from the descriptor */
-	base = get_desc_base(desc);
-
-	if (seg & (1<<2)) { 
-		up(&current->mm->context.sem);
-	} else
-		put_cpu();
-
-	/* Adjust EIP and segment limit, and clamp at the kernel limit.
-	   It's legitimate for segments to wrap at 0xffffffff. */
-	seg_limit += base;
-	if (seg_limit < *eip_limit && seg_limit >= base)
-		*eip_limit = seg_limit;
-	return eip + base;
-}
 
 /* 
  * Sometimes AMD Athlon/Opteron CPUs report invalid exceptions on prefetch.
@@ -135,7 +64,8 @@ static inline unsigned long get_segment_
 static int __is_prefetch(struct pt_regs *regs, unsigned long addr)
 { 
 	unsigned long limit;
-	unsigned long instr = get_segment_eip (regs, &limit);
+	unsigned long instr = convert_eip_to_linear (regs, regs->eip,
+				&current->mm->context, &limit);
 	int scan_more = 1;
 	int prefetch = 0; 
 	int i;
Index: linux-2.6.14-zach-work/include/asm-i386/ptrace.h
===================================================================
--- linux-2.6.14-zach-work.orig/include/asm-i386/ptrace.h	2005-11-04 19:25:27.000000000 -0800
+++ linux-2.6.14-zach-work/include/asm-i386/ptrace.h	2005-11-04 19:26:37.000000000 -0800
@@ -54,9 +54,11 @@ struct pt_regs {
 #define PTRACE_GET_THREAD_AREA    25
 #define PTRACE_SET_THREAD_AREA    26
 
-#ifdef __KERNEL__
+#if defined(__KERNEL__) && !defined(__arch_um__)
 
 #include <asm/vm86.h>
+#include <asm/mmu.h>
+#include <asm/uaccess.h>
 
 struct task_struct;
 extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs, int error_code);
@@ -82,6 +84,45 @@ extern unsigned long profile_pc(struct p
 #else
 #define profile_pc(regs) instruction_pointer(regs)
 #endif
-#endif /* __KERNEL__ */
+
+/*
+ * Return EIP plus the CS segment base.  The segment limit is also
+ * adjusted, clamped to the kernel/user address space (whichever is
+ * appropriate), and returned in *eip_limit.
+ *
+ * The segment is checked, because it might have been changed by another
+ * task between the original faulting instruction and here.
+ *
+ * If CS is no longer a valid code segment, or if EIP is beyond the
+ * limit, or if it is a kernel address when CS is not a kernel segment,
+ * then the returned value will be greater than *eip_limit.
+ * 
+ * This is slow, but is very rarely executed.
+ */
+extern unsigned long convert_eip_to_linear_slow(unsigned long addr, unsigned long seg, mm_context_t *context, unsigned long *eip_limit);
+
+static inline unsigned long convert_eip_to_linear(struct pt_regs *regs, unsigned long addr, mm_context_t *context, unsigned long * const eip_limit)
+{
+	unsigned long seg;
+
+	seg = regs->xcs & 0xffff;
+
+	/* The standard kernel/user address space limit. */
+	if (eip_limit)
+		*eip_limit = user_mode(regs) ? USER_DS.seg : KERNEL_DS.seg;
+
+	if (regs->eflags & VM_MASK) {
+		addr = (addr & 0xffff) + (seg << 4);
+		return addr;
+	}
+
+	/* By far the most common cases. */
+	if (likely(seg == __USER_CS || seg == __KERNEL_CS))
+		return addr;
+
+	return convert_eip_to_linear_slow(addr, seg, context, eip_limit);
+}
+
+#endif /* defined(__KERNEL__) && !defined(__arch_um__) */
 
 #endif

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 19/21] i386 Kprobes semaphore fix
  2005-11-08  4:39 [PATCH 19/21] i386 Kprobes semaphore fix Zachary Amsden
@ 2005-11-08 13:12 ` Andi Kleen
  2005-11-08 13:36   ` Zachary Amsden
  0 siblings, 1 reply; 15+ messages in thread
From: Andi Kleen @ 2005-11-08 13:12 UTC (permalink / raw)
  To: virtualization
  Cc: Zachary Amsden, Andrew Morton, Chris Wright, Linus Torvalds,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Martin Bligh, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar

On Tuesday 08 November 2005 05:39, Zachary Amsden wrote:
> IA-32 linear address translation is loads of fun.

Thanks for doing that audit work. Can you please double check x86-64 code is
ok? 

Actually giving all that complexity maybe it would be better to just
stop handling the case and remove all that. I'm not sure what kprobes needs it 
for - it doesn't even handle user space yet and even if it ever does it is 
unlikely that handling 16bit code makes much sense. And the prefetch 
workaround does it, but 16bit DOS code is unlikely to contain prefetches 
anyways. And for ptrace - well, who cares? I suppose dosemu has an own
debugger anyways and it could be handled in user space (i suppose
they still have that code from 2.4 anyways)

> While cleaning up the LDT code, I noticed that kprobes code was very bogus
> with respect to segment handling.  Many, many bugs are fixed here.  I chose
> to combine the three separate functions that try to do linear address
> conversion into one, nice and working functions.  All of the versions had
> bugs.
>
> 1) Taking an int3 from v8086 mode could cause the kprobes code to read a
>    non-existent LDT.
>
> 2) The CS value was not truncated to 16 bit, which could cause an access
>    beyond the bounds of the LDT.

That's a (small) security hole, isn't it?


-Andi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 19/21] i386 Kprobes semaphore fix
       [not found] <20051108074430.GG28201@elte.hu>
@ 2005-11-08 13:26 ` Zachary Amsden
  0 siblings, 0 replies; 15+ messages in thread
From: Zachary Amsden @ 2005-11-08 13:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: security, Linux Kernel Mailing List

Ingo Molnar wrote:

>could these bugs lead to local rootholes, if kprobes are enabled?
>
>The threat model we care about is: administrator has a few dozen kprobes 
>activated, which sample various aspects of the system. Unprivileged 
>userspace tries to exploit this fact: it has full control over its own 
>conduct (including the ability to create LDTs, vm86 mode and weird 
>segments), but unprivileged userspace has no ability to 
>activate/deactivate kprobes.
>
>could this cause problems?
>  
>

Good - it was thinking along these lines that led me to this patch.  It 
is still possible to defeat the EIP checking if you are crafty enough, 
because there is a window of opportunity from the time a fault is 
induced until the LDT is read where another CPU can come along and 
modify the LDT.  Note the GDT does not have this problem, as remote 
changes (via ptrace I think I saw code that makes it possible) will not 
take effect until the thread is rescheduled.  But with kernel 
preemption, even GDT based segments have the race.  Pardon my thinking 
out loud - this is a very sticky situation.

Actually, perhaps this window could be closed; we will have a pending 
LDT reload posted by IPI, or a reschedule (and thus TLS reload) pending, 
and we can hold processing of that even while enabling interrupts.  
There is a problem - the LDT will have already been modified, and we 
don't always re-allocate a new page for it, so the old data may be 
gone.  Seems like we could find ways to close the race, but it seems 
difficult to verify as correct, and may cost performance.

So my approach is to ignore the race; instead, I use limit checking.  
Note the limit here is specified in linear (non-segmented) virtual 
memory, and is clamped to MIN(limit,PAGE_OFFSET) for userspace probes.

+
+	/* Don't let userspace races re-address into kernel space */
+	if ((unsigned long)addr > limit)
+		return 0;
+

Also, added this test:

+		if (user_mode(regs))
+			__get_user(instr, (unsigned char __user *) addr);
+		else
+			instr = *addr;
+			


Which saves the kernel from faulting due to "impersonated non-linear 
breakpoints".  Note the OOPs should have been caught and safely dealt 
with anyway.

So there are still a couple of crafty tricks that can cause you to 
re-address into kernel range EIPs, but limit checking protects us, and 
in general, perhaps we need not worry about strict correctness of 
kprobes for these edge cases.  Now that I have the issue at attention, 
is it even possible / useful to set kprobes in a specific userspace 
context, or should only kernel breakpoints be considered for kprobes?  
Reading the code, it was not clear, and I could not find proper 
documentation, so I left the semantics as I saw them.

If indeed, as I suspect, kprobes are only for kernel code, then the code 
here can be greatly simplified by discarding user code segments right 
away and doing flat EIP conversion.

Zach

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 19/21] i386 Kprobes semaphore fix
  2005-11-08 13:12 ` Andi Kleen
@ 2005-11-08 13:36   ` Zachary Amsden
  2005-11-09 13:38     ` Andi Kleen
       [not found]     ` <20051109093755.GA10361@in.ibm.com>
  0 siblings, 2 replies; 15+ messages in thread
From: Zachary Amsden @ 2005-11-08 13:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: virtualization, Andrew Morton, Chris Wright, Linus Torvalds,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Martin Bligh, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar

Andi Kleen wrote:

>On Tuesday 08 November 2005 05:39, Zachary Amsden wrote:
>  
>
>>IA-32 linear address translation is loads of fun.
>>    
>>
>
>Thanks for doing that audit work. Can you please double check x86-64 code is
>ok? 
>
>Actually giving all that complexity maybe it would be better to just
>stop handling the case and remove all that. I'm not sure what kprobes needs it 
>for - it doesn't even handle user space yet and even if it ever does it is 
>unlikely that handling 16bit code makes much sense. And the prefetch 
>workaround does it, but 16bit DOS code is unlikely to contain prefetches 
>anyways. And for ptrace - well, who cares? I suppose dosemu has an own
>debugger anyways and it could be handled in user space (i suppose
>they still have that code from 2.4 anyways)
>  
>

I got the idea from the x86-64 code; prompted by you, I looked at it, 
and it appears correct, but I would like to give it a full audit as well 
(especially regarding 32-bit compatibility segments).

About the three cases here:

The prefetch workaround should be harmless, again because of limit 
checking, the kernel is safe even in the raceful cases.

One can imagine clever uses for ptrace to do, say user space 
virtualization (since I'm on the topic), or other neat things.  So there 
is nothing really wrong about having the fully correct EIP conversion 
(and here we shouldn't need to worry about races causing some issues 
with strict correctness, since there can be one external control thread).

But were kprobes even inteneded for userspace?  There are races here 
that are difficult to close without some heavy machinery, and I would 
rather not put the machinery in place if simplifying the code is the 
right answer.

Zach

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 19/21] i386 Kprobes semaphore fix
  2005-11-08 13:36   ` Zachary Amsden
@ 2005-11-09 13:38     ` Andi Kleen
  2005-11-09 16:46       ` Zachary Amsden
       [not found]     ` <20051109093755.GA10361@in.ibm.com>
  1 sibling, 1 reply; 15+ messages in thread
From: Andi Kleen @ 2005-11-09 13:38 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: virtualization, Andrew Morton, Chris Wright, Linus Torvalds,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Martin Bligh, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar

On Tuesday 08 November 2005 14:36, Zachary Amsden wrote:

> One can imagine clever uses for ptrace to do, say user space
> virtualization (since I'm on the topic), or other neat things.  So there
> is nothing really wrong about having the fully correct EIP conversion
> (and here we shouldn't need to worry about races causing some issues
> with strict correctness, since there can be one external control thread).

Well, the code still scaries me a bit, but ok. x86-64 left at least one case 
intentionally out.

> But were kprobes even inteneded for userspace?  There are races here
> that are difficult to close without some heavy machinery, and I would
> rather not put the machinery in place if simplifying the code is the
> right answer.

I believe user space kprobes are being worked on by some IBM India folks yes.

-Andi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 19/21] i386 Kprobes semaphore fix
  2005-11-09 13:38     ` Andi Kleen
@ 2005-11-09 16:46       ` Zachary Amsden
  2005-11-09 16:58         ` Ingo Molnar
  2005-11-11 15:25         ` Andi Kleen
  0 siblings, 2 replies; 15+ messages in thread
From: Zachary Amsden @ 2005-11-09 16:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: virtualization, Andrew Morton, Chris Wright, Linus Torvalds,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Martin Bligh, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar, prasanna, ananth,
	anil.s.keshavamurthy, davem

Andi Kleen wrote:

>On Tuesday 08 November 2005 14:36, Zachary Amsden wrote:
>
>  
>
>>One can imagine clever uses for ptrace to do, say user space
>>virtualization (since I'm on the topic), or other neat things.  So there
>>is nothing really wrong about having the fully correct EIP conversion
>>(and here we shouldn't need to worry about races causing some issues
>>with strict correctness, since there can be one external control thread).
>>    
>>
>
>Well, the code still scaries me a bit, but ok. x86-64 left at least one case 
>intentionally out.
>
>  
>
>>But were kprobes even inteneded for userspace?  There are races here
>>that are difficult to close without some heavy machinery, and I would
>>rather not put the machinery in place if simplifying the code is the
>>right answer.
>>    
>>
>
>I believe user space kprobes are being worked on by some IBM India folks yes.
>  
>

I'm convinced this is pointless.  What does it buy you over a ptrace 
based debugger?  Why would you want extra code running in the kernel 
that can be done perfectly well in userspace?

Let me stress that if you are running on modified segment state, you 
have no way to safely determine the virtual address on which you took an 
instruction trap (int3, general protection, etc..).  If you can't 
determine the virtual address safely, you can't back out your code patch 
to remove the breakpoint.  At this point, you can't execute the next 
instruction; you must wait for a _policy_ decision to be made.  Adding 
policy decisions like this to the kernel surely seems like a bad idea.  
If the fallback is to have a debugger running in userspace that has a 
user or script attached that can make the interactive decision, then why 
not solve the entire problem in userspace from the start?  It's a lot 
safer, it simplifies the kernel kprobes code a lot, and it leaves the 
debugger implementation free to do anything it chooses, as well as not 
locking the solution to a particular kernel.

Zach

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 19/21] i386 Kprobes semaphore fix
  2005-11-09 16:46       ` Zachary Amsden
@ 2005-11-09 16:58         ` Ingo Molnar
  2005-11-09 17:52           ` Zachary Amsden
  2005-11-11 15:25         ` Andi Kleen
  1 sibling, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2005-11-09 16:58 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Andi Kleen, virtualization, Andrew Morton, Chris Wright,
	Linus Torvalds, Linux Kernel Mailing List, H. Peter Anvin,
	Zwane Mwaikambo, Martin Bligh, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, prasanna, ananth,
	anil.s.keshavamurthy, davem


* Zachary Amsden <zach@vmware.com> wrote:

> >I believe user space kprobes are being worked on by some IBM India folks 
> >yes.
> 
> I'm convinced this is pointless.  What does it buy you over a ptrace 
> based debugger?  Why would you want extra code running in the kernel 
> that can be done perfectly well in userspace?

kprobes are not just for 'debuggers', they are also used for tracing and 
other dynamic instrumentation in projects like systemtap. Ptrace is way 
too slow and limited for things like that.

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 19/21] i386 Kprobes semaphore fix
  2005-11-09 16:58         ` Ingo Molnar
@ 2005-11-09 17:52           ` Zachary Amsden
  2005-11-10 18:09             ` Prasanna S Panchamukhi
  2005-11-11 15:27             ` Andi Kleen
  0 siblings, 2 replies; 15+ messages in thread
From: Zachary Amsden @ 2005-11-09 17:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, virtualization, Andrew Morton, Chris Wright,
	Linus Torvalds, Linux Kernel Mailing List, H. Peter Anvin,
	Zwane Mwaikambo, Martin Bligh, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, prasanna, ananth,
	anil.s.keshavamurthy, davem

Ingo Molnar wrote:

>* Zachary Amsden <zach@vmware.com> wrote:
>
>  
>
>>>I believe user space kprobes are being worked on by some IBM India folks 
>>>yes.
>>>      
>>>
>>I'm convinced this is pointless.  What does it buy you over a ptrace 
>>based debugger?  Why would you want extra code running in the kernel 
>>that can be done perfectly well in userspace?
>>    
>>
>
>kprobes are not just for 'debuggers', they are also used for tracing and 
>other dynamic instrumentation in projects like systemtap. Ptrace is way 
>too slow and limited for things like that.
>  
>

Well, if there is a justification for it, that means we really should 
handle all the nasty EIP conversion cases due to segmentation and v8086 
mode in the kprobes code.  I was hoping that might not be the case.

Zach

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 19/21] i386 Kprobes semaphore fix
  2005-11-10 18:09             ` Prasanna S Panchamukhi
@ 2005-11-10 14:58               ` Zachary Amsden
  2005-11-10 16:16               ` H. Peter Anvin
  1 sibling, 0 replies; 15+ messages in thread
From: Zachary Amsden @ 2005-11-10 14:58 UTC (permalink / raw)
  To: prasanna
  Cc: Ingo Molnar, Andi Kleen, virtualization, Andrew Morton,
	Chris Wright, Linus Torvalds, Linux Kernel Mailing List,
	H. Peter Anvin, Zwane Mwaikambo, Martin Bligh,
	Pratap Subrahmanyam, Christopher Li, Eric W. Biederman, ananth,
	anil.s.keshavamurthy, davem

Prasanna S Panchamukhi wrote:

>On Wed, Nov 09, 2005 at 09:52:40AM -0800, Zachary Amsden wrote:
>  
>
>>Ingo Molnar wrote:
>>
>>    
>>
>>>* Zachary Amsden <zach@vmware.com> wrote:
>>>
>>>
>>>
>>>      
>>>
>>>>>I believe user space kprobes are being worked on by some IBM India folks 
>>>>>yes.
>>>>>    
>>>>>
>>>>>          
>>>>>
>>>>I'm convinced this is pointless.  What does it buy you over a ptrace 
>>>>based debugger?  Why would you want extra code running in the kernel 
>>>>that can be done perfectly well in userspace?
>>>>  
>>>>
>>>>        
>>>>
>>>kprobes are not just for 'debuggers', they are also used for tracing and 
>>>other dynamic instrumentation in projects like systemtap. Ptrace is way 
>>>too slow and limited for things like that.
>>>
>>>
>>>      
>>>
>>Well, if there is a justification for it, that means we really should 
>>handle all the nasty EIP conversion cases due to segmentation and v8086 
>>mode in the kprobes code.  I was hoping that might not be the case.
>>
>>    
>>
>
>As Ingo mentioned above, Systemtap uses kprobes infrastructure to provide
>dynamic kernel instrumentation. Using which user can add lots of probes 
>easily, so we need to take care of this fast path.  
>
>Instead of calling convert_eip_to_linear() for all cases, you can
>just check if it is in kernel mode and calculate the address directly
>
>	if (kernel mode)
>                addr = regs->eip - sizeof(kprobe_opcode_t);
>        else
>                addr = convert_eip_to_linear(..);
>
>there by avoiding call to convert_eip_to_linear () for every kernel probes.
>
>As Andi mentioned user space probes support is in progress and 
>this address conversion will help in case of user space probes as well.
>

I like this better.  I have to rework that patch anyways, since it no 
longer applies cleanly.

Zach


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 19/21] i386 Kprobes semaphore fix
  2005-11-10 18:09             ` Prasanna S Panchamukhi
  2005-11-10 14:58               ` Zachary Amsden
@ 2005-11-10 16:16               ` H. Peter Anvin
  1 sibling, 0 replies; 15+ messages in thread
From: H. Peter Anvin @ 2005-11-10 16:16 UTC (permalink / raw)
  To: prasanna
  Cc: Zachary Amsden, Ingo Molnar, Andi Kleen, virtualization,
	Andrew Morton, Chris Wright, Linus Torvalds,
	Linux Kernel Mailing List, Zwane Mwaikambo, Martin Bligh,
	Pratap Subrahmanyam, Christopher Li, Eric W. Biederman, ananth,
	anil.s.keshavamurthy, davem

Prasanna S Panchamukhi wrote:
> 
> As Ingo mentioned above, Systemtap uses kprobes infrastructure to provide
> dynamic kernel instrumentation. Using which user can add lots of probes 
> easily, so we foreed to take care of this fast path.  
> 
> Instead of calling convert_eip_to_linear() for all cases, you can
> just check if it is in kernel mode and calculate the address directly
> 
> 	if (kernel mode)
>                 addr = regs->eip - sizeof(kprobe_opcode_t);
>         else
>                 addr = convert_eip_to_linear(..);
> 
> there by avoiding call to convert_eip_to_linear () for every kernel probes.
> 
> As Andi mentioned user space probes support is in progress and 
> this address conversion will help in case of user space probes as well.
> 

Would it make sense for this to be part of convert_eip_to_linear?

	-hpa

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 19/21] i386 Kprobes semaphore fix
       [not found]     ` <20051109093755.GA10361@in.ibm.com>
@ 2005-11-10 16:33       ` Prasanna S Panchamukhi
  0 siblings, 0 replies; 15+ messages in thread
From: Prasanna S Panchamukhi @ 2005-11-10 16:33 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Andi Kleen, virtualization, Andrew Morton, Chris Wright,
	Linus Torvalds, Linux Kernel Mailing List, H. Peter Anvin,
	Zwane Mwaikambo, Martin Bligh, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

resending this mail, since my earlier email did not reach lkml.
On Wed, Nov 09, 2005 at 03:07:55PM +0530, Prasanna S Panchamukhi wrote:
> Zach,
> 
> Thanks for doing this.
> 
> On Tue, Nov 08, 2005 at 05:36:53AM -0800, Zachary Amsden wrote:
> > Andi Kleen wrote:
> > 
> > >On Tuesday 08 November 2005 05:39, Zachary Amsden wrote:
> > > 
> > >
> > >>IA-32 linear address translation is loads of fun.
> > >>   
> > >>
> > >
> > >Thanks for doing that audit work. Can you please double check x86-64 code 
> > >is
> > >ok? 
> > >
> > >Actually giving all that complexity maybe it would be better to just
> > >stop handling the case and remove all that. I'm not sure what kprobes 
> > >needs it for - it doesn't even handle user space yet and even if it ever 
> > >does it is unlikely that handling 16bit code makes much sense. And the 
> 
> 
> The code was added to address the problem related to stealing of interrupts from
> VM86. Please see the discussion thread for more details from the URL below
> http://lkml.org/lkml/2004/11/9/214
> 
> > But were kprobes even inteneded for userspace?  There are races here 
> > that are difficult to close without some heavy machinery, and I would 
> > rather not put the machinery in place if simplifying the code is the 
> > right answer.
> 
> Presently kprobes supports only kernel space probes. Work is in progress
> for user space probes support.
> 
> >+       addr = (kprobe_opcode_t *)convert_eip_to_linear(regs,
> >+                                       regs->eip -
> >sizeof(kprobe_opcode_t),
> >+                                       &current->mm->context, &limit);
> >+
> 
> Instead you can check if it is in kernel mode and calculate the address directly 
> first, since it is in the fast path.
> 		addr = regs->eip - sizeof(kprobe_opcode_t);
> 	else
> 		addr = convert_eip_to_linear(..);
> 
> there by avoiding calling convert_eip_to_linear () in case of every kernel probes.
> 
> 
> >+       /* Don't let userspace races re-address into kernel space */
> >+       if ((unsigned long)addr > limit)
> >+               return 0;
> 
> there is no need for this check here in the fast path, because kprobes handles this 
> case by checking if the address is on the kprobes hash list and later returning 
> from that point.
> 
> Please make sure it pass the test case discussed in the thread, URL is below.
> http://lkml.org/lkml/2004/11/9/214
> 
> Thanks
> -Prasanna
> --
> Prasanna S Panchamukhi
> Linux Technology Center
> India Software Labs, IBM Bangalore
> Ph: 91-80-25044636
> <prasanna@in.ibm.com>

-- 
Have a Nice Day!

Thanks & Regards
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Ph: 91-80-25044636
<prasanna@in.ibm.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 19/21] i386 Kprobes semaphore fix
  2005-11-09 17:52           ` Zachary Amsden
@ 2005-11-10 18:09             ` Prasanna S Panchamukhi
  2005-11-10 14:58               ` Zachary Amsden
  2005-11-10 16:16               ` H. Peter Anvin
  2005-11-11 15:27             ` Andi Kleen
  1 sibling, 2 replies; 15+ messages in thread
From: Prasanna S Panchamukhi @ 2005-11-10 18:09 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Ingo Molnar, Andi Kleen, virtualization, Andrew Morton,
	Chris Wright, Linus Torvalds, Linux Kernel Mailing List,
	H. Peter Anvin, Zwane Mwaikambo, Martin Bligh,
	Pratap Subrahmanyam, Christopher Li, Eric W. Biederman, ananth,
	anil.s.keshavamurthy, davem

On Wed, Nov 09, 2005 at 09:52:40AM -0800, Zachary Amsden wrote:
> Ingo Molnar wrote:
> 
> >* Zachary Amsden <zach@vmware.com> wrote:
> >
> > 
> >
> >>>I believe user space kprobes are being worked on by some IBM India folks 
> >>>yes.
> >>>     
> >>>
> >>I'm convinced this is pointless.  What does it buy you over a ptrace 
> >>based debugger?  Why would you want extra code running in the kernel 
> >>that can be done perfectly well in userspace?
> >>   
> >>
> >
> >kprobes are not just for 'debuggers', they are also used for tracing and 
> >other dynamic instrumentation in projects like systemtap. Ptrace is way 
> >too slow and limited for things like that.
> > 
> >
> 
> Well, if there is a justification for it, that means we really should 
> handle all the nasty EIP conversion cases due to segmentation and v8086 
> mode in the kprobes code.  I was hoping that might not be the case.
> 

As Ingo mentioned above, Systemtap uses kprobes infrastructure to provide
dynamic kernel instrumentation. Using which user can add lots of probes 
easily, so we need to take care of this fast path.  

Instead of calling convert_eip_to_linear() for all cases, you can
just check if it is in kernel mode and calculate the address directly

	if (kernel mode)
                addr = regs->eip - sizeof(kprobe_opcode_t);
        else
                addr = convert_eip_to_linear(..);

there by avoiding call to convert_eip_to_linear () for every kernel probes.

As Andi mentioned user space probes support is in progress and 
this address conversion will help in case of user space probes as well.


Thanks
Prasanna
-- 

Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Ph: 91-80-25044636
<prasanna@in.ibm.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 19/21] i386 Kprobes semaphore fix
  2005-11-09 16:46       ` Zachary Amsden
  2005-11-09 16:58         ` Ingo Molnar
@ 2005-11-11 15:25         ` Andi Kleen
  2005-11-14  5:54           ` Prasanna S Panchamukhi
  1 sibling, 1 reply; 15+ messages in thread
From: Andi Kleen @ 2005-11-11 15:25 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: virtualization, Andrew Morton, Chris Wright, Linus Torvalds,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Martin Bligh, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar, prasanna, ananth,
	anil.s.keshavamurthy, davem

On Wednesday 09 November 2005 17:46, Zachary Amsden wrote:
> Andi Kleen wrote:
> >On Tuesday 08 November 2005 14:36, Zachary Amsden wrote:
> >>One can imagine clever uses for ptrace to do, say user space
> >>virtualization (since I'm on the topic), or other neat things.  So there
> >>is nothing really wrong about having the fully correct EIP conversion
> >>(and here we shouldn't need to worry about races causing some issues
> >>with strict correctness, since there can be one external control thread).
> >
> >Well, the code still scaries me a bit, but ok. x86-64 left at least one
> > case intentionally out.
> >
> >>But were kprobes even inteneded for userspace?  There are races here
> >>that are difficult to close without some heavy machinery, and I would
> >>rather not put the machinery in place if simplifying the code is the
> >>right answer.
> >
> >I believe user space kprobes are being worked on by some IBM India folks
> > yes.
>
> I'm convinced this is pointless.  What does it buy you over a ptrace
> based debugger?  Why would you want extra code running in the kernel
> that can be done perfectly well in userspace?

People might want to do something like "attach trace point to glibc function 
X" for all processes on the system. Attaching ptrace to everything would 
cause large overhead. systemtap really handles a different need - of just low 
overhead tracing, not heavy weight debugging.

> Let me stress that if you are running on modified segment state, you
> have no way to safely determine the virtual address on which you took an
> instruction trap (int3, general protection, etc..).  If you can't
> determine the virtual address safely, you can't back out your code patch
> to remove the breakpoint.  At this point, you can't execute the next

Kernel kprobes solves this by executing the code out of line. I don't know
how they want to do that in user space though (need a safe address for that),
but somehow that can be likely done.

> instruction; you must wait for a _policy_ decision to be made.  Adding
> policy decisions like this to the kernel surely seems like a bad idea.
> If the fallback is to have a debugger running in userspace that has a
> user or script attached that can make the interactive decision, then why
> not solve the entire problem in userspace from the start?  It's a lot

Doing it in user space would make it hard to do global tracing, and it 
also likely would have much higher overhead.

-Andi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 19/21] i386 Kprobes semaphore fix
  2005-11-09 17:52           ` Zachary Amsden
  2005-11-10 18:09             ` Prasanna S Panchamukhi
@ 2005-11-11 15:27             ` Andi Kleen
  1 sibling, 0 replies; 15+ messages in thread
From: Andi Kleen @ 2005-11-11 15:27 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Ingo Molnar, virtualization, Andrew Morton, Chris Wright,
	Linus Torvalds, Linux Kernel Mailing List, H. Peter Anvin,
	Zwane Mwaikambo, Martin Bligh, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, prasanna, ananth,
	anil.s.keshavamurthy, davem

On Wednesday 09 November 2005 18:52, Zachary Amsden wrote:

> Well, if there is a justification for it, that means we really should
> handle all the nasty EIP conversion cases due to segmentation and v8086
> mode in the kprobes code.  I was hoping that might not be the case.

Or just forbid kprobes for 16bit processes (or anything running in a non GDT
code segment). Would be perfectly reasonable IMHO.

-Andi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 19/21] i386 Kprobes semaphore fix
  2005-11-11 15:25         ` Andi Kleen
@ 2005-11-14  5:54           ` Prasanna S Panchamukhi
  0 siblings, 0 replies; 15+ messages in thread
From: Prasanna S Panchamukhi @ 2005-11-14  5:54 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Zachary Amsden, virtualization, Andrew Morton, Chris Wright,
	Linus Torvalds, Linux Kernel Mailing List, H. Peter Anvin,
	Zwane Mwaikambo, Martin Bligh, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar, ananth,
	anil.s.keshavamurthy, davem

> > Let me stress that if you are running on modified segment state, you
> > have no way to safely determine the virtual address on which you took an
> > instruction trap (int3, general protection, etc..).  If you can't
> > determine the virtual address safely, you can't back out your code patch
> > to remove the breakpoint.  At this point, you can't execute the next
> 
> Kernel kprobes solves this by executing the code out of line. I don't know
> how they want to do that in user space though (need a safe address for that),
> but somehow that can be likely done.

In case of user space probes we adopt a similar method for executing the code
out-of-line. In user space probes  we find free space in the current
 process address space and copy the original instruction to that location and
 execute that instruction from that location. User processes use stack space
 to store local variables, agruments and return values. Normally the stack
 space either below or above the stack pointer indicates the free stack space.
Also in case of no stack free space, we can expand the process stack, copy the 
instruction and execute the instruction from that location.
Detials about this method is discussed on systemtap mailing lists. URL is below.

http://sourceware.org/ml/systemtap/2005-q3/msg00542.html

Please let me know if you have any other solution to the above problem.

Thanks
Prasanna
-- 
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: prasanna@in.ibm.com
Ph: 91-80-25044636

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2005-11-14  5:52 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-08  4:39 [PATCH 19/21] i386 Kprobes semaphore fix Zachary Amsden
2005-11-08 13:12 ` Andi Kleen
2005-11-08 13:36   ` Zachary Amsden
2005-11-09 13:38     ` Andi Kleen
2005-11-09 16:46       ` Zachary Amsden
2005-11-09 16:58         ` Ingo Molnar
2005-11-09 17:52           ` Zachary Amsden
2005-11-10 18:09             ` Prasanna S Panchamukhi
2005-11-10 14:58               ` Zachary Amsden
2005-11-10 16:16               ` H. Peter Anvin
2005-11-11 15:27             ` Andi Kleen
2005-11-11 15:25         ` Andi Kleen
2005-11-14  5:54           ` Prasanna S Panchamukhi
     [not found]     ` <20051109093755.GA10361@in.ibm.com>
2005-11-10 16:33       ` Prasanna S Panchamukhi
     [not found] <20051108074430.GG28201@elte.hu>
2005-11-08 13:26 ` Zachary Amsden

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox