[PATCH] fix for futex_wait signal stack corruption

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] fix for futex_wait signal stack corruption
@ 2007-12-04 20:57 Steven Rostedt
  2007-12-04 21:09 ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Steven Rostedt @ 2007-12-04 20:57 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner
  Cc: LKML, David Holmes - Sun Microsystems, Andrew Morton,
	Linus Torvalds


David Holmes found a bug in the RT patch with respect to
pthread_cond_timedwait. After trying his test program on the latest git
from mainline, I found the bug was there too.  The bug he was seeing
that his test program showed, was that if one were to do a "Ctrl-Z" on a
process that was in the pthread_cond_timedwait, and then did a "bg" on
that process, it would return with a "-ETIMEDOUT" but early. That is,
the timer would go off early.

Looking into this, I found the source of the problem. And it is a rather
nasty bug at that.

Here's the relevant code from kernel/futex.c: (not in order in the file)

[...]
smlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
                          struct timespec __user *utime, u32 __user *uaddr2,
                          u32 val3)
{
        struct timespec ts;
        ktime_t t, *tp = NULL;
        u32 val2 = 0;
        int cmd = op & FUTEX_CMD_MASK;

        if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) {
                if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
                        return -EFAULT;
                if (!timespec_valid(&ts))
                        return -EINVAL;

                t = timespec_to_ktime(ts);
                if (cmd == FUTEX_WAIT)
                        t = ktime_add(ktime_get(), t);
                tp = &t;
        }
[...]
        return do_futex(uaddr, op, val, tp, uaddr2, val2, val3);
}

[...]

long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
                u32 __user *uaddr2, u32 val2, u32 val3)
{
        int ret;
        int cmd = op & FUTEX_CMD_MASK;
        struct rw_semaphore *fshared = NULL;

        if (!(op & FUTEX_PRIVATE_FLAG))
                fshared = &current->mm->mmap_sem;

        switch (cmd) {
        case FUTEX_WAIT:
                ret = futex_wait(uaddr, fshared, val, timeout);

[...]

static int futex_wait(u32 __user *uaddr, struct rw_semaphore *fshared,
                      u32 val, ktime_t *abs_time)
{
[...]
               struct restart_block *restart;
                restart = &current_thread_info()->restart_block;
                restart->fn = futex_wait_restart;
                restart->arg0 = (unsigned long)uaddr;
                restart->arg1 = (unsigned long)val;
                restart->arg2 = (unsigned long)abs_time;
                restart->arg3 = 0;
                if (fshared)
                        restart->arg3 |= ARG3_SHARED;
                return -ERESTART_RESTARTBLOCK;
[...]

static long futex_wait_restart(struct restart_block *restart)
{
        u32 __user *uaddr = (u32 __user *)restart->arg0;
        u32 val = (u32)restart->arg1;
        ktime_t *abs_time = (ktime_t *)restart->arg2;
        struct rw_semaphore *fshared = NULL;

        restart->fn = do_no_restart_syscall;
        if (restart->arg3 & ARG3_SHARED)
                fshared = &current->mm->mmap_sem;
        return (long)futex_wait(uaddr, fshared, val, abs_time);
}


So when the futex_wait is interrupt by a signal we break out of the
hrtimer code and set up or return from signal. This code does not return
back to userspace, so we set up a RESTARTBLOCK.  The bug here is that we
save the "abs_time" which is a pointer to the stack variable "ktime_t t"
from sys_futex.

This returns and unwinds the stack before we get to call our signal. On
return from the signal we go to futex_wait_restart, where we update all
the parameters for futex_wait and call it. But here we have a problem
where abs_time is no longer valid.

I verified this with print statements, and sure enough, what abs_time
was set to ends up being garbage when we get to futex_wait_restart.

The solution I did to solve this is to allocate a temporary buffer when
setting up the block and free it in futex_wait_restart. This patch
allows David's test program to actually pass.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

diff --git a/kernel/futex.c b/kernel/futex.c
index 9dc591a..74be1cb 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1288,11 +1288,27 @@ static int futex_wait(u32 __user *uaddr, struct rw_semaphore *fshared,
 		return -ERESTARTSYS;
 	else {
 		struct restart_block *restart;
+		ktime_t *tmp_time = NULL;
+
+		/*
+		 * abs_time is on the stack and is not a parameter.
+		 * If we save it, it will be overridden on return and
+		 * what is sent to futex_wait_restart will be corrupted.
+		 */
+		if (abs_time) {
+			tmp_time = kmalloc(sizeof(*tmp_time), GFP_KERNEL);
+			if (unlikely(!tmp_time))
+				/* Ahh, what else can we do?? */
+				printk(KERN_WARNING
+				       "Can't allocate temp timer storage\n");
+			else
+				*tmp_time = *abs_time;
+		}
 		restart = &current_thread_info()->restart_block;
 		restart->fn = futex_wait_restart;
 		restart->arg0 = (unsigned long)uaddr;
 		restart->arg1 = (unsigned long)val;
-		restart->arg2 = (unsigned long)abs_time;
+		restart->arg2 = (unsigned long)tmp_time;
 		restart->arg3 = 0;
 		if (fshared)
 			restart->arg3 |= ARG3_SHARED;
@@ -1312,13 +1328,19 @@ static long futex_wait_restart(struct restart_block *restart)
 {
 	u32 __user *uaddr = (u32 __user *)restart->arg0;
 	u32 val = (u32)restart->arg1;
-	ktime_t *abs_time = (ktime_t *)restart->arg2;
+	ktime_t *tmp_time = (ktime_t *)restart->arg2;
+	ktime_t t;
 	struct rw_semaphore *fshared = NULL;
 
+	if (tmp_time) {
+		t = *tmp_time;
+		kfree(tmp_time);
+		tmp_time = &t;
+	}
 	restart->fn = do_no_restart_syscall;
 	if (restart->arg3 & ARG3_SHARED)
 		fshared = &current->mm->mmap_sem;
-	return (long)futex_wait(uaddr, fshared, val, abs_time);
+	return (long)futex_wait(uaddr, fshared, val, tmp_time);
 }
 
 



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix for futex_wait signal stack corruption
  2007-12-04 20:57 [PATCH] fix for futex_wait signal stack corruption Steven Rostedt
@ 2007-12-04 21:09 ` Linus Torvalds
  2007-12-04 21:39   ` Steven Rostedt
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2007-12-04 21:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Thomas Gleixner, LKML,
	David Holmes - Sun Microsystems, Andrew Morton

On Tue, 4 Dec 2007, Steven Rostedt wrote:
>
> The solution I did to solve this is to allocate a temporary buffer when
> setting up the block and free it in futex_wait_restart. This patch
> allows David's test program to actually pass.

No. Unacceptable. This is a memory leak in case nobody retries it. It's 
basically not how you can do this thing.

The *only* thing you can pass for a system call restart is the argument 
block register state. If that is not enough, then you cannot restart it. 
It's that simple.

Andrew, please do *not* put this in any queues. It's fundamentally broken, 
and cannot be fixed as is. 

		Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix for futex_wait signal stack corruption
  2007-12-04 21:09 ` Linus Torvalds
@ 2007-12-04 21:39   ` Steven Rostedt
  2007-12-04 22:43     ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Steven Rostedt @ 2007-12-04 21:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Thomas Gleixner, LKML,
	David Holmes - Sun Microsystems, Andrew Morton

On Tue, Dec 04, 2007 at 01:09:36PM -0800, Linus Torvalds wrote:
> 
> 
> On Tue, 4 Dec 2007, Steven Rostedt wrote:
> >
> > The solution I did to solve this is to allocate a temporary buffer when
> > setting up the block and free it in futex_wait_restart. This patch
> > allows David's test program to actually pass.
> 
> No. Unacceptable. This is a memory leak in case nobody retries it. It's 
> basically not how you can do this thing.

Fair enough.

I was misguided, that the return func had to be called.

> 
> The *only* thing you can pass for a system call restart is the argument 
> block register state. If that is not enough, then you cannot restart it. 
> It's that simple.
> 
> Andrew, please do *not* put this in any queues. It's fundamentally broken, 
> and cannot be fixed as is. 

Yep, trash it.

Seems that arg3 is not used here and since the timer is 64 bits, we can
store the bottom 32 bits in arg2 and the top in arg3 (this will work for
both 32 and 64 bit archs).

This will eliminate the need for kmalloc (I didn't like that solution
anyway).

New patch on its way (after I get some food to eat).

Thanks,

-- Steve


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix for futex_wait signal stack corruption
  2007-12-04 21:39   ` Steven Rostedt
@ 2007-12-04 22:43     ` Linus Torvalds
  2007-12-05  1:23       ` Steven Rostedt
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2007-12-04 22:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Thomas Gleixner, LKML,
	David Holmes - Sun Microsystems, Andrew Morton

On Tue, 4 Dec 2007, Steven Rostedt wrote:
> 
> Seems that arg3 is not used here and since the timer is 64 bits, we can
> store the bottom 32 bits in arg2 and the top in arg3 (this will work for
> both 32 and 64 bit archs).

Yes. That should work fine.

The restart logic sometimes results in odd calling conventions, and quite 
frankly, we could just change how "restart_block" looks too. There is 
nothing that says that it has to be

	unsigned long arg0, arg1, arg2, arg3

and that particular layout was just picked on a whim. The only issue is:

 - we don't want the restart block to be *too* large, since it's part of 
   the thread info.
 - but we need to have enough room for all the system calls that want to 
   use the restart block, and preferably in a reasonable format.

So far, using "unsigned long" has been good enough, in that it's big 
enough for a pointer and all normal arguments, but if something really 
deeply wants another format or a guaranteed 64-bit word regardless of 
architecture, we could make one or more of the arguments be "u64" instead.

But in this case, since there is already unused argument space, I think 
that doing the "32 high bits + 32 low bits" is probably the best option.

			Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix for futex_wait signal stack corruption
  2007-12-04 22:43     ` Linus Torvalds
@ 2007-12-05  1:23       ` Steven Rostedt
  2007-12-05  1:46         ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Steven Rostedt @ 2007-12-05  1:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Thomas Gleixner, LKML,
	David Holmes - Sun Microsystems, Andrew Morton

On Tue, 4 Dec 2007, Linus Torvalds wrote:

> On Tue, 4 Dec 2007, Steven Rostedt wrote:
> >
> > Seems that arg3 is not used here and since the timer is 64 bits, we can
> > store the bottom 32 bits in arg2 and the top in arg3 (this will work for
> > both 32 and 64 bit archs).
>
> Yes. That should work fine.

Unfortunately, looking closer at the code, arg3 _is_ used. arg3 is used to
set a flag whether or not the mmap_sem is shared.
(I blame lack of food for not noticing this in the first place).

[...]

> So far, using "unsigned long" has been good enough, in that it's big
> enough for a pointer and all normal arguments, but if something really
> deeply wants another format or a guaranteed 64-bit word regardless of
> architecture, we could make one or more of the arguments be "u64" instead.
>
> But in this case, since there is already unused argument space, I think
> that doing the "32 high bits + 32 low bits" is probably the best option.

Since arg3 is out, which do you prefer? Creating an arg4 (and perhaps
more) in the block or having a u64 arg?  Changing all the args to u64 may
be the best. This way we don't need to worry about passing 64bit
arguments, and we don't waste more space on 64 bit archs by adding more
args of unsigned long.

Thoughts?

-- Steve

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix for futex_wait signal stack corruption
  2007-12-05  1:23       ` Steven Rostedt
@ 2007-12-05  1:46         ` Linus Torvalds
  2007-12-05  3:17           ` [PATCH -v2] " Steven Rostedt
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2007-12-05  1:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Thomas Gleixner, LKML,
	David Holmes - Sun Microsystems, Andrew Morton

On Tue, 4 Dec 2007, Steven Rostedt wrote:
> 
> Since arg3 is out, which do you prefer? Creating an arg4 (and perhaps
> more) in the block or having a u64 arg?  Changing all the args to u64 may
> be the best.

I suspect that the best option is probably to make that thing a unnamed 
union of the actual different types the different restart cases needs, 
which also allows you to name things appropriately and not have any wasted 
space.

Leave the arg0-3 ones around as one of the unions, both to avoid having to 
change other things and to have a "generic" one for stuff that simply 
doesn't much care (in order to not have tons and tons of substructures to 
the union when most users really don't need any fancy types).

I think we already depend on recent-enough gcc's that unnamed unions are 
ok and we don't need to play games with naming.

		Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH -v2] fix for futex_wait signal stack corruption
  2007-12-05  1:46         ` Linus Torvalds
@ 2007-12-05  3:17           ` Steven Rostedt
  2007-12-05  3:41             ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Steven Rostedt @ 2007-12-05  3:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Thomas Gleixner, LKML,
	David Holmes - Sun Microsystems, Andrew Morton


David Holmes found a bug in the RT patch with respect to
pthread_cond_timedwait. After trying his test program on the latest git
from mainline, I found the bug was there too.  The bug he was seeing
that his test program showed, was that if one were to do a "Ctrl-Z" on a
process that was in the pthread_cond_timedwait, and then did a "bg" on
that process, it would return with a "-ETIMEDOUT" but early. That is,
the timer would go off early.

Looking into this, I found the source of the problem. And it is a rather
nasty bug at that.

Here's the relevant code from kernel/futex.c: (not in order in the file)

[...]
smlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
                          struct timespec __user *utime, u32 __user *uaddr2,
                          u32 val3)
{
        struct timespec ts;
        ktime_t t, *tp = NULL;
        u32 val2 = 0;
        int cmd = op & FUTEX_CMD_MASK;

        if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) {
                if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
                        return -EFAULT;
                if (!timespec_valid(&ts))
                        return -EINVAL;

                t = timespec_to_ktime(ts);
                if (cmd == FUTEX_WAIT)
                        t = ktime_add(ktime_get(), t);
                tp = &t;
        }
[...]
        return do_futex(uaddr, op, val, tp, uaddr2, val2, val3);
}

[...]

long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
                u32 __user *uaddr2, u32 val2, u32 val3)
{
        int ret;
        int cmd = op & FUTEX_CMD_MASK;
        struct rw_semaphore *fshared = NULL;

        if (!(op & FUTEX_PRIVATE_FLAG))
                fshared = &current->mm->mmap_sem;

        switch (cmd) {
        case FUTEX_WAIT:
                ret = futex_wait(uaddr, fshared, val, timeout);

[...]

static int futex_wait(u32 __user *uaddr, struct rw_semaphore *fshared,
                      u32 val, ktime_t *abs_time)
{
[...]
               struct restart_block *restart;
                restart = &current_thread_info()->restart_block;
                restart->fn = futex_wait_restart;
                restart->arg0 = (unsigned long)uaddr;
                restart->arg1 = (unsigned long)val;
                restart->arg2 = (unsigned long)abs_time;
                restart->arg3 = 0;
                if (fshared)
                        restart->arg3 |= ARG3_SHARED;
                return -ERESTART_RESTARTBLOCK;
[...]

static long futex_wait_restart(struct restart_block *restart)
{
        u32 __user *uaddr = (u32 __user *)restart->arg0;
        u32 val = (u32)restart->arg1;
        ktime_t *abs_time = (ktime_t *)restart->arg2;
        struct rw_semaphore *fshared = NULL;

        restart->fn = do_no_restart_syscall;
        if (restart->arg3 & ARG3_SHARED)
                fshared = &current->mm->mmap_sem;
        return (long)futex_wait(uaddr, fshared, val, abs_time);
}


So when the futex_wait is interrupt by a signal we break out of the
hrtimer code and set up or return from signal. This code does not return
back to userspace, so we set up a RESTARTBLOCK.  The bug here is that we
save the "abs_time" which is a pointer to the stack variable "ktime_t t"
from sys_futex.

This returns and unwinds the stack before we get to call our signal. On
return from the signal we go to futex_wait_restart, where we update all
the parameters for futex_wait and call it. But here we have a problem
where abs_time is no longer valid.

I verified this with print statements, and sure enough, what abs_time
was set to ends up being garbage when we get to futex_wait_restart.

The solution I did to solve this (with input from Linus Torvalds)
was to add unions to the restart_block to allow system calls to
use the restart with specific parameters.  This way the futex code now
saves the time in a 64bit value in the restart block instead of storing
it on the stack.

Note: I'm a bit nervious to add "linux/types.h" and use u32 and u64
in thread_info.h, when there's a #ifdef __KERNEL__ just below that.
Not sure what that is there for.  If this turns out to be a problem, I've
tested this with using "unsigned int" for u32 and "unsigned long long" for
u64 and it worked just the same. I'm using u32 and u64 just to be
consistent with what the futex code uses.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

 include/linux/thread_info.h |   15 ++++++++++++++-
 kernel/futex.c              |   25 +++++++++++++------------
 2 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 1c4eb41..d97c874 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -7,12 +7,25 @@
 #ifndef _LINUX_THREAD_INFO_H
 #define _LINUX_THREAD_INFO_H

+#include <linux/types.h>
+
 /*
  * System call restart block.
  */
 struct restart_block {
 	long (*fn)(struct restart_block *);
-	unsigned long arg0, arg1, arg2, arg3;
+	union {
+		struct {
+			unsigned long arg0, arg1, arg2, arg3;
+		};
+		/* For futex_wait */
+		struct {
+			u32 *uaddr;
+			u32 val;
+			u32 flags;
+			u64 time;
+		} fu;
+	};
 };

 extern long do_no_restart_syscall(struct restart_block *parm);
diff --git a/kernel/futex.c b/kernel/futex.c
index 9dc591a..ad3b6e3 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1149,9 +1149,9 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,

 /*
  * In case we must use restart_block to restart a futex_wait,
- * we encode in the 'arg3' shared capability
+ * we encode in the 'flags' shared capability
  */
-#define ARG3_SHARED  1
+#define FLAGS_SHARED  1

 static long futex_wait_restart(struct restart_block *restart);

@@ -1290,12 +1290,13 @@ static int futex_wait(u32 __user *uaddr, struct rw_semaphore *fshared,
 		struct restart_block *restart;
 		restart = &current_thread_info()->restart_block;
 		restart->fn = futex_wait_restart;
-		restart->arg0 = (unsigned long)uaddr;
-		restart->arg1 = (unsigned long)val;
-		restart->arg2 = (unsigned long)abs_time;
-		restart->arg3 = 0;
+		restart->fu.uaddr = (u32*)uaddr;
+		restart->fu.val = val;
+		restart->fu.time = abs_time->tv64;
+		restart->fu.flags = 0;
+
 		if (fshared)
-			restart->arg3 |= ARG3_SHARED;
+			restart->fu.flags |= FLAGS_SHARED;
 		return -ERESTART_RESTARTBLOCK;
 	}

@@ -1310,15 +1311,15 @@ static int futex_wait(u32 __user *uaddr, struct rw_semaphore *fshared,

 static long futex_wait_restart(struct restart_block *restart)
 {
-	u32 __user *uaddr = (u32 __user *)restart->arg0;
-	u32 val = (u32)restart->arg1;
-	ktime_t *abs_time = (ktime_t *)restart->arg2;
+	u32 __user *uaddr = (u32 __user *)restart->fu.uaddr;
 	struct rw_semaphore *fshared = NULL;
+	ktime_t t;

+	t.tv64 = restart->fu.time;
 	restart->fn = do_no_restart_syscall;
-	if (restart->arg3 & ARG3_SHARED)
+	if (restart->fu.flags & FLAGS_SHARED)
 		fshared = &current->mm->mmap_sem;
-	return (long)futex_wait(uaddr, fshared, val, abs_time);
+	return (long)futex_wait(uaddr, fshared, restart->fu.val, &t);
 }



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH -v2] fix for futex_wait signal stack corruption
  2007-12-05  3:17           ` [PATCH -v2] " Steven Rostedt
@ 2007-12-05  3:41             ` Linus Torvalds
  2007-12-05  3:47               ` Randy Dunlap
                                 ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Linus Torvalds @ 2007-12-05  3:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Thomas Gleixner, LKML,
	David Holmes - Sun Microsystems, Andrew Morton

Patch looks fine to me.

On Tue, 4 Dec 2007, Steven Rostedt wrote:
> 
> Note: I'm a bit nervious to add "linux/types.h" and use u32 and u64
> in thread_info.h, when there's a #ifdef __KERNEL__ just below that.
> Not sure what that is there for.

Hmm. I'd not expect user-mode headers to ever include 
<linux/thread-info.h>, and if they do, they'd already get get totally 
invalid namespace pollution ("struct restart_block" at a minimum) along 
with stuff that simply isn't sensible in user-space at all, so I think 
this part is fine.

And I guess somebody will scream if it bites them ;)

Anyway, my gut feel is that this is potentially a real problem, and we 
should fix it asap (ie it should go into 2.6.24 even at this late stage in 
the game), but it would be nice to know if the problem actually hit any 
actual real program, and not just a test-setup.

So here's a question for David Holmes:  What caused you to actually notice 
this behaviour? Can this actually be seen in real life usage?

Anyway, at a minimum, here's an

	Acked-by: Linus Torvalds <torvalds@linux-foundation.org>

and I suspect I should just apply it directly. Any comments from anybody 
else?

		Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -v2] fix for futex_wait signal stack corruption
  2007-12-05  3:41             ` Linus Torvalds
@ 2007-12-05  3:47               ` Randy Dunlap
  2007-12-05  3:53                 ` Steven Rostedt
  2007-12-05  5:33               ` David Holmes - Sun Microsystems
  2007-12-05  5:54               ` Thomas Gleixner
  2 siblings, 1 reply; 15+ messages in thread
From: Randy Dunlap @ 2007-12-05  3:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Ingo Molnar, Thomas Gleixner, LKML,
	David Holmes - Sun Microsystems, Andrew Morton

On Tue, 4 Dec 2007 19:41:32 -0800 (PST) Linus Torvalds wrote:

> 
> Patch looks fine to me.
> 
> On Tue, 4 Dec 2007, Steven Rostedt wrote:
> > 
> > Note: I'm a bit nervious to add "linux/types.h" and use u32 and u64
> > in thread_info.h, when there's a #ifdef __KERNEL__ just below that.
> > Not sure what that is there for.
> 
> Hmm. I'd not expect user-mode headers to ever include 
> <linux/thread-info.h>, and if they do, they'd already get get totally 
> invalid namespace pollution ("struct restart_block" at a minimum) along 
> with stuff that simply isn't sensible in user-space at all, so I think 
> this part is fine.
> 
> And I guess somebody will scream if it bites them ;)
> 
> Anyway, my gut feel is that this is potentially a real problem, and we 
> should fix it asap (ie it should go into 2.6.24 even at this late stage in 
> the game), but it would be nice to know if the problem actually hit any 
> actual real program, and not just a test-setup.

Steve/David,

where can I find the test case, please?

> So here's a question for David Holmes:  What caused you to actually notice 
> this behaviour? Can this actually be seen in real life usage?
> 
> Anyway, at a minimum, here's an
> 
> 	Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> and I suspect I should just apply it directly. Any comments from anybody 
> else?

---
~Randy
Features and documentation: http://lwn.net/Articles/260136/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -v2] fix for futex_wait signal stack corruption
  2007-12-05  3:47               ` Randy Dunlap
@ 2007-12-05  3:53                 ` Steven Rostedt
  0 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2007-12-05  3:53 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	David Holmes - Sun Microsystems, Andrew Morton


On Tue, 4 Dec 2007, Randy Dunlap wrote:
>
> Steve/David,
>
> where can I find the test case, please?

David attached it to the RH Bugzilla:

https://bugzilla.redhat.com/show_bug.cgi?id=408321

-- Steve


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -v2] fix for futex_wait signal stack corruption
  2007-12-05  3:41             ` Linus Torvalds
  2007-12-05  3:47               ` Randy Dunlap
@ 2007-12-05  5:33               ` David Holmes - Sun Microsystems
  2007-12-05  6:06                 ` Linus Torvalds
  2007-12-05  5:54               ` Thomas Gleixner
  2 siblings, 1 reply; 15+ messages in thread
From: David Holmes - Sun Microsystems @ 2007-12-05  5:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Ingo Molnar, Thomas Gleixner, LKML, Andrew Morton

Linus Torvalds said the following on  5/12/07 01:41 PM:
> So here's a question for David Holmes:  What caused you to actually notice 
> this behaviour? Can this actually be seen in real life usage?

We observed an application "hang" that turned out to be caused by a 
clock mismatch between that used with the pthread_cond_t and that used 
to convert a relative wait time to an absolute one. When the program ran 
in the foreground and hung I used ctrl-Z to suspend it then "bg" to 
background it. As soon as I did that the application became unstuck.

While this was observed with process control signals, my concern was 
that other signals might cause pthread_cond_timedwait to return 
immediately in the same way. The test program allows for SIGUSR1 and 
SIGRTMIN testing as well, but these other signals did not cause the 
immediate return. But it would seem from Steven's analysis that this is 
just a fortuitous result. If I understand things correctly, any 
interruption of pthread_cond_timedwait by a signal, could result in 
waiting until an arbitrary time - depending on how the stack value was 
corrupted. Is that correct?

Thanks,
David Holmes
Senior Java Technologist
Java SE VM Real-time and Embedded Group
---------------------------------------

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -v2] fix for futex_wait signal stack corruption
  2007-12-05  5:33               ` David Holmes - Sun Microsystems
@ 2007-12-05  6:06                 ` Linus Torvalds
  2007-12-05  6:14                   ` David Holmes - Sun Microsystems
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2007-12-05  6:06 UTC (permalink / raw)
  To: David Holmes - Sun Microsystems
  Cc: Steven Rostedt, Ingo Molnar, Thomas Gleixner, LKML, Andrew Morton

On Wed, 5 Dec 2007, David Holmes - Sun Microsystems wrote:
> 
> While this was observed with process control signals, my concern was that
> other signals might cause pthread_cond_timedwait to return immediately in the
> same way. The test program allows for SIGUSR1 and SIGRTMIN testing as well,
> but these other signals did not cause the immediate return. But it would seem
> from Steven's analysis that this is just a fortuitous result. If I understand
> things correctly, any interruption of pthread_cond_timedwait by a signal,
> could result in waiting until an arbitrary time - depending on how the stack
> value was corrupted. Is that correct?

No, very few things can actually cause the restart_block path to be taken. 
An actual signal execution would turn that into an EINTR, the only case 
that should ever trigger this is a signal that causes some kernel action 
(ie the system call *is* interrupted), but does not actually result in any 
user-visible state changes.

The classic case is ^Z + bg, but iirc you can trigger it with ptrace too. 
And I think two threads racing to pick up the same signal can cause it 
too, for that matter (ie one thread takes the signal, the other one got 
interrupted but there's nothing there, so it just causes a system call 
restart).

There's basically two different system call restart mechanisms in the 
kernel:

 - returning -ERESTARTNOHAND will cause the system call to be restarted 
   with the *original* arguments if no signal handler was actually 
   invoked. This has been around for a long time, and is used by a lot of 
   system calls. It's fine for things that are idempotent, ie the argument 
   meaning doesn't change over time (things like a "read()" system call, 
   for example)

 - the "restart_block" model that returns -ERESTARTBLOCK, which will cause 
   the system call to be restarted with the arguments specified in the 
   system call restart block. This is for system calls that are *not* 
   idempotent, ie the argument might be a relative timeout or something 
   like that, where we need to actually behave *differently* when 
   restarting.

The latter case is "new" (it's been around for a while, but relative to 
the ERESTARTNOHAND one), and it relies on the system call itself setting 
up its restart point and the argument save area. And each such system call 
can obviously screw it up by saving/restoring the arguments with the 
incorrect semantics.

So this bug was really (a) specific to that particular futex restart 
mechanism, and (b) only triggers for the (rather unusual) case where the 
system call gets interrupted by a signal, but no signal handler actually 
happens. In practice, ^Z is the most common case by far (other signals are 
either ignored and don't even cause an interrupt event in the first place, 
or they are "real" signals, and cause a signal handler to be invoked).

			Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -v2] fix for futex_wait signal stack corruption
  2007-12-05  6:06                 ` Linus Torvalds
@ 2007-12-05  6:14                   ` David Holmes - Sun Microsystems
  0 siblings, 0 replies; 15+ messages in thread
From: David Holmes - Sun Microsystems @ 2007-12-05  6:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Ingo Molnar, Thomas Gleixner, LKML, Andrew Morton

Thanks for clarifying that Linus.

Regards,
David Holmes

Linus Torvalds said the following on  5/12/07 04:06 PM:
> 
> On Wed, 5 Dec 2007, David Holmes - Sun Microsystems wrote:
>> While this was observed with process control signals, my concern was that
>> other signals might cause pthread_cond_timedwait to return immediately in the
>> same way. The test program allows for SIGUSR1 and SIGRTMIN testing as well,
>> but these other signals did not cause the immediate return. But it would seem
>> from Steven's analysis that this is just a fortuitous result. If I understand
>> things correctly, any interruption of pthread_cond_timedwait by a signal,
>> could result in waiting until an arbitrary time - depending on how the stack
>> value was corrupted. Is that correct?
> 
> No, very few things can actually cause the restart_block path to be taken. 
> An actual signal execution would turn that into an EINTR, the only case 
> that should ever trigger this is a signal that causes some kernel action 
> (ie the system call *is* interrupted), but does not actually result in any 
> user-visible state changes.
> 
> The classic case is ^Z + bg, but iirc you can trigger it with ptrace too. 
> And I think two threads racing to pick up the same signal can cause it 
> too, for that matter (ie one thread takes the signal, the other one got 
> interrupted but there's nothing there, so it just causes a system call 
> restart).
> 
> There's basically two different system call restart mechanisms in the 
> kernel:
> 
>  - returning -ERESTARTNOHAND will cause the system call to be restarted 
>    with the *original* arguments if no signal handler was actually 
>    invoked. This has been around for a long time, and is used by a lot of 
>    system calls. It's fine for things that are idempotent, ie the argument 
>    meaning doesn't change over time (things like a "read()" system call, 
>    for example)
> 
>  - the "restart_block" model that returns -ERESTARTBLOCK, which will cause 
>    the system call to be restarted with the arguments specified in the 
>    system call restart block. This is for system calls that are *not* 
>    idempotent, ie the argument might be a relative timeout or something 
>    like that, where we need to actually behave *differently* when 
>    restarting.
> 
> The latter case is "new" (it's been around for a while, but relative to 
> the ERESTARTNOHAND one), and it relies on the system call itself setting 
> up its restart point and the argument save area. And each such system call 
> can obviously screw it up by saving/restoring the arguments with the 
> incorrect semantics.
> 
> So this bug was really (a) specific to that particular futex restart 
> mechanism, and (b) only triggers for the (rather unusual) case where the 
> system call gets interrupted by a signal, but no signal handler actually 
> happens. In practice, ^Z is the most common case by far (other signals are 
> either ignored and don't even cause an interrupt event in the first place, 
> or they are "real" signals, and cause a signal handler to be invoked).
> 
> 			Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -v2] fix for futex_wait signal stack corruption
  2007-12-05  3:41             ` Linus Torvalds
  2007-12-05  3:47               ` Randy Dunlap
  2007-12-05  5:33               ` David Holmes - Sun Microsystems
@ 2007-12-05  5:54               ` Thomas Gleixner
  2007-12-05 10:31                 ` Ingo Molnar
  2 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2007-12-05  5:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Ingo Molnar, LKML,
	David Holmes - Sun Microsystems, Andrew Morton, Stable Team

On Tue, 4 Dec 2007, Linus Torvalds wrote:
> 
> Patch looks fine to me.
> 
> On Tue, 4 Dec 2007, Steven Rostedt wrote:
> > 
> > Note: I'm a bit nervious to add "linux/types.h" and use u32 and u64
> > in thread_info.h, when there's a #ifdef __KERNEL__ just below that.
> > Not sure what that is there for.
> 
> Hmm. I'd not expect user-mode headers to ever include 
> <linux/thread-info.h>, and if they do, they'd already get get totally 
> invalid namespace pollution ("struct restart_block" at a minimum) along 
> with stuff that simply isn't sensible in user-space at all, so I think 
> this part is fine.
> 
> And I guess somebody will scream if it bites them ;)
> 
> Anyway, my gut feel is that this is potentially a real problem, and we 
> should fix it asap (ie it should go into 2.6.24 even at this late stage in 
> the game), but it would be nice to know if the problem actually hit any 
> actual real program, and not just a test-setup.
> 
> So here's a question for David Holmes:  What caused you to actually notice 
> this behaviour? Can this actually be seen in real life usage?
> 
> Anyway, at a minimum, here's an
> 
> 	Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> and I suspect I should just apply it directly. Any comments from anybody 
> else?

Doh, yes. I completely missed that stack dependency of the pointer
when I looked at the patch back then. The solution looks solid and
probably we should get rid of the unnamed union member and fixup the
other places which use restart_block in a similar way.

Just a minor nit. Can we please use "futex" instead of "fu" ? I'm just
envisioning the next union member named "ba".

Acked-by: Thomas Gleixner <tglx@linutronix.de>

Please apply with the s/fu/futex/ change. This needs to go into stable
.22/.23 as well.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -v2] fix for futex_wait signal stack corruption
  2007-12-05  5:54               ` Thomas Gleixner
@ 2007-12-05 10:31                 ` Ingo Molnar
  0 siblings, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2007-12-05 10:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Steven Rostedt, LKML,
	David Holmes - Sun Microsystems, Andrew Morton, Stable Team


* Thomas Gleixner <tglx@linutronix.de> wrote:

> Just a minor nit. Can we please use "futex" instead of "fu" ? I'm just 
> envisioning the next union member named "ba".
> 
> Acked-by: Thomas Gleixner <tglx@linutronix.de>
> 
> Please apply with the s/fu/futex/ change. This needs to go into stable 
> .22/.23 as well.

ok, i've tied it all up, collected the Acks and will push Steve's fix to 
Linus via the scheduler git tree - is that fine with everyone? I think 
there's enough time to get this tested - Linus's latest tree shows up in 
Fedora rawhide in 1-2 days and tons of apps use futexes.

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2007-12-05 10:31 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-04 20:57 [PATCH] fix for futex_wait signal stack corruption Steven Rostedt
2007-12-04 21:09 ` Linus Torvalds
2007-12-04 21:39   ` Steven Rostedt
2007-12-04 22:43     ` Linus Torvalds
2007-12-05  1:23       ` Steven Rostedt
2007-12-05  1:46         ` Linus Torvalds
2007-12-05  3:17           ` [PATCH -v2] " Steven Rostedt
2007-12-05  3:41             ` Linus Torvalds
2007-12-05  3:47               ` Randy Dunlap
2007-12-05  3:53                 ` Steven Rostedt
2007-12-05  5:33               ` David Holmes - Sun Microsystems
2007-12-05  6:06                 ` Linus Torvalds
2007-12-05  6:14                   ` David Holmes - Sun Microsystems
2007-12-05  5:54               ` Thomas Gleixner
2007-12-05 10:31                 ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox