[BUG] Fault handlers can deadlock

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG] Fault handlers can deadlock
@ 2007-06-11 16:00 Russell King
  2007-06-11 16:41 ` Andrew Morton
  2007-06-11 17:39 ` Linus Torvalds
  0 siblings, 2 replies; 8+ messages in thread
From: Russell King @ 2007-06-11 16:00 UTC (permalink / raw)
  To: linux-arch, Linus Torvalds, Andrew Morton

Recently, a bug has been discovered on ARM whereby if futexes are
being used and the system is put under heavy load, a deadlock will
occur.

The deadlock involves mmap_sem having been taken by the futex code
and a page fault occuring in copy_from_user_inatomic().  We then
hit this:

        /*
         * As per x86, we may deadlock here.  However, since the kernel only
         * validly references user space from well defined areas of the code,
         * we can bug out early if this is from code which shouldn't.
         */
        if (!down_read_trylock(&mm->mmap_sem)) {
                if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc))
                        goto no_context;
                down_read(&mm->mmap_sem);
        }

which is more or less identical across many architectures.  The
problem is that:

1. down_read_trylock() finds contention, so it returns 0.
2. we've been called from kernel mode
3. search_exception_tables() finds an entry in the exception
   tables (see note about "well defined areas of the code") and
   returns *non*-NULL
4. therefore, the second if statement is _false_.
5. we then deadlock on down_read(&mm->mmap_sem).

Note that search_exception_tables() is generic code, and I believe rwsems
are also generic code today.

For reference, he's the x86 version, which will perform in the same
way as the ARM version quoted above:

        /* When running in the kernel we expect faults to occur only to
         * addresses in user space.  All other faults represent errors in the
         * kernel and should generate an OOPS.  Unfortunatly, in the case of an
         * erroneous fault occurring in a code path which already holds mmap_sem
         * we will deadlock attempting to validate the fault against the
         * address space.  Luckily the kernel only validly references user
         * space from well defined areas of code, which are listed in the
         * exceptions table.
         *
         * As the vast majority of faults will be valid we will only perform
         * the source reference check when there is a possibilty of a deadlock.
         * Attempt to lock the address space, if we cannot we then validate the
         * source.  If this is invalid we can skip the address space check,
         * thus avoiding the deadlock.
         */
        if (!down_read_trylock(&mm->mmap_sem)) {
                if ((error_code & 4) == 0 &&
                    !search_exception_tables(regs->eip))
                        goto bad_area_nosemaphore;
                down_read(&mm->mmap_sem);
        }

Clearly, if we are expecting to deal with faults occuring from these "well
defined areas of code, which are listed in the exceptions table" the
sense if the test is wrong.

Therefore, I think the following patch is appropriate (and similar patches
are required for other architectures):

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 75d4914..e41efd9 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -238,7 +238,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	 * we can bug out early if this is from code which shouldn't.
 	 */
 	if (!down_read_trylock(&mm->mmap_sem)) {
-		if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc))
+		if (!user_mode(regs) && search_exception_tables(regs->ARM_pc))
 			goto no_context;
 		down_read(&mm->mmap_sem);
 	}

Comments?

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [BUG] Fault handlers can deadlock
  2007-06-11 16:00 [BUG] Fault handlers can deadlock Russell King
@ 2007-06-11 16:41 ` Andrew Morton
  2007-06-11 16:53   ` Russell King
  2007-06-11 17:39 ` Linus Torvalds
  1 sibling, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2007-06-11 16:41 UTC (permalink / raw)
  To: Russell King; +Cc: linux-arch, Linus Torvalds

On Mon, 11 Jun 2007 17:00:27 +0100 Russell King <rmk@arm.linux.org.uk> wrote:

> Recently, a bug has been discovered on ARM whereby if futexes are
> being used and the system is put under heavy load, a deadlock will
> occur.
> 
> The deadlock involves mmap_sem having been taken by the futex code
> and a page fault occuring in copy_from_user_inatomic().  We then
> hit this:
> 
>         /*
>          * As per x86, we may deadlock here.  However, since the kernel only
>          * validly references user space from well defined areas of the code,
>          * we can bug out early if this is from code which shouldn't.
>          */
>         if (!down_read_trylock(&mm->mmap_sem)) {
>                 if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc))
>                         goto no_context;
>                 down_read(&mm->mmap_sem);
>         }

We shouldn't get that far, if the caller is copy_from_user_inatomic():

	/*
	 * If we're in an interrupt or have no user
	 * context, we must not take the fault..
	 */
	if (in_atomic() || !mm)
**taken		goto no_context;

	/*
	 * As per x86, we may deadlock here.  However, since the kernel only
	 * validly references user space from well defined areas of the code,
	 * we can bug out early if this is from code which shouldn't.
	 */
	if (!down_read_trylock(&mm->mmap_sem)) {
		if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc))
			goto no_context;
		down_read(&mm->mmap_sem);
	}


I assume this is the callsite:

static inline int get_futex_value_locked(u32 *dest, u32 __user *from)
{
	int ret;

	pagefault_disable();
	ret = __copy_from_user_inatomic(dest, from, sizeof(u32));
	pagefault_enable();

	return ret ? -EFAULT : 0;
}

it seems to be doing the right thing there, but for some reason it isn't
working as designed?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Fault handlers can deadlock
  2007-06-11 16:41 ` Andrew Morton
@ 2007-06-11 16:53   ` Russell King
  2007-06-11 17:31     ` Andrew Morton
  2007-06-11 17:51     ` Linus Torvalds
  0 siblings, 2 replies; 8+ messages in thread
From: Russell King @ 2007-06-11 16:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-arch, Linus Torvalds

On Mon, Jun 11, 2007 at 09:41:16AM -0700, Andrew Morton wrote:
> On Mon, 11 Jun 2007 17:00:27 +0100 Russell King <rmk@arm.linux.org.uk> wrote:
> 
> > Recently, a bug has been discovered on ARM whereby if futexes are
> > being used and the system is put under heavy load, a deadlock will
> > occur.
> > 
> > The deadlock involves mmap_sem having been taken by the futex code
> > and a page fault occuring in copy_from_user_inatomic().  We then
> > hit this:
> > 
> >         /*
> >          * As per x86, we may deadlock here.  However, since the kernel only
> >          * validly references user space from well defined areas of the code,
> >          * we can bug out early if this is from code which shouldn't.
> >          */
> >         if (!down_read_trylock(&mm->mmap_sem)) {
> >                 if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc))
> >                         goto no_context;
> >                 down_read(&mm->mmap_sem);
> >         }
> 
> We shouldn't get that far, if the caller is copy_from_user_inatomic():
> 
> 	/*
> 	 * If we're in an interrupt or have no user
> 	 * context, we must not take the fault..
> 	 */
> 	if (in_atomic() || !mm)
> **taken		goto no_context;
> 
> 	/*
> 	 * As per x86, we may deadlock here.  However, since the kernel only
> 	 * validly references user space from well defined areas of the code,
> 	 * we can bug out early if this is from code which shouldn't.
> 	 */
> 	if (!down_read_trylock(&mm->mmap_sem)) {
> 		if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc))
> 			goto no_context;
> 		down_read(&mm->mmap_sem);
> 	}
> 
> 
> I assume this is the callsite:
> 
> static inline int get_futex_value_locked(u32 *dest, u32 __user *from)
> {
> 	int ret;
> 
> 	pagefault_disable();
> 	ret = __copy_from_user_inatomic(dest, from, sizeof(u32));
> 	pagefault_enable();
> 
> 	return ret ? -EFAULT : 0;
> }
> 
> it seems to be doing the right thing there, but for some reason it isn't
> working as designed?

It's actually an older kernel, with a suggested fix (which the reporter
is refusing to apply because he thinks it'll still deadlock.)  I think
the only real answer is for the reporter to upgrade their kernel.

However, I still question whether those down_read_trylock games are
correct - they certainly do not agreement the comments directly above
which clearly indicates that the situation the code is trying to avoid
is one where the fault occurs from a location in the exception table
with mmap_sem held.

However (again) the comments in the commit (which can be viewed on
bkbits) indicates that the comment is probably wrong and the code is
correct; it's trying to early-detect conditions which will lead to an
oops.

So maybe it's a comment bug and the code is correct?

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Fault handlers can deadlock
  2007-06-11 16:53   ` Russell King
@ 2007-06-11 17:31     ` Andrew Morton
  2007-06-11 17:38       ` Russell King
  2007-06-11 17:51     ` Linus Torvalds
  1 sibling, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2007-06-11 17:31 UTC (permalink / raw)
  To: Russell King; +Cc: linux-arch, Linus Torvalds

On Mon, 11 Jun 2007 17:53:22 +0100
Russell King <rmk@arm.linux.org.uk> wrote:

> On Mon, Jun 11, 2007 at 09:41:16AM -0700, Andrew Morton wrote:
> > On Mon, 11 Jun 2007 17:00:27 +0100 Russell King <rmk@arm.linux.org.uk> wrote:
> > 
> > > Recently, a bug has been discovered on ARM whereby if futexes are
> > > being used and the system is put under heavy load, a deadlock will
> > > occur.
> > > 
> > > The deadlock involves mmap_sem having been taken by the futex code
> > > and a page fault occuring in copy_from_user_inatomic().  We then
> > > hit this:
> > > 
> > >         /*
> > >          * As per x86, we may deadlock here.  However, since the kernel only
> > >          * validly references user space from well defined areas of the code,
> > >          * we can bug out early if this is from code which shouldn't.
> > >          */
> > >         if (!down_read_trylock(&mm->mmap_sem)) {
> > >                 if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc))
> > >                         goto no_context;
> > >                 down_read(&mm->mmap_sem);
> > >         }
> > 
> > We shouldn't get that far, if the caller is copy_from_user_inatomic():
> > 
> > 	/*
> > 	 * If we're in an interrupt or have no user
> > 	 * context, we must not take the fault..
> > 	 */
> > 	if (in_atomic() || !mm)
> > **taken		goto no_context;
> > 
> > 	/*
> > 	 * As per x86, we may deadlock here.  However, since the kernel only
> > 	 * validly references user space from well defined areas of the code,
> > 	 * we can bug out early if this is from code which shouldn't.
> > 	 */
> > 	if (!down_read_trylock(&mm->mmap_sem)) {
> > 		if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc))
> > 			goto no_context;
> > 		down_read(&mm->mmap_sem);
> > 	}
> > 
> > 
> > I assume this is the callsite:
> > 
> > static inline int get_futex_value_locked(u32 *dest, u32 __user *from)
> > {
> > 	int ret;
> > 
> > 	pagefault_disable();
> > 	ret = __copy_from_user_inatomic(dest, from, sizeof(u32));
> > 	pagefault_enable();
> > 
> > 	return ret ? -EFAULT : 0;
> > }
> > 
> > it seems to be doing the right thing there, but for some reason it isn't
> > working as designed?
> 
> It's actually an older kernel, with a suggested fix (which the reporter
> is refusing to apply because he thinks it'll still deadlock.)  I think
> the only real answer is for the reporter to upgrade their kernel.
> 
> However, I still question whether those down_read_trylock games are
> correct - they certainly do not agreement the comments directly above
> which clearly indicates that the situation the code is trying to avoid
> is one where the fault occurs from a location in the exception table
> with mmap_sem held.
> 
> However (again) the comments in the commit (which can be viewed on
> bkbits) indicates that the comment is probably wrong and the code is
> correct; it's trying to early-detect conditions which will lead to an
> oops.
> 
> So maybe it's a comment bug and the code is correct?

The code is there to improve debuggability and for no other reason.  It is
to fix the situation where the kernel references an invalida address while
holding down_write(mmap_sem).  We used to take the fault and then deadlock
on do_page_fault()'s down_read().  Now, we do the trylock and if that fails
we chech that a) it was running kernel code and b) the kernel didn't expect
a fault to occur at that EIP.  If both of those are true, go off and report
the oops rather than deadlocking.

And I don't think I spot anything in either the ARM or i386 comments which
contradicts that? 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Fault handlers can deadlock
  2007-06-11 17:31     ` Andrew Morton
@ 2007-06-11 17:38       ` Russell King
  0 siblings, 0 replies; 8+ messages in thread
From: Russell King @ 2007-06-11 17:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-arch, Linus Torvalds

On Mon, Jun 11, 2007 at 10:31:42AM -0700, Andrew Morton wrote:
> The code is there to improve debuggability and for no other reason.  It is
> to fix the situation where the kernel references an invalida address while
> holding down_write(mmap_sem).  We used to take the fault and then deadlock
> on do_page_fault()'s down_read().  Now, we do the trylock and if that fails
> we chech that a) it was running kernel code and b) the kernel didn't expect
> a fault to occur at that EIP.  If both of those are true, go off and report
> the oops rather than deadlocking.
> 
> And I don't think I spot anything in either the ARM or i386 comments which
> contradicts that? 

No, I think ARM is right and there isn't a problem (apart from
ambiguous comments), provided people use a recent enough kernel.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Fault handlers can deadlock
  2007-06-11 16:00 [BUG] Fault handlers can deadlock Russell King
  2007-06-11 16:41 ` Andrew Morton
@ 2007-06-11 17:39 ` Linus Torvalds
  2007-06-11 18:08   ` Russell King
  1 sibling, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2007-06-11 17:39 UTC (permalink / raw)
  To: Russell King; +Cc: linux-arch, Andrew Morton



On Mon, 11 Jun 2007, Russell King wrote:
> 
> The deadlock involves mmap_sem having been taken by the futex code
> and a page fault occuring in copy_from_user_inatomic().  We then
> hit this:

No, we don't. If you hit that code, you're buggy.

Look at what x86 has just before that sequence:

        /*
         * If we're in an interrupt, have no user context or are running in an
         * atomic region then we must not take the fault..
         */
        if (in_atomic() || !mm)
                goto bad_area_nosemaphore;

and the whole *point* of "copy_from_user_inatomic()" is that it should 
trigger the "in_atomic()" check, and we will never even get to the code 
you point at:

>         /*
>          * As per x86, we may deadlock here.  However, since the kernel only
>          * validly references user space from well defined areas of the code,
>          * we can bug out early if this is from code which shouldn't.
>          */
>         if (!down_read_trylock(&mm->mmap_sem)) {

And as far as I can see, arm has the same "in_acomit()" checks too.

Maybe the arm implementation of in_atomic() or copy_from_user_inatomic() 
is broken some way?

		Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Fault handlers can deadlock
  2007-06-11 16:53   ` Russell King
  2007-06-11 17:31     ` Andrew Morton
@ 2007-06-11 17:51     ` Linus Torvalds
  1 sibling, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2007-06-11 17:51 UTC (permalink / raw)
  To: Russell King; +Cc: Andrew Morton, linux-arch

On Mon, 11 Jun 2007, Russell King wrote:
> 
> However, I still question whether those down_read_trylock games are
> correct

That is certainly a valid question. It may well be that we should *always* 
do the "search_exception_tables()" for kernel accesses. As it is, we 
aren't doing all the validity checks that we *could* do.

However, the deadlock is real, and concerns kernel *bugs*:

 - if the kernel takes a (wild) page fault while holding the mmap_sem 
   semaphore, we at one point just deadlocked instead of printign a nice 
   oops message. Some people (including me) spent way too much time 
   debugging silent and very-hard-to-debug lockups that should have just 
   caused a nice oops and made the bug obvious.

However, I agree that the check itself conceptually should be done 
regardless of whether the mmap_sem is taken or not.

One solution migt be to just move the whole

	if (!user_mode(regs) && !search_exception_tables(pc))
		goto no_context;

to _outside_ the mmap_sem check, and change the code to basically be

        /*
         * If we're in an interrupt or have no user
         * context, we must not take the fault..
         */
        if (in_atomic() || !mm)
                goto no_context;

	/*
	 * If we're in kernel mode, and don't have the access
	 * listed in the exception tables, we must not take
	 * the fault...
	 */
	if (!user_mode(regs) && !search_exception_tables(pc))
		goto no_context;

	/* Do the fault */
	down_read(&mm->mmap_sem);
	fault = __do_page_fault(mm, addr, flags, tsk);
	up_read(&mm->mmap_sem);

which looks cleaner, I agree.

However, one issue that I don't know was ever part of the reason is that 
the exception table search may be expensive, and maybe skipping it when 
getting the mmap_sem is easy is the right thing for a _performance_ 
reason. I doubt it's a real issue (we should generally not be taking 
a lot of faults from kernel mode!), but it's a thing to keep in mind.

Comments?

		Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Fault handlers can deadlock
  2007-06-11 17:39 ` Linus Torvalds
@ 2007-06-11 18:08   ` Russell King
  0 siblings, 0 replies; 8+ messages in thread
From: Russell King @ 2007-06-11 18:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, Andrew Morton

On Mon, Jun 11, 2007 at 10:39:49AM -0700, Linus Torvalds wrote:
> 
> 
> On Mon, 11 Jun 2007, Russell King wrote:
> > 
> > The deadlock involves mmap_sem having been taken by the futex code
> > and a page fault occuring in copy_from_user_inatomic().  We then
> > hit this:
> 
> No, we don't. If you hit that code, you're buggy.
> 
> Look at what x86 has just before that sequence:
> 
>         /*
>          * If we're in an interrupt, have no user context or are running in an
>          * atomic region then we must not take the fault..
>          */
>         if (in_atomic() || !mm)
>                 goto bad_area_nosemaphore;
> 
> and the whole *point* of "copy_from_user_inatomic()" is that it should 
> trigger the "in_atomic()" check, and we will never even get to the code 
> you point at:
> 
> >         /*
> >          * As per x86, we may deadlock here.  However, since the kernel only
> >          * validly references user space from well defined areas of the code,
> >          * we can bug out early if this is from code which shouldn't.
> >          */
> >         if (!down_read_trylock(&mm->mmap_sem)) {
> 
> And as far as I can see, arm has the same "in_acomit()" checks too.
> 
> Maybe the arm implementation of in_atomic() or copy_from_user_inatomic() 
> is broken some way?

Or the guy's using a very old kernel (2.6.12) and I'm pointing him at
the wrong fix (which I was.)

However, as I said to Andrew (which didn't make it to the other recipients
due to my mailer refusing to do so - yes, group reply only sent my followup
to Andrew) I think the comments could do with some improvement.

It looks like whoever did the s/in_interrupt/in_atomic/ change added
additional commentry to x86 but omitted it from ARM, leading to the
comments being misleading.

So, (as I also said to Andrew) I'm going to apply this patch to fix the
comments and thereby remove the ambiguities.

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 75d4914..368c437 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -226,8 +226,9 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	mm  = tsk->mm;
 
 	/*
-	 * If we're in an interrupt or have no user
-	 * context, we must not take the fault..
+	 * If we're in an atomic context (eg, interrupt, pagefault_disable,
+	 * etc) or have no user context, we must not take the fault...
+	 * We still search the exception tables.
 	 */
 	if (in_atomic() || !mm)
 		goto no_context;
@@ -235,7 +236,8 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	/*
 	 * As per x86, we may deadlock here.  However, since the kernel only
 	 * validly references user space from well defined areas of the code,
-	 * we can bug out early if this is from code which shouldn't.
+	 * we can avoid taking the semaphore entirely if the fault will lead
+	 * to an OOPS.
 	 */
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc))


-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-06-11 18:08 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-11 16:00 [BUG] Fault handlers can deadlock Russell King
2007-06-11 16:41 ` Andrew Morton
2007-06-11 16:53   ` Russell King
2007-06-11 17:31     ` Andrew Morton
2007-06-11 17:38       ` Russell King
2007-06-11 17:51     ` Linus Torvalds
2007-06-11 17:39 ` Linus Torvalds
2007-06-11 18:08   ` Russell King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).