[PATCH/RFC] Futex mmap

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH/RFC] Futex mmap_sem deadlock
@ 2005-02-22 19:06 Olof Johansson
  2005-02-22 19:36 ` Linus Torvalds
  2005-02-22 19:55 ` Andrew Morton
  0 siblings, 2 replies; 38+ messages in thread
From: Olof Johansson @ 2005-02-22 19:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, torvalds, jamie, rusty

Hi,

Consider a small testcase that spawns off two threads, either thread
doing a loop of:

	buf = mmap /dev/zero MAP_SHARED for 0x100000 bytes
	call sys_futex (buf+page, FUTEX_WAIT, 1, NULL, NULL) for each page in said mmap
	munmap(buf)
	repeat

This will quickly lock up, since the futex_wait code dows a
down_read(mmap_sem), then a get_user().

The do_page_fault code on ppc64 (as well as other architectures) needs
to take the same semaphore for reading. This is all good until the
second thread comes into play: Its mmap call tries to take the same
semaphore for writing which causes in the do_page_fault down_read()
to get stuck. Classic deadlock.

Stacks are as follows:

a.out         S 000000000ff9772c 12288  6491   6484  6492               (NOTLB)
Call Trace:
[c0000003eb72fa10] [0000000030ecbd28] 0x30ecbd28 (unreliable)
[c0000003eb72fbe0] [c0000000000119ec] .__switch_to+0xa4/0xf0
[c0000003eb72fc70] [c00000000043ffbc] .schedule+0x4d4/0xe5c
[c0000003eb72fd90] [c0000000000247c8] .sys32_rt_sigsuspend+0xe0/0x120
[c0000003eb72fe30] [c00000000000d7e8] .ppc32_rt_sigsuspend+0x8/0x14
a.out         D 000000000ff0ad38 13632  6492   6491  6493               (NOTLB)
Call Trace:
[c00000003a1a7830] [c00000003a1a7850] 0xc00000003a1a7850 (unreliable)
[c00000003a1a7a00] [c0000000000119ec] .__switch_to+0xa4/0xf0
[c00000003a1a7a90] [c00000000043ffbc] .schedule+0x4d4/0xe5c
[c00000003a1a7bb0] [c000000000441ec0] .rwsem_down_write_failed+0x138/0x2f0
[c00000003a1a7c90] [c000000000098db0] .sys_mprotect+0x7bc/0x81c
[c00000003a1a7e30] [c00000000000d600] syscall_exit+0x0/0x18
a.out         D 0000000010000654 12832  6493   6492                     (NOTLB)
Call Trace:
[c00000000f2674f0] [c0000000000119ec] .__switch_to+0xa4/0xf0
[c00000000f267580] [c00000000043ffbc] .schedule+0x4d4/0xe5c
[c00000000f2676a0] [c0000000004421b4] .rwsem_down_read_failed+0x13c/0x2f4
[c00000000f267780] [c00000000003ba6c] .do_page_fault+0x66c/0x738
[c00000000f267900] [c00000000000b2dc] .handle_page_fault+0x20/0x54
--- Exception: 301 at .do_futex+0x5d4/0x720
    LR = .do_futex+0x1e0/0x720
[c00000000f267d60] [c000000000078124] .compat_sys_futex+0xa4/0x16c
[c00000000f267e30] [c00000000000d600] syscall_exit+0x0/0x18


One attempt to fix this is included below. It works, but I'm not entirely
happy with the fact that it's a bit messy solution. If anyone has a
better idea for how to solve it I'd be all ears.

Auditing other read-takers of mmap_sem, I found one more exposure that
could be solved by just moving code around.

-----

Some futex functions do get_user calls while holding mmap_sem for
reading. If get_user() faults, and another thread happens to be in mmap
(or somewhere else holding waiting on down_write for the same semaphore),
then do_page_fault will deadlock. Most architectures seem to be exposed
to this.

To avoid it, make sure the page is available. If not, release the
semaphore, fault it in and retake it.

I also found another exposure by inspection, moving some of the code
around avoids the possible deadlock there.

Signed-off-by: Olof Johansson <olof@austin.ibm.com>


Index: linux-2.5/kernel/futex.c
===================================================================
--- linux-2.5.orig/kernel/futex.c	2005-02-21 16:09:38.000000000 -0600
+++ linux-2.5/kernel/futex.c	2005-02-22 12:35:01.000000000 -0600
@@ -326,11 +326,22 @@
 	struct futex_hash_bucket *bh1, *bh2;
 	struct list_head *head1;
 	struct futex_q *this, *next;
-	int ret, drop_count = 0;
+	int ret, curval, drop_count = 0;
 	unsigned int nqueued;
 
 	down_read(&current->mm->mmap_sem);
 
+	/* Make sure we can access the page, since there's a risk to
+	 * deadlock with do_page_fault() if get_user() faults.
+	 */
+	while (!check_user_page_readable(current->mm, uaddr1)) {
+		up_read(&current->mm->mmap_sem);
+		/* Fault in the page through get_user() but discard result */
+		if (get_user(curval, (int __user *)uaddr1) != 0)
+			return -EFAULT;
+		down_read(&current->mm->mmap_sem);
+	}
+
 	ret = get_futex_key(uaddr1, &key1);
 	if (unlikely(ret != 0))
 		goto out;
@@ -343,8 +354,6 @@
 
 	nqueued = bh1->nqueued;
 	if (likely(valp != NULL)) {
-		int curval;
-
 		/* In order to avoid doing get_user while
 		   holding bh1->lock and bh2->lock, nqueued
 		   (monotonically increasing field) must be first
@@ -482,6 +491,19 @@
 
 	down_read(&current->mm->mmap_sem);
 
+	/* Make sure we can access the page, since there's a risk to
+	 * deadlock with do_page_fault() if get_user() faults.
+	 */
+	while (!check_user_page_readable(current->mm, uaddr)) {
+		up_read(&current->mm->mmap_sem);
+
+		/* Fault in the page through get_user() but discard result */
+		if (get_user(curval, (int __user *)uaddr) != 0)
+			return -EFAULT;
+
+		down_read(&current->mm->mmap_sem);
+	}
+
 	ret = get_futex_key(uaddr, &q.key);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
Index: linux-2.5/mm/mempolicy.c
===================================================================
--- linux-2.5.orig/mm/mempolicy.c	2005-02-04 00:27:40.000000000 -0600
+++ linux-2.5/mm/mempolicy.c	2005-02-21 16:43:08.000000000 -0600
@@ -486,6 +486,7 @@
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = NULL;
 	struct mempolicy *pol = current->mempolicy;
+	DECLARE_BITMAP(nodes, MAX_NUMNODES);
 
 	if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
 		return -EINVAL;
@@ -524,16 +525,21 @@
 	} else
 		pval = pol->policy;
 
+	if (nmask)
+		get_zonemask(pol, nodes);
+
+	if (vma) {
+		up_read(&current->mm->mmap_sem);
+		vma = NULL;
+	}
+
 	err = -EFAULT;
 	if (policy && put_user(pval, policy))
 		goto out;
 
 	err = 0;
-	if (nmask) {
-		DECLARE_BITMAP(nodes, MAX_NUMNODES);
-		get_zonemask(pol, nodes);
+	if (nmask)
 		err = copy_nodes_to_user(nmask, maxnode, nodes, sizeof(nodes));
-	}
 
  out:
 	if (vma)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 19:06 [PATCH/RFC] Futex mmap_sem deadlock Olof Johansson
@ 2005-02-22 19:36 ` Linus Torvalds
  2005-02-22 21:16   ` Benjamin Herrenschmidt
  2005-02-23 11:24   ` David Howells
  2005-02-22 19:55 ` Andrew Morton
  1 sibling, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-02-22 19:36 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Kernel Mailing List, Andrew Morton, jamie, rusty, David Howells

On Tue, 22 Feb 2005, Olof Johansson wrote:
> 
> Consider a small testcase that spawns off two threads, either thread
> doing a loop of:
> 
> 	buf = mmap /dev/zero MAP_SHARED for 0x100000 bytes
> 	call sys_futex (buf+page, FUTEX_WAIT, 1, NULL, NULL) for each page in said mmap
> 	munmap(buf)
> 	repeat
> 
> This will quickly lock up, since the futex_wait code dows a
> down_read(mmap_sem), then a get_user().
> 
> The do_page_fault code on ppc64 (as well as other architectures) needs
> to take the same semaphore for reading. This is all good until the
> second thread comes into play: Its mmap call tries to take the same
> semaphore for writing which causes in the do_page_fault down_read()
> to get stuck. Classic deadlock.

It shouldn't be. If one read writer is active, another should be able to 
come in, regardless of any pending writer trying to access it. At least 
that's always been the rule for the rw-spinlocks _exactly_ for this 
reaseon.

The rwsem code tries to be fairer, maybe it has broken this case in the 
name of fairness. I personally think fairness is overrated, and would 
rather have the rwsem implementation let readers come in. But if we really 
don't want that, then:

> One attempt to fix this is included below. It works, but I'm not entirely
> happy with the fact that it's a bit messy solution. If anyone has a
> better idea for how to solve it I'd be all ears.

We have a notion of "atomic get_user()" calls, which is a lot more 
efficient, and is already used for other cases where we have _real_
deadlocks (inode semaphore for reads into shared memory mappigns).

See "__copy_to_user_inatomic()", and in particular note that it
effectively _is_ legal to do user accesses with the preempt-count elevated 
or interrupts disabled to tell the page fault handler not to delay (then 
you have to check the error return, and if it returned an error you do it 
the slow way - drop all locks and do a non-atomic access).

> Auditing other read-takers of mmap_sem, I found one more exposure that
> could be solved by just moving code around.

DavidH - what's the word on nested read-semaphores like this? Are they 
supposed to work (like nested read-spinlocks), or do we need to do the 
things Olof does?

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 19:36 ` Linus Torvalds
@ 2005-02-22 21:16   ` Benjamin Herrenschmidt
  2005-02-22 21:19     ` Benjamin Herrenschmidt
  2005-02-22 21:31     ` Linus Torvalds
  2005-02-23 11:24   ` David Howells
  1 sibling, 2 replies; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2005-02-22 21:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Olof Johansson, Kernel Mailing List, Andrew Morton, jamie,
	Rusty Russell, David Howells

On Tue, 2005-02-22 at 11:36 -0800, Linus Torvalds wrote:

> DavidH - what's the word on nested read-semaphores like this? Are they 
> supposed to work (like nested read-spinlocks), or do we need to do the 
> things Olof does?

Isn't Olof scheme racy ? Can't the stuff get swapped out between the
first get_user() and the "real" one ?

Ben.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 21:16   ` Benjamin Herrenschmidt
@ 2005-02-22 21:19     ` Benjamin Herrenschmidt
  2005-02-22 21:31     ` Linus Torvalds
  1 sibling, 0 replies; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2005-02-22 21:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Olof Johansson, Kernel Mailing List, Andrew Morton, jamie,
	Rusty Russell, David Howells

On Wed, 2005-02-23 at 08:16 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2005-02-22 at 11:36 -0800, Linus Torvalds wrote:
> 
> > DavidH - what's the word on nested read-semaphores like this? Are they 
> > supposed to work (like nested read-spinlocks), or do we need to do the 
> > things Olof does?
> 
> Isn't Olof scheme racy ? Can't the stuff get swapped out between the
> first get_user() and the "real" one ?

Forget it, I missed the check_user_page_readable() guy within the
semaphore protection. I didn't know that function ;)

Ben.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 21:16   ` Benjamin Herrenschmidt
  2005-02-22 21:19     ` Benjamin Herrenschmidt
@ 2005-02-22 21:31     ` Linus Torvalds
  2005-02-22 21:42       ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-02-22 21:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Olof Johansson, Kernel Mailing List, Andrew Morton, jamie,
	Rusty Russell, David Howells



On Wed, 23 Feb 2005, Benjamin Herrenschmidt wrote:
> 
> Isn't Olof scheme racy ? Can't the stuff get swapped out between the
> first get_user() and the "real" one ?

Yes. But see my suggested modification (which I still think is "the thing 
that Olof does", except it's more efficient and avoids the race).

If rwsems acted like rwlocks, we wouldn't have this issue at all.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 21:31     ` Linus Torvalds
@ 2005-02-22 21:42       ` Benjamin Herrenschmidt
  2005-02-22 22:10         ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2005-02-22 21:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Olof Johansson, Kernel Mailing List, Andrew Morton, jamie,
	Rusty Russell, David Howells

On Tue, 2005-02-22 at 13:31 -0800, Linus Torvalds wrote:
> 
> On Wed, 23 Feb 2005, Benjamin Herrenschmidt wrote:
> > 
> > Isn't Olof scheme racy ? Can't the stuff get swapped out between the
> > first get_user() and the "real" one ?
> 
> Yes. But see my suggested modification (which I still think is "the thing 
> that Olof does", except it's more efficient and avoids the race).
> 
> If rwsems acted like rwlocks, we wouldn't have this issue at all.

Yours is probably the most efficient too. Note sure what is best for
rwsems tho, there seem to be some interest preventing readers from
starving writers for ever, this has been debated endlessly iirc,
though I have no personal opinion there.

Ben.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 21:42       ` Benjamin Herrenschmidt
@ 2005-02-22 22:10         ` Linus Torvalds
  2005-02-22 22:24           ` Benjamin Herrenschmidt
  2005-02-22 23:08           ` Greg KH
  0 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-02-22 22:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Olof Johansson, Kernel Mailing List, Andrew Morton, jamie,
	Rusty Russell, David Howells

On Wed, 23 Feb 2005, Benjamin Herrenschmidt wrote:
> 
> Yours is probably the most efficient too. Note sure what is best for
> rwsems tho, there seem to be some interest preventing readers from
> starving writers for ever, this has been debated endlessly iirc,
> though I have no personal opinion there.

Yes, the starvation issue is potentially real. And thinking about it,
we've even had that in real life, with /proc and lots of page faults. So I
guess that's a strong argument for the fairness thing.

Oh, well. The reason I hate the rwsem behaviour is exactly because it
results in this very subtle class of deadlocks. This one case is certainly
solvable several ways, but do we have other issues somewhere else? Things
like kobject might be ripe with things like this. The mm semaphore tends
to be pretty well-behaved - and I'm not sure the same is true of the
kobject one.

Normal recursive deadlocks are wonderful - most of them show up
immediately, so assuming you just have enough coverage, you're fine. This
fairness-related deadlock requires a race to happen.

Maybe it would be sufficient to have a debugging version of rwsems that
just notice recursion?

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 22:10         ` Linus Torvalds
@ 2005-02-22 22:24           ` Benjamin Herrenschmidt
  2005-02-22 23:08           ` Greg KH
  1 sibling, 0 replies; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2005-02-22 22:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Olof Johansson, Kernel Mailing List, Andrew Morton, jamie,
	Rusty Russell, David Howells

On Tue, 2005-02-22 at 14:10 -0800, Linus Torvalds wrote:

> Oh, well. The reason I hate the rwsem behaviour is exactly because it
> results in this very subtle class of deadlocks. This one case is certainly
> solvable several ways, but do we have other issues somewhere else? Things
> like kobject might be ripe with things like this. The mm semaphore tends
> to be pretty well-behaved - and I'm not sure the same is true of the
> kobject one.

We could detect those tho. When the appropriate DEBUG option is set, by
storing a cpumask in the semaphore we could detect if it's already taken
on this cpu...

> Normal recursive deadlocks are wonderful - most of them show up
> immediately, so assuming you just have enough coverage, you're fine. This
> fairness-related deadlock requires a race to happen.

Unless you consider that taking the read semaphore twice on the same CPU
is always a bug, thus the above stuff would work for catching them at
least more often...
 
> Maybe it would be sufficient to have a debugging version of rwsems that
> just notice recursion?
> 
> 		Linus
-- 
Benjamin Herrenschmidt <benh@kernel.crashing.org>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 22:10         ` Linus Torvalds
  2005-02-22 22:24           ` Benjamin Herrenschmidt
@ 2005-02-22 23:08           ` Greg KH
  1 sibling, 0 replies; 38+ messages in thread
From: Greg KH @ 2005-02-22 23:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Olof Johansson, Kernel Mailing List,
	Andrew Morton, jamie, Rusty Russell, David Howells

On Tue, Feb 22, 2005 at 02:10:58PM -0800, Linus Torvalds wrote:
> 
> Oh, well. The reason I hate the rwsem behaviour is exactly because it
> results in this very subtle class of deadlocks. This one case is certainly
> solvable several ways, but do we have other issues somewhere else? Things
> like kobject might be ripe with things like this. The mm semaphore tends
> to be pretty well-behaved - and I'm not sure the same is true of the
> kobject one.

I'm trying to get rid of the kobject (actually the subsystem) rwsem
right now, so it should be gone completly within a few kernel versions.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 19:36 ` Linus Torvalds
  2005-02-22 21:16   ` Benjamin Herrenschmidt
@ 2005-02-23 11:24   ` David Howells
  1 sibling, 0 replies; 38+ messages in thread
From: David Howells @ 2005-02-23 11:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Olof Johansson, Kernel Mailing List, Andrew Morton, jamie, rusty

Linus Torvalds <torvalds@osdl.org> wrote:

> It shouldn't be. If one read writer is active, another should be able to 
> come in, regardless of any pending writer trying to access it. At least 
> that's always been the rule for the rw-spinlocks _exactly_ for this 
> reaseon.

But not with rw-semaphores. I've designed them to be as fair as I can possibly
make them. This means holding reads up if there's a write pending.

> The rwsem code tries to be fairer, maybe it has broken this case in the 
> name of fairness. I personally think fairness is overrated, and would 
> rather have the rwsem implementation let readers come in.

I've seen writer starvation happening due to a continuous flow of overlapping
reads... That's one of the reasons I reimplemented rwsems.

Also, fair rwsems are easier to provide an assembly optimised form for that
doesn't involve the "thundering-herd" approach. Obviously, with the spinlock
approach, you can implement any semantics you desire once you're holding the
spinlock. With the optimised form as implemented, you can't determine how many
active readers there are. This information isn't stored anywhere.

> We have a notion of "atomic get_user()" calls, which is a lot more
> efficient, and is already used for other cases where we have _real_
> deadlocks (inode semaphore for reads into shared memory mappigns).

That's okay, provided the data you're trying to access isn't lurking on a disk
somewhere.

> > Auditing other read-takers of mmap_sem, I found one more exposure that
> > could be solved by just moving code around.
> 
> DavidH - what's the word on nested read-semaphores like this? Are they 
> supposed to work (like nested read-spinlocks), or do we need to do the 
> things Olof does?

Nesting rwsems like this is definitely on the not-recommended list.

For the special case of the mmap semaphore, I'd advocate following something
like Jamie Lokier's solution, and note when a task holds its own mmap_sem
semaphore read-locked such that page-fault can avoid taking it again.

Something like:

	struct task_struct {
		...
		unsigned mmsem_nest;
		...
	};

	static inline void lock_mm_for_read(struct mm_struct *mm) {
		if (current->mm == mm &&
		    current->mmsem_nest++ != 0)
			;
		else
			down_read(&mm->mmap_sem);
	}

	static inline void unlock_mm_for_read(struct mm_struct *mm) {
		if (current->mm == mm &&
		    --current->mmsem_nest != 0)
			;
		else
			up_read(&mm->mmap_sem);
	}

	static inline void lock_mm_for_write(struct mm_struct *mm) {
		down_write(&mm->mmap_sem);
	}

	static inline void unlock_mm_for_write(struct mm_struct *mm) {
		up_write(&mm->mmap_sem);
	}

Though I'd be tempted to say that faulting is the only case in which recursion
is permitted, and change the first two functions to reflect this and add an
extra pair specially for page-fault:

	static inline void lock_mm_for_read(struct mm_struct *mm) {
		down_read(&mm->mmap_sem);
		if (current->mm == mm &&
		    current->flags |= PF_MMAP_SEM)
	}

	static inline void unlock_mm_for_read(struct mm_struct *mm) {
		if (current->mm == mm &&
		    current->flags &= ~PF_MMAP_SEM)
		up_read(&mm->mmap_sem);
	}

	static inline void lock_mm_for_fault(struct mm_struct *mm) {
		if (!(current->flags & PF_MMAP_SEM))
			down_read(&mm->mmap_sem);
	}

	static inline void unlock_mm_for_fault(struct mm_struct *mm) {
		if (!(current->flags & PF_MMAP_SEM))
			up_read(&mm->mmap_sem);
	}

If current->mm is passed as the argument, the compiler's optimiser should
discard the first comparison in the if-statement.

Then futex.c would hold:

	lock_mm_for_read(&current->mm);
	get_futex_key(...) etc.
	queue_me(...) etc.
	ret = get_user(...);
	/* the rest */
	unlock_mm_for_read(&current->mm);

And do_page_fault() would hold:

	lock_mm_for_fault(&current->mm);
	...
	unlock_mm_for_fault(&current->mm);

David

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 19:06 [PATCH/RFC] Futex mmap_sem deadlock Olof Johansson
  2005-02-22 19:36 ` Linus Torvalds
@ 2005-02-22 19:55 ` Andrew Morton
  2005-02-22 21:07   ` Jamie Lokier
  1 sibling, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2005-02-22 19:55 UTC (permalink / raw)
  To: Olof Johansson; +Cc: linux-kernel, torvalds, jamie, rusty

olof@austin.ibm.com (Olof Johansson) wrote:
>
> Hi,
> 
> Consider a small testcase that spawns off two threads, either thread
> doing a loop of:
> 
> 	buf = mmap /dev/zero MAP_SHARED for 0x100000 bytes
> 	call sys_futex (buf+page, FUTEX_WAIT, 1, NULL, NULL) for each page in said mmap
> 	munmap(buf)
> 	repeat
> 
> This will quickly lock up, since the futex_wait code dows a
> down_read(mmap_sem), then a get_user().
> 
> The do_page_fault code on ppc64 (as well as other architectures) needs
> to take the same semaphore for reading. This is all good until the
> second thread comes into play: Its mmap call tries to take the same
> semaphore for writing which causes in the do_page_fault down_read()
> to get stuck. Classic deadlock.

Yup.  Jamie says that the futex code _has_ to hold mmap_sem across the
get_user().  I forget (but could probably locate) the details.

>
> One attempt to fix this is included below. It works, but I'm not entirely
> happy with the fact that it's a bit messy solution. If anyone has a
> better idea for how to solve it I'd be all ears.

It's fairly sane.  Style-wise I'd be inclined to turn this:

	down_read(&current->mm->mmap_sem);
	while (!check_user_page_readable(current->mm, uaddr1)) {
		up_read(&current->mm->mmap_sem);
		/* Fault in the page through get_user() but discard result */
		if (get_user(curval, (int __user *)uaddr1) != 0)
			return -EFAULT;
		down_read(&current->mm->mmap_sem);
	}

into a standalone helper function.

> --- linux-2.5.orig/mm/mempolicy.c	2005-02-04 00:27:40.000000000 -0600
> +++ linux-2.5/mm/mempolicy.c	2005-02-21 16:43:08.000000000 -0600
> @@ -486,6 +486,7 @@
>  	struct mm_struct *mm = current->mm;
>  	struct vm_area_struct *vma = NULL;
>  	struct mempolicy *pol = current->mempolicy;
> +	DECLARE_BITMAP(nodes, MAX_NUMNODES);
>  
>  	if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
>  		return -EINVAL;
> @@ -524,16 +525,21 @@
>  	} else
>  		pval = pol->policy;
>  
> +	if (nmask)
> +		get_zonemask(pol, nodes);
> +
> +	if (vma) {
> +		up_read(&current->mm->mmap_sem);
> +		vma = NULL;
> +	}
> +

OK.

>  	err = -EFAULT;
>  	if (policy && put_user(pval, policy))
>  		goto out;
>  
>  	err = 0;
> -	if (nmask) {
> -		DECLARE_BITMAP(nodes, MAX_NUMNODES);
> -		get_zonemask(pol, nodes);
> +	if (nmask)
>  		err = copy_nodes_to_user(nmask, maxnode, nodes, sizeof(nodes));
> -	}
>  
>   out:
>  	if (vma)

I don't think we need to hold mmap_sem while running get_zonemask().  `pol'
is a copy of current->mempolicy, and it won't be going away.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 19:55 ` Andrew Morton
@ 2005-02-22 21:07   ` Jamie Lokier
  2005-02-22 21:19     ` Olof Johansson
                       ` (3 more replies)
  0 siblings, 4 replies; 38+ messages in thread
From: Jamie Lokier @ 2005-02-22 21:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Olof Johansson, linux-kernel, torvalds, rusty

Andrew Morton wrote:
> > This will quickly lock up, since the futex_wait code dows a
> > down_read(mmap_sem), then a get_user().
> > 
> > The do_page_fault code on ppc64 (as well as other architectures) needs
> > to take the same semaphore for reading. This is all good until the
> > second thread comes into play: Its mmap call tries to take the same
> > semaphore for writing which causes in the do_page_fault down_read()
> > to get stuck. Classic deadlock.
> 
> Yup.  Jamie says that the futex code _has_ to hold mmap_sem across the
> get_user().  I forget (but could probably locate) the details.

It does - the "key" which identifies a futex depends on a vma
calculation, and the vma must not change between the calculation and
the get_user().

> > One attempt to fix this is included below. It works, but I'm not entirely
> > happy with the fact that it's a bit messy solution. If anyone has a
> > better idea for how to solve it I'd be all ears.
> 
> It's fairly sane.  Style-wise I'd be inclined to turn this:
> 
> 	down_read(&current->mm->mmap_sem);
> 	while (!check_user_page_readable(current->mm, uaddr1)) {
> 		up_read(&current->mm->mmap_sem);
> 		/* Fault in the page through get_user() but discard result */
> 		if (get_user(curval, (int __user *)uaddr1) != 0)
> 			return -EFAULT;
> 		down_read(&current->mm->mmap_sem);
> 	}

That won't work because the vma lock must be help between key
calculation and get_user() - otherwise futex is not reliable.  It
would work if the futex key calculation was inside the loop.

A much simpler solution (and sorry for not offering it earlier,
because Andrew Morton pointed out this bug long ago, but I was busy), is:

In futex.c:

	down_read(&current->mm->mmap_sem);
	get_futex_key(...) etc.
	queue_me(...) etc.
	current->flags |= PF_MMAP_SEM;             <- new
	ret = get_user(...);
	current->flags &= PF_MMAP_SEM;             <- new
	/* the rest */

And in arch/*/mm/fault.c, replace every one of these:

	down_read(&mm->mmap_sem);

	up_read(&mm->mmap_sem);

with these:

	if (!(current & PF_MMAP_SEM))
		down_read(&mm->mmap_sem);

	if (!(current & PF_MMAP_SEM))
		up_read(&mm->mmap_sem);

-- Jamie

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 21:07   ` Jamie Lokier
@ 2005-02-22 21:19     ` Olof Johansson
  2005-02-22 22:09       ` Jamie Lokier
  2005-02-22 21:19     ` Chris Friesen
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 38+ messages in thread
From: Olof Johansson @ 2005-02-22 21:19 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Andrew Morton, linux-kernel, torvalds, rusty

On Tue, Feb 22, 2005 at 09:07:52PM +0000, Jamie Lokier wrote:
> 
> That won't work because the vma lock must be help between key
> calculation and get_user() - otherwise futex is not reliable.  It
> would work if the futex key calculation was inside the loop.

Sure, but that's still true: It's just that the get_user() is done twice
instead. The semaphore is never released between the key calculation and
the "real" get_user().

> A much simpler solution (and sorry for not offering it earlier,
> because Andrew Morton pointed out this bug long ago, but I was busy), is:

Either way works for me. Andrew/Linus, got a preference? I'll either
post my refresh based on Andrews comments, or code up Jamie's
suggestion.


-Olof

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 21:19     ` Olof Johansson
@ 2005-02-22 22:09       ` Jamie Lokier
  0 siblings, 0 replies; 38+ messages in thread
From: Jamie Lokier @ 2005-02-22 22:09 UTC (permalink / raw)
  To: Olof Johansson; +Cc: Andrew Morton, linux-kernel, torvalds, rusty

Olof Johansson wrote:
> > That won't work because the vma lock must be help between key
> > calculation and get_user() - otherwise futex is not reliable.  It
> > would work if the futex key calculation was inside the loop.
> 
> Sure, but that's still true: It's just that the get_user() is done twice
> instead. The semaphore is never released between the key calculation and
> the "real" get_user().

Ah, I didn't look at where the loop is used and didn't think there'd
be _two_ get_user() calls in the fast case.  Not my instinct.

> > A much simpler solution (and sorry for not offering it earlier,
> > because Andrew Morton pointed out this bug long ago, but I was busy), is:
> 
> Either way works for me. Andrew/Linus, got a preference? I'll either
> post my refresh based on Andrews comments, or code up Jamie's
> suggestion.

Yours has a couple of problems.

   1. It'll make futex waits somewhat slower.  One of the nicer features
      of 2.6 futexes is that we got rid of the explicit page table lookup.

   2. It's broken because a page can be paged out by another thread
      after you've forced it in and before the get_user().  We only
      take mmap_sem, not the page table lock.

-- Jamie

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 21:07   ` Jamie Lokier
  2005-02-22 21:19     ` Olof Johansson
@ 2005-02-22 21:19     ` Chris Friesen
  2005-02-22 21:27       ` Jamie Lokier
  2005-02-22 21:30     ` Linus Torvalds
  2005-02-22 21:40     ` Andrew Morton
  3 siblings, 1 reply; 38+ messages in thread
From: Chris Friesen @ 2005-02-22 21:19 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Andrew Morton, Olof Johansson, linux-kernel, torvalds, rusty

Jamie Lokier wrote:

> In futex.c:
> 
> 	down_read(&current->mm->mmap_sem);
> 	get_futex_key(...) etc.
> 	queue_me(...) etc.
> 	current->flags |= PF_MMAP_SEM;             <- new
> 	ret = get_user(...);
> 	current->flags &= PF_MMAP_SEM;             <- new
> 	/* the rest */

Should the second new line be this (with the inverse)?

	current->flags &= ~PF_MMAP_SEM;


Chris

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 21:19     ` Chris Friesen
@ 2005-02-22 21:27       ` Jamie Lokier
  0 siblings, 0 replies; 38+ messages in thread
From: Jamie Lokier @ 2005-02-22 21:27 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Andrew Morton, Olof Johansson, linux-kernel, torvalds, rusty

Chris Friesen wrote:
> >	down_read(&current->mm->mmap_sem);
> >	get_futex_key(...) etc.
> >	queue_me(...) etc.
> >	current->flags |= PF_MMAP_SEM;             <- new
> >	ret = get_user(...);
> >	current->flags &= PF_MMAP_SEM;             <- new
> >	/* the rest */
> 
> Should the second new line be this (with the inverse)?
> 
> 	current->flags &= ~PF_MMAP_SEM;

Quiet!  I was trying to sneak in a security hole! :)

-- Jamie

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 21:07   ` Jamie Lokier
  2005-02-22 21:19     ` Olof Johansson
  2005-02-22 21:19     ` Chris Friesen
@ 2005-02-22 21:30     ` Linus Torvalds
  2005-02-22 22:34       ` Jamie Lokier
                         ` (2 more replies)
  2005-02-22 21:40     ` Andrew Morton
  3 siblings, 3 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-02-22 21:30 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Andrew Morton, Olof Johansson, linux-kernel, rusty



On Tue, 22 Feb 2005, Jamie Lokier wrote:
> 
> A much simpler solution (and sorry for not offering it earlier,
> because Andrew Morton pointed out this bug long ago, but I was busy), is:
> 
> In futex.c:
> 
> 	down_read(&current->mm->mmap_sem);
> 	get_futex_key(...) etc.
> 	queue_me(...) etc.
> 	current->flags |= PF_MMAP_SEM;             <- new
> 	ret = get_user(...);
> 	current->flags &= PF_MMAP_SEM;             <- new
> 	/* the rest */

That is uglee. 

We really have this already, and it's called "current->preempt". It 
handles any lock at all, and doesn't add yet another special case to all 
the architectures.

Just do

	repeat:
		down_read(&current->mm->mmap_sem);
		get_futex_key(...) etc.
		queue_me(...) etc.
		inc_preempt_count();
		ret = get_user(...);
		dec_preempt_count();
		if (unlikely(ret)) {
			up_read(&current->mm->mmap_sem);
			/* Re-do the access outside the lock */
			ret = get_user(...);
			if (!ret)
				goto repeat;
			return ret;
		}
		...

and you should be ok.

No new special cases, no new abstractions. At most, we should probably 
create a "get_user_inatomic()", to 

 - make it damn obvious what we're doing, and match the explicit
   "inatomic" in the other place where we depend on this (fs/filemap.c)

 - allow the regular "get_user()" to continue to do the normal
   "might_sleep()" checks.

That's assuming we can't just make rwsem's nest nicely.

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 21:30     ` Linus Torvalds
@ 2005-02-22 22:34       ` Jamie Lokier
  2005-02-22 22:42         ` Olof Johansson
  2005-02-23 11:39       ` David Howells
  2005-02-23 14:49       ` Joe Korty
  2 siblings, 1 reply; 38+ messages in thread
From: Jamie Lokier @ 2005-02-22 22:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Olof Johansson, linux-kernel, rusty

Linus Torvalds wrote:
> > 	queue_me(...) etc.
> > 	current->flags |= PF_MMAP_SEM;             <- new
> > 	ret = get_user(...);
> > 	current->flags &= PF_MMAP_SEM;             <- new
> > 	/* the rest */
> 
> That is uglee. 
> 
> We really have this already, and it's called "current->preempt". It 
> handles any lock at all, and doesn't add yet another special case to all 
> the architectures.

Ooh, I didn't know current->preempt did that (been away).

> 	repeat:
> 		down_read(&current->mm->mmap_sem);
> 		get_futex_key(...) etc.
> 		queue_me(...) etc.
> 		inc_preempt_count();
> 		ret = get_user(...);
> 		dec_preempt_count();
> 		if (unlikely(ret)) {
> 			up_read(&current->mm->mmap_sem);
> 			/* Re-do the access outside the lock */
> 			ret = get_user(...);
> 			if (!ret)
> 				goto repeat;
> 			return ret;
>		}

That would work.  I like it. :)

Page faults will enter the fault handler twice (i.e. slower), but
that's not really a disadvantage, because a program always references
the memory just before calling futex_wait anyway.  A fault is rare.

There is one small but important error: the "return ret" mustn't just
return.  It must call unqueue_me(&q) just like the code at out_unqueue,
_including_ the conditional "ret = 0", but _excluding_ the up_read().

Alternatively, since it's a rare case, just shuffle the loop around:

 		down_read(&current->mm->mmap_sem);
 	repeat:
 		get_futex_key(...) etc.
 		queue_me(...) etc.
 		inc_preempt_count();
 		ret = get_user(...);
 		dec_preempt_count();
 		if (unlikely(ret)) {
 			up_read(&current->mm->mmap_sem);
 			/* Re-do the access outside the lock */
 			ret = get_user(...);
			down_read(&current->mm->mmap_sem);
 			if (!ret)
 				goto repeat;
 			goto out_unqueue;
		}

-- Jamie

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 22:34       ` Jamie Lokier
@ 2005-02-22 22:42         ` Olof Johansson
  2005-02-22 23:20           ` Andrew Morton
  0 siblings, 1 reply; 38+ messages in thread
From: Olof Johansson @ 2005-02-22 22:42 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linus Torvalds, Andrew Morton, linux-kernel, rusty

On Tue, Feb 22, 2005 at 10:34:57PM +0000, Jamie Lokier wrote:
> There is one small but important error: the "return ret" mustn't just
> return.  It must call unqueue_me(&q) just like the code at out_unqueue,
> _including_ the conditional "ret = 0", but _excluding_ the up_read().

Not only that, but someone might already have dequeued us, right? It's
probably a pathological case though, i.e. someone did a wake on the same
(bad) address.

How's this patch? It's closer to Linus' pseudo-code than Andrew's, to
avoid the extra get_user() at function entry and keep the common case
path short.

It also includes the feedback from Andrew on the sys_get_mempolicy(),
making the patch even simpler there.

----


Some futex functions do get_user calls while holding mmap_sem for
reading. If get_user() faults, and another thread happens to be in mmap
(or somewhere else holding waiting on down_write for the same semaphore),
then do_page_fault will deadlock. Most architectures seem to be exposed
to this.

To avoid it, make sure the page is available. If not, release the
semaphore, fault it in and retry.

I also found another exposure by inspection, moving some of the code
around avoids the possible deadlock there.

Signed-off-by: Olof Johansson <olof@austin.ibm.com>


Index: linux-2.5/kernel/futex.c
===================================================================
--- linux-2.5.orig/kernel/futex.c	2005-02-21 16:09:38.000000000 -0600
+++ linux-2.5/kernel/futex.c	2005-02-22 16:38:24.000000000 -0600
@@ -329,6 +329,7 @@
 	int ret, drop_count = 0;
 	unsigned int nqueued;
 
+ retry:
 	down_read(&current->mm->mmap_sem);
 
 	ret = get_futex_key(uaddr1, &key1);
@@ -355,9 +356,19 @@
 		   before *uaddr1.  */
 		smp_mb();
 
-		if (get_user(curval, (int __user *)uaddr1) != 0) {
-			ret = -EFAULT;
-			goto out;
+		inc_preempt_count();
+		ret = get_user(curval, (int __user *)uaddr1);
+		dec_preempt_count();
+
+		if (unlikely(ret)) {
+			up_read(&current->mm->mmap_sem);
+			/* Re-do the access outside the lock */
+			ret = get_user(curval, (int __user *)uaddr1);
+
+			if (!ret)
+				goto retry;
+
+			return ret;
 		}
 		if (curval != *valp) {
 			ret = -EAGAIN;
@@ -480,6 +491,7 @@
 	int ret, curval;
 	struct futex_q q;
 
+ retry:
 	down_read(&current->mm->mmap_sem);
 
 	ret = get_futex_key(uaddr, &q.key);
@@ -508,9 +520,21 @@
 	 * We hold the mmap semaphore, so the mapping cannot have changed
 	 * since we looked it up in get_futex_key.
 	 */
-	if (get_user(curval, (int __user *)uaddr) != 0) {
-		ret = -EFAULT;
-		goto out_unqueue;
+	inc_preempt_count();
+	ret = get_user(curval, (int __user *)uaddr);
+	dec_preempt_count();
+	if (unlikely(ret)) {
+		up_read(&current->mm->mmap_sem);
+
+		if (!unqueue_me(&q)) /* There's a chance we got woken already */
+			return 0;
+
+		/* Re-do the access outside the lock */
+		ret = get_user(curval, (int __user *)uaddr);
+
+		if (!ret)
+			goto retry;
+		return ret;
 	}
 	if (curval != val) {
 		ret = -EWOULDBLOCK;
Index: linux-2.5/mm/mempolicy.c
===================================================================
--- linux-2.5.orig/mm/mempolicy.c	2005-02-04 00:27:40.000000000 -0600
+++ linux-2.5/mm/mempolicy.c	2005-02-22 14:34:19.000000000 -0600
@@ -524,9 +524,13 @@
 	} else
 		pval = pol->policy;
 
-	err = -EFAULT;
+	if (vma) {
+		up_read(&current->mm->mmap_sem);
+		vma = NULL;
+	}
+
 	if (policy && put_user(pval, policy))
-		goto out;
+		return -EFAULT;
 
 	err = 0;
 	if (nmask) {

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 22:42         ` Olof Johansson
@ 2005-02-22 23:20           ` Andrew Morton
  2005-02-22 23:23             ` Olof Johansson
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2005-02-22 23:20 UTC (permalink / raw)
  To: Olof Johansson; +Cc: jamie, torvalds, linux-kernel, rusty

olof@austin.ibm.com (Olof Johansson) wrote:
>
> +		inc_preempt_count();
> +		ret = get_user(curval, (int __user *)uaddr1);
> +		dec_preempt_count();

That _should_ generate a might_sleep() warning, except it looks like we
forgot to add a check to get_user().

It would be better to use __copy_from_user_inatomic() here, I think.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 23:20           ` Andrew Morton
@ 2005-02-22 23:23             ` Olof Johansson
  0 siblings, 0 replies; 38+ messages in thread
From: Olof Johansson @ 2005-02-22 23:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: jamie, torvalds, linux-kernel, rusty

On Tue, Feb 22, 2005 at 03:20:27PM -0800, Andrew Morton wrote:
> olof@austin.ibm.com (Olof Johansson) wrote:
> >
> > +		inc_preempt_count();
> > +		ret = get_user(curval, (int __user *)uaddr1);
> > +		dec_preempt_count();
> 
> That _should_ generate a might_sleep() warning, except it looks like we
> forgot to add a check to get_user().
> 
> It would be better to use __copy_from_user_inatomic() here, I think.

Thanks for catching that. New rev below.

-----

Some futex functions do get_user calls while holding mmap_sem for
reading. If get_user() faults, and another thread happens to be in mmap
(or somewhere else holding waiting on down_write for the same semaphore),
then do_page_fault will deadlock. Most architectures seem to be exposed
to this.

To avoid it, make sure the page is available. If not, release the
semaphore, fault it in and retry.

I also found another exposure by inspection, moving some of the code
around avoids the possible deadlock there.

Signed-off-by: Olof Johansson <olof@austin.ibm.com>


Index: linux-2.5/kernel/futex.c
===================================================================
--- linux-2.5.orig/kernel/futex.c	2005-02-21 16:09:38.000000000 -0600
+++ linux-2.5/kernel/futex.c	2005-02-22 17:20:22.000000000 -0600
@@ -329,6 +329,7 @@
 	int ret, drop_count = 0;
 	unsigned int nqueued;
 
+ retry:
 	down_read(&current->mm->mmap_sem);
 
 	ret = get_futex_key(uaddr1, &key1);
@@ -355,9 +356,19 @@
 		   before *uaddr1.  */
 		smp_mb();
 
-		if (get_user(curval, (int __user *)uaddr1) != 0) {
-			ret = -EFAULT;
-			goto out;
+		inc_preempt_count();
+		ret = __copy_from_user_inatomic(&curval, (int __user *)uaddr1, sizeof(int));
+		dec_preempt_count();
+
+		if (unlikely(ret)) {
+			up_read(&current->mm->mmap_sem);
+			/* Re-do the access outside the lock */
+			ret = get_user(curval, (int __user *)uaddr1);
+
+			if (!ret)
+				goto retry;
+
+			return ret;
 		}
 		if (curval != *valp) {
 			ret = -EAGAIN;
@@ -480,6 +491,7 @@
 	int ret, curval;
 	struct futex_q q;
 
+ retry:
 	down_read(&current->mm->mmap_sem);
 
 	ret = get_futex_key(uaddr, &q.key);
@@ -508,9 +520,21 @@
 	 * We hold the mmap semaphore, so the mapping cannot have changed
 	 * since we looked it up in get_futex_key.
 	 */
-	if (get_user(curval, (int __user *)uaddr) != 0) {
-		ret = -EFAULT;
-		goto out_unqueue;
+	inc_preempt_count();
+	ret = __copy_from_user_inatomic(&curval, (int __user *)uaddr, sizeof(int));
+	dec_preempt_count();
+	if (unlikely(ret)) {
+		up_read(&current->mm->mmap_sem);
+
+		if (!unqueue_me(&q)) /* There's a chance we got woken already */
+			return 0;
+
+		/* Re-do the access outside the lock */
+		ret = get_user(curval, (int __user *)uaddr);
+
+		if (!ret)
+			goto retry;
+		return ret;
 	}
 	if (curval != val) {
 		ret = -EWOULDBLOCK;
Index: linux-2.5/mm/mempolicy.c
===================================================================
--- linux-2.5.orig/mm/mempolicy.c	2005-02-04 00:27:40.000000000 -0600
+++ linux-2.5/mm/mempolicy.c	2005-02-22 14:34:19.000000000 -0600
@@ -524,9 +524,13 @@
 	} else
 		pval = pol->policy;
 
-	err = -EFAULT;
+	if (vma) {
+		up_read(&current->mm->mmap_sem);
+		vma = NULL;
+	}
+
 	if (policy && put_user(pval, policy))
-		goto out;
+		return -EFAULT;
 
 	err = 0;
 	if (nmask) {

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 21:30     ` Linus Torvalds
  2005-02-22 22:34       ` Jamie Lokier
@ 2005-02-23 11:39       ` David Howells
  2005-02-23 16:22         ` Olof Johansson
  2005-02-23 14:49       ` Joe Korty
  2 siblings, 1 reply; 38+ messages in thread
From: David Howells @ 2005-02-23 11:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Andrew Morton, Olof Johansson, linux-kernel, rusty

Linus Torvalds <torvalds@osdl.org> wrote:

> That is uglee. 

True. You could just wrap it up in inline functions and hide it in a header
file as I suggested in the email I've just sent.

> We really have this already, and it's called "current->preempt". It 
> handles any lock at all, and doesn't add yet another special case to all 
> the architectures.
> 
> Just do
> 
> 	repeat:

Alternately, you could just have do_page_fault() do:

	while (!down_read_trylock(&current->mm->mmap_sem))
		continue;

However, note that this can suffer from starvation due to a never ending flow
of mixed write-locks and read-locks on other CPUs. Unlikely, true, but not
impossible.

There is yet another way, now that I think about it... Another rwsem could be
added to mm_struct, one that you have to get immediately before mmap_sem and
release immediately after; one that page fault doesn't ever touch... but that
isn't very pleasant either:-/

A yet further way would be to make a second kind of rwsem; one that's
unfair. But if you do that, a few threads that're swapping a lot, combined
with a few tops could starve out another thread that's trying to do an mmap...

Actually, there's an added benefit to your suggested method... It doesn't
involve changing what's in task_struct or mm_struct and doesn't involve
changing the semantics of the mmap_sem lock.

> That's assuming we can't just make rwsem's nest nicely.

rwsems themselves? No, not really. It'd either mean keeing track of which task
holds what sort of lock on every rwsem (which'd be nice for debugging,
granted; but not something you really want in a normal system), or it'd mean
making rwsems unfair - something I think will be a really bad idea.

David

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-23 11:39       ` David Howells
@ 2005-02-23 16:22         ` Olof Johansson
  2005-02-23 18:44           ` David Howells
  0 siblings, 1 reply; 38+ messages in thread
From: Olof Johansson @ 2005-02-23 16:22 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Jamie Lokier, Andrew Morton, linux-kernel, rusty

On Wed, Feb 23, 2005 at 11:39:08AM +0000, David Howells wrote:
> Alternately, you could just have do_page_fault() do:
> 
> 	while (!down_read_trylock(&current->mm->mmap_sem))
> 		continue;
> 
> However, note that this can suffer from starvation due to a never ending flow
> of mixed write-locks and read-locks on other CPUs. Unlikely, true, but not
> impossible.

How can this help? 

The semaphore is held for reading by the thread that faulted in
futex_wait() -> get_user(), so no writers will be let through. Until the
writer has been let through, the down_read_trylock will never succeed
either. No forward progress can be made even with the above loop.


-Olof

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-23 16:22         ` Olof Johansson
@ 2005-02-23 18:44           ` David Howells
  0 siblings, 0 replies; 38+ messages in thread
From: David Howells @ 2005-02-23 18:44 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Linus Torvalds, Jamie Lokier, Andrew Morton, linux-kernel, rusty

Olof Johansson <olof@austin.ibm.com> wrote:

> > Alternately, you could just have do_page_fault() do:
> > 
> > 	while (!down_read_trylock(&current->mm->mmap_sem))
> > 		continue;
> > 
> > However, note that this can suffer from starvation due to a never ending
> > flow of mixed write-locks and read-locks on other CPUs. Unlikely, true,
> > but not impossible.
> 
> How can this help? 
> 
> The semaphore is held for reading by the thread that faulted in
> futex_wait() -> get_user(), so no writers will be let through. Until the
> writer has been let through, the down_read_trylock will never succeed
> either. No forward progress can be made even with the above loop.

You're right. The "writers" would have to spin instead.

David

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 21:30     ` Linus Torvalds
  2005-02-22 22:34       ` Jamie Lokier
  2005-02-23 11:39       ` David Howells
@ 2005-02-23 14:49       ` Joe Korty
  2005-02-23 15:54         ` Linus Torvalds
  2 siblings, 1 reply; 38+ messages in thread
From: Joe Korty @ 2005-02-23 14:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Andrew Morton, Olof Johansson, linux-kernel, rusty

On Tue, Feb 22, 2005 at 01:30:27PM -0800, Linus Torvalds wrote:
> 
> We really have this already, and it's called "current->preempt". It 
> handles any lock at all, and doesn't add yet another special case to all 
> the architectures.
> 
> Just do
> 
> 	repeat:
> 		down_read(&current->mm->mmap_sem);
> 		get_futex_key(...) etc.
> 		queue_me(...) etc.
> 		inc_preempt_count();
> 		ret = get_user(...);
> 		dec_preempt_count();

Perhaps this should be preempt_disable .... preempt_enable.

Otherwise, a preempt attempt in get_user would not be seen
until some future preempt_enable was executed.

Regards,
Joe
--
"Money can buy bandwidth, but latency is forever" -- John Mashey

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-23 14:49       ` Joe Korty
@ 2005-02-23 15:54         ` Linus Torvalds
  2005-02-23 17:10           ` Olof Johansson
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-02-23 15:54 UTC (permalink / raw)
  To: Joe Korty
  Cc: Jamie Lokier, Andrew Morton, Olof Johansson, linux-kernel, rusty

On Wed, 23 Feb 2005, Joe Korty wrote:
> 
> Perhaps this should be preempt_disable .... preempt_enable.

No, the problem with preempt_disable/enable is that they go away if 
preemption is not enabled. So you really do have to do it by hand with the 
"inc_preempt_count".

> Otherwise, a preempt attempt in get_user would not be seen
> until some future preempt_enable was executed.

True. I guess we should have a "preempt_check_resched()" there too. That's 
what "kunmap_atomic()" does too (which is what we rely on in the other 
case we do this..)

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-23 15:54         ` Linus Torvalds
@ 2005-02-23 17:10           ` Olof Johansson
  2005-02-23 17:37             ` Arjan van de Ven
  2005-02-23 18:22             ` Jamie Lokier
  0 siblings, 2 replies; 38+ messages in thread
From: Olof Johansson @ 2005-02-23 17:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joe Korty, Jamie Lokier, Andrew Morton, linux-kernel, rusty

On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote:

> > Otherwise, a preempt attempt in get_user would not be seen
> > until some future preempt_enable was executed.
> 
> True. I guess we should have a "preempt_check_resched()" there too. That's 
> what "kunmap_atomic()" does too (which is what we rely on in the other 
> case we do this..)

Ok, this is getting complex enough to warrant get_user_inatomic(),
which means adding it to every arch's uaccess.h.

Below patch does so. Unfortunately I don't have a Viro setup with cross
compilers for nearly every arch, so I can't make sure it doesn't break
anything. But since I pasted the same code everywhere it shouldn't.


-----

Some futex functions do get_user calls while holding mmap_sem for
reading. If get_user() faults, and another thread happens to be in mmap
(or somewhere else holding waiting on down_write for the same semaphore),
then do_page_fault will deadlock. Most architectures seem to be exposed
to this.

To avoid it, make sure the page is available. If not, release the
semaphore, fault it in and retry.

I also found another exposure by inspection, moving some of the code
around avoids the possible deadlock there.

Signed-off-by: Olof Johansson <olof@austin.ibm.com>


Index: linux-2.5/kernel/futex.c
===================================================================
--- linux-2.5.orig/kernel/futex.c	2005-02-21 16:09:38.000000000 -0600
+++ linux-2.5/kernel/futex.c	2005-02-23 10:55:37.000000000 -0600
@@ -329,6 +329,7 @@
 	int ret, drop_count = 0;
 	unsigned int nqueued;
 
+ retry:
 	down_read(&current->mm->mmap_sem);
 
 	ret = get_futex_key(uaddr1, &key1);
@@ -355,9 +356,17 @@
 		   before *uaddr1.  */
 		smp_mb();
 
-		if (get_user(curval, (int __user *)uaddr1) != 0) {
-			ret = -EFAULT;
-			goto out;
+		ret = get_user_inatomic(curval, (int __user *)uaddr1);
+
+		if (unlikely(ret)) {
+			up_read(&current->mm->mmap_sem);
+			/* Re-do the access outside the lock */
+			ret = get_user(curval, (int __user *)uaddr1);
+
+			if (!ret)
+				goto retry;
+
+			return ret;
 		}
 		if (curval != *valp) {
 			ret = -EAGAIN;
@@ -480,6 +489,7 @@
 	int ret, curval;
 	struct futex_q q;
 
+ retry:
 	down_read(&current->mm->mmap_sem);
 
 	ret = get_futex_key(uaddr, &q.key);
@@ -508,9 +518,21 @@
 	 * We hold the mmap semaphore, so the mapping cannot have changed
 	 * since we looked it up in get_futex_key.
 	 */
-	if (get_user(curval, (int __user *)uaddr) != 0) {
-		ret = -EFAULT;
-		goto out_unqueue;
+
+	ret = get_user_inatomic(curval, (int __user *)uaddr);
+
+	if (unlikely(ret)) {
+		up_read(&current->mm->mmap_sem);
+
+		if (!unqueue_me(&q)) /* There's a chance we got woken already */
+			return 0;
+
+		/* Re-do the access outside the lock */
+		ret = get_user(curval, (int __user *)uaddr);
+
+		if (!ret)
+			goto retry;
+		return ret;
 	}
 	if (curval != val) {
 		ret = -EWOULDBLOCK;
Index: linux-2.5/mm/mempolicy.c
===================================================================
--- linux-2.5.orig/mm/mempolicy.c	2005-02-04 00:27:40.000000000 -0600
+++ linux-2.5/mm/mempolicy.c	2005-02-23 10:16:46.000000000 -0600
@@ -524,9 +524,13 @@
 	} else
 		pval = pol->policy;
 
-	err = -EFAULT;
+	if (vma) {
+		up_read(&current->mm->mmap_sem);
+		vma = NULL;
+	}
+
 	if (policy && put_user(pval, policy))
-		goto out;
+		return -EFAULT;
 
 	err = 0;
 	if (nmask) {
Index: linux-2.5/include/asm-ppc64/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-ppc64/uaccess.h	2005-02-04 00:27:13.000000000 -0600
+++ linux-2.5/include/asm-ppc64/uaccess.h	2005-02-23 11:01:44.000000000 -0600
@@ -217,6 +217,17 @@
 		: "=r"(err), "=r"(x)			\
 		: "b"(addr), "i"(errret), "0"(err))
 
+
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
 /* more complex routines */
 
 extern unsigned long __copy_tofrom_user(void __user *to, const void __user *from,
Index: linux-2.5/include/asm-alpha/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-alpha/uaccess.h	2005-02-04 00:26:53.000000000 -0600
+++ linux-2.5/include/asm-alpha/uaccess.h	2005-02-23 10:59:33.000000000 -0600
@@ -341,6 +341,16 @@
 }
 #endif
 
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
 
 /*
  * Complex access routines
Index: linux-2.5/include/asm-arm26/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-arm26/uaccess.h	2005-02-04 00:26:58.000000000 -0600
+++ linux-2.5/include/asm-arm26/uaccess.h	2005-02-23 10:59:49.000000000 -0600
@@ -224,6 +224,17 @@
         }                                                               \
 } while (0)
 
+
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
 static __inline__ unsigned long copy_from_user(void *to, const void *from, unsigned long n)
 {
 	if (access_ok(VERIFY_READ, from, n))
Index: linux-2.5/include/asm-arm/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-arm/uaccess.h	2005-02-04 00:26:54.000000000 -0600
+++ linux-2.5/include/asm-arm/uaccess.h	2005-02-23 11:00:00.000000000 -0600
@@ -362,6 +362,16 @@
 	: "r" (x), "i" (-EFAULT)				\
 	: "cc")
 
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
 extern unsigned long __arch_copy_from_user(void *to, const void __user *from, unsigned long n);
 extern unsigned long __arch_copy_to_user(void __user *to, const void *from, unsigned long n);
 extern unsigned long __arch_clear_user(void __user *addr, unsigned long n);
Index: linux-2.5/include/asm-cris/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-cris/uaccess.h	2005-02-04 00:26:58.000000000 -0600
+++ linux-2.5/include/asm-cris/uaccess.h	2005-02-23 11:00:11.000000000 -0600
@@ -210,6 +210,16 @@
 	__gu_err;							\
 })
 
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
 extern long __get_user_bad(void);
 
 /* More complex functions.  Most are inline, but some call functions that
Index: linux-2.5/include/asm-frv/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-frv/uaccess.h	2005-02-04 00:26:59.000000000 -0600
+++ linux-2.5/include/asm-frv/uaccess.h	2005-02-23 11:00:21.000000000 -0600
@@ -249,6 +249,17 @@
 
 #endif
 
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
+
 /*****************************************************************************/
 /*
  *
Index: linux-2.5/include/asm-h8300/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-h8300/uaccess.h	2005-02-04 00:26:59.000000000 -0600
+++ linux-2.5/include/asm-h8300/uaccess.h	2005-02-23 11:00:31.000000000 -0600
@@ -130,6 +130,17 @@
 
 #define copy_from_user_ret(to,from,n,retval) ({ if (copy_from_user(to,from,n)) return retval; })
 
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
+
 /*
  * Copy a null terminated string from userspace.
  */
Index: linux-2.5/include/asm-i386/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-i386/uaccess.h	2005-02-04 00:27:00.000000000 -0600
+++ linux-2.5/include/asm-i386/uaccess.h	2005-02-23 11:00:41.000000000 -0600
@@ -377,6 +377,17 @@
 		: "m"(__m(addr)), "i"(errret), "0"(err))
 
 
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
+
 unsigned long __must_check __copy_to_user_ll(void __user *to,
 				const void *from, unsigned long n);
 unsigned long __must_check __copy_from_user_ll(void *to,
Index: linux-2.5/include/asm-ia64/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-ia64/uaccess.h	2005-02-04 00:27:02.000000000 -0600
+++ linux-2.5/include/asm-ia64/uaccess.h	2005-02-23 11:00:48.000000000 -0600
@@ -299,6 +299,16 @@
 	__cu_len;						\
 })
 
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
 
 /*
  * Returns: -EFAULT if exception before terminator, N if the entire buffer filled, else
Index: linux-2.5/include/asm-m32r/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-m32r/uaccess.h	2005-02-04 00:27:03.000000000 -0600
+++ linux-2.5/include/asm-m32r/uaccess.h	2005-02-23 11:00:56.000000000 -0600
@@ -706,6 +706,17 @@
 long __must_check __strncpy_from_user(char *dst,
 				const char __user *src, long count);
 
+
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
 /**
  * __clear_user: - Zero a block of memory in user space, with less checking.
  * @to:   Destination address, in user space.
Index: linux-2.5/include/asm-m68knommu/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-m68knommu/uaccess.h	2005-02-04 00:27:04.000000000 -0600
+++ linux-2.5/include/asm-m68knommu/uaccess.h	2005-02-23 11:01:05.000000000 -0600
@@ -141,6 +141,17 @@
 
 #define copy_from_user_ret(to,from,n,retval) ({ if (copy_from_user(to,from,n)) return retval; })
 
+
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
 /*
  * Copy a null terminated string from userspace.
  */
Index: linux-2.5/include/asm-m68k/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-m68k/uaccess.h	2005-02-04 00:27:04.000000000 -0600
+++ linux-2.5/include/asm-m68k/uaccess.h	2005-02-23 11:01:14.000000000 -0600
@@ -766,6 +766,17 @@
 #define __copy_from_user(to, from, n) copy_from_user(to, from, n)
 #define __copy_to_user(to, from, n) copy_to_user(to, from, n)
 
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
+
 /*
  * Copy a null terminated string from userspace.
  */
Index: linux-2.5/include/asm-mips/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-mips/uaccess.h	2005-02-04 00:27:08.000000000 -0600
+++ linux-2.5/include/asm-mips/uaccess.h	2005-02-23 11:01:26.000000000 -0600
@@ -604,6 +604,17 @@
 	__cu_len;							\
 })
 
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
+
 /*
  * __clear_user: - Zero a block of memory in user space, with less checking.
  * @to:   Destination address, in user space.
Index: linux-2.5/include/asm-parisc/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-parisc/uaccess.h	2005-02-04 00:27:11.000000000 -0600
+++ linux-2.5/include/asm-parisc/uaccess.h	2005-02-23 11:01:36.000000000 -0600
@@ -245,6 +245,16 @@
 
 #endif /* !__LP64__ */
 
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
 
 /*
  * Complex access routines -- external declarations
Index: linux-2.5/include/asm-ppc/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-ppc/uaccess.h	2005-02-04 00:27:12.000000000 -0600
+++ linux-2.5/include/asm-ppc/uaccess.h	2005-02-23 11:01:53.000000000 -0600
@@ -296,6 +296,16 @@
 		: "=r"(err), "=&r"(x)			\
 		: "b"(addr), "i"(-EFAULT), "0"(err))
 
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
 /* more complex routines */
 
 extern int __copy_tofrom_user(void __user *to, const void __user *from,
Index: linux-2.5/include/asm-s390/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-s390/uaccess.h	2005-02-04 00:27:14.000000000 -0600
+++ linux-2.5/include/asm-s390/uaccess.h	2005-02-23 11:02:02.000000000 -0600
@@ -278,6 +278,18 @@
 #define __copy_to_user_inatomic __copy_to_user
 #define __copy_from_user_inatomic __copy_from_user
 
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
+
+
 /**
  * copy_to_user: - Copy a block of data into user space.
  * @to:   Destination address, in user space.
Index: linux-2.5/include/asm-sh64/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-sh64/uaccess.h	2005-02-04 00:27:15.000000000 -0600
+++ linux-2.5/include/asm-sh64/uaccess.h	2005-02-23 11:02:23.000000000 -0600
@@ -264,6 +264,17 @@
 #define __copy_to_user_inatomic __copy_to_user
 #define __copy_from_user_inatomic __copy_from_user
 
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
+
 /* XXX: Not sure it works well..
    should be such that: 4byte clear and the rest. */
 extern __kernel_size_t __clear_user(void *addr, __kernel_size_t size);
Index: linux-2.5/include/asm-sh/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-sh/uaccess.h	2005-02-04 00:27:15.000000000 -0600
+++ linux-2.5/include/asm-sh/uaccess.h	2005-02-23 11:02:29.000000000 -0600
@@ -464,6 +464,16 @@
 	__copy_user((void *)(to),		\
 		    (void *)(from), n)
 
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
 /*
  * Clear the area and return remaining number of bytes
  * (on failure.  Usually it's 0.)
Index: linux-2.5/include/asm-sparc64/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-sparc64/uaccess.h	2005-02-04 00:27:17.000000000 -0600
+++ linux-2.5/include/asm-sparc64/uaccess.h	2005-02-23 11:02:37.000000000 -0600
@@ -312,6 +312,16 @@
 
 #define clear_user __clear_user
 
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
 extern long __must_check __strncpy_from_user(char *dest, const char __user *src, long count);
 
 #define strncpy_from_user __strncpy_from_user
Index: linux-2.5/include/asm-sparc/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-sparc/uaccess.h	2005-02-04 00:27:16.000000000 -0600
+++ linux-2.5/include/asm-sparc/uaccess.h	2005-02-23 11:02:46.000000000 -0600
@@ -325,6 +325,16 @@
 #define __copy_to_user_inatomic __copy_to_user
 #define __copy_from_user_inatomic __copy_from_user
 
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
 static inline unsigned long __clear_user(void __user *addr, unsigned long size)
 {
 	unsigned long ret;
Index: linux-2.5/include/asm-um/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-um/uaccess.h	2005-02-04 00:27:19.000000000 -0600
+++ linux-2.5/include/asm-um/uaccess.h	2005-02-23 11:02:54.000000000 -0600
@@ -80,6 +80,17 @@
 	 __put_user(x, private_ptr) : -EFAULT); \
 })
 
+
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
 #define strlen_user(str) strnlen_user(str, ~0UL >> 1)
 
 struct exception_table_entry
Index: linux-2.5/include/asm-v850/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-v850/uaccess.h	2005-02-04 00:27:21.000000000 -0600
+++ linux-2.5/include/asm-v850/uaccess.h	2005-02-23 11:03:02.000000000 -0600
@@ -124,6 +124,16 @@
 #define copy_from_user_ret(to,from,n,retval) \
   ({ if (copy_from_user (to,from,n)) return retval; })
 
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
 /*
  * Copy a null terminated string from userspace.
  */
Index: linux-2.5/include/asm-x86_64/uaccess.h
===================================================================
--- linux-2.5.orig/include/asm-x86_64/uaccess.h	2005-02-04 00:27:21.000000000 -0600
+++ linux-2.5/include/asm-x86_64/uaccess.h	2005-02-23 11:03:09.000000000 -0600
@@ -235,6 +235,16 @@
 		: "=r"(err), ltype (x)				\
 		: "m"(__m(addr)), "i"(errno), "0"(err))
 
+#define get_user_inatomic(x,ptr)					\
+({									\
+	int __ret;							\
+	inc_preempt_count();						\
+ 	__ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));	\
+ 	dec_preempt_count();						\
+ 	preempt_check_resched();					\
+	__ret;								\
+})
+
 /*
  * Copy To/From Userspace
  */

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-23 17:10           ` Olof Johansson
@ 2005-02-23 17:37             ` Arjan van de Ven
  2005-02-23 18:22             ` Jamie Lokier
  1 sibling, 0 replies; 38+ messages in thread
From: Arjan van de Ven @ 2005-02-23 17:37 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Linus Torvalds, Joe Korty, Jamie Lokier, Andrew Morton,
	linux-kernel, rusty

On Wed, 2005-02-23 at 11:10 -0600, Olof Johansson wrote:
> On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote:
> 
> > > Otherwise, a preempt attempt in get_user would not be seen
> > > until some future preempt_enable was executed.
> > 
> > True. I guess we should have a "preempt_check_resched()" there too. That's 
> > what "kunmap_atomic()" does too (which is what we rely on in the other 
> > case we do this..)
> 
> Ok, this is getting complex enough to warrant get_user_inatomic(),
> which means adding it to every arch's uaccess.h.
> 
> Below patch does so. Unfortunately I don't have a Viro setup with cross
> compilers for nearly every arch, so I can't make sure it doesn't break
> anything. But since I pasted the same code everywhere it shouldn't.

I hate to do this to you, but get_user is a bit horrible in that it is
an untyped interface. Fixing it is hard (ugh) but when adding new
variants should/could we consider to please make it a typed (eg inline
and not a define) interface please ?



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-23 17:10           ` Olof Johansson
  2005-02-23 17:37             ` Arjan van de Ven
@ 2005-02-23 18:22             ` Jamie Lokier
  2005-02-23 18:34               ` Linus Torvalds
  2005-02-23 18:37               ` Olof Johansson
  1 sibling, 2 replies; 38+ messages in thread
From: Jamie Lokier @ 2005-02-23 18:22 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Linus Torvalds, Joe Korty, Andrew Morton, linux-kernel, rusty

Olof Johansson wrote:
> On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote:
> 
> > > Otherwise, a preempt attempt in get_user would not be seen
> > > until some future preempt_enable was executed.
> > 
> > True. I guess we should have a "preempt_check_resched()" there too. That's 
> > what "kunmap_atomic()" does too (which is what we rely on in the other 
> > case we do this..)
> 
> Ok, this is getting complex enough to warrant get_user_inatomic(),
> which means adding it to every arch's uaccess.h.
> 
> Below patch does so. Unfortunately I don't have a Viro setup with cross
> compilers for nearly every arch, so I can't make sure it doesn't break
> anything. But since I pasted the same code everywhere it shouldn't.

My turn to say uglee.

Firstly, get_user_inatomic is the wrong name.

"inatomic" in __copy_from_user_inatomic means it's called inside a
non-premptable region (in atomic...).

Your macro get_user_inatomic is _not_ called inside a
non-preemptable region, so it shouldn't be called "inatomic".

(A better name is get_user_no_paging).

Secondly, does this _one_ use (it's not likely to be used elsewhere)
justify copying & pasting the same code into every asm-*/uaccess,
especially when the code is not in any way arch-specific?

I suggest putting it into futex.c, and make it an inline function
which takes "u32 __user *".

-- Jamie

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-23 18:22             ` Jamie Lokier
@ 2005-02-23 18:34               ` Linus Torvalds
  2005-02-23 18:49                 ` Jamie Lokier
  2005-02-23 18:37               ` Olof Johansson
  1 sibling, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-02-23 18:34 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Olof Johansson, Joe Korty, Andrew Morton, linux-kernel, rusty



On Wed, 23 Feb 2005, Jamie Lokier wrote:
> 
> I suggest putting it into futex.c, and make it an inline function
> which takes "u32 __user *".

Agreed, except we've traditionally just made it "int __user *".

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-23 18:34               ` Linus Torvalds
@ 2005-02-23 18:49                 ` Jamie Lokier
  2005-02-23 19:12                   ` Olof Johansson
  0 siblings, 1 reply; 38+ messages in thread
From: Jamie Lokier @ 2005-02-23 18:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Olof Johansson, Joe Korty, Andrew Morton, linux-kernel, rusty

Linus Torvalds wrote:
> > I suggest putting it into futex.c, and make it an inline function
> > which takes "u32 __user *".
> 
> Agreed, except we've traditionally just made it "int __user *".

The type signatures in futex.c are a bit mixed up - most places say
"int __user *" but sys_futex() says "u32 __user *".  get_futex_key
uses sizeof(u32) to check the address.

-- Jamie

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-23 18:49                 ` Jamie Lokier
@ 2005-02-23 19:12                   ` Olof Johansson
  2005-02-23 22:00                     ` Linus Torvalds
  2005-02-24  0:00                     ` Jamie Lokier
  0 siblings, 2 replies; 38+ messages in thread
From: Olof Johansson @ 2005-02-23 19:12 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Linus Torvalds, Joe Korty, Andrew Morton, linux-kernel, rusty

On Wed, Feb 23, 2005 at 06:49:46PM +0000, Jamie Lokier wrote:
> Linus Torvalds wrote:
> > > I suggest putting it into futex.c, and make it an inline function
> > > which takes "u32 __user *".
> > 
> > Agreed, except we've traditionally just made it "int __user *".
> 
> The type signatures in futex.c are a bit mixed up - most places say
> "int __user *" but sys_futex() says "u32 __user *".  get_futex_key
> uses sizeof(u32) to check the address.

How's this? I went with get_val_no_fault(), since it isn't really a
get_user.*() any more (ptr being passed in), and no_paging is a little
misleading (not all faults are due to paging).


-----

Some futex functions do get_user calls while holding mmap_sem for
reading. If get_user() faults, and another thread happens to be in mmap
(or somewhere else holding waiting on down_write for the same semaphore),
then do_page_fault will deadlock. Most architectures seem to be exposed
to this.

To avoid it, make sure the page is available. If not, release the
semaphore, fault it in and retry.

I also found another exposure by inspection, moving some of the code
around avoids the possible deadlock there.

Signed-off-by: Olof Johansson <olof@austin.ibm.com>


Index: linux-2.5/kernel/futex.c
===================================================================
--- linux-2.5.orig/kernel/futex.c	2005-02-21 16:09:38.000000000 -0600
+++ linux-2.5/kernel/futex.c	2005-02-23 13:10:16.000000000 -0600
@@ -258,6 +258,18 @@
 	}
 }
 
+static inline int get_val_no_fault(int *dest, int __user *from)
+{
+	int ret;
+
+	inc_preempt_count();
+	ret = __copy_from_user_inatomic(dest, from, sizeof(int));
+	dec_preempt_count();
+	preempt_check_resched();
+
+	return ret;
+}
+
 /*
  * The hash bucket lock must be held when this is called.
  * Afterwards, the futex_q must not be accessed.
@@ -329,6 +341,7 @@
 	int ret, drop_count = 0;
 	unsigned int nqueued;
 
+ retry:
 	down_read(&current->mm->mmap_sem);
 
 	ret = get_futex_key(uaddr1, &key1);
@@ -355,9 +368,20 @@
 		   before *uaddr1.  */
 		smp_mb();
 
-		if (get_user(curval, (int __user *)uaddr1) != 0) {
-			ret = -EFAULT;
-			goto out;
+		ret = get_val_no_fault(&curval, (int __user *)uaddr1);
+
+		if (unlikely(ret)) {
+			/* If we would have faulted, release mmap_sem, fault
+			 * it in and start all over again.
+			 */
+			up_read(&current->mm->mmap_sem);
+
+			ret = get_user(curval, (int __user *)uaddr1);
+
+			if (!ret)
+				goto retry;
+
+			return ret;
 		}
 		if (curval != *valp) {
 			ret = -EAGAIN;
@@ -480,6 +504,7 @@
 	int ret, curval;
 	struct futex_q q;
 
+ retry:
 	down_read(&current->mm->mmap_sem);
 
 	ret = get_futex_key(uaddr, &q.key);
@@ -508,9 +533,23 @@
 	 * We hold the mmap semaphore, so the mapping cannot have changed
 	 * since we looked it up in get_futex_key.
 	 */
-	if (get_user(curval, (int __user *)uaddr) != 0) {
-		ret = -EFAULT;
-		goto out_unqueue;
+
+	ret = get_val_no_fault(&curval, (int __user *)uaddr);
+
+	if (unlikely(ret)) {
+		/* If we would have faulted, release mmap_sem, fault it in and
+		 * start all over again.
+		 */
+		up_read(&current->mm->mmap_sem);
+
+		if (!unqueue_me(&q)) /* There's a chance we got woken already */
+			return 0;
+
+		ret = get_user(curval, (int __user *)uaddr);
+
+		if (!ret)
+			goto retry;
+		return ret;
 	}
 	if (curval != val) {
 		ret = -EWOULDBLOCK;
Index: linux-2.5/mm/mempolicy.c
===================================================================
--- linux-2.5.orig/mm/mempolicy.c	2005-02-04 00:27:40.000000000 -0600
+++ linux-2.5/mm/mempolicy.c	2005-02-23 12:53:22.000000000 -0600
@@ -524,9 +524,13 @@
 	} else
 		pval = pol->policy;
 
-	err = -EFAULT;
+	if (vma) {
+		up_read(&current->mm->mmap_sem);
+		vma = NULL;
+	}
+
 	if (policy && put_user(pval, policy))
-		goto out;
+		return -EFAULT;
 
 	err = 0;
 	if (nmask) {

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-23 19:12                   ` Olof Johansson
@ 2005-02-23 22:00                     ` Linus Torvalds
  2005-02-24  0:00                     ` Jamie Lokier
  1 sibling, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-02-23 22:00 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Jamie Lokier, Joe Korty, Andrew Morton, linux-kernel, rusty

On Wed, 23 Feb 2005, Olof Johansson wrote:
> 
> How's this? I went with get_val_no_fault(), since it isn't really a
> get_user.*() any more (ptr being passed in), and no_paging is a little
> misleading (not all faults are due to paging).

Applied with minor cosmetic changes. I'm like a dog who likes to pee on 
things to show his territory, so I changed "get_val_no_fault" to 
"get_futex_value_locked", and I made sure that the return value is 
sensible (return 0 or -EFAULT rather than the "__memcpy_from_user()" 
return value which is how many bytes we couldn't copy).

Not that we care (we just check the return value against zero anyway,
which is success in both cases), but the compiler should be able to
optimize it away, and it might avoid some confusion down the line..

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-23 19:12                   ` Olof Johansson
  2005-02-23 22:00                     ` Linus Torvalds
@ 2005-02-24  0:00                     ` Jamie Lokier
  1 sibling, 0 replies; 38+ messages in thread
From: Jamie Lokier @ 2005-02-24  0:00 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Linus Torvalds, Joe Korty, Andrew Morton, linux-kernel, rusty

Olof Johansson wrote:
> How's this? I went with get_val_no_fault(), since it isn't really a
> get_user.*() any more (ptr being passed in), and no_paging is a little
> misleading (not all faults are due to paging).

How ironic: I deliberately didn't choose "no_fault" because that
function *does* take page faults.  It only disables paging operations! :)

-- Jamie

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-23 18:22             ` Jamie Lokier
  2005-02-23 18:34               ` Linus Torvalds
@ 2005-02-23 18:37               ` Olof Johansson
  1 sibling, 0 replies; 38+ messages in thread
From: Olof Johansson @ 2005-02-23 18:37 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Linus Torvalds, Joe Korty, Andrew Morton, linux-kernel, rusty,
	arjan

On Wed, Feb 23, 2005 at 06:22:04PM +0000, Jamie Lokier wrote:
> Olof Johansson wrote:
> > On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote:
> > 
> > > > Otherwise, a preempt attempt in get_user would not be seen
> > > > until some future preempt_enable was executed.
> > > 
> > > True. I guess we should have a "preempt_check_resched()" there too. That's 
> > > what "kunmap_atomic()" does too (which is what we rely on in the other 
> > > case we do this..)
> > 
> > Ok, this is getting complex enough to warrant get_user_inatomic(),
> > which means adding it to every arch's uaccess.h.
> > 
> > Below patch does so. Unfortunately I don't have a Viro setup with cross
> > compilers for nearly every arch, so I can't make sure it doesn't break
> > anything. But since I pasted the same code everywhere it shouldn't.
> 
> My turn to say uglee.

Yeah, I wasn't entirely happy about it, but it seems that suggestions
are coming in on how to do it better. :-)

> Firstly, get_user_inatomic is the wrong name.
> 
> "inatomic" in __copy_from_user_inatomic means it's called inside a
> non-premptable region (in atomic...).
> 
> Your macro get_user_inatomic is _not_ called inside a
> non-preemptable region, so it shouldn't be called "inatomic".
> 
> (A better name is get_user_no_paging).
>
> Secondly, does this _one_ use (it's not likely to be used elsewhere)
> justify copying & pasting the same code into every asm-*/uaccess,
> especially when the code is not in any way arch-specific?

Arjan suggested creating a linux/uaccess.h that includes asm/uaccess.h,
and start moving the users over since that's where the trend is moving
anyway (avoiding including asm/* from common code). futex.c would be
the first user, and could be followed by more later as a janitorial
patch. That'd mean only one addition of the common function instead of
having to add it to every arch.

> I suggest putting it into futex.c, and make it an inline function
> which takes "u32 __user *".

Sure, for now that's good enough. Above janitorial work could be done
later, if more users get introduced.


-Olof

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 21:07   ` Jamie Lokier
                       ` (2 preceding siblings ...)
  2005-02-22 21:30     ` Linus Torvalds
@ 2005-02-22 21:40     ` Andrew Morton
  2005-02-22 21:59       ` Linus Torvalds
  2005-02-23 11:42       ` David Howells
  3 siblings, 2 replies; 38+ messages in thread
From: Andrew Morton @ 2005-02-22 21:40 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: olof, linux-kernel, torvalds, rusty

Jamie Lokier <jamie@shareable.org> wrote:
>
> ...
> 
> > > One attempt to fix this is included below. It works, but I'm not entirely
> > > happy with the fact that it's a bit messy solution. If anyone has a
> > > better idea for how to solve it I'd be all ears.
> > 
> > It's fairly sane.  Style-wise I'd be inclined to turn this:
> > 
> > 	down_read(&current->mm->mmap_sem);
> > 	while (!check_user_page_readable(current->mm, uaddr1)) {
> > 		up_read(&current->mm->mmap_sem);
> > 		/* Fault in the page through get_user() but discard result */
> > 		if (get_user(curval, (int __user *)uaddr1) != 0)
> > 			return -EFAULT;
> > 		down_read(&current->mm->mmap_sem);
> > 	}
> 
> That won't work because the vma lock must be help between key
> calculation and get_user() - otherwise futex is not reliable.  It
> would work if the futex key calculation was inside the loop.

All the above is trying to do is to convert the initial down_read(mmap_sem)
into a function which, on exit, guarantees that

a) down_read(mmap_sem) is held and

b) the subsequent get_user() of that address will not generate a pagefault.

So it shouldn't affect the futex code's atomicity at all.

However the pte can get unmapped by memory reclaim so we could still take a
minor fault, and hit the same deadlock, yes?

> A much simpler solution (and sorry for not offering it earlier,
> because Andrew Morton pointed out this bug long ago, but I was busy), is:
> 
> In futex.c:
> 
> 	down_read(&current->mm->mmap_sem);
> 	get_futex_key(...) etc.
> 	queue_me(...) etc.
> 	current->flags |= PF_MMAP_SEM;             <- new
> 	ret = get_user(...);
> 	current->flags &= PF_MMAP_SEM;             <- new
> 	/* the rest */
> 
> And in arch/*/mm/fault.c, replace every one of these:
> 
> 	down_read(&mm->mmap_sem);
> 
> 	up_read(&mm->mmap_sem);
> 
> with these:
> 
> 	if (!(current & PF_MMAP_SEM))
> 		down_read(&mm->mmap_sem);
> 
> 	if (!(current & PF_MMAP_SEM))
> 		up_read(&mm->mmap_sem);
> 

Yes, that will work.  However I do feel that it's cleaner to localise this
nastiness into a single function which the futex code calls, rather than
spreading it all around and adding overhead to every pagefault.  If we can
work out how.

wrt this down_read/down_write/down_read deadlock: iirc, the reason why
down_write() takes precedence over down_read() is to avoid the permanent
writer starvation which would occur if there is heavy down_read() traffic.

As Linus points out, an alternative would be to do an inc_preempt_count()
around the offending get_user(), then use __copy_from_user_inatomic(), then
take some sort of remedial action if __copy_from_user_inatomic() returns a
fault.  Something like:

retry:
	if (get_user(uaddr) == -EFAULT)
		return -EFAULT;
	down_read(mmap_sem);
	inc_preempt_count();
	if (__copy_from_user_inatomic(..., uaddr)) {
		up_read(mmap_sem);
		dec_preempt_count();
		goto retry;
	}

	dec_preempt_count();
	up_read(mmap_sem);

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 21:40     ` Andrew Morton
@ 2005-02-22 21:59       ` Linus Torvalds
  2005-02-23 11:42       ` David Howells
  1 sibling, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-02-22 21:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jamie Lokier, olof, linux-kernel, rusty

On Tue, 22 Feb 2005, Andrew Morton wrote:
> 
> However the pte can get unmapped by memory reclaim so we could still take a
> minor fault, and hit the same deadlock, yes?

You _could_ fix that by getting the pagetable spinlock, I guess. Which
check_user_page_readable() assumes you'd be holding anyway (not holding it
would appear to be a bug).

At which point you might as well just walk the tables by hand and just do 
the read that way. Of course, then you have virtual aliasing issues etc.

Insane, but possible.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH/RFC] Futex mmap_sem deadlock
  2005-02-22 21:40     ` Andrew Morton
  2005-02-22 21:59       ` Linus Torvalds
@ 2005-02-23 11:42       ` David Howells
  1 sibling, 0 replies; 38+ messages in thread
From: David Howells @ 2005-02-23 11:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jamie Lokier, olof, linux-kernel, torvalds, rusty

Andrew Morton <akpm@osdl.org> wrote:

> wrt this down_read/down_write/down_read deadlock: iirc, the reason why
> down_write() takes precedence over down_read() is to avoid the permanent
> writer starvation which would occur if there is heavy down_read() traffic.

down_write() doesn't actually take precedence over down_read() as such. The
rwsems try to implement a strict FIFO queue (the fairness thing).

David

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2005-02-24  0:12 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-22 19:06 [PATCH/RFC] Futex mmap_sem deadlock Olof Johansson
2005-02-22 19:36 ` Linus Torvalds
2005-02-22 21:16   ` Benjamin Herrenschmidt
2005-02-22 21:19     ` Benjamin Herrenschmidt
2005-02-22 21:31     ` Linus Torvalds
2005-02-22 21:42       ` Benjamin Herrenschmidt
2005-02-22 22:10         ` Linus Torvalds
2005-02-22 22:24           ` Benjamin Herrenschmidt
2005-02-22 23:08           ` Greg KH
2005-02-23 11:24   ` David Howells
2005-02-22 19:55 ` Andrew Morton
2005-02-22 21:07   ` Jamie Lokier
2005-02-22 21:19     ` Olof Johansson
2005-02-22 22:09       ` Jamie Lokier
2005-02-22 21:19     ` Chris Friesen
2005-02-22 21:27       ` Jamie Lokier
2005-02-22 21:30     ` Linus Torvalds
2005-02-22 22:34       ` Jamie Lokier
2005-02-22 22:42         ` Olof Johansson
2005-02-22 23:20           ` Andrew Morton
2005-02-22 23:23             ` Olof Johansson
2005-02-23 11:39       ` David Howells
2005-02-23 16:22         ` Olof Johansson
2005-02-23 18:44           ` David Howells
2005-02-23 14:49       ` Joe Korty
2005-02-23 15:54         ` Linus Torvalds
2005-02-23 17:10           ` Olof Johansson
2005-02-23 17:37             ` Arjan van de Ven
2005-02-23 18:22             ` Jamie Lokier
2005-02-23 18:34               ` Linus Torvalds
2005-02-23 18:49                 ` Jamie Lokier
2005-02-23 19:12                   ` Olof Johansson
2005-02-23 22:00                     ` Linus Torvalds
2005-02-24  0:00                     ` Jamie Lokier
2005-02-23 18:37               ` Olof Johansson
2005-02-22 21:40     ` Andrew Morton
2005-02-22 21:59       ` Linus Torvalds
2005-02-23 11:42       ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox