public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH/RFC] Futex mmap_sem deadlock
@ 2005-02-22 19:06 Olof Johansson
  2005-02-22 19:36 ` Linus Torvalds
  2005-02-22 19:55 ` Andrew Morton
  0 siblings, 2 replies; 38+ messages in thread
From: Olof Johansson @ 2005-02-22 19:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, torvalds, jamie, rusty

Hi,

Consider a small testcase that spawns off two threads, either thread
doing a loop of:

	buf = mmap /dev/zero MAP_SHARED for 0x100000 bytes
	call sys_futex (buf+page, FUTEX_WAIT, 1, NULL, NULL) for each page in said mmap
	munmap(buf)
	repeat

This will quickly lock up, since the futex_wait code dows a
down_read(mmap_sem), then a get_user().

The do_page_fault code on ppc64 (as well as other architectures) needs
to take the same semaphore for reading. This is all good until the
second thread comes into play: Its mmap call tries to take the same
semaphore for writing which causes in the do_page_fault down_read()
to get stuck. Classic deadlock.

Stacks are as follows:

a.out         S 000000000ff9772c 12288  6491   6484  6492               (NOTLB)
Call Trace:
[c0000003eb72fa10] [0000000030ecbd28] 0x30ecbd28 (unreliable)
[c0000003eb72fbe0] [c0000000000119ec] .__switch_to+0xa4/0xf0
[c0000003eb72fc70] [c00000000043ffbc] .schedule+0x4d4/0xe5c
[c0000003eb72fd90] [c0000000000247c8] .sys32_rt_sigsuspend+0xe0/0x120
[c0000003eb72fe30] [c00000000000d7e8] .ppc32_rt_sigsuspend+0x8/0x14
a.out         D 000000000ff0ad38 13632  6492   6491  6493               (NOTLB)
Call Trace:
[c00000003a1a7830] [c00000003a1a7850] 0xc00000003a1a7850 (unreliable)
[c00000003a1a7a00] [c0000000000119ec] .__switch_to+0xa4/0xf0
[c00000003a1a7a90] [c00000000043ffbc] .schedule+0x4d4/0xe5c
[c00000003a1a7bb0] [c000000000441ec0] .rwsem_down_write_failed+0x138/0x2f0
[c00000003a1a7c90] [c000000000098db0] .sys_mprotect+0x7bc/0x81c
[c00000003a1a7e30] [c00000000000d600] syscall_exit+0x0/0x18
a.out         D 0000000010000654 12832  6493   6492                     (NOTLB)
Call Trace:
[c00000000f2674f0] [c0000000000119ec] .__switch_to+0xa4/0xf0
[c00000000f267580] [c00000000043ffbc] .schedule+0x4d4/0xe5c
[c00000000f2676a0] [c0000000004421b4] .rwsem_down_read_failed+0x13c/0x2f4
[c00000000f267780] [c00000000003ba6c] .do_page_fault+0x66c/0x738
[c00000000f267900] [c00000000000b2dc] .handle_page_fault+0x20/0x54
--- Exception: 301 at .do_futex+0x5d4/0x720
    LR = .do_futex+0x1e0/0x720
[c00000000f267d60] [c000000000078124] .compat_sys_futex+0xa4/0x16c
[c00000000f267e30] [c00000000000d600] syscall_exit+0x0/0x18


One attempt to fix this is included below. It works, but I'm not entirely
happy with the fact that it's a bit messy solution. If anyone has a
better idea for how to solve it I'd be all ears.

Auditing other read-takers of mmap_sem, I found one more exposure that
could be solved by just moving code around.

-----

Some futex functions do get_user calls while holding mmap_sem for
reading. If get_user() faults, and another thread happens to be in mmap
(or somewhere else holding waiting on down_write for the same semaphore),
then do_page_fault will deadlock. Most architectures seem to be exposed
to this.

To avoid it, make sure the page is available. If not, release the
semaphore, fault it in and retake it.

I also found another exposure by inspection, moving some of the code
around avoids the possible deadlock there.

Signed-off-by: Olof Johansson <olof@austin.ibm.com>


Index: linux-2.5/kernel/futex.c
===================================================================
--- linux-2.5.orig/kernel/futex.c	2005-02-21 16:09:38.000000000 -0600
+++ linux-2.5/kernel/futex.c	2005-02-22 12:35:01.000000000 -0600
@@ -326,11 +326,22 @@
 	struct futex_hash_bucket *bh1, *bh2;
 	struct list_head *head1;
 	struct futex_q *this, *next;
-	int ret, drop_count = 0;
+	int ret, curval, drop_count = 0;
 	unsigned int nqueued;
 
 	down_read(&current->mm->mmap_sem);
 
+	/* Make sure we can access the page, since there's a risk to
+	 * deadlock with do_page_fault() if get_user() faults.
+	 */
+	while (!check_user_page_readable(current->mm, uaddr1)) {
+		up_read(&current->mm->mmap_sem);
+		/* Fault in the page through get_user() but discard result */
+		if (get_user(curval, (int __user *)uaddr1) != 0)
+			return -EFAULT;
+		down_read(&current->mm->mmap_sem);
+	}
+
 	ret = get_futex_key(uaddr1, &key1);
 	if (unlikely(ret != 0))
 		goto out;
@@ -343,8 +354,6 @@
 
 	nqueued = bh1->nqueued;
 	if (likely(valp != NULL)) {
-		int curval;
-
 		/* In order to avoid doing get_user while
 		   holding bh1->lock and bh2->lock, nqueued
 		   (monotonically increasing field) must be first
@@ -482,6 +491,19 @@
 
 	down_read(&current->mm->mmap_sem);
 
+	/* Make sure we can access the page, since there's a risk to
+	 * deadlock with do_page_fault() if get_user() faults.
+	 */
+	while (!check_user_page_readable(current->mm, uaddr)) {
+		up_read(&current->mm->mmap_sem);
+
+		/* Fault in the page through get_user() but discard result */
+		if (get_user(curval, (int __user *)uaddr) != 0)
+			return -EFAULT;
+
+		down_read(&current->mm->mmap_sem);
+	}
+
 	ret = get_futex_key(uaddr, &q.key);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
Index: linux-2.5/mm/mempolicy.c
===================================================================
--- linux-2.5.orig/mm/mempolicy.c	2005-02-04 00:27:40.000000000 -0600
+++ linux-2.5/mm/mempolicy.c	2005-02-21 16:43:08.000000000 -0600
@@ -486,6 +486,7 @@
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = NULL;
 	struct mempolicy *pol = current->mempolicy;
+	DECLARE_BITMAP(nodes, MAX_NUMNODES);
 
 	if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
 		return -EINVAL;
@@ -524,16 +525,21 @@
 	} else
 		pval = pol->policy;
 
+	if (nmask)
+		get_zonemask(pol, nodes);
+
+	if (vma) {
+		up_read(&current->mm->mmap_sem);
+		vma = NULL;
+	}
+
 	err = -EFAULT;
 	if (policy && put_user(pval, policy))
 		goto out;
 
 	err = 0;
-	if (nmask) {
-		DECLARE_BITMAP(nodes, MAX_NUMNODES);
-		get_zonemask(pol, nodes);
+	if (nmask)
 		err = copy_nodes_to_user(nmask, maxnode, nodes, sizeof(nodes));
-	}
 
  out:
 	if (vma)

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2005-02-24  0:12 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-22 19:06 [PATCH/RFC] Futex mmap_sem deadlock Olof Johansson
2005-02-22 19:36 ` Linus Torvalds
2005-02-22 21:16   ` Benjamin Herrenschmidt
2005-02-22 21:19     ` Benjamin Herrenschmidt
2005-02-22 21:31     ` Linus Torvalds
2005-02-22 21:42       ` Benjamin Herrenschmidt
2005-02-22 22:10         ` Linus Torvalds
2005-02-22 22:24           ` Benjamin Herrenschmidt
2005-02-22 23:08           ` Greg KH
2005-02-23 11:24   ` David Howells
2005-02-22 19:55 ` Andrew Morton
2005-02-22 21:07   ` Jamie Lokier
2005-02-22 21:19     ` Olof Johansson
2005-02-22 22:09       ` Jamie Lokier
2005-02-22 21:19     ` Chris Friesen
2005-02-22 21:27       ` Jamie Lokier
2005-02-22 21:30     ` Linus Torvalds
2005-02-22 22:34       ` Jamie Lokier
2005-02-22 22:42         ` Olof Johansson
2005-02-22 23:20           ` Andrew Morton
2005-02-22 23:23             ` Olof Johansson
2005-02-23 11:39       ` David Howells
2005-02-23 16:22         ` Olof Johansson
2005-02-23 18:44           ` David Howells
2005-02-23 14:49       ` Joe Korty
2005-02-23 15:54         ` Linus Torvalds
2005-02-23 17:10           ` Olof Johansson
2005-02-23 17:37             ` Arjan van de Ven
2005-02-23 18:22             ` Jamie Lokier
2005-02-23 18:34               ` Linus Torvalds
2005-02-23 18:49                 ` Jamie Lokier
2005-02-23 19:12                   ` Olof Johansson
2005-02-23 22:00                     ` Linus Torvalds
2005-02-24  0:00                     ` Jamie Lokier
2005-02-23 18:37               ` Olof Johansson
2005-02-22 21:40     ` Andrew Morton
2005-02-22 21:59       ` Linus Torvalds
2005-02-23 11:42       ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox