linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	alan@lxorguk.ukuu.org.uk, Darren Hart <dvhart@linux.intel.com>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: [ 48/57] futex: Handle futex_pi OWNER_DIED take over correctly
Date: Wed, 14 Nov 2012 20:11:56 -0800	[thread overview]
Message-ID: <20121115040936.100205267@linuxfoundation.org> (raw)
In-Reply-To: <20121115040933.223998671@linuxfoundation.org>

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Thomas Gleixner <tglx@linutronix.de>

commit 59fa6245192159ab5e1e17b8e31f15afa9cff4bf upstream.

Siddhesh analyzed a failure in the take over of pi futexes in case the
owner died and provided a workaround.
See: http://sourceware.org/bugzilla/show_bug.cgi?id=14076

The detailed problem analysis shows:

Futex F is initialized with PTHREAD_PRIO_INHERIT and
PTHREAD_MUTEX_ROBUST_NP attributes.

T1 lock_futex_pi(F);

T2 lock_futex_pi(F);
   --> T2 blocks on the futex and creates pi_state which is associated
       to T1.

T1 exits
   --> exit_robust_list() runs
       --> Futex F userspace value TID field is set to 0 and
           FUTEX_OWNER_DIED bit is set.

T3 lock_futex_pi(F);
   --> Succeeds due to the check for F's userspace TID field == 0
   --> Claims ownership of the futex and sets its own TID into the
       userspace TID field of futex F
   --> returns to user space

T1 --> exit_pi_state_list()
       --> Transfers pi_state to waiter T2 and wakes T2 via
       	   rt_mutex_unlock(&pi_state->mutex)

T2 --> acquires pi_state->mutex and gains real ownership of the
       pi_state
   --> Claims ownership of the futex and sets its own TID into the
       userspace TID field of futex F
   --> returns to user space

T3 --> observes inconsistent state

This problem is independent of UP/SMP, preemptible/non preemptible
kernels, or process shared vs. private. The only difference is that
certain configurations are more likely to expose it.

So as Siddhesh correctly analyzed the following check in
futex_lock_pi_atomic() is the culprit:

	if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {

We check the userspace value for a TID value of 0 and take over the
futex unconditionally if that's true.

AFAICT this check is there as it is correct for a different corner
case of futexes: the WAITERS bit became stale.

Now the proposed change

-	if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
+       if (unlikely(ownerdied ||
+                       !(curval & (FUTEX_TID_MASK | FUTEX_WAITERS)))) {

solves the problem, but it's not obvious why and it wreckages the
"stale WAITERS bit" case.

What happens is, that due to the WAITERS bit being set (T2 is blocked
on that futex) it enforces T3 to go through lookup_pi_state(), which
in the above case returns an existing pi_state and therefor forces T3
to legitimately fight with T2 over the ownership of the pi_state (via
pi_state->mutex). Probelm solved!

Though that does not work for the "WAITERS bit is stale" problem
because if lookup_pi_state() does not find existing pi_state it
returns -ERSCH (due to TID == 0) which causes futex_lock_pi() to
return -ESRCH to user space because the OWNER_DIED bit is not set.

Now there is a different solution to that problem. Do not look at the
user space value at all and enforce a lookup of possibly available
pi_state. If pi_state can be found, then the new incoming locker T3
blocks on that pi_state and legitimately races with T2 to acquire the
rt_mutex and the pi_state and therefor the proper ownership of the
user space futex.

lookup_pi_state() has the correct order of checks. It first tries to
find a pi_state associated with the user space futex and only if that
fails it checks for futex TID value = 0. If no pi_state is available
nothing can create new state at that point because this happens with
the hash bucket lock held.

So the above scenario changes to:

T1 lock_futex_pi(F);

T2 lock_futex_pi(F);
   --> T2 blocks on the futex and creates pi_state which is associated
       to T1.

T1 exits
   --> exit_robust_list() runs
       --> Futex F userspace value TID field is set to 0 and
           FUTEX_OWNER_DIED bit is set.

T3 lock_futex_pi(F);
   --> Finds pi_state and blocks on pi_state->rt_mutex

T1 --> exit_pi_state_list()
       --> Transfers pi_state to waiter T2 and wakes it via
       	   rt_mutex_unlock(&pi_state->mutex)

T2 --> acquires pi_state->mutex and gains ownership of the pi_state
   --> Claims ownership of the futex and sets its own TID into the
       userspace TID field of futex F
   --> returns to user space

This covers all gazillion points on which T3 might come in between
T1's exit_robust_list() clearing the TID field and T2 fixing it up. It
also solves the "WAITERS bit stale" problem by forcing the take over.

Another benefit of changing the code this way is that it makes it less
dependent on untrusted user space values and therefor minimizes the
possible wreckage which might be inflicted.

As usual after staring for too long at the futex code my brain hurts
so much that I really want to ditch that whole optimization of
avoiding the syscall for the non contended case for PI futexes and rip
out the maze of corner case handling code. Unfortunately we can't as
user space relies on that existing behaviour, but at least thinking
about it helps me to preserve my mental sanity. Maybe we should
nevertheless :)

Reported-and-tested-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1210232138540.2756@ionos
Acked-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 kernel/futex.c |   41 ++++++++++++++++++++++-------------------
 1 file changed, 22 insertions(+), 19 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -716,7 +716,7 @@ static int futex_lock_pi_atomic(u32 __us
 				struct futex_pi_state **ps,
 				struct task_struct *task, int set_waiters)
 {
-	int lock_taken, ret, ownerdied = 0;
+	int lock_taken, ret, force_take = 0;
 	u32 uval, newval, curval, vpid = task_pid_vnr(task);
 
 retry:
@@ -755,17 +755,15 @@ retry:
 	newval = curval | FUTEX_WAITERS;
 
 	/*
-	 * There are two cases, where a futex might have no owner (the
-	 * owner TID is 0): OWNER_DIED. We take over the futex in this
-	 * case. We also do an unconditional take over, when the owner
-	 * of the futex died.
-	 *
-	 * This is safe as we are protected by the hash bucket lock !
+	 * Should we force take the futex? See below.
 	 */
-	if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
-		/* Keep the OWNER_DIED bit */
+	if (unlikely(force_take)) {
+		/*
+		 * Keep the OWNER_DIED and the WAITERS bit and set the
+		 * new TID value.
+		 */
 		newval = (curval & ~FUTEX_TID_MASK) | vpid;
-		ownerdied = 0;
+		force_take = 0;
 		lock_taken = 1;
 	}
 
@@ -775,7 +773,7 @@ retry:
 		goto retry;
 
 	/*
-	 * We took the lock due to owner died take over.
+	 * We took the lock due to forced take over.
 	 */
 	if (unlikely(lock_taken))
 		return 1;
@@ -790,20 +788,25 @@ retry:
 		switch (ret) {
 		case -ESRCH:
 			/*
-			 * No owner found for this futex. Check if the
-			 * OWNER_DIED bit is set to figure out whether
-			 * this is a robust futex or not.
+			 * We failed to find an owner for this
+			 * futex. So we have no pi_state to block
+			 * on. This can happen in two cases:
+			 *
+			 * 1) The owner died
+			 * 2) A stale FUTEX_WAITERS bit
+			 *
+			 * Re-read the futex value.
 			 */
 			if (get_futex_value_locked(&curval, uaddr))
 				return -EFAULT;
 
 			/*
-			 * We simply start over in case of a robust
-			 * futex. The code above will take the futex
-			 * and return happy.
+			 * If the owner died or we have a stale
+			 * WAITERS bit the owner TID in the user space
+			 * futex is 0.
 			 */
-			if (curval & FUTEX_OWNER_DIED) {
-				ownerdied = 1;
+			if (!(curval & FUTEX_TID_MASK)) {
+				force_take = 1;
 				goto retry;
 			}
 		default:



  parent reply	other threads:[~2012-11-15  4:16 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-15  4:11 [ 00/57] 3.4.19-stable review Greg Kroah-Hartman
2012-11-15  4:11 ` [ 01/57] xen/gntdev: dont leak memory from IOCTL_GNTDEV_MAP_GRANT_REF Greg Kroah-Hartman
2012-11-15  4:11 ` [ 02/57] xen/mmu: Use Xen specific TLB flush instead of the generic one Greg Kroah-Hartman
2012-11-15  4:11 ` [ 03/57] Input: tsc40 - remove wrong announcement of pressure support Greg Kroah-Hartman
2012-11-15  4:11 ` [ 04/57] ath9k: fix stale pointers potentially causing access to freed skbs Greg Kroah-Hartman
2012-11-15  4:11 ` [ 05/57] ath9k: Test for TID only in BlockAcks while checking tx status Greg Kroah-Hartman
2012-11-15  4:11 ` [ 06/57] rt2800: validate step value for temperature compensation Greg Kroah-Hartman
2012-11-15  4:11 ` [ 07/57] target: Dont return success from module_init() if setup fails Greg Kroah-Hartman
2012-11-15  4:11 ` [ 08/57] target: Avoid integer overflow in se_dev_align_max_sectors() Greg Kroah-Hartman
2012-11-15  4:11 ` [ 09/57] iscsi-target: Fix missed wakeup race in TX thread Greg Kroah-Hartman
2012-11-15  4:11 ` [ 10/57] target: Fix incorrect usage of nested IRQ spinlocks in ABORT_TASK path Greg Kroah-Hartman
2012-11-15  4:11 ` [ 11/57] cfg80211: fix antenna gain handling Greg Kroah-Hartman
2012-11-15  4:11 ` [ 12/57] wireless: drop invalid mesh address extension frames Greg Kroah-Hartman
2012-11-15  4:11 ` [ 13/57] mac80211: use blacklist for duplicate IE check Greg Kroah-Hartman
2012-11-15  4:11 ` [ 14/57] mac80211: Only process mesh config header on frames that RA_MATCH Greg Kroah-Hartman
2012-11-15  4:11 ` [ 15/57] mac80211: dont inspect Sequence Control field on control frames Greg Kroah-Hartman
2012-11-15  4:11 ` [ 16/57] DRM/Radeon: Fix Load Detection on legacy primary DAC Greg Kroah-Hartman
2012-11-15  4:11 ` [ 17/57] drm/udl: fix stride issues scanning out stride != width*bpp Greg Kroah-Hartman
2012-11-15  4:11 ` [ 18/57] mac80211: check management frame header length Greg Kroah-Hartman
2012-11-15  4:11 ` [ 19/57] mac80211: verify that skb data is present Greg Kroah-Hartman
2012-11-15  4:11 ` [ 20/57] mac80211: make sure data is accessible in EAPOL check Greg Kroah-Hartman
2012-11-15  4:11 ` [ 21/57] mac80211: fix SSID copy on IBSS JOIN Greg Kroah-Hartman
2012-11-15  4:11 ` [ 22/57] nfsv3: Make v3 mounts fail with ETIMEDOUTs instead EIO on mountd timeouts Greg Kroah-Hartman
2012-11-15  4:11 ` [ 23/57] nfs: Show original device name verbatim in /proc/*/mount{s,info} Greg Kroah-Hartman
2012-11-15  4:11 ` [ 24/57] NFSv4: nfs4_locku_done must release the sequence id Greg Kroah-Hartman
2012-11-15  4:11 ` [ 25/57] NFSv4.1: We must release the sequence id when we fail to get a session slot Greg Kroah-Hartman
2012-11-15  4:11 ` [ 26/57] nfsd: add get_uint for u32s Greg Kroah-Hartman
2012-11-15  4:11 ` [ 27/57] NFS: fix bug in legacy DNS resolver Greg Kroah-Hartman
2012-11-15  4:11 ` [ 28/57] NFS: Fix Oopses in nfs_lookup_revalidate and nfs4_lookup_revalidate Greg Kroah-Hartman
2012-11-15  4:11 ` [ 29/57] drm: restore open_count if drm_setup fails Greg Kroah-Hartman
2012-11-15  4:11 ` [ 30/57] hwmon: (w83627ehf) Force initial bank selection Greg Kroah-Hartman
2012-11-15  4:11 ` [ 31/57] ALSA: PCM: Fix some races at disconnection Greg Kroah-Hartman
2012-11-15  4:11 ` [ 32/57] ALSA: usb-audio: Fix " Greg Kroah-Hartman
2012-11-15  4:11 ` [ 33/57] ALSA: usb-audio: Use rwsem for disconnect protection Greg Kroah-Hartman
2012-11-15  4:11 ` [ 34/57] ALSA: usb-audio: Fix races at disconnection in mixer_quirks.c Greg Kroah-Hartman
2012-11-15  4:11 ` [ 35/57] ALSA: Add a reference counter to card instance Greg Kroah-Hartman
2012-11-15  4:11 ` [ 36/57] ALSA: Avoid endless sleep after disconnect Greg Kroah-Hartman
2012-11-15  4:11 ` [ 37/57] sctp: fix call to SCTP_CMD_PROCESS_SACK in sctp_cmd_interpreter() Greg Kroah-Hartman
2012-11-15  4:11 ` [ 38/57] netlink: use kfree_rcu() in netlink_release() Greg Kroah-Hartman
2012-11-15  4:11 ` [ 39/57] tcp: fix FIONREAD/SIOCINQ Greg Kroah-Hartman
2012-11-15  4:11 ` [ 40/57] ipv6: Set default hoplimit as zero Greg Kroah-Hartman
2012-11-15  4:11 ` [ 41/57] net: usb: Fix memory leak on Tx data path Greg Kroah-Hartman
2012-11-15  4:11 ` [ 42/57] net: fix divide by zero in tcp algorithm illinois Greg Kroah-Hartman
2012-11-15  4:11 ` [ 43/57] drivers/net/ethernet/nxp/lpc_eth.c: Call mdiobus_unregister before mdiobus_free Greg Kroah-Hartman
2012-11-15  4:11 ` [ 44/57] l2tp: fix oops in l2tp_eth_create() error path Greg Kroah-Hartman
2012-11-15  4:11 ` [ 45/57] net: inet_diag -- Return error code if protocol handler is missed Greg Kroah-Hartman
2012-11-15  4:11 ` [ 46/57] af-packet: fix oops when socket is not present Greg Kroah-Hartman
2012-11-15  4:11 ` [ 47/57] ipv6: send unsolicited neighbour advertisements to all-nodes Greg Kroah-Hartman
2012-11-15  4:11 ` Greg Kroah-Hartman [this message]
2012-11-15  4:11 ` [ 49/57] mmc: sdhci: fix NULL dereference in sdhci_request() tuning Greg Kroah-Hartman
2012-11-15  4:11 ` [ 50/57] drm/vmwgfx: Fix hibernation device reset Greg Kroah-Hartman
2012-11-15  4:11 ` [ 51/57] drm/vmwgfx: Fix a case where the code would BUG when trying to pin GMR memory Greg Kroah-Hartman
2012-11-15  4:12 ` [ 52/57] drm/radeon/cayman: add some missing regs to the VM reg checker Greg Kroah-Hartman
2012-11-15  4:12 ` [ 53/57] drm/radeon/si: " Greg Kroah-Hartman
2012-11-15  4:12 ` [ 54/57] drm/i915: fixup infoframe support for sdvo Greg Kroah-Hartman
2012-11-15  4:12 ` [ 55/57] drm/i915: clear the entire sdvo infoframe buffer Greg Kroah-Hartman
2012-11-15  4:12 ` [ 56/57] USB: mos7840: remove unused variable Greg Kroah-Hartman
2012-11-15  4:12 ` [ 57/57] xfs: fix reading of wrapped log data Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121115040936.100205267@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=dvhart@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).