All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	juri.lelli@arm.com, bigeasy@linutronix.de, xlpang@redhat.com,
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com,
	jdesfossez@efficios.com, dvhart@infradead.org,
	bristot@redhat.com, Thomas Gleixner <tglx@linutronix.de>,
	Lee Jones <lee.jones@linaro.org>
Subject: [PATCH 4.9 14/49] futex: Change locking rules
Date: Mon, 22 Feb 2021 13:36:12 +0100	[thread overview]
Message-ID: <20210222121025.901487433@linuxfoundation.org> (raw)
In-Reply-To: <20210222121022.546148341@linuxfoundation.org>

From: Peter Zijlstra <peterz@infradead.org>

Currently futex-pi relies on hb->lock to serialize everything. But hb->lock
creates another set of problems, especially priority inversions on RT where
hb->lock becomes a rt_mutex itself.

The rt_mutex::wait_lock is the most obvious protection for keeping the
futex user space value and the kernel internal pi_state in sync.

Rework and document the locking so rt_mutex::wait_lock is held accross all
operations which modify the user space value and the pi state.

This allows to invoke rt_mutex_unlock() (including deboost) without holding
hb->lock as a next step.

Nothing yet relies on the new locking rules.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.751993333@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[Lee: Back-ported in support of a previous futex back-port attempt]
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 kernel/futex.c |  138 ++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 112 insertions(+), 26 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1019,6 +1019,39 @@ static void exit_pi_state_list(struct ta
  * [10] There is no transient state which leaves owner and user space
  *	TID out of sync. Except one error case where the kernel is denied
  *	write access to the user address, see fixup_pi_state_owner().
+ *
+ *
+ * Serialization and lifetime rules:
+ *
+ * hb->lock:
+ *
+ *	hb -> futex_q, relation
+ *	futex_q -> pi_state, relation
+ *
+ *	(cannot be raw because hb can contain arbitrary amount
+ *	 of futex_q's)
+ *
+ * pi_mutex->wait_lock:
+ *
+ *	{uval, pi_state}
+ *
+ *	(and pi_mutex 'obviously')
+ *
+ * p->pi_lock:
+ *
+ *	p->pi_state_list -> pi_state->list, relation
+ *
+ * pi_state->refcount:
+ *
+ *	pi_state lifetime
+ *
+ *
+ * Lock order:
+ *
+ *   hb->lock
+ *     pi_mutex->wait_lock
+ *       p->pi_lock
+ *
  */
 
 /*
@@ -1026,10 +1059,12 @@ static void exit_pi_state_list(struct ta
  * the pi_state against the user space value. If correct, attach to
  * it.
  */
-static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
+static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
+			      struct futex_pi_state *pi_state,
 			      struct futex_pi_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
+	int ret, uval2;
 
 	/*
 	 * Userspace might have messed up non-PI and PI futexes [3]
@@ -1037,9 +1072,34 @@ static int attach_to_pi_state(u32 uval,
 	if (unlikely(!pi_state))
 		return -EINVAL;
 
+	/*
+	 * We get here with hb->lock held, and having found a
+	 * futex_top_waiter(). This means that futex_lock_pi() of said futex_q
+	 * has dropped the hb->lock in between queue_me() and unqueue_me_pi(),
+	 * which in turn means that futex_lock_pi() still has a reference on
+	 * our pi_state.
+	 */
 	WARN_ON(!atomic_read(&pi_state->refcount));
 
 	/*
+	 * Now that we have a pi_state, we can acquire wait_lock
+	 * and do the state validation.
+	 */
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
+	/*
+	 * Since {uval, pi_state} is serialized by wait_lock, and our current
+	 * uval was read without holding it, it can have changed. Verify it
+	 * still is what we expect it to be, otherwise retry the entire
+	 * operation.
+	 */
+	if (get_futex_value_locked(&uval2, uaddr))
+		goto out_efault;
+
+	if (uval != uval2)
+		goto out_eagain;
+
+	/*
 	 * Handle the owner died case:
 	 */
 	if (uval & FUTEX_OWNER_DIED) {
@@ -1054,11 +1114,11 @@ static int attach_to_pi_state(u32 uval,
 			 * is not 0. Inconsistent state. [5]
 			 */
 			if (pid)
-				return -EINVAL;
+				goto out_einval;
 			/*
 			 * Take a ref on the state and return success. [4]
 			 */
-			goto out_state;
+			goto out_attach;
 		}
 
 		/*
@@ -1070,14 +1130,14 @@ static int attach_to_pi_state(u32 uval,
 		 * Take a ref on the state and return success. [6]
 		 */
 		if (!pid)
-			goto out_state;
+			goto out_attach;
 	} else {
 		/*
 		 * If the owner died bit is not set, then the pi_state
 		 * must have an owner. [7]
 		 */
 		if (!pi_state->owner)
-			return -EINVAL;
+			goto out_einval;
 	}
 
 	/*
@@ -1086,11 +1146,29 @@ static int attach_to_pi_state(u32 uval,
 	 * user space TID. [9/10]
 	 */
 	if (pid != task_pid_vnr(pi_state->owner))
-		return -EINVAL;
-out_state:
+		goto out_einval;
+
+out_attach:
 	atomic_inc(&pi_state->refcount);
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 	*ps = pi_state;
 	return 0;
+
+out_einval:
+	ret = -EINVAL;
+	goto out_error;
+
+out_eagain:
+	ret = -EAGAIN;
+	goto out_error;
+
+out_efault:
+	ret = -EFAULT;
+	goto out_error;
+
+out_error:
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+	return ret;
 }
 
 /**
@@ -1183,6 +1261,9 @@ static int attach_to_pi_owner(u32 uval,
 
 	/*
 	 * No existing pi state. First waiter. [2]
+	 *
+	 * This creates pi_state, we have hb->lock held, this means nothing can
+	 * observe this state, wait_lock is irrelevant.
 	 */
 	pi_state = alloc_pi_state();
 
@@ -1207,7 +1288,8 @@ static int attach_to_pi_owner(u32 uval,
 	return 0;
 }
 
-static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
+static int lookup_pi_state(u32 __user *uaddr, u32 uval,
+			   struct futex_hash_bucket *hb,
 			   union futex_key *key, struct futex_pi_state **ps,
 			   struct task_struct **exiting)
 {
@@ -1218,7 +1300,7 @@ static int lookup_pi_state(u32 uval, str
 	 * attach to the pi_state when the validation succeeds.
 	 */
 	if (match)
-		return attach_to_pi_state(uval, match->pi_state, ps);
+		return attach_to_pi_state(uaddr, uval, match->pi_state, ps);
 
 	/*
 	 * We are the first waiter - try to look up the owner based on
@@ -1237,7 +1319,7 @@ static int lock_pi_update_atomic(u32 __u
 	if (unlikely(cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)))
 		return -EFAULT;
 
-	/*If user space value changed, let the caller retry */
+	/* If user space value changed, let the caller retry */
 	return curval != uval ? -EAGAIN : 0;
 }
 
@@ -1301,7 +1383,7 @@ static int futex_lock_pi_atomic(u32 __us
 	 */
 	match = futex_top_waiter(hb, key);
 	if (match)
-		return attach_to_pi_state(uval, match->pi_state, ps);
+		return attach_to_pi_state(uaddr, uval, match->pi_state, ps);
 
 	/*
 	 * No waiter and user TID is 0. We are here because the
@@ -1441,6 +1523,7 @@ static int wake_futex_pi(u32 __user *uad
 
 	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)) {
 		ret = -EFAULT;
+
 	} else if (curval != uval) {
 		/*
 		 * If a unconditional UNLOCK_PI operation (user space did not
@@ -1977,7 +2060,7 @@ retry_private:
 			 * If that call succeeds then we have pi_state and an
 			 * initial refcount on it.
 			 */
-			ret = lookup_pi_state(ret, hb2, &key2,
+			ret = lookup_pi_state(uaddr2, ret, hb2, &key2,
 					      &pi_state, &exiting);
 		}
 
@@ -2282,7 +2365,6 @@ static int __fixup_pi_state_owner(u32 __
 	int err = 0;
 
 	oldowner = pi_state->owner;
-
 	/* Owner died? */
 	if (!pi_state->owner)
 		newtid |= FUTEX_OWNER_DIED;
@@ -2305,11 +2387,10 @@ static int __fixup_pi_state_owner(u32 __
 	 * because we can fault here. Imagine swapped out pages or a fork
 	 * that marked all the anonymous memory readonly for cow.
 	 *
-	 * Modifying pi_state _before_ the user space value would
-	 * leave the pi_state in an inconsistent state when we fault
-	 * here, because we need to drop the hash bucket lock to
-	 * handle the fault. This might be observed in the PID check
-	 * in lookup_pi_state.
+	 * Modifying pi_state _before_ the user space value would leave the
+	 * pi_state in an inconsistent state when we fault here, because we
+	 * need to drop the locks to handle the fault. This might be observed
+	 * in the PID check in lookup_pi_state.
 	 */
 retry:
 	if (!argowner) {
@@ -2367,21 +2448,26 @@ retry:
 	return argowner == current;
 
 	/*
-	 * To handle the page fault we need to drop the hash bucket
-	 * lock here. That gives the other task (either the highest priority
-	 * waiter itself or the task which stole the rtmutex) the
-	 * chance to try the fixup of the pi_state. So once we are
-	 * back from handling the fault we need to check the pi_state
-	 * after reacquiring the hash bucket lock and before trying to
-	 * do another fixup. When the fixup has been done already we
-	 * simply return.
+	 * To handle the page fault we need to drop the locks here. That gives
+	 * the other task (either the highest priority waiter itself or the
+	 * task which stole the rtmutex) the chance to try the fixup of the
+	 * pi_state. So once we are back from handling the fault we need to
+	 * check the pi_state after reacquiring the locks and before trying to
+	 * do another fixup. When the fixup has been done already we simply
+	 * return.
+	 *
+	 * Note: we hold both hb->lock and pi_mutex->wait_lock. We can safely
+	 * drop hb->lock since the caller owns the hb -> futex_q relation.
+	 * Dropping the pi_mutex->wait_lock requires the state revalidate.
 	 */
 handle_fault:
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 	spin_unlock(q->lock_ptr);
 
 	err = fault_in_user_writeable(uaddr);
 
 	spin_lock(q->lock_ptr);
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 
 	/*
 	 * Check if someone else fixed it for us:



  parent reply	other threads:[~2021-02-22 13:54 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-22 12:35 [PATCH 4.9 00/49] 4.9.258-rc1 review Greg Kroah-Hartman
2021-02-22 12:35 ` [PATCH 4.9 01/49] mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback() Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 02/49] fgraph: Initialize tracing_graph_pause at task creation Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 03/49] remoteproc: qcom_q6v5_mss: Validate MBA firmware size before load Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 04/49] af_key: relax availability checks for skb size calculation Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 05/49] iwlwifi: mvm: take mutex for calling iwl_mvm_get_sync_time() Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 06/49] iwlwifi: pcie: add a NULL check in iwl_pcie_txq_unmap Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 07/49] iwlwifi: mvm: guard against device removal in reprobe Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 08/49] SUNRPC: Move simple_get_bytes and simple_get_netobj into private header Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 09/49] SUNRPC: Handle 0 length opaque XDR object data properly Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 10/49] lib/string: Add strscpy_pad() function Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 11/49] include/trace/events/writeback.h: fix -Wstringop-truncation warnings Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 12/49] memcg: fix a crash in wb_workfn when a device disappears Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 13/49] futex: Ensure the correct return value from futex_lock_pi() Greg Kroah-Hartman
2021-02-22 12:36 ` Greg Kroah-Hartman [this message]
2021-02-22 12:36 ` [PATCH 4.9 15/49] futex: Cure exit race Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 16/49] squashfs: add more sanity checks in id lookup Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 17/49] squashfs: add more sanity checks in inode lookup Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 18/49] squashfs: add more sanity checks in xattr id lookup Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 19/49] tracing: Do not count ftrace events in top level enable output Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 20/49] tracing: Check length before giving out the filter buffer Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 21/49] ovl: skip getxattr of security labels Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 22/49] ARM: dts: lpc32xx: Revert set default clock rate of HCLK PLL Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 23/49] memblock: do not start bottom-up allocations with kernel_end Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 24/49] bpf: Check for integer overflow when using roundup_pow_of_two() Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 25/49] netfilter: xt_recent: Fix attempt to update deleted entry Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 26/49] xen/netback: avoid race in xenvif_rx_ring_slots_available() Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 27/49] netfilter: conntrack: skip identical origin tuple in same zone only Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 28/49] h8300: fix PREEMPTION build, TI_PRE_COUNT undefined Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 29/49] usb: dwc3: ulpi: fix checkpatch warning Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 30/49] usb: dwc3: ulpi: Replace CPU-based busyloop with Protocol-based one Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 31/49] net/vmw_vsock: improve locking in vsock_connect_timeout() Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 32/49] net: watchdog: hold device global xmit lock during tx disable Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 33/49] vsock/virtio: update credit only if socket is not closed Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 34/49] vsock: fix locking in vsock_shutdown() Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 35/49] x86/build: Disable CET instrumentation in the kernel for 32-bit too Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 36/49] trace: Use -mcount-record for dynamic ftrace Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 37/49] tracing: Fix SKIP_STACK_VALIDATION=1 build due to bad merge with -mrecord-mcount Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 38/49] tracing: Avoid calling cc-option -mrecord-mcount for every Makefile Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 39/49] Xen/x86: dont bail early from clear_foreign_p2m_mapping() Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 40/49] Xen/x86: also check kernel mapping in set_foreign_p2m_mapping() Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 41/49] Xen/gntdev: correct dev_bus_addr handling in gntdev_map_grant_pages() Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 42/49] Xen/gntdev: correct error checking " Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 43/49] xen/arm: dont ignore return errors from set_phys_to_machine Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 44/49] xen-blkback: dont "handle" error by BUG() Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 45/49] xen-netback: " Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 46/49] xen-scsiback: " Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 47/49] xen-blkback: fix error handling in xen_blkbk_map() Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 48/49] scsi: qla2xxx: Fix crash during driver load on big endian machines Greg Kroah-Hartman
2021-02-22 12:36 ` [PATCH 4.9 49/49] kvm: check tlbs_dirty directly Greg Kroah-Hartman
2021-02-22 18:24 ` [PATCH 4.9 00/49] 4.9.258-rc1 review Florian Fainelli
2021-02-22 21:27 ` Guenter Roeck
2021-02-23 12:04 ` Naresh Kamboju
2021-02-23 13:47 ` Jon Hunter
2021-02-23 21:19 ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210222121025.901487433@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=bigeasy@linutronix.de \
    --cc=bristot@redhat.com \
    --cc=dvhart@infradead.org \
    --cc=jdesfossez@efficios.com \
    --cc=juri.lelli@arm.com \
    --cc=lee.jones@linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=xlpang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.