public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Waiman Long <longman@redhat.com>
To: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Jonathan Corbet <corbet@lwn.net>
Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Mike Galbraith <umgwanakikbuti@gmail.com>,
	Scott J Norton <scott.norton@hpe.com>,
	Waiman Long <longman@redhat.com>
Subject: [PATCH v4 09/20] TP-futex: Implement lock handoff to prevent lock starvation
Date: Thu, 29 Dec 2016 11:13:35 -0500	[thread overview]
Message-ID: <1483028026-10305-10-git-send-email-longman@redhat.com> (raw)
In-Reply-To: <1483028026-10305-1-git-send-email-longman@redhat.com>

The current TP futexes has no guarantee that the top waiter
(serialization mutex owner) can get the lock within a finite time.
As a result, lock starvation can happen.

A lock handoff mechanism is added to the TP futexes to prevent lock
starvation from happening. The idea is that the top waiter can set a
special handoff_pid variable with its own pid. The futex owner will
then check this variable at unlock time to see if it should free the
futex or change its value to the designated pid to transfer the lock
directly to the top waiter.

This change does have the effect of limiting the amount of unfairness
and hence may adversely affect performance depending on the workloads.

Currently the handoff mechanism is triggered when the top waiter fails
to acquire the lock after about 5ms of spinning and sleeping. So the
maximum lock handoff rate is about 200 handoffs per second.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/futex.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 53 insertions(+), 6 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 2d3ec8d..711a2b4 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -63,6 +63,7 @@
 #include <linux/pid.h>
 #include <linux/nsproxy.h>
 #include <linux/ptrace.h>
+#include <linux/sched.h>
 #include <linux/sched/rt.h>
 #include <linux/hugetlb.h>
 #include <linux/freezer.h>
@@ -235,6 +236,8 @@ struct futex_state {
 	struct task_struct *mutex_owner;	/* For TP futexes only */
 	atomic_t refcount;
 
+	u32 handoff_pid;			/* For TP futexes only */
+
 	enum futex_type type;
 	union futex_key key;
 };
@@ -912,6 +915,7 @@ static int refill_futex_state_cache(void)
 	/* pi_mutex gets initialized later */
 	state->owner = NULL;
 	state->mutex_owner = NULL;
+	state->handoff_pid = 0;
 	atomic_set(&state->refcount, 1);
 	state->key = FUTEX_KEY_INIT;
 
@@ -3335,8 +3339,9 @@ void exit_robust_list(struct task_struct *curr)
  * and unlocking.
  *
  * The purpose of this FUTEX_WAITERS bit is to make the unlocker wake up the
- * serialization mutex owner. Not having the FUTEX_WAITERS bit set doesn't
- * mean there is no waiter in the kernel.
+ * serialization mutex owner or to hand off the lock directly to the top
+ * waiter. Not having the FUTEX_WAITERS bit set doesn't mean there is no
+ * waiter in the kernel.
  *
  * Like PI futexes, TP futexes are orthogonal to robust futexes can be
  * used together.
@@ -3352,6 +3357,16 @@ void exit_robust_list(struct task_struct *curr)
  * the ownership of the futex.
  */
 
+/*
+ * Timeout value for enabling lock handoff and prevent lock starvation.
+ *
+ * Currently, lock handoff will be enabled if the top waiter can't get the
+ * futex after about 5ms. To reduce time checking overhead, it is only done
+ * after every 128 spins or right after wakeup. As a result, the actual
+ * elapsed time will be a bit longer than the specified value.
+ */
+#define TP_HANDOFF_TIMEOUT	5000000	/* 5ms	*/
+
 /**
  * lookup_futex_state - Looking up the futex state structure.
  * @hb:		 hash bucket
@@ -3431,6 +3446,7 @@ static inline int put_futex_state_unlocked(struct futex_state *state)
  * If waiter is true
  * then
  *   don't preserve the flag bits;
+ *   check for handoff (futex word == own pid)
  * else
  *   preserve the flag bits
  * endif
@@ -3449,6 +3465,9 @@ static inline int futex_trylock(u32 __user *uaddr, const u32 vpid, u32 *puval,
 
 	uval = *puval;
 
+	if (waiter && (uval & FUTEX_TID_MASK) == vpid)
+		return 1;
+
 	if (uval & FUTEX_TID_MASK)
 		return 0;	/* Trylock fails */
 
@@ -3542,10 +3561,12 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 	"futex: owner pid %d of TP futex 0x%lx was %s.\n"	\
 	"\tLock is now acquired by pid %d!\n"
 
-	int ret;
+	int ret, loopcnt = 1;
+	bool handoff_set = false;
 	u32 uval;
 	u32 owner_pid = 0;
 	struct task_struct *owner_task = NULL;
+	u64 handoff_time = sched_clock() + TP_HANDOFF_TIMEOUT;
 
 	preempt_disable();
 	WRITE_ONCE(state->mutex_owner, current);
@@ -3600,6 +3621,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 		if (need_resched()) {
 			__set_current_state(TASK_RUNNING);
 			schedule_preempt_disabled();
+			loopcnt = 0;
 			continue;
 		}
 
@@ -3608,6 +3630,23 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 			break;
 		}
 
+		/*
+		 * Enable lock handoff if the elapsed time exceed the timeout
+		 * value. We also need to set the FUTEX_WAITERS bit to make
+		 * sure that futex lock holder will initiate the handoff at
+		 * unlock time.
+		 */
+		if (!handoff_set && !(loopcnt++ & 0x7f)) {
+			if (sched_clock() > handoff_time) {
+				WARN_ON(READ_ONCE(state->handoff_pid));
+				ret = futex_set_waiters_bit(uaddr, &uval);
+				if (ret)
+					break;
+				WRITE_ONCE(state->handoff_pid, vpid);
+				handoff_set = true;
+			}
+		}
+
 		if (owner_task->on_cpu) {
 			cpu_relax();
 			continue;
@@ -3650,8 +3689,10 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 		 * lock stealing happen in between setting the FUTEX_WAITERS
 		 * and setting state to TASK_INTERRUPTIBLE.
 		 */
-		if (!(uval & FUTEX_OWNER_DIED) && (uval & FUTEX_WAITERS))
+		if (!(uval & FUTEX_OWNER_DIED) && (uval & FUTEX_WAITERS)) {
 			schedule_preempt_disabled();
+			loopcnt = 0;
+		}
 		__set_current_state(TASK_RUNNING);
 	}
 
@@ -3673,6 +3714,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 	 * Cleanup futex state.
 	 */
 	WRITE_ONCE(state->mutex_owner, NULL);
+	WRITE_ONCE(state->handoff_pid, 0);
 
 	preempt_enable();
 	return ret;
@@ -3787,6 +3829,7 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
 static int futex_unlock(u32 __user *uaddr, unsigned int flags)
 {
 	u32 uval, vpid = task_pid_vnr(current);
+	u32 newpid = 0;
 	union futex_key key = FUTEX_KEY_INIT;
 	struct futex_hash_bucket *hb;
 	struct futex_state *state = NULL;
@@ -3820,6 +3863,10 @@ static int futex_unlock(u32 __user *uaddr, unsigned int flags)
 		goto out_put_key;
 	}
 
+	newpid = READ_ONCE(state->handoff_pid);
+	if (newpid)
+		WRITE_ONCE(state->handoff_pid, 0);
+
 	owner = READ_ONCE(state->mutex_owner);
 	if (owner)
 		wake_q_add(&wake_q, owner);
@@ -3827,13 +3874,13 @@ static int futex_unlock(u32 __user *uaddr, unsigned int flags)
 	spin_unlock(&hb->fs_lock);
 
 	/*
-	 * Unlock the futex.
+	 * Unlock the futex or handoff to the next owner.
 	 * The flag bits are not preserved to encourage more lock stealing.
 	 */
 	for (;;) {
 		u32 old = uval;
 
-		if (cmpxchg_futex_value(&uval, uaddr, old, 0)) {
+		if (cmpxchg_futex_value(&uval, uaddr, old, newpid)) {
 			ret = -EFAULT;
 			break;
 		}
-- 
1.8.3.1

  parent reply	other threads:[~2016-12-29 16:17 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
2016-12-29 16:13 ` [PATCH v4 01/20] futex: Consolidate duplicated timer setup code Waiman Long
2016-12-29 16:13 ` [PATCH v4 02/20] futex: Rename futex_pi_state to futex_state Waiman Long
2016-12-29 16:13 ` [PATCH v4 03/20] futex: Add helpers to get & cmpxchg futex value without lock Waiman Long
2016-12-29 16:13 ` [PATCH v4 04/20] futex: Consolidate pure pi_state_list add & delete codes to helpers Waiman Long
2016-12-29 16:13 ` [PATCH v4 05/20] futex: Add a new futex type field into futex_state Waiman Long
2016-12-29 16:13 ` [PATCH v4 06/20] futex: Allow direct attachment of futex_state objects to hash bucket Waiman Long
2016-12-29 16:13 ` [PATCH v4 07/20] futex: Introduce throughput-optimized (TP) futexes Waiman Long
2016-12-29 21:17   ` kbuild test robot
2016-12-29 16:13 ` [PATCH v4 08/20] TP-futex: Enable robust handling Waiman Long
2016-12-29 16:13 ` Waiman Long [this message]
2016-12-29 16:13 ` [PATCH v4 10/20] TP-futex: Return status code on FUTEX_LOCK calls Waiman Long
2016-12-29 16:13 ` [PATCH v4 11/20] TP-futex: Add timeout support Waiman Long
2016-12-29 16:13 ` [PATCH v4 12/20] TP-futex, doc: Add TP futexes documentation Waiman Long
2016-12-29 16:13 ` [PATCH v4 13/20] perf bench: New microbenchmark for userspace mutex performance Waiman Long
2017-01-02 17:16   ` Arnaldo Carvalho de Melo
2017-01-02 18:17     ` Waiman Long
2016-12-29 16:13 ` [PATCH v4 14/20] TP-futex: Support userspace reader/writer locks Waiman Long
2016-12-29 18:34   ` kbuild test robot
2016-12-29 16:13 ` [PATCH v4 15/20] TP-futex: Enable kernel reader lock stealing Waiman Long
2016-12-29 16:13 ` [PATCH v4 16/20] TP-futex: Group readers together in wait queue Waiman Long
2016-12-29 19:53   ` kbuild test robot
2016-12-29 20:16   ` kbuild test robot
2016-12-29 16:13 ` [PATCH v4 17/20] TP-futex, doc: Update TP futexes document on shared locking Waiman Long
2016-12-29 16:13 ` [PATCH v4 18/20] perf bench: New microbenchmark for userspace rwlock performance Waiman Long
2016-12-29 16:13 ` [PATCH v4 19/20] sched, TP-futex: Make wake_up_q() return wakeup count Waiman Long
2016-12-29 16:13 ` [PATCH v4 20/20] futex: Dump internal futex state via debugfs Waiman Long
2016-12-29 18:55   ` kbuild test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1483028026-10305-10-git-send-email-longman@redhat.com \
    --to=longman@redhat.com \
    --cc=acme@kernel.org \
    --cc=corbet@lwn.net \
    --cc=dave@stgolabs.net \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=scott.norton@hpe.com \
    --cc=tglx@linutronix.de \
    --cc=umgwanakikbuti@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox